2026-02-17 · 5 min read
Building Axiora: Parsing 11,000 Companies' Financial Data From Raw XBRL
Japan mandates some of the most detailed corporate disclosures in the world. Every listed company files a 有価証券報告書 — a securities report that runs 100 to 200+ pages covering full financial statements, segment breakdowns, risk factors, executive compensation, cross-shareholdings, employee data. Every number is XBRL-tagged. Every filing is free and public through EDINET, Japan's equivalent of the SEC's EDGAR. 11,000+ entities. Up to a decade of history.
And almost nobody outside Japan uses any of it.
I spent months working with EDINET. Getting any of it into a usable format was miserable.
The filings are in XBRL, a markup language that makes XML look simple. Three accounting standards coexist (J-GAAP, IFRS, US-GAAP), each with different element names for the same financial concept. Fiscal years end in March, not December. Consolidated and unconsolidated reports sit side by side. There's no official English-language API. And the FSA updates the taxonomy annually — element names change, get deprecated, get replaced.
So I built Axiora.
Why not use an existing parser?
I evaluated all the open-source options. Arelle is the industry standard — handles every edge case in every jurisdiction. It's also a 200MB install designed for compliance, not for extracting 52 financial fields at 500ms per filing. edinet-tools doesn't parse iXBRL, which is ~60% of recent filings. edinet-mcp is the closest — 161 fields, three standards — but its non-consolidated filter is binary, which zeroes out standalone filers that have no consolidated data.
The gap across all of them: context resolution. A single EDINET filing contains the same element in 10+ contexts — consolidated current year, non-consolidated current year, prior year, by segment, by geography. Picking the right one requires knowing whether the filer has consolidated data at all, whether the context carries dimension members, and which accounting standard's elements take priority when multiple map to the same field.
What makes this hard
Building an XBRL parser is straightforward. Building one that produces correct data for 11,000+ entities across three accounting standards, two XBRL formats, and a decade of filings — that's where the interesting problems are.
Attribution ambiguity. Every consolidated filing contains multiple versions of the same metric at different scopes. Group total vs parent-attributed. For a conglomerate with large joint ventures, the difference can exceed 20%. In extreme cases, the group total is positive while the parent-attributed figure is negative — same filing, same year, opposite sign. EPS, ROE, and net margin should all use the parent-attributed figure. Get this wrong and every derived metric is inconsistent.
Two formats. Every filing arrives as a ZIP containing either traditional XBRL (pure XML, 10-50MB) or inline XBRL (HTML with embedded tags, 1-5MB). Traditional XBRL requires streaming to keep memory flat. iXBRL has its own traps: scaling attributes that multiply displayed values by powers of ten, sign attributes that override the visible number. Miss a scaling attribute and every number in your database is off by six orders of magnitude.
Accounting standard transitions. Hundreds of companies have switched between JP-GAAP, IFRS, and US-GAAP. Each transition leaves ghost elements from the old standard in subsequent filings. A parser that doesn't track transitions extracts values from the wrong standard — and the wrong fiscal period. The error can be tens of trillions of yen for Japan's largest companies.
Japanese numeric conventions. Negative numbers appear as triangles (△), filled triangles (▲), parentheses, minus signs, full-width minus signs, and em-dashes. Some mean "not applicable" depending on the filer. Older filings use Shift-JIS encoding. Every branch in my numeric parser exists because a real company filed a real document that triggered it.
Structural false positives. Companies can define their own XBRL elements beyond the standard taxonomy. Sometimes these naming conventions collide with standard financial concepts. A real estate line item shares a prefix with total equity. The parser selects the wrong element and the resulting error is orders of magnitude off — but only for the handful of companies with these collisions. You can't catch it with aggregate validation.
The pipeline
New filings appear on EDINET daily. The pipeline runs every 30 minutes:
- List — query EDINET API for recent dates, filter to reports with XBRL
- Download — fetch the ZIP, extract the instance document
- Parse — detect format, resolve taxonomy, extract values with correct context priority
- Persist — upsert company, filing, and financials to the database
For historical backfill, I parallelize with bounded concurrency. A multi-year backfill completes in hours. Per-share values are stored as integers to avoid floating-point errors. No floats touch the money.
Testing
The test suite is roughly 5x the size of the parser — over 665 correctness checks, not smoke tests, but specific scenarios reconstructed from production bugs. Every time a filing breaks something, the regression test goes in before the fix.
The scenarios cover all three accounting standards, both XBRL formats, consolidated and non-consolidated scoping, standard transitions, naming collisions, and cross-field consistency (net income and EPS must have the same sign). The suite grows monotonically. I haven't deleted a test yet.
The result
52 normalized fields. Three accounting standards. Two XBRL formats. 11,224 companies. 278,701 financial data points. Screening, rankings, sector analytics, comparisons, bulk export, webhooks, an MCP server for AI agents, and a Python SDK.
import httpx
resp = httpx.get(
"https://api.axiora.dev/v1/companies/7203/financials",
headers={"Authorization": "Bearer ax_live_YOUR_KEY"},
)
data = resp.json()["data"][0]
print(f"Revenue: ¥{data['revenue']:,}")
print(f"Net income: ¥{data['net_income']:,}")
print(f"Total assets: ¥{data['total_assets']:,}")
Bloomberg costs $24K/year. EDINET's raw XBRL is a maze. The gap between "the data is technically public" and "a developer can actually use it" is where Axiora sits.