← Back to writing

2026-02-17 · 5 min read

Parsing 11,000 Companies' Financial Data From Raw XBRL

Every listed company in Japan files a 有価証券報告書, a securities report running 100 to 200+ pages. Full financial statements, segment breakdowns, executive compensation, cross-shareholdings, all XBRL-tagged and free through EDINET. Over 11,000 entities with up to a decade of history.

Almost nobody outside Japan uses any of it.

The filings are in XBRL, a markup language that makes XML look simple. Three accounting standards coexist (J-GAAP, IFRS, US-GAAP), each with different element names for the same concept. Fiscal years end in March, not December. No official English language API. The FSA updates the taxonomy annually, so element names change, get deprecated, or get swapped out for something new.

I spent months trying to get this data into a usable format. It was miserable. So I built Axiora.


Why not an existing parser?

Arelle is the industry standard. Handles every edge case in every jurisdiction. Also a 200MB install designed for compliance, not for extracting 52 fields at 500ms per filing. edinet-tools doesn't parse iXBRL, which is ~60% of recent filings. edinet-mcp gets closest (161 fields, three standards) but its non-consolidated filter is binary. Standalone filers with no consolidated data get zeroed out.

The gap across all of them: context resolution. A single filing contains the same element in 10+ contexts. Consolidated current year, non-consolidated current year, prior year, by segment, by geography. Picking the right one requires knowing whether the filer consolidates at all, whether the context carries dimension members, and which standard's elements take priority when multiple map to the same field.


What makes this hard

An XBRL parser that handles the happy path is a weekend project. Getting one to produce correct data for 11,000+ entities across three standards, two formats, and a decade of filings is a different thing entirely.

Attribution ambiguity. Every consolidated filing contains the same metric at multiple scopes. Group total vs parent-attributed. For conglomerates with large joint ventures, the difference exceeds 20%. In extreme cases the group total is positive while parent-attributed is negative. Same filing, same year, but opposite sign. EPS, ROE, and net margin need the parent-attributed figure. Get this wrong and every derived metric is garbage.

Two formats. Every filing arrives as a ZIP: either traditional XBRL (pure XML, 10–50MB) or inline XBRL (HTML with embedded tags, 1–5MB). Traditional needs streaming to keep memory flat. iXBRL has scaling attributes that multiply values by powers of ten and sign attributes that override the displayed number. Miss one scaling attribute and every number in your database is off by six orders of magnitude.

Standard transitions. Hundreds of companies have switched between JP-GAAP, IFRS, and US-GAAP. Each transition leaves ghost elements from the old standard. A parser that doesn't track transitions extracts from the wrong standard and the wrong fiscal period. For Japan's largest companies, that error is tens of trillions of yen.

Japanese numeric conventions. Negative numbers appear as triangles (), filled triangles (), parentheses, minus signs, full-width minus signs, and em-dashes. Some mean "not applicable" depending on who filed it. Older filings use Shift-JIS encoding. Every branch in my numeric parser exists because a real company filed a real document that triggered it.

Structural false positives. Companies define custom XBRL elements beyond the standard taxonomy. Sometimes these collide with standard financial concepts. A real estate line item shares a prefix with total equity. The parser grabs the wrong element. The error is orders of magnitude off, but only for the handful of companies with these collisions. Aggregate validation won't catch it.


The pipeline

Filings appear on EDINET daily. The pipeline runs every 30 minutes:

  1. Query EDINET API for recent dates, filter to reports with XBRL
  2. Fetch the ZIP, extract the instance document
  3. Detect format, resolve taxonomy, extract values with correct context priority
  4. Upsert company, filing, and financials to the database

Historical backfill runs with bounded concurrency. Multi-year backfill completes in hours. Per-share values stored as integers. No floats touch the money.


665 tests

The test suite is 5x the size of the parser. 665 correctness checks, not smoke tests, but specific scenarios reconstructed from production bugs. Every time a filing breaks something, the regression test goes in before the fix.

Covers all three accounting standards, both XBRL formats, consolidated and non-consolidated scoping, standard transitions, naming collisions, and cross-field consistency. Net income and EPS must have the same sign. The suite grows monotonically. I haven't deleted a test yet.


The result

52 normalized fields across three accounting standards and two XBRL formats. 11,224 companies, 278,701 financial data points. Screening, rankings, sector analytics, bulk export, webhooks, an MCP server, and a Python SDK.

import httpx

resp = httpx.get(
    "https://api.axiora.dev/v1/companies/7203/financials",
    headers={"Authorization": "Bearer ax_live_YOUR_KEY"},
)
data = resp.json()["data"][0]

print(f"Revenue:      ¥{data['revenue']:,}")
print(f"Net income:   ¥{data['net_income']:,}")
print(f"Total assets: ¥{data['total_assets']:,}")

Bloomberg charges $24K/year. EDINET's raw XBRL is a maze. The gap between "the data is technically public" and "a developer can actually use it" is where Axiora sits.