Data Dumps & Bulk Downloads
For bulk access to OLDP data, the dump_api_data management command exports
every public API resource to gzipped JSONL files, accompanied by a
manifest.json describing the snapshot. This is the canonical way OLDP
publishes its data — including the HuggingFace dataset
openlegaldata/court-decisions-germany
(produced from a dump via oldp-toolkit).
Usage
./manage.py dump_api_data ./workingdir/snapshot-2026-04-29 --override
Arguments:
output(positional) — directory path relative toWORKING_DIR.--override— replace an existing output directory.--limit N— cap the number of records per resource (default0= unlimited).--include-lawbook-revisions— include every historicalLawBookrevision (and its childLawrows) in the dump. By default only books withlatest=Trueare exported; see Latest-only LawBooks.
Output layout
snapshot-2026-04-29/
├── cases.jsonl.gz
├── courts.jsonl.gz
├── laws.jsonl.gz
├── law_books.jsonl.gz
├── cities.jsonl.gz
├── states.jsonl.gz
├── countries.jsonl.gz
├── annotation_labels.jsonl.gz
├── case_annotations.jsonl.gz
├── case_markers.jsonl.gz
└── manifest.json
Each *.jsonl.gz contains one record per line, serialized using the same
serializer the public REST API uses, so the dump shape mirrors the API.
Snapshot contract
The dump command guarantees the following properties:
Always-accepted filter
Records on Case, Law, LawBook, and Court always have
review_status == "accepted". Pending or declined records are never written
to a dump and therefore never reach published artifacts.
Stable ordering
Records are written in ascending primary-key order. Given the same database state, two consecutive dump runs produce byte-identical files. This makes downstream snapshots (HuggingFace dataset versions, benchmark resolution indices) reproducible.
Latest-only LawBooks
By default, only LawBook rows with latest=True are exported, and
laws.jsonl.gz is filtered to laws whose parent book is the latest
revision. This matches what the public website serves and what most
downstream consumers (citation matching, search indexing, the public HF
dataset) actually want.
For archival or law-revision-history use cases, pass
--include-lawbook-revisions to export every historical revision and its
laws. The active filter setting is recorded in manifest.json’s
filters.include_lawbook_revisions field.
Self-describing manifest
manifest.json records the snapshot identity:
{
"snapshot_date": "2026-04-29T12:34:56+00:00",
"oldp_version": "0.9.13",
"filters": {
"review_status": "accepted",
"include_lawbook_revisions": false
},
"files": {
"cases.jsonl.gz": {"row_count": 318442},
"laws.jsonl.gz": {"row_count": 47821},
"law_books.jsonl.gz": {"row_count": 612},
"courts.jsonl.gz": {"row_count": 1284}
}
}
Downstream consumers should pin against snapshot_date so that benchmark
scores or analyses remain reproducible across OLDP database growth.
Citation-friendly fields
Some serializers denormalize fields that would otherwise require joining multiple JSONL files:
laws.jsonl.gzrecords carry bothbook_code(the parentLawBook.code, e.g.,"BGB") andbook_slug(e.g.,"bgb") so consumers can build either a(book_code, slug) -> law_idindex for citation parsing or a stablebook_slug/law_slugcanonical key (e.g.,bgb/823) without loadinglaw_books.jsonl.gz. Slug-based keys are recommended as the canonical target identifier in benchmarks because numericidvalues are not stable across OLDP database rebuilds.courts.jsonl.gzrecords carry bothcode(canonical, ECLI-derived) andaliases(newline-separated alternative names like “BGH” / “Bundesgerichtshof”) for resolving citations to acourt_id.
Downstream consumers
oldp-toolkitreadscases.jsonl.gz(auto-detects the.gzsuffix), converts HTML to Markdown, extracts inline reference markers, and publishes the result as a HuggingFace parquet dataset.