Data Dumps & Bulk Downloads

For bulk access to OLDP data, the dump_api_data management command exports every public API resource to gzipped JSONL files, accompanied by a manifest.json describing the snapshot. This is the canonical way OLDP publishes its data — including the HuggingFace dataset openlegaldata/court-decisions-germany (produced from a dump via oldp-toolkit).

Usage

./manage.py dump_api_data ./workingdir/snapshot-2026-04-29 --override

Arguments:

  • output (positional) — directory path relative to WORKING_DIR.

  • --override — replace an existing output directory.

  • --limit N — cap the number of records per resource (default 0 = unlimited).

  • --include-lawbook-revisions — include every historical LawBook revision (and its child Law rows) in the dump. By default only books with latest=True are exported; see Latest-only LawBooks.

Output layout

snapshot-2026-04-29/
├── cases.jsonl.gz
├── courts.jsonl.gz
├── laws.jsonl.gz
├── law_books.jsonl.gz
├── cities.jsonl.gz
├── states.jsonl.gz
├── countries.jsonl.gz
├── annotation_labels.jsonl.gz
├── case_annotations.jsonl.gz
├── case_markers.jsonl.gz
└── manifest.json

Each *.jsonl.gz contains one record per line, serialized using the same serializer the public REST API uses, so the dump shape mirrors the API.

Snapshot contract

The dump command guarantees the following properties:

Always-accepted filter

Records on Case, Law, LawBook, and Court always have review_status == "accepted". Pending or declined records are never written to a dump and therefore never reach published artifacts.

Stable ordering

Records are written in ascending primary-key order. Given the same database state, two consecutive dump runs produce byte-identical files. This makes downstream snapshots (HuggingFace dataset versions, benchmark resolution indices) reproducible.

Latest-only LawBooks

By default, only LawBook rows with latest=True are exported, and laws.jsonl.gz is filtered to laws whose parent book is the latest revision. This matches what the public website serves and what most downstream consumers (citation matching, search indexing, the public HF dataset) actually want.

For archival or law-revision-history use cases, pass --include-lawbook-revisions to export every historical revision and its laws. The active filter setting is recorded in manifest.json’s filters.include_lawbook_revisions field.

Self-describing manifest

manifest.json records the snapshot identity:

{
  "snapshot_date": "2026-04-29T12:34:56+00:00",
  "oldp_version": "0.9.13",
  "filters": {
    "review_status": "accepted",
    "include_lawbook_revisions": false
  },
  "files": {
    "cases.jsonl.gz": {"row_count": 318442},
    "laws.jsonl.gz": {"row_count": 47821},
    "law_books.jsonl.gz": {"row_count": 612},
    "courts.jsonl.gz": {"row_count": 1284}
  }
}

Downstream consumers should pin against snapshot_date so that benchmark scores or analyses remain reproducible across OLDP database growth.

Citation-friendly fields

Some serializers denormalize fields that would otherwise require joining multiple JSONL files:

  • laws.jsonl.gz records carry both book_code (the parent LawBook.code, e.g., "BGB") and book_slug (e.g., "bgb") so consumers can build either a (book_code, slug) -> law_id index for citation parsing or a stable book_slug/law_slug canonical key (e.g., bgb/823) without loading law_books.jsonl.gz. Slug-based keys are recommended as the canonical target identifier in benchmarks because numeric id values are not stable across OLDP database rebuilds.

  • courts.jsonl.gz records carry both code (canonical, ECLI-derived) and aliases (newline-separated alternative names like “BGH” / “Bundesgerichtshof”) for resolving citations to a court_id.

Downstream consumers

  • oldp-toolkit reads cases.jsonl.gz (auto-detects the .gz suffix), converts HTML to Markdown, extracts inline reference markers, and publishes the result as a HuggingFace parquet dataset.