# Data Dumps & Bulk Downloads

For bulk access to OLDP data, the `dump_api_data` management command exports
every public API resource to gzipped JSONL files, accompanied by a
`manifest.json` describing the snapshot. This is the canonical way OLDP
publishes its data — including the [HuggingFace dataset
`openlegaldata/court-decisions-germany`](https://huggingface.co/datasets/openlegaldata/court-decisions-germany)
(produced from a dump via [`oldp-toolkit`](https://github.com/openlegaldata/oldp-toolkit)).

## Usage

```bash
./manage.py dump_api_data ./workingdir/snapshot-2026-04-29 --override
```

Arguments:

- `output` (positional) — directory path relative to `WORKING_DIR`.
- `--override` — replace an existing output directory.
- `--limit N` — cap the number of records per resource (default `0` = unlimited).
- `--include-lawbook-revisions` — include every historical `LawBook` revision
  (and its child `Law` rows) in the dump. By default only books with
  `latest=True` are exported; see [Latest-only LawBooks](#latest-only-lawbooks).

## Output layout

```
snapshot-2026-04-29/
├── cases.jsonl.gz
├── courts.jsonl.gz
├── laws.jsonl.gz
├── law_books.jsonl.gz
├── cities.jsonl.gz
├── states.jsonl.gz
├── countries.jsonl.gz
├── annotation_labels.jsonl.gz
├── case_annotations.jsonl.gz
├── case_markers.jsonl.gz
└── manifest.json
```

Each `*.jsonl.gz` contains one record per line, serialized using the same
serializer the public REST API uses, so the dump shape mirrors the API.

## Snapshot contract

The dump command guarantees the following properties:

### Always-accepted filter

Records on `Case`, `Law`, `LawBook`, and `Court` always have
`review_status == "accepted"`. Pending or declined records are never written
to a dump and therefore never reach published artifacts.

### Stable ordering

Records are written in ascending primary-key order. Given the same database
state, two consecutive dump runs produce byte-identical files. This makes
downstream snapshots (HuggingFace dataset versions, benchmark resolution
indices) reproducible.

### Latest-only LawBooks

By default, only `LawBook` rows with `latest=True` are exported, and
`laws.jsonl.gz` is filtered to laws whose parent book is the latest
revision. This matches what the public website serves and what most
downstream consumers (citation matching, search indexing, the public HF
dataset) actually want.

For archival or law-revision-history use cases, pass
`--include-lawbook-revisions` to export every historical revision and its
laws. The active filter setting is recorded in `manifest.json`'s
`filters.include_lawbook_revisions` field.

### Self-describing manifest

`manifest.json` records the snapshot identity:

```json
{
  "snapshot_date": "2026-04-29T12:34:56+00:00",
  "oldp_version": "0.9.13",
  "filters": {
    "review_status": "accepted",
    "include_lawbook_revisions": false
  },
  "files": {
    "cases.jsonl.gz": {"row_count": 318442},
    "laws.jsonl.gz": {"row_count": 47821},
    "law_books.jsonl.gz": {"row_count": 612},
    "courts.jsonl.gz": {"row_count": 1284}
  }
}
```

Downstream consumers should pin against `snapshot_date` so that benchmark
scores or analyses remain reproducible across OLDP database growth.

## Citation-friendly fields

Some serializers denormalize fields that would otherwise require joining
multiple JSONL files:

- **`laws.jsonl.gz`** records carry both `book_code` (the parent
  `LawBook.code`, e.g., `"BGB"`) and `book_slug` (e.g., `"bgb"`) so
  consumers can build either a `(book_code, slug) -> law_id` index for
  citation parsing or a stable `book_slug/law_slug` canonical key
  (e.g., `bgb/823`) without loading `law_books.jsonl.gz`. Slug-based keys
  are recommended as the canonical target identifier in benchmarks because
  numeric `id` values are not stable across OLDP database rebuilds.
- **`courts.jsonl.gz`** records carry both `code` (canonical, ECLI-derived)
  and `aliases` (newline-separated alternative names like "BGH" /
  "Bundesgerichtshof") for resolving citations to a `court_id`.

## Downstream consumers

- [`oldp-toolkit`](https://github.com/openlegaldata/oldp-toolkit) reads
  `cases.jsonl.gz` (auto-detects the `.gz` suffix), converts HTML to
  Markdown, extracts inline reference markers, and publishes the result as
  a HuggingFace parquet dataset.