arXiv OCR with Chandra OCR 2 (Merged Buckets)
This bucket merges the completed production shard buckets for the full missing-paper sweep.
Summary
- Output bucket:
nielsr/arxiv-chandra-ocr-full-merged-20260406 - Source shard buckets merged: 16
- Processed IDs recorded in
state/processed_ids.txt: 27,584 - Successes: 27,490
- Partial successes: 0
- Errors: 94
- Global part files: 2,770
- Updated at: 2026-04-06T10:38:44.915699+00:00
Files
data/part-*.jsonl.gz: merged OCR result shards with global part numberingstate/processed_ids.txt: completed paper IDs used for resumestate/summary.json: aggregate counters and bookkeepingshards/manifest.json: provenance for the original shard buckets and their merged part ranges
Each paper record includes:
num_pages: total number of pages in the source PDFnum_pages_processed: number of pages actually sent to OCRpdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR capmax_pages_per_paper: configured OCR page cap for the run
This merged bucket was assembled from completed bucket-only HF Jobs outputs and preserves resume safety through the merged state/* files.
Xet Storage Details
- Size:
- 1.19 kB
- Xet hash:
- e26b67e9fd23a62875e273e3985cd27108698796b7e68316d41a616de7ad5a90
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.