|
download
raw
1.19 kB

arXiv OCR with Chandra OCR 2 (Merged Buckets)

This bucket merges the completed production shard buckets for the full missing-paper sweep.

Summary

  • Output bucket: nielsr/arxiv-chandra-ocr-full-merged-20260406
  • Source shard buckets merged: 16
  • Processed IDs recorded in state/processed_ids.txt: 27,584
  • Successes: 27,490
  • Partial successes: 0
  • Errors: 94
  • Global part files: 2,770
  • Updated at: 2026-04-06T10:38:44.915699+00:00

Files

  • data/part-*.jsonl.gz: merged OCR result shards with global part numbering
  • state/processed_ids.txt: completed paper IDs used for resume
  • state/summary.json: aggregate counters and bookkeeping
  • shards/manifest.json: provenance for the original shard buckets and their merged part ranges

Each paper record includes:

  • num_pages: total number of pages in the source PDF
  • num_pages_processed: number of pages actually sent to OCR
  • pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
  • max_pages_per_paper: configured OCR page cap for the run

This merged bucket was assembled from completed bucket-only HF Jobs outputs and preserves resume safety through the merged state/* files.

Xet Storage Details

Size:
1.19 kB
·
Xet hash:
e26b67e9fd23a62875e273e3985cd27108698796b7e68316d41a616de7ad5a90

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.