Buckets:

nielsr
/

arxiv-chandra-ocr-full-merged-20260406

nielsr/arxiv-chandra-ocr-full-merged-20260406 / README.md

|

1.19 kB

arXiv OCR with Chandra OCR 2 (Merged Buckets)

This bucket merges the completed production shard buckets for the full missing-paper sweep.

Summary

Output bucket: nielsr/arxiv-chandra-ocr-full-merged-20260406
Source shard buckets merged: 16
Processed IDs recorded in state/processed_ids.txt: 27,584
Successes: 27,490
Partial successes: 0
Errors: 94
Global part files: 2,770
Updated at: 2026-04-06T10:38:44.915699+00:00

Files

data/part-*.jsonl.gz: merged OCR result shards with global part numbering
state/processed_ids.txt: completed paper IDs used for resume
state/summary.json: aggregate counters and bookkeeping
shards/manifest.json: provenance for the original shard buckets and their merged part ranges

Each paper record includes:

num_pages: total number of pages in the source PDF
num_pages_processed: number of pages actually sent to OCR
pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
max_pages_per_paper: configured OCR page cap for the run

This merged bucket was assembled from completed bucket-only HF Jobs outputs and preserves resume safety through the merged state/* files.

Xet Storage Details

Size:: 1.19 kB
Xet hash:: e26b67e9fd23a62875e273e3985cd27108698796b7e68316d41a616de7ad5a90

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.