Compare to LightOnOCR-2-1B
skipping a comparison with the only SOTA 1B OCR model? That’s… embarrassing.
LightOnOCR-2-1B's performance on OmniDocBench still lags behind the previous SOTA, as noted in the original paper.
skipping a comparison with the only SOTA 1B OCR model? That’s… embarrassing.
You can post LightOnOCR-2-1B OmniDocBench v1.5 Result instead of asking others to provide the comparison results.
LightOnOCR-2-1B's performance on OmniDocBench still lags behind the previous SOTA, as noted in the original paper.
We explained in the paper why OmniDocBench is not adapted : it has only Chinese and English documents and it uses edit distance which is sensitive to formatting choices(e.g for math, italics, emphasise, tables, reading order, etc).
Plus even if it's lagging behind, it doesn't make sense not to include it in the comparison, it was released a week ago, is SOTA on OlmOCR-bench, multilingual(not just chinese and English), end-to-end, only 1B parameters, 1.73× faster than DeepSeekOCR.
Even though OmniDocBench is not adapted to our case due to issues mentioned above, we still included it for reference 😁
A more adapted benchmark for OCR using VLMs is OlmOCR-bench, which has multiple languages(not just Chinese and English) and uses binary unit tests to evaluate various aspects about the OCR(formula, table cell positions etc), much similar to how real world extraction is evaluated.
Since OlmOCR-bench is not reported in the paper, I am currently evaluating DeepSeek-OCR-2 on it to see!
Here are the results on OlmOCR-bench(w/o headers and footers)
code used to run the model : https://gist.github.com/staghado/e0834f1afd105459030bac5ff8385ad1
Feedback welcome if I did something wrong while running the model!
@staghado While OlmOCR-bench is a solid evaluation set, it struggles with quite a few ambiguities. This often leads to 'false negatives' where the model predicts the correct answer but is still flagged as incorrect. If you're interested, I’d love to dive deeper into this with you—it could be a great way to contribute to more objective OCR evaluation.
@ChengCui agreed that false negatives can happen; "all benchmarks are wrong, but some are useful". compared to edit-distance to a single “truth,” unit tests are less brittle for OCR’s many valid renderings and avoid rewarding stripped-down, no-formatting outputs, opinionated formats. they try to check properties like reading order, header/footer handling, math formula rendering, and table structure (e.g., expected row/col headers and specific cell values/relationships, which is very close to how a human does it).
Happy to discuss how to make OCR evaluation better as the current landscape mostly hurts progress.
@staghado I completely agree with your point. A major limitation of OlmOCR-bench is its requirement for strict string consistency, which is often impractical in real-world scenarios. For instance, a single mathematical formula can be represented by multiple valid LaTeX strings; in OlmOCR’s evaluation framework, these syntactic variations often lead to false negatives, flagging correct predictions as errors. This is why I prefer the CDM-based evaluation used in OmniDocBench, as it prioritizes visual structural features over literal string alignment.
That said, even a benchmark like OmniDocBench still has many challenges to address. If this is an area you're interested in, perhaps we could collaborate and work on something together—it would be a great way to contribute to more objective and robust OCR evaluation.
they try to check properties like reading order, header/footer handling
@staghado Along with the LaTeX issues in OlmOCRBench, it also explicitly penalises for prediction footer and watermark, which should not be the case for a image to markdown model. We wrote a blog on all these issues: https://nanonets.com/blog/evaluating-ocr-to-markdown-systems-is-fundamentally-broken-and-why-thats-hard-to-fix/
Some subset of the OlmOCRBench is good like table, multi-column but results on subsets like math, old scans math, headers & footers, ArXiv can not be used much.
Hey
