File size: 14,766 Bytes
e2b569a 69ccfc5 e2b569a 6a53ad3 e2b569a 17fd115 e2b569a 6a53ad3 e2b569a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 | ---
license: apache-2.0
license_link: LICENSE
language:
- multilingual
tags:
- vision-language
- ocr
- document-intelligence
- qianfan
pipeline_tag: image-text-to-text
library_name: transformers
model-index:
- name: Qianfan-OCR
results:
- task:
type: document-parsing
name: Document Parsing
dataset:
name: OmniDocBench v1.5
type: opendatalab/OmniDocBench
metrics:
- type: overall
value: 93.12
name: Overall Score
- task:
type: ocr
name: OCR
dataset:
name: OlmOCR Bench
type: allenai/olmOCR-bench
metrics:
- type: accuracy
value: 79.8
name: Overall Score
- task:
type: ocr
name: OCR
dataset:
name: OCRBench
type: echo840/OCRBench
metrics:
- type: accuracy
value: 880
name: Score
---
<div align="center">
<h1>Qianfan-OCR</h1>
<h3>A Unified End-to-End Model for Document Intelligence</h3>
[**🤖 Demo**](https://huggingface.co/spaces/baidu/Qianfan-OCR-Demo) |
[**📄 Technical Report**](https://arxiv.org/abs/2603.13398) |
[**🖥️ Qianfan Platform**](https://cloud.baidu.com/product-s/qianfan_home) |
[**💻 GitHub**](https://github.com/baidubce/Qianfan-VL) |
[**🧩 Skill**](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence)
</div>
## Introduction
**Qianfan-OCR** is a **4B-parameter end-to-end document intelligence model** developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture.
Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs **direct image-to-Markdown conversion** and supports a broad range of prompt-driven tasks — from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction — all within one model.
### Key Highlights
- 🏆 **#1 End-to-End Model on OmniDocBench v1.5** — Achieves **93.12** overall score, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and all other end-to-end models
- 🏆 **#1 End-to-End Model on OlmOCR Bench** — Scores **79.8**
- 🏆 **#1 on Key Information Extraction** — Overall mean score of **87.9** across five public KIE benchmarks, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B
- 🧠 **Layout-as-Thought** — An innovative optional thinking phase that recovers explicit layout analysis within the end-to-end paradigm via `⟨think⟩` tokens
- 🌍 **192 Languages** — Multilingual OCR support across diverse scripts
- ⚡ **Efficient Deployment** — Achieves **1.024 PPS** (pages per second) with W8A8 quantization on a single A100 GPU
## Architecture
Qianfan-OCR adopts the multimodal bridging architecture from [Qianfan-VL](https://arxiv.org/abs/2509.18189), consisting of three core components:
| Component | Details |
|---|---|
| **Vision Encoder** | Qianfan-ViT, 24 Transformer layers, AnyResolution design (up to 4K), 256 visual tokens per 448×448 tile, max 4,096 tokens per image |
| **Language Model** | Qwen3-4B (3.6B non-embedding), 36 layers, 2560 hidden dim, GQA (32 query / 8 KV heads), 32K context (extendable to 131K) |
| **Cross-Modal Adapter** | 2-layer MLP with GELU activation, projecting from 1024-dim to 2560-dim |
### Layout-as-Thought
A key innovation is **Layout-as-Thought**: an optional thinking phase triggered by `⟨think⟩` tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final outputs.
This mechanism serves two purposes:
1. **Functional**: Recovers layout analysis capability within the end-to-end paradigm — users obtain structured layout results directly
2. **Enhancement**: Provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders
> **When to use**: Enable thinking for heterogeneous pages with mixed element types (exam papers, technical reports, newspapers). Disable for homogeneous documents (single-column text, simple forms) for better results and lower latency.
## Benchmark Results
### OmniDocBench v1.5 (Document Parsing)
| Model | Type | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDss ↑ | R-orderEdit ↓ |
|---|---|---|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | End-to-end | **93.12** | **0.041** | **92.43** | **91.02** | **93.85** | **0.049** |
| DeepSeek-OCR-v2 | End-to-end | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 |
| Gemini-3 Pro | End-to-end | 90.33 | 0.065 | 89.18 | 88.28 | 90.29 | 0.071 |
| Qwen3-VL-235B | End-to-end | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 |
| dots.ocr | End-to-end | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 |
| PaddleOCR-VL 1.5 | Pipeline | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 |
### General OCR Benchmarks
| Model | OCRBench | OCRBenchv2 (en/zh) | CCOCR-multilan | CCOCR-overall |
|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | **880** | 56.0 / **60.77** | **76.7** | **79.3** |
| Qwen3-VL-4B | 873 | **60.68** / 59.13 | 74.2 | 76.5 |
| MonkeyOCR | 655 | 21.78 / 38.91 | 43.8 | 35.2 |
| DeepSeek-OCR | 459 | 15.98 / 38.31 | 32.5 | 27.6 |
### Document Understanding
| Benchmark | Qianfan-OCR | Qwen3-VL-4B | Qwen3-VL-2B |
|---|---|---|---|
| DocVQA | 92.8 | **94.9** | 92.7 |
| CharXiv_DQ | **94.0** | 81.8 | 69.7 |
| CharXiv_RQ | **85.2** | 48.5 | 41.3 |
| ChartQA | **88.1** | 83.3 | 78.3 |
| ChartQAPro | **42.9** | 36.2 | 24.5 |
| ChartBench | **85.9** | 74.9 | 73.2 |
| TextVQA | 80.0 | **81.8** | 79.9 |
| OCRVQA | **66.8** | 64.7 | 59.3 |
> 💡 Two-stage OCR+LLM systems score **0.0** on CharXiv (both DQ and RQ), demonstrating that chart structures discarded during text extraction are essential for reasoning.
### Key Information Extraction (KIE)
| Model | Overall | OCRBench KIE | OCRBenchv2 KIE (en) | OCRBenchv2 KIE (zh) | CCOCR KIE | Nanonets KIE (F1) |
|---|---|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | **87.9** | 95.0 | 82.8 | **82.3** | 92.8 | **86.5** |
| Qwen3-VL-235B-A22B | 84.2 | 94.0 | 85.6 | 62.9 | **95.1** | 83.8 |
| Qwen3-4B-VL | 83.5 | 89.0 | 82.1 | 71.3 | 91.6 | 83.3 |
| Gemini-3.1-Pro | 79.2 | **96.0** | **87.8** | 63.4 | 72.5 | 76.1 |
### Inference Throughput
| Model | PPS (pages/sec) |
|---|---|
| **Qianfan-OCR (W8A8)** | **1.024** |
| Qianfan-OCR (W16A16) | 0.503 |
| MinerU 2.5 | 1.057 |
| MonkeyOCR-pro-1.2B | 0.673 |
| Dots OCR | 0.352 |
*All benchmarks on a single NVIDIA A100 GPU with vLLM 0.10.2.*
## Supported Tasks
Qianfan-OCR supports a comprehensive set of document intelligence tasks through prompt-driven control:
| Task Category | Specific Tasks |
|---|---|
| **Document Parsing** | Image-to-Markdown conversion, multi-page parsing, structured output (JSON/HTML) |
| **Layout Analysis** | Bounding box detection, element type classification (25 categories), reading order |
| **Table Recognition** | Complex table extraction (merged cells, rotated tables), HTML output |
| **Formula Recognition** | Inline and display math formulas, LaTeX output |
| **Chart Understanding** | Chart QA, trend analysis, data extraction from various chart types |
| **Key Information Extraction** | Receipts, invoices, certificates, medical records, ID cards |
| **Handwriting Recognition** | Chinese and English handwritten text |
| **Scene Text Recognition** | Street signs, product labels, natural scene text |
| **Multilingual OCR** | 192 languages including Latin, Cyrillic, Arabic, South/Southeast Asian, CJK scripts |
## Quick Start
### Basic Usage
```python
import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# Load model
MODEL_PATH = "baidu/Qianfan-OCR"
model = AutoModel.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
# Load and process image
pixel_values = load_image("./Qianfan-OCR/examples/document.png").to(torch.bfloat16).to(model.device)
# Inference
prompt = "Parse this document to Markdown."
with torch.no_grad():
response = model.chat(
tokenizer,
pixel_values=pixel_values,
question=prompt,
generation_config={"max_new_tokens": 16384}
)
print(response)
```
### With Layout-as-Thought (Thinking Mode)
```python
# Enable Layout-as-Thought by appending <think> token to query
pixel_values = load_image("./Qianfan-OCR/examples/complex_document.jpg").to(torch.bfloat16)
prompt = "Parse this document to Markdown.<think>"
with torch.no_grad():
response = model.chat(
tokenizer,
pixel_values=pixel_values,
question=prompt,
generation_config={"max_new_tokens": 16384}
)
print(response)
# The model will first generate structured layout analysis, then produce the final output
```
### Key Information Extraction
```python
pixel_values = load_image("./Qianfan-OCR/examples/invoice.jpg").to(torch.bfloat16)
prompt = "请从图片中提取以下字段信息:姓名、日期、总金额。使用标准JSON格式输出。"
with torch.no_grad():
response = model.chat(
tokenizer,
pixel_values=pixel_values,
question=prompt,
generation_config={"max_new_tokens": 16384}
)
print(response)
```
### vLLM Deployment
```bash
# Serve with vLLM for high-throughput inference
vllm serve baidu/Qianfan-OCR --trust-remote-code
```
## Skill
We provide a [Qianfan OCR Document Intelligence](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence) skill for image and PDF understanding workflows.
It can be used by users of OpenClaw, Claude Code, Codex, and other assistants that support this skill format.
This skill packages reusable instructions, scripts, and references so the agent can automatically apply Qianfan-powered document intelligence to tasks such as:
- document parsing to Markdown
- layout analysis
- element recognition
- general OCR
- key information extraction
- chart understanding
- document VQA
The skill is designed for visual understanding tasks over images and PDFs, and includes the execution flow needed to prepare inputs, choose the right analysis mode, and call the bundled CLI tools.
## Citation
```bibtex
@misc{dong2026qianfanocrunifiedendtoendmodel,
title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
year={2026},
eprint={2603.13398},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13398},
}
```
## Acknowledgments
We thank the Baidu AI Cloud team for infrastructure support, the Baige and Kunlun teams for AI infrastructure assistance, and all contributors to the Qianfan platform.
## License
This project is licensed under the Apache License 2.0. See `LICENSE` for the
full license text.
Some bundled third-party source files are licensed under the MIT License. See
`NOTICE` for the file list and corresponding attribution details.
|