File size: 14,766 Bytes
e2b569a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69ccfc5
e2b569a
 
6a53ad3
 
e2b569a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17fd115
e2b569a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a53ad3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2b569a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
license: apache-2.0
license_link: LICENSE
language:
  - multilingual
tags:
  - vision-language
  - ocr
  - document-intelligence
  - qianfan
pipeline_tag: image-text-to-text
library_name: transformers
model-index:
  - name: Qianfan-OCR
    results:
      - task:
          type: document-parsing
          name: Document Parsing
        dataset:
          name: OmniDocBench v1.5
          type: opendatalab/OmniDocBench
        metrics:
          - type: overall
            value: 93.12
            name: Overall Score
      - task:
          type: ocr
          name: OCR
        dataset:
          name: OlmOCR Bench
          type: allenai/olmOCR-bench
        metrics:
          - type: accuracy
            value: 79.8
            name: Overall Score
      - task:
          type: ocr
          name: OCR
        dataset:
          name: OCRBench
          type: echo840/OCRBench
        metrics:
          - type: accuracy
            value: 880
            name: Score
---

<div align="center">

<h1>Qianfan-OCR</h1>

<h3>A Unified End-to-End Model for Document Intelligence</h3>

[**🤖 Demo**](https://huggingface.co/spaces/baidu/Qianfan-OCR-Demo) |
[**📄 Technical Report**](https://arxiv.org/abs/2603.13398) |
[**🖥️ Qianfan Platform**](https://cloud.baidu.com/product-s/qianfan_home) |
[**💻 GitHub**](https://github.com/baidubce/Qianfan-VL) |
[**🧩 Skill**](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence)

</div>

## Introduction

**Qianfan-OCR** is a **4B-parameter end-to-end document intelligence model** developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture.

Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs **direct image-to-Markdown conversion** and supports a broad range of prompt-driven tasks — from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction — all within one model.

### Key Highlights

- 🏆 **#1 End-to-End Model on OmniDocBench v1.5** — Achieves **93.12** overall score, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and all other end-to-end models
- 🏆 **#1 End-to-End Model on OlmOCR Bench** — Scores **79.8**
- 🏆 **#1 on Key Information Extraction** — Overall mean score of **87.9** across five public KIE benchmarks, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B
- 🧠 **Layout-as-Thought** — An innovative optional thinking phase that recovers explicit layout analysis within the end-to-end paradigm via `⟨think⟩` tokens
- 🌍 **192 Languages** — Multilingual OCR support across diverse scripts
-**Efficient Deployment** — Achieves **1.024 PPS** (pages per second) with W8A8 quantization on a single A100 GPU

## Architecture

Qianfan-OCR adopts the multimodal bridging architecture from [Qianfan-VL](https://arxiv.org/abs/2509.18189), consisting of three core components:

| Component | Details |
|---|---|
| **Vision Encoder** | Qianfan-ViT, 24 Transformer layers, AnyResolution design (up to 4K), 256 visual tokens per 448×448 tile, max 4,096 tokens per image |
| **Language Model** | Qwen3-4B (3.6B non-embedding), 36 layers, 2560 hidden dim, GQA (32 query / 8 KV heads), 32K context (extendable to 131K) |
| **Cross-Modal Adapter** | 2-layer MLP with GELU activation, projecting from 1024-dim to 2560-dim |

### Layout-as-Thought

A key innovation is **Layout-as-Thought**: an optional thinking phase triggered by `⟨think⟩` tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final outputs.

This mechanism serves two purposes:
1. **Functional**: Recovers layout analysis capability within the end-to-end paradigm — users obtain structured layout results directly
2. **Enhancement**: Provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders

> **When to use**: Enable thinking for heterogeneous pages with mixed element types (exam papers, technical reports, newspapers). Disable for homogeneous documents (single-column text, simple forms) for better results and lower latency.

## Benchmark Results

### OmniDocBench v1.5 (Document Parsing)

| Model | Type | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDss ↑ | R-orderEdit ↓ |
|---|---|---|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | End-to-end | **93.12** | **0.041** | **92.43** | **91.02** | **93.85** | **0.049** |
| DeepSeek-OCR-v2 | End-to-end | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 |
| Gemini-3 Pro | End-to-end | 90.33 | 0.065 | 89.18 | 88.28 | 90.29 | 0.071 |
| Qwen3-VL-235B | End-to-end | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 |
| dots.ocr | End-to-end | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 |
| PaddleOCR-VL 1.5 | Pipeline | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 |

### General OCR Benchmarks

| Model | OCRBench | OCRBenchv2 (en/zh) | CCOCR-multilan | CCOCR-overall |
|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | **880** | 56.0 / **60.77** | **76.7** | **79.3** |
| Qwen3-VL-4B | 873 | **60.68** / 59.13 | 74.2 | 76.5 |
| MonkeyOCR | 655 | 21.78 / 38.91 | 43.8 | 35.2 |
| DeepSeek-OCR | 459 | 15.98 / 38.31 | 32.5 | 27.6 |

### Document Understanding

| Benchmark | Qianfan-OCR | Qwen3-VL-4B | Qwen3-VL-2B |
|---|---|---|---|
| DocVQA | 92.8 | **94.9** | 92.7 |
| CharXiv_DQ | **94.0** | 81.8 | 69.7 |
| CharXiv_RQ | **85.2** | 48.5 | 41.3 |
| ChartQA | **88.1** | 83.3 | 78.3 |
| ChartQAPro | **42.9** | 36.2 | 24.5 |
| ChartBench | **85.9** | 74.9 | 73.2 |
| TextVQA | 80.0 | **81.8** | 79.9 |
| OCRVQA | **66.8** | 64.7 | 59.3 |

> 💡 Two-stage OCR+LLM systems score **0.0** on CharXiv (both DQ and RQ), demonstrating that chart structures discarded during text extraction are essential for reasoning.

### Key Information Extraction (KIE)

| Model | Overall | OCRBench KIE | OCRBenchv2 KIE (en) | OCRBenchv2 KIE (zh) | CCOCR KIE | Nanonets KIE (F1) |
|---|---|---|---|---|---|---|
| **Qianfan-OCR (Ours)** | **87.9** | 95.0 | 82.8 | **82.3** | 92.8 | **86.5** |
| Qwen3-VL-235B-A22B | 84.2 | 94.0 | 85.6 | 62.9 | **95.1** | 83.8 |
| Qwen3-4B-VL | 83.5 | 89.0 | 82.1 | 71.3 | 91.6 | 83.3 |
| Gemini-3.1-Pro | 79.2 | **96.0** | **87.8** | 63.4 | 72.5 | 76.1 |

### Inference Throughput

| Model | PPS (pages/sec) |
|---|---|
| **Qianfan-OCR (W8A8)** | **1.024** |
| Qianfan-OCR (W16A16) | 0.503 |
| MinerU 2.5 | 1.057 |
| MonkeyOCR-pro-1.2B | 0.673 |
| Dots OCR | 0.352 |

*All benchmarks on a single NVIDIA A100 GPU with vLLM 0.10.2.*

## Supported Tasks

Qianfan-OCR supports a comprehensive set of document intelligence tasks through prompt-driven control:

| Task Category | Specific Tasks |
|---|---|
| **Document Parsing** | Image-to-Markdown conversion, multi-page parsing, structured output (JSON/HTML) |
| **Layout Analysis** | Bounding box detection, element type classification (25 categories), reading order |
| **Table Recognition** | Complex table extraction (merged cells, rotated tables), HTML output |
| **Formula Recognition** | Inline and display math formulas, LaTeX output |
| **Chart Understanding** | Chart QA, trend analysis, data extraction from various chart types |
| **Key Information Extraction** | Receipts, invoices, certificates, medical records, ID cards |
| **Handwriting Recognition** | Chinese and English handwritten text |
| **Scene Text Recognition** | Street signs, product labels, natural scene text |
| **Multilingual OCR** | 192 languages including Latin, Cyrillic, Arabic, South/Southeast Asian, CJK scripts |

## Quick Start

### Basic Usage

```python
import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-OCR"
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./Qianfan-OCR/examples/document.png").to(torch.bfloat16).to(model.device)

# Inference
prompt = "Parse this document to Markdown."
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 16384}
    )
print(response)
```

### With Layout-as-Thought (Thinking Mode)

```python
# Enable Layout-as-Thought by appending <think> token to query

pixel_values = load_image("./Qianfan-OCR/examples/complex_document.jpg").to(torch.bfloat16)
prompt = "Parse this document to Markdown.<think>"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 16384}
    )
print(response)

# The model will first generate structured layout analysis, then produce the final output
```

### Key Information Extraction

```python
pixel_values = load_image("./Qianfan-OCR/examples/invoice.jpg").to(torch.bfloat16)
prompt = "请从图片中提取以下字段信息:姓名、日期、总金额。使用标准JSON格式输出。"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 16384}
    )
print(response)
```

### vLLM Deployment

```bash
# Serve with vLLM for high-throughput inference
vllm serve baidu/Qianfan-OCR --trust-remote-code
```

## Skill

We provide a [Qianfan OCR Document Intelligence](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence) skill for image and PDF understanding workflows.

It can be used by users of OpenClaw, Claude Code, Codex, and other assistants that support this skill format.
This skill packages reusable instructions, scripts, and references so the agent can automatically apply Qianfan-powered document intelligence to tasks such as:

- document parsing to Markdown
- layout analysis
- element recognition
- general OCR
- key information extraction
- chart understanding
- document VQA

The skill is designed for visual understanding tasks over images and PDFs, and includes the execution flow needed to prepare inputs, choose the right analysis mode, and call the bundled CLI tools.

## Citation

```bibtex
@misc{dong2026qianfanocrunifiedendtoendmodel,
  title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
  author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
  year={2026},
  eprint={2603.13398},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.13398},
}
```

## Acknowledgments

We thank the Baidu AI Cloud team for infrastructure support, the Baige and Kunlun teams for AI infrastructure assistance, and all contributors to the Qianfan platform.

## License

This project is licensed under the Apache License 2.0. See `LICENSE` for the
full license text.

Some bundled third-party source files are licensed under the MIT License. See
`NOTICE` for the file list and corresponding attribution details.