Qianfan-OCR in 8-bit

This is a simple replacement for baidu/Qianfan-OCR for GPUs with limited VRAM. The weights are pre-quantized to 8-bit to fit the model into a limited amount of VRAM.

The original model requires at least 16GB VRAM to be served via vLLM. The 8-bit version fits into GPUs with 8GB VRAM, althrough the batching capabilities will be severely limited.

Only the LLM part is quantized. The vision model is untouched.

Serving via vLLM

It should work with Baidu's official vLLM example (1, 2) by swapping baidu/Qianfan-OCR with FriskyFennec/Qianfan-OCR-8bit.

It was tested on RTX 3060 (12GB) using this command,

vllm serve FriskyFennec/Qianfan-OCR-8bit --served-model-name "baidu/Qianfan-OCR" --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.95 --max-model-len 16384 --trust-remote-code --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'

Sample output

The 1st page of arXiv:2603.13398,

Tap to view the output

arXiv:2603.13398v1[cs.CV] 11 Mar 2026

March 17, 2026

# Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Baidu Qianfan Team

https://github.com/baidubce/qianfan-VL

## Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end document intelligence model that unifies document parsing, layout analysis, and document understanding within a single vision-language architecture. Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs direct image-to-Markdown conversion and supports a broad range of prompt-driven tasks – from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction – all within one model. A practical limitation of end-to-end OCR is the loss of explicit layout analysis, a capability that pipeline users routinely rely on for element localization and type classification. We introduce Layout-as-Thought to bridge this gap: an optional thinking phase triggered by (token) tokens, where the model generates structured layout representations (bounding boxes, element types, and reading orders) before producing final outputs. This mechanism serves two purposes: (1) it recovers layout analysis functionality within the end-to-end paradigm, enabling users to obtain spatial grounding results directly; and (2) it provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders, where structural priors help resolve recognition ambiguities. On OCR-specific benchmarks, Qianfan-OCR ranks first among all end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8). It also achieves strong results on general OCR benchmarks including OOCRbench (880), OCRBenchv2, and CCOCR, as well as document understanding tasks such as DocVQA, ChartQA, and CharXiv, matching general vision-language models of comparable scale. On public Key Information Extraction benchmarks, Qianfan-OCR achieves the highest average score, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B. The model is publicly accessible through Baidu AI Cloud Qianfan platform, with usage examples and best practices available at https://github.com/baidubce/qianfan-VL.

## 1 Introduction

Current OCR systems face a three-way trade-off between cost, accuracy, and capability. Traditional OCR pipelines based on small specialized models offer low inference cost and high throughput, but require complex multi-stage preprocessing and postprocessing to handle diverse document layouts. Specialized OCR large models (Wei et al., 2024; 2025; Cui et al., 2025b; Poznanski et al., 2025) improve accuracy through two-stage architectures – layout detection followed by element-wise recognition – but introduce deployment complexity, inter-stage error propagation, and irreversible loss of visual context during text extraction. General vision-language models (Liu et al., 2024; Chen et al., 2024b) offer broad multimodal capabilities but incur higher inference costs and underperform specialized systems on structured document parsing tasks requiring precise layout preservation.

In practice, industrial OCR applications – document retrieval with chunking and indexing, contract review, key information extraction from receipts and certificates – often chain detection models, OCR models, and separate LLMs for downstream understanding. This fragmented approach increases deployment cost, limits end-to-end optimization, and requires careful orchestration of heterogeneous components.

We introduce Qianfan-OCR, a 4B-parameter unified end-to-end model that addresses these limitations with three key designs:

End-to-End Architecture: Qianfan-OCR integrates layout analysis, text recognition, and semantic understanding into a single vision-language model, eliminating inter-stage error propagation and enabling joint optimization across all tasks. The end-to-end design allows the model to retain full visual context throughout processing – spatial relationships, chart structures, and formatting that pipeline systems discard during text extraction. For tasks that do not require explicit layout analysis (e.g., simple document transcription or scene text recognition), the model directly outputs results without mandatory layout preprocessing.

Layout-as-Thought: A practical limitation of end-to-end OCR is the loss of explicit layout analysis –

1

It takes about 92 seconds to process the page above with a rich amount of text. Smaller samples, such as screenshots, take less (~15 seconds).

Important note

The weights are plain Transformers, which comes at the price of non-ideal throughput. They will not benefit from Baidu's reported 2x throughput.

License

The weights of baidu/Qianfan-OCR are shared under Apache-2.0. Refer to: Qianfan-OCR's license.

Downloads last month: 102

Safetensors

Model size

5B params

Tensor type

F32

BF16

Paper for FriskyFennec/Qianfan-OCR-8bit

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Paper • 2603.13398 • Published Mar 11 • 153