PaddleOCR-VL-1.5 ONNX (Q4 Quantized)
ONNX conversion of PaddlePaddle/PaddleOCR-VL-1.5 for browser deployment with Transformers.js.
Model Details
- Base model: PaddleOCR-VL-1.5 (0.9B parameters)
- Quantization: INT4 (MatMulNBits, 32-block symmetric)
- Total size: ~991 MB (down from 3.4 GB FP32)
- License: Apache 2.0
- Table recognition accuracy: 92.76% TEDS on PubTabNet
Files
| File | Size | Description |
|---|---|---|
onnx/embed_tokens.onnx |
405 MB | Token embedding lookup |
onnx/vision_encoder.onnx |
369 MB | NaViT vision encoder + MLP projector |
onnx/decoder_model_merged.onnx |
217 MB | Ernie4.5 LLM decoder with KV cache |
Architecture
The model is split into 3 ONNX sub-models following the Florence-2 pattern:
- embed_tokens: Converts token IDs to embeddings
- vision_encoder: Processes image patches into visual tokens (NaViT + projector MLP)
- decoder_model_merged: Autoregressive text decoder with KV cache support
Key features:
- M-RoPE (3D multimodal rotary position embeddings)
- GQA (16 attention heads, 2 KV heads, head_dim=128)
- Merged decoder supporting both prefill and generation via
use_cache_branch
Validation
All components validated against PyTorch reference:
| Component | FP32 Cosine Sim | Q4 Cosine Sim |
|---|---|---|
| embed_tokens | 1.000000 | 1.000000 |
| vision_encoder | 1.000000 | 0.960662 |
| decoder (prefill) | 1.000000 | 0.970726 |
| decoder (with past) | 1.000000 | - |
Usage
Intended for browser-based OCR and table recognition via ONNX Runtime Web (WebGPU/WASM).
Credits
- Original model: PaddlePaddle/PaddleOCR-VL-1.5
- Conversion approach inspired by Xenova's Florence-2 ONNX export
- Downloads last month
- 82
Model tree for lbm364dl/PaddleOCR-VL-1.5-ONNX
Base model
baidu/ERNIE-4.5-0.3B-Paddle Finetuned
PaddlePaddle/PaddleOCR-VL-1.5