PaddleOCR-VL-1.5 ONNX (Q4 Quantized)

ONNX conversion of PaddlePaddle/PaddleOCR-VL-1.5 for browser deployment with Transformers.js.

Model Details

  • Base model: PaddleOCR-VL-1.5 (0.9B parameters)
  • Quantization: INT4 (MatMulNBits, 32-block symmetric)
  • Total size: ~991 MB (down from 3.4 GB FP32)
  • License: Apache 2.0
  • Table recognition accuracy: 92.76% TEDS on PubTabNet

Files

File Size Description
onnx/embed_tokens.onnx 405 MB Token embedding lookup
onnx/vision_encoder.onnx 369 MB NaViT vision encoder + MLP projector
onnx/decoder_model_merged.onnx 217 MB Ernie4.5 LLM decoder with KV cache

Architecture

The model is split into 3 ONNX sub-models following the Florence-2 pattern:

  1. embed_tokens: Converts token IDs to embeddings
  2. vision_encoder: Processes image patches into visual tokens (NaViT + projector MLP)
  3. decoder_model_merged: Autoregressive text decoder with KV cache support

Key features:

  • M-RoPE (3D multimodal rotary position embeddings)
  • GQA (16 attention heads, 2 KV heads, head_dim=128)
  • Merged decoder supporting both prefill and generation via use_cache_branch

Validation

All components validated against PyTorch reference:

Component FP32 Cosine Sim Q4 Cosine Sim
embed_tokens 1.000000 1.000000
vision_encoder 1.000000 0.960662
decoder (prefill) 1.000000 0.970726
decoder (with past) 1.000000 -

Usage

Intended for browser-based OCR and table recognition via ONNX Runtime Web (WebGPU/WASM).

Credits

Downloads last month
82
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lbm364dl/PaddleOCR-VL-1.5-ONNX

Quantized
(5)
this model