PaddleOCR-VL-1.5

This version of PaddleOCR-VL-1.5 has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo:

Support Platform

Image Process

Chips input size image num ViT encoder TTFT (640 tokens) Decode speed CMM Flash
AX650 576x768 1 1685.554 ms 361.8 ms 44.6 tokens/sec TBD TBD

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装 (默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载 Github Actions CI 导出的可执行程序 (适合没有编译环境的用户):

如果没有编译环境, 请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序 (axllm), 然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载 (Hugging Face)

先创建模型目录并进入, 然后下载到该目录:

mkdir -p AXERA-TECH/PaddleOCR-VL-1.5
cd AXERA-TECH/PaddleOCR-VL-1.5
hf download AXERA-TECH/PaddleOCR-VL-1.5 --local-dir .

# 下载后目录结构
tree -L 1
.
|-- README.md
|-- assets
|-- config.json
|-- model.embed_tokens.weight.bfloat16.bin
|-- paddleocr_vl_p128_l0_together.axmodel
...
|-- paddleocr_vl_p128_l17_together.axmodel
|-- paddleocr_vl_post.axmodel
|-- post_config.json
|-- python
|-- tokenizer.model
|-- vision_cache
`-- vit_576x768.axmodel

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行 (CLI)

axllm run AXERA-TECH/PaddleOCR-VL-1.5/

启动后进入交互模式.每轮输入 prompt 后, 会提示 image >>

  • 直接回车:仅文本对话
  • 输入图片路径:图文 OCR 对话

输入图像示例:

TestImage

root@ax650 # axllm run AXERA-TECH/PaddleOCR-VL-1.5/ # 注意使用最新版本的 axllm
20:45:21.515 INF Init:218 | LLM init start
tokenizer_type = 0
 95% | ##############################   |  20 /  21 [1.97s<2.07s, 10.13 count/s] init post axmodel ok,remain_cmm(4330 MB)
20:45:23.490 INF Init:368 | max_token_len : 2047
20:45:23.490 INF Init:371 | kv_cache_size : 256, kv_cache_num: 2047
20:45:23.490 INF Init:374 | prefill_token_num : 128
20:45:23.490 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:45:23.490 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:45:23.490 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:45:23.490 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:45:23.490 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:45:23.490 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:45:23.490 INF Init:384 | prefill_max_token_num : 640
20:45:23.490 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  21 /  21 [1.98s<1.98s, 10.62 count/s] embed_selector init ok
20:45:24.028 INF Init:453 | PaddleOCRVL: encoder input nSize=5334336 -> eff_nSize=1333584 (float32 input)
20:45:24.029 WRN Init:469 | Qwen-VL vision size override: cfg=448x448 bytes=602112, model_input_bytes=5334336 -> 756x588 (factor-search).
20:45:24.029 INF Init:661 | PaddleOCR-VL token ids: vision_start=101305 image_pad=100295 video_pad=100295
20:45:24.029 INF Init:686 | VisionModule init ok: type=PaddleOCRVL, tokens_per_block=567, embed_size=1024, out_dtype=fp32
20:45:24.029 WRN Init:695 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
20:45:24.031 INF load_config:282 | load config:
20:45:24.031 INF load_config:282 | {
20:45:24.031 INF load_config:282 |     "enable_repetition_penalty": false,
20:45:24.031 INF load_config:282 |     "enable_temperature": false,
20:45:24.031 INF load_config:282 |     "enable_top_k_sampling": false,
20:45:24.031 INF load_config:282 |     "enable_top_p_sampling": false,
20:45:24.031 INF load_config:282 |     "penalty_window": 20,
20:45:24.031 INF load_config:282 |     "repetition_penalty": 1.0,
20:45:24.031 INF load_config:282 |     "temperature": 0.6,
20:45:24.031 INF load_config:282 |     "top_k": 1,
20:45:24.031 INF load_config:282 |     "top_p": 0.9
20:45:24.031 INF load_config:282 | }
20:45:24.031 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> OCR:
image >> /Path/To/Your/AXERA-TECH/PaddleOCR-VL-1.5/assets/IMG_0462.JPG
20:45:30.031 INF EncodeForContent:1058 | PaddleOCRVL pixel_values bytes=1333584 min=0 max=255 (w=756 h=588 ps=14)
20:45:31.726 INF EncodeForContent:1102 | vision cache store: /Path/To/Your/AXERA-TECH/PaddleOCR-VL-1.5/assets/IMG_0462.JPG
20:45:31.760 INF SetKVCache:747 | prefill_grpid:6 kv_cache_num:640 precompute_len:0 input_num_token:596
20:45:31.760 INF SetKVCache:749 | current prefill_max_token_num:640
20:45:31.760 INF SetKVCache:752 | first run
20:45:31.761 INF Run:805 | input token num : 596, prefill_split_num : 5
20:45:31.761 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
20:45:31.761 INF Run:868 | prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.815 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=128
20:45:31.816 INF Run:868 | prefill indices shape: p=1 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.873 INF Run:845 | prefill chunk p=2 history_len=256 grpid=3 kv_cache_num=256 input_tokens=128
20:45:31.873 INF Run:868 | prefill indices shape: p=2 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.937 INF Run:845 | prefill chunk p=3 history_len=384 grpid=4 kv_cache_num=384 input_tokens=128
20:45:31.937 INF Run:868 | prefill indices shape: p=3 idx_elems=384 idx_rows=3 pos_rows=3
20:45:32.006 INF Run:845 | prefill chunk p=4 history_len=512 grpid=5 kv_cache_num=512 input_tokens=84
20:45:32.006 INF Run:868 | prefill indices shape: p=4 idx_elems=384 idx_rows=3 pos_rows=3
20:45:32.088 INF Run:1010 | ttft: 327.20 ms
James Landay-VR
14175

20:45:32.374 NTC Run:1132 | hit eos,avg 38.47 token/s
20:45:32.374 INF GetKVCache:721 | precompute_len:597, remaining:43

启动服务(OpenAI 兼容)

axllm serve AXERA-TECH/PaddleOCR-VL-1.5/

参考日志如下:

root@ax650 # axllm serve AXERA-TECH/PaddleOCR-VL-1.5/ # 注意使用最新版本的 axllm
20:47:54.027 INF Init:218 | LLM init start
tokenizer_type = 0
 95% | ##############################   |  20 /  21 [1.96s<2.05s, 10.22 count/s] init post axmodel ok,remain_cmm(4330 MB)
20:47:55.983 INF Init:368 | max_token_len : 2047
20:47:55.983 INF Init:371 | kv_cache_size : 256, kv_cache_num: 2047
20:47:55.983 INF Init:374 | prefill_token_num : 128
20:47:55.983 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:47:55.983 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:47:55.983 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:47:55.983 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:47:55.983 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:47:55.983 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:47:55.983 INF Init:384 | prefill_max_token_num : 640
20:47:55.983 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  21 /  21 [1.96s<1.96s, 10.72 count/s] embed_selector init ok
20:47:56.526 INF Init:453 | PaddleOCRVL: encoder input nSize=5334336 -> eff_nSize=1333584 (float32 input)
20:47:56.526 WRN Init:469 | Qwen-VL vision size override: cfg=448x448 bytes=602112, model_input_bytes=5334336 -> 756x588 (factor-search).
20:47:56.526 INF Init:661 | PaddleOCR-VL token ids: vision_start=101305 image_pad=100295 video_pad=100295
20:47:56.526 INF Init:686 | VisionModule init ok: type=PaddleOCRVL, tokens_per_block=567, embed_size=1024, out_dtype=fp32
20:47:56.526 WRN Init:695 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
20:47:56.528 INF load_config:282 | load config:
20:47:56.528 INF load_config:282 | {
20:47:56.528 INF load_config:282 |     "enable_repetition_penalty": false,
20:47:56.528 INF load_config:282 |     "enable_temperature": false,
20:47:56.528 INF load_config:282 |     "enable_top_k_sampling": false,
20:47:56.528 INF load_config:282 |     "enable_top_p_sampling": false,
20:47:56.528 INF load_config:282 |     "penalty_window": 20,
20:47:56.528 INF load_config:282 |     "repetition_penalty": 1.0,
20:47:56.528 INF load_config:282 |     "temperature": 0.6,
20:47:56.528 INF load_config:282 |     "top_k": 1,
20:47:56.528 INF load_config:282 |     "top_p": 0.9
20:47:56.528 INF load_config:282 | }
20:47:56.528 INF Init:448 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/PaddleOCR-VL-1.5'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
  GET  http://10.168.232.217:8000/health
  GET  http://10.168.232.217:8000/v1/models
  POST http://10.168.232.217:8000/v1/chat/completions
  GET  http://172.17.0.1:8000/health
  GET  http://172.17.0.1:8000/v1/models
  POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
  GET  http://127.0.0.1:8000/models
  POST http://127.0.0.1:8000/chat/completions
  GET  http://10.168.232.217:8000/models
  POST http://10.168.232.217:8000/chat/completions
  GET  http://172.17.0.1:8000/models
  POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/PaddleOCR-VL-1.5

Python Inference (可选)

Python 推理脚本保留在 python/ 目录下, 可在 PC/开发板上使用 Python 进行推理测试.

python3 python/infer_axmodel.py \
  --hf_model ./python/paddleocr_vl_1-5_tokenizer \
  --axmodel_path . \
  --vit_model_path ./vit_576x768.axmodel \
  --image_path ./assets/IMG_0462.JPG \
  --task ocr

支持的任务类型 (--task):

  • ocr — 通用文字识别
  • table — 表格识别
  • chart — 图表识别
  • formula — 公式识别
  • spotting — 文字检测(Spotting)
  • seal — 印章识别

End-to-End Metrics (AX650N)

Metric Value
Max TTFT (640 tokens) 361.8 ms
Decode speed 44.6 tokens/s
ViT latency (576x768) 1685.554 ms

Subgraph Latency

Stage Subgraph Latency
Prefill g1 2.551 ms
Prefill g2 2.883 ms
Prefill g3 3.158 ms
Prefill g4 3.413 ms
Prefill g5 3.795 ms
Prefill g6 4.007 ms
Decode g0 0.949 ms
Post-process - 5.313 ms
ViT - 1685.554 ms

Discussion

  • GitHub Issues
  • QQ Group: 139953715
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/PaddleOCR-VL-1.5

Finetuned
(3)
this model