PaddleOCR-VL-1.5
This version of PaddleOCR-VL-1.5 has been converted to run on the Axera NPU using w4a16 quantization.
Compatible with Pulsar2 version: 5.0
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo:
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Image Process
| Chips | input size | image num | ViT encoder | TTFT (640 tokens) | Decode speed | CMM | Flash |
|---|---|---|---|---|---|---|---|
| AX650 | 576x768 | 1 | 1685.554 ms | 361.8 ms | 44.6 tokens/sec | TBD | TBD |
How to use
安装 axllm
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装 (默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载 Github Actions CI 导出的可执行程序 (适合没有编译环境的用户):
如果没有编译环境, 请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序 (axllm), 然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
模型下载 (Hugging Face)
先创建模型目录并进入, 然后下载到该目录:
mkdir -p AXERA-TECH/PaddleOCR-VL-1.5
cd AXERA-TECH/PaddleOCR-VL-1.5
hf download AXERA-TECH/PaddleOCR-VL-1.5 --local-dir .
# 下载后目录结构
tree -L 1
.
|-- README.md
|-- assets
|-- config.json
|-- model.embed_tokens.weight.bfloat16.bin
|-- paddleocr_vl_p128_l0_together.axmodel
...
|-- paddleocr_vl_p128_l17_together.axmodel
|-- paddleocr_vl_post.axmodel
|-- post_config.json
|-- python
|-- tokenizer.model
|-- vision_cache
`-- vit_576x768.axmodel
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
运行 (CLI)
axllm run AXERA-TECH/PaddleOCR-VL-1.5/
启动后进入交互模式.每轮输入 prompt 后, 会提示 image >>:
- 直接回车:仅文本对话
- 输入图片路径:图文 OCR 对话
输入图像示例:
root@ax650 # axllm run AXERA-TECH/PaddleOCR-VL-1.5/ # 注意使用最新版本的 axllm
20:45:21.515 INF Init:218 | LLM init start
tokenizer_type = 0
95% | ############################## | 20 / 21 [1.97s<2.07s, 10.13 count/s] init post axmodel ok,remain_cmm(4330 MB)
20:45:23.490 INF Init:368 | max_token_len : 2047
20:45:23.490 INF Init:371 | kv_cache_size : 256, kv_cache_num: 2047
20:45:23.490 INF Init:374 | prefill_token_num : 128
20:45:23.490 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:45:23.490 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:45:23.490 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:45:23.490 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:45:23.490 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:45:23.490 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:45:23.490 INF Init:384 | prefill_max_token_num : 640
20:45:23.490 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 21 / 21 [1.98s<1.98s, 10.62 count/s] embed_selector init ok
20:45:24.028 INF Init:453 | PaddleOCRVL: encoder input nSize=5334336 -> eff_nSize=1333584 (float32 input)
20:45:24.029 WRN Init:469 | Qwen-VL vision size override: cfg=448x448 bytes=602112, model_input_bytes=5334336 -> 756x588 (factor-search).
20:45:24.029 INF Init:661 | PaddleOCR-VL token ids: vision_start=101305 image_pad=100295 video_pad=100295
20:45:24.029 INF Init:686 | VisionModule init ok: type=PaddleOCRVL, tokens_per_block=567, embed_size=1024, out_dtype=fp32
20:45:24.029 WRN Init:695 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
20:45:24.031 INF load_config:282 | load config:
20:45:24.031 INF load_config:282 | {
20:45:24.031 INF load_config:282 | "enable_repetition_penalty": false,
20:45:24.031 INF load_config:282 | "enable_temperature": false,
20:45:24.031 INF load_config:282 | "enable_top_k_sampling": false,
20:45:24.031 INF load_config:282 | "enable_top_p_sampling": false,
20:45:24.031 INF load_config:282 | "penalty_window": 20,
20:45:24.031 INF load_config:282 | "repetition_penalty": 1.0,
20:45:24.031 INF load_config:282 | "temperature": 0.6,
20:45:24.031 INF load_config:282 | "top_k": 1,
20:45:24.031 INF load_config:282 | "top_p": 0.9
20:45:24.031 INF load_config:282 | }
20:45:24.031 INF Init:448 | LLM init ok
Commands:
/q, /exit 退出
/reset 重置 kvcache
/dd 删除一轮对话
/pp 打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> OCR:
image >> /Path/To/Your/AXERA-TECH/PaddleOCR-VL-1.5/assets/IMG_0462.JPG
20:45:30.031 INF EncodeForContent:1058 | PaddleOCRVL pixel_values bytes=1333584 min=0 max=255 (w=756 h=588 ps=14)
20:45:31.726 INF EncodeForContent:1102 | vision cache store: /Path/To/Your/AXERA-TECH/PaddleOCR-VL-1.5/assets/IMG_0462.JPG
20:45:31.760 INF SetKVCache:747 | prefill_grpid:6 kv_cache_num:640 precompute_len:0 input_num_token:596
20:45:31.760 INF SetKVCache:749 | current prefill_max_token_num:640
20:45:31.760 INF SetKVCache:752 | first run
20:45:31.761 INF Run:805 | input token num : 596, prefill_split_num : 5
20:45:31.761 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
20:45:31.761 INF Run:868 | prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.815 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=128
20:45:31.816 INF Run:868 | prefill indices shape: p=1 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.873 INF Run:845 | prefill chunk p=2 history_len=256 grpid=3 kv_cache_num=256 input_tokens=128
20:45:31.873 INF Run:868 | prefill indices shape: p=2 idx_elems=384 idx_rows=3 pos_rows=3
20:45:31.937 INF Run:845 | prefill chunk p=3 history_len=384 grpid=4 kv_cache_num=384 input_tokens=128
20:45:31.937 INF Run:868 | prefill indices shape: p=3 idx_elems=384 idx_rows=3 pos_rows=3
20:45:32.006 INF Run:845 | prefill chunk p=4 history_len=512 grpid=5 kv_cache_num=512 input_tokens=84
20:45:32.006 INF Run:868 | prefill indices shape: p=4 idx_elems=384 idx_rows=3 pos_rows=3
20:45:32.088 INF Run:1010 | ttft: 327.20 ms
James Landay-VR
14175
20:45:32.374 NTC Run:1132 | hit eos,avg 38.47 token/s
20:45:32.374 INF GetKVCache:721 | precompute_len:597, remaining:43
启动服务(OpenAI 兼容)
axllm serve AXERA-TECH/PaddleOCR-VL-1.5/
参考日志如下:
root@ax650 # axllm serve AXERA-TECH/PaddleOCR-VL-1.5/ # 注意使用最新版本的 axllm
20:47:54.027 INF Init:218 | LLM init start
tokenizer_type = 0
95% | ############################## | 20 / 21 [1.96s<2.05s, 10.22 count/s] init post axmodel ok,remain_cmm(4330 MB)
20:47:55.983 INF Init:368 | max_token_len : 2047
20:47:55.983 INF Init:371 | kv_cache_size : 256, kv_cache_num: 2047
20:47:55.983 INF Init:374 | prefill_token_num : 128
20:47:55.983 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:47:55.983 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:47:55.983 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:47:55.983 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:47:55.983 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:47:55.983 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:47:55.983 INF Init:384 | prefill_max_token_num : 640
20:47:55.983 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 21 / 21 [1.96s<1.96s, 10.72 count/s] embed_selector init ok
20:47:56.526 INF Init:453 | PaddleOCRVL: encoder input nSize=5334336 -> eff_nSize=1333584 (float32 input)
20:47:56.526 WRN Init:469 | Qwen-VL vision size override: cfg=448x448 bytes=602112, model_input_bytes=5334336 -> 756x588 (factor-search).
20:47:56.526 INF Init:661 | PaddleOCR-VL token ids: vision_start=101305 image_pad=100295 video_pad=100295
20:47:56.526 INF Init:686 | VisionModule init ok: type=PaddleOCRVL, tokens_per_block=567, embed_size=1024, out_dtype=fp32
20:47:56.526 WRN Init:695 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
20:47:56.528 INF load_config:282 | load config:
20:47:56.528 INF load_config:282 | {
20:47:56.528 INF load_config:282 | "enable_repetition_penalty": false,
20:47:56.528 INF load_config:282 | "enable_temperature": false,
20:47:56.528 INF load_config:282 | "enable_top_k_sampling": false,
20:47:56.528 INF load_config:282 | "enable_top_p_sampling": false,
20:47:56.528 INF load_config:282 | "penalty_window": 20,
20:47:56.528 INF load_config:282 | "repetition_penalty": 1.0,
20:47:56.528 INF load_config:282 | "temperature": 0.6,
20:47:56.528 INF load_config:282 | "top_k": 1,
20:47:56.528 INF load_config:282 | "top_p": 0.9
20:47:56.528 INF load_config:282 | }
20:47:56.528 INF Init:448 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/PaddleOCR-VL-1.5'...
API URLs:
GET http://127.0.0.1:8000/health
GET http://127.0.0.1:8000/v1/models
POST http://127.0.0.1:8000/v1/chat/completions
GET http://10.168.232.217:8000/health
GET http://10.168.232.217:8000/v1/models
POST http://10.168.232.217:8000/v1/chat/completions
GET http://172.17.0.1:8000/health
GET http://172.17.0.1:8000/v1/models
POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
GET http://127.0.0.1:8000/models
POST http://127.0.0.1:8000/chat/completions
GET http://10.168.232.217:8000/models
POST http://10.168.232.217:8000/chat/completions
GET http://172.17.0.1:8000/models
POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/PaddleOCR-VL-1.5
Python Inference (可选)
Python 推理脚本保留在 python/ 目录下, 可在 PC/开发板上使用 Python 进行推理测试.
python3 python/infer_axmodel.py \
--hf_model ./python/paddleocr_vl_1-5_tokenizer \
--axmodel_path . \
--vit_model_path ./vit_576x768.axmodel \
--image_path ./assets/IMG_0462.JPG \
--task ocr
支持的任务类型 (--task):
ocr— 通用文字识别table— 表格识别chart— 图表识别formula— 公式识别spotting— 文字检测(Spotting)seal— 印章识别
End-to-End Metrics (AX650N)
| Metric | Value |
|---|---|
| Max TTFT (640 tokens) | 361.8 ms |
| Decode speed | 44.6 tokens/s |
| ViT latency (576x768) | 1685.554 ms |
Subgraph Latency
| Stage | Subgraph | Latency |
|---|---|---|
| Prefill | g1 | 2.551 ms |
| Prefill | g2 | 2.883 ms |
| Prefill | g3 | 3.158 ms |
| Prefill | g4 | 3.413 ms |
| Prefill | g5 | 3.795 ms |
| Prefill | g6 | 4.007 ms |
| Decode | g0 | 0.949 ms |
| Post-process | - | 5.313 ms |
| ViT | - | 1685.554 ms |
Discussion
- GitHub Issues
- QQ Group:
139953715
- Downloads last month
- 36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for AXERA-TECH/PaddleOCR-VL-1.5
Base model
baidu/ERNIE-4.5-0.3B-Paddle Finetuned
PaddlePaddle/PaddleOCR-VL-1.5