Qwen3-VL-4B-Instruct-GPTQ-Int4

This version of Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	222 ms	678 ms	7.0 tokens/sec	5.6GiB	5.6GiB

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	773 ms	1887 ms	7.1 tokens/sec	5.6GiB	5.6GiB

Image Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	143 ms	678 ms	7.0 tokens/sec	5.6GiB	5.6GiB

Video Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	498 ms	1887 ms	7.1 tokens/sec	5.6GiB	5.6GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一：克隆仓库后执行安装脚本：

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二：一行命令安装（默认分支 axllm）：

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三：下载Github Actions CI 导出的可执行程序（适合没有编译环境的用户）：

如果没有编译环境，请到： https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序（axllm），然后：

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载（Hugging Face）

mkdir -p AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4
cd AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4
hf download AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4 --local-dir .

# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
    └── Qwen3-VL-4B-Instruct-GPTQ-Int4
        ├── Qwen3-VL-4B-Instruct_vision.axmodel
        ├── Qwen3-VL-4B-Instruct_vision_u8.axmodel
        ├── README.md
        ├── config.json
        ├── images
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── qwen3_tokenizer.txt
        ├── qwen3_vl_text_p128_l0_together.axmodel
        ...
        ├── qwen3_vl_text_p128_l9_together.axmodel
        ├── qwen3_vl_text_post.axmodel
        ├── requirements.txt
        └── video

4 directories, 45 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行（CLI）

(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4/
20:13:34.015 INF Init:218 | LLM init start
tokenizer_type = 1
 97% | ###############################  |  38 /  39 [11.25s<11.54s, 3.38 count/s] init post axmodel ok,remain_cmm(6133 MB)
20:13:45.263 INF Init:368 | max_token_len : 2047
20:13:45.263 INF Init:371 | kv_cache_size : 1024, kv_cache_num: 2047
20:13:45.263 INF Init:374 | prefill_token_num : 128
20:13:45.263 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:13:45.263 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:13:45.263 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:13:45.263 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:13:45.263 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:13:45.263 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:13:45.263 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
20:13:45.263 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
20:13:45.263 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
20:13:45.263 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
20:13:45.263 INF Init:384 | prefill_max_token_num : 1152
20:13:45.263 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  39 /  39 [11.25s<11.25s, 3.47 count/s] embed_selector init ok
20:13:47.224 WRN Init:511 | Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
20:13:47.224 INF Init:695 | Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
20:13:47.224 INF Init:728 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2560, out_dtype=fp32
20:13:47.224 INF Init:734 | VisionModule deepstack enabled: layers=3
20:13:47.224 INF load_config:282 | load config:
20:13:47.224 INF load_config:282 | {
20:13:47.224 INF load_config:282 |     "enable_repetition_penalty": false,
20:13:47.224 INF load_config:282 |     "enable_temperature": false,
20:13:47.224 INF load_config:282 |     "enable_top_k_sampling": false,
20:13:47.224 INF load_config:282 |     "enable_top_p_sampling": false,
20:13:47.224 INF load_config:282 |     "penalty_window": 20,
20:13:47.224 INF load_config:282 |     "repetition_penalty": 1.2,
20:13:47.224 INF load_config:282 |     "temperature": 0.9,
20:13:47.224 INF load_config:282 |     "top_k": 10,
20:13:47.224 INF load_config:282 |     "top_p": 0.8
20:13:47.224 INF load_config:282 | }
20:13:47.224 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the image
image >> ./AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4/images/ssd_car.jpg
20:14:13.430 INF EncodeForContent:1121 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=255 (w=384 h=384 tp=2 ps=16 sm=2)
20:14:13.594 INF EncodeForContent:1144 | vision cache store: ./AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4/images/ssd_car.jpg
20:14:13.616 INF SetKVCache:749 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
20:14:13.616 INF SetKVCache:757 | current prefill_max_token_num:1152
20:14:13.616 INF SetKVCache:760 | first run
20:14:13.618 INF Run:818 | input token num : 168, prefill_split_num : 2
20:14:13.618 INF Run:858 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
20:14:13.618 INF Run:881 | prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
20:14:13.940 INF Run:858 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
20:14:13.940 INF Run:881 | prefill indices shape: p=1 idx_elems=384 idx_rows=3 pos_rows=3
20:14:14.295 INF Run:1023 | ttft: 677.29 ms
This is a vibrant street photograph taken in a city, likely London, featuring a classic red double-decker bus as the central subject.

**Key elements in the image:**

- **The Bus:** A bright red, vintage-style double-decker bus, which is a hallmark of London's public transport. The bus is parked or stopped on the street. A prominent advertisement is visible on its side: “WHEN YOU SAY ‘YES’” above the website “WIXMONEY.COM”. The bus has a classic design with large windows and ornate architectural details on its upper deck.

- **The Setting:** The background consists of tall, ornate, multi-story buildings with traditional European architecture, featuring large windows, stone facades, and decorative balconies. This strongly suggests a central or affluent district in a major European city.

- **The Person:** In the foreground, a person (likely a woman) is standing on the sidewalk, looking up at the bus. She is wearing a dark coat and a light-colored hat or head covering, and she is holding a small, light-colored handbag. Her posture and gaze suggest she is observing the bus or the scene.

- **The Atmosphere:** The photo has a bright, clear, and cheerful quality, with natural daylight illuminating the scene. The colors are vivid, especially the red of the bus, which stands out against the more muted tones of the buildings and the person’s clothing.

- **The Composition:** The image is framed to capture the bus and the surrounding architecture, with the person adding a human element and a sense of scale. The perspective is slightly elevated, looking down at the bus and the street.

Overall, the image captures a moment of urban life, blending the iconic imagery of a city bus with the everyday activity of a pedestrian, all set against a backdrop of classic architecture.

20:15:12.812 NTC Run:1145 | hit eos,avg 6.37 token/s
20:15:12.813 INF GetKVCache:721 | precompute_len:409, remaining:743
prompt >> how many people in the image?
image >>
20:15:33.058 INF EncodeForContent:1057 | vision cache hit (mem): ./AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4/images/ssd_car.jpg
20:15:33.067 INF SetKVCache:749 | prefill_grpid:5 kv_cache_num:512 precompute_len:409 input_num_token:17
20:15:33.067 INF SetKVCache:757 | current prefill_max_token_num:640
20:15:33.068 INF Run:818 | input token num : 17, prefill_split_num : 1
20:15:33.068 INF Run:858 | prefill chunk p=0 history_len=409 grpid=5 kv_cache_num=512 input_tokens=17
20:15:33.068 INF Run:881 | prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
20:15:33.502 INF Run:1023 | ttft: 433.86 ms
Based on the image provided, there is **one person** clearly visible in the foreground — the woman standing on the sidewalk, looking up at the bus. She is the only person explicitly depicted in the photograph.

There may be other people on the bus or in the background, but they are not visible or identifiable in the image. Therefore, the answer is:

> **One person.**

20:15:45.526 NTC Run:1145 | hit eos,avg 6.49 token/s
20:15:45.526 INF GetKVCache:721 | precompute_len:503, remaining:649
prompt >> /q

启动服务（OpenAI 兼容）

(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4/
20:18:10.375 INF Init:218 | LLM init start
tokenizer_type = 1
 97% | ###############################  |  38 /  39 [6.45s<6.62s, 5.89 count/s] init post axmodel ok,remain_cmm(6133 MB)
20:18:16.826 INF Init:368 | max_token_len : 2047
20:18:16.826 INF Init:371 | kv_cache_size : 1024, kv_cache_num: 2047
20:18:16.826 INF Init:374 | prefill_token_num : 128
20:18:16.826 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
20:18:16.826 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
20:18:16.826 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
20:18:16.826 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
20:18:16.826 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
20:18:16.826 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
20:18:16.826 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
20:18:16.826 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
20:18:16.826 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
20:18:16.826 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
20:18:16.826 INF Init:384 | prefill_max_token_num : 1152
20:18:16.826 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  39 /  39 [6.45s<6.45s, 6.05 count/s] embed_selector init ok
20:18:17.190 WRN Init:511 | Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
20:18:17.191 INF Init:695 | Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
20:18:17.191 INF Init:728 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2560, out_dtype=fp32
20:18:17.191 INF Init:734 | VisionModule deepstack enabled: layers=3
20:18:17.191 INF load_config:282 | load config:
20:18:17.191 INF load_config:282 | {
20:18:17.191 INF load_config:282 |     "enable_repetition_penalty": false,
20:18:17.191 INF load_config:282 |     "enable_temperature": false,
20:18:17.191 INF load_config:282 |     "enable_top_k_sampling": false,
20:18:17.191 INF load_config:282 |     "enable_top_p_sampling": false,
20:18:17.191 INF load_config:282 |     "penalty_window": 20,
20:18:17.191 INF load_config:282 |     "repetition_penalty": 1.2,
20:18:17.191 INF load_config:282 |     "temperature": 0.9,
20:18:17.191 INF load_config:282 |     "top_k": 10,
20:18:17.191 INF load_config:282 |     "top_p": 0.8
20:18:17.191 INF load_config:282 | }
20:18:17.191 INF Init:448 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
  GET  http://10.126.35.203:8000/health
  GET  http://10.126.35.203:8000/v1/models
  POST http://10.126.35.203:8000/v1/chat/completions
  GET  http://172.18.0.1:8000/health
  GET  http://172.18.0.1:8000/v1/models
  POST http://172.18.0.1:8000/v1/chat/completions
  GET  http://172.17.0.1:8000/health
  GET  http://172.17.0.1:8000/v1/models
  POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
  GET  http://127.0.0.1:8000/models
  POST http://127.0.0.1:8000/chat/completions
  GET  http://10.126.35.203:8000/models
  POST http://10.126.35.203:8000/chat/completions
  GET  http://172.18.0.1:8000/models
  POST http://172.18.0.1:8000/chat/completions
  GET  http://172.17.0.1:8000/models
  POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-VL-4B-Instruct-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")