Qwen3.5-4B

This version of Qwen3.5-4B has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips input size image num ttft(168 tokens) w4a16 CMM Flash
AX650 384*384 1 1551 ms 5.7 tokens/sec 3.58GiB 5.69GiB

Video Process

Chips input size image num ttft(600 tokens) w4a16 CMM Flash
AX650 384*384 8 4293 ms 5.7 tokens/sec 3.58GiB 5.69GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047
        |-- qwen3_5_vision.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_5_tokenizer.txt
        |-- qwen3_5_text_p128_l0_together.axmodel
        ...
        |-- qwen3_5_text_p128_l23_together.axmodel
        |-- qwen3_5_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047/
11:40:21.412 INF Init:218 | LLM init start
11:40:21.412 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [3.78s<3.93s, 6.87 count/s] init post axmodel ok,remain_cmm(5589 MB)
11:40:25.195 INF Init:368 | max_token_len : 2047
11:40:25.195 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
11:40:25.195 INF Init:374 | prefill_token_num : 128
11:40:25.195 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
11:40:25.195 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
11:40:25.195 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
11:40:25.195 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
11:40:25.195 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
11:40:25.195 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
11:40:25.195 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 1024
11:40:25.195 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1152
11:40:25.195 INF Init:384 | prefill_max_token_num : 1152
11:40:25.195 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [3.79s<3.79s, 7.13 count/s] embed_selector init ok
11:40:25.478 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
11:40:25.478 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
11:40:25.478 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
11:40:25.480 INF load_config:282 | load config: 
11:40:25.480 INF load_config:282 | {
11:40:25.480 INF load_config:282 |     "enable_repetition_penalty": false,
11:40:25.480 INF load_config:282 |     "enable_temperature": false,
11:40:25.480 INF load_config:282 |     "enable_top_k_sampling": true,
11:40:25.480 INF load_config:282 |     "enable_top_p_sampling": false,
11:40:25.480 INF load_config:282 |     "penalty_window": 20,
11:40:25.480 INF load_config:282 |     "repetition_penalty": 1.2,
11:40:25.480 INF load_config:282 |     "temperature": 0.9,
11:40:25.480 INF load_config:282 |     "top_k": 10,
11:40:25.480 INF load_config:282 |     "top_p": 0.8
11:40:25.480 INF load_config:282 | }
11:40:25.480 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> who are you
image >>
11:43:59.538 INF SetKVCache:747 | prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
11:43:59.538 INF SetKVCache:749 | current prefill_max_token_num:1152
11:43:59.538 INF SetKVCache:750 | first run
11:43:59.539 INF Run:805 | input token num : 22, prefill_split_num : 1
11:43:59.539 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
11:43:59.539 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
11:43:59.894 INF Run:1010 | ttft: 355.40 ms
<think>

</think>

I am **Qwen3.5**, the latest large language model developed by Tongyi Lab. I have a strong foundation in knowledge and language understanding, and I support multiple languages including English, and over 100 others, allowing me to handle queries from different regions. Whether you need help with complex tasks, creative writing, logical reasoning, or just want to chat and explore, I'm here to assist you!

11:44:09.734 NTC Run:1132 | hit eos,avg 9.04 token/s
11:44:09.734 INF GetKVCache:721 | precompute_len:111, remaining:1041
prompt >> describe the image
image >> ./AXERA-TECH/Qwen3-VL-4B-Instruct-AX650-c128_p1152-int4/image.png
11:45:32.726 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
11:45:32.877 INF EncodeForContent:996 | vision cache store: image.png
11:45:32.903 INF SetKVCache:747 | prefill_grpid:4 kv_cache_num:384 precompute_len:111 input_num_token:159
11:45:32.904 INF SetKVCache:749 | current prefill_max_token_num:1024
11:45:32.904 INF Run:805 | input token num : 159, prefill_split_num : 2
11:45:32.904 INF Run:845 | prefill chunk p=0 history_len=111 grpid=2 kv_cache_num=128 input_tokens=128
11:45:32.904 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
11:45:33.258 INF Run:845 | prefill chunk p=1 history_len=239 grpid=3 kv_cache_num=256 input_tokens=31
11:45:33.258 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
11:45:33.660 INF Run:1010 | ttft: 755.85 ms
<think>

</think>

This image is a striking, surreal blend of science fiction and nature that evokes a sense of otherworldly discovery or a dreamlike state. Here's a detailed description:

- **Subjects**: Three figures in full white astronaut suits dominate the scene. Their gear includes bulky helmets with reflective visors, thick life-support canisters strapped to each side, and integrated backpacks or external tanks. The suits appear futuristic and functional, lacking the sleek lines of modern spacecraft—they’re more rugged and utilitarian.

- **Setting**: The suitmen stand in a dense, monochromatic forest. The trees are tall and slender, their trunks appearing almost silhouetted, and the canopy is thick enough to filter out most sunlight. The forest floor is obscured but suggests earthy tones and undergrowth. Mist or fog clings to the lower areas, adding to the mysterious atmosphere.

- **Color Palette**: The dominant colors are shades of gray, beige, and muted olive, creating a neutral, desaturated tone that draws focus to the astronauts. The white of the suits contrasts sharply against the dark greens and browns of the forest, making them pop as silent sentinels within the environment.

- **Atmosphere & Mood**: The scene feels tranquil yet enigmatic. There’s no indication of action or movement from the figures—it’s a still frame caught in a moment of pause. The mood is contemplative, almost melancholic, as if these explorers have returned from a long journey or are observing something beyond their understanding.

- **Style**: The image has a cinematic, painterly quality, similar to digital art or concept art designed for media like video games or films. It avoids hyper-realism in favor of stylized composition and symbolic imagery.

In essence, this image captures the juxtaposition of human ambition (represented by the astronauts) with the quiet majesty of nature, suggesting themes of exploration, solitude, and the unknown.

11:46:17.731 NTC Run:1132 | hit eos,avg 9.08 token/s
11:46:17.732 INF GetKVCache:721 | precompute_len:538, remaining:614
prompt >> how many people in the image?
image >> 
11:46:58.653 INF EncodeForContent:928 | vision cache hit (mem): image.png
11:46:58.663 INF SetKVCache:747 | prefill_grpid:6 kv_cache_num:768 precompute_len:538 input_num_token:17
11:46:58.664 INF SetKVCache:749 | current prefill_max_token_num:512
11:46:58.664 INF Run:805 | input token num : 17, prefill_split_num : 1
11:46:58.664 INF Run:845 | prefill chunk p=0 history_len=538 grpid=6 kv_cache_num=768 input_tokens=17
11:46:58.664 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
11:46:59.155 INF Run:1010 | ttft: 491.45 ms
<think>

</think>

There are **three** people in the image.

All three figures are wearing identical futuristic space suits (resembling the *Battlemech* from *Cyberpunk 2077*), standing in a forest setting. Since they are all part of the same group, count each distinct individual.

11:47:06.292 NTC Run:1132 | hit eos,avg 9.11 token/s
11:47:06.292 INF GetKVCache:721 | precompute_len:556, remaining:596
prompt >> /q

启动服务(OpenAI 兼容)

root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(152)
this model

Collection including AXERA-TECH/Qwen3.5-4B-AX650-GPTQ-Int4-C128-P1152-CTX2047