Qwen3.5
Collection
5 items • Updated
This version of Qwen3.5-0.8B has been converted to run on the Axera NPU using w4a16 quantization.
Compatible with Pulsar2 version: 5.0
For those who are interested in model conversion, you can try to export axmodel through the original repo :
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Image Process
| Chips | Input Size | Image Num | TTFT (168 tokens) | Throughput (w4a16) | CMM Memory | Flash Memory |
|---|---|---|---|---|---|---|
| AX650 | 384×384 | 1 | 332 ms | 21.5 tokens/sec | 2.47G GiB | 2.9 GiB |
Video Process
| Chips | Input Size | Image Num | TTFT (600 tokens) | Throughput (w4a16) | CMM Memory | Flash Memory |
|---|---|---|---|---|---|---|
| AX650 | 384×384 | 8 | 838 ms | 21.8 tokens/sec | 2.47 GiB | 2.9 GiB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047 --local-dir .
# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
`-- Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
|-- qwen3_5_vision.axmodel
|-- README.md
|-- config.json
|-- image.png
|-- model.embed_tokens.weight.bfloat16.bin
|-- post_config.json
|-- qwen3_5_tokenizer.txt
|-- qwen3_5_text_p128_l0_together.axmodel
...
|-- qwen3_5_text_p128_l23_together.axmodel
|-- qwen3_5_text_post.axmodel
`-- vision_cache
3 directories, 39 files
root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047/
19:14:47.144 INF Init:218 | LLM init start
19:14:47.144 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
96% | ############################## | 26 / 27 [28.70s<29.80s, 0.91 count/s] init post axmodel ok,remain_cmm(5497 MB)
19:15:15.845 INF Init:368 | max_token_len : 2047
19:15:15.845 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
19:15:15.845 INF Init:374 | prefill_token_num : 128
19:15:15.845 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
19:15:15.845 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
19:15:15.845 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
19:15:15.845 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
19:15:15.845 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
19:15:15.845 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
19:15:15.845 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 896
19:15:15.845 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1024
19:15:15.845 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1152
19:15:15.845 INF Init:384 | prefill_max_token_num : 1152
19:15:15.845 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 27 / 27 [28.71s<28.71s, 0.94 count/s] embed_selector init ok
19:15:17.168 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
19:15:17.168 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=1024, out_dtype=fp32
19:15:17.168 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
19:15:17.173 INF load_config:282 | load config:
19:15:17.173 INF load_config:282 | {
19:15:17.173 INF load_config:282 | "enable_repetition_penalty": false,
19:15:17.173 INF load_config:282 | "enable_temperature": false,
19:15:17.173 INF load_config:282 | "enable_top_k_sampling": true,
19:15:17.173 INF load_config:282 | "enable_top_p_sampling": false,
19:15:17.173 INF load_config:282 | "penalty_window": 20,
19:15:17.173 INF load_config:282 | "repetition_penalty": 1.2,
19:15:17.173 INF load_config:282 | "temperature": 0.9,
19:15:17.173 INF load_config:282 | "top_k": 10,
19:15:17.173 INF load_config:282 | "top_p": 0.8
19:15:17.173 INF load_config:282 | }
19:15:17.173 INF Init:448 | LLM init ok
Commands:
/q, /exit 退出
/reset 重置 kvcache
/dd 删除一轮对话
/pp 打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> 描述图片内容
image >> image.png
19:16:17.643 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
19:16:17.689 INF EncodeForContent:996 | vision cache store: image.png
19:16:17.719 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
19:16:17.719 INF SetKVCache:749 | current prefill_max_token_num:1152
19:16:17.719 INF SetKVCache:750 | first run
19:16:17.766 INF Run:805 | input token num : 168, prefill_split_num : 2
19:16:17.766 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
19:16:17.766 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
19:16:17.921 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
19:16:17.921 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
19:16:18.098 INF Run:1010 | ttft: 332.49 ms
<think>
</think>
这是一张描绘三名宇航员在一片自然环境中(可能是森林或花园)的场景的图片。他们看起来像是处于太空环境中。
**详细描述:**
- **前景 (底部): 前景是一片绿色的植物,可能是草地、灌木或草丛。
- **中景 (中间):
- 三名宇航员并排站立,面向不同的方向。
- 中间的一名宇航员站立着,面向左前方。他/她穿着一件白色的宇航服,头盔为深灰色。
- 左侧的一名宇航员也站立,面向左方,穿着类似的宇航服。他们的姿势略有不同,可能处于不同的动作中。
- 右侧的一名宇航员站立着,面对着右前方,同样穿着宇航服。
- **背景 (背景): 背景由高大的树木和植物构成,形成一种自然的背景。这些树的高大增加了图像的整体深度感和氛围感。
19:16:27.208 NTC Run:1132 | hit eos,avg 21.52 token/s
19:16:27.210 INF GetKVCache:721 | precompute_len:232, remaining:920
root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
[I][ Init][ 138]: LLM init start
tokenizer_type = 1
96% | ███████████████████████████████ | 30 / 31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][ Init][ 199]: max_token_len : 2047
[I][ Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 205]: prefill_token_num : 128
[I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][ Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][ Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][ Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][ Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][ Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][ Init][ 214]: prefill_max_token_num : 1152
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][ Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][ Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][ Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][ Init][ 672]: VisionModule deepstack enabled: layers=3
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": false,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("
")