Qwen3.5-0.8B

This version of Qwen3.5-0.8B has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips Input Size Image Num TTFT (168 tokens) Throughput (w4a16) CMM Memory Flash Memory
AX650 384×384 1 332 ms 21.5 tokens/sec 2.47G GiB 2.9 GiB

Video Process

Chips Input Size Image Num TTFT (600 tokens) Throughput (w4a16) CMM Memory Flash Memory
AX650 384×384 8 838 ms 21.8 tokens/sec 2.47 GiB 2.9 GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
        |-- qwen3_5_vision.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_5_tokenizer.txt
        |-- qwen3_5_text_p128_l0_together.axmodel
        ...
        |-- qwen3_5_text_p128_l23_together.axmodel
        |-- qwen3_5_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047/
19:14:47.144 INF Init:218 | LLM init start
19:14:47.144 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [28.70s<29.80s, 0.91 count/s] init post axmodel ok,remain_cmm(5497 MB)
19:15:15.845 INF Init:368 | max_token_len : 2047
19:15:15.845 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
19:15:15.845 INF Init:374 | prefill_token_num : 128
19:15:15.845 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
19:15:15.845 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
19:15:15.845 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
19:15:15.845 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
19:15:15.845 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
19:15:15.845 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
19:15:15.845 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 896
19:15:15.845 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1024
19:15:15.845 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1152
19:15:15.845 INF Init:384 | prefill_max_token_num : 1152
19:15:15.845 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [28.71s<28.71s, 0.94 count/s] embed_selector init ok
19:15:17.168 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
19:15:17.168 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=1024, out_dtype=fp32
19:15:17.168 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
19:15:17.173 INF load_config:282 | load config: 
19:15:17.173 INF load_config:282 | {
19:15:17.173 INF load_config:282 |     "enable_repetition_penalty": false,
19:15:17.173 INF load_config:282 |     "enable_temperature": false,
19:15:17.173 INF load_config:282 |     "enable_top_k_sampling": true,
19:15:17.173 INF load_config:282 |     "enable_top_p_sampling": false,
19:15:17.173 INF load_config:282 |     "penalty_window": 20,
19:15:17.173 INF load_config:282 |     "repetition_penalty": 1.2,
19:15:17.173 INF load_config:282 |     "temperature": 0.9,
19:15:17.173 INF load_config:282 |     "top_k": 10,
19:15:17.173 INF load_config:282 |     "top_p": 0.8
19:15:17.173 INF load_config:282 | }
19:15:17.173 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> 描述图片内容
image >> image.png
19:16:17.643 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
19:16:17.689 INF EncodeForContent:996 | vision cache store: image.png
19:16:17.719 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
19:16:17.719 INF SetKVCache:749 | current prefill_max_token_num:1152
19:16:17.719 INF SetKVCache:750 | first run
19:16:17.766 INF Run:805 | input token num : 168, prefill_split_num : 2
19:16:17.766 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
19:16:17.766 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
19:16:17.921 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
19:16:17.921 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
19:16:18.098 INF Run:1010 | ttft: 332.49 ms
<think>

</think>

这是一张描绘三名宇航员在一片自然环境中(可能是森林或花园)的场景的图片。他们看起来像是处于太空环境中。

**详细描述:**

-   **前景 (底部): 前景是一片绿色的植物,可能是草地、灌木或草丛。
-   **中景 (中间):
    -   三名宇航员并排站立,面向不同的方向。
    -   中间的一名宇航员站立着,面向左前方。他/她穿着一件白色的宇航服,头盔为深灰色。
    -   左侧的一名宇航员也站立,面向左方,穿着类似的宇航服。他们的姿势略有不同,可能处于不同的动作中。
    -   右侧的一名宇航员站立着,面对着右前方,同样穿着宇航服。
-   **背景 (背景): 背景由高大的树木和植物构成,形成一种自然的背景。这些树的高大增加了图像的整体深度感和氛围感。
19:16:27.208 NTC Run:1132 | hit eos,avg 21.52 token/s
19:16:27.210 INF GetKVCache:721 | precompute_len:232, remaining:920

启动服务(OpenAI 兼容)

root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047

Finetuned
(159)
this model

Collection including AXERA-TECH/Qwen3.5-0.8B-AX650-GPTQ-Int4-C128-P1152-CTX2047