openPangu-R-7B-2512 在vllm-ascend部署指导文档

部署环境说明

Atlas 800T A2(64GB) 可部署openPangu-R-7B-2512。

A2镜像构建和启动

拉取基础镜像：

docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11

使用Dockerfile.构建镜像：

IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
docker build -t $IMAGE -f ./Dockerfile .

启动镜像：

export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11  # Use correct image id
export NAME=XXX  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run -itd \
--privileged \
--ipc=host \
--name $NAME \
--network host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/:/mnt/ \
-v /data:/data \
-v /home/work:/home/work \
--entrypoint /bin/bash \
$IMAGE

需要保证模型权重和本项目代码可在容器中访问。如果未进入容器，需以root用户进容器。

docker exec -itu root $NAME /bin/bash
cd inference
pip install -r requirements.txt
bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall

openPangu-R-7B-2512推理

启动脚本：inference/launch.sh

执行命令：

export LOAD_CKPT_DIR = XXX/checkpoint/   # The pangu_7b bf16 weight
bash inference/launch.sh

启动脚本示例：

# 指定 HOST=127.0.0.1（本地主机）表示服务器只能从主设备访问。
# 指定 HOST=0.0.0.0 允许从同一网络上的其他设备甚至从互联网访问 vLLM 服务器，前提是网络配置正确（例如，防火墙规则、端口转发）。
HOST=xxx.xxx.xxx.xxx

python $SCRIPT_DIR/vllm_register.py \
    --model $LOCAL_CKPT_DIR \
    --served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
    --tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
    --trust-remote-code \
    --host $HOST \
    --port ${PORT:=8000} \
    --max-num-seqs ${MAX_NUM_SEQS:=256} \
    --max-model-len ${MAX_MODEL_LEN:=40960} \
    --tokenizer-mode "slow" \
    --dtype bfloat16 \
    --enable-log-requests \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.9 \
      --max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
    --no-enable-prefix-caching \
    --enforce_eager \
    --reasoning-parser pangu \

发请求测试

服务启动后，可发送测试请求：

MASTER_NODE_IP=xxx.xxx.xxx.xxx  # server node ip
curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "'$SERVED_MODEL_NAME'",
        "messages": [
            {
                "role": "user",
                "content": "Who are you?"
            }
        ],
        "max_tokens": 512,
        "temperature": 0
    }'