openPangu-R-7B-2512 在vllm-ascend部署指导文档
部署环境说明
Atlas 800T A2(64GB) 可部署openPangu-R-7B-2512。
A2镜像构建和启动
拉取基础镜像:
docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11
使用Dockerfile.构建镜像:
IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
docker build -t $IMAGE -f ./Dockerfile .
启动镜像:
export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 # Use correct image id
export NAME=XXX # Custom docker name
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run -itd \
--privileged \
--ipc=host \
--name $NAME \
--network host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/:/mnt/ \
-v /data:/data \
-v /home/work:/home/work \
--entrypoint /bin/bash \
$IMAGE
需要保证模型权重和本项目代码可在容器中访问。如果未进入容器,需以root用户进容器。
docker exec -itu root $NAME /bin/bash
cd inference
pip install -r requirements.txt
bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall
openPangu-R-7B-2512推理
启动脚本:inference/launch.sh
执行命令:
export LOAD_CKPT_DIR = XXX/checkpoint/ # The pangu_7b bf16 weight
bash inference/launch.sh
启动脚本示例:
# 指定 HOST=127.0.0.1(本地主机)表示服务器只能从主设备访问。
# 指定 HOST=0.0.0.0 允许从同一网络上的其他设备甚至从互联网访问 vLLM 服务器,前提是网络配置正确(例如,防火墙规则、端口转发)。
HOST=xxx.xxx.xxx.xxx
python $SCRIPT_DIR/vllm_register.py \
--model $LOCAL_CKPT_DIR \
--served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
--trust-remote-code \
--host $HOST \
--port ${PORT:=8000} \
--max-num-seqs ${MAX_NUM_SEQS:=256} \
--max-model-len ${MAX_MODEL_LEN:=40960} \
--tokenizer-mode "slow" \
--dtype bfloat16 \
--enable-log-requests \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
--no-enable-prefix-caching \
--enforce_eager \
--reasoning-parser pangu \
发请求测试
服务启动后,可发送测试请求:
MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip
curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'$SERVED_MODEL_NAME'",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
],
"max_tokens": 512,
"temperature": 0
}'