Deployment Guide for openPangu-R-72B-2512 on Omni-Infer
Hardware Environment and Deployment Method
PD hybrid deployment, requiring only 4 dies of one Atlas 800T A3 machine.
Codes and Image
- Omni-Infer code version: release_v0.7.0
- Docker Image: Refer to the v0.7.0 image in https://gitee.com/omniai/omniinfer/releases. For example, for A3 hardware and ARM architecture, use "docker pull swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm".
Deployment
1. Launch the image
IMAGE=swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
NAME=omniinfer-v0.7.0 # Custom docker name
NPU_NUM=16 # 16 dies of A3 node
DEVICE_ARGS=$(for i in $(seq 0 $((NPU_NUM-1))); do echo -n "--device /dev/davinci${i} "; done)
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run -itd \
--name=${NAME} \
--network host \
--privileged \
--ipc=host \
$DEVICE_ARGS \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/:/mnt/ \
-v /data:/data \
-v /home/work:/home/work \
--entrypoint /bin/bash \
swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
Ensure that the model checkpoint and the project code are accessible within the container. Enter the container:
docker exec -it $NAME /bin/bash
2. Put examples/start_serving_openpangu_r_72b_2512.sh in the omniinfer/tools/scripts path and start the serving script
git clone -b release_v0.7.0 https://gitee.com/omniai/omniinfer.git
cd omniinfer/tools/scripts
# You need to modify the model-path, master-ip address and PYTHONPATH in the serving script.
bash start_serving_openpangu_r_72b_2512.sh
3. Send Testing Requests
After the service is started, we can send testing requests.
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openpangu_r_72b_2512",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
],
"temperature": 1.0,
"top_p": 0.8,
"top_k": -1,
"vllm_xargs": {"top_n_sigma": 0.05},
"chat_template_kwargs": {"think": true, "reasoning_effort": "low"}
}'
# Tool use
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openpangu_r_72b_2512",
"messages": [
{"role": "system", "content": "你是华为公司开发的盘古模型。\n现在是2025年7月30日"},
{"role": "user", "content": "深圳明天的天气如何?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "获取指定城市的当前天气信息,包括温度、湿度、风速等数据。",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "城市名称,例如:北京、深圳。支持中文或拼音输入。"
},
"date": {
"type": "string",
"description": "查询日期,格式为 YYYY-MM-DD(遵循 ISO 8601 标准)。例如:2023-10-01。"
}
},
"required": ["location", "date"],
"additionalProperties": "false"
}
}
}
],
"temperature": 1.0,
"top_p": 0.8,
"top_k": -1,
"vllm_xargs": {"top_n_sigma": 0.05},
"chat_template_kwargs": {"think": true, "reasoning_effort": "high"}
}'
The model is in slow-thinking mode by default. In slow-thinking mode, you can specify different reasoning effort by setting the "reasoning_effort" parameter in "chat_template_kwargs" to "high" or "low" to balance model accuracy and efficiency.
openPangu-R-72B-2512 supports switching between slow-thinking and fast-thinking mode by setting {"think": true/false} in "chat_template_kwargs".