Spaces:

openbmb
/

VoxCPM-Demo

Running on A10G

App Files Files Community

VoxCPM-Demo / README.md

刘鑫

fix: harden async nanovllm demo stability

44d9bcd 4 days ago

preview code

raw

history blame contribute delete

3.7 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: VoxCPM Demo
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
python_version: '3.10'
pinned: true
license: apache-2.0
short_description: VoxCPM2 Nano-vLLM Demo

Experimental Gradio Space demo for VoxCPM2 powered by nanovllm-voxcpm.

This repo keeps the existing Gradio frontend layout and swaps only the backend inference path to Nano-vLLM.

Notes:

This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
flash-attn and nanovllm-voxcpm are pinned in requirements.txt, so they install during Space build instead of on first request.
ZipEnhancer denoising is supported for reference audio cloning. The default denoiser model is iic/speech_zipenhancer_ans_multiloss_16k_base.
The Space now defaults to a hardened runtime path:
- If /data exists, request logs are written to daily JSONL files like /data/logs/2026-04-05.jsonl.
- Model, pip, and temporary caches now stay on the default runtime paths instead of consuming persistent storage.
- Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
- Gradio SSR is disabled by default for stability.
The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
SenseVoiceSmall is downloaded from Hugging Face and cached locally before ASR initialization.
ASR_DEVICE defaults to cpu to avoid competing with TTS GPU memory.
Reference audio longer than 50 seconds is rejected early before denoising or Nano-vLLM encoding.
The LocDiT flow-matching steps slider is wired to Nano-vLLM server inference_timesteps; changing it rebuilds the backend server.
The existing normalize toggle is kept for UI compatibility, but Nano-vLLM currently ignores it.
The existing denoise toggle now runs ZipEnhancer on the reference audio before encoding it to latents.
packages.txt is required because this path needs extra system build dependencies.

Stability recommendation:

Use a persistent GPU Space.
Attach persistent storage so /data is available.
Keep the default queue concurrency at 1 unless you have profiled GPU memory headroom.

Recommended environment variables:

HF_REPO_ID: Hugging Face model repo id. Defaults to openbmb/VoxCPM2
HF_TOKEN: required if the model repo is private
NANOVLLM_MODEL: optional direct model ref override. Can be a local path or HF repo id
NANOVLLM_MODEL_PATH: optional local model path override
ASR_DEVICE: defaults to cpu
ZIPENHANCER_MODEL_ID: optional ModelScope denoiser model id or local path. Defaults to iic/speech_zipenhancer_ans_multiloss_16k_base
NANOVLLM_INFERENCE_TIMESTEPS: initial default is 10
NANOVLLM_PREWARM: defaults to true
NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS: defaults to 8192
NANOVLLM_SERVERPOOL_MAX_NUM_SEQS: defaults to 16
NANOVLLM_SERVERPOOL_MAX_MODEL_LEN: defaults to 4096
NANOVLLM_SERVERPOOL_GPU_MEMORY_UTILIZATION: defaults to 0.95
NANOVLLM_SERVERPOOL_ENFORCE_EAGER: defaults to false
NANOVLLM_SERVERPOOL_DEVICES: defaults to 0
NANOVLLM_MAX_GENERATE_LENGTH: defaults to 2000
NANOVLLM_TEMPERATURE: defaults to 1.0
REQUEST_LOG_DIR: optional persistent request log directory. Defaults to /data/logs when /data exists
GRADIO_QUEUE_MAX_SIZE: defaults to 10
GRADIO_DEFAULT_CONCURRENCY_LIMIT: defaults to 4 (uses async server pool bridge for thread-safe concurrency)
DENOISE_MAX_CONCURRENT: defaults to 1 (limits concurrent ZipEnhancer denoise requests to avoid GPU OOM)
GRADIO_SSR_MODE: defaults to false