--- title: VoxCPM Demo emoji: 🎙️ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 6.0.0 app_file: app.py python_version: "3.10" pinned: true license: apache-2.0 short_description: VoxCPM2 Nano-vLLM Demo --- Experimental Gradio Space demo for `VoxCPM2` powered by `nanovllm-voxcpm`. This repo keeps the existing Gradio frontend layout and swaps only the backend inference path to Nano-vLLM. Notes: - This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space. - `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request. - ZipEnhancer denoising is supported for reference audio cloning. The default denoiser model is `iic/speech_zipenhancer_ans_multiloss_16k_base`. - The Space now defaults to a hardened runtime path: - If `/data` exists, request logs are written to daily JSONL files like `/data/logs/2026-04-05.jsonl`. - Model, pip, and temporary caches now stay on the default runtime paths instead of consuming persistent storage. - Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background. - Gradio SSR is disabled by default for stability. - The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server. - `SenseVoiceSmall` is downloaded from Hugging Face and cached locally before ASR initialization. - `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory. - Reference audio longer than 50 seconds is rejected early before denoising or Nano-vLLM encoding. - The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server. - The existing `normalize` toggle is kept for UI compatibility, but Nano-vLLM currently ignores it. - The existing `denoise` toggle now runs ZipEnhancer on the reference audio before encoding it to latents. - `packages.txt` is required because this path needs extra system build dependencies. Stability recommendation: - Use a persistent GPU Space. - Attach persistent storage so `/data` is available. - Keep the default queue concurrency at `1` unless you have profiled GPU memory headroom. Recommended environment variables: - `HF_REPO_ID`: Hugging Face model repo id. Defaults to `openbmb/VoxCPM2` - `HF_TOKEN`: required if the model repo is private - `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id - `NANOVLLM_MODEL_PATH`: optional local model path override - `ASR_DEVICE`: defaults to `cpu` - `ZIPENHANCER_MODEL_ID`: optional ModelScope denoiser model id or local path. Defaults to `iic/speech_zipenhancer_ans_multiloss_16k_base` - `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10` - `NANOVLLM_PREWARM`: defaults to `true` - `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192` - `NANOVLLM_SERVERPOOL_MAX_NUM_SEQS`: defaults to `16` - `NANOVLLM_SERVERPOOL_MAX_MODEL_LEN`: defaults to `4096` - `NANOVLLM_SERVERPOOL_GPU_MEMORY_UTILIZATION`: defaults to `0.95` - `NANOVLLM_SERVERPOOL_ENFORCE_EAGER`: defaults to `false` - `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0` - `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000` - `NANOVLLM_TEMPERATURE`: defaults to `1.0` - `REQUEST_LOG_DIR`: optional persistent request log directory. Defaults to `/data/logs` when `/data` exists - `GRADIO_QUEUE_MAX_SIZE`: defaults to `10` - `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `4` (uses async server pool bridge for thread-safe concurrency) - `DENOISE_MAX_CONCURRENT`: defaults to `1` (limits concurrent ZipEnhancer denoise requests to avoid GPU OOM) - `GRADIO_SSR_MODE`: defaults to `false`