---
title: VoxCPM Demo
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
python_version: "3.10"
pinned: true
license: apache-2.0
short_description: VoxCPM2 Nano-vLLM Demo
---

Experimental Gradio Space demo for `VoxCPM2` powered by `nanovllm-voxcpm`.

This repo keeps the existing Gradio frontend layout and swaps only the backend inference path to Nano-vLLM.

Notes:

- This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
- `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request.
- ZipEnhancer denoising is supported for reference audio cloning. The default denoiser model is `iic/speech_zipenhancer_ans_multiloss_16k_base`.
- The Space now defaults to a hardened runtime path:
  - If `/data` exists, request logs are written to daily JSONL files like `/data/logs/2026-04-05.jsonl`.
  - Model, pip, and temporary caches now stay on the default runtime paths instead of consuming persistent storage.
  - Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
  - Gradio SSR is disabled by default for stability.
- The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
- `SenseVoiceSmall` is downloaded from Hugging Face and cached locally before ASR initialization.
- `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory.
- Reference audio longer than 50 seconds is rejected early before denoising or Nano-vLLM encoding.
- The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server.
- The existing `normalize` toggle is kept for UI compatibility, but Nano-vLLM currently ignores it.
- The existing `denoise` toggle now runs ZipEnhancer on the reference audio before encoding it to latents.
- `packages.txt` is required because this path needs extra system build dependencies.

Stability recommendation:

- Use a persistent GPU Space.
- Attach persistent storage so `/data` is available.
- Keep the default queue concurrency at `1` unless you have profiled GPU memory headroom.

Recommended environment variables:

- `HF_REPO_ID`: Hugging Face model repo id. Defaults to `openbmb/VoxCPM2`
- `HF_TOKEN`: required if the model repo is private
- `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id
- `NANOVLLM_MODEL_PATH`: optional local model path override
- `ASR_DEVICE`: defaults to `cpu`
- `ZIPENHANCER_MODEL_ID`: optional ModelScope denoiser model id or local path. Defaults to `iic/speech_zipenhancer_ans_multiloss_16k_base`
- `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10`
- `NANOVLLM_PREWARM`: defaults to `true`
- `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192`
- `NANOVLLM_SERVERPOOL_MAX_NUM_SEQS`: defaults to `16`
- `NANOVLLM_SERVERPOOL_MAX_MODEL_LEN`: defaults to `4096`
- `NANOVLLM_SERVERPOOL_GPU_MEMORY_UTILIZATION`: defaults to `0.95`
- `NANOVLLM_SERVERPOOL_ENFORCE_EAGER`: defaults to `false`
- `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0`
- `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000`
- `NANOVLLM_TEMPERATURE`: defaults to `1.0`
- `REQUEST_LOG_DIR`: optional persistent request log directory. Defaults to `/data/logs` when `/data` exists
- `GRADIO_QUEUE_MAX_SIZE`: defaults to `10`
- `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `4` (uses async server pool bridge for thread-safe concurrency)
- `DENOISE_MAX_CONCURRENT`: defaults to `1` (limits concurrent ZipEnhancer denoise requests to avoid GPU OOM)
- `GRADIO_SSR_MODE`: defaults to `false`