刘鑫 commited on
Commit
ffbf382
·
1 Parent(s): 7d8de04

fix: add denoiser and reference audio guards

Browse files

Enable ZipEnhancer preprocessing for reference audio, move SenseVoice downloads to Hugging Face caching, and reject overly long reference clips before ASR or NanoVLLM encoding.

Made-with: Cursor

Files changed (3) hide show
  1. README.md +8 -1
  2. app.py +180 -62
  3. requirements.txt +1 -0
README.md CHANGED
@@ -20,14 +20,19 @@ Notes:
20
 
21
  - This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
22
  - `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request.
 
23
  - The Space now defaults to a hardened runtime path:
24
  - If `/data` exists, Hugging Face cache, pip cache, and Gradio temp files are stored there automatically.
 
25
  - Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
26
  - Gradio SSR is disabled by default for stability.
27
  - The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
 
28
  - `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory.
 
29
  - The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server.
30
- - The existing `normalize` / `denoise` frontend toggles are kept for UI compatibility, but Nano-vLLM currently ignores them.
 
31
  - `packages.txt` is required because this path needs extra system build dependencies.
32
 
33
  Stability recommendation:
@@ -43,6 +48,7 @@ Recommended environment variables:
43
  - `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id
44
  - `NANOVLLM_MODEL_PATH`: optional local model path override
45
  - `ASR_DEVICE`: defaults to `cpu`
 
46
  - `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10`
47
  - `NANOVLLM_PREWARM`: defaults to `true`
48
  - `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192`
@@ -53,6 +59,7 @@ Recommended environment variables:
53
  - `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0`
54
  - `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000`
55
  - `NANOVLLM_TEMPERATURE`: defaults to `1.0`
 
56
  - `GRADIO_QUEUE_MAX_SIZE`: defaults to `10`
57
  - `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `1`
58
  - `GRADIO_SSR_MODE`: defaults to `false`
 
20
 
21
  - This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
22
  - `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request.
23
+ - ZipEnhancer denoising is supported for reference audio cloning. The default denoiser model is `iic/speech_zipenhancer_ans_multiloss_16k_base`.
24
  - The Space now defaults to a hardened runtime path:
25
  - If `/data` exists, Hugging Face cache, pip cache, and Gradio temp files are stored there automatically.
26
+ - If `/data` exists, ModelScope cache is also persisted there for ZipEnhancer downloads.
27
  - Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
28
  - Gradio SSR is disabled by default for stability.
29
  - The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
30
+ - `SenseVoiceSmall` is downloaded from Hugging Face and cached locally before ASR initialization.
31
  - `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory.
32
+ - Reference audio longer than 50 seconds is rejected early before denoising or Nano-vLLM encoding.
33
  - The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server.
34
+ - The existing `normalize` toggle is kept for UI compatibility, but Nano-vLLM currently ignores it.
35
+ - The existing `denoise` toggle now runs ZipEnhancer on the reference audio before encoding it to latents.
36
  - `packages.txt` is required because this path needs extra system build dependencies.
37
 
38
  Stability recommendation:
 
48
  - `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id
49
  - `NANOVLLM_MODEL_PATH`: optional local model path override
50
  - `ASR_DEVICE`: defaults to `cpu`
51
+ - `ZIPENHANCER_MODEL_ID`: optional ModelScope denoiser model id or local path. Defaults to `iic/speech_zipenhancer_ans_multiloss_16k_base`
52
  - `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10`
53
  - `NANOVLLM_PREWARM`: defaults to `true`
54
  - `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192`
 
59
  - `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0`
60
  - `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000`
61
  - `NANOVLLM_TEMPERATURE`: defaults to `1.0`
62
+ - `MODELSCOPE_CACHE`: optional persistent cache path for ZipEnhancer downloads
63
  - `GRADIO_QUEUE_MAX_SIZE`: defaults to `10`
64
  - `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `1`
65
  - `GRADIO_SSR_MODE`: defaults to `false`
app.py CHANGED
@@ -2,6 +2,7 @@ import atexit
2
  import logging
3
  import os
4
  import sys
 
5
  from pathlib import Path
6
  from threading import Lock, Thread
7
  from typing import Optional, Tuple
@@ -28,6 +29,9 @@ logging.basicConfig(
28
  handlers=[logging.StreamHandler(sys.stdout)],
29
  )
30
  logger = logging.getLogger(__name__)
 
 
 
31
 
32
 
33
  def _configure_cache_dirs() -> None:
@@ -43,8 +47,11 @@ def _configure_cache_dirs() -> None:
43
  os.environ.get("GRADIO_TEMP_DIR", str(cache_root / "gradio"))
44
  ).expanduser()
45
  pip_cache = Path(os.environ.get("PIP_CACHE_DIR", str(cache_root / "pip"))).expanduser()
 
 
 
46
 
47
- for path in (cache_root, hf_home, gradio_tmp, pip_cache):
48
  path.mkdir(parents=True, exist_ok=True)
49
 
50
  os.environ.setdefault("XDG_CACHE_HOME", str(cache_root))
@@ -52,6 +59,7 @@ def _configure_cache_dirs() -> None:
52
  os.environ.setdefault("HUGGINGFACE_HUB_CACHE", str(hf_home / "hub"))
53
  os.environ.setdefault("PIP_CACHE_DIR", str(pip_cache))
54
  os.environ.setdefault("GRADIO_TEMP_DIR", str(gradio_tmp))
 
55
  logger.info(f"Using persistent cache directories under {persistent_root}")
56
 
57
 
@@ -61,8 +69,10 @@ _asr_model = None
61
  _voxcpm_server = None
62
  _model_info = None
63
  _server_inference_timesteps = None
 
64
  _server_lock = Lock()
65
  _prewarm_lock = Lock()
 
66
  _prewarm_started = False
67
  _runtime_diag_logged = False
68
 
@@ -111,6 +121,18 @@ def _resolve_model_ref() -> str:
111
  return DEFAULT_MODEL_REF
112
 
113
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  def _log_runtime_diagnostics_once() -> None:
115
  global _runtime_diag_logged
116
  if _runtime_diag_logged:
@@ -132,6 +154,54 @@ def _log_runtime_diagnostics_once() -> None:
132
  _runtime_diag_logged = True
133
 
134
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  def _extract_asr_text(asr_result) -> str:
136
  if not asr_result:
137
  return ""
@@ -151,6 +221,39 @@ def _read_audio_bytes(audio_path: Optional[str]) -> tuple[bytes | None, str | No
151
  return path.read_bytes(), audio_format
152
 
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  def _safe_prompt_wav_recognition(use_prompt_text: bool, prompt_wav: Optional[str]) -> str:
155
  try:
156
  return prompt_wav_recognition(use_prompt_text, prompt_wav)
@@ -378,11 +481,15 @@ def get_asr_model():
378
  global _asr_model
379
  if _asr_model is None:
380
  from funasr import AutoModel
 
381
 
382
  device = os.environ.get("ASR_DEVICE", "cpu").strip() or "cpu"
 
 
 
383
  logger.info(f"Loading ASR model on {device} ...")
384
  _asr_model = AutoModel(
385
- model="iic/SenseVoiceSmall",
386
  disable_update=True,
387
  log_level="INFO",
388
  device=device,
@@ -491,68 +598,79 @@ def _generate_tts_audio_once(
491
  denoise: bool = True,
492
  inference_timesteps: int = 10,
493
  ) -> Tuple[int, np.ndarray]:
494
- timesteps = int(inference_timesteps)
495
- server = get_voxcpm_server(timesteps)
496
- model_info = get_model_info(timesteps)
497
-
498
- text = (text_input or "").strip()
499
- if len(text) == 0:
500
- raise ValueError("Please input text to synthesize.")
501
-
502
- control = (control_instruction or "").strip()
503
- final_text = f"({control}){text}" if control and not use_prompt_text else text
504
-
505
- audio_bytes, audio_format = _read_audio_bytes(reference_wav_path_input)
506
- prompt_text_clean = (prompt_text_input or "").strip()
507
- if use_prompt_text and audio_bytes is None:
508
- raise ValueError("Ultimate Cloning Mode requires a reference audio clip.")
509
- if use_prompt_text and not prompt_text_clean:
510
- raise ValueError(
511
- "Ultimate Cloning Mode requires a transcript. Please wait for ASR or fill it in manually."
512
  )
513
- if not use_prompt_text:
514
- prompt_text_clean = ""
 
 
 
 
 
 
 
515
 
516
- if do_normalize:
517
- logger.info("Ignoring normalize option: nano-vLLM backend does not support per-request text normalization.")
518
- if denoise:
519
- logger.info("Ignoring denoise option: nano-vLLM backend does not support per-request reference denoising.")
520
-
521
- prompt_latents = None
522
- ref_audio_latents = None
523
- if audio_bytes is not None and audio_format is not None and use_prompt_text:
524
- logger.info(f"[Ultimate Cloning] encoding prompt audio as {audio_format}")
525
- prompt_latents = server.encode_latents(audio_bytes, audio_format)
526
- elif audio_bytes is not None and audio_format is not None:
527
- logger.info(f"[Controllable Cloning] encoding reference audio as {audio_format}")
528
- ref_audio_latents = server.encode_latents(audio_bytes, audio_format)
529
-
530
- if prompt_latents is not None:
531
- logger.info("[Ultimate Cloning] reference audio + transcript")
532
- elif ref_audio_latents is not None:
533
- logger.info("[Controllable Cloning] reference audio only")
534
- else:
535
- logger.info(f"[Voice Design] control: {control[:50] if control else 'None'}")
536
-
537
- chunks: list[np.ndarray] = []
538
- logger.info(f"Generating: '{final_text[:80]}...'")
539
- for chunk in server.generate(
540
- target_text=final_text,
541
- prompt_latents=prompt_latents,
542
- prompt_text=prompt_text_clean if prompt_latents is not None else "",
543
- max_generate_length=_get_int_env("NANOVLLM_MAX_GENERATE_LENGTH", 2000),
544
- temperature=_get_float_env("NANOVLLM_TEMPERATURE", 1.0),
545
- cfg_value=float(cfg_value_input),
546
- ref_audio_latents=ref_audio_latents,
547
- ):
548
- chunks.append(chunk)
549
-
550
- if not chunks:
551
- raise RuntimeError("The model returned no audio chunks.")
552
-
553
- wav = np.concatenate(chunks, axis=0).astype(np.float32, copy=False)
554
- wav = _float_audio_to_int16(wav)
555
- return (int(model_info["sample_rate"]), wav)
 
 
 
 
 
 
556
 
557
 
558
  def generate_tts_audio(
 
2
  import logging
3
  import os
4
  import sys
5
+ import tempfile
6
  from pathlib import Path
7
  from threading import Lock, Thread
8
  from typing import Optional, Tuple
 
29
  handlers=[logging.StreamHandler(sys.stdout)],
30
  )
31
  logger = logging.getLogger(__name__)
32
+ DEFAULT_ASR_MODEL_REF = "FunAudioLLM/SenseVoiceSmall"
33
+ DEFAULT_ZIPENHANCER_MODEL = "iic/speech_zipenhancer_ans_multiloss_16k_base"
34
+ MAX_REFERENCE_AUDIO_SECONDS = 50.0
35
 
36
 
37
  def _configure_cache_dirs() -> None:
 
47
  os.environ.get("GRADIO_TEMP_DIR", str(cache_root / "gradio"))
48
  ).expanduser()
49
  pip_cache = Path(os.environ.get("PIP_CACHE_DIR", str(cache_root / "pip"))).expanduser()
50
+ modelscope_cache = Path(
51
+ os.environ.get("MODELSCOPE_CACHE", str(cache_root / "modelscope"))
52
+ ).expanduser()
53
 
54
+ for path in (cache_root, hf_home, gradio_tmp, pip_cache, modelscope_cache):
55
  path.mkdir(parents=True, exist_ok=True)
56
 
57
  os.environ.setdefault("XDG_CACHE_HOME", str(cache_root))
 
59
  os.environ.setdefault("HUGGINGFACE_HUB_CACHE", str(hf_home / "hub"))
60
  os.environ.setdefault("PIP_CACHE_DIR", str(pip_cache))
61
  os.environ.setdefault("GRADIO_TEMP_DIR", str(gradio_tmp))
62
+ os.environ.setdefault("MODELSCOPE_CACHE", str(modelscope_cache))
63
  logger.info(f"Using persistent cache directories under {persistent_root}")
64
 
65
 
 
69
  _voxcpm_server = None
70
  _model_info = None
71
  _server_inference_timesteps = None
72
+ _denoiser = None
73
  _server_lock = Lock()
74
  _prewarm_lock = Lock()
75
+ _denoiser_lock = Lock()
76
  _prewarm_started = False
77
  _runtime_diag_logged = False
78
 
 
121
  return DEFAULT_MODEL_REF
122
 
123
 
124
+ def _resolve_asr_model_ref() -> str:
125
+ return DEFAULT_ASR_MODEL_REF
126
+
127
+
128
+ def _resolve_zipenhancer_model_ref() -> str:
129
+ for env_name in ("ZIPENHANCER_MODEL_ID", "ZIPENHANCER_MODEL_PATH"):
130
+ value = os.environ.get(env_name, "").strip()
131
+ if value:
132
+ return value
133
+ return DEFAULT_ZIPENHANCER_MODEL
134
+
135
+
136
  def _log_runtime_diagnostics_once() -> None:
137
  global _runtime_diag_logged
138
  if _runtime_diag_logged:
 
154
  _runtime_diag_logged = True
155
 
156
 
157
+ class _ZipEnhancer:
158
+ def __init__(self, model_ref: str):
159
+ import torchaudio
160
+ from modelscope.pipelines import pipeline
161
+ from modelscope.utils.constant import Tasks
162
+
163
+ self._torchaudio = torchaudio
164
+ self.model_ref = model_ref
165
+ self._pipeline = pipeline(Tasks.acoustic_noise_suppression, model=model_ref)
166
+
167
+ def _normalize_loudness(self, wav_path: str) -> None:
168
+ audio, sr = self._torchaudio.load(wav_path)
169
+ loudness = self._torchaudio.functional.loudness(audio, sr)
170
+ normalized_audio = self._torchaudio.functional.gain(audio, -20 - loudness)
171
+ self._torchaudio.save(wav_path, normalized_audio, sr)
172
+
173
+ def enhance(self, input_path: str) -> str:
174
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
175
+ output_path = tmp_file.name
176
+ try:
177
+ self._pipeline(input_path, output_path=output_path)
178
+ self._normalize_loudness(output_path)
179
+ return output_path
180
+ except Exception:
181
+ if os.path.exists(output_path):
182
+ try:
183
+ os.unlink(output_path)
184
+ except OSError:
185
+ pass
186
+ raise
187
+
188
+
189
+ def get_denoiser():
190
+ global _denoiser
191
+ if _denoiser is not None:
192
+ return _denoiser
193
+
194
+ with _denoiser_lock:
195
+ if _denoiser is not None:
196
+ return _denoiser
197
+
198
+ model_ref = _resolve_zipenhancer_model_ref()
199
+ logger.info(f"Loading ZipEnhancer denoiser from {model_ref} ...")
200
+ _denoiser = _ZipEnhancer(model_ref)
201
+ logger.info("ZipEnhancer denoiser loaded.")
202
+ return _denoiser
203
+
204
+
205
  def _extract_asr_text(asr_result) -> str:
206
  if not asr_result:
207
  return ""
 
221
  return path.read_bytes(), audio_format
222
 
223
 
224
+ def _validate_reference_audio_duration(audio_path: str) -> None:
225
+ import soundfile as sf
226
+
227
+ info = sf.info(audio_path)
228
+ duration_seconds = float(info.frames) / float(info.samplerate)
229
+ if duration_seconds > MAX_REFERENCE_AUDIO_SECONDS:
230
+ raise ValueError(
231
+ f"参考音频太长了,请上传不超过 {int(MAX_REFERENCE_AUDIO_SECONDS)} 秒的音频。"
232
+ )
233
+
234
+
235
+ def _prepare_audio_for_encoding(
236
+ audio_path: Optional[str], *, denoise: bool
237
+ ) -> tuple[bytes | None, str | None, Optional[str]]:
238
+ if audio_path is None or not audio_path.strip():
239
+ return None, None, None
240
+
241
+ _validate_reference_audio_duration(audio_path)
242
+
243
+ source_path = audio_path
244
+ temp_path = None
245
+ if denoise:
246
+ logger.info("Applying ZipEnhancer denoising to reference audio ...")
247
+ try:
248
+ temp_path = get_denoiser().enhance(audio_path)
249
+ source_path = temp_path
250
+ except Exception as exc:
251
+ raise RuntimeError(f"ZipEnhancer denoising failed: {exc}") from exc
252
+
253
+ audio_bytes, audio_format = _read_audio_bytes(source_path)
254
+ return audio_bytes, audio_format, temp_path
255
+
256
+
257
  def _safe_prompt_wav_recognition(use_prompt_text: bool, prompt_wav: Optional[str]) -> str:
258
  try:
259
  return prompt_wav_recognition(use_prompt_text, prompt_wav)
 
481
  global _asr_model
482
  if _asr_model is None:
483
  from funasr import AutoModel
484
+ from huggingface_hub import snapshot_download
485
 
486
  device = os.environ.get("ASR_DEVICE", "cpu").strip() or "cpu"
487
+ asr_model_ref = _resolve_asr_model_ref()
488
+ logger.info(f"Downloading ASR model from Hugging Face: {asr_model_ref}")
489
+ asr_model_path = snapshot_download(repo_id=asr_model_ref)
490
  logger.info(f"Loading ASR model on {device} ...")
491
  _asr_model = AutoModel(
492
+ model=asr_model_path,
493
  disable_update=True,
494
  log_level="INFO",
495
  device=device,
 
598
  denoise: bool = True,
599
  inference_timesteps: int = 10,
600
  ) -> Tuple[int, np.ndarray]:
601
+ temp_audio_path = None
602
+ try:
603
+ timesteps = int(inference_timesteps)
604
+ server = get_voxcpm_server(timesteps)
605
+ model_info = get_model_info(timesteps)
606
+
607
+ text = (text_input or "").strip()
608
+ if len(text) == 0:
609
+ raise ValueError("Please input text to synthesize.")
610
+
611
+ control = (control_instruction or "").strip()
612
+ final_text = f"({control}){text}" if control and not use_prompt_text else text
613
+
614
+ audio_bytes, audio_format, temp_audio_path = _prepare_audio_for_encoding(
615
+ reference_wav_path_input,
616
+ denoise=bool(denoise),
 
 
617
  )
618
+ prompt_text_clean = (prompt_text_input or "").strip()
619
+ if use_prompt_text and audio_bytes is None:
620
+ raise ValueError("Ultimate Cloning Mode requires a reference audio clip.")
621
+ if use_prompt_text and not prompt_text_clean:
622
+ raise ValueError(
623
+ "Ultimate Cloning Mode requires a transcript. Please wait for ASR or fill it in manually."
624
+ )
625
+ if not use_prompt_text:
626
+ prompt_text_clean = ""
627
 
628
+ if do_normalize:
629
+ logger.info(
630
+ "Ignoring normalize option: nano-vLLM backend does not support per-request text normalization."
631
+ )
632
+
633
+ prompt_latents = None
634
+ ref_audio_latents = None
635
+ if audio_bytes is not None and audio_format is not None and use_prompt_text:
636
+ logger.info(f"[Ultimate Cloning] encoding prompt audio as {audio_format}")
637
+ prompt_latents = server.encode_latents(audio_bytes, audio_format)
638
+ elif audio_bytes is not None and audio_format is not None:
639
+ logger.info(f"[Controllable Cloning] encoding reference audio as {audio_format}")
640
+ ref_audio_latents = server.encode_latents(audio_bytes, audio_format)
641
+
642
+ if prompt_latents is not None:
643
+ logger.info("[Ultimate Cloning] reference audio + transcript")
644
+ elif ref_audio_latents is not None:
645
+ logger.info("[Controllable Cloning] reference audio only")
646
+ else:
647
+ logger.info(f"[Voice Design] control: {control[:50] if control else 'None'}")
648
+
649
+ chunks: list[np.ndarray] = []
650
+ logger.info(f"Generating: '{final_text[:80]}...'")
651
+ for chunk in server.generate(
652
+ target_text=final_text,
653
+ prompt_latents=prompt_latents,
654
+ prompt_text=prompt_text_clean if prompt_latents is not None else "",
655
+ max_generate_length=_get_int_env("NANOVLLM_MAX_GENERATE_LENGTH", 2000),
656
+ temperature=_get_float_env("NANOVLLM_TEMPERATURE", 1.0),
657
+ cfg_value=float(cfg_value_input),
658
+ ref_audio_latents=ref_audio_latents,
659
+ ):
660
+ chunks.append(chunk)
661
+
662
+ if not chunks:
663
+ raise RuntimeError("The model returned no audio chunks.")
664
+
665
+ wav = np.concatenate(chunks, axis=0).astype(np.float32, copy=False)
666
+ wav = _float_audio_to_int16(wav)
667
+ return (int(model_info["sample_rate"]), wav)
668
+ finally:
669
+ if temp_audio_path and os.path.exists(temp_audio_path):
670
+ try:
671
+ os.unlink(temp_audio_path)
672
+ except OSError:
673
+ pass
674
 
675
 
676
  def generate_tts_audio(
requirements.txt CHANGED
@@ -1,6 +1,7 @@
1
  gradio==6.0.0
2
  huggingface-hub
3
  funasr
 
4
  numpy>=1.21.0
5
  torch==2.5.1
6
  torchaudio==2.5.1
 
1
  gradio==6.0.0
2
  huggingface-hub
3
  funasr
4
+ modelscope>=1.22.0
5
  numpy>=1.21.0
6
  torch==2.5.1
7
  torchaudio==2.5.1