Spaces:
Running on A10G
Running on A10G
刘鑫 commited on
Commit ·
ffbf382
1
Parent(s): 7d8de04
fix: add denoiser and reference audio guards
Browse filesEnable ZipEnhancer preprocessing for reference audio, move SenseVoice downloads to Hugging Face caching, and reject overly long reference clips before ASR or NanoVLLM encoding.
Made-with: Cursor
- README.md +8 -1
- app.py +180 -62
- requirements.txt +1 -0
README.md
CHANGED
|
@@ -20,14 +20,19 @@ Notes:
|
|
| 20 |
|
| 21 |
- This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
|
| 22 |
- `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request.
|
|
|
|
| 23 |
- The Space now defaults to a hardened runtime path:
|
| 24 |
- If `/data` exists, Hugging Face cache, pip cache, and Gradio temp files are stored there automatically.
|
|
|
|
| 25 |
- Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
|
| 26 |
- Gradio SSR is disabled by default for stability.
|
| 27 |
- The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
|
|
|
|
| 28 |
- `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory.
|
|
|
|
| 29 |
- The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server.
|
| 30 |
-
- The existing `normalize`
|
|
|
|
| 31 |
- `packages.txt` is required because this path needs extra system build dependencies.
|
| 32 |
|
| 33 |
Stability recommendation:
|
|
@@ -43,6 +48,7 @@ Recommended environment variables:
|
|
| 43 |
- `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id
|
| 44 |
- `NANOVLLM_MODEL_PATH`: optional local model path override
|
| 45 |
- `ASR_DEVICE`: defaults to `cpu`
|
|
|
|
| 46 |
- `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10`
|
| 47 |
- `NANOVLLM_PREWARM`: defaults to `true`
|
| 48 |
- `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192`
|
|
@@ -53,6 +59,7 @@ Recommended environment variables:
|
|
| 53 |
- `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0`
|
| 54 |
- `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000`
|
| 55 |
- `NANOVLLM_TEMPERATURE`: defaults to `1.0`
|
|
|
|
| 56 |
- `GRADIO_QUEUE_MAX_SIZE`: defaults to `10`
|
| 57 |
- `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `1`
|
| 58 |
- `GRADIO_SSR_MODE`: defaults to `false`
|
|
|
|
| 20 |
|
| 21 |
- This is the non-Docker experiment path. It relies on a persistent GPU Gradio Space.
|
| 22 |
- `flash-attn` and `nanovllm-voxcpm` are pinned in `requirements.txt`, so they install during Space build instead of on first request.
|
| 23 |
+
- ZipEnhancer denoising is supported for reference audio cloning. The default denoiser model is `iic/speech_zipenhancer_ans_multiloss_16k_base`.
|
| 24 |
- The Space now defaults to a hardened runtime path:
|
| 25 |
- If `/data` exists, Hugging Face cache, pip cache, and Gradio temp files are stored there automatically.
|
| 26 |
+
- If `/data` exists, ModelScope cache is also persisted there for ZipEnhancer downloads.
|
| 27 |
- Backend prewarm is enabled by default, so startup can begin dependency install + model load in the background.
|
| 28 |
- Gradio SSR is disabled by default for stability.
|
| 29 |
- The first cold start may still spend extra time installing dependencies, downloading the model, and loading the server.
|
| 30 |
+
- `SenseVoiceSmall` is downloaded from Hugging Face and cached locally before ASR initialization.
|
| 31 |
- `ASR_DEVICE` defaults to `cpu` to avoid competing with TTS GPU memory.
|
| 32 |
+
- Reference audio longer than 50 seconds is rejected early before denoising or Nano-vLLM encoding.
|
| 33 |
- The `LocDiT flow-matching steps` slider is wired to Nano-vLLM server `inference_timesteps`; changing it rebuilds the backend server.
|
| 34 |
+
- The existing `normalize` toggle is kept for UI compatibility, but Nano-vLLM currently ignores it.
|
| 35 |
+
- The existing `denoise` toggle now runs ZipEnhancer on the reference audio before encoding it to latents.
|
| 36 |
- `packages.txt` is required because this path needs extra system build dependencies.
|
| 37 |
|
| 38 |
Stability recommendation:
|
|
|
|
| 48 |
- `NANOVLLM_MODEL`: optional direct model ref override. Can be a local path or HF repo id
|
| 49 |
- `NANOVLLM_MODEL_PATH`: optional local model path override
|
| 50 |
- `ASR_DEVICE`: defaults to `cpu`
|
| 51 |
+
- `ZIPENHANCER_MODEL_ID`: optional ModelScope denoiser model id or local path. Defaults to `iic/speech_zipenhancer_ans_multiloss_16k_base`
|
| 52 |
- `NANOVLLM_INFERENCE_TIMESTEPS`: initial default is `10`
|
| 53 |
- `NANOVLLM_PREWARM`: defaults to `true`
|
| 54 |
- `NANOVLLM_SERVERPOOL_MAX_NUM_BATCHED_TOKENS`: defaults to `8192`
|
|
|
|
| 59 |
- `NANOVLLM_SERVERPOOL_DEVICES`: defaults to `0`
|
| 60 |
- `NANOVLLM_MAX_GENERATE_LENGTH`: defaults to `2000`
|
| 61 |
- `NANOVLLM_TEMPERATURE`: defaults to `1.0`
|
| 62 |
+
- `MODELSCOPE_CACHE`: optional persistent cache path for ZipEnhancer downloads
|
| 63 |
- `GRADIO_QUEUE_MAX_SIZE`: defaults to `10`
|
| 64 |
- `GRADIO_DEFAULT_CONCURRENCY_LIMIT`: defaults to `1`
|
| 65 |
- `GRADIO_SSR_MODE`: defaults to `false`
|
app.py
CHANGED
|
@@ -2,6 +2,7 @@ import atexit
|
|
| 2 |
import logging
|
| 3 |
import os
|
| 4 |
import sys
|
|
|
|
| 5 |
from pathlib import Path
|
| 6 |
from threading import Lock, Thread
|
| 7 |
from typing import Optional, Tuple
|
|
@@ -28,6 +29,9 @@ logging.basicConfig(
|
|
| 28 |
handlers=[logging.StreamHandler(sys.stdout)],
|
| 29 |
)
|
| 30 |
logger = logging.getLogger(__name__)
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
def _configure_cache_dirs() -> None:
|
|
@@ -43,8 +47,11 @@ def _configure_cache_dirs() -> None:
|
|
| 43 |
os.environ.get("GRADIO_TEMP_DIR", str(cache_root / "gradio"))
|
| 44 |
).expanduser()
|
| 45 |
pip_cache = Path(os.environ.get("PIP_CACHE_DIR", str(cache_root / "pip"))).expanduser()
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
for path in (cache_root, hf_home, gradio_tmp, pip_cache):
|
| 48 |
path.mkdir(parents=True, exist_ok=True)
|
| 49 |
|
| 50 |
os.environ.setdefault("XDG_CACHE_HOME", str(cache_root))
|
|
@@ -52,6 +59,7 @@ def _configure_cache_dirs() -> None:
|
|
| 52 |
os.environ.setdefault("HUGGINGFACE_HUB_CACHE", str(hf_home / "hub"))
|
| 53 |
os.environ.setdefault("PIP_CACHE_DIR", str(pip_cache))
|
| 54 |
os.environ.setdefault("GRADIO_TEMP_DIR", str(gradio_tmp))
|
|
|
|
| 55 |
logger.info(f"Using persistent cache directories under {persistent_root}")
|
| 56 |
|
| 57 |
|
|
@@ -61,8 +69,10 @@ _asr_model = None
|
|
| 61 |
_voxcpm_server = None
|
| 62 |
_model_info = None
|
| 63 |
_server_inference_timesteps = None
|
|
|
|
| 64 |
_server_lock = Lock()
|
| 65 |
_prewarm_lock = Lock()
|
|
|
|
| 66 |
_prewarm_started = False
|
| 67 |
_runtime_diag_logged = False
|
| 68 |
|
|
@@ -111,6 +121,18 @@ def _resolve_model_ref() -> str:
|
|
| 111 |
return DEFAULT_MODEL_REF
|
| 112 |
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
def _log_runtime_diagnostics_once() -> None:
|
| 115 |
global _runtime_diag_logged
|
| 116 |
if _runtime_diag_logged:
|
|
@@ -132,6 +154,54 @@ def _log_runtime_diagnostics_once() -> None:
|
|
| 132 |
_runtime_diag_logged = True
|
| 133 |
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
def _extract_asr_text(asr_result) -> str:
|
| 136 |
if not asr_result:
|
| 137 |
return ""
|
|
@@ -151,6 +221,39 @@ def _read_audio_bytes(audio_path: Optional[str]) -> tuple[bytes | None, str | No
|
|
| 151 |
return path.read_bytes(), audio_format
|
| 152 |
|
| 153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
def _safe_prompt_wav_recognition(use_prompt_text: bool, prompt_wav: Optional[str]) -> str:
|
| 155 |
try:
|
| 156 |
return prompt_wav_recognition(use_prompt_text, prompt_wav)
|
|
@@ -378,11 +481,15 @@ def get_asr_model():
|
|
| 378 |
global _asr_model
|
| 379 |
if _asr_model is None:
|
| 380 |
from funasr import AutoModel
|
|
|
|
| 381 |
|
| 382 |
device = os.environ.get("ASR_DEVICE", "cpu").strip() or "cpu"
|
|
|
|
|
|
|
|
|
|
| 383 |
logger.info(f"Loading ASR model on {device} ...")
|
| 384 |
_asr_model = AutoModel(
|
| 385 |
-
model=
|
| 386 |
disable_update=True,
|
| 387 |
log_level="INFO",
|
| 388 |
device=device,
|
|
@@ -491,68 +598,79 @@ def _generate_tts_audio_once(
|
|
| 491 |
denoise: bool = True,
|
| 492 |
inference_timesteps: int = 10,
|
| 493 |
) -> Tuple[int, np.ndarray]:
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
raise ValueError(
|
| 511 |
-
"Ultimate Cloning Mode requires a transcript. Please wait for ASR or fill it in manually."
|
| 512 |
)
|
| 513 |
-
|
| 514 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 515 |
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 556 |
|
| 557 |
|
| 558 |
def generate_tts_audio(
|
|
|
|
| 2 |
import logging
|
| 3 |
import os
|
| 4 |
import sys
|
| 5 |
+
import tempfile
|
| 6 |
from pathlib import Path
|
| 7 |
from threading import Lock, Thread
|
| 8 |
from typing import Optional, Tuple
|
|
|
|
| 29 |
handlers=[logging.StreamHandler(sys.stdout)],
|
| 30 |
)
|
| 31 |
logger = logging.getLogger(__name__)
|
| 32 |
+
DEFAULT_ASR_MODEL_REF = "FunAudioLLM/SenseVoiceSmall"
|
| 33 |
+
DEFAULT_ZIPENHANCER_MODEL = "iic/speech_zipenhancer_ans_multiloss_16k_base"
|
| 34 |
+
MAX_REFERENCE_AUDIO_SECONDS = 50.0
|
| 35 |
|
| 36 |
|
| 37 |
def _configure_cache_dirs() -> None:
|
|
|
|
| 47 |
os.environ.get("GRADIO_TEMP_DIR", str(cache_root / "gradio"))
|
| 48 |
).expanduser()
|
| 49 |
pip_cache = Path(os.environ.get("PIP_CACHE_DIR", str(cache_root / "pip"))).expanduser()
|
| 50 |
+
modelscope_cache = Path(
|
| 51 |
+
os.environ.get("MODELSCOPE_CACHE", str(cache_root / "modelscope"))
|
| 52 |
+
).expanduser()
|
| 53 |
|
| 54 |
+
for path in (cache_root, hf_home, gradio_tmp, pip_cache, modelscope_cache):
|
| 55 |
path.mkdir(parents=True, exist_ok=True)
|
| 56 |
|
| 57 |
os.environ.setdefault("XDG_CACHE_HOME", str(cache_root))
|
|
|
|
| 59 |
os.environ.setdefault("HUGGINGFACE_HUB_CACHE", str(hf_home / "hub"))
|
| 60 |
os.environ.setdefault("PIP_CACHE_DIR", str(pip_cache))
|
| 61 |
os.environ.setdefault("GRADIO_TEMP_DIR", str(gradio_tmp))
|
| 62 |
+
os.environ.setdefault("MODELSCOPE_CACHE", str(modelscope_cache))
|
| 63 |
logger.info(f"Using persistent cache directories under {persistent_root}")
|
| 64 |
|
| 65 |
|
|
|
|
| 69 |
_voxcpm_server = None
|
| 70 |
_model_info = None
|
| 71 |
_server_inference_timesteps = None
|
| 72 |
+
_denoiser = None
|
| 73 |
_server_lock = Lock()
|
| 74 |
_prewarm_lock = Lock()
|
| 75 |
+
_denoiser_lock = Lock()
|
| 76 |
_prewarm_started = False
|
| 77 |
_runtime_diag_logged = False
|
| 78 |
|
|
|
|
| 121 |
return DEFAULT_MODEL_REF
|
| 122 |
|
| 123 |
|
| 124 |
+
def _resolve_asr_model_ref() -> str:
|
| 125 |
+
return DEFAULT_ASR_MODEL_REF
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def _resolve_zipenhancer_model_ref() -> str:
|
| 129 |
+
for env_name in ("ZIPENHANCER_MODEL_ID", "ZIPENHANCER_MODEL_PATH"):
|
| 130 |
+
value = os.environ.get(env_name, "").strip()
|
| 131 |
+
if value:
|
| 132 |
+
return value
|
| 133 |
+
return DEFAULT_ZIPENHANCER_MODEL
|
| 134 |
+
|
| 135 |
+
|
| 136 |
def _log_runtime_diagnostics_once() -> None:
|
| 137 |
global _runtime_diag_logged
|
| 138 |
if _runtime_diag_logged:
|
|
|
|
| 154 |
_runtime_diag_logged = True
|
| 155 |
|
| 156 |
|
| 157 |
+
class _ZipEnhancer:
|
| 158 |
+
def __init__(self, model_ref: str):
|
| 159 |
+
import torchaudio
|
| 160 |
+
from modelscope.pipelines import pipeline
|
| 161 |
+
from modelscope.utils.constant import Tasks
|
| 162 |
+
|
| 163 |
+
self._torchaudio = torchaudio
|
| 164 |
+
self.model_ref = model_ref
|
| 165 |
+
self._pipeline = pipeline(Tasks.acoustic_noise_suppression, model=model_ref)
|
| 166 |
+
|
| 167 |
+
def _normalize_loudness(self, wav_path: str) -> None:
|
| 168 |
+
audio, sr = self._torchaudio.load(wav_path)
|
| 169 |
+
loudness = self._torchaudio.functional.loudness(audio, sr)
|
| 170 |
+
normalized_audio = self._torchaudio.functional.gain(audio, -20 - loudness)
|
| 171 |
+
self._torchaudio.save(wav_path, normalized_audio, sr)
|
| 172 |
+
|
| 173 |
+
def enhance(self, input_path: str) -> str:
|
| 174 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
|
| 175 |
+
output_path = tmp_file.name
|
| 176 |
+
try:
|
| 177 |
+
self._pipeline(input_path, output_path=output_path)
|
| 178 |
+
self._normalize_loudness(output_path)
|
| 179 |
+
return output_path
|
| 180 |
+
except Exception:
|
| 181 |
+
if os.path.exists(output_path):
|
| 182 |
+
try:
|
| 183 |
+
os.unlink(output_path)
|
| 184 |
+
except OSError:
|
| 185 |
+
pass
|
| 186 |
+
raise
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
def get_denoiser():
|
| 190 |
+
global _denoiser
|
| 191 |
+
if _denoiser is not None:
|
| 192 |
+
return _denoiser
|
| 193 |
+
|
| 194 |
+
with _denoiser_lock:
|
| 195 |
+
if _denoiser is not None:
|
| 196 |
+
return _denoiser
|
| 197 |
+
|
| 198 |
+
model_ref = _resolve_zipenhancer_model_ref()
|
| 199 |
+
logger.info(f"Loading ZipEnhancer denoiser from {model_ref} ...")
|
| 200 |
+
_denoiser = _ZipEnhancer(model_ref)
|
| 201 |
+
logger.info("ZipEnhancer denoiser loaded.")
|
| 202 |
+
return _denoiser
|
| 203 |
+
|
| 204 |
+
|
| 205 |
def _extract_asr_text(asr_result) -> str:
|
| 206 |
if not asr_result:
|
| 207 |
return ""
|
|
|
|
| 221 |
return path.read_bytes(), audio_format
|
| 222 |
|
| 223 |
|
| 224 |
+
def _validate_reference_audio_duration(audio_path: str) -> None:
|
| 225 |
+
import soundfile as sf
|
| 226 |
+
|
| 227 |
+
info = sf.info(audio_path)
|
| 228 |
+
duration_seconds = float(info.frames) / float(info.samplerate)
|
| 229 |
+
if duration_seconds > MAX_REFERENCE_AUDIO_SECONDS:
|
| 230 |
+
raise ValueError(
|
| 231 |
+
f"参考音频太长了,请上传不超过 {int(MAX_REFERENCE_AUDIO_SECONDS)} 秒的音频。"
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
def _prepare_audio_for_encoding(
|
| 236 |
+
audio_path: Optional[str], *, denoise: bool
|
| 237 |
+
) -> tuple[bytes | None, str | None, Optional[str]]:
|
| 238 |
+
if audio_path is None or not audio_path.strip():
|
| 239 |
+
return None, None, None
|
| 240 |
+
|
| 241 |
+
_validate_reference_audio_duration(audio_path)
|
| 242 |
+
|
| 243 |
+
source_path = audio_path
|
| 244 |
+
temp_path = None
|
| 245 |
+
if denoise:
|
| 246 |
+
logger.info("Applying ZipEnhancer denoising to reference audio ...")
|
| 247 |
+
try:
|
| 248 |
+
temp_path = get_denoiser().enhance(audio_path)
|
| 249 |
+
source_path = temp_path
|
| 250 |
+
except Exception as exc:
|
| 251 |
+
raise RuntimeError(f"ZipEnhancer denoising failed: {exc}") from exc
|
| 252 |
+
|
| 253 |
+
audio_bytes, audio_format = _read_audio_bytes(source_path)
|
| 254 |
+
return audio_bytes, audio_format, temp_path
|
| 255 |
+
|
| 256 |
+
|
| 257 |
def _safe_prompt_wav_recognition(use_prompt_text: bool, prompt_wav: Optional[str]) -> str:
|
| 258 |
try:
|
| 259 |
return prompt_wav_recognition(use_prompt_text, prompt_wav)
|
|
|
|
| 481 |
global _asr_model
|
| 482 |
if _asr_model is None:
|
| 483 |
from funasr import AutoModel
|
| 484 |
+
from huggingface_hub import snapshot_download
|
| 485 |
|
| 486 |
device = os.environ.get("ASR_DEVICE", "cpu").strip() or "cpu"
|
| 487 |
+
asr_model_ref = _resolve_asr_model_ref()
|
| 488 |
+
logger.info(f"Downloading ASR model from Hugging Face: {asr_model_ref}")
|
| 489 |
+
asr_model_path = snapshot_download(repo_id=asr_model_ref)
|
| 490 |
logger.info(f"Loading ASR model on {device} ...")
|
| 491 |
_asr_model = AutoModel(
|
| 492 |
+
model=asr_model_path,
|
| 493 |
disable_update=True,
|
| 494 |
log_level="INFO",
|
| 495 |
device=device,
|
|
|
|
| 598 |
denoise: bool = True,
|
| 599 |
inference_timesteps: int = 10,
|
| 600 |
) -> Tuple[int, np.ndarray]:
|
| 601 |
+
temp_audio_path = None
|
| 602 |
+
try:
|
| 603 |
+
timesteps = int(inference_timesteps)
|
| 604 |
+
server = get_voxcpm_server(timesteps)
|
| 605 |
+
model_info = get_model_info(timesteps)
|
| 606 |
+
|
| 607 |
+
text = (text_input or "").strip()
|
| 608 |
+
if len(text) == 0:
|
| 609 |
+
raise ValueError("Please input text to synthesize.")
|
| 610 |
+
|
| 611 |
+
control = (control_instruction or "").strip()
|
| 612 |
+
final_text = f"({control}){text}" if control and not use_prompt_text else text
|
| 613 |
+
|
| 614 |
+
audio_bytes, audio_format, temp_audio_path = _prepare_audio_for_encoding(
|
| 615 |
+
reference_wav_path_input,
|
| 616 |
+
denoise=bool(denoise),
|
|
|
|
|
|
|
| 617 |
)
|
| 618 |
+
prompt_text_clean = (prompt_text_input or "").strip()
|
| 619 |
+
if use_prompt_text and audio_bytes is None:
|
| 620 |
+
raise ValueError("Ultimate Cloning Mode requires a reference audio clip.")
|
| 621 |
+
if use_prompt_text and not prompt_text_clean:
|
| 622 |
+
raise ValueError(
|
| 623 |
+
"Ultimate Cloning Mode requires a transcript. Please wait for ASR or fill it in manually."
|
| 624 |
+
)
|
| 625 |
+
if not use_prompt_text:
|
| 626 |
+
prompt_text_clean = ""
|
| 627 |
|
| 628 |
+
if do_normalize:
|
| 629 |
+
logger.info(
|
| 630 |
+
"Ignoring normalize option: nano-vLLM backend does not support per-request text normalization."
|
| 631 |
+
)
|
| 632 |
+
|
| 633 |
+
prompt_latents = None
|
| 634 |
+
ref_audio_latents = None
|
| 635 |
+
if audio_bytes is not None and audio_format is not None and use_prompt_text:
|
| 636 |
+
logger.info(f"[Ultimate Cloning] encoding prompt audio as {audio_format}")
|
| 637 |
+
prompt_latents = server.encode_latents(audio_bytes, audio_format)
|
| 638 |
+
elif audio_bytes is not None and audio_format is not None:
|
| 639 |
+
logger.info(f"[Controllable Cloning] encoding reference audio as {audio_format}")
|
| 640 |
+
ref_audio_latents = server.encode_latents(audio_bytes, audio_format)
|
| 641 |
+
|
| 642 |
+
if prompt_latents is not None:
|
| 643 |
+
logger.info("[Ultimate Cloning] reference audio + transcript")
|
| 644 |
+
elif ref_audio_latents is not None:
|
| 645 |
+
logger.info("[Controllable Cloning] reference audio only")
|
| 646 |
+
else:
|
| 647 |
+
logger.info(f"[Voice Design] control: {control[:50] if control else 'None'}")
|
| 648 |
+
|
| 649 |
+
chunks: list[np.ndarray] = []
|
| 650 |
+
logger.info(f"Generating: '{final_text[:80]}...'")
|
| 651 |
+
for chunk in server.generate(
|
| 652 |
+
target_text=final_text,
|
| 653 |
+
prompt_latents=prompt_latents,
|
| 654 |
+
prompt_text=prompt_text_clean if prompt_latents is not None else "",
|
| 655 |
+
max_generate_length=_get_int_env("NANOVLLM_MAX_GENERATE_LENGTH", 2000),
|
| 656 |
+
temperature=_get_float_env("NANOVLLM_TEMPERATURE", 1.0),
|
| 657 |
+
cfg_value=float(cfg_value_input),
|
| 658 |
+
ref_audio_latents=ref_audio_latents,
|
| 659 |
+
):
|
| 660 |
+
chunks.append(chunk)
|
| 661 |
+
|
| 662 |
+
if not chunks:
|
| 663 |
+
raise RuntimeError("The model returned no audio chunks.")
|
| 664 |
+
|
| 665 |
+
wav = np.concatenate(chunks, axis=0).astype(np.float32, copy=False)
|
| 666 |
+
wav = _float_audio_to_int16(wav)
|
| 667 |
+
return (int(model_info["sample_rate"]), wav)
|
| 668 |
+
finally:
|
| 669 |
+
if temp_audio_path and os.path.exists(temp_audio_path):
|
| 670 |
+
try:
|
| 671 |
+
os.unlink(temp_audio_path)
|
| 672 |
+
except OSError:
|
| 673 |
+
pass
|
| 674 |
|
| 675 |
|
| 676 |
def generate_tts_audio(
|
requirements.txt
CHANGED
|
@@ -1,6 +1,7 @@
|
|
| 1 |
gradio==6.0.0
|
| 2 |
huggingface-hub
|
| 3 |
funasr
|
|
|
|
| 4 |
numpy>=1.21.0
|
| 5 |
torch==2.5.1
|
| 6 |
torchaudio==2.5.1
|
|
|
|
| 1 |
gradio==6.0.0
|
| 2 |
huggingface-hub
|
| 3 |
funasr
|
| 4 |
+
modelscope>=1.22.0
|
| 5 |
numpy>=1.21.0
|
| 6 |
torch==2.5.1
|
| 7 |
torchaudio==2.5.1
|