Spaces:
Running on Zero
Running on Zero
github-actions[bot] commited on
Commit ·
5b7cd5f
1
Parent(s): e5dfb46
deploy: switch to chatterbox requirements @ 4319730
Browse files- CLAUDE.md +4 -0
- app.py +2 -0
- graphify-out/.graphify_root +1 -0
- graphify-out/GRAPH_REPORT.md +152 -127
- graphify-out/graph.html +0 -0
- server.py +5 -1
- tools_api/__init__.py +17 -0
- tools_api/audio_cleanup.py +136 -0
- tools_api/router.py +248 -0
- tools_api/storage.py +73 -0
- tools_api/subtitles.py +288 -0
- tools_api/voice_clone.py +241 -0
CLAUDE.md
CHANGED
|
@@ -1,3 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
## graphify
|
| 2 |
|
| 3 |
This project has a graphify knowledge graph at graphify-out/.
|
|
|
|
| 1 |
+
## Deployment
|
| 2 |
+
|
| 3 |
+
HF Spaces deployment is fully automated via `.github/workflows/deploy-hf.yml`. Pushing to `origin/main` triggers the workflow which runs `./deploy.sh --force` and pushes to all three Spaces (Chatterbox, OmniVoice, Qwen3). Do not run `./deploy.sh` locally after a push — it is redundant. To verify a deploy, use `gh run list --workflow=deploy-hf.yml`.
|
| 4 |
+
|
| 5 |
## graphify
|
| 6 |
|
| 7 |
This project has a graphify knowledge graph at graphify-out/.
|
app.py
CHANGED
|
@@ -25,8 +25,10 @@ demo = Server()
|
|
| 25 |
# INTEGRATE SERVER.PY ROUTES
|
| 26 |
# -----------------------------------------------------
|
| 27 |
from server import router, limiter, enforce_content_length_limit
|
|
|
|
| 28 |
|
| 29 |
demo.include_router(router)
|
|
|
|
| 30 |
demo.state.limiter = limiter
|
| 31 |
demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
|
| 32 |
|
|
|
|
| 25 |
# INTEGRATE SERVER.PY ROUTES
|
| 26 |
# -----------------------------------------------------
|
| 27 |
from server import router, limiter, enforce_content_length_limit
|
| 28 |
+
from tools_api import router as tools_router
|
| 29 |
|
| 30 |
demo.include_router(router)
|
| 31 |
+
demo.include_router(tools_router)
|
| 32 |
demo.state.limiter = limiter
|
| 33 |
demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
|
| 34 |
|
graphify-out/.graphify_root
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
/Users/rafa/MscAi/VideoVoice-be
|
graphify-out/GRAPH_REPORT.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
-
# Graph Report - VideoVoice-be (2026-
|
| 2 |
|
| 3 |
## Corpus Check
|
| 4 |
-
-
|
| 5 |
- Verdict: corpus is large enough that graph structure adds value.
|
| 6 |
|
| 7 |
## Summary
|
| 8 |
-
-
|
| 9 |
-
- Extraction:
|
| 10 |
- Token cost: 0 input · 0 output
|
| 11 |
|
| 12 |
## Community Hubs (Navigation)
|
|
@@ -27,12 +27,12 @@
|
|
| 27 |
- [[_COMMUNITY_Community 14|Community 14]]
|
| 28 |
- [[_COMMUNITY_Community 15|Community 15]]
|
| 29 |
- [[_COMMUNITY_Community 16|Community 16]]
|
|
|
|
| 30 |
- [[_COMMUNITY_Community 18|Community 18]]
|
| 31 |
-
- [[_COMMUNITY_Community
|
| 32 |
-
- [[_COMMUNITY_Community
|
| 33 |
-
- [[_COMMUNITY_Community
|
| 34 |
-
- [[_COMMUNITY_Community
|
| 35 |
-
- [[_COMMUNITY_Community 30|Community 30]]
|
| 36 |
- [[_COMMUNITY_Community 31|Community 31]]
|
| 37 |
- [[_COMMUNITY_Community 32|Community 32]]
|
| 38 |
- [[_COMMUNITY_Community 33|Community 33]]
|
|
@@ -67,6 +67,11 @@
|
|
| 67 |
- [[_COMMUNITY_Community 62|Community 62]]
|
| 68 |
- [[_COMMUNITY_Community 63|Community 63]]
|
| 69 |
- [[_COMMUNITY_Community 64|Community 64]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## God Nodes (most connected - your core abstractions)
|
| 72 |
1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
|
|
@@ -81,16 +86,16 @@
|
|
| 81 |
10. `BasePoster` - 14 edges
|
| 82 |
|
| 83 |
## Surprising Connections (you probably didn't know these)
|
| 84 |
-
- `_segments_from_pollinations()` --calls--> `post()` [INFERRED]
|
| 85 |
-
steps/s2_transcribe.py → social_distributor/poster/platforms/base.py
|
| 86 |
- `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4` [INFERRED] [semantically similar]
|
| 87 |
requirements.txt → requirements-omni.txt
|
| 88 |
- `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)` [INFERRED] [semantically similar]
|
| 89 |
requirements.txt → requirements-omni.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
- `run_pipeline()` --calls--> `transcribe()` [INFERRED]
|
| 91 |
pipeline.py → steps/s2_transcribe.py
|
| 92 |
-
- `run_pipeline()` --calls--> `translate()` [INFERRED]
|
| 93 |
-
pipeline.py → steps/s3_translate.py
|
| 94 |
|
| 95 |
## Hyperedges (group relationships)
|
| 96 |
- **Six-step translation pipeline** — [EXTRACTED 1.00]
|
|
@@ -101,323 +106,343 @@
|
|
| 101 |
|
| 102 |
### Community 0 - "Community 0"
|
| 103 |
Cohesion: 0.04
|
| 104 |
-
Nodes (
|
| 105 |
|
| 106 |
### Community 1 - "Community 1"
|
| 107 |
-
Cohesion: 0.
|
| 108 |
-
Nodes (
|
| 109 |
|
| 110 |
### Community 2 - "Community 2"
|
| 111 |
-
Cohesion: 0.
|
| 112 |
-
Nodes (
|
| 113 |
|
| 114 |
### Community 3 - "Community 3"
|
| 115 |
-
Cohesion: 0.
|
| 116 |
-
Nodes (
|
| 117 |
|
| 118 |
### Community 4 - "Community 4"
|
| 119 |
-
Cohesion: 0.
|
| 120 |
-
Nodes (
|
| 121 |
|
| 122 |
### Community 5 - "Community 5"
|
| 123 |
-
Cohesion: 0.
|
| 124 |
-
Nodes (
|
| 125 |
|
| 126 |
### Community 6 - "Community 6"
|
| 127 |
-
Cohesion: 0.
|
| 128 |
-
Nodes (
|
| 129 |
|
| 130 |
### Community 7 - "Community 7"
|
| 131 |
Cohesion: 0.07
|
| 132 |
-
Nodes (
|
| 133 |
|
| 134 |
### Community 8 - "Community 8"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
Cohesion: 0.09
|
| 136 |
Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
|
| 137 |
|
| 138 |
-
### Community
|
| 139 |
Cohesion: 0.1
|
| 140 |
Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
|
| 141 |
|
| 142 |
-
### Community
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
Cohesion: 0.1
|
| 144 |
Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches. Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
|
| 145 |
|
| 146 |
-
### Community
|
| 147 |
Cohesion: 0.12
|
| 148 |
-
Nodes (
|
| 149 |
|
| 150 |
-
### Community
|
| 151 |
-
Cohesion: 0.
|
| 152 |
-
Nodes (
|
| 153 |
|
| 154 |
-
### Community
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
Cohesion: 0.24
|
| 156 |
Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL. Uses Play (+3 more)
|
| 157 |
|
| 158 |
-
### Community
|
| 159 |
Cohesion: 0.27
|
| 160 |
Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline. Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
|
| 161 |
|
| 162 |
-
### Community
|
| 163 |
-
Cohesion: 0.22
|
| 164 |
-
Nodes (5): Qwen3TTSProcessor, r""" Constructs a Qwen3TTS processor. Args: tokenizer ([`Qwen2T, Main method to prepare for the model one or several sequences(s) and audio(s). T, This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedToke, ProcessorMixin
|
| 165 |
-
|
| 166 |
-
### Community 16 - "Community 16"
|
| 167 |
Cohesion: 0.33
|
| 168 |
Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
|
| 169 |
|
| 170 |
-
### Community
|
| 171 |
Cohesion: 1.0
|
| 172 |
Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
|
| 173 |
|
| 174 |
-
### Community
|
| 175 |
Cohesion: 1.0
|
| 176 |
Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
|
| 177 |
|
| 178 |
-
### Community
|
| 179 |
Cohesion: 1.0
|
| 180 |
Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
|
| 181 |
|
| 182 |
-
### Community
|
| 183 |
Cohesion: 1.0
|
| 184 |
Nodes (1): Voice clone speech using the Base model. You can provide either:
|
| 185 |
|
| 186 |
-
### Community
|
| 187 |
Cohesion: 1.0
|
| 188 |
Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
|
| 189 |
|
| 190 |
-
### Community
|
| 191 |
Cohesion: 1.0
|
| 192 |
Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
|
| 193 |
|
| 194 |
-
### Community
|
| 195 |
Cohesion: 1.0
|
| 196 |
Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
|
| 197 |
|
| 198 |
-
### Community
|
| 199 |
Cohesion: 1.0
|
| 200 |
Nodes (1): Reject oversized uploads before body parsing.
|
| 201 |
|
| 202 |
-
### Community
|
| 203 |
Cohesion: 1.0
|
| 204 |
Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
|
| 205 |
|
| 206 |
-
### Community
|
| 207 |
Cohesion: 1.0
|
| 208 |
Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
|
| 209 |
|
| 210 |
-
### Community
|
| 211 |
Cohesion: 1.0
|
| 212 |
Nodes (1): Return curated showcase entries with resolved streaming URLs.
|
| 213 |
|
| 214 |
-
### Community
|
| 215 |
Cohesion: 1.0
|
| 216 |
Nodes (1): Submit a video for translation.
|
| 217 |
|
| 218 |
-
### Community
|
| 219 |
Cohesion: 1.0
|
| 220 |
Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
|
| 221 |
|
| 222 |
-
### Community
|
| 223 |
Cohesion: 1.0
|
| 224 |
Nodes (1): User selects a TTS model after previewing.
|
| 225 |
|
| 226 |
-
### Community
|
| 227 |
Cohesion: 1.0
|
| 228 |
Nodes (1): Serve a preview audio WAV file.
|
| 229 |
|
| 230 |
-
### Community
|
| 231 |
Cohesion: 1.0
|
| 232 |
Nodes (1): Download the translated video.
|
| 233 |
|
| 234 |
-
### Community
|
| 235 |
Cohesion: 1.0
|
| 236 |
Nodes (1): Create artifact directories and start background cleanup.
|
| 237 |
|
| 238 |
-
### Community
|
| 239 |
Cohesion: 1.0
|
| 240 |
Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
|
| 241 |
|
| 242 |
-
### Community
|
| 243 |
Cohesion: 1.0
|
| 244 |
Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
|
| 245 |
|
| 246 |
-
### Community
|
| 247 |
Cohesion: 1.0
|
| 248 |
Nodes (1): Insert extra silence distributed across detected pause points.
|
| 249 |
|
| 250 |
-
### Community
|
| 251 |
Cohesion: 1.0
|
| 252 |
Nodes (1): Generate a silent WAV file of given duration.
|
| 253 |
|
| 254 |
-
### Community
|
| 255 |
Cohesion: 1.0
|
| 256 |
Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
|
| 257 |
|
| 258 |
-
### Community
|
| 259 |
Cohesion: 1.0
|
| 260 |
Nodes (1): Translate the text of each segment into target_language in batches. Args:
|
| 261 |
|
| 262 |
-
### Community
|
| 263 |
Cohesion: 1.0
|
| 264 |
Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int
|
| 265 |
|
| 266 |
-
### Community
|
| 267 |
Cohesion: 1.0
|
| 268 |
Nodes (1): Remove trailing noise/artifacts after speech ends.
|
| 269 |
|
| 270 |
-
### Community
|
| 271 |
Cohesion: 1.0
|
| 272 |
Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
|
| 273 |
|
| 274 |
-
### Community
|
| 275 |
Cohesion: 1.0
|
| 276 |
Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
|
| 277 |
|
| 278 |
-
### Community
|
| 279 |
Cohesion: 1.0
|
| 280 |
Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
|
| 281 |
|
| 282 |
-
### Community
|
| 283 |
Cohesion: 1.0
|
| 284 |
Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
|
| 285 |
|
| 286 |
-
### Community
|
| 287 |
Cohesion: 1.0
|
| 288 |
Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
|
| 289 |
|
| 290 |
-
### Community
|
| 291 |
Cohesion: 1.0
|
| 292 |
Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
|
| 293 |
|
| 294 |
-
### Community
|
| 295 |
Cohesion: 1.0
|
| 296 |
Nodes (1): torch==2.6.0
|
| 297 |
|
| 298 |
-
### Community
|
| 299 |
Cohesion: 1.0
|
| 300 |
Nodes (1): fastapi
|
| 301 |
|
| 302 |
-
### Community
|
| 303 |
Cohesion: 1.0
|
| 304 |
Nodes (1): yt-dlp
|
| 305 |
|
| 306 |
-
### Community
|
| 307 |
Cohesion: 1.0
|
| 308 |
Nodes (1): diffusers==0.29.0
|
| 309 |
|
| 310 |
-
### Community
|
| 311 |
Cohesion: 1.0
|
| 312 |
Nodes (1): ARTIFACTS_ROOT env
|
| 313 |
|
| 314 |
-
### Community
|
| 315 |
Cohesion: 1.0
|
| 316 |
Nodes (1): AWS g4dn.xlarge alternative
|
| 317 |
|
| 318 |
-
### Community
|
| 319 |
Cohesion: 1.0
|
| 320 |
Nodes (1): nodejs (system pkg)
|
| 321 |
|
| 322 |
-
### Community
|
| 323 |
Cohesion: 1.0
|
| 324 |
Nodes (1): fonts-noto-core / cjk
|
| 325 |
|
| 326 |
-
### Community
|
| 327 |
Cohesion: 1.0
|
| 328 |
Nodes (1): graphify project rules
|
| 329 |
|
| 330 |
## Knowledge Gaps
|
| 331 |
-
- **
|
| 332 |
These have ≤1 connection - possible missing edges or undocumented components.
|
| 333 |
-
- **Thin community `Community
|
| 334 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 335 |
-
- **Thin community `Community
|
| 336 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 337 |
-
- **Thin community `Community
|
| 338 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 339 |
-
- **Thin community `Community
|
| 340 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 341 |
-
- **Thin community `Community
|
| 342 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 343 |
-
- **Thin community `Community
|
| 344 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 345 |
-
- **Thin community `Community
|
| 346 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 347 |
-
- **Thin community `Community
|
| 348 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 349 |
-
- **Thin community `Community
|
| 350 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 351 |
-
- **Thin community `Community
|
| 352 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 353 |
-
- **Thin community `Community
|
| 354 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 355 |
-
- **Thin community `Community
|
| 356 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 357 |
-
- **Thin community `Community
|
| 358 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 359 |
-
- **Thin community `Community
|
| 360 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 361 |
-
- **Thin community `Community
|
| 362 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 363 |
-
- **Thin community `Community
|
| 364 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 365 |
-
- **Thin community `Community
|
| 366 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 367 |
-
- **Thin community `Community
|
| 368 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 369 |
-
- **Thin community `Community
|
| 370 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 371 |
-
- **Thin community `Community
|
| 372 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 373 |
-
- **Thin community `Community
|
| 374 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 375 |
-
- **Thin community `Community
|
| 376 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 377 |
-
- **Thin community `Community
|
| 378 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 379 |
-
- **Thin community `Community
|
| 380 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 381 |
-
- **Thin community `Community
|
| 382 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 383 |
-
- **Thin community `Community
|
| 384 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 385 |
-
- **Thin community `Community
|
| 386 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 387 |
-
- **Thin community `Community
|
| 388 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 389 |
-
- **Thin community `Community
|
| 390 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 391 |
-
- **Thin community `Community
|
| 392 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 393 |
-
- **Thin community `Community
|
| 394 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 395 |
-
- **Thin community `Community
|
| 396 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 397 |
-
- **Thin community `Community
|
| 398 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 399 |
-
- **Thin community `Community
|
| 400 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 401 |
-
- **Thin community `Community
|
| 402 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 403 |
-
- **Thin community `Community
|
| 404 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 405 |
-
- **Thin community `Community
|
| 406 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 407 |
-
- **Thin community `Community
|
| 408 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 409 |
-
- **Thin community `Community
|
| 410 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 411 |
-
- **Thin community `Community
|
| 412 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 413 |
|
| 414 |
## Suggested Questions
|
| 415 |
_Questions this graph is uniquely positioned to answer:_
|
| 416 |
|
| 417 |
-
- **Why does `
|
| 418 |
-
_High betweenness centrality (0.
|
| 419 |
-
- **Why does `
|
| 420 |
-
_High betweenness centrality (0.
|
| 421 |
- **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
| 422 |
_`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
|
| 423 |
- **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
|
@@ -427,4 +452,4 @@ _Questions this graph is uniquely positioned to answer:_
|
|
| 427 |
- **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
| 428 |
_`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
|
| 429 |
- **What connects `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
|
| 430 |
-
|
|
|
|
| 1 |
+
# Graph Report - VideoVoice-be (2026-05-16)
|
| 2 |
|
| 3 |
## Corpus Check
|
| 4 |
+
- 59 files · ~253,292 words
|
| 5 |
- Verdict: corpus is large enough that graph structure adds value.
|
| 6 |
|
| 7 |
## Summary
|
| 8 |
+
- 1050 nodes · 1833 edges · 62 communities detected
|
| 9 |
+
- Extraction: 79% EXTRACTED · 21% INFERRED · 0% AMBIGUOUS · INFERRED: 389 edges (avg confidence: 0.62)
|
| 10 |
- Token cost: 0 input · 0 output
|
| 11 |
|
| 12 |
## Community Hubs (Navigation)
|
|
|
|
| 27 |
- [[_COMMUNITY_Community 14|Community 14]]
|
| 28 |
- [[_COMMUNITY_Community 15|Community 15]]
|
| 29 |
- [[_COMMUNITY_Community 16|Community 16]]
|
| 30 |
+
- [[_COMMUNITY_Community 17|Community 17]]
|
| 31 |
- [[_COMMUNITY_Community 18|Community 18]]
|
| 32 |
+
- [[_COMMUNITY_Community 19|Community 19]]
|
| 33 |
+
- [[_COMMUNITY_Community 20|Community 20]]
|
| 34 |
+
- [[_COMMUNITY_Community 21|Community 21]]
|
| 35 |
+
- [[_COMMUNITY_Community 23|Community 23]]
|
|
|
|
| 36 |
- [[_COMMUNITY_Community 31|Community 31]]
|
| 37 |
- [[_COMMUNITY_Community 32|Community 32]]
|
| 38 |
- [[_COMMUNITY_Community 33|Community 33]]
|
|
|
|
| 67 |
- [[_COMMUNITY_Community 62|Community 62]]
|
| 68 |
- [[_COMMUNITY_Community 63|Community 63]]
|
| 69 |
- [[_COMMUNITY_Community 64|Community 64]]
|
| 70 |
+
- [[_COMMUNITY_Community 65|Community 65]]
|
| 71 |
+
- [[_COMMUNITY_Community 66|Community 66]]
|
| 72 |
+
- [[_COMMUNITY_Community 67|Community 67]]
|
| 73 |
+
- [[_COMMUNITY_Community 68|Community 68]]
|
| 74 |
+
- [[_COMMUNITY_Community 69|Community 69]]
|
| 75 |
|
| 76 |
## God Nodes (most connected - your core abstractions)
|
| 77 |
1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
|
|
|
|
| 86 |
10. `BasePoster` - 14 edges
|
| 87 |
|
| 88 |
## Surprising Connections (you probably didn't know these)
|
|
|
|
|
|
|
| 89 |
- `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4` [INFERRED] [semantically similar]
|
| 90 |
requirements.txt → requirements-omni.txt
|
| 91 |
- `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)` [INFERRED] [semantically similar]
|
| 92 |
requirements.txt → requirements-omni.txt
|
| 93 |
+
- `content_length_middleware()` --calls--> `enforce_content_length_limit()` [INFERRED]
|
| 94 |
+
app.py → server.py
|
| 95 |
+
- `run_pipeline()` --calls--> `separate_audio()` [INFERRED]
|
| 96 |
+
pipeline.py → steps/s1b_separate.py
|
| 97 |
- `run_pipeline()` --calls--> `transcribe()` [INFERRED]
|
| 98 |
pipeline.py → steps/s2_transcribe.py
|
|
|
|
|
|
|
| 99 |
|
| 100 |
## Hyperedges (group relationships)
|
| 101 |
- **Six-step translation pipeline** — [EXTRACTED 1.00]
|
|
|
|
| 106 |
|
| 107 |
### Community 0 - "Community 0"
|
| 108 |
Cohesion: 0.04
|
| 109 |
+
Nodes (69): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r""" This is the configuration class to store the configuration of a [`Qwen3, r""" This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r""" This is the configuration class to store the configuration of a [`Qwen3 (+61 more)
|
| 110 |
|
| 111 |
### Community 1 - "Community 1"
|
| 112 |
+
Cohesion: 0.02
|
| 113 |
+
Nodes (118): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine. ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+110 more)
|
| 114 |
|
| 115 |
### Community 2 - "Community 2"
|
| 116 |
+
Cohesion: 0.05
|
| 117 |
+
Nodes (57): ABC, BasePoster, Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt(), format_caption() (+49 more)
|
| 118 |
|
| 119 |
### Community 3 - "Community 3"
|
| 120 |
+
Cohesion: 0.05
|
| 121 |
+
Nodes (59): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages. Args:, run_pipeline() (+51 more)
|
| 122 |
|
| 123 |
### Community 4 - "Community 4"
|
| 124 |
+
Cohesion: 0.06
|
| 125 |
+
Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
|
| 126 |
|
| 127 |
### Community 5 - "Community 5"
|
| 128 |
+
Cohesion: 0.06
|
| 129 |
+
Nodes (59): post(), _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps. Primary local backend (device-depende (+51 more)
|
| 130 |
|
| 131 |
### Community 6 - "Community 6"
|
| 132 |
+
Cohesion: 0.06
|
| 133 |
+
Nodes (31): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+23 more)
|
| 134 |
|
| 135 |
### Community 7 - "Community 7"
|
| 136 |
Cohesion: 0.07
|
| 137 |
+
Nodes (25): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation. Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D) q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx) the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal. Args:, spectral_normalize_torch() (+17 more)
|
| 138 |
|
| 139 |
### Community 8 - "Community 8"
|
| 140 |
+
Cohesion: 0.05
|
| 141 |
+
Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
|
| 142 |
+
|
| 143 |
+
### Community 9 - "Community 9"
|
| 144 |
Cohesion: 0.09
|
| 145 |
Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
|
| 146 |
|
| 147 |
+
### Community 10 - "Community 10"
|
| 148 |
Cohesion: 0.1
|
| 149 |
Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
|
| 150 |
|
| 151 |
+
### Community 11 - "Community 11"
|
| 152 |
+
Cohesion: 0.08
|
| 153 |
+
Nodes (32): _apply_demucs(), _get_model(), _load_and_normalise(), Step 1b: Separate vocals from accompaniment using Demucs (Python API). In-proce, Lazy-load htdemucs once per process. Module-level semantics; we load on firs, GPU-bound inference call. `mix` shape: [1, channels, time]., Load WAV, resample/remix to match model requirements, z-normalise., Separate vocals from accompaniment using Demucs htdemucs (Python API). Args (+24 more)
|
| 154 |
+
|
| 155 |
+
### Community 12 - "Community 12"
|
| 156 |
Cohesion: 0.1
|
| 157 |
Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches. Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
|
| 158 |
|
| 159 |
+
### Community 13 - "Community 13"
|
| 160 |
Cohesion: 0.12
|
| 161 |
+
Nodes (27): build_for_job(), ensure_transcription(), extract_audio_hq(), extract_reference_audio(), get_audio_duration(), get_device(), load_chatterbox(), main() (+19 more)
|
| 162 |
|
| 163 |
+
### Community 14 - "Community 14"
|
| 164 |
+
Cohesion: 0.11
|
| 165 |
+
Nodes (25): tools_api — Standalone endpoints for creator quick tools. Lives alongside the m, audio_cleanup_endpoint(), _ext_to_media_type(), APIRouter for /api/tools/* endpoints. Each endpoint is sync request-response (n, Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS., Manual reap trigger (mostly for testing). Auto-reap runs on a timer., Stream upload to disk, enforcing the tools size cap., _reap() (+17 more)
|
| 166 |
|
| 167 |
+
### Community 15 - "Community 15"
|
| 168 |
+
Cohesion: 0.12
|
| 169 |
+
Nodes (23): build_t3_cond(), main(), prepare_sample(), prepare_sample.py — Turn one dataset.jsonl row into the exact tensors T3.loss(), Build the speaker conditioning (frozen during training)., MTLTokenizer + SOT/EOT padding (mirrors what generate() does internally)., S3Tokenizer on the target dubbed audio → speech tokens (the LABEL). Critica, Turn one dataset row into ready-to-train tensors. (+15 more)
|
| 170 |
+
|
| 171 |
+
### Community 16 - "Community 16"
|
| 172 |
+
Cohesion: 0.19
|
| 173 |
+
Nodes (18): _burn_in(), _clamp(), _extract_audio(), _force_style_for(), _format_timestamp_srt(), _format_timestamp_vtt(), generate_subtitles(), _is_video() (+10 more)
|
| 174 |
+
|
| 175 |
+
### Community 17 - "Community 17"
|
| 176 |
+
Cohesion: 0.22
|
| 177 |
+
Nodes (12): download_result(), _is_noise(), main(), Batch translate Instagram reels to English via the VideoVoice server API. Usage, Extract the Instagram reel shortcode from a URL, e.g. 'DWn_yPoDsYw'., Submit a single video URL and return the job_id., Return True if a log line is internal noise we don't want in the log., Poll job status until complete or error. Returns final messages and collected lo (+4 more)
|
| 178 |
+
|
| 179 |
+
### Community 18 - "Community 18"
|
| 180 |
+
Cohesion: 0.23
|
| 181 |
+
Nodes (12): evaluate(), load_baseline(), load_with_lora(), main(), pick_held_out_samples(), print_summary(), eval.py — Evaluate the fine-tuned LoRA against the un-tuned baseline. Picks N s, Return overshoot samples (duration_diff > 0.2) — these are NOT in the asymme (+4 more)
|
| 182 |
+
|
| 183 |
+
### Community 19 - "Community 19"
|
| 184 |
Cohesion: 0.24
|
| 185 |
Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL. Uses Play (+3 more)
|
| 186 |
|
| 187 |
+
### Community 20 - "Community 20"
|
| 188 |
Cohesion: 0.27
|
| 189 |
Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline. Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
|
| 190 |
|
| 191 |
+
### Community 21 - "Community 21"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
Cohesion: 0.33
|
| 193 |
Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
|
| 194 |
|
| 195 |
+
### Community 23 - "Community 23"
|
| 196 |
Cohesion: 1.0
|
| 197 |
Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
|
| 198 |
|
| 199 |
+
### Community 31 - "Community 31"
|
| 200 |
Cohesion: 1.0
|
| 201 |
Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
|
| 202 |
|
| 203 |
+
### Community 32 - "Community 32"
|
| 204 |
Cohesion: 1.0
|
| 205 |
Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
|
| 206 |
|
| 207 |
+
### Community 33 - "Community 33"
|
| 208 |
Cohesion: 1.0
|
| 209 |
Nodes (1): Voice clone speech using the Base model. You can provide either:
|
| 210 |
|
| 211 |
+
### Community 34 - "Community 34"
|
| 212 |
Cohesion: 1.0
|
| 213 |
Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
|
| 214 |
|
| 215 |
+
### Community 35 - "Community 35"
|
| 216 |
Cohesion: 1.0
|
| 217 |
Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
|
| 218 |
|
| 219 |
+
### Community 36 - "Community 36"
|
| 220 |
Cohesion: 1.0
|
| 221 |
Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
|
| 222 |
|
| 223 |
+
### Community 37 - "Community 37"
|
| 224 |
Cohesion: 1.0
|
| 225 |
Nodes (1): Reject oversized uploads before body parsing.
|
| 226 |
|
| 227 |
+
### Community 38 - "Community 38"
|
| 228 |
Cohesion: 1.0
|
| 229 |
Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
|
| 230 |
|
| 231 |
+
### Community 39 - "Community 39"
|
| 232 |
Cohesion: 1.0
|
| 233 |
Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
|
| 234 |
|
| 235 |
+
### Community 40 - "Community 40"
|
| 236 |
Cohesion: 1.0
|
| 237 |
Nodes (1): Return curated showcase entries with resolved streaming URLs.
|
| 238 |
|
| 239 |
+
### Community 41 - "Community 41"
|
| 240 |
Cohesion: 1.0
|
| 241 |
Nodes (1): Submit a video for translation.
|
| 242 |
|
| 243 |
+
### Community 42 - "Community 42"
|
| 244 |
Cohesion: 1.0
|
| 245 |
Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
|
| 246 |
|
| 247 |
+
### Community 43 - "Community 43"
|
| 248 |
Cohesion: 1.0
|
| 249 |
Nodes (1): User selects a TTS model after previewing.
|
| 250 |
|
| 251 |
+
### Community 44 - "Community 44"
|
| 252 |
Cohesion: 1.0
|
| 253 |
Nodes (1): Serve a preview audio WAV file.
|
| 254 |
|
| 255 |
+
### Community 45 - "Community 45"
|
| 256 |
Cohesion: 1.0
|
| 257 |
Nodes (1): Download the translated video.
|
| 258 |
|
| 259 |
+
### Community 46 - "Community 46"
|
| 260 |
Cohesion: 1.0
|
| 261 |
Nodes (1): Create artifact directories and start background cleanup.
|
| 262 |
|
| 263 |
+
### Community 47 - "Community 47"
|
| 264 |
Cohesion: 1.0
|
| 265 |
Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
|
| 266 |
|
| 267 |
+
### Community 48 - "Community 48"
|
| 268 |
Cohesion: 1.0
|
| 269 |
Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
|
| 270 |
|
| 271 |
+
### Community 49 - "Community 49"
|
| 272 |
Cohesion: 1.0
|
| 273 |
Nodes (1): Insert extra silence distributed across detected pause points.
|
| 274 |
|
| 275 |
+
### Community 50 - "Community 50"
|
| 276 |
Cohesion: 1.0
|
| 277 |
Nodes (1): Generate a silent WAV file of given duration.
|
| 278 |
|
| 279 |
+
### Community 51 - "Community 51"
|
| 280 |
Cohesion: 1.0
|
| 281 |
Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
|
| 282 |
|
| 283 |
+
### Community 52 - "Community 52"
|
| 284 |
Cohesion: 1.0
|
| 285 |
Nodes (1): Translate the text of each segment into target_language in batches. Args:
|
| 286 |
|
| 287 |
+
### Community 53 - "Community 53"
|
| 288 |
Cohesion: 1.0
|
| 289 |
Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int
|
| 290 |
|
| 291 |
+
### Community 54 - "Community 54"
|
| 292 |
Cohesion: 1.0
|
| 293 |
Nodes (1): Remove trailing noise/artifacts after speech ends.
|
| 294 |
|
| 295 |
+
### Community 55 - "Community 55"
|
| 296 |
Cohesion: 1.0
|
| 297 |
Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
|
| 298 |
|
| 299 |
+
### Community 56 - "Community 56"
|
| 300 |
Cohesion: 1.0
|
| 301 |
Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
|
| 302 |
|
| 303 |
+
### Community 57 - "Community 57"
|
| 304 |
Cohesion: 1.0
|
| 305 |
Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
|
| 306 |
|
| 307 |
+
### Community 58 - "Community 58"
|
| 308 |
Cohesion: 1.0
|
| 309 |
Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
|
| 310 |
|
| 311 |
+
### Community 59 - "Community 59"
|
| 312 |
Cohesion: 1.0
|
| 313 |
Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
|
| 314 |
|
| 315 |
+
### Community 60 - "Community 60"
|
| 316 |
Cohesion: 1.0
|
| 317 |
Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
|
| 318 |
|
| 319 |
+
### Community 61 - "Community 61"
|
| 320 |
Cohesion: 1.0
|
| 321 |
Nodes (1): torch==2.6.0
|
| 322 |
|
| 323 |
+
### Community 62 - "Community 62"
|
| 324 |
Cohesion: 1.0
|
| 325 |
Nodes (1): fastapi
|
| 326 |
|
| 327 |
+
### Community 63 - "Community 63"
|
| 328 |
Cohesion: 1.0
|
| 329 |
Nodes (1): yt-dlp
|
| 330 |
|
| 331 |
+
### Community 64 - "Community 64"
|
| 332 |
Cohesion: 1.0
|
| 333 |
Nodes (1): diffusers==0.29.0
|
| 334 |
|
| 335 |
+
### Community 65 - "Community 65"
|
| 336 |
Cohesion: 1.0
|
| 337 |
Nodes (1): ARTIFACTS_ROOT env
|
| 338 |
|
| 339 |
+
### Community 66 - "Community 66"
|
| 340 |
Cohesion: 1.0
|
| 341 |
Nodes (1): AWS g4dn.xlarge alternative
|
| 342 |
|
| 343 |
+
### Community 67 - "Community 67"
|
| 344 |
Cohesion: 1.0
|
| 345 |
Nodes (1): nodejs (system pkg)
|
| 346 |
|
| 347 |
+
### Community 68 - "Community 68"
|
| 348 |
Cohesion: 1.0
|
| 349 |
Nodes (1): fonts-noto-core / cjk
|
| 350 |
|
| 351 |
+
### Community 69 - "Community 69"
|
| 352 |
Cohesion: 1.0
|
| 353 |
Nodes (1): graphify project rules
|
| 354 |
|
| 355 |
## Knowledge Gaps
|
| 356 |
+
- **321 isolated node(s):** `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+316 more)
|
| 357 |
These have ≤1 connection - possible missing edges or undocumented components.
|
| 358 |
+
- **Thin community `Community 23`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
|
| 359 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 360 |
+
- **Thin community `Community 31`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
|
| 361 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 362 |
+
- **Thin community `Community 32`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
|
| 363 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 364 |
+
- **Thin community `Community 33`** (1 nodes): `Voice clone speech using the Base model. You can provide either:`
|
| 365 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 366 |
+
- **Thin community `Community 34`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
|
| 367 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 368 |
+
- **Thin community `Community 35`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
|
| 369 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 370 |
+
- **Thin community `Community 36`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
|
| 371 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 372 |
+
- **Thin community `Community 37`** (1 nodes): `Reject oversized uploads before body parsing.`
|
| 373 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 374 |
+
- **Thin community `Community 38`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
|
| 375 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 376 |
+
- **Thin community `Community 39`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
|
| 377 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 378 |
+
- **Thin community `Community 40`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
|
| 379 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 380 |
+
- **Thin community `Community 41`** (1 nodes): `Submit a video for translation.`
|
| 381 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 382 |
+
- **Thin community `Community 42`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
|
| 383 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 384 |
+
- **Thin community `Community 43`** (1 nodes): `User selects a TTS model after previewing.`
|
| 385 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 386 |
+
- **Thin community `Community 44`** (1 nodes): `Serve a preview audio WAV file.`
|
| 387 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 388 |
+
- **Thin community `Community 45`** (1 nodes): `Download the translated video.`
|
| 389 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 390 |
+
- **Thin community `Community 46`** (1 nodes): `Create artifact directories and start background cleanup.`
|
| 391 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 392 |
+
- **Thin community `Community 47`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
|
| 393 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 394 |
+
- **Thin community `Community 48`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
|
| 395 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 396 |
+
- **Thin community `Community 49`** (1 nodes): `Insert extra silence distributed across detected pause points.`
|
| 397 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 398 |
+
- **Thin community `Community 50`** (1 nodes): `Generate a silent WAV file of given duration.`
|
| 399 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 400 |
+
- **Thin community `Community 51`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
|
| 401 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 402 |
+
- **Thin community `Community 52`** (1 nodes): `Translate the text of each segment into target_language in batches. Args:`
|
| 403 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 404 |
+
- **Thin community `Community 53`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int`
|
| 405 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 406 |
+
- **Thin community `Community 54`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
|
| 407 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 408 |
+
- **Thin community `Community 55`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
|
| 409 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 410 |
+
- **Thin community `Community 56`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
|
| 411 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 412 |
+
- **Thin community `Community 57`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
|
| 413 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 414 |
+
- **Thin community `Community 58`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
|
| 415 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 416 |
+
- **Thin community `Community 59`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
|
| 417 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 418 |
+
- **Thin community `Community 60`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
|
| 419 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 420 |
+
- **Thin community `Community 61`** (1 nodes): `torch==2.6.0`
|
| 421 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 422 |
+
- **Thin community `Community 62`** (1 nodes): `fastapi`
|
| 423 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 424 |
+
- **Thin community `Community 63`** (1 nodes): `yt-dlp`
|
| 425 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 426 |
+
- **Thin community `Community 64`** (1 nodes): `diffusers==0.29.0`
|
| 427 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 428 |
+
- **Thin community `Community 65`** (1 nodes): `ARTIFACTS_ROOT env`
|
| 429 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 430 |
+
- **Thin community `Community 66`** (1 nodes): `AWS g4dn.xlarge alternative`
|
| 431 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 432 |
+
- **Thin community `Community 67`** (1 nodes): `nodejs (system pkg)`
|
| 433 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 434 |
+
- **Thin community `Community 68`** (1 nodes): `fonts-noto-core / cjk`
|
| 435 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 436 |
+
- **Thin community `Community 69`** (1 nodes): `graphify project rules`
|
| 437 |
Too small to be a meaningful cluster - may be noise or needs more connections extracted.
|
| 438 |
|
| 439 |
## Suggested Questions
|
| 440 |
_Questions this graph is uniquely positioned to answer:_
|
| 441 |
|
| 442 |
+
- **Why does `synthesise_segments()` connect `Community 4` to `Community 11`, `Community 3`?**
|
| 443 |
+
_High betweenness centrality (0.324) - this node is a cross-community bridge._
|
| 444 |
+
- **Why does `generate()` connect `Community 4` to `Community 0`, `Community 6`?**
|
| 445 |
+
_High betweenness centrality (0.209) - this node is a cross-community bridge._
|
| 446 |
- **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
| 447 |
_`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
|
| 448 |
- **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
|
|
|
| 452 |
- **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
|
| 453 |
_`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
|
| 454 |
- **What connects `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
|
| 455 |
+
_321 weakly-connected nodes found - possible documentation gaps or missing edges._
|
graphify-out/graph.html
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
server.py
CHANGED
|
@@ -75,7 +75,7 @@ ALLOWED_YTDLP_HOSTS = {
|
|
| 75 |
"tiktok.com",
|
| 76 |
"vm.tiktok.com",
|
| 77 |
}
|
| 78 |
-
PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp"}
|
| 79 |
REAPER_INTERVAL_SECONDS = 10 * 60
|
| 80 |
REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
|
| 81 |
|
|
@@ -913,6 +913,10 @@ if __name__ == "__main__":
|
|
| 913 |
|
| 914 |
local_app.include_router(router)
|
| 915 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 916 |
# Serve the legacy static frontend at / so `python server.py` keeps the
|
| 917 |
# old dev UX (open http://localhost:8000 to hit frontend/index.html).
|
| 918 |
# The React SPA in production is deployed separately to S3.
|
|
|
|
| 75 |
"tiktok.com",
|
| 76 |
"vm.tiktok.com",
|
| 77 |
}
|
| 78 |
+
PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp", "tools"}
|
| 79 |
REAPER_INTERVAL_SECONDS = 10 * 60
|
| 80 |
REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
|
| 81 |
|
|
|
|
| 913 |
|
| 914 |
local_app.include_router(router)
|
| 915 |
|
| 916 |
+
# Tools API — independent of pipeline; safe to include here too.
|
| 917 |
+
from tools_api import router as tools_router
|
| 918 |
+
local_app.include_router(tools_router)
|
| 919 |
+
|
| 920 |
# Serve the legacy static frontend at / so `python server.py` keeps the
|
| 921 |
# old dev UX (open http://localhost:8000 to hit frontend/index.html).
|
| 922 |
# The React SPA in production is deployed separately to S3.
|
tools_api/__init__.py
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
tools_api — Standalone endpoints for creator quick tools.
|
| 3 |
+
|
| 4 |
+
Lives alongside the main pipeline (server.py) but stays decoupled:
|
| 5 |
+
- No shared job state, no SSE, no GPU semaphore.
|
| 6 |
+
- Reuses step modules as libraries only (no edits to steps/).
|
| 7 |
+
- Artifacts written under ARTIFACTS_ROOT/tools/<run_id>/.
|
| 8 |
+
|
| 9 |
+
Endpoints (mounted by router.router):
|
| 10 |
+
POST /api/tools/subtitles — captions (sidecar or burn-in MP4)
|
| 11 |
+
POST /api/tools/voice-clone — single-segment TTS with voice clone
|
| 12 |
+
POST /api/tools/audio-cleanup — Demucs source separation
|
| 13 |
+
GET /api/tools/file/{run}/{f} — download generated artifact
|
| 14 |
+
"""
|
| 15 |
+
from .router import router
|
| 16 |
+
|
| 17 |
+
__all__ = ["router"]
|
tools_api/audio_cleanup.py
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Audio source separation tool — three modes via Demucs.
|
| 3 |
+
|
| 4 |
+
Reuses internals from steps.s1b_separate (model loader, device picker, normaliser,
|
| 5 |
+
GPU-decorated apply). The existing separate_audio() returns only (vocals, accompaniment),
|
| 6 |
+
so we replicate its flow here and keep all four stems addressable.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import subprocess
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import Literal
|
| 13 |
+
|
| 14 |
+
import torch
|
| 15 |
+
import torchaudio
|
| 16 |
+
|
| 17 |
+
# Reuse internals — no edits to s1b_separate.py.
|
| 18 |
+
from steps.s1b_separate import (
|
| 19 |
+
_apply_demucs,
|
| 20 |
+
_get_model,
|
| 21 |
+
_load_and_normalise,
|
| 22 |
+
_select_device,
|
| 23 |
+
)
|
| 24 |
+
|
| 25 |
+
Mode = Literal["vocals-only", "instrumental-only", "stems"]
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def _ensure_audio(input_path: Path, out_dir: Path) -> Path:
|
| 29 |
+
"""Convert input to a stable WAV format if it's a video or non-WAV audio."""
|
| 30 |
+
if input_path.suffix.lower() == ".wav":
|
| 31 |
+
return input_path
|
| 32 |
+
out = out_dir / "input.wav"
|
| 33 |
+
cmd = [
|
| 34 |
+
"ffmpeg", "-y", "-i", str(input_path),
|
| 35 |
+
"-vn", "-ac", "2", "-ar", "44100",
|
| 36 |
+
"-acodec", "pcm_s16le",
|
| 37 |
+
str(out),
|
| 38 |
+
]
|
| 39 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
|
| 40 |
+
if result.returncode != 0:
|
| 41 |
+
raise RuntimeError(f"ffmpeg input prep failed: {result.stderr[-300:]}")
|
| 42 |
+
return out
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _separate_all_stems(audio_path: Path, out_dir: Path) -> dict[str, Path]:
|
| 46 |
+
"""Return {stem_name: wav_path} for every demucs source."""
|
| 47 |
+
model = _get_model()
|
| 48 |
+
device = _select_device()
|
| 49 |
+
target_sr = model.samplerate
|
| 50 |
+
target_ch = model.audio_channels
|
| 51 |
+
source_names = list(model.sources) # ["drums", "bass", "other", "vocals"]
|
| 52 |
+
|
| 53 |
+
mix, mean, std = _load_and_normalise(str(audio_path), target_sr, target_ch)
|
| 54 |
+
sources = _apply_demucs(mix, device)
|
| 55 |
+
sources = sources * std + mean
|
| 56 |
+
sources = sources[0] # [num_sources, channels, time]
|
| 57 |
+
|
| 58 |
+
stems: dict[str, Path] = {}
|
| 59 |
+
for idx, name in enumerate(source_names):
|
| 60 |
+
wav_path = out_dir / f"{name}.wav"
|
| 61 |
+
torchaudio.save(str(wav_path), sources[idx], target_sr)
|
| 62 |
+
stems[name] = wav_path
|
| 63 |
+
return stems
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def _sum_to_wav(stems: list[Path], dest: Path, sample_rate: int = 44100) -> Path:
|
| 67 |
+
"""Sum N stem WAVs into one — used to build the instrumental track."""
|
| 68 |
+
mix: torch.Tensor | None = None
|
| 69 |
+
sr_used = sample_rate
|
| 70 |
+
for path in stems:
|
| 71 |
+
wav, sr = torchaudio.load(str(path))
|
| 72 |
+
sr_used = sr
|
| 73 |
+
mix = wav if mix is None else mix + wav
|
| 74 |
+
if mix is None:
|
| 75 |
+
raise RuntimeError("No stems to sum.")
|
| 76 |
+
torchaudio.save(str(dest), mix, sr_used)
|
| 77 |
+
return dest
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def separate(
|
| 81 |
+
*,
|
| 82 |
+
input_path: Path,
|
| 83 |
+
out_dir: Path,
|
| 84 |
+
mode: Mode,
|
| 85 |
+
) -> list[dict]:
|
| 86 |
+
"""
|
| 87 |
+
Run separation. Returns a list of output descriptors:
|
| 88 |
+
[{"name": "vocals.wav", "label": "Vocals", "filename": "vocals.wav"}, ...]
|
| 89 |
+
"""
|
| 90 |
+
audio_in = _ensure_audio(input_path, out_dir)
|
| 91 |
+
stems = _separate_all_stems(audio_in, out_dir)
|
| 92 |
+
|
| 93 |
+
if mode == "vocals-only":
|
| 94 |
+
return [{
|
| 95 |
+
"name": "vocals",
|
| 96 |
+
"label": "Vocals",
|
| 97 |
+
"filename": stems["vocals"].name,
|
| 98 |
+
"sub": "Dialogue track",
|
| 99 |
+
}]
|
| 100 |
+
|
| 101 |
+
if mode == "instrumental-only":
|
| 102 |
+
non_vocal_stems = [stems[n] for n in stems if n != "vocals"]
|
| 103 |
+
out = _sum_to_wav(non_vocal_stems, out_dir / "instrumental.wav")
|
| 104 |
+
# Cleanup intermediate stem files we won't expose
|
| 105 |
+
for path in stems.values():
|
| 106 |
+
try:
|
| 107 |
+
path.unlink()
|
| 108 |
+
except OSError:
|
| 109 |
+
pass
|
| 110 |
+
return [{
|
| 111 |
+
"name": "instrumental",
|
| 112 |
+
"label": "Instrumental",
|
| 113 |
+
"filename": out.name,
|
| 114 |
+
"sub": "Music + ambient (vocals removed)",
|
| 115 |
+
}]
|
| 116 |
+
|
| 117 |
+
# stems mode — return all four
|
| 118 |
+
label_map = {
|
| 119 |
+
"vocals": ("Vocals", "Dialogue track"),
|
| 120 |
+
"drums": ("Drums", "Percussion"),
|
| 121 |
+
"bass": ("Bass", "Low frequency"),
|
| 122 |
+
"other": ("Other", "Melodic / ambient"),
|
| 123 |
+
}
|
| 124 |
+
results: list[dict] = []
|
| 125 |
+
# Stable order: vocals first, then drums, bass, other
|
| 126 |
+
for stem_key in ("vocals", "drums", "bass", "other"):
|
| 127 |
+
if stem_key not in stems:
|
| 128 |
+
continue
|
| 129 |
+
label, sub = label_map[stem_key]
|
| 130 |
+
results.append({
|
| 131 |
+
"name": stem_key,
|
| 132 |
+
"label": label,
|
| 133 |
+
"filename": stems[stem_key].name,
|
| 134 |
+
"sub": sub,
|
| 135 |
+
})
|
| 136 |
+
return results
|
tools_api/router.py
ADDED
|
@@ -0,0 +1,248 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
APIRouter for /api/tools/* endpoints.
|
| 3 |
+
|
| 4 |
+
Each endpoint is sync request-response (no SSE, no job state). Input files
|
| 5 |
+
land in a fresh per-run directory, outputs are returned as a download URL
|
| 6 |
+
to GET /api/tools/file/{run_id}/{filename}.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import asyncio
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import Optional
|
| 13 |
+
|
| 14 |
+
from fastapi import APIRouter, File, Form, HTTPException, Request, UploadFile
|
| 15 |
+
from fastapi.responses import FileResponse, JSONResponse, PlainTextResponse
|
| 16 |
+
|
| 17 |
+
from server import limiter, _download_url, _is_allowed_video_host
|
| 18 |
+
|
| 19 |
+
from . import audio_cleanup, subtitles, voice_clone
|
| 20 |
+
from .storage import (
|
| 21 |
+
file_url,
|
| 22 |
+
new_run_dir,
|
| 23 |
+
reap_old_runs,
|
| 24 |
+
run_dir,
|
| 25 |
+
safe_filename,
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
router = APIRouter(prefix="/api/tools", tags=["tools"])
|
| 29 |
+
|
| 30 |
+
# Per-tool body size cap (separate from pipeline's MAX_UPLOAD_BYTES check).
|
| 31 |
+
TOOLS_MAX_BYTES = 50 * 1024 * 1024 # 50 MB
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# ── Helpers ──────────────────────────────────────────────────────────
|
| 35 |
+
|
| 36 |
+
async def _save_upload(file: UploadFile, dest_dir: Path, default_name: str) -> Path:
|
| 37 |
+
"""Stream upload to disk, enforcing the tools size cap."""
|
| 38 |
+
dest = dest_dir / safe_filename(file.filename, default_name)
|
| 39 |
+
written = 0
|
| 40 |
+
with open(dest, "wb") as fh:
|
| 41 |
+
while chunk := await file.read(1024 * 1024):
|
| 42 |
+
written += len(chunk)
|
| 43 |
+
if written > TOOLS_MAX_BYTES:
|
| 44 |
+
fh.close()
|
| 45 |
+
dest.unlink(missing_ok=True)
|
| 46 |
+
raise HTTPException(413, f"File too large (max {TOOLS_MAX_BYTES // (1024*1024)} MB).")
|
| 47 |
+
fh.write(chunk)
|
| 48 |
+
return dest
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def _ext_to_media_type(filename: str) -> str:
|
| 52 |
+
ext = Path(filename).suffix.lower()
|
| 53 |
+
return {
|
| 54 |
+
".mp4": "video/mp4",
|
| 55 |
+
".mov": "video/quicktime",
|
| 56 |
+
".webm": "video/webm",
|
| 57 |
+
".mp3": "audio/mpeg",
|
| 58 |
+
".wav": "audio/wav",
|
| 59 |
+
".srt": "application/x-subrip",
|
| 60 |
+
".vtt": "text/vtt",
|
| 61 |
+
".txt": "text/plain",
|
| 62 |
+
}.get(ext, "application/octet-stream")
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
# ── Subtitles ────────────────────────────────────────────────────────
|
| 66 |
+
|
| 67 |
+
@router.post("/subtitles")
|
| 68 |
+
@limiter.limit("10/hour")
|
| 69 |
+
async def subtitles_endpoint(
|
| 70 |
+
request: Request,
|
| 71 |
+
file: Optional[UploadFile] = File(None),
|
| 72 |
+
url: Optional[str] = Form(None),
|
| 73 |
+
source_lang: str = Form("Auto-detect"),
|
| 74 |
+
target_lang: str = Form("Same as source"),
|
| 75 |
+
fmt: str = Form("srt"),
|
| 76 |
+
style: str = Form("tiktok"),
|
| 77 |
+
position: str = Form("bottom"),
|
| 78 |
+
h_align: str = Form("center"),
|
| 79 |
+
font_size: Optional[int] = Form(None),
|
| 80 |
+
margin_v: Optional[int] = Form(None),
|
| 81 |
+
):
|
| 82 |
+
if fmt not in ("srt", "vtt", "txt", "mp4"):
|
| 83 |
+
raise HTTPException(400, "fmt must be one of: srt, vtt, txt, mp4")
|
| 84 |
+
if style not in ("tiktok", "youtube", "minimal"):
|
| 85 |
+
raise HTTPException(400, "style must be one of: tiktok, youtube, minimal")
|
| 86 |
+
if position not in ("top", "middle", "bottom"):
|
| 87 |
+
raise HTTPException(400, "position must be one of: top, middle, bottom")
|
| 88 |
+
if h_align not in ("left", "center", "right"):
|
| 89 |
+
raise HTTPException(400, "h_align must be one of: left, center, right")
|
| 90 |
+
|
| 91 |
+
url = (url or "").strip()
|
| 92 |
+
if not file and not url:
|
| 93 |
+
raise HTTPException(400, "Provide either a file upload or a video URL.")
|
| 94 |
+
if file and url:
|
| 95 |
+
raise HTTPException(400, "Send a file OR a URL, not both.")
|
| 96 |
+
|
| 97 |
+
run_id, dest_dir = new_run_dir()
|
| 98 |
+
if file:
|
| 99 |
+
input_path = await _save_upload(file, dest_dir, "input.mp4")
|
| 100 |
+
else:
|
| 101 |
+
if not _is_allowed_video_host(url):
|
| 102 |
+
raise HTTPException(400, "URL host not supported. Use TikTok, YouTube, or Instagram.")
|
| 103 |
+
input_path = Path(dest_dir) / "input.mp4"
|
| 104 |
+
try:
|
| 105 |
+
await asyncio.to_thread(_download_url, url, str(input_path))
|
| 106 |
+
except Exception as e: # noqa: BLE001
|
| 107 |
+
raise HTTPException(400, f"Couldn't fetch the video URL: {e}")
|
| 108 |
+
|
| 109 |
+
try:
|
| 110 |
+
# Heavy: transcribe + (optional) translate + (optional) ffmpeg burn-in.
|
| 111 |
+
# Run off the event loop so concurrent requests don't starve.
|
| 112 |
+
info = await asyncio.to_thread(
|
| 113 |
+
subtitles.generate_subtitles,
|
| 114 |
+
input_path=input_path,
|
| 115 |
+
out_dir=dest_dir,
|
| 116 |
+
source_lang_name=source_lang,
|
| 117 |
+
target_lang_name=target_lang,
|
| 118 |
+
fmt=fmt, # type: ignore[arg-type]
|
| 119 |
+
style=style, # type: ignore[arg-type]
|
| 120 |
+
position=position, # type: ignore[arg-type]
|
| 121 |
+
h_align=h_align, # type: ignore[arg-type]
|
| 122 |
+
font_size=font_size,
|
| 123 |
+
margin_v=margin_v,
|
| 124 |
+
)
|
| 125 |
+
except ValueError as e:
|
| 126 |
+
raise HTTPException(400, str(e))
|
| 127 |
+
except Exception as e: # noqa: BLE001
|
| 128 |
+
raise HTTPException(500, f"Subtitle generation failed: {e}")
|
| 129 |
+
|
| 130 |
+
return JSONResponse({
|
| 131 |
+
"run_id": run_id,
|
| 132 |
+
"format": info["format"],
|
| 133 |
+
"filename": info["filename"],
|
| 134 |
+
"url": file_url(run_id, info["filename"]),
|
| 135 |
+
"segments": info["segments"],
|
| 136 |
+
"translated": info["translated"],
|
| 137 |
+
})
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
# ── Voice clone ──────────────────────────────────────────────────────
|
| 141 |
+
|
| 142 |
+
@router.post("/voice-clone")
|
| 143 |
+
@limiter.limit("10/hour")
|
| 144 |
+
async def voice_clone_endpoint(
|
| 145 |
+
request: Request,
|
| 146 |
+
sample: UploadFile = File(...),
|
| 147 |
+
text: str = Form(...),
|
| 148 |
+
language_id: str = Form("en"),
|
| 149 |
+
):
|
| 150 |
+
text = (text or "").strip()
|
| 151 |
+
if not text:
|
| 152 |
+
raise HTTPException(400, "text is required")
|
| 153 |
+
if len(text) > 1000:
|
| 154 |
+
raise HTTPException(400, "text exceeds 1000 char limit")
|
| 155 |
+
|
| 156 |
+
run_id, dest_dir = new_run_dir()
|
| 157 |
+
sample_path = await _save_upload(sample, dest_dir, "sample.wav")
|
| 158 |
+
|
| 159 |
+
try:
|
| 160 |
+
info = await asyncio.to_thread(
|
| 161 |
+
voice_clone.clone_voice,
|
| 162 |
+
sample_path=sample_path,
|
| 163 |
+
text=text,
|
| 164 |
+
out_dir=dest_dir,
|
| 165 |
+
language_id=language_id,
|
| 166 |
+
)
|
| 167 |
+
except ValueError as e:
|
| 168 |
+
raise HTTPException(400, str(e))
|
| 169 |
+
except Exception as e: # noqa: BLE001
|
| 170 |
+
raise HTTPException(500, f"Voice clone failed: {e}")
|
| 171 |
+
|
| 172 |
+
return JSONResponse({
|
| 173 |
+
"run_id": run_id,
|
| 174 |
+
"engine": info["engine"],
|
| 175 |
+
"chunks": info["chunks"],
|
| 176 |
+
"filename": info["filename"],
|
| 177 |
+
"url": file_url(run_id, info["filename"]),
|
| 178 |
+
})
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
# ── Audio cleanup ────────────────────────────────────────────────────
|
| 182 |
+
|
| 183 |
+
@router.post("/audio-cleanup")
|
| 184 |
+
@limiter.limit("10/hour")
|
| 185 |
+
async def audio_cleanup_endpoint(
|
| 186 |
+
request: Request,
|
| 187 |
+
file: UploadFile = File(...),
|
| 188 |
+
mode: str = Form("vocals-only"),
|
| 189 |
+
):
|
| 190 |
+
if mode not in ("vocals-only", "instrumental-only", "stems"):
|
| 191 |
+
raise HTTPException(400, "mode must be one of: vocals-only, instrumental-only, stems")
|
| 192 |
+
|
| 193 |
+
run_id, dest_dir = new_run_dir()
|
| 194 |
+
input_path = await _save_upload(file, dest_dir, "input.wav")
|
| 195 |
+
|
| 196 |
+
try:
|
| 197 |
+
stems = await asyncio.to_thread(
|
| 198 |
+
audio_cleanup.separate,
|
| 199 |
+
input_path=input_path,
|
| 200 |
+
out_dir=dest_dir,
|
| 201 |
+
mode=mode, # type: ignore[arg-type]
|
| 202 |
+
)
|
| 203 |
+
except ValueError as e:
|
| 204 |
+
raise HTTPException(400, str(e))
|
| 205 |
+
except Exception as e: # noqa: BLE001
|
| 206 |
+
raise HTTPException(500, f"Audio separation failed: {e}")
|
| 207 |
+
|
| 208 |
+
return JSONResponse({
|
| 209 |
+
"run_id": run_id,
|
| 210 |
+
"mode": mode,
|
| 211 |
+
"stems": [
|
| 212 |
+
{**stem, "url": file_url(run_id, stem["filename"])}
|
| 213 |
+
for stem in stems
|
| 214 |
+
],
|
| 215 |
+
})
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
# ── File download ────────────────────────────────────────────────────
|
| 219 |
+
|
| 220 |
+
@router.get("/file/{run_id}/{filename}")
|
| 221 |
+
async def tools_file(run_id: str, filename: str):
|
| 222 |
+
"""Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS."""
|
| 223 |
+
safe_name = safe_filename(filename)
|
| 224 |
+
if safe_name != filename:
|
| 225 |
+
raise HTTPException(400, "Invalid filename")
|
| 226 |
+
|
| 227 |
+
base = run_dir(run_id)
|
| 228 |
+
if base is None:
|
| 229 |
+
raise HTTPException(404, "Run not found or expired")
|
| 230 |
+
|
| 231 |
+
target = base / safe_name
|
| 232 |
+
if not target.exists() or not target.is_file():
|
| 233 |
+
raise HTTPException(404, "File not found")
|
| 234 |
+
|
| 235 |
+
return FileResponse(
|
| 236 |
+
path=str(target),
|
| 237 |
+
media_type=_ext_to_media_type(safe_name),
|
| 238 |
+
filename=safe_name,
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
# ── Cleanup hook ─────────────────────────────────────────────────────
|
| 243 |
+
|
| 244 |
+
@router.post("/_internal/reap")
|
| 245 |
+
async def _reap():
|
| 246 |
+
"""Manual reap trigger (mostly for testing). Auto-reap runs on a timer."""
|
| 247 |
+
removed = await asyncio.to_thread(reap_old_runs)
|
| 248 |
+
return {"removed": removed}
|
tools_api/storage.py
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Per-run temp storage for tools_api.
|
| 3 |
+
|
| 4 |
+
Each tool request creates a fresh dir under ARTIFACTS_ROOT/tools/<run_id>/.
|
| 5 |
+
Files are reaped after TTL by _reap_old_runs(). Kept independent of the main
|
| 6 |
+
job-tracker so a tool failure can't corrupt or block pipeline state.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import shutil
|
| 11 |
+
import time
|
| 12 |
+
import uuid
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
from typing import Optional
|
| 15 |
+
|
| 16 |
+
# Pull ARTIFACTS_ROOT from server.py without importing the heavy modules
|
| 17 |
+
# (server.py imports torch/whisper/etc. at top level — we already loaded it
|
| 18 |
+
# at app startup, so this is just a name lookup).
|
| 19 |
+
from server import ARTIFACTS_ROOT
|
| 20 |
+
|
| 21 |
+
TOOLS_ROOT = ARTIFACTS_ROOT / "tools"
|
| 22 |
+
TOOLS_ROOT.mkdir(parents=True, exist_ok=True)
|
| 23 |
+
|
| 24 |
+
# Tool runs are reaped 1h after creation (shorter than pipeline jobs since
|
| 25 |
+
# users typically download immediately).
|
| 26 |
+
RUN_TTL_SECONDS = 60 * 60
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def new_run_dir() -> tuple[str, Path]:
|
| 30 |
+
"""Allocate a fresh per-request directory. Returns (run_id, path)."""
|
| 31 |
+
run_id = uuid.uuid4().hex[:16]
|
| 32 |
+
path = TOOLS_ROOT / run_id
|
| 33 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 34 |
+
return run_id, path
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def run_dir(run_id: str) -> Optional[Path]:
|
| 38 |
+
"""Resolve a run_id to its directory, or None if missing/invalid."""
|
| 39 |
+
if not run_id or "/" in run_id or ".." in run_id:
|
| 40 |
+
return None
|
| 41 |
+
candidate = TOOLS_ROOT / run_id
|
| 42 |
+
if not candidate.exists() or not candidate.is_dir():
|
| 43 |
+
return None
|
| 44 |
+
return candidate
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def file_url(run_id: str, filename: str) -> str:
|
| 48 |
+
"""Construct the public download URL for an artifact."""
|
| 49 |
+
return f"/api/tools/file/{run_id}/{filename}"
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def safe_filename(name: str, fallback: str = "file") -> str:
|
| 53 |
+
"""Strip path separators and dangerous chars from a user-supplied name."""
|
| 54 |
+
if not name:
|
| 55 |
+
return fallback
|
| 56 |
+
base = Path(name).name
|
| 57 |
+
return base or fallback
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def reap_old_runs() -> int:
|
| 61 |
+
"""Delete tool run dirs older than RUN_TTL_SECONDS. Returns count removed."""
|
| 62 |
+
if not TOOLS_ROOT.exists():
|
| 63 |
+
return 0
|
| 64 |
+
cutoff = time.time() - RUN_TTL_SECONDS
|
| 65 |
+
removed = 0
|
| 66 |
+
for child in TOOLS_ROOT.iterdir():
|
| 67 |
+
try:
|
| 68 |
+
if child.is_dir() and child.stat().st_mtime < cutoff:
|
| 69 |
+
shutil.rmtree(child, ignore_errors=True)
|
| 70 |
+
removed += 1
|
| 71 |
+
except OSError:
|
| 72 |
+
continue
|
| 73 |
+
return removed
|
tools_api/subtitles.py
ADDED
|
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Subtitle generation: sidecar files (.srt/.vtt/.txt) and burn-in MP4.
|
| 3 |
+
|
| 4 |
+
Reuses steps.s2_transcribe.transcribe and steps.s3_translate.translate as
|
| 5 |
+
libraries. ffmpeg burn-in goes through subprocess (matches existing s5_sync
|
| 6 |
+
pattern but without sharing code, since the styling needs are different).
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import subprocess
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import Literal
|
| 13 |
+
|
| 14 |
+
from steps.s2_transcribe import transcribe
|
| 15 |
+
from steps.s3_translate import translate
|
| 16 |
+
|
| 17 |
+
Format = Literal["srt", "vtt", "txt", "mp4"]
|
| 18 |
+
CaptionStyle = Literal["tiktok", "youtube", "minimal"]
|
| 19 |
+
Position = Literal["top", "middle", "bottom"]
|
| 20 |
+
HAlign = Literal["left", "center", "right"]
|
| 21 |
+
|
| 22 |
+
# Bounds for user-adjustable knobs. Backend clamps to these regardless of
|
| 23 |
+
# what the client sends.
|
| 24 |
+
FONT_SIZE_MIN = 12
|
| 25 |
+
FONT_SIZE_MAX = 40
|
| 26 |
+
MARGIN_V_MIN = 0
|
| 27 |
+
MARGIN_V_MAX = 240
|
| 28 |
+
|
| 29 |
+
# ISO-style short codes Whisper accepts. Names map to UI dropdown labels.
|
| 30 |
+
_LANG_CODE = {
|
| 31 |
+
"Auto-detect": "auto",
|
| 32 |
+
"English": "en", "Spanish": "es", "French": "fr", "German": "de",
|
| 33 |
+
"Portuguese": "pt", "Italian": "it", "Hindi": "hi", "Arabic": "ar",
|
| 34 |
+
"Chinese": "zh", "Japanese": "ja", "Korean": "ko", "Russian": "ru",
|
| 35 |
+
}
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _is_video(path: Path) -> bool:
|
| 39 |
+
return path.suffix.lower() in {".mp4", ".mov", ".webm", ".mkv", ".avi", ".m4v"}
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def _extract_audio(input_path: Path, out_dir: Path) -> Path:
|
| 43 |
+
"""Pull a 16kHz mono WAV from the input — what whisper expects."""
|
| 44 |
+
audio_path = out_dir / "audio.wav"
|
| 45 |
+
cmd = [
|
| 46 |
+
"ffmpeg", "-y", "-i", str(input_path),
|
| 47 |
+
"-vn", "-ac", "1", "-ar", "16000",
|
| 48 |
+
"-acodec", "pcm_s16le",
|
| 49 |
+
str(audio_path),
|
| 50 |
+
]
|
| 51 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
|
| 52 |
+
if result.returncode != 0:
|
| 53 |
+
raise RuntimeError(f"ffmpeg audio extract failed: {result.stderr[-300:]}")
|
| 54 |
+
return audio_path
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _resolve_lang(name: str) -> str:
|
| 58 |
+
return _LANG_CODE.get(name, "auto")
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# ── Caption format writers ─────────────────────────────────────────────
|
| 62 |
+
|
| 63 |
+
def _seg_text(seg: dict, prefer_translation: bool) -> str:
|
| 64 |
+
if prefer_translation:
|
| 65 |
+
return (seg.get("translated_text") or seg.get("text") or "").strip()
|
| 66 |
+
return (seg.get("text") or "").strip()
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def _format_timestamp_srt(t: float) -> str:
|
| 70 |
+
h = int(t // 3600)
|
| 71 |
+
m = int((t % 3600) // 60)
|
| 72 |
+
s = int(t % 60)
|
| 73 |
+
ms = int(round((t - int(t)) * 1000))
|
| 74 |
+
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def _format_timestamp_vtt(t: float) -> str:
|
| 78 |
+
return _format_timestamp_srt(t).replace(",", ".")
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def write_srt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
|
| 82 |
+
lines = []
|
| 83 |
+
for i, seg in enumerate(segments, 1):
|
| 84 |
+
text = _seg_text(seg, prefer_translation)
|
| 85 |
+
if not text:
|
| 86 |
+
continue
|
| 87 |
+
lines.append(str(i))
|
| 88 |
+
lines.append(f"{_format_timestamp_srt(seg['start'])} --> {_format_timestamp_srt(seg['end'])}")
|
| 89 |
+
lines.append(text)
|
| 90 |
+
lines.append("")
|
| 91 |
+
dest.write_text("\n".join(lines), encoding="utf-8")
|
| 92 |
+
return dest
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def write_vtt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
|
| 96 |
+
lines = ["WEBVTT", ""]
|
| 97 |
+
for seg in segments:
|
| 98 |
+
text = _seg_text(seg, prefer_translation)
|
| 99 |
+
if not text:
|
| 100 |
+
continue
|
| 101 |
+
lines.append(f"{_format_timestamp_vtt(seg['start'])} --> {_format_timestamp_vtt(seg['end'])}")
|
| 102 |
+
lines.append(text)
|
| 103 |
+
lines.append("")
|
| 104 |
+
dest.write_text("\n".join(lines), encoding="utf-8")
|
| 105 |
+
return dest
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def write_txt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
|
| 109 |
+
text = " ".join(_seg_text(s, prefer_translation) for s in segments if _seg_text(s, prefer_translation))
|
| 110 |
+
dest.write_text(text, encoding="utf-8")
|
| 111 |
+
return dest
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
# ── Burn-in styling ────────────────────────────────────────────────────
|
| 115 |
+
|
| 116 |
+
# ASS-format alignment codes (libass), arranged as row + column:
|
| 117 |
+
# row: bottom=0, middle=3, top=6
|
| 118 |
+
# col: left=1, center=2, right=3
|
| 119 |
+
# So bottom-left=1, bottom-center=2, ..., top-right=9.
|
| 120 |
+
_POSITION_ROW = {"bottom": 0, "middle": 3, "top": 6}
|
| 121 |
+
_HALIGN_COL = {"left": 1, "center": 2, "right": 3}
|
| 122 |
+
_DEFAULT_MARGIN_V = {"bottom": 60, "middle": 0, "top": 60}
|
| 123 |
+
|
| 124 |
+
# Per-style baseline — font size, stroke/shadow choices. The user can override
|
| 125 |
+
# the font size via the slider; everything else stays tied to the style preset.
|
| 126 |
+
_STYLE_DEFAULTS: dict[CaptionStyle, dict] = {
|
| 127 |
+
"tiktok": {"font_size": 22, "bold": 1, "border_style": 1, "outline": 3, "shadow": 1},
|
| 128 |
+
"youtube": {"font_size": 18, "bold": 0, "border_style": 4, "outline": 8, "shadow": 0},
|
| 129 |
+
"minimal": {"font_size": 16, "bold": 0, "border_style": 1, "outline": 1, "shadow": 0},
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def _clamp(value: int, lo: int, hi: int) -> int:
|
| 134 |
+
return max(lo, min(hi, value))
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def _force_style_for(
|
| 138 |
+
style: CaptionStyle,
|
| 139 |
+
position: Position,
|
| 140 |
+
h_align: HAlign = "center",
|
| 141 |
+
font_size: int | None = None,
|
| 142 |
+
margin_v: int | None = None,
|
| 143 |
+
) -> str:
|
| 144 |
+
"""Return an ffmpeg `subtitles=...:force_style='...'` string.
|
| 145 |
+
|
| 146 |
+
Args:
|
| 147 |
+
style: Visual preset — sets weight, stroke, shadow defaults.
|
| 148 |
+
position: top / middle / bottom row.
|
| 149 |
+
h_align: left / center / right column.
|
| 150 |
+
font_size: Override the style's default font size (clamped to FONT_SIZE_MIN..MAX).
|
| 151 |
+
margin_v: Override vertical margin in pixels (clamped to MARGIN_V_MIN..MAX).
|
| 152 |
+
"""
|
| 153 |
+
defaults = _STYLE_DEFAULTS[style]
|
| 154 |
+
fs = _clamp(font_size if font_size is not None else defaults["font_size"],
|
| 155 |
+
FONT_SIZE_MIN, FONT_SIZE_MAX)
|
| 156 |
+
mv = _clamp(margin_v if margin_v is not None else _DEFAULT_MARGIN_V[position],
|
| 157 |
+
MARGIN_V_MIN, MARGIN_V_MAX)
|
| 158 |
+
align = _POSITION_ROW[position] + _HALIGN_COL[h_align]
|
| 159 |
+
|
| 160 |
+
parts = [
|
| 161 |
+
"FontName=Arial",
|
| 162 |
+
f"FontSize={fs}",
|
| 163 |
+
f"Bold={defaults['bold']}",
|
| 164 |
+
"PrimaryColour=&H00FFFFFF",
|
| 165 |
+
]
|
| 166 |
+
if style == "youtube":
|
| 167 |
+
# White on translucent black box
|
| 168 |
+
parts.append("BackColour=&HB8000000")
|
| 169 |
+
elif style == "minimal":
|
| 170 |
+
# Subtle semi-transparent stroke instead of hard black
|
| 171 |
+
parts.append("OutlineColour=&H80000000")
|
| 172 |
+
else: # tiktok — hard black stroke
|
| 173 |
+
parts.append("OutlineColour=&H00000000")
|
| 174 |
+
parts += [
|
| 175 |
+
f"BorderStyle={defaults['border_style']}",
|
| 176 |
+
f"Outline={defaults['outline']}",
|
| 177 |
+
f"Shadow={defaults['shadow']}",
|
| 178 |
+
f"Alignment={align}",
|
| 179 |
+
f"MarginV={mv}",
|
| 180 |
+
# Symmetric horizontal margins so left/right alignment has breathing room
|
| 181 |
+
"MarginL=40",
|
| 182 |
+
"MarginR=40",
|
| 183 |
+
]
|
| 184 |
+
return ",".join(parts)
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
def _burn_in(
|
| 188 |
+
video_path: Path,
|
| 189 |
+
srt_path: Path,
|
| 190 |
+
dest: Path,
|
| 191 |
+
style: CaptionStyle,
|
| 192 |
+
position: Position,
|
| 193 |
+
h_align: HAlign = "center",
|
| 194 |
+
font_size: int | None = None,
|
| 195 |
+
margin_v: int | None = None,
|
| 196 |
+
) -> Path:
|
| 197 |
+
"""Render captions into the video pixels via ffmpeg + libass."""
|
| 198 |
+
force_style = _force_style_for(style, position, h_align, font_size, margin_v)
|
| 199 |
+
# Escape path for ffmpeg subtitle filter (single quotes around path,
|
| 200 |
+
# and we replace any existing single quotes since they'd break the filter).
|
| 201 |
+
srt_str = str(srt_path).replace("'", r"\'").replace(":", r"\:")
|
| 202 |
+
vf = f"subtitles='{srt_str}':force_style='{force_style}'"
|
| 203 |
+
cmd = [
|
| 204 |
+
"ffmpeg", "-y",
|
| 205 |
+
"-i", str(video_path),
|
| 206 |
+
"-vf", vf,
|
| 207 |
+
"-c:a", "copy",
|
| 208 |
+
"-c:v", "libx264",
|
| 209 |
+
"-preset", "veryfast",
|
| 210 |
+
"-crf", "22",
|
| 211 |
+
str(dest),
|
| 212 |
+
]
|
| 213 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
|
| 214 |
+
if result.returncode != 0:
|
| 215 |
+
raise RuntimeError(f"ffmpeg burn-in failed: {result.stderr[-300:]}")
|
| 216 |
+
return dest
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
# ── Public entry point ────────────────────────────────────────────────
|
| 220 |
+
|
| 221 |
+
def generate_subtitles(
|
| 222 |
+
*,
|
| 223 |
+
input_path: Path,
|
| 224 |
+
out_dir: Path,
|
| 225 |
+
source_lang_name: str,
|
| 226 |
+
target_lang_name: str,
|
| 227 |
+
fmt: Format,
|
| 228 |
+
style: CaptionStyle = "tiktok",
|
| 229 |
+
position: Position = "bottom",
|
| 230 |
+
h_align: HAlign = "center",
|
| 231 |
+
font_size: int | None = None,
|
| 232 |
+
margin_v: int | None = None,
|
| 233 |
+
) -> dict:
|
| 234 |
+
"""
|
| 235 |
+
Run the full subtitle pipeline. Returns:
|
| 236 |
+
{
|
| 237 |
+
"format": "srt" | "vtt" | "txt" | "mp4",
|
| 238 |
+
"filename": <name in out_dir>,
|
| 239 |
+
"segments": <int>,
|
| 240 |
+
"translated": <bool>,
|
| 241 |
+
}
|
| 242 |
+
"""
|
| 243 |
+
is_burn = fmt == "mp4"
|
| 244 |
+
if is_burn and not _is_video(input_path):
|
| 245 |
+
raise ValueError("Burn-in requires a video file.")
|
| 246 |
+
|
| 247 |
+
# 1. Extract audio (or use as-is)
|
| 248 |
+
if _is_video(input_path):
|
| 249 |
+
audio_path = _extract_audio(input_path, out_dir)
|
| 250 |
+
else:
|
| 251 |
+
audio_path = input_path
|
| 252 |
+
|
| 253 |
+
# 2. Transcribe
|
| 254 |
+
src_code = _resolve_lang(source_lang_name)
|
| 255 |
+
segments = transcribe(str(audio_path), language=src_code)
|
| 256 |
+
if not segments:
|
| 257 |
+
raise RuntimeError("Transcription produced no segments.")
|
| 258 |
+
|
| 259 |
+
# 3. Translate if requested
|
| 260 |
+
translated = False
|
| 261 |
+
same_as_source = (
|
| 262 |
+
target_lang_name == "Same as source"
|
| 263 |
+
or target_lang_name.lower() == source_lang_name.lower()
|
| 264 |
+
)
|
| 265 |
+
if not same_as_source:
|
| 266 |
+
segments = translate(segments, target_lang_name)
|
| 267 |
+
translated = True
|
| 268 |
+
|
| 269 |
+
# 4. Emit
|
| 270 |
+
if fmt == "srt":
|
| 271 |
+
out = write_srt(segments, out_dir / "captions.srt", translated)
|
| 272 |
+
elif fmt == "vtt":
|
| 273 |
+
out = write_vtt(segments, out_dir / "captions.vtt", translated)
|
| 274 |
+
elif fmt == "txt":
|
| 275 |
+
out = write_txt(segments, out_dir / "transcript.txt", translated)
|
| 276 |
+
else: # mp4
|
| 277 |
+
srt_path = write_srt(segments, out_dir / "captions.srt", translated)
|
| 278 |
+
out = _burn_in(
|
| 279 |
+
input_path, srt_path, out_dir / "captioned.mp4",
|
| 280 |
+
style, position, h_align, font_size, margin_v,
|
| 281 |
+
)
|
| 282 |
+
|
| 283 |
+
return {
|
| 284 |
+
"format": fmt,
|
| 285 |
+
"filename": out.name,
|
| 286 |
+
"segments": len(segments),
|
| 287 |
+
"translated": translated,
|
| 288 |
+
}
|
tools_api/voice_clone.py
ADDED
|
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Voice clone playground — single-engine TTS from a sample + text input.
|
| 3 |
+
|
| 4 |
+
This Space runs only ONE engine (s4_tts enforces TTS_ENGINE match), so the
|
| 5 |
+
endpoint accepts no engine parameter. The frontend is responsible for fanning
|
| 6 |
+
out to multiple Spaces when the user wants comparison output.
|
| 7 |
+
|
| 8 |
+
Long text is split into ~200-char chunks at sentence/word boundaries and
|
| 9 |
+
synthesised as multiple segments, then concatenated into one MP3.
|
| 10 |
+
"""
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import os
|
| 14 |
+
import re
|
| 15 |
+
import subprocess
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
from steps.s4_tts import synthesise_segments
|
| 19 |
+
|
| 20 |
+
_AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".flac", ".ogg", ".aac"}
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def _prepare_sample(sample_path: Path, out_dir: Path) -> Path:
|
| 24 |
+
"""Convert any uploaded sample (audio or video) to a clean 24kHz mono WAV.
|
| 25 |
+
|
| 26 |
+
TTS internals (s4_tts) call torchaudio.load via libsndfile, which only
|
| 27 |
+
understands WAV/FLAC. Anything else — including MP4 video, MP3, M4A —
|
| 28 |
+
has to be re-encoded first. We do this here so callers don't need to.
|
| 29 |
+
"""
|
| 30 |
+
out = out_dir / "sample_prepared.wav"
|
| 31 |
+
cmd = [
|
| 32 |
+
"ffmpeg", "-y", "-i", str(sample_path),
|
| 33 |
+
"-vn", # drop video stream if present
|
| 34 |
+
"-ac", "1", # mono
|
| 35 |
+
"-ar", "24000", # 24kHz — sweet spot for the TTS engines
|
| 36 |
+
"-acodec", "pcm_s16le",
|
| 37 |
+
str(out),
|
| 38 |
+
]
|
| 39 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
|
| 40 |
+
if result.returncode != 0:
|
| 41 |
+
raise ValueError(
|
| 42 |
+
"Couldn't read the uploaded sample. Use a clean audio file "
|
| 43 |
+
"(WAV, MP3, M4A) or a video with an audio track."
|
| 44 |
+
)
|
| 45 |
+
return out
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def _isolate_vocals(prepared_sample: Path, out_dir: Path) -> Path:
|
| 49 |
+
"""Run Demucs source separation on the prepared sample and return a
|
| 50 |
+
vocals-only WAV (24kHz mono) suitable for TTS reference.
|
| 51 |
+
|
| 52 |
+
Mirrors what the dub pipeline (steps.s1b_separate) does so cloned voice
|
| 53 |
+
doesn't pick up music / ambient noise from the uploaded sample. Falls back
|
| 54 |
+
to the raw prepared sample if separation fails (model missing, oom, etc.)
|
| 55 |
+
rather than failing the whole clone request.
|
| 56 |
+
"""
|
| 57 |
+
try:
|
| 58 |
+
from steps.s1b_separate import separate_audio
|
| 59 |
+
except ImportError as e:
|
| 60 |
+
print(f"[voice_clone] Demucs unavailable, skipping vocal isolation: {e}")
|
| 61 |
+
return prepared_sample
|
| 62 |
+
|
| 63 |
+
separate_dir = out_dir / "separate"
|
| 64 |
+
separate_dir.mkdir(parents=True, exist_ok=True)
|
| 65 |
+
|
| 66 |
+
try:
|
| 67 |
+
vocals_16k_path, _accompaniment = separate_audio(str(prepared_sample), str(separate_dir))
|
| 68 |
+
except Exception as e:
|
| 69 |
+
print(f"[voice_clone] Demucs separation failed, using raw sample: {e}")
|
| 70 |
+
return prepared_sample
|
| 71 |
+
|
| 72 |
+
# Resample vocals from 16 kHz mono → 24 kHz mono for the TTS engines
|
| 73 |
+
vocals_24k = out_dir / "vocals_24k.wav"
|
| 74 |
+
cmd = [
|
| 75 |
+
"ffmpeg", "-y", "-i", vocals_16k_path,
|
| 76 |
+
"-ac", "1", "-ar", "24000",
|
| 77 |
+
"-acodec", "pcm_s16le",
|
| 78 |
+
str(vocals_24k),
|
| 79 |
+
]
|
| 80 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
|
| 81 |
+
if result.returncode != 0:
|
| 82 |
+
print(f"[voice_clone] Vocals resample failed, using 16kHz: {result.stderr[-200:]}")
|
| 83 |
+
return Path(vocals_16k_path)
|
| 84 |
+
|
| 85 |
+
return vocals_24k
|
| 86 |
+
|
| 87 |
+
CHUNK_TARGET_CHARS = 200
|
| 88 |
+
CHUNK_HARD_MAX = 280 # under chatterbox's 300-char per-segment ceiling
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def _split_text(text: str) -> list[str]:
|
| 92 |
+
"""Split into chunks of ~CHUNK_TARGET_CHARS at sentence then word boundaries."""
|
| 93 |
+
text = text.strip()
|
| 94 |
+
if not text:
|
| 95 |
+
return []
|
| 96 |
+
if len(text) <= CHUNK_HARD_MAX:
|
| 97 |
+
return [text]
|
| 98 |
+
|
| 99 |
+
# First pass: sentence boundaries
|
| 100 |
+
sentences = re.split(r"(?<=[.!?])\s+", text)
|
| 101 |
+
chunks: list[str] = []
|
| 102 |
+
current = ""
|
| 103 |
+
for sent in sentences:
|
| 104 |
+
if not sent.strip():
|
| 105 |
+
continue
|
| 106 |
+
if len(current) + 1 + len(sent) <= CHUNK_TARGET_CHARS:
|
| 107 |
+
current = f"{current} {sent}".strip() if current else sent
|
| 108 |
+
else:
|
| 109 |
+
if current:
|
| 110 |
+
chunks.append(current)
|
| 111 |
+
# Sentence itself may exceed target — break it on words
|
| 112 |
+
if len(sent) > CHUNK_HARD_MAX:
|
| 113 |
+
words = sent.split()
|
| 114 |
+
buf = ""
|
| 115 |
+
for w in words:
|
| 116 |
+
if len(buf) + 1 + len(w) > CHUNK_HARD_MAX:
|
| 117 |
+
if buf:
|
| 118 |
+
chunks.append(buf)
|
| 119 |
+
buf = w
|
| 120 |
+
else:
|
| 121 |
+
buf = f"{buf} {w}".strip() if buf else w
|
| 122 |
+
if buf:
|
| 123 |
+
current = buf
|
| 124 |
+
else:
|
| 125 |
+
current = ""
|
| 126 |
+
else:
|
| 127 |
+
current = sent
|
| 128 |
+
if current:
|
| 129 |
+
chunks.append(current)
|
| 130 |
+
return chunks
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def _build_segments(chunks: list[str], chunk_secs: float = 8.0) -> list[dict]:
|
| 134 |
+
"""Construct segment dicts for synthesise_segments — fake timing windows."""
|
| 135 |
+
segs = []
|
| 136 |
+
cursor = 0.0
|
| 137 |
+
for text in chunks:
|
| 138 |
+
# Allocate a generous window so _trim_to_duration doesn't clip output.
|
| 139 |
+
# Headroom is 1.4× so 8s window allows up to ~11s of audio per chunk.
|
| 140 |
+
segs.append({
|
| 141 |
+
"start": cursor,
|
| 142 |
+
"end": cursor + chunk_secs,
|
| 143 |
+
"text": text,
|
| 144 |
+
"translated_text": text,
|
| 145 |
+
"tts_text": text,
|
| 146 |
+
})
|
| 147 |
+
cursor += chunk_secs
|
| 148 |
+
return segs
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
def _concat_wavs_to_mp3(wav_paths: list[Path], dest: Path) -> Path:
|
| 152 |
+
"""Concat in order via ffmpeg concat demuxer, then encode MP3."""
|
| 153 |
+
if not wav_paths:
|
| 154 |
+
raise RuntimeError("No TTS chunks to concatenate.")
|
| 155 |
+
|
| 156 |
+
if len(wav_paths) == 1:
|
| 157 |
+
cmd = [
|
| 158 |
+
"ffmpeg", "-y", "-i", str(wav_paths[0]),
|
| 159 |
+
"-codec:a", "libmp3lame", "-b:a", "192k",
|
| 160 |
+
str(dest),
|
| 161 |
+
]
|
| 162 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
|
| 163 |
+
if result.returncode != 0:
|
| 164 |
+
raise RuntimeError(f"ffmpeg encode failed: {result.stderr[-300:]}")
|
| 165 |
+
return dest
|
| 166 |
+
|
| 167 |
+
list_file = dest.with_suffix(".txt")
|
| 168 |
+
list_file.write_text(
|
| 169 |
+
"\n".join(f"file '{p.as_posix()}'" for p in wav_paths),
|
| 170 |
+
encoding="utf-8",
|
| 171 |
+
)
|
| 172 |
+
cmd = [
|
| 173 |
+
"ffmpeg", "-y",
|
| 174 |
+
"-f", "concat", "-safe", "0",
|
| 175 |
+
"-i", str(list_file),
|
| 176 |
+
"-codec:a", "libmp3lame", "-b:a", "192k",
|
| 177 |
+
str(dest),
|
| 178 |
+
]
|
| 179 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
|
| 180 |
+
list_file.unlink(missing_ok=True)
|
| 181 |
+
if result.returncode != 0:
|
| 182 |
+
raise RuntimeError(f"ffmpeg concat failed: {result.stderr[-300:]}")
|
| 183 |
+
return dest
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
def clone_voice(
|
| 187 |
+
*,
|
| 188 |
+
sample_path: Path,
|
| 189 |
+
text: str,
|
| 190 |
+
out_dir: Path,
|
| 191 |
+
language_id: str = "en",
|
| 192 |
+
) -> dict:
|
| 193 |
+
"""
|
| 194 |
+
Run TTS on `text` using the voice from `sample_path`. Returns:
|
| 195 |
+
{
|
| 196 |
+
"filename": "voice.mp3",
|
| 197 |
+
"engine": <current TTS_ENGINE>,
|
| 198 |
+
"chunks": <int>,
|
| 199 |
+
}
|
| 200 |
+
"""
|
| 201 |
+
text = (text or "").strip()
|
| 202 |
+
if not text:
|
| 203 |
+
raise ValueError("Text is required.")
|
| 204 |
+
|
| 205 |
+
chunks = _split_text(text)
|
| 206 |
+
segments = _build_segments(chunks)
|
| 207 |
+
|
| 208 |
+
# Normalise the sample (handles video, mp3, m4a, etc.) → 24kHz mono WAV
|
| 209 |
+
prepared_sample = _prepare_sample(sample_path, out_dir)
|
| 210 |
+
|
| 211 |
+
# Demucs source separation → isolate vocals so the clone doesn't pick up
|
| 212 |
+
# background music or ambient noise. Same step the dub pipeline uses.
|
| 213 |
+
reference_for_tts = _isolate_vocals(prepared_sample, out_dir)
|
| 214 |
+
|
| 215 |
+
seg_out_dir = out_dir / "tts"
|
| 216 |
+
seg_out_dir.mkdir(parents=True, exist_ok=True)
|
| 217 |
+
|
| 218 |
+
tts_result = None
|
| 219 |
+
for msg in synthesise_segments(
|
| 220 |
+
segments=segments,
|
| 221 |
+
reference_audio_path=str(reference_for_tts),
|
| 222 |
+
language_id=language_id,
|
| 223 |
+
output_dir=str(seg_out_dir),
|
| 224 |
+
):
|
| 225 |
+
if isinstance(msg, dict) and "__TTS_RESULT__" in msg:
|
| 226 |
+
tts_result = msg["__TTS_RESULT__"]
|
| 227 |
+
|
| 228 |
+
if not tts_result:
|
| 229 |
+
raise RuntimeError("TTS produced no output.")
|
| 230 |
+
|
| 231 |
+
wav_paths = [Path(seg["tts_path"]) for seg in tts_result if seg.get("tts_path")]
|
| 232 |
+
if not wav_paths:
|
| 233 |
+
raise RuntimeError("TTS result missing audio paths.")
|
| 234 |
+
|
| 235 |
+
mp3_path = _concat_wavs_to_mp3(wav_paths, out_dir / "voice.mp3")
|
| 236 |
+
|
| 237 |
+
return {
|
| 238 |
+
"filename": mp3_path.name,
|
| 239 |
+
"engine": os.getenv("TTS_ENGINE", "chatterbox").lower(),
|
| 240 |
+
"chunks": len(chunks),
|
| 241 |
+
}
|