github-actions[bot] commited on
Commit
5b7cd5f
·
1 Parent(s): e5dfb46

deploy: switch to chatterbox requirements @ 4319730

Browse files
CLAUDE.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  ## graphify
2
 
3
  This project has a graphify knowledge graph at graphify-out/.
 
1
+ ## Deployment
2
+
3
+ HF Spaces deployment is fully automated via `.github/workflows/deploy-hf.yml`. Pushing to `origin/main` triggers the workflow which runs `./deploy.sh --force` and pushes to all three Spaces (Chatterbox, OmniVoice, Qwen3). Do not run `./deploy.sh` locally after a push — it is redundant. To verify a deploy, use `gh run list --workflow=deploy-hf.yml`.
4
+
5
  ## graphify
6
 
7
  This project has a graphify knowledge graph at graphify-out/.
app.py CHANGED
@@ -25,8 +25,10 @@ demo = Server()
25
  # INTEGRATE SERVER.PY ROUTES
26
  # -----------------------------------------------------
27
  from server import router, limiter, enforce_content_length_limit
 
28
 
29
  demo.include_router(router)
 
30
  demo.state.limiter = limiter
31
  demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
32
 
 
25
  # INTEGRATE SERVER.PY ROUTES
26
  # -----------------------------------------------------
27
  from server import router, limiter, enforce_content_length_limit
28
+ from tools_api import router as tools_router
29
 
30
  demo.include_router(router)
31
+ demo.include_router(tools_router)
32
  demo.state.limiter = limiter
33
  demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
34
 
graphify-out/.graphify_root ADDED
@@ -0,0 +1 @@
 
 
1
+ /Users/rafa/MscAi/VideoVoice-be
graphify-out/GRAPH_REPORT.md CHANGED
@@ -1,12 +1,12 @@
1
- # Graph Report - VideoVoice-be (2026-04-26)
2
 
3
  ## Corpus Check
4
- - 47 files · ~206,237 words
5
  - Verdict: corpus is large enough that graph structure adds value.
6
 
7
  ## Summary
8
- - 796 nodes · 1429 edges · 57 communities detected
9
- - Extraction: 75% EXTRACTED · 25% INFERRED · 0% AMBIGUOUS · INFERRED: 357 edges (avg confidence: 0.6)
10
  - Token cost: 0 input · 0 output
11
 
12
  ## Community Hubs (Navigation)
@@ -27,12 +27,12 @@
27
  - [[_COMMUNITY_Community 14|Community 14]]
28
  - [[_COMMUNITY_Community 15|Community 15]]
29
  - [[_COMMUNITY_Community 16|Community 16]]
 
30
  - [[_COMMUNITY_Community 18|Community 18]]
31
- - [[_COMMUNITY_Community 26|Community 26]]
32
- - [[_COMMUNITY_Community 27|Community 27]]
33
- - [[_COMMUNITY_Community 28|Community 28]]
34
- - [[_COMMUNITY_Community 29|Community 29]]
35
- - [[_COMMUNITY_Community 30|Community 30]]
36
  - [[_COMMUNITY_Community 31|Community 31]]
37
  - [[_COMMUNITY_Community 32|Community 32]]
38
  - [[_COMMUNITY_Community 33|Community 33]]
@@ -67,6 +67,11 @@
67
  - [[_COMMUNITY_Community 62|Community 62]]
68
  - [[_COMMUNITY_Community 63|Community 63]]
69
  - [[_COMMUNITY_Community 64|Community 64]]
 
 
 
 
 
70
 
71
  ## God Nodes (most connected - your core abstractions)
72
  1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
@@ -81,16 +86,16 @@
81
  10. `BasePoster` - 14 edges
82
 
83
  ## Surprising Connections (you probably didn't know these)
84
- - `_segments_from_pollinations()` --calls--> `post()` [INFERRED]
85
- steps/s2_transcribe.py → social_distributor/poster/platforms/base.py
86
  - `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4` [INFERRED] [semantically similar]
87
  requirements.txt → requirements-omni.txt
88
  - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)` [INFERRED] [semantically similar]
89
  requirements.txt → requirements-omni.txt
 
 
 
 
90
  - `run_pipeline()` --calls--> `transcribe()` [INFERRED]
91
  pipeline.py → steps/s2_transcribe.py
92
- - `run_pipeline()` --calls--> `translate()` [INFERRED]
93
- pipeline.py → steps/s3_translate.py
94
 
95
  ## Hyperedges (group relationships)
96
  - **Six-step translation pipeline** — [EXTRACTED 1.00]
@@ -101,323 +106,343 @@
101
 
102
  ### Community 0 - "Community 0"
103
  Cohesion: 0.04
104
- Nodes (70): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r""" This is the configuration class to store the configuration of a [`Qwen3, r""" This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r""" This is the configuration class to store the configuration of a [`Qwen3 (+62 more)
105
 
106
  ### Community 1 - "Community 1"
107
- Cohesion: 0.05
108
- Nodes (58): ABC, BasePoster, post(), Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt() (+50 more)
109
 
110
  ### Community 2 - "Community 2"
111
- Cohesion: 0.04
112
- Nodes (61): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine. ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+53 more)
113
 
114
  ### Community 3 - "Community 3"
115
- Cohesion: 0.06
116
- Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
117
 
118
  ### Community 4 - "Community 4"
119
- Cohesion: 0.07
120
- Nodes (24): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation. Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D) q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx) the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal. Args:, spectral_normalize_torch() (+16 more)
121
 
122
  ### Community 5 - "Community 5"
123
- Cohesion: 0.07
124
- Nodes (26): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+18 more)
125
 
126
  ### Community 6 - "Community 6"
127
- Cohesion: 0.05
128
- Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
129
 
130
  ### Community 7 - "Community 7"
131
  Cohesion: 0.07
132
- Nodes (35): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages. Args:, run_pipeline() (+27 more)
133
 
134
  ### Community 8 - "Community 8"
 
 
 
 
135
  Cohesion: 0.09
136
  Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
137
 
138
- ### Community 9 - "Community 9"
139
  Cohesion: 0.1
140
  Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
141
 
142
- ### Community 10 - "Community 10"
 
 
 
 
143
  Cohesion: 0.1
144
  Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches. Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
145
 
146
- ### Community 11 - "Community 11"
147
  Cohesion: 0.12
148
- Nodes (29): _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps. Primary local backend (device-depende, Split segments longer than _MAX_SEGMENT_DURATION using word timings. (+21 more)
149
 
150
- ### Community 12 - "Community 12"
151
- Cohesion: 0.13
152
- Nodes (26): _compress_silences(), _detect_pauses(), _distribute_padding(), _find_tts_silences(), _generate_silence(), _get_wav_duration(), _pad_silence(), _pause_aware_sync() (+18 more)
153
 
154
- ### Community 13 - "Community 13"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  Cohesion: 0.24
156
  Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL. Uses Play (+3 more)
157
 
158
- ### Community 14 - "Community 14"
159
  Cohesion: 0.27
160
  Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline. Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
161
 
162
- ### Community 15 - "Community 15"
163
- Cohesion: 0.22
164
- Nodes (5): Qwen3TTSProcessor, r""" Constructs a Qwen3TTS processor. Args: tokenizer ([`Qwen2T, Main method to prepare for the model one or several sequences(s) and audio(s). T, This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedToke, ProcessorMixin
165
-
166
- ### Community 16 - "Community 16"
167
  Cohesion: 0.33
168
  Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
169
 
170
- ### Community 18 - "Community 18"
171
  Cohesion: 1.0
172
  Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
173
 
174
- ### Community 26 - "Community 26"
175
  Cohesion: 1.0
176
  Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
177
 
178
- ### Community 27 - "Community 27"
179
  Cohesion: 1.0
180
  Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
181
 
182
- ### Community 28 - "Community 28"
183
  Cohesion: 1.0
184
  Nodes (1): Voice clone speech using the Base model. You can provide either:
185
 
186
- ### Community 29 - "Community 29"
187
  Cohesion: 1.0
188
  Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
189
 
190
- ### Community 30 - "Community 30"
191
  Cohesion: 1.0
192
  Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
193
 
194
- ### Community 31 - "Community 31"
195
  Cohesion: 1.0
196
  Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
197
 
198
- ### Community 32 - "Community 32"
199
  Cohesion: 1.0
200
  Nodes (1): Reject oversized uploads before body parsing.
201
 
202
- ### Community 33 - "Community 33"
203
  Cohesion: 1.0
204
  Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
205
 
206
- ### Community 34 - "Community 34"
207
  Cohesion: 1.0
208
  Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
209
 
210
- ### Community 35 - "Community 35"
211
  Cohesion: 1.0
212
  Nodes (1): Return curated showcase entries with resolved streaming URLs.
213
 
214
- ### Community 36 - "Community 36"
215
  Cohesion: 1.0
216
  Nodes (1): Submit a video for translation.
217
 
218
- ### Community 37 - "Community 37"
219
  Cohesion: 1.0
220
  Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
221
 
222
- ### Community 38 - "Community 38"
223
  Cohesion: 1.0
224
  Nodes (1): User selects a TTS model after previewing.
225
 
226
- ### Community 39 - "Community 39"
227
  Cohesion: 1.0
228
  Nodes (1): Serve a preview audio WAV file.
229
 
230
- ### Community 40 - "Community 40"
231
  Cohesion: 1.0
232
  Nodes (1): Download the translated video.
233
 
234
- ### Community 41 - "Community 41"
235
  Cohesion: 1.0
236
  Nodes (1): Create artifact directories and start background cleanup.
237
 
238
- ### Community 42 - "Community 42"
239
  Cohesion: 1.0
240
  Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
241
 
242
- ### Community 43 - "Community 43"
243
  Cohesion: 1.0
244
  Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
245
 
246
- ### Community 44 - "Community 44"
247
  Cohesion: 1.0
248
  Nodes (1): Insert extra silence distributed across detected pause points.
249
 
250
- ### Community 45 - "Community 45"
251
  Cohesion: 1.0
252
  Nodes (1): Generate a silent WAV file of given duration.
253
 
254
- ### Community 46 - "Community 46"
255
  Cohesion: 1.0
256
  Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
257
 
258
- ### Community 47 - "Community 47"
259
  Cohesion: 1.0
260
  Nodes (1): Translate the text of each segment into target_language in batches. Args:
261
 
262
- ### Community 48 - "Community 48"
263
  Cohesion: 1.0
264
  Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int
265
 
266
- ### Community 49 - "Community 49"
267
  Cohesion: 1.0
268
  Nodes (1): Remove trailing noise/artifacts after speech ends.
269
 
270
- ### Community 50 - "Community 50"
271
  Cohesion: 1.0
272
  Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
273
 
274
- ### Community 51 - "Community 51"
275
  Cohesion: 1.0
276
  Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
277
 
278
- ### Community 52 - "Community 52"
279
  Cohesion: 1.0
280
  Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
281
 
282
- ### Community 53 - "Community 53"
283
  Cohesion: 1.0
284
  Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
285
 
286
- ### Community 54 - "Community 54"
287
  Cohesion: 1.0
288
  Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
289
 
290
- ### Community 55 - "Community 55"
291
  Cohesion: 1.0
292
  Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
293
 
294
- ### Community 56 - "Community 56"
295
  Cohesion: 1.0
296
  Nodes (1): torch==2.6.0
297
 
298
- ### Community 57 - "Community 57"
299
  Cohesion: 1.0
300
  Nodes (1): fastapi
301
 
302
- ### Community 58 - "Community 58"
303
  Cohesion: 1.0
304
  Nodes (1): yt-dlp
305
 
306
- ### Community 59 - "Community 59"
307
  Cohesion: 1.0
308
  Nodes (1): diffusers==0.29.0
309
 
310
- ### Community 60 - "Community 60"
311
  Cohesion: 1.0
312
  Nodes (1): ARTIFACTS_ROOT env
313
 
314
- ### Community 61 - "Community 61"
315
  Cohesion: 1.0
316
  Nodes (1): AWS g4dn.xlarge alternative
317
 
318
- ### Community 62 - "Community 62"
319
  Cohesion: 1.0
320
  Nodes (1): nodejs (system pkg)
321
 
322
- ### Community 63 - "Community 63"
323
  Cohesion: 1.0
324
  Nodes (1): fonts-noto-core / cjk
325
 
326
- ### Community 64 - "Community 64"
327
  Cohesion: 1.0
328
  Nodes (1): graphify project rules
329
 
330
  ## Knowledge Gaps
331
- - **217 isolated node(s):** `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+212 more)
332
  These have ≤1 connection - possible missing edges or undocumented components.
333
- - **Thin community `Community 18`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
334
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
335
- - **Thin community `Community 26`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
336
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
337
- - **Thin community `Community 27`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
338
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
339
- - **Thin community `Community 28`** (1 nodes): `Voice clone speech using the Base model. You can provide either:`
340
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
341
- - **Thin community `Community 29`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
342
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
343
- - **Thin community `Community 30`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
344
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
345
- - **Thin community `Community 31`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
346
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
347
- - **Thin community `Community 32`** (1 nodes): `Reject oversized uploads before body parsing.`
348
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
349
- - **Thin community `Community 33`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
350
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
351
- - **Thin community `Community 34`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
352
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
353
- - **Thin community `Community 35`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
354
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
355
- - **Thin community `Community 36`** (1 nodes): `Submit a video for translation.`
356
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
357
- - **Thin community `Community 37`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
358
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
359
- - **Thin community `Community 38`** (1 nodes): `User selects a TTS model after previewing.`
360
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
361
- - **Thin community `Community 39`** (1 nodes): `Serve a preview audio WAV file.`
362
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
363
- - **Thin community `Community 40`** (1 nodes): `Download the translated video.`
364
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
365
- - **Thin community `Community 41`** (1 nodes): `Create artifact directories and start background cleanup.`
366
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
367
- - **Thin community `Community 42`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
368
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
369
- - **Thin community `Community 43`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
370
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
371
- - **Thin community `Community 44`** (1 nodes): `Insert extra silence distributed across detected pause points.`
372
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
373
- - **Thin community `Community 45`** (1 nodes): `Generate a silent WAV file of given duration.`
374
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
375
- - **Thin community `Community 46`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
376
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
377
- - **Thin community `Community 47`** (1 nodes): `Translate the text of each segment into target_language in batches. Args:`
378
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
379
- - **Thin community `Community 48`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int`
380
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
381
- - **Thin community `Community 49`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
382
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
383
- - **Thin community `Community 50`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
384
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
385
- - **Thin community `Community 51`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
386
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
387
- - **Thin community `Community 52`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
388
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
389
- - **Thin community `Community 53`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
390
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
391
- - **Thin community `Community 54`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
392
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
393
- - **Thin community `Community 55`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
394
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
395
- - **Thin community `Community 56`** (1 nodes): `torch==2.6.0`
396
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
397
- - **Thin community `Community 57`** (1 nodes): `fastapi`
398
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
399
- - **Thin community `Community 58`** (1 nodes): `yt-dlp`
400
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
401
- - **Thin community `Community 59`** (1 nodes): `diffusers==0.29.0`
402
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
403
- - **Thin community `Community 60`** (1 nodes): `ARTIFACTS_ROOT env`
404
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
405
- - **Thin community `Community 61`** (1 nodes): `AWS g4dn.xlarge alternative`
406
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
407
- - **Thin community `Community 62`** (1 nodes): `nodejs (system pkg)`
408
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
409
- - **Thin community `Community 63`** (1 nodes): `fonts-noto-core / cjk`
410
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
411
- - **Thin community `Community 64`** (1 nodes): `graphify project rules`
412
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
413
 
414
  ## Suggested Questions
415
  _Questions this graph is uniquely positioned to answer:_
416
 
417
- - **Why does `run_pipeline()` connect `Community 7` to `Community 3`, `Community 10`, `Community 11`, `Community 12`?**
418
- _High betweenness centrality (0.339) - this node is a cross-community bridge._
419
- - **Why does `synthesise_segments()` connect `Community 3` to `Community 7`?**
420
- _High betweenness centrality (0.299) - this node is a cross-community bridge._
421
  - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
422
  _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
423
  - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
@@ -427,4 +452,4 @@ _Questions this graph is uniquely positioned to answer:_
427
  - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
428
  _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
429
  - **What connects `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
430
- _217 weakly-connected nodes found - possible documentation gaps or missing edges._
 
1
+ # Graph Report - VideoVoice-be (2026-05-16)
2
 
3
  ## Corpus Check
4
+ - 59 files · ~253,292 words
5
  - Verdict: corpus is large enough that graph structure adds value.
6
 
7
  ## Summary
8
+ - 1050 nodes · 1833 edges · 62 communities detected
9
+ - Extraction: 79% EXTRACTED · 21% INFERRED · 0% AMBIGUOUS · INFERRED: 389 edges (avg confidence: 0.62)
10
  - Token cost: 0 input · 0 output
11
 
12
  ## Community Hubs (Navigation)
 
27
  - [[_COMMUNITY_Community 14|Community 14]]
28
  - [[_COMMUNITY_Community 15|Community 15]]
29
  - [[_COMMUNITY_Community 16|Community 16]]
30
+ - [[_COMMUNITY_Community 17|Community 17]]
31
  - [[_COMMUNITY_Community 18|Community 18]]
32
+ - [[_COMMUNITY_Community 19|Community 19]]
33
+ - [[_COMMUNITY_Community 20|Community 20]]
34
+ - [[_COMMUNITY_Community 21|Community 21]]
35
+ - [[_COMMUNITY_Community 23|Community 23]]
 
36
  - [[_COMMUNITY_Community 31|Community 31]]
37
  - [[_COMMUNITY_Community 32|Community 32]]
38
  - [[_COMMUNITY_Community 33|Community 33]]
 
67
  - [[_COMMUNITY_Community 62|Community 62]]
68
  - [[_COMMUNITY_Community 63|Community 63]]
69
  - [[_COMMUNITY_Community 64|Community 64]]
70
+ - [[_COMMUNITY_Community 65|Community 65]]
71
+ - [[_COMMUNITY_Community 66|Community 66]]
72
+ - [[_COMMUNITY_Community 67|Community 67]]
73
+ - [[_COMMUNITY_Community 68|Community 68]]
74
+ - [[_COMMUNITY_Community 69|Community 69]]
75
 
76
  ## God Nodes (most connected - your core abstractions)
77
  1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
 
86
  10. `BasePoster` - 14 edges
87
 
88
  ## Surprising Connections (you probably didn't know these)
 
 
89
  - `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4` [INFERRED] [semantically similar]
90
  requirements.txt → requirements-omni.txt
91
  - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)` [INFERRED] [semantically similar]
92
  requirements.txt → requirements-omni.txt
93
+ - `content_length_middleware()` --calls--> `enforce_content_length_limit()` [INFERRED]
94
+ app.py → server.py
95
+ - `run_pipeline()` --calls--> `separate_audio()` [INFERRED]
96
+ pipeline.py → steps/s1b_separate.py
97
  - `run_pipeline()` --calls--> `transcribe()` [INFERRED]
98
  pipeline.py → steps/s2_transcribe.py
 
 
99
 
100
  ## Hyperedges (group relationships)
101
  - **Six-step translation pipeline** — [EXTRACTED 1.00]
 
106
 
107
  ### Community 0 - "Community 0"
108
  Cohesion: 0.04
109
+ Nodes (69): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r""" This is the configuration class to store the configuration of a [`Qwen3, r""" This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r""" This is the configuration class to store the configuration of a [`Qwen3 (+61 more)
110
 
111
  ### Community 1 - "Community 1"
112
+ Cohesion: 0.02
113
+ Nodes (118): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine. ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+110 more)
114
 
115
  ### Community 2 - "Community 2"
116
+ Cohesion: 0.05
117
+ Nodes (57): ABC, BasePoster, Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt(), format_caption() (+49 more)
118
 
119
  ### Community 3 - "Community 3"
120
+ Cohesion: 0.05
121
+ Nodes (59): _collect_output(), _log_step_done(), main(), pipeline.py Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages. Args:, run_pipeline() (+51 more)
122
 
123
  ### Community 4 - "Community 4"
124
+ Cohesion: 0.06
125
+ Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
126
 
127
  ### Community 5 - "Community 5"
128
+ Cohesion: 0.06
129
+ Nodes (59): post(), _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps. Primary local backend (device-depende (+51 more)
130
 
131
  ### Community 6 - "Community 6"
132
+ Cohesion: 0.06
133
+ Nodes (31): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+23 more)
134
 
135
  ### Community 7 - "Community 7"
136
  Cohesion: 0.07
137
+ Nodes (25): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation. Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D) q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx) the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal. Args:, spectral_normalize_torch() (+17 more)
138
 
139
  ### Community 8 - "Community 8"
140
+ Cohesion: 0.05
141
+ Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
142
+
143
+ ### Community 9 - "Community 9"
144
  Cohesion: 0.09
145
  Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
146
 
147
+ ### Community 10 - "Community 10"
148
  Cohesion: 0.1
149
  Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
150
 
151
+ ### Community 11 - "Community 11"
152
+ Cohesion: 0.08
153
+ Nodes (32): _apply_demucs(), _get_model(), _load_and_normalise(), Step 1b: Separate vocals from accompaniment using Demucs (Python API). In-proce, Lazy-load htdemucs once per process. Module-level semantics; we load on firs, GPU-bound inference call. `mix` shape: [1, channels, time]., Load WAV, resample/remix to match model requirements, z-normalise., Separate vocals from accompaniment using Demucs htdemucs (Python API). Args (+24 more)
154
+
155
+ ### Community 12 - "Community 12"
156
  Cohesion: 0.1
157
  Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches. Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
158
 
159
+ ### Community 13 - "Community 13"
160
  Cohesion: 0.12
161
+ Nodes (27): build_for_job(), ensure_transcription(), extract_audio_hq(), extract_reference_audio(), get_audio_duration(), get_device(), load_chatterbox(), main() (+19 more)
162
 
163
+ ### Community 14 - "Community 14"
164
+ Cohesion: 0.11
165
+ Nodes (25): tools_api — Standalone endpoints for creator quick tools. Lives alongside the m, audio_cleanup_endpoint(), _ext_to_media_type(), APIRouter for /api/tools/* endpoints. Each endpoint is sync request-response (n, Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS., Manual reap trigger (mostly for testing). Auto-reap runs on a timer., Stream upload to disk, enforcing the tools size cap., _reap() (+17 more)
166
 
167
+ ### Community 15 - "Community 15"
168
+ Cohesion: 0.12
169
+ Nodes (23): build_t3_cond(), main(), prepare_sample(), prepare_sample.py — Turn one dataset.jsonl row into the exact tensors T3.loss(), Build the speaker conditioning (frozen during training)., MTLTokenizer + SOT/EOT padding (mirrors what generate() does internally)., S3Tokenizer on the target dubbed audio → speech tokens (the LABEL). Critica, Turn one dataset row into ready-to-train tensors. (+15 more)
170
+
171
+ ### Community 16 - "Community 16"
172
+ Cohesion: 0.19
173
+ Nodes (18): _burn_in(), _clamp(), _extract_audio(), _force_style_for(), _format_timestamp_srt(), _format_timestamp_vtt(), generate_subtitles(), _is_video() (+10 more)
174
+
175
+ ### Community 17 - "Community 17"
176
+ Cohesion: 0.22
177
+ Nodes (12): download_result(), _is_noise(), main(), Batch translate Instagram reels to English via the VideoVoice server API. Usage, Extract the Instagram reel shortcode from a URL, e.g. 'DWn_yPoDsYw'., Submit a single video URL and return the job_id., Return True if a log line is internal noise we don't want in the log., Poll job status until complete or error. Returns final messages and collected lo (+4 more)
178
+
179
+ ### Community 18 - "Community 18"
180
+ Cohesion: 0.23
181
+ Nodes (12): evaluate(), load_baseline(), load_with_lora(), main(), pick_held_out_samples(), print_summary(), eval.py — Evaluate the fine-tuned LoRA against the un-tuned baseline. Picks N s, Return overshoot samples (duration_diff > 0.2) — these are NOT in the asymme (+4 more)
182
+
183
+ ### Community 19 - "Community 19"
184
  Cohesion: 0.24
185
  Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL. Uses Play (+3 more)
186
 
187
+ ### Community 20 - "Community 20"
188
  Cohesion: 0.27
189
  Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline. Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
190
 
191
+ ### Community 21 - "Community 21"
 
 
 
 
192
  Cohesion: 0.33
193
  Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
194
 
195
+ ### Community 23 - "Community 23"
196
  Cohesion: 1.0
197
  Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
198
 
199
+ ### Community 31 - "Community 31"
200
  Cohesion: 1.0
201
  Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
202
 
203
+ ### Community 32 - "Community 32"
204
  Cohesion: 1.0
205
  Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
206
 
207
+ ### Community 33 - "Community 33"
208
  Cohesion: 1.0
209
  Nodes (1): Voice clone speech using the Base model. You can provide either:
210
 
211
+ ### Community 34 - "Community 34"
212
  Cohesion: 1.0
213
  Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
214
 
215
+ ### Community 35 - "Community 35"
216
  Cohesion: 1.0
217
  Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
218
 
219
+ ### Community 36 - "Community 36"
220
  Cohesion: 1.0
221
  Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
222
 
223
+ ### Community 37 - "Community 37"
224
  Cohesion: 1.0
225
  Nodes (1): Reject oversized uploads before body parsing.
226
 
227
+ ### Community 38 - "Community 38"
228
  Cohesion: 1.0
229
  Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
230
 
231
+ ### Community 39 - "Community 39"
232
  Cohesion: 1.0
233
  Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
234
 
235
+ ### Community 40 - "Community 40"
236
  Cohesion: 1.0
237
  Nodes (1): Return curated showcase entries with resolved streaming URLs.
238
 
239
+ ### Community 41 - "Community 41"
240
  Cohesion: 1.0
241
  Nodes (1): Submit a video for translation.
242
 
243
+ ### Community 42 - "Community 42"
244
  Cohesion: 1.0
245
  Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
246
 
247
+ ### Community 43 - "Community 43"
248
  Cohesion: 1.0
249
  Nodes (1): User selects a TTS model after previewing.
250
 
251
+ ### Community 44 - "Community 44"
252
  Cohesion: 1.0
253
  Nodes (1): Serve a preview audio WAV file.
254
 
255
+ ### Community 45 - "Community 45"
256
  Cohesion: 1.0
257
  Nodes (1): Download the translated video.
258
 
259
+ ### Community 46 - "Community 46"
260
  Cohesion: 1.0
261
  Nodes (1): Create artifact directories and start background cleanup.
262
 
263
+ ### Community 47 - "Community 47"
264
  Cohesion: 1.0
265
  Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
266
 
267
+ ### Community 48 - "Community 48"
268
  Cohesion: 1.0
269
  Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
270
 
271
+ ### Community 49 - "Community 49"
272
  Cohesion: 1.0
273
  Nodes (1): Insert extra silence distributed across detected pause points.
274
 
275
+ ### Community 50 - "Community 50"
276
  Cohesion: 1.0
277
  Nodes (1): Generate a silent WAV file of given duration.
278
 
279
+ ### Community 51 - "Community 51"
280
  Cohesion: 1.0
281
  Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
282
 
283
+ ### Community 52 - "Community 52"
284
  Cohesion: 1.0
285
  Nodes (1): Translate the text of each segment into target_language in batches. Args:
286
 
287
+ ### Community 53 - "Community 53"
288
  Cohesion: 1.0
289
  Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int
290
 
291
+ ### Community 54 - "Community 54"
292
  Cohesion: 1.0
293
  Nodes (1): Remove trailing noise/artifacts after speech ends.
294
 
295
+ ### Community 55 - "Community 55"
296
  Cohesion: 1.0
297
  Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
298
 
299
+ ### Community 56 - "Community 56"
300
  Cohesion: 1.0
301
  Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
302
 
303
+ ### Community 57 - "Community 57"
304
  Cohesion: 1.0
305
  Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
306
 
307
+ ### Community 58 - "Community 58"
308
  Cohesion: 1.0
309
  Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
310
 
311
+ ### Community 59 - "Community 59"
312
  Cohesion: 1.0
313
  Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
314
 
315
+ ### Community 60 - "Community 60"
316
  Cohesion: 1.0
317
  Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
318
 
319
+ ### Community 61 - "Community 61"
320
  Cohesion: 1.0
321
  Nodes (1): torch==2.6.0
322
 
323
+ ### Community 62 - "Community 62"
324
  Cohesion: 1.0
325
  Nodes (1): fastapi
326
 
327
+ ### Community 63 - "Community 63"
328
  Cohesion: 1.0
329
  Nodes (1): yt-dlp
330
 
331
+ ### Community 64 - "Community 64"
332
  Cohesion: 1.0
333
  Nodes (1): diffusers==0.29.0
334
 
335
+ ### Community 65 - "Community 65"
336
  Cohesion: 1.0
337
  Nodes (1): ARTIFACTS_ROOT env
338
 
339
+ ### Community 66 - "Community 66"
340
  Cohesion: 1.0
341
  Nodes (1): AWS g4dn.xlarge alternative
342
 
343
+ ### Community 67 - "Community 67"
344
  Cohesion: 1.0
345
  Nodes (1): nodejs (system pkg)
346
 
347
+ ### Community 68 - "Community 68"
348
  Cohesion: 1.0
349
  Nodes (1): fonts-noto-core / cjk
350
 
351
+ ### Community 69 - "Community 69"
352
  Cohesion: 1.0
353
  Nodes (1): graphify project rules
354
 
355
  ## Knowledge Gaps
356
+ - **321 isolated node(s):** `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+316 more)
357
  These have ≤1 connection - possible missing edges or undocumented components.
358
+ - **Thin community `Community 23`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
359
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
360
+ - **Thin community `Community 31`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
361
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
362
+ - **Thin community `Community 32`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
363
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
364
+ - **Thin community `Community 33`** (1 nodes): `Voice clone speech using the Base model. You can provide either:`
365
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
366
+ - **Thin community `Community 34`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
367
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
368
+ - **Thin community `Community 35`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
369
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
370
+ - **Thin community `Community 36`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
371
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
372
+ - **Thin community `Community 37`** (1 nodes): `Reject oversized uploads before body parsing.`
373
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
374
+ - **Thin community `Community 38`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
375
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
376
+ - **Thin community `Community 39`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
377
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
378
+ - **Thin community `Community 40`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
379
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
380
+ - **Thin community `Community 41`** (1 nodes): `Submit a video for translation.`
381
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
382
+ - **Thin community `Community 42`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
383
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
384
+ - **Thin community `Community 43`** (1 nodes): `User selects a TTS model after previewing.`
385
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
386
+ - **Thin community `Community 44`** (1 nodes): `Serve a preview audio WAV file.`
387
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
388
+ - **Thin community `Community 45`** (1 nodes): `Download the translated video.`
389
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
390
+ - **Thin community `Community 46`** (1 nodes): `Create artifact directories and start background cleanup.`
391
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
392
+ - **Thin community `Community 47`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
393
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
394
+ - **Thin community `Community 48`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
395
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
396
+ - **Thin community `Community 49`** (1 nodes): `Insert extra silence distributed across detected pause points.`
397
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
398
+ - **Thin community `Community 50`** (1 nodes): `Generate a silent WAV file of given duration.`
399
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
400
+ - **Thin community `Community 51`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
401
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
402
+ - **Thin community `Community 52`** (1 nodes): `Translate the text of each segment into target_language in batches. Args:`
403
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
404
+ - **Thin community `Community 53`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope. ZeroGPU only int`
405
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
406
+ - **Thin community `Community 54`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
407
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
408
+ - **Thin community `Community 55`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
409
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
410
+ - **Thin community `Community 56`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
411
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
412
+ - **Thin community `Community 57`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
413
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
414
+ - **Thin community `Community 58`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
415
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
416
+ - **Thin community `Community 59`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
417
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
418
+ - **Thin community `Community 60`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
419
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
420
+ - **Thin community `Community 61`** (1 nodes): `torch==2.6.0`
421
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
422
+ - **Thin community `Community 62`** (1 nodes): `fastapi`
423
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
424
+ - **Thin community `Community 63`** (1 nodes): `yt-dlp`
425
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
426
+ - **Thin community `Community 64`** (1 nodes): `diffusers==0.29.0`
427
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
428
+ - **Thin community `Community 65`** (1 nodes): `ARTIFACTS_ROOT env`
429
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
430
+ - **Thin community `Community 66`** (1 nodes): `AWS g4dn.xlarge alternative`
431
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
432
+ - **Thin community `Community 67`** (1 nodes): `nodejs (system pkg)`
433
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
434
+ - **Thin community `Community 68`** (1 nodes): `fonts-noto-core / cjk`
435
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
436
+ - **Thin community `Community 69`** (1 nodes): `graphify project rules`
437
  Too small to be a meaningful cluster - may be noise or needs more connections extracted.
438
 
439
  ## Suggested Questions
440
  _Questions this graph is uniquely positioned to answer:_
441
 
442
+ - **Why does `synthesise_segments()` connect `Community 4` to `Community 11`, `Community 3`?**
443
+ _High betweenness centrality (0.324) - this node is a cross-community bridge._
444
+ - **Why does `generate()` connect `Community 4` to `Community 0`, `Community 6`?**
445
+ _High betweenness centrality (0.209) - this node is a cross-community bridge._
446
  - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
447
  _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
448
  - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
 
452
  - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
453
  _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
454
  - **What connects `server.py — FastAPI backend for VideoVoice. Endpoints: POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
455
+ _321 weakly-connected nodes found - possible documentation gaps or missing edges._
graphify-out/graph.html CHANGED
The diff for this file is too large to render. See raw diff
 
server.py CHANGED
@@ -75,7 +75,7 @@ ALLOWED_YTDLP_HOSTS = {
75
  "tiktok.com",
76
  "vm.tiktok.com",
77
  }
78
- PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp"}
79
  REAPER_INTERVAL_SECONDS = 10 * 60
80
  REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
81
 
@@ -913,6 +913,10 @@ if __name__ == "__main__":
913
 
914
  local_app.include_router(router)
915
 
 
 
 
 
916
  # Serve the legacy static frontend at / so `python server.py` keeps the
917
  # old dev UX (open http://localhost:8000 to hit frontend/index.html).
918
  # The React SPA in production is deployed separately to S3.
 
75
  "tiktok.com",
76
  "vm.tiktok.com",
77
  }
78
+ PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp", "tools"}
79
  REAPER_INTERVAL_SECONDS = 10 * 60
80
  REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
81
 
 
913
 
914
  local_app.include_router(router)
915
 
916
+ # Tools API — independent of pipeline; safe to include here too.
917
+ from tools_api import router as tools_router
918
+ local_app.include_router(tools_router)
919
+
920
  # Serve the legacy static frontend at / so `python server.py` keeps the
921
  # old dev UX (open http://localhost:8000 to hit frontend/index.html).
922
  # The React SPA in production is deployed separately to S3.
tools_api/__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ tools_api — Standalone endpoints for creator quick tools.
3
+
4
+ Lives alongside the main pipeline (server.py) but stays decoupled:
5
+ - No shared job state, no SSE, no GPU semaphore.
6
+ - Reuses step modules as libraries only (no edits to steps/).
7
+ - Artifacts written under ARTIFACTS_ROOT/tools/<run_id>/.
8
+
9
+ Endpoints (mounted by router.router):
10
+ POST /api/tools/subtitles — captions (sidecar or burn-in MP4)
11
+ POST /api/tools/voice-clone — single-segment TTS with voice clone
12
+ POST /api/tools/audio-cleanup — Demucs source separation
13
+ GET /api/tools/file/{run}/{f} — download generated artifact
14
+ """
15
+ from .router import router
16
+
17
+ __all__ = ["router"]
tools_api/audio_cleanup.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Audio source separation tool — three modes via Demucs.
3
+
4
+ Reuses internals from steps.s1b_separate (model loader, device picker, normaliser,
5
+ GPU-decorated apply). The existing separate_audio() returns only (vocals, accompaniment),
6
+ so we replicate its flow here and keep all four stems addressable.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import subprocess
11
+ from pathlib import Path
12
+ from typing import Literal
13
+
14
+ import torch
15
+ import torchaudio
16
+
17
+ # Reuse internals — no edits to s1b_separate.py.
18
+ from steps.s1b_separate import (
19
+ _apply_demucs,
20
+ _get_model,
21
+ _load_and_normalise,
22
+ _select_device,
23
+ )
24
+
25
+ Mode = Literal["vocals-only", "instrumental-only", "stems"]
26
+
27
+
28
+ def _ensure_audio(input_path: Path, out_dir: Path) -> Path:
29
+ """Convert input to a stable WAV format if it's a video or non-WAV audio."""
30
+ if input_path.suffix.lower() == ".wav":
31
+ return input_path
32
+ out = out_dir / "input.wav"
33
+ cmd = [
34
+ "ffmpeg", "-y", "-i", str(input_path),
35
+ "-vn", "-ac", "2", "-ar", "44100",
36
+ "-acodec", "pcm_s16le",
37
+ str(out),
38
+ ]
39
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
40
+ if result.returncode != 0:
41
+ raise RuntimeError(f"ffmpeg input prep failed: {result.stderr[-300:]}")
42
+ return out
43
+
44
+
45
+ def _separate_all_stems(audio_path: Path, out_dir: Path) -> dict[str, Path]:
46
+ """Return {stem_name: wav_path} for every demucs source."""
47
+ model = _get_model()
48
+ device = _select_device()
49
+ target_sr = model.samplerate
50
+ target_ch = model.audio_channels
51
+ source_names = list(model.sources) # ["drums", "bass", "other", "vocals"]
52
+
53
+ mix, mean, std = _load_and_normalise(str(audio_path), target_sr, target_ch)
54
+ sources = _apply_demucs(mix, device)
55
+ sources = sources * std + mean
56
+ sources = sources[0] # [num_sources, channels, time]
57
+
58
+ stems: dict[str, Path] = {}
59
+ for idx, name in enumerate(source_names):
60
+ wav_path = out_dir / f"{name}.wav"
61
+ torchaudio.save(str(wav_path), sources[idx], target_sr)
62
+ stems[name] = wav_path
63
+ return stems
64
+
65
+
66
+ def _sum_to_wav(stems: list[Path], dest: Path, sample_rate: int = 44100) -> Path:
67
+ """Sum N stem WAVs into one — used to build the instrumental track."""
68
+ mix: torch.Tensor | None = None
69
+ sr_used = sample_rate
70
+ for path in stems:
71
+ wav, sr = torchaudio.load(str(path))
72
+ sr_used = sr
73
+ mix = wav if mix is None else mix + wav
74
+ if mix is None:
75
+ raise RuntimeError("No stems to sum.")
76
+ torchaudio.save(str(dest), mix, sr_used)
77
+ return dest
78
+
79
+
80
+ def separate(
81
+ *,
82
+ input_path: Path,
83
+ out_dir: Path,
84
+ mode: Mode,
85
+ ) -> list[dict]:
86
+ """
87
+ Run separation. Returns a list of output descriptors:
88
+ [{"name": "vocals.wav", "label": "Vocals", "filename": "vocals.wav"}, ...]
89
+ """
90
+ audio_in = _ensure_audio(input_path, out_dir)
91
+ stems = _separate_all_stems(audio_in, out_dir)
92
+
93
+ if mode == "vocals-only":
94
+ return [{
95
+ "name": "vocals",
96
+ "label": "Vocals",
97
+ "filename": stems["vocals"].name,
98
+ "sub": "Dialogue track",
99
+ }]
100
+
101
+ if mode == "instrumental-only":
102
+ non_vocal_stems = [stems[n] for n in stems if n != "vocals"]
103
+ out = _sum_to_wav(non_vocal_stems, out_dir / "instrumental.wav")
104
+ # Cleanup intermediate stem files we won't expose
105
+ for path in stems.values():
106
+ try:
107
+ path.unlink()
108
+ except OSError:
109
+ pass
110
+ return [{
111
+ "name": "instrumental",
112
+ "label": "Instrumental",
113
+ "filename": out.name,
114
+ "sub": "Music + ambient (vocals removed)",
115
+ }]
116
+
117
+ # stems mode — return all four
118
+ label_map = {
119
+ "vocals": ("Vocals", "Dialogue track"),
120
+ "drums": ("Drums", "Percussion"),
121
+ "bass": ("Bass", "Low frequency"),
122
+ "other": ("Other", "Melodic / ambient"),
123
+ }
124
+ results: list[dict] = []
125
+ # Stable order: vocals first, then drums, bass, other
126
+ for stem_key in ("vocals", "drums", "bass", "other"):
127
+ if stem_key not in stems:
128
+ continue
129
+ label, sub = label_map[stem_key]
130
+ results.append({
131
+ "name": stem_key,
132
+ "label": label,
133
+ "filename": stems[stem_key].name,
134
+ "sub": sub,
135
+ })
136
+ return results
tools_api/router.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ APIRouter for /api/tools/* endpoints.
3
+
4
+ Each endpoint is sync request-response (no SSE, no job state). Input files
5
+ land in a fresh per-run directory, outputs are returned as a download URL
6
+ to GET /api/tools/file/{run_id}/{filename}.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import asyncio
11
+ from pathlib import Path
12
+ from typing import Optional
13
+
14
+ from fastapi import APIRouter, File, Form, HTTPException, Request, UploadFile
15
+ from fastapi.responses import FileResponse, JSONResponse, PlainTextResponse
16
+
17
+ from server import limiter, _download_url, _is_allowed_video_host
18
+
19
+ from . import audio_cleanup, subtitles, voice_clone
20
+ from .storage import (
21
+ file_url,
22
+ new_run_dir,
23
+ reap_old_runs,
24
+ run_dir,
25
+ safe_filename,
26
+ )
27
+
28
+ router = APIRouter(prefix="/api/tools", tags=["tools"])
29
+
30
+ # Per-tool body size cap (separate from pipeline's MAX_UPLOAD_BYTES check).
31
+ TOOLS_MAX_BYTES = 50 * 1024 * 1024 # 50 MB
32
+
33
+
34
+ # ── Helpers ──────────────────────────────────────────────────────────
35
+
36
+ async def _save_upload(file: UploadFile, dest_dir: Path, default_name: str) -> Path:
37
+ """Stream upload to disk, enforcing the tools size cap."""
38
+ dest = dest_dir / safe_filename(file.filename, default_name)
39
+ written = 0
40
+ with open(dest, "wb") as fh:
41
+ while chunk := await file.read(1024 * 1024):
42
+ written += len(chunk)
43
+ if written > TOOLS_MAX_BYTES:
44
+ fh.close()
45
+ dest.unlink(missing_ok=True)
46
+ raise HTTPException(413, f"File too large (max {TOOLS_MAX_BYTES // (1024*1024)} MB).")
47
+ fh.write(chunk)
48
+ return dest
49
+
50
+
51
+ def _ext_to_media_type(filename: str) -> str:
52
+ ext = Path(filename).suffix.lower()
53
+ return {
54
+ ".mp4": "video/mp4",
55
+ ".mov": "video/quicktime",
56
+ ".webm": "video/webm",
57
+ ".mp3": "audio/mpeg",
58
+ ".wav": "audio/wav",
59
+ ".srt": "application/x-subrip",
60
+ ".vtt": "text/vtt",
61
+ ".txt": "text/plain",
62
+ }.get(ext, "application/octet-stream")
63
+
64
+
65
+ # ── Subtitles ────────────────────────────────────────────────────────
66
+
67
+ @router.post("/subtitles")
68
+ @limiter.limit("10/hour")
69
+ async def subtitles_endpoint(
70
+ request: Request,
71
+ file: Optional[UploadFile] = File(None),
72
+ url: Optional[str] = Form(None),
73
+ source_lang: str = Form("Auto-detect"),
74
+ target_lang: str = Form("Same as source"),
75
+ fmt: str = Form("srt"),
76
+ style: str = Form("tiktok"),
77
+ position: str = Form("bottom"),
78
+ h_align: str = Form("center"),
79
+ font_size: Optional[int] = Form(None),
80
+ margin_v: Optional[int] = Form(None),
81
+ ):
82
+ if fmt not in ("srt", "vtt", "txt", "mp4"):
83
+ raise HTTPException(400, "fmt must be one of: srt, vtt, txt, mp4")
84
+ if style not in ("tiktok", "youtube", "minimal"):
85
+ raise HTTPException(400, "style must be one of: tiktok, youtube, minimal")
86
+ if position not in ("top", "middle", "bottom"):
87
+ raise HTTPException(400, "position must be one of: top, middle, bottom")
88
+ if h_align not in ("left", "center", "right"):
89
+ raise HTTPException(400, "h_align must be one of: left, center, right")
90
+
91
+ url = (url or "").strip()
92
+ if not file and not url:
93
+ raise HTTPException(400, "Provide either a file upload or a video URL.")
94
+ if file and url:
95
+ raise HTTPException(400, "Send a file OR a URL, not both.")
96
+
97
+ run_id, dest_dir = new_run_dir()
98
+ if file:
99
+ input_path = await _save_upload(file, dest_dir, "input.mp4")
100
+ else:
101
+ if not _is_allowed_video_host(url):
102
+ raise HTTPException(400, "URL host not supported. Use TikTok, YouTube, or Instagram.")
103
+ input_path = Path(dest_dir) / "input.mp4"
104
+ try:
105
+ await asyncio.to_thread(_download_url, url, str(input_path))
106
+ except Exception as e: # noqa: BLE001
107
+ raise HTTPException(400, f"Couldn't fetch the video URL: {e}")
108
+
109
+ try:
110
+ # Heavy: transcribe + (optional) translate + (optional) ffmpeg burn-in.
111
+ # Run off the event loop so concurrent requests don't starve.
112
+ info = await asyncio.to_thread(
113
+ subtitles.generate_subtitles,
114
+ input_path=input_path,
115
+ out_dir=dest_dir,
116
+ source_lang_name=source_lang,
117
+ target_lang_name=target_lang,
118
+ fmt=fmt, # type: ignore[arg-type]
119
+ style=style, # type: ignore[arg-type]
120
+ position=position, # type: ignore[arg-type]
121
+ h_align=h_align, # type: ignore[arg-type]
122
+ font_size=font_size,
123
+ margin_v=margin_v,
124
+ )
125
+ except ValueError as e:
126
+ raise HTTPException(400, str(e))
127
+ except Exception as e: # noqa: BLE001
128
+ raise HTTPException(500, f"Subtitle generation failed: {e}")
129
+
130
+ return JSONResponse({
131
+ "run_id": run_id,
132
+ "format": info["format"],
133
+ "filename": info["filename"],
134
+ "url": file_url(run_id, info["filename"]),
135
+ "segments": info["segments"],
136
+ "translated": info["translated"],
137
+ })
138
+
139
+
140
+ # ── Voice clone ──────────────────────────────────────────────────────
141
+
142
+ @router.post("/voice-clone")
143
+ @limiter.limit("10/hour")
144
+ async def voice_clone_endpoint(
145
+ request: Request,
146
+ sample: UploadFile = File(...),
147
+ text: str = Form(...),
148
+ language_id: str = Form("en"),
149
+ ):
150
+ text = (text or "").strip()
151
+ if not text:
152
+ raise HTTPException(400, "text is required")
153
+ if len(text) > 1000:
154
+ raise HTTPException(400, "text exceeds 1000 char limit")
155
+
156
+ run_id, dest_dir = new_run_dir()
157
+ sample_path = await _save_upload(sample, dest_dir, "sample.wav")
158
+
159
+ try:
160
+ info = await asyncio.to_thread(
161
+ voice_clone.clone_voice,
162
+ sample_path=sample_path,
163
+ text=text,
164
+ out_dir=dest_dir,
165
+ language_id=language_id,
166
+ )
167
+ except ValueError as e:
168
+ raise HTTPException(400, str(e))
169
+ except Exception as e: # noqa: BLE001
170
+ raise HTTPException(500, f"Voice clone failed: {e}")
171
+
172
+ return JSONResponse({
173
+ "run_id": run_id,
174
+ "engine": info["engine"],
175
+ "chunks": info["chunks"],
176
+ "filename": info["filename"],
177
+ "url": file_url(run_id, info["filename"]),
178
+ })
179
+
180
+
181
+ # ── Audio cleanup ────────────────────────────────────────────────────
182
+
183
+ @router.post("/audio-cleanup")
184
+ @limiter.limit("10/hour")
185
+ async def audio_cleanup_endpoint(
186
+ request: Request,
187
+ file: UploadFile = File(...),
188
+ mode: str = Form("vocals-only"),
189
+ ):
190
+ if mode not in ("vocals-only", "instrumental-only", "stems"):
191
+ raise HTTPException(400, "mode must be one of: vocals-only, instrumental-only, stems")
192
+
193
+ run_id, dest_dir = new_run_dir()
194
+ input_path = await _save_upload(file, dest_dir, "input.wav")
195
+
196
+ try:
197
+ stems = await asyncio.to_thread(
198
+ audio_cleanup.separate,
199
+ input_path=input_path,
200
+ out_dir=dest_dir,
201
+ mode=mode, # type: ignore[arg-type]
202
+ )
203
+ except ValueError as e:
204
+ raise HTTPException(400, str(e))
205
+ except Exception as e: # noqa: BLE001
206
+ raise HTTPException(500, f"Audio separation failed: {e}")
207
+
208
+ return JSONResponse({
209
+ "run_id": run_id,
210
+ "mode": mode,
211
+ "stems": [
212
+ {**stem, "url": file_url(run_id, stem["filename"])}
213
+ for stem in stems
214
+ ],
215
+ })
216
+
217
+
218
+ # ── File download ────────────────────────────────────────────────────
219
+
220
+ @router.get("/file/{run_id}/{filename}")
221
+ async def tools_file(run_id: str, filename: str):
222
+ """Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS."""
223
+ safe_name = safe_filename(filename)
224
+ if safe_name != filename:
225
+ raise HTTPException(400, "Invalid filename")
226
+
227
+ base = run_dir(run_id)
228
+ if base is None:
229
+ raise HTTPException(404, "Run not found or expired")
230
+
231
+ target = base / safe_name
232
+ if not target.exists() or not target.is_file():
233
+ raise HTTPException(404, "File not found")
234
+
235
+ return FileResponse(
236
+ path=str(target),
237
+ media_type=_ext_to_media_type(safe_name),
238
+ filename=safe_name,
239
+ )
240
+
241
+
242
+ # ── Cleanup hook ─────────────────────────────────────────────────────
243
+
244
+ @router.post("/_internal/reap")
245
+ async def _reap():
246
+ """Manual reap trigger (mostly for testing). Auto-reap runs on a timer."""
247
+ removed = await asyncio.to_thread(reap_old_runs)
248
+ return {"removed": removed}
tools_api/storage.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Per-run temp storage for tools_api.
3
+
4
+ Each tool request creates a fresh dir under ARTIFACTS_ROOT/tools/<run_id>/.
5
+ Files are reaped after TTL by _reap_old_runs(). Kept independent of the main
6
+ job-tracker so a tool failure can't corrupt or block pipeline state.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import shutil
11
+ import time
12
+ import uuid
13
+ from pathlib import Path
14
+ from typing import Optional
15
+
16
+ # Pull ARTIFACTS_ROOT from server.py without importing the heavy modules
17
+ # (server.py imports torch/whisper/etc. at top level — we already loaded it
18
+ # at app startup, so this is just a name lookup).
19
+ from server import ARTIFACTS_ROOT
20
+
21
+ TOOLS_ROOT = ARTIFACTS_ROOT / "tools"
22
+ TOOLS_ROOT.mkdir(parents=True, exist_ok=True)
23
+
24
+ # Tool runs are reaped 1h after creation (shorter than pipeline jobs since
25
+ # users typically download immediately).
26
+ RUN_TTL_SECONDS = 60 * 60
27
+
28
+
29
+ def new_run_dir() -> tuple[str, Path]:
30
+ """Allocate a fresh per-request directory. Returns (run_id, path)."""
31
+ run_id = uuid.uuid4().hex[:16]
32
+ path = TOOLS_ROOT / run_id
33
+ path.mkdir(parents=True, exist_ok=True)
34
+ return run_id, path
35
+
36
+
37
+ def run_dir(run_id: str) -> Optional[Path]:
38
+ """Resolve a run_id to its directory, or None if missing/invalid."""
39
+ if not run_id or "/" in run_id or ".." in run_id:
40
+ return None
41
+ candidate = TOOLS_ROOT / run_id
42
+ if not candidate.exists() or not candidate.is_dir():
43
+ return None
44
+ return candidate
45
+
46
+
47
+ def file_url(run_id: str, filename: str) -> str:
48
+ """Construct the public download URL for an artifact."""
49
+ return f"/api/tools/file/{run_id}/{filename}"
50
+
51
+
52
+ def safe_filename(name: str, fallback: str = "file") -> str:
53
+ """Strip path separators and dangerous chars from a user-supplied name."""
54
+ if not name:
55
+ return fallback
56
+ base = Path(name).name
57
+ return base or fallback
58
+
59
+
60
+ def reap_old_runs() -> int:
61
+ """Delete tool run dirs older than RUN_TTL_SECONDS. Returns count removed."""
62
+ if not TOOLS_ROOT.exists():
63
+ return 0
64
+ cutoff = time.time() - RUN_TTL_SECONDS
65
+ removed = 0
66
+ for child in TOOLS_ROOT.iterdir():
67
+ try:
68
+ if child.is_dir() and child.stat().st_mtime < cutoff:
69
+ shutil.rmtree(child, ignore_errors=True)
70
+ removed += 1
71
+ except OSError:
72
+ continue
73
+ return removed
tools_api/subtitles.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Subtitle generation: sidecar files (.srt/.vtt/.txt) and burn-in MP4.
3
+
4
+ Reuses steps.s2_transcribe.transcribe and steps.s3_translate.translate as
5
+ libraries. ffmpeg burn-in goes through subprocess (matches existing s5_sync
6
+ pattern but without sharing code, since the styling needs are different).
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import subprocess
11
+ from pathlib import Path
12
+ from typing import Literal
13
+
14
+ from steps.s2_transcribe import transcribe
15
+ from steps.s3_translate import translate
16
+
17
+ Format = Literal["srt", "vtt", "txt", "mp4"]
18
+ CaptionStyle = Literal["tiktok", "youtube", "minimal"]
19
+ Position = Literal["top", "middle", "bottom"]
20
+ HAlign = Literal["left", "center", "right"]
21
+
22
+ # Bounds for user-adjustable knobs. Backend clamps to these regardless of
23
+ # what the client sends.
24
+ FONT_SIZE_MIN = 12
25
+ FONT_SIZE_MAX = 40
26
+ MARGIN_V_MIN = 0
27
+ MARGIN_V_MAX = 240
28
+
29
+ # ISO-style short codes Whisper accepts. Names map to UI dropdown labels.
30
+ _LANG_CODE = {
31
+ "Auto-detect": "auto",
32
+ "English": "en", "Spanish": "es", "French": "fr", "German": "de",
33
+ "Portuguese": "pt", "Italian": "it", "Hindi": "hi", "Arabic": "ar",
34
+ "Chinese": "zh", "Japanese": "ja", "Korean": "ko", "Russian": "ru",
35
+ }
36
+
37
+
38
+ def _is_video(path: Path) -> bool:
39
+ return path.suffix.lower() in {".mp4", ".mov", ".webm", ".mkv", ".avi", ".m4v"}
40
+
41
+
42
+ def _extract_audio(input_path: Path, out_dir: Path) -> Path:
43
+ """Pull a 16kHz mono WAV from the input — what whisper expects."""
44
+ audio_path = out_dir / "audio.wav"
45
+ cmd = [
46
+ "ffmpeg", "-y", "-i", str(input_path),
47
+ "-vn", "-ac", "1", "-ar", "16000",
48
+ "-acodec", "pcm_s16le",
49
+ str(audio_path),
50
+ ]
51
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
52
+ if result.returncode != 0:
53
+ raise RuntimeError(f"ffmpeg audio extract failed: {result.stderr[-300:]}")
54
+ return audio_path
55
+
56
+
57
+ def _resolve_lang(name: str) -> str:
58
+ return _LANG_CODE.get(name, "auto")
59
+
60
+
61
+ # ── Caption format writers ─────────────────────────────────────────────
62
+
63
+ def _seg_text(seg: dict, prefer_translation: bool) -> str:
64
+ if prefer_translation:
65
+ return (seg.get("translated_text") or seg.get("text") or "").strip()
66
+ return (seg.get("text") or "").strip()
67
+
68
+
69
+ def _format_timestamp_srt(t: float) -> str:
70
+ h = int(t // 3600)
71
+ m = int((t % 3600) // 60)
72
+ s = int(t % 60)
73
+ ms = int(round((t - int(t)) * 1000))
74
+ return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
75
+
76
+
77
+ def _format_timestamp_vtt(t: float) -> str:
78
+ return _format_timestamp_srt(t).replace(",", ".")
79
+
80
+
81
+ def write_srt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
82
+ lines = []
83
+ for i, seg in enumerate(segments, 1):
84
+ text = _seg_text(seg, prefer_translation)
85
+ if not text:
86
+ continue
87
+ lines.append(str(i))
88
+ lines.append(f"{_format_timestamp_srt(seg['start'])} --> {_format_timestamp_srt(seg['end'])}")
89
+ lines.append(text)
90
+ lines.append("")
91
+ dest.write_text("\n".join(lines), encoding="utf-8")
92
+ return dest
93
+
94
+
95
+ def write_vtt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
96
+ lines = ["WEBVTT", ""]
97
+ for seg in segments:
98
+ text = _seg_text(seg, prefer_translation)
99
+ if not text:
100
+ continue
101
+ lines.append(f"{_format_timestamp_vtt(seg['start'])} --> {_format_timestamp_vtt(seg['end'])}")
102
+ lines.append(text)
103
+ lines.append("")
104
+ dest.write_text("\n".join(lines), encoding="utf-8")
105
+ return dest
106
+
107
+
108
+ def write_txt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
109
+ text = " ".join(_seg_text(s, prefer_translation) for s in segments if _seg_text(s, prefer_translation))
110
+ dest.write_text(text, encoding="utf-8")
111
+ return dest
112
+
113
+
114
+ # ── Burn-in styling ────────────────────────────────────────────────────
115
+
116
+ # ASS-format alignment codes (libass), arranged as row + column:
117
+ # row: bottom=0, middle=3, top=6
118
+ # col: left=1, center=2, right=3
119
+ # So bottom-left=1, bottom-center=2, ..., top-right=9.
120
+ _POSITION_ROW = {"bottom": 0, "middle": 3, "top": 6}
121
+ _HALIGN_COL = {"left": 1, "center": 2, "right": 3}
122
+ _DEFAULT_MARGIN_V = {"bottom": 60, "middle": 0, "top": 60}
123
+
124
+ # Per-style baseline — font size, stroke/shadow choices. The user can override
125
+ # the font size via the slider; everything else stays tied to the style preset.
126
+ _STYLE_DEFAULTS: dict[CaptionStyle, dict] = {
127
+ "tiktok": {"font_size": 22, "bold": 1, "border_style": 1, "outline": 3, "shadow": 1},
128
+ "youtube": {"font_size": 18, "bold": 0, "border_style": 4, "outline": 8, "shadow": 0},
129
+ "minimal": {"font_size": 16, "bold": 0, "border_style": 1, "outline": 1, "shadow": 0},
130
+ }
131
+
132
+
133
+ def _clamp(value: int, lo: int, hi: int) -> int:
134
+ return max(lo, min(hi, value))
135
+
136
+
137
+ def _force_style_for(
138
+ style: CaptionStyle,
139
+ position: Position,
140
+ h_align: HAlign = "center",
141
+ font_size: int | None = None,
142
+ margin_v: int | None = None,
143
+ ) -> str:
144
+ """Return an ffmpeg `subtitles=...:force_style='...'` string.
145
+
146
+ Args:
147
+ style: Visual preset — sets weight, stroke, shadow defaults.
148
+ position: top / middle / bottom row.
149
+ h_align: left / center / right column.
150
+ font_size: Override the style's default font size (clamped to FONT_SIZE_MIN..MAX).
151
+ margin_v: Override vertical margin in pixels (clamped to MARGIN_V_MIN..MAX).
152
+ """
153
+ defaults = _STYLE_DEFAULTS[style]
154
+ fs = _clamp(font_size if font_size is not None else defaults["font_size"],
155
+ FONT_SIZE_MIN, FONT_SIZE_MAX)
156
+ mv = _clamp(margin_v if margin_v is not None else _DEFAULT_MARGIN_V[position],
157
+ MARGIN_V_MIN, MARGIN_V_MAX)
158
+ align = _POSITION_ROW[position] + _HALIGN_COL[h_align]
159
+
160
+ parts = [
161
+ "FontName=Arial",
162
+ f"FontSize={fs}",
163
+ f"Bold={defaults['bold']}",
164
+ "PrimaryColour=&H00FFFFFF",
165
+ ]
166
+ if style == "youtube":
167
+ # White on translucent black box
168
+ parts.append("BackColour=&HB8000000")
169
+ elif style == "minimal":
170
+ # Subtle semi-transparent stroke instead of hard black
171
+ parts.append("OutlineColour=&H80000000")
172
+ else: # tiktok — hard black stroke
173
+ parts.append("OutlineColour=&H00000000")
174
+ parts += [
175
+ f"BorderStyle={defaults['border_style']}",
176
+ f"Outline={defaults['outline']}",
177
+ f"Shadow={defaults['shadow']}",
178
+ f"Alignment={align}",
179
+ f"MarginV={mv}",
180
+ # Symmetric horizontal margins so left/right alignment has breathing room
181
+ "MarginL=40",
182
+ "MarginR=40",
183
+ ]
184
+ return ",".join(parts)
185
+
186
+
187
+ def _burn_in(
188
+ video_path: Path,
189
+ srt_path: Path,
190
+ dest: Path,
191
+ style: CaptionStyle,
192
+ position: Position,
193
+ h_align: HAlign = "center",
194
+ font_size: int | None = None,
195
+ margin_v: int | None = None,
196
+ ) -> Path:
197
+ """Render captions into the video pixels via ffmpeg + libass."""
198
+ force_style = _force_style_for(style, position, h_align, font_size, margin_v)
199
+ # Escape path for ffmpeg subtitle filter (single quotes around path,
200
+ # and we replace any existing single quotes since they'd break the filter).
201
+ srt_str = str(srt_path).replace("'", r"\'").replace(":", r"\:")
202
+ vf = f"subtitles='{srt_str}':force_style='{force_style}'"
203
+ cmd = [
204
+ "ffmpeg", "-y",
205
+ "-i", str(video_path),
206
+ "-vf", vf,
207
+ "-c:a", "copy",
208
+ "-c:v", "libx264",
209
+ "-preset", "veryfast",
210
+ "-crf", "22",
211
+ str(dest),
212
+ ]
213
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
214
+ if result.returncode != 0:
215
+ raise RuntimeError(f"ffmpeg burn-in failed: {result.stderr[-300:]}")
216
+ return dest
217
+
218
+
219
+ # ── Public entry point ────────────────────────────────────────────────
220
+
221
+ def generate_subtitles(
222
+ *,
223
+ input_path: Path,
224
+ out_dir: Path,
225
+ source_lang_name: str,
226
+ target_lang_name: str,
227
+ fmt: Format,
228
+ style: CaptionStyle = "tiktok",
229
+ position: Position = "bottom",
230
+ h_align: HAlign = "center",
231
+ font_size: int | None = None,
232
+ margin_v: int | None = None,
233
+ ) -> dict:
234
+ """
235
+ Run the full subtitle pipeline. Returns:
236
+ {
237
+ "format": "srt" | "vtt" | "txt" | "mp4",
238
+ "filename": <name in out_dir>,
239
+ "segments": <int>,
240
+ "translated": <bool>,
241
+ }
242
+ """
243
+ is_burn = fmt == "mp4"
244
+ if is_burn and not _is_video(input_path):
245
+ raise ValueError("Burn-in requires a video file.")
246
+
247
+ # 1. Extract audio (or use as-is)
248
+ if _is_video(input_path):
249
+ audio_path = _extract_audio(input_path, out_dir)
250
+ else:
251
+ audio_path = input_path
252
+
253
+ # 2. Transcribe
254
+ src_code = _resolve_lang(source_lang_name)
255
+ segments = transcribe(str(audio_path), language=src_code)
256
+ if not segments:
257
+ raise RuntimeError("Transcription produced no segments.")
258
+
259
+ # 3. Translate if requested
260
+ translated = False
261
+ same_as_source = (
262
+ target_lang_name == "Same as source"
263
+ or target_lang_name.lower() == source_lang_name.lower()
264
+ )
265
+ if not same_as_source:
266
+ segments = translate(segments, target_lang_name)
267
+ translated = True
268
+
269
+ # 4. Emit
270
+ if fmt == "srt":
271
+ out = write_srt(segments, out_dir / "captions.srt", translated)
272
+ elif fmt == "vtt":
273
+ out = write_vtt(segments, out_dir / "captions.vtt", translated)
274
+ elif fmt == "txt":
275
+ out = write_txt(segments, out_dir / "transcript.txt", translated)
276
+ else: # mp4
277
+ srt_path = write_srt(segments, out_dir / "captions.srt", translated)
278
+ out = _burn_in(
279
+ input_path, srt_path, out_dir / "captioned.mp4",
280
+ style, position, h_align, font_size, margin_v,
281
+ )
282
+
283
+ return {
284
+ "format": fmt,
285
+ "filename": out.name,
286
+ "segments": len(segments),
287
+ "translated": translated,
288
+ }
tools_api/voice_clone.py ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Voice clone playground — single-engine TTS from a sample + text input.
3
+
4
+ This Space runs only ONE engine (s4_tts enforces TTS_ENGINE match), so the
5
+ endpoint accepts no engine parameter. The frontend is responsible for fanning
6
+ out to multiple Spaces when the user wants comparison output.
7
+
8
+ Long text is split into ~200-char chunks at sentence/word boundaries and
9
+ synthesised as multiple segments, then concatenated into one MP3.
10
+ """
11
+ from __future__ import annotations
12
+
13
+ import os
14
+ import re
15
+ import subprocess
16
+ from pathlib import Path
17
+
18
+ from steps.s4_tts import synthesise_segments
19
+
20
+ _AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".flac", ".ogg", ".aac"}
21
+
22
+
23
+ def _prepare_sample(sample_path: Path, out_dir: Path) -> Path:
24
+ """Convert any uploaded sample (audio or video) to a clean 24kHz mono WAV.
25
+
26
+ TTS internals (s4_tts) call torchaudio.load via libsndfile, which only
27
+ understands WAV/FLAC. Anything else — including MP4 video, MP3, M4A —
28
+ has to be re-encoded first. We do this here so callers don't need to.
29
+ """
30
+ out = out_dir / "sample_prepared.wav"
31
+ cmd = [
32
+ "ffmpeg", "-y", "-i", str(sample_path),
33
+ "-vn", # drop video stream if present
34
+ "-ac", "1", # mono
35
+ "-ar", "24000", # 24kHz — sweet spot for the TTS engines
36
+ "-acodec", "pcm_s16le",
37
+ str(out),
38
+ ]
39
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
40
+ if result.returncode != 0:
41
+ raise ValueError(
42
+ "Couldn't read the uploaded sample. Use a clean audio file "
43
+ "(WAV, MP3, M4A) or a video with an audio track."
44
+ )
45
+ return out
46
+
47
+
48
+ def _isolate_vocals(prepared_sample: Path, out_dir: Path) -> Path:
49
+ """Run Demucs source separation on the prepared sample and return a
50
+ vocals-only WAV (24kHz mono) suitable for TTS reference.
51
+
52
+ Mirrors what the dub pipeline (steps.s1b_separate) does so cloned voice
53
+ doesn't pick up music / ambient noise from the uploaded sample. Falls back
54
+ to the raw prepared sample if separation fails (model missing, oom, etc.)
55
+ rather than failing the whole clone request.
56
+ """
57
+ try:
58
+ from steps.s1b_separate import separate_audio
59
+ except ImportError as e:
60
+ print(f"[voice_clone] Demucs unavailable, skipping vocal isolation: {e}")
61
+ return prepared_sample
62
+
63
+ separate_dir = out_dir / "separate"
64
+ separate_dir.mkdir(parents=True, exist_ok=True)
65
+
66
+ try:
67
+ vocals_16k_path, _accompaniment = separate_audio(str(prepared_sample), str(separate_dir))
68
+ except Exception as e:
69
+ print(f"[voice_clone] Demucs separation failed, using raw sample: {e}")
70
+ return prepared_sample
71
+
72
+ # Resample vocals from 16 kHz mono → 24 kHz mono for the TTS engines
73
+ vocals_24k = out_dir / "vocals_24k.wav"
74
+ cmd = [
75
+ "ffmpeg", "-y", "-i", vocals_16k_path,
76
+ "-ac", "1", "-ar", "24000",
77
+ "-acodec", "pcm_s16le",
78
+ str(vocals_24k),
79
+ ]
80
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
81
+ if result.returncode != 0:
82
+ print(f"[voice_clone] Vocals resample failed, using 16kHz: {result.stderr[-200:]}")
83
+ return Path(vocals_16k_path)
84
+
85
+ return vocals_24k
86
+
87
+ CHUNK_TARGET_CHARS = 200
88
+ CHUNK_HARD_MAX = 280 # under chatterbox's 300-char per-segment ceiling
89
+
90
+
91
+ def _split_text(text: str) -> list[str]:
92
+ """Split into chunks of ~CHUNK_TARGET_CHARS at sentence then word boundaries."""
93
+ text = text.strip()
94
+ if not text:
95
+ return []
96
+ if len(text) <= CHUNK_HARD_MAX:
97
+ return [text]
98
+
99
+ # First pass: sentence boundaries
100
+ sentences = re.split(r"(?<=[.!?])\s+", text)
101
+ chunks: list[str] = []
102
+ current = ""
103
+ for sent in sentences:
104
+ if not sent.strip():
105
+ continue
106
+ if len(current) + 1 + len(sent) <= CHUNK_TARGET_CHARS:
107
+ current = f"{current} {sent}".strip() if current else sent
108
+ else:
109
+ if current:
110
+ chunks.append(current)
111
+ # Sentence itself may exceed target — break it on words
112
+ if len(sent) > CHUNK_HARD_MAX:
113
+ words = sent.split()
114
+ buf = ""
115
+ for w in words:
116
+ if len(buf) + 1 + len(w) > CHUNK_HARD_MAX:
117
+ if buf:
118
+ chunks.append(buf)
119
+ buf = w
120
+ else:
121
+ buf = f"{buf} {w}".strip() if buf else w
122
+ if buf:
123
+ current = buf
124
+ else:
125
+ current = ""
126
+ else:
127
+ current = sent
128
+ if current:
129
+ chunks.append(current)
130
+ return chunks
131
+
132
+
133
+ def _build_segments(chunks: list[str], chunk_secs: float = 8.0) -> list[dict]:
134
+ """Construct segment dicts for synthesise_segments — fake timing windows."""
135
+ segs = []
136
+ cursor = 0.0
137
+ for text in chunks:
138
+ # Allocate a generous window so _trim_to_duration doesn't clip output.
139
+ # Headroom is 1.4× so 8s window allows up to ~11s of audio per chunk.
140
+ segs.append({
141
+ "start": cursor,
142
+ "end": cursor + chunk_secs,
143
+ "text": text,
144
+ "translated_text": text,
145
+ "tts_text": text,
146
+ })
147
+ cursor += chunk_secs
148
+ return segs
149
+
150
+
151
+ def _concat_wavs_to_mp3(wav_paths: list[Path], dest: Path) -> Path:
152
+ """Concat in order via ffmpeg concat demuxer, then encode MP3."""
153
+ if not wav_paths:
154
+ raise RuntimeError("No TTS chunks to concatenate.")
155
+
156
+ if len(wav_paths) == 1:
157
+ cmd = [
158
+ "ffmpeg", "-y", "-i", str(wav_paths[0]),
159
+ "-codec:a", "libmp3lame", "-b:a", "192k",
160
+ str(dest),
161
+ ]
162
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
163
+ if result.returncode != 0:
164
+ raise RuntimeError(f"ffmpeg encode failed: {result.stderr[-300:]}")
165
+ return dest
166
+
167
+ list_file = dest.with_suffix(".txt")
168
+ list_file.write_text(
169
+ "\n".join(f"file '{p.as_posix()}'" for p in wav_paths),
170
+ encoding="utf-8",
171
+ )
172
+ cmd = [
173
+ "ffmpeg", "-y",
174
+ "-f", "concat", "-safe", "0",
175
+ "-i", str(list_file),
176
+ "-codec:a", "libmp3lame", "-b:a", "192k",
177
+ str(dest),
178
+ ]
179
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
180
+ list_file.unlink(missing_ok=True)
181
+ if result.returncode != 0:
182
+ raise RuntimeError(f"ffmpeg concat failed: {result.stderr[-300:]}")
183
+ return dest
184
+
185
+
186
+ def clone_voice(
187
+ *,
188
+ sample_path: Path,
189
+ text: str,
190
+ out_dir: Path,
191
+ language_id: str = "en",
192
+ ) -> dict:
193
+ """
194
+ Run TTS on `text` using the voice from `sample_path`. Returns:
195
+ {
196
+ "filename": "voice.mp3",
197
+ "engine": <current TTS_ENGINE>,
198
+ "chunks": <int>,
199
+ }
200
+ """
201
+ text = (text or "").strip()
202
+ if not text:
203
+ raise ValueError("Text is required.")
204
+
205
+ chunks = _split_text(text)
206
+ segments = _build_segments(chunks)
207
+
208
+ # Normalise the sample (handles video, mp3, m4a, etc.) → 24kHz mono WAV
209
+ prepared_sample = _prepare_sample(sample_path, out_dir)
210
+
211
+ # Demucs source separation → isolate vocals so the clone doesn't pick up
212
+ # background music or ambient noise. Same step the dub pipeline uses.
213
+ reference_for_tts = _isolate_vocals(prepared_sample, out_dir)
214
+
215
+ seg_out_dir = out_dir / "tts"
216
+ seg_out_dir.mkdir(parents=True, exist_ok=True)
217
+
218
+ tts_result = None
219
+ for msg in synthesise_segments(
220
+ segments=segments,
221
+ reference_audio_path=str(reference_for_tts),
222
+ language_id=language_id,
223
+ output_dir=str(seg_out_dir),
224
+ ):
225
+ if isinstance(msg, dict) and "__TTS_RESULT__" in msg:
226
+ tts_result = msg["__TTS_RESULT__"]
227
+
228
+ if not tts_result:
229
+ raise RuntimeError("TTS produced no output.")
230
+
231
+ wav_paths = [Path(seg["tts_path"]) for seg in tts_result if seg.get("tts_path")]
232
+ if not wav_paths:
233
+ raise RuntimeError("TTS result missing audio paths.")
234
+
235
+ mp3_path = _concat_wavs_to_mp3(wav_paths, out_dir / "voice.mp3")
236
+
237
+ return {
238
+ "filename": mp3_path.name,
239
+ "engine": os.getenv("TTS_ENGINE", "chatterbox").lower(),
240
+ "chunks": len(chunks),
241
+ }