Spaces:

Rafii
/

videovoice

Running on Zero

App Files Files Community

github-actions[bot] commited on 8 days ago

Commit

5b7cd5f

1 Parent(s): e5dfb46

deploy: switch to chatterbox requirements @ 4319730

Browse files

Files changed (12) hide show

CLAUDE.md +4 -0
app.py +2 -0
graphify-out/.graphify_root +1 -0
graphify-out/GRAPH_REPORT.md +152 -127
graphify-out/graph.html +0 -0
server.py +5 -1
tools_api/__init__.py +17 -0
tools_api/audio_cleanup.py +136 -0
tools_api/router.py +248 -0
tools_api/storage.py +73 -0
tools_api/subtitles.py +288 -0
tools_api/voice_clone.py +241 -0

CLAUDE.md CHANGED Viewed

@@ -1,3 +1,7 @@
 ## graphify
 This project has a graphify knowledge graph at graphify-out/.

+## Deployment
+HF Spaces deployment is fully automated via `.github/workflows/deploy-hf.yml`. Pushing to `origin/main` triggers the workflow which runs `./deploy.sh --force` and pushes to all three Spaces (Chatterbox, OmniVoice, Qwen3). Do not run `./deploy.sh` locally after a push — it is redundant. To verify a deploy, use `gh run list --workflow=deploy-hf.yml`.
 ## graphify
 This project has a graphify knowledge graph at graphify-out/.

app.py CHANGED Viewed

@@ -25,8 +25,10 @@ demo = Server()
 # INTEGRATE SERVER.PY ROUTES
 # -----------------------------------------------------
 from server import router, limiter, enforce_content_length_limit
 demo.include_router(router)
 demo.state.limiter = limiter
 demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

 # INTEGRATE SERVER.PY ROUTES
 # -----------------------------------------------------
 from server import router, limiter, enforce_content_length_limit
+from tools_api import router as tools_router
 demo.include_router(router)
+demo.include_router(tools_router)
 demo.state.limiter = limiter
 demo.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

graphify-out/.graphify_root ADDED Viewed

	@@ -0,0 +1 @@


1	+ /Users/rafa/MscAi/VideoVoice-be

graphify-out/GRAPH_REPORT.md CHANGED Viewed

@@ -1,12 +1,12 @@
-# Graph Report - VideoVoice-be  (2026-04-26)
 ## Corpus Check
-- 47 files · ~206,237 words
 - Verdict: corpus is large enough that graph structure adds value.
 ## Summary
-- 796 nodes · 1429 edges · 57 communities detected
-- Extraction: 75% EXTRACTED · 25% INFERRED · 0% AMBIGUOUS · INFERRED: 357 edges (avg confidence: 0.6)
 - Token cost: 0 input · 0 output
 ## Community Hubs (Navigation)
@@ -27,12 +27,12 @@
 - [[_COMMUNITY_Community 14|Community 14]]
 - [[_COMMUNITY_Community 15|Community 15]]
 - [[_COMMUNITY_Community 16|Community 16]]
 - [[_COMMUNITY_Community 18|Community 18]]
-- [[_COMMUNITY_Community 26|Community 26]]
-- [[_COMMUNITY_Community 27|Community 27]]
-- [[_COMMUNITY_Community 28|Community 28]]
-- [[_COMMUNITY_Community 29|Community 29]]
-- [[_COMMUNITY_Community 30|Community 30]]
 - [[_COMMUNITY_Community 31|Community 31]]
 - [[_COMMUNITY_Community 32|Community 32]]
 - [[_COMMUNITY_Community 33|Community 33]]
@@ -67,6 +67,11 @@
 - [[_COMMUNITY_Community 62|Community 62]]
 - [[_COMMUNITY_Community 63|Community 63]]
 - [[_COMMUNITY_Community 64|Community 64]]
 ## God Nodes (most connected - your core abstractions)
 1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
@@ -81,16 +86,16 @@
 10. `BasePoster` - 14 edges
 ## Surprising Connections (you probably didn't know these)
-- `_segments_from_pollinations()` --calls--> `post()`  [INFERRED]
-  steps/s2_transcribe.py → social_distributor/poster/platforms/base.py
 - `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
 - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
 - `run_pipeline()` --calls--> `transcribe()`  [INFERRED]
   pipeline.py → steps/s2_transcribe.py
-- `run_pipeline()` --calls--> `translate()`  [INFERRED]
-  pipeline.py → steps/s3_translate.py
 ## Hyperedges (group relationships)
 - **Six-step translation pipeline** —  [EXTRACTED 1.00]
@@ -101,323 +106,343 @@
 ### Community 0 - "Community 0"
 Cohesion: 0.04
-Nodes (70): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r"""     This is the configuration class to store the configuration of a [`Qwen3, r"""     This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r"""     This is the configuration class to store the configuration of a [`Qwen3 (+62 more)
 ### Community 1 - "Community 1"
-Cohesion: 0.05
-Nodes (58): ABC, BasePoster, post(), Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt() (+50 more)
 ### Community 2 - "Community 2"
-Cohesion: 0.04
-Nodes (61): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine.     ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+53 more)
 ### Community 3 - "Community 3"
-Cohesion: 0.06
-Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
 ### Community 4 - "Community 4"
-Cohesion: 0.07
-Nodes (24): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation.     Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D)             q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx)             the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal.     Args:, spectral_normalize_torch() (+16 more)
 ### Community 5 - "Community 5"
-Cohesion: 0.07
-Nodes (26): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+18 more)
 ### Community 6 - "Community 6"
-Cohesion: 0.05
-Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
 ### Community 7 - "Community 7"
 Cohesion: 0.07
-Nodes (35): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages.      Args:, run_pipeline() (+27 more)
 ### Community 8 - "Community 8"
 Cohesion: 0.09
 Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
-### Community 9 - "Community 9"
 Cohesion: 0.1
 Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
-### Community 10 - "Community 10"
 Cohesion: 0.1
 Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches.      Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
-### Community 11 - "Community 11"
 Cohesion: 0.12
-Nodes (29): _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps.  Primary local backend (device-depende, Split segments longer than _MAX_SEGMENT_DURATION using word timings. (+21 more)
-### Community 12 - "Community 12"
-Cohesion: 0.13
-Nodes (26): _compress_silences(), _detect_pauses(), _distribute_padding(), _find_tts_silences(), _generate_silence(), _get_wav_duration(), _pad_silence(), _pause_aware_sync() (+18 more)
-### Community 13 - "Community 13"
 Cohesion: 0.24
 Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL.      Uses Play (+3 more)
-### Community 14 - "Community 14"
 Cohesion: 0.27
 Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline.  Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
-### Community 15 - "Community 15"
-Cohesion: 0.22
-Nodes (5): Qwen3TTSProcessor, r"""     Constructs a Qwen3TTS processor.      Args:         tokenizer ([`Qwen2T, Main method to prepare for the model one or several sequences(s) and audio(s). T, This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedToke, ProcessorMixin
-### Community 16 - "Community 16"
 Cohesion: 0.33
 Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
-### Community 18 - "Community 18"
 Cohesion: 1.0
 Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
-### Community 26 - "Community 26"
 Cohesion: 1.0
 Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
-### Community 27 - "Community 27"
 Cohesion: 1.0
 Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
-### Community 28 - "Community 28"
 Cohesion: 1.0
 Nodes (1): Voice clone speech using the Base model.          You can provide either:
-### Community 29 - "Community 29"
 Cohesion: 1.0
 Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
-### Community 30 - "Community 30"
 Cohesion: 1.0
 Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
-### Community 31 - "Community 31"
 Cohesion: 1.0
 Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
-### Community 32 - "Community 32"
 Cohesion: 1.0
 Nodes (1): Reject oversized uploads before body parsing.
-### Community 33 - "Community 33"
 Cohesion: 1.0
 Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
-### Community 34 - "Community 34"
 Cohesion: 1.0
 Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
-### Community 35 - "Community 35"
 Cohesion: 1.0
 Nodes (1): Return curated showcase entries with resolved streaming URLs.
-### Community 36 - "Community 36"
 Cohesion: 1.0
 Nodes (1): Submit a video for translation.
-### Community 37 - "Community 37"
 Cohesion: 1.0
 Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
-### Community 38 - "Community 38"
 Cohesion: 1.0
 Nodes (1): User selects a TTS model after previewing.
-### Community 39 - "Community 39"
 Cohesion: 1.0
 Nodes (1): Serve a preview audio WAV file.
-### Community 40 - "Community 40"
 Cohesion: 1.0
 Nodes (1): Download the translated video.
-### Community 41 - "Community 41"
 Cohesion: 1.0
 Nodes (1): Create artifact directories and start background cleanup.
-### Community 42 - "Community 42"
 Cohesion: 1.0
 Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
-### Community 43 - "Community 43"
 Cohesion: 1.0
 Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
-### Community 44 - "Community 44"
 Cohesion: 1.0
 Nodes (1): Insert extra silence distributed across detected pause points.
-### Community 45 - "Community 45"
 Cohesion: 1.0
 Nodes (1): Generate a silent WAV file of given duration.
-### Community 46 - "Community 46"
 Cohesion: 1.0
 Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
-### Community 47 - "Community 47"
 Cohesion: 1.0
 Nodes (1): Translate the text of each segment into target_language in batches.      Args:
-### Community 48 - "Community 48"
 Cohesion: 1.0
 Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int
-### Community 49 - "Community 49"
 Cohesion: 1.0
 Nodes (1): Remove trailing noise/artifacts after speech ends.
-### Community 50 - "Community 50"
 Cohesion: 1.0
 Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
-### Community 51 - "Community 51"
 Cohesion: 1.0
 Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
-### Community 52 - "Community 52"
 Cohesion: 1.0
 Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
-### Community 53 - "Community 53"
 Cohesion: 1.0
 Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
-### Community 54 - "Community 54"
 Cohesion: 1.0
 Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
-### Community 55 - "Community 55"
 Cohesion: 1.0
 Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
-### Community 56 - "Community 56"
 Cohesion: 1.0
 Nodes (1): torch==2.6.0
-### Community 57 - "Community 57"
 Cohesion: 1.0
 Nodes (1): fastapi
-### Community 58 - "Community 58"
 Cohesion: 1.0
 Nodes (1): yt-dlp
-### Community 59 - "Community 59"
 Cohesion: 1.0
 Nodes (1): diffusers==0.29.0
-### Community 60 - "Community 60"
 Cohesion: 1.0
 Nodes (1): ARTIFACTS_ROOT env
-### Community 61 - "Community 61"
 Cohesion: 1.0
 Nodes (1): AWS g4dn.xlarge alternative
-### Community 62 - "Community 62"
 Cohesion: 1.0
 Nodes (1): nodejs (system pkg)
-### Community 63 - "Community 63"
 Cohesion: 1.0
 Nodes (1): fonts-noto-core / cjk
-### Community 64 - "Community 64"
 Cohesion: 1.0
 Nodes (1): graphify project rules
 ## Knowledge Gaps
-- **217 isolated node(s):** `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+212 more)
   These have ≤1 connection - possible missing edges or undocumented components.
-- **Thin community `Community 18`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 26`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 27`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 28`** (1 nodes): `Voice clone speech using the Base model.          You can provide either:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 29`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 30`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 31`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 32`** (1 nodes): `Reject oversized uploads before body parsing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 33`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 34`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 35`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 36`** (1 nodes): `Submit a video for translation.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 37`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 38`** (1 nodes): `User selects a TTS model after previewing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 39`** (1 nodes): `Serve a preview audio WAV file.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 40`** (1 nodes): `Download the translated video.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 41`** (1 nodes): `Create artifact directories and start background cleanup.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 42`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 43`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 44`** (1 nodes): `Insert extra silence distributed across detected pause points.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 45`** (1 nodes): `Generate a silent WAV file of given duration.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 46`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 47`** (1 nodes): `Translate the text of each segment into target_language in batches.      Args:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 48`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 49`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 50`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 51`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 52`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 53`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 54`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 55`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 56`** (1 nodes): `torch==2.6.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 57`** (1 nodes): `fastapi`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 58`** (1 nodes): `yt-dlp`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 59`** (1 nodes): `diffusers==0.29.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 60`** (1 nodes): `ARTIFACTS_ROOT env`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 61`** (1 nodes): `AWS g4dn.xlarge alternative`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 62`** (1 nodes): `nodejs (system pkg)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 63`** (1 nodes): `fonts-noto-core / cjk`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 64`** (1 nodes): `graphify project rules`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
 ## Suggested Questions
 _Questions this graph is uniquely positioned to answer:_
-- **Why does `run_pipeline()` connect `Community 7` to `Community 3`, `Community 10`, `Community 11`, `Community 12`?**
-  _High betweenness centrality (0.339) - this node is a cross-community bridge._
-- **Why does `synthesise_segments()` connect `Community 3` to `Community 7`?**
-  _High betweenness centrality (0.299) - this node is a cross-community bridge._
 - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
@@ -427,4 +452,4 @@ _Questions this graph is uniquely positioned to answer:_
 - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **What connects `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
-  _217 weakly-connected nodes found - possible documentation gaps or missing edges._

+# Graph Report - VideoVoice-be  (2026-05-16)
 ## Corpus Check
+- 59 files · ~253,292 words
 - Verdict: corpus is large enough that graph structure adds value.
 ## Summary
+- 1050 nodes · 1833 edges · 62 communities detected
+- Extraction: 79% EXTRACTED · 21% INFERRED · 0% AMBIGUOUS · INFERRED: 389 edges (avg confidence: 0.62)
 - Token cost: 0 input · 0 output
 ## Community Hubs (Navigation)
 - [[_COMMUNITY_Community 14|Community 14]]
 - [[_COMMUNITY_Community 15|Community 15]]
 - [[_COMMUNITY_Community 16|Community 16]]
+- [[_COMMUNITY_Community 17|Community 17]]
 - [[_COMMUNITY_Community 18|Community 18]]
+- [[_COMMUNITY_Community 19|Community 19]]
+- [[_COMMUNITY_Community 20|Community 20]]
+- [[_COMMUNITY_Community 21|Community 21]]
+- [[_COMMUNITY_Community 23|Community 23]]
 - [[_COMMUNITY_Community 31|Community 31]]
 - [[_COMMUNITY_Community 32|Community 32]]
 - [[_COMMUNITY_Community 33|Community 33]]
 - [[_COMMUNITY_Community 62|Community 62]]
 - [[_COMMUNITY_Community 63|Community 63]]
 - [[_COMMUNITY_Community 64|Community 64]]
+- [[_COMMUNITY_Community 65|Community 65]]
+- [[_COMMUNITY_Community 66|Community 66]]
+- [[_COMMUNITY_Community 67|Community 67]]
+- [[_COMMUNITY_Community 68|Community 68]]
+- [[_COMMUNITY_Community 69|Community 69]]
 ## God Nodes (most connected - your core abstractions)
 1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
 10. `BasePoster` - 14 edges
 ## Surprising Connections (you probably didn't know these)
 - `chatterbox-tts==0.1.7 --no-deps` --semantically_similar_to--> `omnivoice>=0.1.4`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
 - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
+- `content_length_middleware()` --calls--> `enforce_content_length_limit()`  [INFERRED]
+  app.py → server.py
+- `run_pipeline()` --calls--> `separate_audio()`  [INFERRED]
+  pipeline.py → steps/s1b_separate.py
 - `run_pipeline()` --calls--> `transcribe()`  [INFERRED]
   pipeline.py → steps/s2_transcribe.py
 ## Hyperedges (group relationships)
 - **Six-step translation pipeline** —  [EXTRACTED 1.00]
 ### Community 0 - "Community 0"
 Cohesion: 0.04
+Nodes (69): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r"""     This is the configuration class to store the configuration of a [`Qwen3, r"""     This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r"""     This is the configuration class to store the configuration of a [`Qwen3 (+61 more)
 ### Community 1 - "Community 1"
+Cohesion: 0.02
+Nodes (118): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine.     ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+110 more)
 ### Community 2 - "Community 2"
+Cohesion: 0.05
+Nodes (57): ABC, BasePoster, Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt(), format_caption() (+49 more)
 ### Community 3 - "Community 3"
+Cohesion: 0.05
+Nodes (59): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages.      Args:, run_pipeline() (+51 more)
 ### Community 4 - "Community 4"
+Cohesion: 0.06
+Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
 ### Community 5 - "Community 5"
+Cohesion: 0.06
+Nodes (59): post(), _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps.  Primary local backend (device-depende (+51 more)
 ### Community 6 - "Community 6"
+Cohesion: 0.06
+Nodes (31): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+23 more)
 ### Community 7 - "Community 7"
 Cohesion: 0.07
+Nodes (25): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation.     Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D)             q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx)             the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal.     Args:, spectral_normalize_torch() (+17 more)
 ### Community 8 - "Community 8"
+Cohesion: 0.05
+Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
+### Community 9 - "Community 9"
 Cohesion: 0.09
 Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
+### Community 10 - "Community 10"
 Cohesion: 0.1
 Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
+### Community 11 - "Community 11"
+Cohesion: 0.08
+Nodes (32): _apply_demucs(), _get_model(), _load_and_normalise(), Step 1b: Separate vocals from accompaniment using Demucs (Python API).  In-proce, Lazy-load htdemucs once per process. Module-level semantics; we load     on firs, GPU-bound inference call. `mix` shape: [1, channels, time]., Load WAV, resample/remix to match model requirements, z-normalise., Separate vocals from accompaniment using Demucs htdemucs (Python API).      Args (+24 more)
+### Community 12 - "Community 12"
 Cohesion: 0.1
 Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches.      Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
+### Community 13 - "Community 13"
 Cohesion: 0.12
+Nodes (27): build_for_job(), ensure_transcription(), extract_audio_hq(), extract_reference_audio(), get_audio_duration(), get_device(), load_chatterbox(), main() (+19 more)
+### Community 14 - "Community 14"
+Cohesion: 0.11
+Nodes (25): tools_api — Standalone endpoints for creator quick tools.  Lives alongside the m, audio_cleanup_endpoint(), _ext_to_media_type(), APIRouter for /api/tools/* endpoints.  Each endpoint is sync request-response (n, Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS., Manual reap trigger (mostly for testing). Auto-reap runs on a timer., Stream upload to disk, enforcing the tools size cap., _reap() (+17 more)
+### Community 15 - "Community 15"
+Cohesion: 0.12
+Nodes (23): build_t3_cond(), main(), prepare_sample(), prepare_sample.py — Turn one dataset.jsonl row into the exact tensors T3.loss(), Build the speaker conditioning (frozen during training)., MTLTokenizer + SOT/EOT padding (mirrors what generate() does internally)., S3Tokenizer on the target dubbed audio → speech tokens (the LABEL).      Critica, Turn one dataset row into ready-to-train tensors. (+15 more)
+### Community 16 - "Community 16"
+Cohesion: 0.19
+Nodes (18): _burn_in(), _clamp(), _extract_audio(), _force_style_for(), _format_timestamp_srt(), _format_timestamp_vtt(), generate_subtitles(), _is_video() (+10 more)
+### Community 17 - "Community 17"
+Cohesion: 0.22
+Nodes (12): download_result(), _is_noise(), main(), Batch translate Instagram reels to English via the VideoVoice server API.  Usage, Extract the Instagram reel shortcode from a URL, e.g. 'DWn_yPoDsYw'., Submit a single video URL and return the job_id., Return True if a log line is internal noise we don't want in the log., Poll job status until complete or error. Returns final messages and collected lo (+4 more)
+### Community 18 - "Community 18"
+Cohesion: 0.23
+Nodes (12): evaluate(), load_baseline(), load_with_lora(), main(), pick_held_out_samples(), print_summary(), eval.py — Evaluate the fine-tuned LoRA against the un-tuned baseline.  Picks N s, Return overshoot samples (duration_diff > 0.2) — these are NOT in the     asymme (+4 more)
+### Community 19 - "Community 19"
 Cohesion: 0.24
 Nodes (11): extract_creator(), _extract_instagram(), _extract_tiktok(), _extract_youtube(), _load_cache(), Extract original creator @username from video URLs., YouTube: visit video page, extract channel name from meta tags., Extract the @username of the original creator from the video URL.      Uses Play (+3 more)
+### Community 20 - "Community 20"
 Cohesion: 0.27
 Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline.  Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
+### Community 21 - "Community 21"
 Cohesion: 0.33
 Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
+### Community 23 - "Community 23"
 Cohesion: 1.0
 Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
+### Community 31 - "Community 31"
 Cohesion: 1.0
 Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
+### Community 32 - "Community 32"
 Cohesion: 1.0
 Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
+### Community 33 - "Community 33"
 Cohesion: 1.0
 Nodes (1): Voice clone speech using the Base model.          You can provide either:
+### Community 34 - "Community 34"
 Cohesion: 1.0
 Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
+### Community 35 - "Community 35"
 Cohesion: 1.0
 Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
+### Community 36 - "Community 36"
 Cohesion: 1.0
 Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
+### Community 37 - "Community 37"
 Cohesion: 1.0
 Nodes (1): Reject oversized uploads before body parsing.
+### Community 38 - "Community 38"
 Cohesion: 1.0
 Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
+### Community 39 - "Community 39"
 Cohesion: 1.0
 Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
+### Community 40 - "Community 40"
 Cohesion: 1.0
 Nodes (1): Return curated showcase entries with resolved streaming URLs.
+### Community 41 - "Community 41"
 Cohesion: 1.0
 Nodes (1): Submit a video for translation.
+### Community 42 - "Community 42"
 Cohesion: 1.0
 Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
+### Community 43 - "Community 43"
 Cohesion: 1.0
 Nodes (1): User selects a TTS model after previewing.
+### Community 44 - "Community 44"
 Cohesion: 1.0
 Nodes (1): Serve a preview audio WAV file.
+### Community 45 - "Community 45"
 Cohesion: 1.0
 Nodes (1): Download the translated video.
+### Community 46 - "Community 46"
 Cohesion: 1.0
 Nodes (1): Create artifact directories and start background cleanup.
+### Community 47 - "Community 47"
 Cohesion: 1.0
 Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
+### Community 48 - "Community 48"
 Cohesion: 1.0
 Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
+### Community 49 - "Community 49"
 Cohesion: 1.0
 Nodes (1): Insert extra silence distributed across detected pause points.
+### Community 50 - "Community 50"
 Cohesion: 1.0
 Nodes (1): Generate a silent WAV file of given duration.
+### Community 51 - "Community 51"
 Cohesion: 1.0
 Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
+### Community 52 - "Community 52"
 Cohesion: 1.0
 Nodes (1): Translate the text of each segment into target_language in batches.      Args:
+### Community 53 - "Community 53"
 Cohesion: 1.0
 Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int
+### Community 54 - "Community 54"
 Cohesion: 1.0
 Nodes (1): Remove trailing noise/artifacts after speech ends.
+### Community 55 - "Community 55"
 Cohesion: 1.0
 Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
+### Community 56 - "Community 56"
 Cohesion: 1.0
 Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
+### Community 57 - "Community 57"
 Cohesion: 1.0
 Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
+### Community 58 - "Community 58"
 Cohesion: 1.0
 Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
+### Community 59 - "Community 59"
 Cohesion: 1.0
 Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
+### Community 60 - "Community 60"
 Cohesion: 1.0
 Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
+### Community 61 - "Community 61"
 Cohesion: 1.0
 Nodes (1): torch==2.6.0
+### Community 62 - "Community 62"
 Cohesion: 1.0
 Nodes (1): fastapi
+### Community 63 - "Community 63"
 Cohesion: 1.0
 Nodes (1): yt-dlp
+### Community 64 - "Community 64"
 Cohesion: 1.0
 Nodes (1): diffusers==0.29.0
+### Community 65 - "Community 65"
 Cohesion: 1.0
 Nodes (1): ARTIFACTS_ROOT env
+### Community 66 - "Community 66"
 Cohesion: 1.0
 Nodes (1): AWS g4dn.xlarge alternative
+### Community 67 - "Community 67"
 Cohesion: 1.0
 Nodes (1): nodejs (system pkg)
+### Community 68 - "Community 68"
 Cohesion: 1.0
 Nodes (1): fonts-noto-core / cjk
+### Community 69 - "Community 69"
 Cohesion: 1.0
 Nodes (1): graphify project rules
 ## Knowledge Gaps
+- **321 isolated node(s):** `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+316 more)
   These have ≤1 connection - possible missing edges or undocumented components.
+- **Thin community `Community 23`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 31`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 32`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 33`** (1 nodes): `Voice clone speech using the Base model.          You can provide either:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 34`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 35`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 36`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 37`** (1 nodes): `Reject oversized uploads before body parsing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 38`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 39`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 40`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 41`** (1 nodes): `Submit a video for translation.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 42`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 43`** (1 nodes): `User selects a TTS model after previewing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 44`** (1 nodes): `Serve a preview audio WAV file.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 45`** (1 nodes): `Download the translated video.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 46`** (1 nodes): `Create artifact directories and start background cleanup.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 47`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 48`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 49`** (1 nodes): `Insert extra silence distributed across detected pause points.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 50`** (1 nodes): `Generate a silent WAV file of given duration.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 51`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 52`** (1 nodes): `Translate the text of each segment into target_language in batches.      Args:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 53`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 54`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 55`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 56`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 57`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 58`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 59`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 60`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 61`** (1 nodes): `torch==2.6.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 62`** (1 nodes): `fastapi`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 63`** (1 nodes): `yt-dlp`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 64`** (1 nodes): `diffusers==0.29.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 65`** (1 nodes): `ARTIFACTS_ROOT env`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 66`** (1 nodes): `AWS g4dn.xlarge alternative`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 67`** (1 nodes): `nodejs (system pkg)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 68`** (1 nodes): `fonts-noto-core / cjk`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 69`** (1 nodes): `graphify project rules`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
 ## Suggested Questions
 _Questions this graph is uniquely positioned to answer:_
+- **Why does `synthesise_segments()` connect `Community 4` to `Community 11`, `Community 3`?**
+  _High betweenness centrality (0.324) - this node is a cross-community bridge._
+- **Why does `generate()` connect `Community 4` to `Community 0`, `Community 6`?**
+  _High betweenness centrality (0.209) - this node is a cross-community bridge._
 - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
 - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **What connects `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
+  _321 weakly-connected nodes found - possible documentation gaps or missing edges._

graphify-out/graph.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

server.py CHANGED Viewed

@@ -75,7 +75,7 @@ ALLOWED_YTDLP_HOSTS = {
     "tiktok.com",
     "vm.tiktok.com",
 }
-PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp"}
 REAPER_INTERVAL_SECONDS = 10 * 60
 REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
@@ -913,6 +913,10 @@ if __name__ == "__main__":
     local_app.include_router(router)
     # Serve the legacy static frontend at / so `python server.py` keeps the
     # old dev UX (open http://localhost:8000 to hit frontend/index.html).
     # The React SPA in production is deployed separately to S3.

     "tiktok.com",
     "vm.tiktok.com",
 }
+PERSISTENT_ARTIFACT_DIRS = {"uploads", "outputs", "data", "tmp", "tools"}
 REAPER_INTERVAL_SECONDS = 10 * 60
 REAPER_MAX_AGE_SECONDS = 2 * 60 * 60
     local_app.include_router(router)
+    # Tools API — independent of pipeline; safe to include here too.
+    from tools_api import router as tools_router
+    local_app.include_router(tools_router)
     # Serve the legacy static frontend at / so `python server.py` keeps the
     # old dev UX (open http://localhost:8000 to hit frontend/index.html).
     # The React SPA in production is deployed separately to S3.

tools_api/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""
+tools_api — Standalone endpoints for creator quick tools.
+Lives alongside the main pipeline (server.py) but stays decoupled:
+  - No shared job state, no SSE, no GPU semaphore.
+  - Reuses step modules as libraries only (no edits to steps/).
+  - Artifacts written under ARTIFACTS_ROOT/tools/<run_id>/.
+Endpoints (mounted by router.router):
+  POST /api/tools/subtitles       — captions (sidecar or burn-in MP4)
+  POST /api/tools/voice-clone     — single-segment TTS with voice clone
+  POST /api/tools/audio-cleanup   — Demucs source separation
+  GET  /api/tools/file/{run}/{f}  — download generated artifact
+"""
+from .router import router
+__all__ = ["router"]

tools_api/audio_cleanup.py ADDED Viewed

	@@ -0,0 +1,136 @@

+"""
+Audio source separation tool — three modes via Demucs.
+Reuses internals from steps.s1b_separate (model loader, device picker, normaliser,
+GPU-decorated apply). The existing separate_audio() returns only (vocals, accompaniment),
+so we replicate its flow here and keep all four stems addressable.
+"""
+from __future__ import annotations
+import subprocess
+from pathlib import Path
+from typing import Literal
+import torch
+import torchaudio
+# Reuse internals — no edits to s1b_separate.py.
+from steps.s1b_separate import (
+    _apply_demucs,
+    _get_model,
+    _load_and_normalise,
+    _select_device,
+)
+Mode = Literal["vocals-only", "instrumental-only", "stems"]
+def _ensure_audio(input_path: Path, out_dir: Path) -> Path:
+    """Convert input to a stable WAV format if it's a video or non-WAV audio."""
+    if input_path.suffix.lower() == ".wav":
+        return input_path
+    out = out_dir / "input.wav"
+    cmd = [
+        "ffmpeg", "-y", "-i", str(input_path),
+        "-vn", "-ac", "2", "-ar", "44100",
+        "-acodec", "pcm_s16le",
+        str(out),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
+    if result.returncode != 0:
+        raise RuntimeError(f"ffmpeg input prep failed: {result.stderr[-300:]}")
+    return out
+def _separate_all_stems(audio_path: Path, out_dir: Path) -> dict[str, Path]:
+    """Return {stem_name: wav_path} for every demucs source."""
+    model = _get_model()
+    device = _select_device()
+    target_sr = model.samplerate
+    target_ch = model.audio_channels
+    source_names = list(model.sources)  # ["drums", "bass", "other", "vocals"]
+    mix, mean, std = _load_and_normalise(str(audio_path), target_sr, target_ch)
+    sources = _apply_demucs(mix, device)
+    sources = sources * std + mean
+    sources = sources[0]  # [num_sources, channels, time]
+    stems: dict[str, Path] = {}
+    for idx, name in enumerate(source_names):
+        wav_path = out_dir / f"{name}.wav"
+        torchaudio.save(str(wav_path), sources[idx], target_sr)
+        stems[name] = wav_path
+    return stems
+def _sum_to_wav(stems: list[Path], dest: Path, sample_rate: int = 44100) -> Path:
+    """Sum N stem WAVs into one — used to build the instrumental track."""
+    mix: torch.Tensor | None = None
+    sr_used = sample_rate
+    for path in stems:
+        wav, sr = torchaudio.load(str(path))
+        sr_used = sr
+        mix = wav if mix is None else mix + wav
+    if mix is None:
+        raise RuntimeError("No stems to sum.")
+    torchaudio.save(str(dest), mix, sr_used)
+    return dest
+def separate(
+    *,
+    input_path: Path,
+    out_dir: Path,
+    mode: Mode,
+) -> list[dict]:
+    """
+    Run separation. Returns a list of output descriptors:
+      [{"name": "vocals.wav", "label": "Vocals", "filename": "vocals.wav"}, ...]
+    """
+    audio_in = _ensure_audio(input_path, out_dir)
+    stems = _separate_all_stems(audio_in, out_dir)
+    if mode == "vocals-only":
+        return [{
+            "name": "vocals",
+            "label": "Vocals",
+            "filename": stems["vocals"].name,
+            "sub": "Dialogue track",
+        }]
+    if mode == "instrumental-only":
+        non_vocal_stems = [stems[n] for n in stems if n != "vocals"]
+        out = _sum_to_wav(non_vocal_stems, out_dir / "instrumental.wav")
+        # Cleanup intermediate stem files we won't expose
+        for path in stems.values():
+            try:
+                path.unlink()
+            except OSError:
+                pass
+        return [{
+            "name": "instrumental",
+            "label": "Instrumental",
+            "filename": out.name,
+            "sub": "Music + ambient (vocals removed)",
+        }]
+    # stems mode — return all four
+    label_map = {
+        "vocals": ("Vocals", "Dialogue track"),
+        "drums": ("Drums", "Percussion"),
+        "bass": ("Bass", "Low frequency"),
+        "other": ("Other", "Melodic / ambient"),
+    }
+    results: list[dict] = []
+    # Stable order: vocals first, then drums, bass, other
+    for stem_key in ("vocals", "drums", "bass", "other"):
+        if stem_key not in stems:
+            continue
+        label, sub = label_map[stem_key]
+        results.append({
+            "name": stem_key,
+            "label": label,
+            "filename": stems[stem_key].name,
+            "sub": sub,
+        })
+    return results

tools_api/router.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""
+APIRouter for /api/tools/* endpoints.
+Each endpoint is sync request-response (no SSE, no job state). Input files
+land in a fresh per-run directory, outputs are returned as a download URL
+to GET /api/tools/file/{run_id}/{filename}.
+"""
+from __future__ import annotations
+import asyncio
+from pathlib import Path
+from typing import Optional
+from fastapi import APIRouter, File, Form, HTTPException, Request, UploadFile
+from fastapi.responses import FileResponse, JSONResponse, PlainTextResponse
+from server import limiter, _download_url, _is_allowed_video_host
+from . import audio_cleanup, subtitles, voice_clone
+from .storage import (
+    file_url,
+    new_run_dir,
+    reap_old_runs,
+    run_dir,
+    safe_filename,
+)
+router = APIRouter(prefix="/api/tools", tags=["tools"])
+# Per-tool body size cap (separate from pipeline's MAX_UPLOAD_BYTES check).
+TOOLS_MAX_BYTES = 50 * 1024 * 1024  # 50 MB
+# ── Helpers ──────────────────────────────────────────────────────────
+async def _save_upload(file: UploadFile, dest_dir: Path, default_name: str) -> Path:
+    """Stream upload to disk, enforcing the tools size cap."""
+    dest = dest_dir / safe_filename(file.filename, default_name)
+    written = 0
+    with open(dest, "wb") as fh:
+        while chunk := await file.read(1024 * 1024):
+            written += len(chunk)
+            if written > TOOLS_MAX_BYTES:
+                fh.close()
+                dest.unlink(missing_ok=True)
+                raise HTTPException(413, f"File too large (max {TOOLS_MAX_BYTES // (1024*1024)} MB).")
+            fh.write(chunk)
+    return dest
+def _ext_to_media_type(filename: str) -> str:
+    ext = Path(filename).suffix.lower()
+    return {
+        ".mp4": "video/mp4",
+        ".mov": "video/quicktime",
+        ".webm": "video/webm",
+        ".mp3": "audio/mpeg",
+        ".wav": "audio/wav",
+        ".srt": "application/x-subrip",
+        ".vtt": "text/vtt",
+        ".txt": "text/plain",
+    }.get(ext, "application/octet-stream")
+# ── Subtitles ────────────────────────────────────────────────────────
+@router.post("/subtitles")
+@limiter.limit("10/hour")
+async def subtitles_endpoint(
+    request: Request,
+    file: Optional[UploadFile] = File(None),
+    url: Optional[str] = Form(None),
+    source_lang: str = Form("Auto-detect"),
+    target_lang: str = Form("Same as source"),
+    fmt: str = Form("srt"),
+    style: str = Form("tiktok"),
+    position: str = Form("bottom"),
+    h_align: str = Form("center"),
+    font_size: Optional[int] = Form(None),
+    margin_v: Optional[int] = Form(None),
+):
+    if fmt not in ("srt", "vtt", "txt", "mp4"):
+        raise HTTPException(400, "fmt must be one of: srt, vtt, txt, mp4")
+    if style not in ("tiktok", "youtube", "minimal"):
+        raise HTTPException(400, "style must be one of: tiktok, youtube, minimal")
+    if position not in ("top", "middle", "bottom"):
+        raise HTTPException(400, "position must be one of: top, middle, bottom")
+    if h_align not in ("left", "center", "right"):
+        raise HTTPException(400, "h_align must be one of: left, center, right")
+    url = (url or "").strip()
+    if not file and not url:
+        raise HTTPException(400, "Provide either a file upload or a video URL.")
+    if file and url:
+        raise HTTPException(400, "Send a file OR a URL, not both.")
+    run_id, dest_dir = new_run_dir()
+    if file:
+        input_path = await _save_upload(file, dest_dir, "input.mp4")
+    else:
+        if not _is_allowed_video_host(url):
+            raise HTTPException(400, "URL host not supported. Use TikTok, YouTube, or Instagram.")
+        input_path = Path(dest_dir) / "input.mp4"
+        try:
+            await asyncio.to_thread(_download_url, url, str(input_path))
+        except Exception as e:  # noqa: BLE001
+            raise HTTPException(400, f"Couldn't fetch the video URL: {e}")
+    try:
+        # Heavy: transcribe + (optional) translate + (optional) ffmpeg burn-in.
+        # Run off the event loop so concurrent requests don't starve.
+        info = await asyncio.to_thread(
+            subtitles.generate_subtitles,
+            input_path=input_path,
+            out_dir=dest_dir,
+            source_lang_name=source_lang,
+            target_lang_name=target_lang,
+            fmt=fmt,  # type: ignore[arg-type]
+            style=style,  # type: ignore[arg-type]
+            position=position,  # type: ignore[arg-type]
+            h_align=h_align,  # type: ignore[arg-type]
+            font_size=font_size,
+            margin_v=margin_v,
+        )
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except Exception as e:  # noqa: BLE001
+        raise HTTPException(500, f"Subtitle generation failed: {e}")
+    return JSONResponse({
+        "run_id": run_id,
+        "format": info["format"],
+        "filename": info["filename"],
+        "url": file_url(run_id, info["filename"]),
+        "segments": info["segments"],
+        "translated": info["translated"],
+    })
+# ── Voice clone ──────────────────────────────────────────────────────
+@router.post("/voice-clone")
+@limiter.limit("10/hour")
+async def voice_clone_endpoint(
+    request: Request,
+    sample: UploadFile = File(...),
+    text: str = Form(...),
+    language_id: str = Form("en"),
+):
+    text = (text or "").strip()
+    if not text:
+        raise HTTPException(400, "text is required")
+    if len(text) > 1000:
+        raise HTTPException(400, "text exceeds 1000 char limit")
+    run_id, dest_dir = new_run_dir()
+    sample_path = await _save_upload(sample, dest_dir, "sample.wav")
+    try:
+        info = await asyncio.to_thread(
+            voice_clone.clone_voice,
+            sample_path=sample_path,
+            text=text,
+            out_dir=dest_dir,
+            language_id=language_id,
+        )
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except Exception as e:  # noqa: BLE001
+        raise HTTPException(500, f"Voice clone failed: {e}")
+    return JSONResponse({
+        "run_id": run_id,
+        "engine": info["engine"],
+        "chunks": info["chunks"],
+        "filename": info["filename"],
+        "url": file_url(run_id, info["filename"]),
+    })
+# ── Audio cleanup ────────────────────────────────────────────────────
+@router.post("/audio-cleanup")
+@limiter.limit("10/hour")
+async def audio_cleanup_endpoint(
+    request: Request,
+    file: UploadFile = File(...),
+    mode: str = Form("vocals-only"),
+):
+    if mode not in ("vocals-only", "instrumental-only", "stems"):
+        raise HTTPException(400, "mode must be one of: vocals-only, instrumental-only, stems")
+    run_id, dest_dir = new_run_dir()
+    input_path = await _save_upload(file, dest_dir, "input.wav")
+    try:
+        stems = await asyncio.to_thread(
+            audio_cleanup.separate,
+            input_path=input_path,
+            out_dir=dest_dir,
+            mode=mode,  # type: ignore[arg-type]
+        )
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except Exception as e:  # noqa: BLE001
+        raise HTTPException(500, f"Audio separation failed: {e}")
+    return JSONResponse({
+        "run_id": run_id,
+        "mode": mode,
+        "stems": [
+            {**stem, "url": file_url(run_id, stem["filename"])}
+            for stem in stems
+        ],
+    })
+# ── File download ────────────────────────────────────────────────────
+@router.get("/file/{run_id}/{filename}")
+async def tools_file(run_id: str, filename: str):
+    """Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS."""
+    safe_name = safe_filename(filename)
+    if safe_name != filename:
+        raise HTTPException(400, "Invalid filename")
+    base = run_dir(run_id)
+    if base is None:
+        raise HTTPException(404, "Run not found or expired")
+    target = base / safe_name
+    if not target.exists() or not target.is_file():
+        raise HTTPException(404, "File not found")
+    return FileResponse(
+        path=str(target),
+        media_type=_ext_to_media_type(safe_name),
+        filename=safe_name,
+    )
+# ── Cleanup hook ─────────────────────────────────────────────────────
+@router.post("/_internal/reap")
+async def _reap():
+    """Manual reap trigger (mostly for testing). Auto-reap runs on a timer."""
+    removed = await asyncio.to_thread(reap_old_runs)
+    return {"removed": removed}

tools_api/storage.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Per-run temp storage for tools_api.
+Each tool request creates a fresh dir under ARTIFACTS_ROOT/tools/<run_id>/.
+Files are reaped after TTL by _reap_old_runs(). Kept independent of the main
+job-tracker so a tool failure can't corrupt or block pipeline state.
+"""
+from __future__ import annotations
+import shutil
+import time
+import uuid
+from pathlib import Path
+from typing import Optional
+# Pull ARTIFACTS_ROOT from server.py without importing the heavy modules
+# (server.py imports torch/whisper/etc. at top level — we already loaded it
+# at app startup, so this is just a name lookup).
+from server import ARTIFACTS_ROOT
+TOOLS_ROOT = ARTIFACTS_ROOT / "tools"
+TOOLS_ROOT.mkdir(parents=True, exist_ok=True)
+# Tool runs are reaped 1h after creation (shorter than pipeline jobs since
+# users typically download immediately).
+RUN_TTL_SECONDS = 60 * 60
+def new_run_dir() -> tuple[str, Path]:
+    """Allocate a fresh per-request directory. Returns (run_id, path)."""
+    run_id = uuid.uuid4().hex[:16]
+    path = TOOLS_ROOT / run_id
+    path.mkdir(parents=True, exist_ok=True)
+    return run_id, path
+def run_dir(run_id: str) -> Optional[Path]:
+    """Resolve a run_id to its directory, or None if missing/invalid."""
+    if not run_id or "/" in run_id or ".." in run_id:
+        return None
+    candidate = TOOLS_ROOT / run_id
+    if not candidate.exists() or not candidate.is_dir():
+        return None
+    return candidate
+def file_url(run_id: str, filename: str) -> str:
+    """Construct the public download URL for an artifact."""
+    return f"/api/tools/file/{run_id}/{filename}"
+def safe_filename(name: str, fallback: str = "file") -> str:
+    """Strip path separators and dangerous chars from a user-supplied name."""
+    if not name:
+        return fallback
+    base = Path(name).name
+    return base or fallback
+def reap_old_runs() -> int:
+    """Delete tool run dirs older than RUN_TTL_SECONDS. Returns count removed."""
+    if not TOOLS_ROOT.exists():
+        return 0
+    cutoff = time.time() - RUN_TTL_SECONDS
+    removed = 0
+    for child in TOOLS_ROOT.iterdir():
+        try:
+            if child.is_dir() and child.stat().st_mtime < cutoff:
+                shutil.rmtree(child, ignore_errors=True)
+                removed += 1
+        except OSError:
+            continue
+    return removed

tools_api/subtitles.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""
+Subtitle generation: sidecar files (.srt/.vtt/.txt) and burn-in MP4.
+Reuses steps.s2_transcribe.transcribe and steps.s3_translate.translate as
+libraries. ffmpeg burn-in goes through subprocess (matches existing s5_sync
+pattern but without sharing code, since the styling needs are different).
+"""
+from __future__ import annotations
+import subprocess
+from pathlib import Path
+from typing import Literal
+from steps.s2_transcribe import transcribe
+from steps.s3_translate import translate
+Format = Literal["srt", "vtt", "txt", "mp4"]
+CaptionStyle = Literal["tiktok", "youtube", "minimal"]
+Position = Literal["top", "middle", "bottom"]
+HAlign = Literal["left", "center", "right"]
+# Bounds for user-adjustable knobs. Backend clamps to these regardless of
+# what the client sends.
+FONT_SIZE_MIN = 12
+FONT_SIZE_MAX = 40
+MARGIN_V_MIN = 0
+MARGIN_V_MAX = 240
+# ISO-style short codes Whisper accepts. Names map to UI dropdown labels.
+_LANG_CODE = {
+    "Auto-detect": "auto",
+    "English": "en", "Spanish": "es", "French": "fr", "German": "de",
+    "Portuguese": "pt", "Italian": "it", "Hindi": "hi", "Arabic": "ar",
+    "Chinese": "zh", "Japanese": "ja", "Korean": "ko", "Russian": "ru",
+}
+def _is_video(path: Path) -> bool:
+    return path.suffix.lower() in {".mp4", ".mov", ".webm", ".mkv", ".avi", ".m4v"}
+def _extract_audio(input_path: Path, out_dir: Path) -> Path:
+    """Pull a 16kHz mono WAV from the input — what whisper expects."""
+    audio_path = out_dir / "audio.wav"
+    cmd = [
+        "ffmpeg", "-y", "-i", str(input_path),
+        "-vn", "-ac", "1", "-ar", "16000",
+        "-acodec", "pcm_s16le",
+        str(audio_path),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
+    if result.returncode != 0:
+        raise RuntimeError(f"ffmpeg audio extract failed: {result.stderr[-300:]}")
+    return audio_path
+def _resolve_lang(name: str) -> str:
+    return _LANG_CODE.get(name, "auto")
+# ── Caption format writers ─────────────────────────────────────────────
+def _seg_text(seg: dict, prefer_translation: bool) -> str:
+    if prefer_translation:
+        return (seg.get("translated_text") or seg.get("text") or "").strip()
+    return (seg.get("text") or "").strip()
+def _format_timestamp_srt(t: float) -> str:
+    h = int(t // 3600)
+    m = int((t % 3600) // 60)
+    s = int(t % 60)
+    ms = int(round((t - int(t)) * 1000))
+    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
+def _format_timestamp_vtt(t: float) -> str:
+    return _format_timestamp_srt(t).replace(",", ".")
+def write_srt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
+    lines = []
+    for i, seg in enumerate(segments, 1):
+        text = _seg_text(seg, prefer_translation)
+        if not text:
+            continue
+        lines.append(str(i))
+        lines.append(f"{_format_timestamp_srt(seg['start'])} --> {_format_timestamp_srt(seg['end'])}")
+        lines.append(text)
+        lines.append("")
+    dest.write_text("\n".join(lines), encoding="utf-8")
+    return dest
+def write_vtt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
+    lines = ["WEBVTT", ""]
+    for seg in segments:
+        text = _seg_text(seg, prefer_translation)
+        if not text:
+            continue
+        lines.append(f"{_format_timestamp_vtt(seg['start'])} --> {_format_timestamp_vtt(seg['end'])}")
+        lines.append(text)
+        lines.append("")
+    dest.write_text("\n".join(lines), encoding="utf-8")
+    return dest
+def write_txt(segments: list[dict], dest: Path, prefer_translation: bool) -> Path:
+    text = " ".join(_seg_text(s, prefer_translation) for s in segments if _seg_text(s, prefer_translation))
+    dest.write_text(text, encoding="utf-8")
+    return dest
+# ── Burn-in styling ────────────────────────────────────────────────────
+# ASS-format alignment codes (libass), arranged as row + column:
+#   row: bottom=0, middle=3, top=6
+#   col: left=1,   center=2,  right=3
+# So bottom-left=1, bottom-center=2, ..., top-right=9.
+_POSITION_ROW = {"bottom": 0, "middle": 3, "top": 6}
+_HALIGN_COL = {"left": 1, "center": 2, "right": 3}
+_DEFAULT_MARGIN_V = {"bottom": 60, "middle": 0, "top": 60}
+# Per-style baseline — font size, stroke/shadow choices. The user can override
+# the font size via the slider; everything else stays tied to the style preset.
+_STYLE_DEFAULTS: dict[CaptionStyle, dict] = {
+    "tiktok":  {"font_size": 22, "bold": 1, "border_style": 1, "outline": 3, "shadow": 1},
+    "youtube": {"font_size": 18, "bold": 0, "border_style": 4, "outline": 8, "shadow": 0},
+    "minimal": {"font_size": 16, "bold": 0, "border_style": 1, "outline": 1, "shadow": 0},
+}
+def _clamp(value: int, lo: int, hi: int) -> int:
+    return max(lo, min(hi, value))
+def _force_style_for(
+    style: CaptionStyle,
+    position: Position,
+    h_align: HAlign = "center",
+    font_size: int | None = None,
+    margin_v: int | None = None,
+) -> str:
+    """Return an ffmpeg `subtitles=...:force_style='...'` string.
+    Args:
+        style: Visual preset — sets weight, stroke, shadow defaults.
+        position: top / middle / bottom row.
+        h_align: left / center / right column.
+        font_size: Override the style's default font size (clamped to FONT_SIZE_MIN..MAX).
+        margin_v: Override vertical margin in pixels (clamped to MARGIN_V_MIN..MAX).
+    """
+    defaults = _STYLE_DEFAULTS[style]
+    fs = _clamp(font_size if font_size is not None else defaults["font_size"],
+                FONT_SIZE_MIN, FONT_SIZE_MAX)
+    mv = _clamp(margin_v if margin_v is not None else _DEFAULT_MARGIN_V[position],
+                MARGIN_V_MIN, MARGIN_V_MAX)
+    align = _POSITION_ROW[position] + _HALIGN_COL[h_align]
+    parts = [
+        "FontName=Arial",
+        f"FontSize={fs}",
+        f"Bold={defaults['bold']}",
+        "PrimaryColour=&H00FFFFFF",
+    ]
+    if style == "youtube":
+        # White on translucent black box
+        parts.append("BackColour=&HB8000000")
+    elif style == "minimal":
+        # Subtle semi-transparent stroke instead of hard black
+        parts.append("OutlineColour=&H80000000")
+    else:  # tiktok — hard black stroke
+        parts.append("OutlineColour=&H00000000")
+    parts += [
+        f"BorderStyle={defaults['border_style']}",
+        f"Outline={defaults['outline']}",
+        f"Shadow={defaults['shadow']}",
+        f"Alignment={align}",
+        f"MarginV={mv}",
+        # Symmetric horizontal margins so left/right alignment has breathing room
+        "MarginL=40",
+        "MarginR=40",
+    ]
+    return ",".join(parts)
+def _burn_in(
+    video_path: Path,
+    srt_path: Path,
+    dest: Path,
+    style: CaptionStyle,
+    position: Position,
+    h_align: HAlign = "center",
+    font_size: int | None = None,
+    margin_v: int | None = None,
+) -> Path:
+    """Render captions into the video pixels via ffmpeg + libass."""
+    force_style = _force_style_for(style, position, h_align, font_size, margin_v)
+    # Escape path for ffmpeg subtitle filter (single quotes around path,
+    # and we replace any existing single quotes since they'd break the filter).
+    srt_str = str(srt_path).replace("'", r"\'").replace(":", r"\:")
+    vf = f"subtitles='{srt_str}':force_style='{force_style}'"
+    cmd = [
+        "ffmpeg", "-y",
+        "-i", str(video_path),
+        "-vf", vf,
+        "-c:a", "copy",
+        "-c:v", "libx264",
+        "-preset", "veryfast",
+        "-crf", "22",
+        str(dest),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
+    if result.returncode != 0:
+        raise RuntimeError(f"ffmpeg burn-in failed: {result.stderr[-300:]}")
+    return dest
+# ── Public entry point ────────────────────────────────────────────────
+def generate_subtitles(
+    *,
+    input_path: Path,
+    out_dir: Path,
+    source_lang_name: str,
+    target_lang_name: str,
+    fmt: Format,
+    style: CaptionStyle = "tiktok",
+    position: Position = "bottom",
+    h_align: HAlign = "center",
+    font_size: int | None = None,
+    margin_v: int | None = None,
+) -> dict:
+    """
+    Run the full subtitle pipeline. Returns:
+      {
+        "format": "srt" | "vtt" | "txt" | "mp4",
+        "filename": <name in out_dir>,
+        "segments": <int>,
+        "translated": <bool>,
+      }
+    """
+    is_burn = fmt == "mp4"
+    if is_burn and not _is_video(input_path):
+        raise ValueError("Burn-in requires a video file.")
+    # 1. Extract audio (or use as-is)
+    if _is_video(input_path):
+        audio_path = _extract_audio(input_path, out_dir)
+    else:
+        audio_path = input_path
+    # 2. Transcribe
+    src_code = _resolve_lang(source_lang_name)
+    segments = transcribe(str(audio_path), language=src_code)
+    if not segments:
+        raise RuntimeError("Transcription produced no segments.")
+    # 3. Translate if requested
+    translated = False
+    same_as_source = (
+        target_lang_name == "Same as source"
+        or target_lang_name.lower() == source_lang_name.lower()
+    )
+    if not same_as_source:
+        segments = translate(segments, target_lang_name)
+        translated = True
+    # 4. Emit
+    if fmt == "srt":
+        out = write_srt(segments, out_dir / "captions.srt", translated)
+    elif fmt == "vtt":
+        out = write_vtt(segments, out_dir / "captions.vtt", translated)
+    elif fmt == "txt":
+        out = write_txt(segments, out_dir / "transcript.txt", translated)
+    else:  # mp4
+        srt_path = write_srt(segments, out_dir / "captions.srt", translated)
+        out = _burn_in(
+            input_path, srt_path, out_dir / "captioned.mp4",
+            style, position, h_align, font_size, margin_v,
+        )
+    return {
+        "format": fmt,
+        "filename": out.name,
+        "segments": len(segments),
+        "translated": translated,
+    }

tools_api/voice_clone.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+Voice clone playground — single-engine TTS from a sample + text input.
+This Space runs only ONE engine (s4_tts enforces TTS_ENGINE match), so the
+endpoint accepts no engine parameter. The frontend is responsible for fanning
+out to multiple Spaces when the user wants comparison output.
+Long text is split into ~200-char chunks at sentence/word boundaries and
+synthesised as multiple segments, then concatenated into one MP3.
+"""
+from __future__ import annotations
+import os
+import re
+import subprocess
+from pathlib import Path
+from steps.s4_tts import synthesise_segments
+_AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".flac", ".ogg", ".aac"}
+def _prepare_sample(sample_path: Path, out_dir: Path) -> Path:
+    """Convert any uploaded sample (audio or video) to a clean 24kHz mono WAV.
+    TTS internals (s4_tts) call torchaudio.load via libsndfile, which only
+    understands WAV/FLAC. Anything else — including MP4 video, MP3, M4A —
+    has to be re-encoded first. We do this here so callers don't need to.
+    """
+    out = out_dir / "sample_prepared.wav"
+    cmd = [
+        "ffmpeg", "-y", "-i", str(sample_path),
+        "-vn",                # drop video stream if present
+        "-ac", "1",           # mono
+        "-ar", "24000",       # 24kHz — sweet spot for the TTS engines
+        "-acodec", "pcm_s16le",
+        str(out),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
+    if result.returncode != 0:
+        raise ValueError(
+            "Couldn't read the uploaded sample. Use a clean audio file "
+            "(WAV, MP3, M4A) or a video with an audio track."
+        )
+    return out
+def _isolate_vocals(prepared_sample: Path, out_dir: Path) -> Path:
+    """Run Demucs source separation on the prepared sample and return a
+    vocals-only WAV (24kHz mono) suitable for TTS reference.
+    Mirrors what the dub pipeline (steps.s1b_separate) does so cloned voice
+    doesn't pick up music / ambient noise from the uploaded sample. Falls back
+    to the raw prepared sample if separation fails (model missing, oom, etc.)
+    rather than failing the whole clone request.
+    """
+    try:
+        from steps.s1b_separate import separate_audio
+    except ImportError as e:
+        print(f"[voice_clone] Demucs unavailable, skipping vocal isolation: {e}")
+        return prepared_sample
+    separate_dir = out_dir / "separate"
+    separate_dir.mkdir(parents=True, exist_ok=True)
+    try:
+        vocals_16k_path, _accompaniment = separate_audio(str(prepared_sample), str(separate_dir))
+    except Exception as e:
+        print(f"[voice_clone] Demucs separation failed, using raw sample: {e}")
+        return prepared_sample
+    # Resample vocals from 16 kHz mono → 24 kHz mono for the TTS engines
+    vocals_24k = out_dir / "vocals_24k.wav"
+    cmd = [
+        "ffmpeg", "-y", "-i", vocals_16k_path,
+        "-ac", "1", "-ar", "24000",
+        "-acodec", "pcm_s16le",
+        str(vocals_24k),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+    if result.returncode != 0:
+        print(f"[voice_clone] Vocals resample failed, using 16kHz: {result.stderr[-200:]}")
+        return Path(vocals_16k_path)
+    return vocals_24k
+CHUNK_TARGET_CHARS = 200
+CHUNK_HARD_MAX = 280  # under chatterbox's 300-char per-segment ceiling
+def _split_text(text: str) -> list[str]:
+    """Split into chunks of ~CHUNK_TARGET_CHARS at sentence then word boundaries."""
+    text = text.strip()
+    if not text:
+        return []
+    if len(text) <= CHUNK_HARD_MAX:
+        return [text]
+    # First pass: sentence boundaries
+    sentences = re.split(r"(?<=[.!?])\s+", text)
+    chunks: list[str] = []
+    current = ""
+    for sent in sentences:
+        if not sent.strip():
+            continue
+        if len(current) + 1 + len(sent) <= CHUNK_TARGET_CHARS:
+            current = f"{current} {sent}".strip() if current else sent
+        else:
+            if current:
+                chunks.append(current)
+            # Sentence itself may exceed target — break it on words
+            if len(sent) > CHUNK_HARD_MAX:
+                words = sent.split()
+                buf = ""
+                for w in words:
+                    if len(buf) + 1 + len(w) > CHUNK_HARD_MAX:
+                        if buf:
+                            chunks.append(buf)
+                        buf = w
+                    else:
+                        buf = f"{buf} {w}".strip() if buf else w
+                if buf:
+                    current = buf
+                else:
+                    current = ""
+            else:
+                current = sent
+    if current:
+        chunks.append(current)
+    return chunks
+def _build_segments(chunks: list[str], chunk_secs: float = 8.0) -> list[dict]:
+    """Construct segment dicts for synthesise_segments — fake timing windows."""
+    segs = []
+    cursor = 0.0
+    for text in chunks:
+        # Allocate a generous window so _trim_to_duration doesn't clip output.
+        # Headroom is 1.4× so 8s window allows up to ~11s of audio per chunk.
+        segs.append({
+            "start": cursor,
+            "end": cursor + chunk_secs,
+            "text": text,
+            "translated_text": text,
+            "tts_text": text,
+        })
+        cursor += chunk_secs
+    return segs
+def _concat_wavs_to_mp3(wav_paths: list[Path], dest: Path) -> Path:
+    """Concat in order via ffmpeg concat demuxer, then encode MP3."""
+    if not wav_paths:
+        raise RuntimeError("No TTS chunks to concatenate.")
+    if len(wav_paths) == 1:
+        cmd = [
+            "ffmpeg", "-y", "-i", str(wav_paths[0]),
+            "-codec:a", "libmp3lame", "-b:a", "192k",
+            str(dest),
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
+        if result.returncode != 0:
+            raise RuntimeError(f"ffmpeg encode failed: {result.stderr[-300:]}")
+        return dest
+    list_file = dest.with_suffix(".txt")
+    list_file.write_text(
+        "\n".join(f"file '{p.as_posix()}'" for p in wav_paths),
+        encoding="utf-8",
+    )
+    cmd = [
+        "ffmpeg", "-y",
+        "-f", "concat", "-safe", "0",
+        "-i", str(list_file),
+        "-codec:a", "libmp3lame", "-b:a", "192k",
+        str(dest),
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)
+    list_file.unlink(missing_ok=True)
+    if result.returncode != 0:
+        raise RuntimeError(f"ffmpeg concat failed: {result.stderr[-300:]}")
+    return dest
+def clone_voice(
+    *,
+    sample_path: Path,
+    text: str,
+    out_dir: Path,
+    language_id: str = "en",
+) -> dict:
+    """
+    Run TTS on `text` using the voice from `sample_path`. Returns:
+      {
+        "filename": "voice.mp3",
+        "engine": <current TTS_ENGINE>,
+        "chunks": <int>,
+      }
+    """
+    text = (text or "").strip()
+    if not text:
+        raise ValueError("Text is required.")
+    chunks = _split_text(text)
+    segments = _build_segments(chunks)
+    # Normalise the sample (handles video, mp3, m4a, etc.) → 24kHz mono WAV
+    prepared_sample = _prepare_sample(sample_path, out_dir)
+    # Demucs source separation → isolate vocals so the clone doesn't pick up
+    # background music or ambient noise. Same step the dub pipeline uses.
+    reference_for_tts = _isolate_vocals(prepared_sample, out_dir)
+    seg_out_dir = out_dir / "tts"
+    seg_out_dir.mkdir(parents=True, exist_ok=True)
+    tts_result = None
+    for msg in synthesise_segments(
+        segments=segments,
+        reference_audio_path=str(reference_for_tts),
+        language_id=language_id,
+        output_dir=str(seg_out_dir),
+    ):
+        if isinstance(msg, dict) and "__TTS_RESULT__" in msg:
+            tts_result = msg["__TTS_RESULT__"]
+    if not tts_result:
+        raise RuntimeError("TTS produced no output.")
+    wav_paths = [Path(seg["tts_path"]) for seg in tts_result if seg.get("tts_path")]
+    if not wav_paths:
+        raise RuntimeError("TTS result missing audio paths.")
+    mp3_path = _concat_wavs_to_mp3(wav_paths, out_dir / "voice.mp3")
+    return {
+        "filename": mp3_path.name,
+        "engine": os.getenv("TTS_ENGINE", "chatterbox").lower(),
+        "chunks": len(chunks),
+    }