Spaces:

Rafii
/

videovoice

Running on Zero

App Files Files Community

github-actions[bot] commited on 8 days ago

Commit

12ab2ca

1 Parent(s): ffee483

deploy: switch to chatterbox requirements @ a95fda4

Browse files

Files changed (6) hide show

graphify-out/GRAPH_REPORT.md +121 -111
graphify-out/graph.html +0 -0
server.py +2 -2
tools_api/__init__.py +1 -0
tools_api/dramabox.py +181 -0
tools_api/router.py +71 -1

graphify-out/GRAPH_REPORT.md CHANGED Viewed

@@ -1,12 +1,12 @@
-# Graph Report - VideoVoice-be  (2026-05-16)
 ## Corpus Check
-- 59 files · ~253,292 words
 - Verdict: corpus is large enough that graph structure adds value.
 ## Summary
-- 1050 nodes · 1833 edges · 62 communities detected
-- Extraction: 79% EXTRACTED · 21% INFERRED · 0% AMBIGUOUS · INFERRED: 389 edges (avg confidence: 0.62)
 - Token cost: 0 input · 0 output
 ## Community Hubs (Navigation)
@@ -32,9 +32,9 @@
 - [[_COMMUNITY_Community 19|Community 19]]
 - [[_COMMUNITY_Community 20|Community 20]]
 - [[_COMMUNITY_Community 21|Community 21]]
 - [[_COMMUNITY_Community 23|Community 23]]
-- [[_COMMUNITY_Community 31|Community 31]]
-- [[_COMMUNITY_Community 32|Community 32]]
 - [[_COMMUNITY_Community 33|Community 33]]
 - [[_COMMUNITY_Community 34|Community 34]]
 - [[_COMMUNITY_Community 35|Community 35]]
@@ -72,6 +72,8 @@
 - [[_COMMUNITY_Community 67|Community 67]]
 - [[_COMMUNITY_Community 68|Community 68]]
 - [[_COMMUNITY_Community 69|Community 69]]
 ## God Nodes (most connected - your core abstractions)
 1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
@@ -90,8 +92,8 @@
   requirements.txt → requirements-omni.txt
 - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
-- `content_length_middleware()` --calls--> `enforce_content_length_limit()`  [INFERRED]
-  app.py → server.py
 - `run_pipeline()` --calls--> `separate_audio()`  [INFERRED]
   pipeline.py → steps/s1b_separate.py
 - `run_pipeline()` --calls--> `transcribe()`  [INFERRED]
@@ -106,47 +108,47 @@
 ### Community 0 - "Community 0"
 Cohesion: 0.04
-Nodes (69): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r"""     This is the configuration class to store the configuration of a [`Qwen3, r"""     This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r"""     This is the configuration class to store the configuration of a [`Qwen3 (+61 more)
 ### Community 1 - "Community 1"
 Cohesion: 0.02
 Nodes (118): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine.     ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+110 more)
 ### Community 2 - "Community 2"
-Cohesion: 0.05
-Nodes (57): ABC, BasePoster, Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt(), format_caption() (+49 more)
 ### Community 3 - "Community 3"
 Cohesion: 0.05
-Nodes (59): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages.      Args:, run_pipeline() (+51 more)
 ### Community 4 - "Community 4"
 Cohesion: 0.06
-Nodes (55): forward(), generate(), generate_speaker_prompt(), main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces. (+47 more)
 ### Community 5 - "Community 5"
 Cohesion: 0.06
 Nodes (59): post(), _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps.  Primary local backend (device-depende (+51 more)
 ### Community 6 - "Community 6"
-Cohesion: 0.06
-Nodes (31): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+23 more)
-### Community 7 - "Community 7"
 Cohesion: 0.07
-Nodes (25): DistributedGroupResidualVectorQuantization, Efficient distributed group residual vector quantization implementation.     Fol, dynamic_range_compression_torch(), MelSpectrogramFeatures, x: torch.Tensor, shape = (T, D)             q: torch.Tensor, shape = (T, D), x : torch.Tensor, shape = (n_mels, n_ctx)             the mel spectrogram of the, Calculate the BigVGAN style mel spectrogram of an input signal.     Args:, spectral_normalize_torch() (+17 more)
-### Community 8 - "Community 8"
 Cohesion: 0.05
 Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
 ### Community 9 - "Community 9"
 Cohesion: 0.09
 Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
 ### Community 10 - "Community 10"
-Cohesion: 0.1
-Nodes (14): default(), DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb(), preprocess() (+6 more)
 ### Community 11 - "Community 11"
 Cohesion: 0.08
@@ -154,20 +156,20 @@ Nodes (32): _apply_demucs(), _get_model(), _load_and_normalise(), Step 1b: Separ
 ### Community 12 - "Community 12"
 Cohesion: 0.1
-Nodes (31): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., Translate the text of each segment into target_language in batches.      Args:, translate(), _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client() (+23 more)
 ### Community 13 - "Community 13"
 Cohesion: 0.12
 Nodes (27): build_for_job(), ensure_transcription(), extract_audio_hq(), extract_reference_audio(), get_audio_duration(), get_device(), load_chatterbox(), main() (+19 more)
 ### Community 14 - "Community 14"
-Cohesion: 0.11
-Nodes (25): tools_api — Standalone endpoints for creator quick tools.  Lives alongside the m, audio_cleanup_endpoint(), _ext_to_media_type(), APIRouter for /api/tools/* endpoints.  Each endpoint is sync request-response (n, Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS., Manual reap trigger (mostly for testing). Auto-reap runs on a timer., Stream upload to disk, enforcing the tools size cap., _reap() (+17 more)
-### Community 15 - "Community 15"
 Cohesion: 0.12
 Nodes (23): build_t3_cond(), main(), prepare_sample(), prepare_sample.py — Turn one dataset.jsonl row into the exact tensors T3.loss(), Build the speaker conditioning (frozen during training)., MTLTokenizer + SOT/EOT padding (mirrors what generate() does internally)., S3Tokenizer on the target dubbed audio → speech tokens (the LABEL).      Critica, Turn one dataset row into ready-to-train tensors. (+15 more)
 ### Community 16 - "Community 16"
 Cohesion: 0.19
 Nodes (18): _burn_in(), _clamp(), _extract_audio(), _force_style_for(), _format_timestamp_srt(), _format_timestamp_vtt(), generate_subtitles(), _is_video() (+10 more)
@@ -189,260 +191,268 @@ Cohesion: 0.27
 Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline.  Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
 ### Community 21 - "Community 21"
 Cohesion: 0.33
 Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
-### Community 23 - "Community 23"
 Cohesion: 1.0
 Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
-### Community 31 - "Community 31"
 Cohesion: 1.0
 Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
-### Community 32 - "Community 32"
 Cohesion: 1.0
 Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
-### Community 33 - "Community 33"
 Cohesion: 1.0
 Nodes (1): Voice clone speech using the Base model.          You can provide either:
-### Community 34 - "Community 34"
 Cohesion: 1.0
 Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
-### Community 35 - "Community 35"
 Cohesion: 1.0
 Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
-### Community 36 - "Community 36"
 Cohesion: 1.0
 Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
-### Community 37 - "Community 37"
 Cohesion: 1.0
 Nodes (1): Reject oversized uploads before body parsing.
-### Community 38 - "Community 38"
 Cohesion: 1.0
 Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
-### Community 39 - "Community 39"
 Cohesion: 1.0
 Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
-### Community 40 - "Community 40"
 Cohesion: 1.0
 Nodes (1): Return curated showcase entries with resolved streaming URLs.
-### Community 41 - "Community 41"
 Cohesion: 1.0
 Nodes (1): Submit a video for translation.
-### Community 42 - "Community 42"
 Cohesion: 1.0
 Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
-### Community 43 - "Community 43"
 Cohesion: 1.0
 Nodes (1): User selects a TTS model after previewing.
-### Community 44 - "Community 44"
 Cohesion: 1.0
 Nodes (1): Serve a preview audio WAV file.
-### Community 45 - "Community 45"
 Cohesion: 1.0
 Nodes (1): Download the translated video.
-### Community 46 - "Community 46"
 Cohesion: 1.0
 Nodes (1): Create artifact directories and start background cleanup.
-### Community 47 - "Community 47"
 Cohesion: 1.0
 Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
-### Community 48 - "Community 48"
 Cohesion: 1.0
 Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
-### Community 49 - "Community 49"
 Cohesion: 1.0
 Nodes (1): Insert extra silence distributed across detected pause points.
-### Community 50 - "Community 50"
 Cohesion: 1.0
 Nodes (1): Generate a silent WAV file of given duration.
-### Community 51 - "Community 51"
 Cohesion: 1.0
 Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
-### Community 52 - "Community 52"
 Cohesion: 1.0
 Nodes (1): Translate the text of each segment into target_language in batches.      Args:
-### Community 53 - "Community 53"
 Cohesion: 1.0
 Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int
-### Community 54 - "Community 54"
 Cohesion: 1.0
 Nodes (1): Remove trailing noise/artifacts after speech ends.
-### Community 55 - "Community 55"
 Cohesion: 1.0
 Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
-### Community 56 - "Community 56"
 Cohesion: 1.0
 Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
-### Community 57 - "Community 57"
 Cohesion: 1.0
 Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
-### Community 58 - "Community 58"
 Cohesion: 1.0
 Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
-### Community 59 - "Community 59"
 Cohesion: 1.0
 Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
-### Community 60 - "Community 60"
 Cohesion: 1.0
 Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
-### Community 61 - "Community 61"
 Cohesion: 1.0
 Nodes (1): torch==2.6.0
-### Community 62 - "Community 62"
 Cohesion: 1.0
 Nodes (1): fastapi
-### Community 63 - "Community 63"
 Cohesion: 1.0
 Nodes (1): yt-dlp
-### Community 64 - "Community 64"
 Cohesion: 1.0
 Nodes (1): diffusers==0.29.0
-### Community 65 - "Community 65"
 Cohesion: 1.0
 Nodes (1): ARTIFACTS_ROOT env
-### Community 66 - "Community 66"
 Cohesion: 1.0
 Nodes (1): AWS g4dn.xlarge alternative
-### Community 67 - "Community 67"
 Cohesion: 1.0
 Nodes (1): nodejs (system pkg)
-### Community 68 - "Community 68"
 Cohesion: 1.0
 Nodes (1): fonts-noto-core / cjk
-### Community 69 - "Community 69"
 Cohesion: 1.0
 Nodes (1): graphify project rules
 ## Knowledge Gaps
-- **321 isolated node(s):** `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+316 more)
   These have ≤1 connection - possible missing edges or undocumented components.
-- **Thin community `Community 23`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 31`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 32`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 33`** (1 nodes): `Voice clone speech using the Base model.          You can provide either:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 34`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 35`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 36`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 37`** (1 nodes): `Reject oversized uploads before body parsing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 38`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 39`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 40`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 41`** (1 nodes): `Submit a video for translation.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 42`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 43`** (1 nodes): `User selects a TTS model after previewing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 44`** (1 nodes): `Serve a preview audio WAV file.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 45`** (1 nodes): `Download the translated video.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 46`** (1 nodes): `Create artifact directories and start background cleanup.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 47`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 48`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 49`** (1 nodes): `Insert extra silence distributed across detected pause points.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 50`** (1 nodes): `Generate a silent WAV file of given duration.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 51`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 52`** (1 nodes): `Translate the text of each segment into target_language in batches.      Args:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 53`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 54`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 55`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 56`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 57`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 58`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 59`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 60`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 61`** (1 nodes): `torch==2.6.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 62`** (1 nodes): `fastapi`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 63`** (1 nodes): `yt-dlp`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 64`** (1 nodes): `diffusers==0.29.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 65`** (1 nodes): `ARTIFACTS_ROOT env`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 66`** (1 nodes): `AWS g4dn.xlarge alternative`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 67`** (1 nodes): `nodejs (system pkg)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 68`** (1 nodes): `fonts-noto-core / cjk`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
-- **Thin community `Community 69`** (1 nodes): `graphify project rules`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
 ## Suggested Questions
 _Questions this graph is uniquely positioned to answer:_
-- **Why does `synthesise_segments()` connect `Community 4` to `Community 11`, `Community 3`?**
   _High betweenness centrality (0.324) - this node is a cross-community bridge._
-- **Why does `generate()` connect `Community 4` to `Community 0`, `Community 6`?**
-  _High betweenness centrality (0.209) - this node is a cross-community bridge._
 - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
@@ -452,4 +462,4 @@ _Questions this graph is uniquely positioned to answer:_
 - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **What connects `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
-  _321 weakly-connected nodes found - possible documentation gaps or missing edges._

+# Graph Report - VideoVoice-be  (2026-05-17)
 ## Corpus Check
+- 60 files · ~254,726 words
 - Verdict: corpus is large enough that graph structure adds value.
 ## Summary
+- 1065 nodes · 1859 edges · 64 communities detected
+- Extraction: 79% EXTRACTED · 21% INFERRED · 0% AMBIGUOUS · INFERRED: 397 edges (avg confidence: 0.62)
 - Token cost: 0 input · 0 output
 ## Community Hubs (Navigation)
 - [[_COMMUNITY_Community 19|Community 19]]
 - [[_COMMUNITY_Community 20|Community 20]]
 - [[_COMMUNITY_Community 21|Community 21]]
+- [[_COMMUNITY_Community 22|Community 22]]
 - [[_COMMUNITY_Community 23|Community 23]]
+- [[_COMMUNITY_Community 25|Community 25]]
 - [[_COMMUNITY_Community 33|Community 33]]
 - [[_COMMUNITY_Community 34|Community 34]]
 - [[_COMMUNITY_Community 35|Community 35]]
 - [[_COMMUNITY_Community 67|Community 67]]
 - [[_COMMUNITY_Community 68|Community 68]]
 - [[_COMMUNITY_Community 69|Community 69]]
+- [[_COMMUNITY_Community 70|Community 70]]
+- [[_COMMUNITY_Community 71|Community 71]]
 ## God Nodes (most connected - your core abstractions)
 1. `Qwen3TTSSpeakerEncoderConfig` - 49 edges
   requirements.txt → requirements-omni.txt
 - `gradio==6.8.0` --semantically_similar_to--> `gradio==6.12.0 (omni)`  [INFERRED] [semantically similar]
   requirements.txt → requirements-omni.txt
+- `enforce_content_length_limit()` --calls--> `content_length_middleware()`  [INFERRED]
+  server.py → app.py
 - `run_pipeline()` --calls--> `separate_audio()`  [INFERRED]
   pipeline.py → steps/s1b_separate.py
 - `run_pipeline()` --calls--> `transcribe()`  [INFERRED]
 ### Community 0 - "Community 0"
 Cohesion: 0.04
+Nodes (70): Qwen3TTSConfig, Qwen3TTSSpeakerEncoderConfig, Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig, r"""     This is the configuration class to store the configuration of a [`Qwen3, r"""     This is the configuration class to store the configuration of a [`Qwen3, This is the configuration class to store the configuration of a [`Qwen3TTSForCon, r"""     This is the configuration class to store the configuration of a [`Qwen3 (+62 more)
 ### Community 1 - "Community 1"
 Cohesion: 0.02
 Nodes (118): api_run_pipeline(), content_length_middleware(), ZeroGPU-compatible entrypoint using gradio.Server. Server extends FastAPI, so al, Exposed through Gradio's API engine.     ZeroGPU will allocate a GPU when this e, run_pipeline(), BaseHTTPMiddleware, BaseModel, _artifact_reaper_loop() (+110 more)
 ### Community 2 - "Community 2"
+Cohesion: 0.04
+Nodes (38): default(), DistributedGroupResidualVectorQuantization, DistributedResidualVectorQuantization, ema_inplace(), EuclideanCodebook, kmeans(), laplace_smoothing(), postprocess_emb() (+30 more)
 ### Community 3 - "Community 3"
 Cohesion: 0.05
+Nodes (57): ABC, BasePoster, Abstract base class for platform posters., Save a debug screenshot on failure., BasePoster, _build_system_prompt(), _build_user_prompt(), format_caption() (+49 more)
 ### Community 4 - "Community 4"
 Cohesion: 0.06
+Nodes (31): _audio_to_tuple(), _build_choices_and_map(), build_demo(), build_parser(), _collect_gen_kwargs(), _detect_model_kind(), _dtype_from_str(), main() (+23 more)
 ### Community 5 - "Community 5"
 Cohesion: 0.06
 Nodes (59): post(), _assign_words_to_segments(), _extract_words(), _get_faster_whisper_model(), _get_local_whisper_backend(), _get_openai_whisper_model(), _normalise_segments(), Step 3: Transcribe audio with timestamps.  Primary local backend (device-depende (+51 more)
 ### Community 6 - "Community 6"
 Cohesion: 0.07
+Nodes (50): forward(), generate(), generate_speaker_prompt(), from_pretrained(), _clip_audio(), _ensure_browser_wav(), _filter_preview_segments(), _free_memory() (+42 more)
+### Community 7 - "Community 7"
 Cohesion: 0.05
 Nodes (49): FFmpeg concat list (synced TTS), Try-Now app panel, app.js script ref, Comparison table (HeyGen, Rask, ElevenLabs, Synthesia), Hero section + 23+ languages, Frontend index.html, Source/target language selectors, Pricing tiers (Free/Starter/Creator) (+41 more)
+### Community 8 - "Community 8"
+Cohesion: 0.07
+Nodes (35): _collect_output(), _log_step_done(), main(), pipeline.py — Core pipeline: CLI entrypoint + importable run_pipeline() for Grad, Print duration + separator line for a completed step., Collect all yields and the return value from the generator., Run the full translation pipeline, yielding progress messages.      Args:, run_pipeline() (+27 more)
 ### Community 9 - "Community 9"
 Cohesion: 0.09
 Nodes (27): $(), clearFile(), createDemoCard(), detectPlatform(), formatBytes(), formatDemoDate(), formatDemoTitle(), getUsedVideos() (+19 more)
 ### Community 10 - "Community 10"
+Cohesion: 0.09
+Nodes (34): Step 4: Translate segment texts using Pollinations chat completions API (OpenAI-, Translate a batch of segments into target_language., _translate_batch(), bedrock_converse(), bedrock_fallback(), build_client(), log_llm_call(), parse_json_array() (+26 more)
 ### Community 11 - "Community 11"
 Cohesion: 0.08
 ### Community 12 - "Community 12"
 Cohesion: 0.1
+Nodes (28): tools_api — Standalone endpoints for creator quick tools.  Lives alongside the m, audio_cleanup_endpoint(), dramabox_endpoint(), _ext_to_media_type(), APIRouter for /api/tools/* endpoints.  Each endpoint is sync request-response (n, Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS., Manual reap trigger (mostly for testing). Auto-reap runs on a timer., Serve a generated artifact. Run dirs auto-expire after RUN_TTL_SECONDS. (+20 more)
 ### Community 13 - "Community 13"
 Cohesion: 0.12
 Nodes (27): build_for_job(), ensure_transcription(), extract_audio_hq(), extract_reference_audio(), get_audio_duration(), get_device(), load_chatterbox(), main() (+19 more)
 ### Community 14 - "Community 14"
 Cohesion: 0.12
 Nodes (23): build_t3_cond(), main(), prepare_sample(), prepare_sample.py — Turn one dataset.jsonl row into the exact tensors T3.loss(), Build the speaker conditioning (frozen during training)., MTLTokenizer + SOT/EOT padding (mirrors what generate() does internally)., S3Tokenizer on the target dubbed audio → speech tokens (the LABEL).      Critica, Turn one dataset row into ready-to-train tensors. (+15 more)
+### Community 15 - "Community 15"
+Cohesion: 0.13
+Nodes (26): _compress_silences(), _detect_pauses(), _distribute_padding(), _find_tts_silences(), _generate_silence(), _get_wav_duration(), _pad_silence(), _pause_aware_sync() (+18 more)
 ### Community 16 - "Community 16"
 Cohesion: 0.19
 Nodes (18): _burn_in(), _clamp(), _extract_audio(), _force_style_for(), _format_timestamp_srt(), _format_timestamp_vtt(), generate_subtitles(), _is_video() (+10 more)
 Nodes (9): get_fallback_mode(), _get_handler(), get_translation_prompt(), post_translate(), Language-specific handlers for the translation pipeline.  Each language that nee, Return a language-specific translation prompt, or the default., Return 'bedrock' or 'google' depending on the language., Run any language-specific post-processing after translation. (+1 more)
 ### Community 21 - "Community 21"
+Cohesion: 0.38
+Nodes (6): _ensure_server(), _generate_impl(), generate_scene(), Dramabox — Resemble AI directable speech engine.  Single-Space tool: generates a, Lazy-import the Dramabox model + load checkpoints once. Raises a clean     Runti, Run Dramabox on `prompt` and write the resulting WAV under `out_dir`.      Retur
+### Community 22 - "Community 22"
+Cohesion: 0.53
+Nodes (5): main(), _prefetch_chatterbox(), _prefetch_demucs(), _prefetch_faster_whisper(), Prefetch model weights into HF_HOME for faster cold starts on Spaces.
+### Community 23 - "Community 23"
 Cohesion: 0.33
 Nodes (6): app.py validation, pipeline.py simplified, steps/s4_preview.py, steps/s4_tts.py conditional imports, server.py /api/config, TTS_ENGINE env var
+### Community 25 - "Community 25"
 Cohesion: 1.0
 Nodes (2): gradio==6.8.0, gradio==6.12.0 (omni)
+### Community 33 - "Community 33"
 Cohesion: 1.0
 Nodes (1): Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.
+### Community 34 - "Community 34"
 Cohesion: 1.0
 Nodes (1): Build voice-clone prompt items from reference audio (and optionally reference te
+### Community 35 - "Community 35"
 Cohesion: 1.0
 Nodes (1): Voice clone speech using the Base model.          You can provide either:
+### Community 36 - "Community 36"
 Cohesion: 1.0
 Nodes (1): Generate speech with the VoiceDesign model using natural-language style instruct
+### Community 37 - "Community 37"
 Cohesion: 1.0
 Nodes (1): Generate speech with the CustomVoice model using a predefined speaker id, option
+### Community 38 - "Community 38"
 Cohesion: 1.0
 Nodes (1): Delete stale per-job artifact directories from ARTIFACTS_ROOT.
+### Community 39 - "Community 39"
 Cohesion: 1.0
 Nodes (1): Reject oversized uploads before body parsing.
+### Community 40 - "Community 40"
 Cohesion: 1.0
 Nodes (1): Run the translation pipeline in a background thread, pushing progress to the job
+### Community 41 - "Community 41"
 Cohesion: 1.0
 Nodes (1): List whitelisted MP4 demo videos from outputs/ and data/.
+### Community 42 - "Community 42"
 Cohesion: 1.0
 Nodes (1): Return curated showcase entries with resolved streaming URLs.
+### Community 43 - "Community 43"
 Cohesion: 1.0
 Nodes (1): Submit a video for translation.
+### Community 44 - "Community 44"
 Cohesion: 1.0
 Nodes (1): Poll endpoint returning new messages since index `after`, plus live wait status.
+### Community 45 - "Community 45"
 Cohesion: 1.0
 Nodes (1): User selects a TTS model after previewing.
+### Community 46 - "Community 46"
 Cohesion: 1.0
 Nodes (1): Serve a preview audio WAV file.
+### Community 47 - "Community 47"
 Cohesion: 1.0
 Nodes (1): Download the translated video.
+### Community 48 - "Community 48"
 Cohesion: 1.0
 Nodes (1): Create artifact directories and start background cleanup.
+### Community 49 - "Community 49"
 Cohesion: 1.0
 Nodes (1): Sync TTS audio using pause-aware strategy: compress silences first, then atempo.
+### Community 50 - "Community 50"
 Cohesion: 1.0
 Nodes (1): Rewrite WAV with silence regions compressed to keep_ratio of their original dura
+### Community 51 - "Community 51"
 Cohesion: 1.0
 Nodes (1): Insert extra silence distributed across detected pause points.
+### Community 52 - "Community 52"
 Cohesion: 1.0
 Nodes (1): Generate a silent WAV file of given duration.
+### Community 53 - "Community 53"
 Cohesion: 1.0
 Nodes (1): Sync each TTS segment to its original timestamp window and stitch into a single
+### Community 54 - "Community 54"
 Cohesion: 1.0
 Nodes (1): Translate the text of each segment into target_language in batches.      Args:
+### Community 55 - "Community 55"
 Cohesion: 1.0
 Nodes (1): Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int
+### Community 56 - "Community 56"
 Cohesion: 1.0
 Nodes (1): Remove trailing noise/artifacts after speech ends.
+### Community 57 - "Community 57"
 Cohesion: 1.0
 Nodes (1): Hard-trim TTS output to orig_dur * headroom, with a short fade-out.
+### Community 58 - "Community 58"
 Cohesion: 1.0
 Nodes (1): Clip audio to max_sec to prevent excessively slow voice cloning.
+### Community 59 - "Community 59"
 Cohesion: 1.0
 Nodes (1): Numpy variant of _trim_trailing_noise for engines returning np.ndarray.
+### Community 60 - "Community 60"
 Cohesion: 1.0
 Nodes (1): Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated
+### Community 61 - "Community 61"
 Cohesion: 1.0
 Nodes (1): Generate speech for all segments using OmniVoice voice cloning.
+### Community 62 - "Community 62"
 Cohesion: 1.0
 Nodes (1): Synthesise translated text for each segment using voice cloned from reference au
+### Community 63 - "Community 63"
 Cohesion: 1.0
 Nodes (1): torch==2.6.0
+### Community 64 - "Community 64"
 Cohesion: 1.0
 Nodes (1): fastapi
+### Community 65 - "Community 65"
 Cohesion: 1.0
 Nodes (1): yt-dlp
+### Community 66 - "Community 66"
 Cohesion: 1.0
 Nodes (1): diffusers==0.29.0
+### Community 67 - "Community 67"
 Cohesion: 1.0
 Nodes (1): ARTIFACTS_ROOT env
+### Community 68 - "Community 68"
 Cohesion: 1.0
 Nodes (1): AWS g4dn.xlarge alternative
+### Community 69 - "Community 69"
 Cohesion: 1.0
 Nodes (1): nodejs (system pkg)
+### Community 70 - "Community 70"
 Cohesion: 1.0
 Nodes (1): fonts-noto-core / cjk
+### Community 71 - "Community 71"
 Cohesion: 1.0
 Nodes (1): graphify project rules
 ## Knowledge Gaps
+- **329 isolated node(s):** `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.`, `Read media duration from ffprobe.`, `Report CUDA/MPS availability.` (+324 more)
   These have ≤1 connection - possible missing edges or undocumented components.
+- **Thin community `Community 25`** (2 nodes): `gradio==6.8.0`, `gradio==6.12.0 (omni)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 33`** (1 nodes): `Load a Qwen3 TTS model and its processor in HuggingFace `from_pretrained` style.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 34`** (1 nodes): `Build voice-clone prompt items from reference audio (and optionally reference te`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 35`** (1 nodes): `Voice clone speech using the Base model.          You can provide either:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 36`** (1 nodes): `Generate speech with the VoiceDesign model using natural-language style instruct`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 37`** (1 nodes): `Generate speech with the CustomVoice model using a predefined speaker id, option`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 38`** (1 nodes): `Delete stale per-job artifact directories from ARTIFACTS_ROOT.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 39`** (1 nodes): `Reject oversized uploads before body parsing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 40`** (1 nodes): `Run the translation pipeline in a background thread, pushing progress to the job`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 41`** (1 nodes): `List whitelisted MP4 demo videos from outputs/ and data/.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 42`** (1 nodes): `Return curated showcase entries with resolved streaming URLs.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 43`** (1 nodes): `Submit a video for translation.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 44`** (1 nodes): `Poll endpoint returning new messages since index `after`, plus live wait status.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 45`** (1 nodes): `User selects a TTS model after previewing.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 46`** (1 nodes): `Serve a preview audio WAV file.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 47`** (1 nodes): `Download the translated video.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 48`** (1 nodes): `Create artifact directories and start background cleanup.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 49`** (1 nodes): `Sync TTS audio using pause-aware strategy: compress silences first, then atempo.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 50`** (1 nodes): `Rewrite WAV with silence regions compressed to keep_ratio of their original dura`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 51`** (1 nodes): `Insert extra silence distributed across detected pause points.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 52`** (1 nodes): `Generate a silent WAV file of given duration.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 53`** (1 nodes): `Sync each TTS segment to its original timestamp window and stitch into a single`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 54`** (1 nodes): `Translate the text of each segment into target_language in batches.      Args:`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 55`** (1 nodes): `Load + run Chatterbox inside a single GPU-decorated scope.      ZeroGPU only int`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 56`** (1 nodes): `Remove trailing noise/artifacts after speech ends.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 57`** (1 nodes): `Hard-trim TTS output to orig_dur * headroom, with a short fade-out.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 58`** (1 nodes): `Clip audio to max_sec to prevent excessively slow voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 59`** (1 nodes): `Numpy variant of _trim_trailing_noise for engines returning np.ndarray.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 60`** (1 nodes): `Perform full OmniVoice processing (load + generate batch) inside a GPU-decorated`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 61`** (1 nodes): `Generate speech for all segments using OmniVoice voice cloning.`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 62`** (1 nodes): `Synthesise translated text for each segment using voice cloned from reference au`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 63`** (1 nodes): `torch==2.6.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 64`** (1 nodes): `fastapi`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 65`** (1 nodes): `yt-dlp`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 66`** (1 nodes): `diffusers==0.29.0`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 67`** (1 nodes): `ARTIFACTS_ROOT env`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 68`** (1 nodes): `AWS g4dn.xlarge alternative`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 69`** (1 nodes): `nodejs (system pkg)`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 70`** (1 nodes): `fonts-noto-core / cjk`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
+- **Thin community `Community 71`** (1 nodes): `graphify project rules`
   Too small to be a meaningful cluster - may be noise or needs more connections extracted.
 ## Suggested Questions
 _Questions this graph is uniquely positioned to answer:_
+- **Why does `synthesise_segments()` connect `Community 6` to `Community 8`, `Community 11`?**
   _High betweenness centrality (0.324) - this node is a cross-community bridge._
+- **Why does `generate()` connect `Community 6` to `Community 0`, `Community 4`?**
+  _High betweenness centrality (0.200) - this node is a cross-community bridge._
 - **Are the 44 inferred relationships involving `Qwen3TTSSpeakerEncoderConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSSpeakerEncoderConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **Are the 44 inferred relationships involving `Qwen3TTSTalkerCodePredictorConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
 - **Are the 44 inferred relationships involving `Qwen3TTSConfig` (e.g. with `Res2NetBlock` and `SqueezeExcitationBlock`) actually correct?**
   _`Qwen3TTSConfig` has 44 INFERRED edges - model-reasoned connections that need verification._
 - **What connects `server.py — FastAPI backend for VideoVoice.  Endpoints:   POST /api/jobs`, `Download video from Instagram/YouTube using yt-dlp.`, `Allow only trusted social platforms for yt-dlp.` to the rest of the system?**
+  _329 weakly-connected nodes found - possible documentation gaps or missing edges._

graphify-out/graph.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

server.py CHANGED Viewed

@@ -42,8 +42,8 @@ load_dotenv()
 # TTS_ENGINE controls which TTS backend this Space serves
 TTS_ENGINE = os.getenv("TTS_ENGINE", "chatterbox").lower()
-if TTS_ENGINE not in ("chatterbox", "omnivoice", "qwen3"):
-    raise ValueError(f"Invalid TTS_ENGINE: {TTS_ENGINE}. Use 'chatterbox', 'omnivoice', or 'qwen3'.")
 # ── Config ────────────────────────────────────────────────
 PORT = int(os.getenv("PORT", "7860"))

 # TTS_ENGINE controls which TTS backend this Space serves
 TTS_ENGINE = os.getenv("TTS_ENGINE", "chatterbox").lower()
+if TTS_ENGINE not in ("chatterbox", "omnivoice", "qwen3", "dramabox"):
+    raise ValueError(f"Invalid TTS_ENGINE: {TTS_ENGINE}. Use 'chatterbox', 'omnivoice', 'qwen3', or 'dramabox'.")
 # ── Config ────────────────────────────────────────────────
 PORT = int(os.getenv("PORT", "7860"))

tools_api/__init__.py CHANGED Viewed

@@ -10,6 +10,7 @@ Endpoints (mounted by router.router):
   POST /api/tools/subtitles       — captions (sidecar or burn-in MP4)
   POST /api/tools/voice-clone     — single-segment TTS with voice clone
   POST /api/tools/audio-cleanup   — Demucs source separation
   GET  /api/tools/file/{run}/{f}  — download generated artifact
 """
 from .router import router

   POST /api/tools/subtitles       — captions (sidecar or burn-in MP4)
   POST /api/tools/voice-clone     — single-segment TTS with voice clone
   POST /api/tools/audio-cleanup   — Demucs source separation
+  POST /api/tools/dramabox        — Resemble Dramabox directable speech (dramabox Space only)
   GET  /api/tools/file/{run}/{f}  — download generated artifact
 """
 from .router import router

tools_api/dramabox.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""
+Dramabox — Resemble AI directable speech engine.
+Single-Space tool: generates a 48 kHz WAV "performance" from a scene prompt
+(quoted dialogue + stage directions) and an optional voice reference. Mirrors
+the official ResembleAI/Dramabox Space's on_generate(): same parameter order,
+same defaults, same model invocation.
+This module only runs on the videovoice-dramabox Space, which must vendor the
+Dramabox `src/` directory (inference_server.py + model_downloader.py) and the
+requirements-dramabox.txt deps. On any other Space the lazy import below
+raises a clean RuntimeError rather than crashing app startup.
+The module loads the TTSServer once on first request (warm-load pattern from
+the upstream Space) and reuses it across calls.
+"""
+from __future__ import annotations
+import logging
+import os
+import threading
+import time
+from pathlib import Path
+# Backend env knobs — kept compatible with the upstream Space.
+_LTX_DTYPE = os.environ.get("LTX_DTYPE", "bf16")
+# Module-level warm load, guarded by a lock so a flurry of concurrent first
+# requests only triggers one load. Subsequent calls are ~2.5s on warm GPU.
+_tts_lock = threading.Lock()
+_tts_server = None  # populated lazily on first generate() call
+logger = logging.getLogger("tools_api.dramabox")
+def _ensure_server():
+    """Lazy-import the Dramabox model + load checkpoints once. Raises a clean
+    RuntimeError on Spaces that don't ship the Dramabox `src/` vendoring.
+    """
+    global _tts_server
+    if _tts_server is not None:
+        return _tts_server
+    with _tts_lock:
+        if _tts_server is not None:
+            return _tts_server
+        try:
+            # Vendored from ResembleAI/Dramabox; the Space's `src/` must be on
+            # sys.path. We add it here so this module doesn't require app.py
+            # to do the insert itself.
+            import sys
+            vendored_src = Path(__file__).parent.parent / "dramabox_src"
+            if vendored_src.exists() and str(vendored_src) not in sys.path:
+                sys.path.insert(0, str(vendored_src))
+            from inference_server import TTSServer  # type: ignore[import-not-found]
+            from model_downloader import get_all_paths  # type: ignore[import-not-found]
+        except ImportError as e:
+            raise RuntimeError(
+                "Dramabox is not installed on this Space. Vendor "
+                "ResembleAI/Dramabox's src/ directory at "
+                "VideoVoice-be/dramabox_src/ and install requirements-dramabox.txt."
+            ) from e
+        logger.info("Fetching Dramabox checkpoints (cached after first run)...")
+        paths = get_all_paths()
+        logger.info("Loading Dramabox warm server (Gemma + DiT + VAE + Decoder)...")
+        _tts_server = TTSServer(
+            checkpoint=paths["transformer"],
+            full_checkpoint=paths["audio_components"],
+            gemma_root=paths["gemma_root"],
+            device="cuda",
+            dtype=_LTX_DTYPE,
+            compile_model=False,   # torch.compile breaks under ZeroGPU's brief GPU windows
+            bnb_4bit=True,         # unsloth Gemma is pre-quantized
+        )
+        logger.info("Dramabox TTSServer ready.")
+        return _tts_server
+def generate_scene(
+    *,
+    prompt: str,
+    out_dir: Path,
+    audio_ref: Path | None = None,
+    cfg: float = 2.5,
+    stg: float = 1.5,
+    dur_mult: float = 1.1,
+    gen_dur: float = 0.0,
+    ref_dur: float = 10.0,
+    seed: int = 42,
+) -> dict:
+    """
+    Run Dramabox on `prompt` and write the resulting WAV under `out_dir`.
+    Returns:
+      {
+        "filename": "dramabox_<run_id_short>.wav",
+        "elapsed": <seconds>,
+        "settings": {...echo of inputs used...},
+      }
+    """
+    prompt = (prompt or "").strip()
+    if not prompt:
+        raise ValueError("Prompt is empty.")
+    # Try to GPU-decorate at call time if `spaces` is available. On the
+    # ZeroGPU Space this maps weights onto the GPU for the duration of the
+    # call; on local dev (no `spaces`) it's a no-op pass-through.
+    try:
+        import spaces  # type: ignore[import-not-found]
+        @spaces.GPU(duration=60)
+        def _run():
+            return _generate_impl(
+                prompt=prompt,
+                out_dir=out_dir,
+                audio_ref=audio_ref,
+                cfg=cfg, stg=stg, dur_mult=dur_mult,
+                gen_dur=gen_dur, ref_dur=ref_dur, seed=seed,
+            )
+        return _run()
+    except ImportError:
+        return _generate_impl(
+            prompt=prompt,
+            out_dir=out_dir,
+            audio_ref=audio_ref,
+            cfg=cfg, stg=stg, dur_mult=dur_mult,
+            gen_dur=gen_dur, ref_dur=ref_dur, seed=seed,
+        )
+def _generate_impl(
+    *,
+    prompt: str,
+    out_dir: Path,
+    audio_ref: Path | None,
+    cfg: float,
+    stg: float,
+    dur_mult: float,
+    gen_dur: float,
+    ref_dur: float,
+    seed: int,
+) -> dict:
+    tts = _ensure_server()
+    out_dir.mkdir(parents=True, exist_ok=True)
+    output = out_dir / f"dramabox_{int(time.time() * 1000)}.wav"
+    ref_path: str | None = None
+    if audio_ref is not None and Path(audio_ref).exists():
+        ref_path = str(audio_ref)
+    t0 = time.time()
+    tts.generate_to_file(
+        prompt=prompt,
+        output=str(output),
+        voice_ref=ref_path,
+        cfg_scale=float(cfg),
+        stg_scale=float(stg),
+        duration_multiplier=float(dur_mult),
+        seed=int(seed),
+        gen_duration=float(gen_dur),
+        ref_duration=float(ref_dur),
+    )
+    elapsed = time.time() - t0
+    logger.info(f"Dramabox generated in {elapsed:.2f}s -> {output}")
+    return {
+        "filename": output.name,
+        "elapsed": elapsed,
+        "settings": {
+            "cfg": cfg,
+            "stg": stg,
+            "dur_mult": dur_mult,
+            "gen_dur": gen_dur,
+            "ref_dur": ref_dur,
+            "seed": seed,
+            "had_voice_ref": ref_path is not None,
+        },
+    }

tools_api/router.py CHANGED Viewed

@@ -16,7 +16,7 @@ from fastapi.responses import FileResponse, JSONResponse, PlainTextResponse
 from server import limiter, _download_url, _is_allowed_video_host
-from . import audio_cleanup, subtitles, voice_clone
 from .storage import (
     file_url,
     new_run_dir,
@@ -178,6 +178,76 @@ async def voice_clone_endpoint(
     })
 # ── Audio cleanup ────────────────────────────────────────────────────
 @router.post("/audio-cleanup")

 from server import limiter, _download_url, _is_allowed_video_host
+from . import audio_cleanup, dramabox, subtitles, voice_clone
 from .storage import (
     file_url,
     new_run_dir,
     })
+# ── Dramabox ─────────────────────────────────────────────────────────
+@router.post("/dramabox")
+@limiter.limit("10/hour")
+async def dramabox_endpoint(
+    request: Request,
+    prompt: str = Form(...),
+    audio_ref: Optional[UploadFile] = File(None),
+    cfg: float = Form(2.5),
+    stg: float = Form(1.5),
+    dur_mult: float = Form(1.1),
+    gen_dur: float = Form(0.0),
+    ref_dur: float = Form(10.0),
+    seed: int = Form(42),
+):
+    prompt = (prompt or "").strip()
+    if not prompt:
+        raise HTTPException(400, "prompt is required")
+    if len(prompt) > 2000:
+        raise HTTPException(400, "prompt exceeds 2000 char limit")
+    # Range guards mirror the upstream Dramabox sliders.
+    if not (1.0 <= cfg <= 10.0):
+        raise HTTPException(400, "cfg must be between 1 and 10")
+    if not (0.0 <= stg <= 5.0):
+        raise HTTPException(400, "stg must be between 0 and 5")
+    if not (0.8 <= dur_mult <= 2.0):
+        raise HTTPException(400, "dur_mult must be between 0.8 and 2.0")
+    if not (0.0 <= gen_dur <= 60.0):
+        raise HTTPException(400, "gen_dur must be between 0 and 60")
+    if not (3.0 <= ref_dur <= 30.0):
+        raise HTTPException(400, "ref_dur must be between 3 and 30")
+    run_id, dest_dir = new_run_dir()
+    ref_path: Optional[Path] = None
+    if audio_ref is not None and audio_ref.filename:
+        ref_path = await _save_upload(audio_ref, dest_dir, "voice_ref.wav")
+    try:
+        info = await asyncio.to_thread(
+            dramabox.generate_scene,
+            prompt=prompt,
+            out_dir=dest_dir,
+            audio_ref=ref_path,
+            cfg=cfg,
+            stg=stg,
+            dur_mult=dur_mult,
+            gen_dur=gen_dur,
+            ref_dur=ref_dur,
+            seed=seed,
+        )
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except RuntimeError as e:
+        # Raised by dramabox._ensure_server() on Spaces that don't ship the
+        # vendored model. Surface clearly so the frontend can fall back.
+        raise HTTPException(503, str(e))
+    except Exception as e:  # noqa: BLE001
+        raise HTTPException(500, f"Dramabox generation failed: {e}")
+    return JSONResponse({
+        "run_id": run_id,
+        "filename": info["filename"],
+        "url": file_url(run_id, info["filename"]),
+        "elapsed": info["elapsed"],
+        "settings": info["settings"],
+    })
 # ── Audio cleanup ────────────────────────────────────────────────────
 @router.post("/audio-cleanup")