Spaces:

ResembleAI
/

Dramabox

Running on Zero

Manmay commited on 1 day ago

Commit

fc8ba6b

1 Parent(s): d621c93

Tighten GPU window: 10s base + 1s/sentence, quote-aware count

Replace the 'sentences × 3 + 12' approximation with a tighter formula
calibrated to observed per-sentence runtime on this Space:

window = 10 + (num_sentences - 1) × 1, capped at 110

Examples:
1 sentence -> 10 s (was 15 s)
5 sentences -> 14 s (was 27 s)
20 sents -> 29 s (was 72 s)
100 sents -> 110 s cap (was 120 s cap)

Shorter windows improve queue priority for visitors (per HF ZeroGPU docs),
and we still get the same 10 s safety margin under the 120 s per-call
ceiling.

Sentence count now uses src/text_chunker.split_sentences_outside_quotes
so terminators inside dialogue quotes ("How are you?") aren't counted —
this matches what the chunker sees and prevents over-budgeting on
dialogue-heavy prompts. Falls back to a punctuation count if the import
ever fails.

Files changed (1) hide show

app.py +45 -19

app.py CHANGED Viewed

@@ -182,6 +182,31 @@ async def homepage():
         return f.read()
 def _gpu_duration(
     prompt: str,
     audio_ref: FileData | None,
@@ -196,26 +221,27 @@ def _gpu_duration(
     target_chunk_duration: float = 37.0,
     crossfade_ms: float = 50.0,
 ) -> int:
-    """Per-call GPU window sizing.
-    ZeroGPU rejects any decorator value over the account's per-call cap (120 s
-    on PRO). It also supports a callable here that's evaluated per request, so
-    we ask only for what each call needs:
-      * short request: 60 s (sufficient for a single ≤30 s generation on
-        warm models — denoise + prompt encode + 30-step euler + decode).
-      * long request:  ceil(target_audio_s × 1.5) + 25 s overhead.
-      * cap:           120 s — the documented ZeroGPU PRO per-call ceiling.
-    Long-form prompts that internally chunk into >1 generate() pass run
-    sequentially inside one GPU window today, so multi-chunk total wall time
-    must still fit under the 120 s cap. Above that, the kernel kills the call
-    — the cleaner long-term fix is to acquire a GPU per chunk (separate
-    @spaces.GPU function) rather than holding one window across the loop.
     """
-    target = float(gen_dur) if gen_dur and gen_dur > 0 else 30.0
-    needed = int(target * 1.5 + 25)
-    return max(60, min(needed, 120))
 @app.api()

         return f.read()
+_GPU_BASE_S = 10           # bare-minimum window even for a single sentence
+_GPU_PER_SENTENCE_S = 1    # add 1 s per additional sentence
+_GPU_CAP_S = 110           # leave 10 s headroom under ZeroGPU's 120 s ceiling
+def _count_sentences(prompt: str) -> int:
+    """Count TTS sentences in ``prompt`` using the same quote-aware splitter
+    the long-form chunker uses (``src/text_chunker``). Terminators inside
+    ``"..."`` dialogue do **not** count, so the GPU window calc agrees with
+    what the chunker sees — and dialogue-heavy prompts don't get over-budgeted.
+    Always returns ≥1 so a single fragment still gets a real window.
+    """
+    if not prompt or not prompt.strip():
+        return 1
+    try:
+        from text_chunker import split_sentences_outside_quotes
+        n = len(split_sentences_outside_quotes(prompt))
+    except Exception:
+        # Fallback: cheap punctuation count if the chunker import fails for any
+        # reason — preserves the ability to size GPU windows even on a broken
+        # import path.
+        n = sum(1 for ch in prompt if ch in ".!?")
+    return max(1, n)
 def _gpu_duration(
     prompt: str,
     audio_ref: FileData | None,
     target_chunk_duration: float = 37.0,
     crossfade_ms: float = 50.0,
 ) -> int:
+    """Per-call ZeroGPU window sizing.
+    ZeroGPU rejects any static decorator value above the account's per-call
+    cap (120 s on PRO), but ``duration=`` also accepts a callable evaluated
+    per request — we ask only for what each call needs:
+        window = _GPU_BASE_S + (num_sentences - 1) × _GPU_PER_SENTENCE_S
+    Defaults: 10 s base + 1 s/extra sentence, capped at 110 s (a 10 s safety
+    margin under the 120 s ZeroGPU ceiling). Numbers tuned to observed
+    runtime on this Space's hardware.
+    Under-allocating is worse than over: if a call exceeds its allocated
+    duration ZeroGPU kills it (the user sees a generation failure) **and**
+    daily quota is still consumed against the time actually spent. Shorter
+    allocations *do* improve queue priority (per HF docs), which is why we
+    don't just pin everything at 110.
     """
+    n = _count_sentences(prompt)
+    needed = _GPU_BASE_S + (n - 1) * _GPU_PER_SENTENCE_S
+    return max(_GPU_BASE_S, min(needed, _GPU_CAP_S))
 @app.api()