Spaces:
Running on Zero
Tighten GPU window: 10s base + 1s/sentence, quote-aware count
Browse filesReplace the 'sentences Γ 3 + 12' approximation with a tighter formula
calibrated to observed per-sentence runtime on this Space:
window = 10 + (num_sentences - 1) Γ 1, capped at 110
Examples:
1 sentence -> 10 s (was 15 s)
5 sentences -> 14 s (was 27 s)
20 sents -> 29 s (was 72 s)
100 sents -> 110 s cap (was 120 s cap)
Shorter windows improve queue priority for visitors (per HF ZeroGPU docs),
and we still get the same 10 s safety margin under the 120 s per-call
ceiling.
Sentence count now uses src/text_chunker.split_sentences_outside_quotes
so terminators inside dialogue quotes ("How are you?") aren't counted β
this matches what the chunker sees and prevents over-budgeting on
dialogue-heavy prompts. Falls back to a punctuation count if the import
ever fails.
|
@@ -182,6 +182,31 @@ async def homepage():
|
|
| 182 |
return f.read()
|
| 183 |
|
| 184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
def _gpu_duration(
|
| 186 |
prompt: str,
|
| 187 |
audio_ref: FileData | None,
|
|
@@ -196,26 +221,27 @@ def _gpu_duration(
|
|
| 196 |
target_chunk_duration: float = 37.0,
|
| 197 |
crossfade_ms: float = 50.0,
|
| 198 |
) -> int:
|
| 199 |
-
"""Per-call
|
| 200 |
-
|
| 201 |
-
ZeroGPU rejects any decorator value
|
| 202 |
-
on PRO)
|
| 203 |
-
we ask only for what each call needs:
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
|
|
|
| 215 |
"""
|
| 216 |
-
|
| 217 |
-
needed =
|
| 218 |
-
return max(
|
| 219 |
|
| 220 |
|
| 221 |
@app.api()
|
|
|
|
| 182 |
return f.read()
|
| 183 |
|
| 184 |
|
| 185 |
+
_GPU_BASE_S = 10 # bare-minimum window even for a single sentence
|
| 186 |
+
_GPU_PER_SENTENCE_S = 1 # add 1 s per additional sentence
|
| 187 |
+
_GPU_CAP_S = 110 # leave 10 s headroom under ZeroGPU's 120 s ceiling
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
def _count_sentences(prompt: str) -> int:
|
| 191 |
+
"""Count TTS sentences in ``prompt`` using the same quote-aware splitter
|
| 192 |
+
the long-form chunker uses (``src/text_chunker``). Terminators inside
|
| 193 |
+
``"..."`` dialogue do **not** count, so the GPU window calc agrees with
|
| 194 |
+
what the chunker sees β and dialogue-heavy prompts don't get over-budgeted.
|
| 195 |
+
Always returns β₯1 so a single fragment still gets a real window.
|
| 196 |
+
"""
|
| 197 |
+
if not prompt or not prompt.strip():
|
| 198 |
+
return 1
|
| 199 |
+
try:
|
| 200 |
+
from text_chunker import split_sentences_outside_quotes
|
| 201 |
+
n = len(split_sentences_outside_quotes(prompt))
|
| 202 |
+
except Exception:
|
| 203 |
+
# Fallback: cheap punctuation count if the chunker import fails for any
|
| 204 |
+
# reason β preserves the ability to size GPU windows even on a broken
|
| 205 |
+
# import path.
|
| 206 |
+
n = sum(1 for ch in prompt if ch in ".!?")
|
| 207 |
+
return max(1, n)
|
| 208 |
+
|
| 209 |
+
|
| 210 |
def _gpu_duration(
|
| 211 |
prompt: str,
|
| 212 |
audio_ref: FileData | None,
|
|
|
|
| 221 |
target_chunk_duration: float = 37.0,
|
| 222 |
crossfade_ms: float = 50.0,
|
| 223 |
) -> int:
|
| 224 |
+
"""Per-call ZeroGPU window sizing.
|
| 225 |
+
|
| 226 |
+
ZeroGPU rejects any static decorator value above the account's per-call
|
| 227 |
+
cap (120 s on PRO), but ``duration=`` also accepts a callable evaluated
|
| 228 |
+
per request β we ask only for what each call needs:
|
| 229 |
+
|
| 230 |
+
window = _GPU_BASE_S + (num_sentences - 1) Γ _GPU_PER_SENTENCE_S
|
| 231 |
+
|
| 232 |
+
Defaults: 10 s base + 1 s/extra sentence, capped at 110 s (a 10 s safety
|
| 233 |
+
margin under the 120 s ZeroGPU ceiling). Numbers tuned to observed
|
| 234 |
+
runtime on this Space's hardware.
|
| 235 |
+
|
| 236 |
+
Under-allocating is worse than over: if a call exceeds its allocated
|
| 237 |
+
duration ZeroGPU kills it (the user sees a generation failure) **and**
|
| 238 |
+
daily quota is still consumed against the time actually spent. Shorter
|
| 239 |
+
allocations *do* improve queue priority (per HF docs), which is why we
|
| 240 |
+
don't just pin everything at 110.
|
| 241 |
"""
|
| 242 |
+
n = _count_sentences(prompt)
|
| 243 |
+
needed = _GPU_BASE_S + (n - 1) * _GPU_PER_SENTENCE_S
|
| 244 |
+
return max(_GPU_BASE_S, min(needed, _GPU_CAP_S))
|
| 245 |
|
| 246 |
|
| 247 |
@app.api()
|