Manmay commited on
Commit
fc8ba6b
Β·
1 Parent(s): d621c93

Tighten GPU window: 10s base + 1s/sentence, quote-aware count

Browse files

Replace the 'sentences Γ— 3 + 12' approximation with a tighter formula
calibrated to observed per-sentence runtime on this Space:

window = 10 + (num_sentences - 1) Γ— 1, capped at 110

Examples:
1 sentence -> 10 s (was 15 s)
5 sentences -> 14 s (was 27 s)
20 sents -> 29 s (was 72 s)
100 sents -> 110 s cap (was 120 s cap)

Shorter windows improve queue priority for visitors (per HF ZeroGPU docs),
and we still get the same 10 s safety margin under the 120 s per-call
ceiling.

Sentence count now uses src/text_chunker.split_sentences_outside_quotes
so terminators inside dialogue quotes ("How are you?") aren't counted β€”
this matches what the chunker sees and prevents over-budgeting on
dialogue-heavy prompts. Falls back to a punctuation count if the import
ever fails.

Files changed (1) hide show
  1. app.py +45 -19
app.py CHANGED
@@ -182,6 +182,31 @@ async def homepage():
182
  return f.read()
183
 
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  def _gpu_duration(
186
  prompt: str,
187
  audio_ref: FileData | None,
@@ -196,26 +221,27 @@ def _gpu_duration(
196
  target_chunk_duration: float = 37.0,
197
  crossfade_ms: float = 50.0,
198
  ) -> int:
199
- """Per-call GPU window sizing.
200
-
201
- ZeroGPU rejects any decorator value over the account's per-call cap (120 s
202
- on PRO). It also supports a callable here that's evaluated per request, so
203
- we ask only for what each call needs:
204
-
205
- * short request: 60 s (sufficient for a single ≀30 s generation on
206
- warm models β€” denoise + prompt encode + 30-step euler + decode).
207
- * long request: ceil(target_audio_s Γ— 1.5) + 25 s overhead.
208
- * cap: 120 s β€” the documented ZeroGPU PRO per-call ceiling.
209
-
210
- Long-form prompts that internally chunk into >1 generate() pass run
211
- sequentially inside one GPU window today, so multi-chunk total wall time
212
- must still fit under the 120 s cap. Above that, the kernel kills the call
213
- β€” the cleaner long-term fix is to acquire a GPU per chunk (separate
214
- @spaces.GPU function) rather than holding one window across the loop.
 
215
  """
216
- target = float(gen_dur) if gen_dur and gen_dur > 0 else 30.0
217
- needed = int(target * 1.5 + 25)
218
- return max(60, min(needed, 120))
219
 
220
 
221
  @app.api()
 
182
  return f.read()
183
 
184
 
185
+ _GPU_BASE_S = 10 # bare-minimum window even for a single sentence
186
+ _GPU_PER_SENTENCE_S = 1 # add 1 s per additional sentence
187
+ _GPU_CAP_S = 110 # leave 10 s headroom under ZeroGPU's 120 s ceiling
188
+
189
+
190
+ def _count_sentences(prompt: str) -> int:
191
+ """Count TTS sentences in ``prompt`` using the same quote-aware splitter
192
+ the long-form chunker uses (``src/text_chunker``). Terminators inside
193
+ ``"..."`` dialogue do **not** count, so the GPU window calc agrees with
194
+ what the chunker sees β€” and dialogue-heavy prompts don't get over-budgeted.
195
+ Always returns β‰₯1 so a single fragment still gets a real window.
196
+ """
197
+ if not prompt or not prompt.strip():
198
+ return 1
199
+ try:
200
+ from text_chunker import split_sentences_outside_quotes
201
+ n = len(split_sentences_outside_quotes(prompt))
202
+ except Exception:
203
+ # Fallback: cheap punctuation count if the chunker import fails for any
204
+ # reason β€” preserves the ability to size GPU windows even on a broken
205
+ # import path.
206
+ n = sum(1 for ch in prompt if ch in ".!?")
207
+ return max(1, n)
208
+
209
+
210
  def _gpu_duration(
211
  prompt: str,
212
  audio_ref: FileData | None,
 
221
  target_chunk_duration: float = 37.0,
222
  crossfade_ms: float = 50.0,
223
  ) -> int:
224
+ """Per-call ZeroGPU window sizing.
225
+
226
+ ZeroGPU rejects any static decorator value above the account's per-call
227
+ cap (120 s on PRO), but ``duration=`` also accepts a callable evaluated
228
+ per request β€” we ask only for what each call needs:
229
+
230
+ window = _GPU_BASE_S + (num_sentences - 1) Γ— _GPU_PER_SENTENCE_S
231
+
232
+ Defaults: 10 s base + 1 s/extra sentence, capped at 110 s (a 10 s safety
233
+ margin under the 120 s ZeroGPU ceiling). Numbers tuned to observed
234
+ runtime on this Space's hardware.
235
+
236
+ Under-allocating is worse than over: if a call exceeds its allocated
237
+ duration ZeroGPU kills it (the user sees a generation failure) **and**
238
+ daily quota is still consumed against the time actually spent. Shorter
239
+ allocations *do* improve queue priority (per HF docs), which is why we
240
+ don't just pin everything at 110.
241
  """
242
+ n = _count_sentences(prompt)
243
+ needed = _GPU_BASE_S + (n - 1) * _GPU_PER_SENTENCE_S
244
+ return max(_GPU_BASE_S, min(needed, _GPU_CAP_S))
245
 
246
 
247
  @app.api()