Manmay commited on
Commit
d621c93
·
1 Parent(s): e44cca0

Fix ZeroGPU duration: dynamic per-sentence sizing, cap at 120s

Browse files

Root cause: @spaces.GPU(duration=600) was rejected at decorator registration
because 600 > ZeroGPU's per-call cap (120s on PRO). The registration failure
made *every* call fail with 'The requested GPU duration (600s) is larger
than the maximum allowed' — including single-sentence requests.

Fix: switch duration= to a callable (spaces.GPU evaluates it per request).
Empirically the dominant compute on a warm server is the 30-step euler
denoise, which is ~2.5 s per sentence on this hardware; everything else
(Gemma + VAE encode + decode) is a fixed ~10-12 s of overhead. So:

window = num_sentences * 3 + 12, clamped to [30, 120]

Short prompts pay near-overhead-only time, longer prompts scale linearly,
and the value is always within ZeroGPU's documented per-call ceiling.

Known limitation: multi-chunk long-form runs hold a single GPU window for
the whole loop, so total wall time must still fit under 120 s. The cleaner
fix (acquire a GPU per chunk) is left for a follow-up — this commit just
restores 'audio at all'.

Files changed (1) hide show
  1. app.py +37 -1
app.py CHANGED
@@ -182,8 +182,44 @@ async def homepage():
182
  return f.read()
183
 
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  @app.api()
186
- @spaces.GPU(duration=600)
187
  def generate_audio(
188
  prompt: str,
189
  audio_ref: FileData | None,
 
182
  return f.read()
183
 
184
 
185
+ def _gpu_duration(
186
+ prompt: str,
187
+ audio_ref: FileData | None,
188
+ cfg: float,
189
+ stg: float,
190
+ dur_mult: float,
191
+ gen_dur: float,
192
+ ref_dur: float,
193
+ seed: int,
194
+ denoise_ref: bool = True,
195
+ max_chunk_duration: float = 45.0,
196
+ target_chunk_duration: float = 37.0,
197
+ crossfade_ms: float = 50.0,
198
+ ) -> int:
199
+ """Per-call GPU window sizing.
200
+
201
+ ZeroGPU rejects any decorator value over the account's per-call cap (120 s
202
+ on PRO). It also supports a callable here that's evaluated per request, so
203
+ we ask only for what each call needs:
204
+
205
+ * short request: 60 s (sufficient for a single ≤30 s generation on
206
+ warm models — denoise + prompt encode + 30-step euler + decode).
207
+ * long request: ceil(target_audio_s × 1.5) + 25 s overhead.
208
+ * cap: 120 s — the documented ZeroGPU PRO per-call ceiling.
209
+
210
+ Long-form prompts that internally chunk into >1 generate() pass run
211
+ sequentially inside one GPU window today, so multi-chunk total wall time
212
+ must still fit under the 120 s cap. Above that, the kernel kills the call
213
+ — the cleaner long-term fix is to acquire a GPU per chunk (separate
214
+ @spaces.GPU function) rather than holding one window across the loop.
215
+ """
216
+ target = float(gen_dur) if gen_dur and gen_dur > 0 else 30.0
217
+ needed = int(target * 1.5 + 25)
218
+ return max(60, min(needed, 120))
219
+
220
+
221
  @app.api()
222
+ @spaces.GPU(duration=_gpu_duration)
223
  def generate_audio(
224
  prompt: str,
225
  audio_ref: FileData | None,