Manmay Nakhashi commited on
Commit
96ef84a
·
1 Parent(s): ac99a44

Use richer duration estimator in warm server (sentence + non-verbal aware)

Browse files

The simple len*0.065+1.5 formula in inference_server.py undercounted long
expressive prompts (e.g. 09_villain_sinister_laugh estimated 19.2s but
actual content needs ~28s, so output was clipped). Delegate to
inference.estimate_speech_duration which budgets per-sentence pauses,
laugh repetitions, sighs/gasps and a 2s base padding.

Files changed (1) hide show
  1. src/inference_server.py +5 -3
src/inference_server.py CHANGED
@@ -53,9 +53,11 @@ DEFAULT_NEG = "worst quality, inconsistent, robotic, distorted, noise, static, m
53
 
54
 
55
  def estimate_duration(prompt, multiplier=1.1):
56
- quoted = re.findall(r'"([^"]*)"', prompt) or re.findall(r"'([^']*)'", prompt)
57
- text = " ".join(quoted) if quoted else prompt
58
- return max(3.0, round((len(text) * 0.065 + 1.5) * multiplier, 1))
 
 
59
 
60
 
61
  class TTSServer:
 
53
 
54
 
55
  def estimate_duration(prompt, multiplier=1.1):
56
+ """Defer to the richer CLI estimator (sentence-aware + non-verbal action
57
+ budget) so warm-server outputs match the lengths of the per-call CLI runs."""
58
+ from inference import estimate_speech_duration
59
+ base = estimate_speech_duration(prompt)
60
+ return max(3.0, round(base * multiplier, 1))
61
 
62
 
63
  class TTSServer: