himahande45 commited on
Commit
0e9e909
·
verified ·
1 Parent(s): 402a61f

Switch Space to VM-backed frontend

Browse files
Files changed (3) hide show
  1. README.md +6 -4
  2. frontend_app.py +344 -0
  3. requirements.txt +1 -12
README.md CHANGED
@@ -5,15 +5,15 @@ colorFrom: indigo
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 6.12.0
8
- app_file: app.py
9
  pinned: false
10
  python_version: "3.10.16"
11
- suggested_hardware: a10g-small
12
  ---
13
 
14
  # IndicVox
15
 
16
- IndicVox is a GPU-backed research demo for multilingual text-to-speech across Hindi, Tamil, and code-switched prompts. The Space exposes the paper checkpoints through a clean Gradio UI with built-in voice presets and example prompts.
17
 
18
  ## What it includes
19
 
@@ -22,6 +22,7 @@ IndicVox is a GPU-backed research demo for multilingual text-to-speech across Hi
22
  - `Research Baseline` for direct comparison against the untuned multilingual model
23
  - Built-in research voice presets for fast demo playback
24
  - Zero-shot `Text Only` mode if you want to skip reference conditioning
 
25
 
26
  ## Usage
27
 
@@ -32,5 +33,6 @@ IndicVox is a GPU-backed research demo for multilingual text-to-speech across Hi
32
 
33
  ## Notes
34
 
35
- - The base multilingual model stays resident on GPU memory and the paper checkpoints are swapped on demand.
 
36
  - The Space is meant for inference/demo usage, not batch evaluation.
 
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 6.12.0
8
+ app_file: frontend_app.py
9
  pinned: false
10
  python_version: "3.10.16"
11
+ suggested_hardware: cpu-basic
12
  ---
13
 
14
  # IndicVox
15
 
16
+ IndicVox is a research demo for multilingual text-to-speech across Hindi, Tamil, and code-switched prompts. The Space hosts the frontend UI, while inference runs on an external GPU VM backend.
17
 
18
  ## What it includes
19
 
 
22
  - `Research Baseline` for direct comparison against the untuned multilingual model
23
  - Built-in research voice presets for fast demo playback
24
  - Zero-shot `Text Only` mode if you want to skip reference conditioning
25
+ - VM-backed inference on a dedicated GPU server
26
 
27
  ## Usage
28
 
 
33
 
34
  ## Notes
35
 
36
+ - The frontend expects `INDICVOX_API_URL` to point at the VM backend.
37
+ - If the backend is token-protected, set `INDICVOX_BACKEND_TOKEN` in Space secrets too.
38
  - The Space is meant for inference/demo usage, not batch evaluation.
frontend_app.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import os
5
+ import tempfile
6
+ from pathlib import Path
7
+
8
+ import gradio as gr
9
+ import requests
10
+
11
+ APP_DIR = Path(__file__).resolve().parent
12
+ PROMPTS_FILE = APP_DIR / "code_switch_prompts.json"
13
+ VOICE_DIR = APP_DIR / "assets" / "voices"
14
+ API_URL = os.getenv("INDICVOX_API_URL", "").rstrip("/")
15
+ BACKEND_TOKEN = os.getenv("INDICVOX_BACKEND_TOKEN", "")
16
+ DEFAULT_PROFILE = "Tamil Focus"
17
+ DEFAULT_VOICE = "Tamil Female Research Voice"
18
+ DEFAULT_TEXT = "இந்த experimentக்கு clean reference audio use பண்ணணும், இல்லனா output quality drop ஆகும்."
19
+ TIMEOUT_S = 600
20
+ SESSION = requests.Session()
21
+
22
+ PROFILES = {
23
+ "Tamil Focus": {
24
+ "description": "Best for Tamil and Tamil-English code-switched prompts.",
25
+ },
26
+ "Hindi Focus": {
27
+ "description": "Best for Hindi and Hindi-English code-switched prompts.",
28
+ },
29
+ "Research Baseline": {
30
+ "description": "Base multilingual checkpoint without paper fine-tuning.",
31
+ },
32
+ }
33
+
34
+ VOICE_PRESETS = {
35
+ "Hindi Research Voice": {
36
+ "path": VOICE_DIR / "hin_m_ref_00.wav",
37
+ "transcript": "लेकिन क्या यह हम सभी कार्यक्रमों के साथ कर सकते?",
38
+ "summary": "Short Hindi reference used for sharper Hindi + English prompting.",
39
+ },
40
+ "Tamil Female Research Voice": {
41
+ "path": VOICE_DIR / "tam_f_ref_00.wav",
42
+ "transcript": "விக்கற நேரத்தையும் லாபத்தையும் பொறுத்து, இந்த டேக்ஸை ஷார்ட் டேர்ம் இல்ல லாங் டேர்ம்னு பிரிப்பாங்க.",
43
+ "summary": "Clear Tamil reference with stable conversational prosody.",
44
+ },
45
+ "Tamil Male Research Voice": {
46
+ "path": VOICE_DIR / "tam_m_ref_00.wav",
47
+ "transcript": "கொரோனா பாதிப்பு காலத்தில் எண்பது கோடி மக்களுக்கு உணவு தானியம் வழங்கப்பட்டதாகவும் அவர் தெரிவித்தார்.",
48
+ "summary": "Tamil male reference that holds rhythm well on longer prompts.",
49
+ },
50
+ "Text Only": {
51
+ "path": None,
52
+ "transcript": None,
53
+ "summary": "Zero-shot generation without a reference voice clip.",
54
+ },
55
+ }
56
+
57
+ CUSTOM_CSS = """
58
+ #app-shell {
59
+ max-width: 1180px;
60
+ margin: 0 auto;
61
+ }
62
+ #hero {
63
+ padding: 24px 26px 12px 26px;
64
+ border: 1px solid rgba(255, 255, 255, 0.08);
65
+ border-radius: 22px;
66
+ background:
67
+ radial-gradient(circle at top right, rgba(99, 102, 241, 0.16), transparent 34%),
68
+ radial-gradient(circle at bottom left, rgba(16, 185, 129, 0.14), transparent 30%),
69
+ rgba(15, 23, 42, 0.74);
70
+ }
71
+ .stat-chip {
72
+ display: inline-block;
73
+ margin: 6px 8px 0 0;
74
+ padding: 8px 12px;
75
+ border-radius: 999px;
76
+ background: rgba(255, 255, 255, 0.06);
77
+ font-size: 0.92rem;
78
+ }
79
+ .footnote {
80
+ opacity: 0.78;
81
+ font-size: 0.94rem;
82
+ }
83
+ footer {
84
+ visibility: hidden;
85
+ }
86
+ """
87
+
88
+ THEME = gr.themes.Soft(primary_hue="indigo", secondary_hue="emerald")
89
+
90
+
91
+ def load_examples() -> list[list[str]]:
92
+ with PROMPTS_FILE.open("r", encoding="utf-8") as f:
93
+ prompt_bank = json.load(f)
94
+
95
+ return [
96
+ [prompt_bank["hi_en"][0]["text"], "Hindi Focus", "Hindi Research Voice"],
97
+ [prompt_bank["hi_en"][9]["text"], "Hindi Focus", "Hindi Research Voice"],
98
+ [prompt_bank["hi_en"][16]["text"], "Hindi Focus", "Hindi Research Voice"],
99
+ [prompt_bank["ta_en"][0]["text"], "Tamil Focus", "Tamil Female Research Voice"],
100
+ [prompt_bank["ta_en"][9]["text"], "Tamil Focus", "Tamil Female Research Voice"],
101
+ [prompt_bank["ta_en"][14]["text"], "Tamil Focus", "Tamil Male Research Voice"],
102
+ ]
103
+
104
+
105
+ EXAMPLES = load_examples()
106
+
107
+
108
+ def profile_markdown(profile_name: str) -> str:
109
+ return f"**{profile_name}** \n{PROFILES[profile_name]['description']}"
110
+
111
+
112
+ def voice_markdown(voice_name: str) -> str:
113
+ voice = VOICE_PRESETS[voice_name]
114
+ if voice["path"] is None:
115
+ return f"**{voice_name}** \n{voice['summary']}"
116
+ return (
117
+ f"**{voice_name}** \n"
118
+ f"{voice['summary']} \n"
119
+ f"Reference transcript: `{voice['transcript']}`"
120
+ )
121
+
122
+
123
+ def auth_headers() -> dict[str, str]:
124
+ headers: dict[str, str] = {}
125
+ if BACKEND_TOKEN:
126
+ headers["x-api-key"] = BACKEND_TOKEN
127
+ return headers
128
+
129
+
130
+ def backend_status() -> str:
131
+ if not API_URL:
132
+ return "**Backend Not Configured** \nSet `INDICVOX_API_URL` in Space secrets."
133
+
134
+ try:
135
+ response = SESSION.get(f"{API_URL}/health", headers=auth_headers(), timeout=10)
136
+ response.raise_for_status()
137
+ payload = response.json()
138
+ except Exception as exc:
139
+ return (
140
+ f"**Backend Unreachable** \n"
141
+ f"Endpoint: `{API_URL}` \n"
142
+ f"Error: `{type(exc).__name__}: {exc}`"
143
+ )
144
+
145
+ return (
146
+ f"**VM Backend Ready** \n"
147
+ f"Endpoint: `{API_URL}` \n"
148
+ f"GPU: `{payload.get('gpu', 'unknown')}` \n"
149
+ f"Warm profile: `{payload.get('active_profile', 'unknown')}` \n"
150
+ f"Uptime: `{payload.get('uptime_s', 'unknown')}s`"
151
+ )
152
+
153
+
154
+ def synthesize(text: str, profile_name: str, voice_name: str, cfg_value: float, inference_steps: int):
155
+ clean_text = text.strip()
156
+ if not clean_text:
157
+ raise gr.Error("Enter a prompt first.")
158
+ if not API_URL:
159
+ raise gr.Error("`INDICVOX_API_URL` is not configured on the Space.")
160
+
161
+ response = SESSION.post(
162
+ f"{API_URL}/synthesize",
163
+ headers=auth_headers(),
164
+ json={
165
+ "text": clean_text,
166
+ "profile_name": profile_name,
167
+ "voice_name": voice_name,
168
+ "cfg_value": float(cfg_value),
169
+ "inference_steps": int(inference_steps),
170
+ },
171
+ timeout=TIMEOUT_S,
172
+ )
173
+
174
+ if not response.ok:
175
+ detail = response.text
176
+ try:
177
+ detail = response.json().get("detail", detail)
178
+ except Exception:
179
+ pass
180
+ raise gr.Error(f"Backend error {response.status_code}: {detail}")
181
+
182
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
183
+ f.write(response.content)
184
+ audio_path = f.name
185
+
186
+ audio_seconds = response.headers.get("X-IndicVox-Audio-Seconds", "n/a")
187
+ generation_seconds = response.headers.get("X-IndicVox-Generation-Seconds", "n/a")
188
+ rtf = response.headers.get("X-IndicVox-RTF", "n/a")
189
+ gpu = response.headers.get("X-IndicVox-GPU", "unknown")
190
+ status = (
191
+ f"**Ready** \n"
192
+ f"Profile: `{profile_name}` \n"
193
+ f"Voice: `{voice_name}` \n"
194
+ f"GPU backend: `{gpu}` \n"
195
+ f"Audio length: `{audio_seconds}s` \n"
196
+ f"Generation time: `{generation_seconds}s` \n"
197
+ f"RTF: `{rtf}`"
198
+ )
199
+ return audio_path, status
200
+
201
+
202
+ def voice_preview(voice_name: str):
203
+ voice = VOICE_PRESETS[voice_name]
204
+ preview_path = str(voice["path"]) if voice["path"] is not None else None
205
+ return preview_path, voice_markdown(voice_name)
206
+
207
+
208
+ def clear_prompt() -> str:
209
+ return ""
210
+
211
+
212
+ with gr.Blocks() as demo:
213
+ with gr.Column(elem_id="app-shell"):
214
+ gr.HTML(
215
+ """
216
+ <div id="hero">
217
+ <h1>IndicVox</h1>
218
+ <p>Research demo for multilingual TTS across Hindi, Tamil, and code-switched prompts.</p>
219
+ <div>
220
+ <span class="stat-chip">HF Space frontend</span>
221
+ <span class="stat-chip">VM-hosted H100 backend</span>
222
+ <span class="stat-chip">Hindi + Tamil + English prompts</span>
223
+ </div>
224
+ </div>
225
+ """
226
+ )
227
+
228
+ with gr.Row():
229
+ with gr.Column(scale=5):
230
+ prompt = gr.Textbox(
231
+ label="Prompt",
232
+ value=DEFAULT_TEXT,
233
+ lines=5,
234
+ max_lines=8,
235
+ placeholder="Type Hindi, Tamil, or code-switched text here...",
236
+ )
237
+
238
+ with gr.Row():
239
+ profile = gr.Dropdown(
240
+ choices=list(PROFILES.keys()),
241
+ value=DEFAULT_PROFILE,
242
+ label="Model Profile",
243
+ info="Switch between the Hindi-tuned and Tamil-tuned research profiles.",
244
+ )
245
+ voice = gr.Dropdown(
246
+ choices=list(VOICE_PRESETS.keys()),
247
+ value=DEFAULT_VOICE,
248
+ label="Voice Preset",
249
+ info="Built-in research voices plus a zero-shot option.",
250
+ )
251
+
252
+ with gr.Accordion("Advanced Settings", open=False):
253
+ with gr.Row():
254
+ cfg_value = gr.Slider(
255
+ minimum=1.0,
256
+ maximum=4.0,
257
+ value=2.0,
258
+ step=0.1,
259
+ label="CFG",
260
+ )
261
+ inference_steps = gr.Slider(
262
+ minimum=6,
263
+ maximum=16,
264
+ value=10,
265
+ step=1,
266
+ label="Diffusion Steps",
267
+ )
268
+
269
+ with gr.Row():
270
+ generate_btn = gr.Button("Generate Speech", variant="primary", size="lg")
271
+ clear_btn = gr.Button("Clear Prompt")
272
+ refresh_btn = gr.Button("Refresh Backend Status")
273
+
274
+ with gr.Row():
275
+ profile_info = gr.Markdown(profile_markdown(DEFAULT_PROFILE))
276
+ voice_info = gr.Markdown(voice_markdown(DEFAULT_VOICE))
277
+
278
+ with gr.Column(scale=4):
279
+ backend_info = gr.Markdown(backend_status())
280
+ output_audio = gr.Audio(
281
+ label="Synthesized Audio",
282
+ autoplay=False,
283
+ format="wav",
284
+ )
285
+ generation_info = gr.Markdown("Generate a sample to see timing details.")
286
+ voice_preview_audio = gr.Audio(
287
+ label="Voice Preset Preview",
288
+ value=str(VOICE_PRESETS[DEFAULT_VOICE]["path"]),
289
+ interactive=False,
290
+ autoplay=False,
291
+ format="wav",
292
+ )
293
+ gr.Markdown(
294
+ "Inference runs on the external VM GPU; the Space only provides the paper demo UI.",
295
+ elem_classes=["footnote"],
296
+ )
297
+
298
+ with gr.Tabs():
299
+ with gr.Tab("Hindi + English Examples"):
300
+ gr.Examples(
301
+ examples=[row for row in EXAMPLES if row[1] == "Hindi Focus"],
302
+ inputs=[prompt, profile, voice],
303
+ cache_examples=False,
304
+ )
305
+ with gr.Tab("Tamil + English Examples"):
306
+ gr.Examples(
307
+ examples=[row for row in EXAMPLES if row[1] == "Tamil Focus"],
308
+ inputs=[prompt, profile, voice],
309
+ cache_examples=False,
310
+ )
311
+
312
+ gr.Markdown(
313
+ """
314
+ **Demo notes**
315
+
316
+ - `Hindi Focus` maps to the Hindi-strong checkpoint from the paper experiments.
317
+ - `Tamil Focus` maps to the Tamil + code-switch checkpoint and is the default for the demo.
318
+ - `Text Only` skips the reference clip and runs zero-shot synthesis.
319
+ """,
320
+ elem_classes=["footnote"],
321
+ )
322
+
323
+ demo.load(fn=backend_status, outputs=backend_info, api_name=False)
324
+ generate_btn.click(
325
+ fn=synthesize,
326
+ inputs=[prompt, profile, voice, cfg_value, inference_steps],
327
+ outputs=[output_audio, generation_info],
328
+ api_name="synthesize",
329
+ )
330
+ prompt.submit(
331
+ fn=synthesize,
332
+ inputs=[prompt, profile, voice, cfg_value, inference_steps],
333
+ outputs=[output_audio, generation_info],
334
+ api_name=False,
335
+ )
336
+ profile.change(fn=profile_markdown, inputs=profile, outputs=profile_info, api_name=False)
337
+ voice.change(fn=voice_preview, inputs=voice, outputs=[voice_preview_audio, voice_info], api_name=False)
338
+ clear_btn.click(fn=clear_prompt, outputs=prompt, api_name=False)
339
+ refresh_btn.click(fn=backend_status, outputs=backend_info, api_name=False)
340
+
341
+ demo.queue(default_concurrency_limit=2, max_size=32)
342
+
343
+ if __name__ == "__main__":
344
+ demo.launch(theme=THEME, css=CUSTOM_CSS)
requirements.txt CHANGED
@@ -1,13 +1,2 @@
1
  gradio>=6,<7
2
- huggingface_hub>=1.0
3
- numpy<3
4
- torch>=2.5.0
5
- torchaudio>=2.5.0
6
- transformers>=4.36.2
7
- einops>=0.8.0
8
- inflect>=7.0.0
9
- wetext
10
- librosa>=0.10.2
11
- soundfile>=0.12.1
12
- pydantic>=2
13
- safetensors>=0.4.5
 
1
  gradio>=6,<7
2
+ requests>=2.31.0