Spaces:

Tushar9802
/

sakhi

Sleeping

Tushar9802 commited on 11 days ago

Commit

d3595cb

1 Parent(s): 5575d97

feat(audio): demo clips + Translate + Whisper warm + CT2 mirror

- demo_audio/ ships two curated ANC preeclampsia recordings (20s + 52s)
with a manifest. Voice tab gets a "Load voice example" dropdown that
fetches the file, populates the player, and runs extraction. Two
reviewer-facing wins: (a) text-only demo silently undersold voice;
(b) the audio path is the actual product moat.
- /api/audio-examples serves the manifest with synthesized URLs.
/audio/* static-mounts the bundle, ordered before the SPA catch-all.
.dockerignore allowlists demo_audio so the blanket *.ogg excludes
don't silently drop the curated clips at build time.
- /api/translate exposes Hindi/Hinglish -> English via the resident
Gemma. New Translate button below the transcript card, on-demand so
the main extraction path stays fast. ~3-5s on a hot T4.
- Whisper now eager-loads from the FastAPI startup hook. Space marks
READY only when both LLM and ASR are resident; first user audio
request no longer pays a ~60s cold load. WHISPER_MODEL env var
defaults to Tushar9802/whisper-large-v2-hindi-ct2 (CT2 mirror of
collabora/whisper-large-v2-hindi -- faster-whisper requires CT2
format, and the source repo is transformers). Local dev with a
models/whisper-hindi-ct2/ directory takes precedence over the env.
- /api/health returns the LLM and ASR model names; status line surfaces
both.
- Ollama keep_alive at both call sites now reads
OLLAMA_KEEP_ALIVE from env so the 24h container default actually
wins. Previously the hardcoded "10m" silently overrode the env.
- README touch-ups: ASR row notes the CT2 mirror, HF Space cache
section reflects eager-load + persistent caching.

Files changed (11) hide show

.dockerignore +5 -0
.gitattributes +3 -0
Dockerfile +3 -1
README.md +2 -2
api.py +55 -2
app.py +64 -15
demo_audio/anc_preeclampsia_full.ogg +3 -0
demo_audio/anc_preeclampsia_short.ogg +3 -0
demo_audio/manifest.json +20 -0
frontend/src/App.css +12 -0
frontend/src/App.jsx +107 -1

.dockerignore CHANGED Viewed

@@ -72,6 +72,11 @@ app.log
 *.mpeg
 *.flac
 !tests/fixtures/*.wav
 # Secrets
 .env

 *.mpeg
 *.flac
 !tests/fixtures/*.wav
+# Curated reviewer-facing demo clips must reach the image — these are tiny
+# (≈100 KB each) and served by the FastAPI static mount at /audio/*.
+!demo_audio/*.ogg
+!demo_audio/*.wav
+!demo_audio/*.mp3
 # Secrets
 .env

.gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+*.ogg filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.mp3 filter=lfs diff=lfs merge=lfs -text

Dockerfile CHANGED Viewed

@@ -63,6 +63,7 @@ COPY app.py api.py ./
 COPY src/ ./src/
 COPY configs/ ./configs/
 COPY scripts/ ./scripts/
 COPY FAILURES.md JUDGE_BRIEF.md README.md ./
 COPY entrypoint.sh ./
 RUN chmod +x entrypoint.sh
@@ -75,7 +76,8 @@ ENV PORT=7860 \
     OLLAMA_MODEL=gemma4:e4b-it-q4_K_M \
     OLLAMA_MODELS=/data/.ollama/models \
     HF_HOME=/data/.cache/huggingface \
-    OLLAMA_KEEP_ALIVE=24h
 EXPOSE 7860

 COPY src/ ./src/
 COPY configs/ ./configs/
 COPY scripts/ ./scripts/
+COPY demo_audio/ ./demo_audio/
 COPY FAILURES.md JUDGE_BRIEF.md README.md ./
 COPY entrypoint.sh ./
 RUN chmod +x entrypoint.sh
     OLLAMA_MODEL=gemma4:e4b-it-q4_K_M \
     OLLAMA_MODELS=/data/.ollama/models \
     HF_HOME=/data/.cache/huggingface \
+    OLLAMA_KEEP_ALIVE=24h \
+    WHISPER_MODEL=Tushar9802/whisper-large-v2-hindi-ct2
 EXPOSE 7860

README.md CHANGED Viewed

@@ -65,7 +65,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
 | Component | Model | Size | Role | Deployment |
 |-----------|-------|------|------|------------|
-| ASR (workstation path only) | collabora/whisper-large-v2-hindi | ~1.5 GB | Hindi speech → text via faster-whisper/CTranslate2 | Workstation |
 | Normalization | src/hindi_normalize.py | — | Hindi number words → digits, medical term mapping | Shared (Python server-side; JS port for phone) |
 | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
 | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
@@ -289,7 +289,7 @@ git push hf master
 #                ~7 GB and the first request waits 3–5 min)
 ```
-On first boot the container pulls `gemma4:e4b-it-q4_K_M` into the persistent volume (~3 min). Subsequent restarts are instant. Whisper-Large CT2 downloads from HF Hub on the first audio request and stays cached under `$HF_HOME`.
 **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.

 | Component | Model | Size | Role | Deployment |
 |-----------|-------|------|------|------------|
+| ASR (workstation path only) | collabora/whisper-large-v2-hindi (served as the CTranslate2 mirror Tushar9802/whisper-large-v2-hindi-ct2 — faster-whisper requires CT2 format) | ~1.5 GB | Hindi speech → text via faster-whisper/CTranslate2 | Workstation |
 | Normalization | src/hindi_normalize.py | — | Hindi number words → digits, medical term mapping | Shared (Python server-side; JS port for phone) |
 | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
 | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
 #                ~7 GB and the first request waits 3–5 min)
 ```
+On first boot the container pulls `gemma4:e4b-it-q4_K_M` into the persistent volume (~3 min) and warms it with a one-token generate so the first user request lands hot. The FastAPI startup hook eagerly loads Whisper-Large CT2 from `Tushar9802/whisper-large-v2-hindi-ct2` (~3 GB, cached under `$HF_HOME` on the persistent volume after the first boot). The Space only reports ready when both models are resident. Subsequent restarts read everything from `/data` and are fast.
 **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.

api.py CHANGED Viewed

@@ -34,6 +34,9 @@ from app import (
     init_schemas,
     validate_form_output,
     postprocess_transcript,
 )
 app = FastAPI(title="Sakhi API", version="1.0.0")
@@ -46,10 +49,17 @@ app.add_middleware(
     allow_headers=["*"],
 )
-# Load schemas on startup — models load lazily on first request (like Gradio)
 @app.on_event("startup")
 def startup():
     init_schemas()
 # ── Models ──
@@ -71,6 +81,10 @@ class TextRequest(BaseModel):
     metadata: Optional[PatientMetadata] = None
 class ExtractionResult(BaseModel):
     visit_type: str
     form: Optional[dict] = None
@@ -94,7 +108,11 @@ def _metadata_dict(meta):
 # ── Endpoints ──
 @app.get("/api/health")
 def health():
-    return {"status": "ok", "model": os.environ.get("OLLAMA_MODEL", "gemma4:e4b-it-q4_K_M")}
 @app.get("/api/examples")
@@ -107,6 +125,34 @@ def examples():
     # index 1 = "ANC Visit — Preeclampsia (DANGER)" — best for demo (has danger signs)
 @app.post("/api/process-text", response_model=ExtractionResult)
 def process_text(req: TextRequest):
     t_total = time.time()
@@ -334,6 +380,13 @@ async def process_audio_stream(
     return StreamingResponse(generate(), media_type="text/event-stream")
 # Serve built React frontend at / when dist exists (unified desktop UI for health centers).
 # Must be mounted AFTER all /api/* routes so they take priority.
 _FRONTEND_DIST = os.path.join(os.path.dirname(os.path.abspath(__file__)), "frontend", "dist")

     init_schemas,
     validate_form_output,
     postprocess_transcript,
+    translate_to_english,
+    warm_whisper,
+    WHISPER_MODEL,
 )
 app = FastAPI(title="Sakhi API", version="1.0.0")
     allow_headers=["*"],
 )
+# Startup: load schemas + pre-warm Whisper so the Space only reports ready
+# when the audio path is hot. Whisper load is wrapped in try/except — if the
+# eager load fails (no GPU, network blip), fall back to lazy loading on
+# first audio request instead of blocking the whole boot.
 @app.on_event("startup")
 def startup():
     init_schemas()
+    try:
+        warm_whisper()
+    except Exception as e:
+        print(f"[startup] WARN: Whisper pre-warm failed ({e!r}); falling back to lazy load")
 # ── Models ──
     metadata: Optional[PatientMetadata] = None
+class TranslateRequest(BaseModel):
+    text: str
 class ExtractionResult(BaseModel):
     visit_type: str
     form: Optional[dict] = None
 # ── Endpoints ──
 @app.get("/api/health")
 def health():
+    return {
+        "status": "ok",
+        "model": os.environ.get("OLLAMA_MODEL", "gemma4:e4b-it-q4_K_M"),
+        "whisper": WHISPER_MODEL,
+    }
 @app.get("/api/examples")
     # index 1 = "ANC Visit — Preeclampsia (DANGER)" — best for demo (has danger signs)
+@app.post("/api/translate")
+def translate(req: TranslateRequest):
+    """Hindi / Hinglish → English. Uses the same Gemma model already in VRAM,
+    so the cost is one extra ~3-5s LLM call. Reviewer-facing convenience;
+    never invoked from the main extraction path."""
+    t0 = time.time()
+    english = translate_to_english(req.text)
+    return {"english": english, "time_s": round(time.time() - t0, 2)}
+_DEMO_AUDIO_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "demo_audio")
+@app.get("/api/audio-examples")
+def audio_examples():
+    """Curated voice clips bundled into the image. Returns playable URLs
+    relative to the Space origin, so the frontend can both <audio src=...>
+    them and re-POST them to /api/process-audio-stream."""
+    manifest_path = os.path.join(_DEMO_AUDIO_DIR, "manifest.json")
+    if not os.path.isfile(manifest_path):
+        return []
+    with open(manifest_path, "r", encoding="utf-8") as f:
+        entries = json.load(f)
+    for e in entries:
+        e["url"] = f"/audio/{e['file']}"
+    return entries
 @app.post("/api/process-text", response_model=ExtractionResult)
 def process_text(req: TextRequest):
     t_total = time.time()
     return StreamingResponse(generate(), media_type="text/event-stream")
+# Serve curated demo audio under /audio/* so the frontend can <audio src=...>
+# them. Must be mounted BEFORE the SPA catch-all below; otherwise the
+# StaticFiles for `/` would swallow these paths.
+if os.path.isdir(_DEMO_AUDIO_DIR):
+    app.mount("/audio", StaticFiles(directory=_DEMO_AUDIO_DIR), name="demo_audio")
 # Serve built React frontend at / when dist exists (unified desktop UI for health centers).
 # Must be mounted AFTER all /api/* routes so they take priority.
 _FRONTEND_DIST = os.path.join(os.path.dirname(os.path.abspath(__file__)), "frontend", "dist")

app.py CHANGED Viewed

@@ -27,6 +27,15 @@ OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "gemma4:e4b-it-q4_K_M")
 USE_OLLAMA = os.environ.get("USE_OLLAMA", "1") == "1"
 USE_FUNCTION_CALLING = os.environ.get("USE_FUNCTION_CALLING", "1") == "1"
 # System prompts (same as training)
 FORM_SYSTEM_PROMPT = (
     "You are a clinical data extraction system for India's ASHA health worker program. "
@@ -189,20 +198,28 @@ from src.hindi_normalize import normalize_transcript as postprocess_transcript
 _whisper_model = None
-def transcribe_audio(audio_path):
-    """Transcribe audio using collabora/whisper-large-v2-hindi via faster-whisper (CTranslate2)."""
     global _whisper_model
-    if _whisper_model is None:
-        from faster_whisper import WhisperModel
-        import os
-        ct2_path = os.path.join(os.path.dirname(__file__), "models", "whisper-hindi-ct2")
-        if os.path.exists(ct2_path):
-            print(f"[ASR] Loading CTranslate2 model from {ct2_path}...")
-            _whisper_model = WhisperModel(ct2_path, device="cuda", compute_type="float16")
-        else:
-            print("[ASR] CT2 model not found, loading from HuggingFace (slower)...")
-            _whisper_model = WhisperModel("collabora/whisper-large-v2-hindi", device="cuda", compute_type="float16")
-        print("[ASR] Whisper loaded.")
     print("[ASR] Transcribing...")
     segments, info = _whisper_model.transcribe(
@@ -229,6 +246,38 @@ def run_inference(system_prompt, user_prompt):
     return _run_inference_unsloth(system_prompt, user_prompt)
 def _run_inference_ollama(system_prompt, user_prompt):
     """Run inference via Ollama API — fast GGUF on GPU with JSON mode."""
     import ollama
@@ -242,7 +291,7 @@ def _run_inference_ollama(system_prompt, user_prompt):
         ],
         format="json",
         options={"temperature": 0.1, "num_ctx": 4096, "num_gpu": 999},
-        keep_alive="10m",
     )
     elapsed = time.time() - t0
@@ -378,7 +427,7 @@ def _run_danger_fc(transcript, visit_type):
         ],
         tools=tools,
         options={"temperature": 0.1, "num_ctx": 4096, "num_gpu": 999},
-        keep_alive="10m",
     )
     elapsed = time.time() - t0

 USE_OLLAMA = os.environ.get("USE_OLLAMA", "1") == "1"
 USE_FUNCTION_CALLING = os.environ.get("USE_FUNCTION_CALLING", "1") == "1"
+# Whisper config. Default = the CTranslate2-converted mirror of collabora's
+# Hindi fine-tune of whisper-large-v2 (selected after session 19's real-voice
+# validation pass). faster-whisper requires CT2 format; the original
+# collabora/ repo is transformers format and won't load directly.
+# Override with WHISPER_MODEL for evals against other variants. Local dev
+# with a pre-converted CT2 directory at models/whisper-hindi-ct2/ takes
+# precedence over this env var — see warm_whisper().
+WHISPER_MODEL = os.environ.get("WHISPER_MODEL", "Tushar9802/whisper-large-v2-hindi-ct2")
 # System prompts (same as training)
 FORM_SYSTEM_PROMPT = (
     "You are a clinical data extraction system for India's ASHA health worker program. "
 _whisper_model = None
+def warm_whisper():
+    """Eagerly load the Whisper model into VRAM. Idempotent — safe to call
+    multiple times; subsequent calls return the cached singleton. Called from
+    FastAPI's startup hook so the first user audio request lands hot."""
     global _whisper_model
+    if _whisper_model is not None:
+        return _whisper_model
+    from faster_whisper import WhisperModel
+    ct2_path = os.path.join(os.path.dirname(__file__), "models", "whisper-hindi-ct2")
+    if os.path.exists(ct2_path):
+        print(f"[ASR] Loading CTranslate2 model from {ct2_path}...")
+        _whisper_model = WhisperModel(ct2_path, device="cuda", compute_type="float16")
+    else:
+        print(f"[ASR] Loading {WHISPER_MODEL} from HuggingFace Hub...")
+        _whisper_model = WhisperModel(WHISPER_MODEL, device="cuda", compute_type="float16")
+    print("[ASR] Whisper loaded.")
+    return _whisper_model
+def transcribe_audio(audio_path):
+    """Transcribe audio using the configured Whisper model via faster-whisper (CTranslate2)."""
+    warm_whisper()
     print("[ASR] Transcribing...")
     segments, info = _whisper_model.transcribe(
     return _run_inference_unsloth(system_prompt, user_prompt)
+def translate_to_english(hindi_text):
+    """Translate Hindi / Hinglish home-visit text to English via the same
+    Gemma model already loaded in VRAM. On-demand only — never on the
+    main extraction path. Returns plain English text (not JSON)."""
+    import ollama
+    text = (hindi_text or "").strip()
+    if not text:
+        return ""
+    t0 = time.time()
+    resp = ollama.chat(
+        model=OLLAMA_MODEL,
+        messages=[
+            {"role": "system", "content": (
+                "Translate the following Hindi or Hinglish conversation into clear, natural English. "
+                "Preserve speaker labels (ASHA / Patient / Mother) and clinical numbers exactly. "
+                "Do not add commentary or explanations — output ONLY the translation."
+            )},
+            {"role": "user", "content": text},
+        ],
+        options={"temperature": 0.1, "num_ctx": 4096, "num_gpu": 999},
+        keep_alive=os.environ.get("OLLAMA_KEEP_ALIVE", "10m"),
+    )
+    elapsed = time.time() - t0
+    out = resp.message.content.strip()
+    tok_s = resp.eval_count / (resp.eval_duration / 1e9) if resp.eval_duration else 0
+    print(f"[LLM] Translate: {elapsed:.1f}s ({resp.eval_count} tok, {tok_s:.0f} tok/s)")
+    return out
 def _run_inference_ollama(system_prompt, user_prompt):
     """Run inference via Ollama API — fast GGUF on GPU with JSON mode."""
     import ollama
         ],
         format="json",
         options={"temperature": 0.1, "num_ctx": 4096, "num_gpu": 999},
+        keep_alive=os.environ.get("OLLAMA_KEEP_ALIVE", "10m"),
     )
     elapsed = time.time() - t0
         ],
         tools=tools,
         options={"temperature": 0.1, "num_ctx": 4096, "num_gpu": 999},
+        keep_alive=os.environ.get("OLLAMA_KEEP_ALIVE", "10m"),
     )
     elapsed = time.time() - t0

demo_audio/anc_preeclampsia_full.ogg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7572ab4c67cae5373451aae322882931e36fa9f643e33e2743200ce2b01d0e76
+size 118010

demo_audio/anc_preeclampsia_short.ogg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f36a1da62e4db68867036d72200fd5813087c6888a645a471e928144700e045
+size 46456

demo_audio/manifest.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "id": "anc_preeclampsia_short",
+    "label": "ANC — Preeclampsia (short, 20s)",
+    "file": "anc_preeclampsia_short.ogg",
+    "duration_s": 20,
+    "visit_type_hint": "anc_visit",
+    "speaker": "male",
+    "description": "Short ANC clip. ASHA reads BP 155/100 — the danger-sign threshold trigger. Demonstrates the danger pipeline end-to-end on a clip a judge will actually sit through."
+  },
+  {
+    "id": "anc_preeclampsia_full",
+    "label": "ANC — Preeclampsia (full, 52s)",
+    "file": "anc_preeclampsia_full.ogg",
+    "duration_s": 52,
+    "visit_type_hint": "anc_visit",
+    "speaker": "male",
+    "description": "Full ANC home-visit role-play. Headache, blurred vision, facial swelling, BP 155/100, late-pregnancy context (≈8 months) — multiple preeclampsia danger signs plus a PHC-referral decision."
+  }
+]

frontend/src/App.css CHANGED Viewed

@@ -238,6 +238,18 @@ select {
   font-size: 13px;
 }
 .text-input {
   width: 100%;
   min-height: 220px;

   font-size: 13px;
 }
+.translate-row {
+  display: flex;
+  align-items: center;
+  gap: 12px;
+  margin-top: 12px;
+}
+.error-text {
+  color: #991b1b;
+  font-size: 13px;
+}
 .text-input {
   width: 100%;
   min-height: 220px;

frontend/src/App.jsx CHANGED Viewed

@@ -204,6 +204,16 @@ function App() {
   const [audioUrl, setAudioUrl] = useState('')
   const [isRecording, setIsRecording] = useState(false)
   const mediaRecorderRef = useRef(null)
   const streamRef = useRef(null)
   const chunksRef = useRef([])
@@ -261,7 +271,11 @@ function App() {
     fetch(`${API_BASE}/api/health`)
       .then((r) => r.json())
       .then((d) => {
-        setHealth(`API: ${d.status} · Model: ${d.model}`)
         setApiReachable(true)
       })
       .catch(() => {
@@ -280,8 +294,52 @@ function App() {
         }
       })
       .catch(() => {})
   }, [])
   function saveServerUrl() {
     const cleaned = (serverUrlInput || '').trim().replace(/\/+$/, '')
     if (!cleaned) {
@@ -964,6 +1022,34 @@ function App() {
             visitType={recordingVisitType}
             setVisitType={setRecordingVisitType}
           />
           <div className="card">
             <div className="audio-tools audio-tools-3">
               <button className={`btn ${isRecording ? 'danger' : ''}`} onClick={isRecording ? stopRecording : startRecording}>
@@ -983,6 +1069,26 @@ function App() {
           <div className="card">
             <h3>Transcript</h3>
             <pre className="transcript">{voiceState.transcript || 'Transcript will appear here after processing audio.'}</pre>
           </div>
         </section>
       )}

   const [audioUrl, setAudioUrl] = useState('')
   const [isRecording, setIsRecording] = useState(false)
+  // Curated audio examples served by the backend at /api/audio-examples.
+  // Lets a judge play a real role-play clip + run extraction without
+  // needing their own Hindi audio.
+  const [audioExamples, setAudioExamples] = useState([])
+  const [selectedAudioExample, setSelectedAudioExample] = useState('')
+  // On-demand English translation of the Hindi transcript. Not in the main
+  // extraction path — fires only when the reviewer clicks Translate.
+  const [translation, setTranslation] = useState({ loading: false, english: '', error: '' })
   const mediaRecorderRef = useRef(null)
   const streamRef = useRef(null)
   const chunksRef = useRef([])
     fetch(`${API_BASE}/api/health`)
       .then((r) => r.json())
       .then((d) => {
+        setHealth(
+          d.whisper
+            ? `API: ${d.status} · LLM: ${d.model} · ASR: ${d.whisper}`
+            : `API: ${d.status} · Model: ${d.model}`
+        )
         setApiReachable(true)
       })
       .catch(() => {
         }
       })
       .catch(() => {})
+    fetch(`${API_BASE}/api/audio-examples`)
+      .then((r) => r.json())
+      .then((data) => setAudioExamples(Array.isArray(data) ? data : []))
+      .catch(() => {})
   }, [])
+  async function loadAudioExample(id) {
+    setSelectedAudioExample(id)
+    if (!id) return
+    const ex = audioExamples.find((e) => e.id === id)
+    if (!ex) return
+    setVoiceState((s) => ({ ...s, error: '' }))
+    setTranslation({ loading: false, english: '', error: '' })
+    try {
+      const res = await fetch(`${API_BASE}${ex.url}`)
+      if (!res.ok) throw new Error(`fetch ${ex.url} → ${res.status}`)
+      const blob = await res.blob()
+      const file = new File([blob], ex.file, { type: blob.type || 'audio/ogg' })
+      if (audioUrl) URL.revokeObjectURL(audioUrl)
+      setAudioFile(file)
+      setAudioUrl(URL.createObjectURL(file))
+      if (ex.visit_type_hint) setRecordingVisitType(ex.visit_type_hint)
+    } catch (err) {
+      setVoiceState((s) => ({ ...s, error: `Could not load example: ${err.message}` }))
+    }
+  }
+  async function translateTranscript() {
+    const text = (voiceState.transcript || '').trim()
+    if (!text) return
+    setTranslation({ loading: true, english: '', error: '' })
+    try {
+      const res = await fetch(`${API_BASE}/api/translate`, {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json' },
+        body: JSON.stringify({ text }),
+      })
+      if (!res.ok) throw new Error(`HTTP ${res.status}`)
+      const data = await res.json()
+      setTranslation({ loading: false, english: data.english || '', error: '' })
+    } catch (err) {
+      setTranslation({ loading: false, english: '', error: err.message })
+    }
+  }
   function saveServerUrl() {
     const cleaned = (serverUrlInput || '').trim().replace(/\/+$/, '')
     if (!cleaned) {
             visitType={recordingVisitType}
             setVisitType={setRecordingVisitType}
           />
+          {audioExamples.length > 0 && (
+            <div className="card">
+              <div className="text-tools">
+                <select
+                  value={selectedAudioExample}
+                  onChange={(e) => loadAudioExample(e.target.value)}
+                >
+                  <option value="">Load voice example...</option>
+                  {audioExamples.map((ex) => (
+                    <option key={ex.id} value={ex.id}>
+                      {ex.label}
+                    </option>
+                  ))}
+                </select>
+                <button
+                  className="btn primary"
+                  onClick={processVoice}
+                  disabled={!audioFile || voiceState.loading}
+                >
+                  {voiceState.loading ? 'Processing...' : 'Extract from example'}
+                </button>
+              </div>
+              {selectedAudioExample && (() => {
+                const ex = audioExamples.find((e) => e.id === selectedAudioExample)
+                return ex ? <p className="field-desc">{ex.description}</p> : null
+              })()}
+            </div>
+          )}
           <div className="card">
             <div className="audio-tools audio-tools-3">
               <button className={`btn ${isRecording ? 'danger' : ''}`} onClick={isRecording ? stopRecording : startRecording}>
           <div className="card">
             <h3>Transcript</h3>
             <pre className="transcript">{voiceState.transcript || 'Transcript will appear here after processing audio.'}</pre>
+            {voiceState.transcript && (
+              <div className="translate-row">
+                <button
+                  className="btn secondary"
+                  onClick={translateTranscript}
+                  disabled={translation.loading}
+                >
+                  {translation.loading ? 'Translating...' : 'Translate to English'}
+                </button>
+                {translation.error && (
+                  <span className="error-text">{translation.error}</span>
+                )}
+              </div>
+            )}
+            {translation.english && (
+              <>
+                <h3 style={{ marginTop: 16 }}>English translation</h3>
+                <pre className="transcript">{translation.english}</pre>
+              </>
+            )}
           </div>
         </section>
       )}