Spaces:
Sleeping
Known Failures
Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis.
E2E audio pipeline: 2 / 15 failing (13 / 15 pass)
Harness: scripts/test_pipeline_e2e.py
Pipeline stages exercised: Google TTS (gTTS, Hindi) → Whisper-Large-V2 Hindi ASR (CTranslate2) → src/hindi_normalize.py → Gemma 4 E4B via Ollama (function calling).
Test data: 15 synthetic Hindi ASHA conversations, manifest at test_audio/synthetic/manifest.json, with ground-truth vitals and danger-sign expectations per case.
Failure pattern: BP value drift through TTS → ASR
gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see scripts/generate_test_audio.py) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like "एक सौ साठ बटा एक सौ दस" (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of "बटा" (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
Observed failure pattern (from development iteration logs, before the current passing-13/15 baseline was pinned):
- gTTS audio renders
"एक सौ साठ बटा एक सौ दस"with reduced amplitude onबटा. - Whisper transcribes as
"एक सौ साठ बाटा एक सौ दस"or dropsबटाentirely →"एक सौ साठ एक सौ दस"reading as a single compound 160105. - Normalization layer (
hindi_normalize.parse_number) handles the first variant through a known misspelling table forबटा→ division-separator synonyms. The second variant (where the separator word is dropped) is handled by a heuristic that looks for the "100-range + 100-range" pattern and splits — but the heuristic does not fire on every pattern (e.g., compound dosage phrases can legitimately be concatenated numbers, and over-eager splitting would introduce false positives on non-BP numeric data). - Downstream: Gemma 4 sees either a mangled BP or the systolic-only component; the form-extraction check
bp_systolic == 160 AND bp_diastolic == 105fails on one component.
Why this is a synthetic-audio artifact, not a pipeline defect
- The test-time TTS pipeline (gTTS → mp3) introduces distortion that real speech from a human ASHA saying the same numbers does not introduce. Human speakers pronounce
बटाwith consistent prosodic stress because it is the pivot of the BP reading; gTTS flattens that stress. - When a developer pronounces the same Hindi sentence on a real phone mic and feeds it through the same Whisper + normalization pipeline, the BP values extract correctly — verified during pipeline development (not captured in the automated suite since the test harness is gTTS-driven for reproducibility).
- The production deployment path does not include gTTS. Real-world audio comes from an actual phone mic captured in a visit context.
Reproducing these specific failures
python scripts/test_pipeline_e2e.py will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.
Planned mitigation
- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (
ROLE_PLAY_SCRIPTS.md) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, thetest_pipeline_e2e.pypass rate should rise, not fall — real speech is cleaner than gTTS for Whisper. - Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (
बटा,by,/). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.
Fine-tune vs base: fine-tune loses 1 / 15 (14 / 15 pass) on single-test harness
Harness: scripts/test_ollama_quality.py
Case: anc_hinglish_codeswitching — heavy Hindi-English code-mixing (e.g., "patient बहुत weak है, hemoglobin low है"), the fine-tune over-refers (marks as refer_within_24h instead of continue_monitoring).
Root cause
The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained on a distribution where Hinglish code-switching appeared predominantly in danger-case examples. The model learned the co-occurrence and over-weights "English word in Hindi sentence" as a mild danger signal. On the single Hinglish case that is actually routine, the fine-tune raises the referral urgency one level — a safer failure mode than under-referring, but a failure nonetheless.
Disposition
Documented in RETRAIN_RESULTS.md. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as tusharbrisingr9802/sakhi for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.
Hindi normalization: 133 / 133 pass
scripts/test_asr.py covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
JS pipeline port: 72 / 72 pass
frontend/src/lib/__tests__/*.test.js under node --test. Covers parseJsonLoose repair cases, extractForm validation, extractDangerSigns JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, runPipeline end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (applyMetadata) across ANC / PNC / child-health / delivery schemas. No known failures.
ANC form: patient.age slot misclassification on on-device E2B path
Harness: Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.
Observed output: With the Load ANC example ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, patient.age is populated with 8. The source is the speaker's response to the ASHA's gestational-age question — लगभग 8 महीने ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.
Root cause
Same family as the pregnancy.previous_complications walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.
Disposition
Not a safety-critical issue — no clinical decision in the pipeline depends on patient.age. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via apply_metadata, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template ("age": "patient's age in YEARS, not gestational months"); not landed in this submission.
ANC form: pregnancy.previous_complications slot misclassification on preeclampsia transcripts
Harness: live ANC preeclampsia inputs — synthetic text example EXAMPLE_TRANSCRIPTS[1] in app.py, real-voice clip demo_audio/anc_preeclampsia_full.ogg.
Observed output: the form's pregnancy.previous_complications field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting today, not in a prior pregnancy. The same symptoms also appear correctly in symptoms_reported, and the danger panel correctly flags severe_hypertension / severe_headache_and_visual_changes / edema with refer_immediately and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.
Root cause
configs/schemas/anc_visit.json:29 defines previous_complications with bare {"type": ["string", "null"]} and no description attribute — unlike adjacent fields (lmp_date, gravida, para) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.
Disposition
The one-line schema fix (add "description": "Complications in PRIOR pregnancies — not current-visit findings") touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.
Eval-rubric scope: per-case hallucination traps under-specify ANC
Harness: scripts/test_ollama_quality.py
The 15/15 pass rate is computed against per-case hallucination_traps lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (scripts/test_ollama_quality.py:470-473). For the ANC preeclampsia case at line 93, the trap list is ["patient.name", "lab_results.blood_group"] — pregnancy.previous_complications is not checked, which is why the misclassification above passed every run.
Disposition
hallucination_traps is the literal list of fields each test asserts null for; the test source is scripts/test_ollama_quality.py:470-473. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.
ANC long-clip BP drop on conversational pacing
Harness: demo_audio/anc_preeclampsia_full.ogg (52 s self-recorded clip) on the live HF Space.
Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value 155/100 is dropped. The 20-second short clip (demo_audio/anc_preeclampsia_short.ogg), where the same speaker pauses deliberately around बटा, transcribes 155/100 reliably.
Root cause
Conversational pacing on the long clip. BP एक सौ साठ बटा एक सौ दस is recoverable from Whisper-Large with a ~0.5 s gap around बटा, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.
Disposition
The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim "बहुत ज़्यादा है" framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.