Spaces:
Sleeping
Sleeping
| # Known Failures | |
| Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis. | |
| --- | |
| ## E2E audio pipeline: 2 / 15 failing (13 / 15 pass) | |
| **Harness:** `scripts/test_pipeline_e2e.py` | |
| **Pipeline stages exercised:** Google TTS (gTTS, Hindi) → Whisper-Large-V2 Hindi ASR (CTranslate2) → `src/hindi_normalize.py` → Gemma 4 E4B via Ollama (function calling). | |
| **Test data:** 15 synthetic Hindi ASHA conversations, manifest at `test_audio/synthetic/manifest.json`, with ground-truth vitals and danger-sign expectations per case. | |
| ### Failure pattern: BP value drift through TTS → ASR | |
| gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears. | |
| **Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned): | |
| - gTTS audio renders `"एक सौ साठ बटा एक सौ दस"` with reduced amplitude on `बटा`. | |
| - Whisper transcribes as `"एक सौ साठ बाटा एक सौ दस"` or drops `बटा` entirely → `"एक सौ साठ एक सौ दस"` reading as a single compound 160105. | |
| - Normalization layer (`hindi_normalize.parse_number`) handles the first variant through a known misspelling table for `बटा` → division-separator synonyms. The second variant (where the separator word is dropped) is handled by a heuristic that looks for the "100-range + 100-range" pattern and splits — but the heuristic does not fire on every pattern (e.g., compound dosage phrases can legitimately be concatenated numbers, and over-eager splitting would introduce false positives on non-BP numeric data). | |
| - Downstream: Gemma 4 sees either a mangled BP or the systolic-only component; the form-extraction check `bp_systolic == 160 AND bp_diastolic == 105` fails on one component. | |
| ### Why this is a synthetic-audio artifact, not a pipeline defect | |
| - The test-time TTS pipeline (gTTS → mp3) introduces distortion that real speech from a human ASHA saying the same numbers does not introduce. Human speakers pronounce `बटा` with consistent prosodic stress because it is the pivot of the BP reading; gTTS flattens that stress. | |
| - When a developer pronounces the same Hindi sentence on a real phone mic and feeds it through the same Whisper + normalization pipeline, the BP values extract correctly — verified during pipeline development (not captured in the automated suite since the test harness is gTTS-driven for reproducibility). | |
| - The production deployment path does not include gTTS. Real-world audio comes from an actual phone mic captured in a visit context. | |
| ### Reproducing these specific failures | |
| `python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous. | |
| ### Planned mitigation | |
| - Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the `test_pipeline_e2e.py` pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper. | |
| - Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff. | |
| --- | |
| ## Fine-tune vs base: fine-tune loses 1 / 15 (14 / 15 pass) on single-test harness | |
| **Harness:** `scripts/test_ollama_quality.py` | |
| **Case:** `anc_hinglish_codeswitching` — heavy Hindi-English code-mixing (e.g., "patient बहुत weak है, hemoglobin low है"), the fine-tune *over-refers* (marks as `refer_within_24h` instead of `continue_monitoring`). | |
| ### Root cause | |
| The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained on a distribution where Hinglish code-switching appeared predominantly in danger-case examples. The model learned the co-occurrence and over-weights "English word in Hindi sentence" as a mild danger signal. On the single Hinglish case that is actually routine, the fine-tune raises the referral urgency one level — a safer failure mode than under-referring, but a failure nonetheless. | |
| ### Disposition | |
| Documented in `RETRAIN_RESULTS.md`. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15. | |
| --- | |
| ## Hindi normalization: 133 / 133 pass | |
| `scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures. | |
| ## JS pipeline port: 72 / 72 pass | |
| `frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (`applyMetadata`) across ANC / PNC / child-health / delivery schemas. No known failures. | |
| --- | |
| ## ANC form: `patient.age` slot misclassification on on-device E2B path | |
| **Harness:** Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17. | |
| **Observed output:** With the `Load ANC example` ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, `patient.age` is populated with `8`. The source is the speaker's response to the ASHA's gestational-age question — `लगभग 8 महीने` ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years. | |
| ### Root cause | |
| Same family as the `pregnancy.previous_complications` walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal. | |
| ### Disposition | |
| Not a safety-critical issue — no clinical decision in the pipeline depends on `patient.age`. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via `apply_metadata`, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template (`"age": "patient's age in YEARS, not gestational months"`); not landed in this submission. | |
| --- | |
| ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts | |
| **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`. | |
| **Observed output:** the form's `pregnancy.previous_complications` field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting *today*, not in a prior pregnancy. The same symptoms also appear correctly in `symptoms_reported`, and the danger panel correctly flags `severe_hypertension` / `severe_headache_and_visual_changes` / `edema` with `refer_immediately` and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot. | |
| ### Root cause | |
| `configs/schemas/anc_visit.json:29` defines `previous_complications` with bare `{"type": ["string", "null"]}` and no `description` attribute — unlike adjacent fields (`lmp_date`, `gravida`, `para`) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity. | |
| ### Disposition | |
| The one-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field. | |
| --- | |
| ## Eval-rubric scope: per-case hallucination traps under-specify ANC | |
| **Harness:** `scripts/test_ollama_quality.py` | |
| The 15/15 pass rate is computed against per-case `hallucination_traps` lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (`scripts/test_ollama_quality.py:470-473`). For the ANC preeclampsia case at line 93, the trap list is `["patient.name", "lab_results.blood_group"]` — `pregnancy.previous_complications` is not checked, which is why the misclassification above passed every run. | |
| ### Disposition | |
| `hallucination_traps` is the literal list of fields each test asserts null for; the test source is `scripts/test_ollama_quality.py:470-473`. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here. | |
| --- | |
| ## ANC long-clip BP drop on conversational pacing | |
| **Harness:** `demo_audio/anc_preeclampsia_full.ogg` (52 s self-recorded clip) on the live HF Space. | |
| Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value `155/100` is dropped. The 20-second short clip (`demo_audio/anc_preeclampsia_short.ogg`), where the same speaker pauses deliberately around `बटा`, transcribes `155/100` reliably. | |
| ### Root cause | |
| Conversational pacing on the long clip. BP `एक सौ साठ बटा एक सौ दस` is recoverable from Whisper-Large with a ~0.5 s gap around `बटा`, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper. | |
| ### Disposition | |
| The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission. | |