Spaces:

Tushar9802
/

sakhi

Sleeping

App Files Files Community

sakhi / FAILURES.md

Tushar9802

docs: tone scrub + YouTube demo link + Ollama-pull reproducer

20f5235 3 days ago

preview code

raw

history blame contribute delete

12.4 kB

	# Known Failures

	Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis.

	---

	## E2E audio pipeline: 2 / 15 failing (13 / 15 pass)

	Harness: `scripts/test_pipeline_e2e.py`
	Pipeline stages exercised: Google TTS (gTTS, Hindi) → Whisper-Large-V2 Hindi ASR (CTranslate2) → `src/hindi_normalize.py` → Gemma 4 E4B via Ollama (function calling).
	Test data: 15 synthetic Hindi ASHA conversations, manifest at `test_audio/synthetic/manifest.json`, with ground-truth vitals and danger-sign expectations per case.

	### Failure pattern: BP value drift through TTS → ASR

	gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.

	Observed failure pattern (from development iteration logs, before the current passing-13/15 baseline was pinned):

	- gTTS audio renders `"एक सौ साठ बटा एक सौ दस"` with reduced amplitude on `बटा`.
	- Whisper transcribes as `"एक सौ साठ बाटा एक सौ दस"` or drops `बटा` entirely → `"एक सौ साठ एक सौ दस"` reading as a single compound 160105.
	- Normalization layer (`hindi_normalize.parse_number`) handles the first variant through a known misspelling table for `बटा` → division-separator synonyms. The second variant (where the separator word is dropped) is handled by a heuristic that looks for the "100-range + 100-range" pattern and splits — but the heuristic does not fire on every pattern (e.g., compound dosage phrases can legitimately be concatenated numbers, and over-eager splitting would introduce false positives on non-BP numeric data).
	- Downstream: Gemma 4 sees either a mangled BP or the systolic-only component; the form-extraction check `bp_systolic == 160 AND bp_diastolic == 105` fails on one component.

	### Why this is a synthetic-audio artifact, not a pipeline defect

	- The test-time TTS pipeline (gTTS → mp3) introduces distortion that real speech from a human ASHA saying the same numbers does not introduce. Human speakers pronounce `बटा` with consistent prosodic stress because it is the pivot of the BP reading; gTTS flattens that stress.
	- When a developer pronounces the same Hindi sentence on a real phone mic and feeds it through the same Whisper + normalization pipeline, the BP values extract correctly — verified during pipeline development (not captured in the automated suite since the test harness is gTTS-driven for reproducibility).
	- The production deployment path does not include gTTS. Real-world audio comes from an actual phone mic captured in a visit context.

	### Reproducing these specific failures

	`python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.

	### Planned mitigation

	- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the `test_pipeline_e2e.py` pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper.
	- Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.

	---

	## Fine-tune vs base: fine-tune loses 1 / 15 (14 / 15 pass) on single-test harness

	Harness: `scripts/test_ollama_quality.py`
	Case: `anc_hinglish_codeswitching` — heavy Hindi-English code-mixing (e.g., "patient बहुत weak है, hemoglobin low है"), the fine-tune over-refers (marks as `refer_within_24h` instead of `continue_monitoring`).

	### Root cause

	The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained on a distribution where Hinglish code-switching appeared predominantly in danger-case examples. The model learned the co-occurrence and over-weights "English word in Hindi sentence" as a mild danger signal. On the single Hinglish case that is actually routine, the fine-tune raises the referral urgency one level — a safer failure mode than under-referring, but a failure nonetheless.

	### Disposition

	Documented in `RETRAIN_RESULTS.md`. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.

	---

	## Hindi normalization: 133 / 133 pass

	`scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.

	## JS pipeline port: 72 / 72 pass

	`frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (`applyMetadata`) across ANC / PNC / child-health / delivery schemas. No known failures.

	---

	## ANC form: `patient.age` slot misclassification on on-device E2B path

	Harness: Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.

	Observed output: With the `Load ANC example` ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, `patient.age` is populated with `8`. The source is the speaker's response to the ASHA's gestational-age question — `लगभग 8 महीने` ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.

	### Root cause

	Same family as the `pregnancy.previous_complications` walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.

	### Disposition

	Not a safety-critical issue — no clinical decision in the pipeline depends on `patient.age`. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via `apply_metadata`, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template (`"age": "patient's age in YEARS, not gestational months"`); not landed in this submission.

	---

	## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts

	Harness: live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.

	Observed output: the form's `pregnancy.previous_complications` field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting today, not in a prior pregnancy. The same symptoms also appear correctly in `symptoms_reported`, and the danger panel correctly flags `severe_hypertension` / `severe_headache_and_visual_changes` / `edema` with `refer_immediately` and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.

	### Root cause

	`configs/schemas/anc_visit.json:29` defines `previous_complications` with bare `{"type": ["string", "null"]}` and no `description` attribute — unlike adjacent fields (`lmp_date`, `gravida`, `para`) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.

	### Disposition

	The one-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.

	---

	## Eval-rubric scope: per-case hallucination traps under-specify ANC

	Harness: `scripts/test_ollama_quality.py`

	The 15/15 pass rate is computed against per-case `hallucination_traps` lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (`scripts/test_ollama_quality.py:470-473`). For the ANC preeclampsia case at line 93, the trap list is `["patient.name", "lab_results.blood_group"]` — `pregnancy.previous_complications` is not checked, which is why the misclassification above passed every run.

	### Disposition

	`hallucination_traps` is the literal list of fields each test asserts null for; the test source is `scripts/test_ollama_quality.py:470-473`. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.

	---

	## ANC long-clip BP drop on conversational pacing

	Harness: `demo_audio/anc_preeclampsia_full.ogg` (52 s self-recorded clip) on the live HF Space.

	Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value `155/100` is dropped. The 20-second short clip (`demo_audio/anc_preeclampsia_short.ogg`), where the same speaker pauses deliberately around `बटा`, transcribes `155/100` reliably.

	### Root cause

	Conversational pacing on the long clip. BP `एक सौ साठ बटा एक सौ दस` is recoverable from Whisper-Large with a ~0.5 s gap around `बटा`, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.

	### Disposition

	The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.