Spaces:

Tushar9802
/

sakhi

Sleeping

App Files Files Community

sakhi / FAILURES.md

Tushar9802

docs: tone scrub + YouTube demo link + Ollama-pull reproducer

20f5235 3 days ago

preview code

raw

history blame contribute delete

12.4 kB

Known Failures

Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis.

E2E audio pipeline: 2 / 15 failing (13 / 15 pass)

Harness: scripts/test_pipeline_e2e.py Pipeline stages exercised: Google TTS (gTTS, Hindi) → Whisper-Large-V2 Hindi ASR (CTranslate2) → src/hindi_normalize.py → Gemma 4 E4B via Ollama (function calling). Test data: 15 synthetic Hindi ASHA conversations, manifest at test_audio/synthetic/manifest.json, with ground-truth vitals and danger-sign expectations per case.

Failure pattern: BP value drift through TTS → ASR

gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see scripts/generate_test_audio.py) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like "एक सौ साठ बटा एक सौ दस" (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of "बटा" (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.

Observed failure pattern (from development iteration logs, before the current passing-13/15 baseline was pinned):

gTTS audio renders "एक सौ साठ बटा एक सौ दस" with reduced amplitude on बटा.
Whisper transcribes as "एक सौ साठ बाटा एक सौ दस" or drops बटा entirely → "एक सौ साठ एक सौ दस" reading as a single compound 160105.
Normalization layer (hindi_normalize.parse_number) handles the first variant through a known misspelling table for बटा → division-separator synonyms. The second variant (where the separator word is dropped) is handled by a heuristic that looks for the "100-range + 100-range" pattern and splits — but the heuristic does not fire on every pattern (e.g., compound dosage phrases can legitimately be concatenated numbers, and over-eager splitting would introduce false positives on non-BP numeric data).
Downstream: Gemma 4 sees either a mangled BP or the systolic-only component; the form-extraction check bp_systolic == 160 AND bp_diastolic == 105 fails on one component.

Why this is a synthetic-audio artifact, not a pipeline defect

The test-time TTS pipeline (gTTS → mp3) introduces distortion that real speech from a human ASHA saying the same numbers does not introduce. Human speakers pronounce बटा with consistent prosodic stress because it is the pivot of the BP reading; gTTS flattens that stress.
When a developer pronounces the same Hindi sentence on a real phone mic and feeds it through the same Whisper + normalization pipeline, the BP values extract correctly — verified during pipeline development (not captured in the automated suite since the test harness is gTTS-driven for reproducibility).
The production deployment path does not include gTTS. Real-world audio comes from an actual phone mic captured in a visit context.

Reproducing these specific failures

python scripts/test_pipeline_e2e.py will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.

Planned mitigation

Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (ROLE_PLAY_SCRIPTS.md) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the test_pipeline_e2e.py pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper.
Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (बटा, by, /). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.

Fine-tune vs base: fine-tune loses 1 / 15 (14 / 15 pass) on single-test harness

Harness: scripts/test_ollama_quality.py Case: anc_hinglish_codeswitching — heavy Hindi-English code-mixing (e.g., "patient बहुत weak है, hemoglobin low है"), the fine-tune over-refers (marks as refer_within_24h instead of continue_monitoring).

Root cause

The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained on a distribution where Hinglish code-switching appeared predominantly in danger-case examples. The model learned the co-occurrence and over-weights "English word in Hindi sentence" as a mild danger signal. On the single Hinglish case that is actually routine, the fine-tune raises the referral urgency one level — a safer failure mode than under-referring, but a failure nonetheless.

Disposition

Documented in RETRAIN_RESULTS.md. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as tusharbrisingr9802/sakhi for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.

Hindi normalization: 133 / 133 pass

scripts/test_asr.py covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.

JS pipeline port: 72 / 72 pass

frontend/src/lib/__tests__/*.test.js under node --test. Covers parseJsonLoose repair cases, extractForm validation, extractDangerSigns JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, runPipeline end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (applyMetadata) across ANC / PNC / child-health / delivery schemas. No known failures.

ANC form: `patient.age` slot misclassification on on-device E2B path

Harness: Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.

Observed output: With the Load ANC example ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, patient.age is populated with 8. The source is the speaker's response to the ASHA's gestational-age question — लगभग 8 महीने ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.

Root cause

Same family as the pregnancy.previous_complications walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.

Disposition

Not a safety-critical issue — no clinical decision in the pipeline depends on patient.age. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via apply_metadata, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template ("age": "patient's age in YEARS, not gestational months"); not landed in this submission.

ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts

Harness: live ANC preeclampsia inputs — synthetic text example EXAMPLE_TRANSCRIPTS[1] in app.py, real-voice clip demo_audio/anc_preeclampsia_full.ogg.

Observed output: the form's pregnancy.previous_complications field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting today, not in a prior pregnancy. The same symptoms also appear correctly in symptoms_reported, and the danger panel correctly flags severe_hypertension / severe_headache_and_visual_changes / edema with refer_immediately and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.

Root cause

configs/schemas/anc_visit.json:29 defines previous_complications with bare {"type": ["string", "null"]} and no description attribute — unlike adjacent fields (lmp_date, gravida, para) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.

Disposition

The one-line schema fix (add "description": "Complications in PRIOR pregnancies — not current-visit findings") touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.

Eval-rubric scope: per-case hallucination traps under-specify ANC

Harness: scripts/test_ollama_quality.py

The 15/15 pass rate is computed against per-case hallucination_traps lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (scripts/test_ollama_quality.py:470-473). For the ANC preeclampsia case at line 93, the trap list is ["patient.name", "lab_results.blood_group"] — pregnancy.previous_complications is not checked, which is why the misclassification above passed every run.

Disposition

hallucination_traps is the literal list of fields each test asserts null for; the test source is scripts/test_ollama_quality.py:470-473. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.

ANC long-clip BP drop on conversational pacing

Harness: demo_audio/anc_preeclampsia_full.ogg (52 s self-recorded clip) on the live HF Space.

Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value 155/100 is dropped. The 20-second short clip (demo_audio/anc_preeclampsia_short.ogg), where the same speaker pauses deliberately around बटा, transcribes 155/100 reliably.

Root cause

Conversational pacing on the long clip. BP एक सौ साठ बटा एक सौ दस is recoverable from Whisper-Large with a ~0.5 s gap around बटा, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.

Disposition

The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim "बहुत ज़्यादा है" framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.

Known Failures

E2E audio pipeline: 2 / 15 failing (13 / 15 pass)

Failure pattern: BP value drift through TTS → ASR

Why this is a synthetic-audio artifact, not a pipeline defect

Reproducing these specific failures

Planned mitigation

Fine-tune vs base: fine-tune loses 1 / 15 (14 / 15 pass) on single-test harness

Root cause

Disposition

Hindi normalization: 133 / 133 pass

JS pipeline port: 72 / 72 pass

ANC form: patient.age slot misclassification on on-device E2B path

Root cause

Disposition

ANC form: pregnancy.previous_complications slot misclassification on preeclampsia transcripts

Root cause

Disposition

Eval-rubric scope: per-case hallucination traps under-specify ANC

Disposition

ANC long-clip BP drop on conversational pacing

Root cause

Disposition

ANC form: `patient.age` slot misclassification on on-device E2B path

ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts