Spaces:

Tushar9802
/

sakhi

Sleeping

Tushar9802 commited on 3 days ago

Commit

20f5235

1 Parent(s): d630c01

docs: tone scrub + YouTube demo link + Ollama-pull reproducer

- README + JUDGE_BRIEF + FAILURES: replace solo-dev "we" with neutral or
product-subject phrasing across 10 spots. Matches the voiceover's
"my submission" framing in the demo video. Side-benefit: removes a
few self-grading / hedge-language phrases.
- README + JUDGE_BRIEF: add 3-min demo video link
(https://youtu.be/n-u7J1lljUg) in 6 places — top-of-readme callout,
inline mentions of the on-device demo, Public Demo section.
- RETRAIN_RESULTS: document the `ollama pull` + `ollama cp` two-step
needed to reproduce the A/B against tusharbrisingr9802/sakhi locally.
- FIELD_COVERAGE_DIFF: factual restatement of the base-vs-finetune
trade-off (drop "safer, more consistent alternative" value-judgment).

Files changed (5) hide show

FAILURES.md +24 -8
FIELD_COVERAGE_DIFF.md +1 -1
JUDGE_BRIEF.md +19 -11
README.md +59 -23
RETRAIN_RESULTS.md +3 -1

FAILURES.md CHANGED Viewed

@@ -12,7 +12,7 @@ Every test failure in Sakhi's eval suite is recorded here with a root-cause diag
 ### Failure pattern: BP value drift through TTS → ASR
-gTTS (Google Text-to-Speech, the synthesizer we use for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
 **Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
@@ -29,12 +29,12 @@ gTTS (Google Text-to-Speech, the synthesizer we use for test audio generation
 ### Reproducing these specific failures
-`python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous. (Re-running the suite on a fresh Ollama + Whisper install on 2026-04-19 will produce the definitive current list — will be pinned in a follow-up commit after the Bareilly recordings, alongside the real-audio-path baseline.)
 ### Planned mitigation
-- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, we expect `test_pipeline_e2e.py` pass rate to rise, not fall — real speech is cleaner than gTTS for Whisper.
-- Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will let us re-tune the recall/precision tradeoff.
 ---
@@ -49,7 +49,7 @@ The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained o
 ### Disposition
-Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama path for its zero-shot pass-rate edge. The fine-tune remains available as `sakhi:latest` in Ollama for deployments that prefer the English-schema-label normalization the fine-tune also produces. We did not further tune — the finding is informative (synthetic-data distribution bias is a known LoRA pitfall), not a ship-blocker.
 ---
@@ -63,6 +63,22 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
 ---
 ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
 **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
@@ -75,7 +91,7 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
 ### Disposition
-One-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) is held back close to deadline. The regression surface is the full form schema across all four visit types and we don't have time to re-run the eval suite against a tightened schema with confidence. The safety-critical output (danger panel + referral decision) is unaffected, so the conservative choice is documented disclosure now, schema cleanup post-competition.
 ---
@@ -87,7 +103,7 @@ The 15/15 pass rate is computed against per-case `hallucination_traps` lists —
 ### Disposition
-The rubric is honest about what it tests — `hallucination_traps` is the literal list of fields each test asserts null for, and the test source is reproducible. But "15/15 tests pass" rests on a narrow per-case rubric, not a whole-schema null-everywhere-not-mentioned check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above before deploy. Post-competition the rubric will be widened; the current ratio is reported as-is.
 ---
@@ -103,4 +119,4 @@ Conversational pacing on the long clip. BP `एक सौ साठ बटा
 ### Disposition
-Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so the BP path is exercised end-to-end on the most-played sample. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.

 ### Failure pattern: BP value drift through TTS → ASR
+gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
 **Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
 ### Reproducing these specific failures
+`python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.
 ### Planned mitigation
+- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the `test_pipeline_e2e.py` pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper.
+- Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.
 ---
 ### Disposition
+Documented in `RETRAIN_RESULTS.md`. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.
 ---
 ---
+## ANC form: `patient.age` slot misclassification on on-device E2B path
+**Harness:** Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.
+**Observed output:** With the `Load ANC example` ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, `patient.age` is populated with `8`. The source is the speaker's response to the ASHA's gestational-age question — `लगभग 8 महीने` ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.
+### Root cause
+Same family as the `pregnancy.previous_complications` walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.
+### Disposition
+Not a safety-critical issue — no clinical decision in the pipeline depends on `patient.age`. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via `apply_metadata`, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template (`"age": "patient's age in YEARS, not gestational months"`); not landed in this submission.
+---
 ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
 **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
 ### Disposition
+The one-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.
 ---
 ### Disposition
+`hallucination_traps` is the literal list of fields each test asserts null for; the test source is `scripts/test_ollama_quality.py:470-473`. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.
 ---
 ### Disposition
+The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.

FIELD_COVERAGE_DIFF.md CHANGED Viewed

@@ -2,7 +2,7 @@
 Date: 2026-04-17 09:53
-The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values — translating Hindi symptom phrases to English labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") — and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
 ## Summary

 Date: 2026-04-17 09:53
+The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). The base model extracted more raw fields on average (11 vs 2 unique extractions). The fine-tune translates Hindi symptom phrases into English schema labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") and recovers two visit-type-specific fields the base model misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base ships in the live pipeline for the single-test accuracy edge (15/15 vs 14/15); the fine-tune is registered as a schema-normalization alternative.
 ## Summary

JUDGE_BRIEF.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Sakhi (सखी) — Judge Brief
-*One-page version of the README. Full detail in [README.md](README.md).*
 ## The problem, in two sentences
@@ -22,31 +22,35 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
 | Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
 | On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
-The 5-minute on-device figure is tested against the `ms2_0425` ANC preeclampsia training transcript: the model correctly extracts BP 150/95, TT complete, IFA = yes, verbatim Hindi symptoms, and flags `high_bp_with_symptoms` (urgent_care) with the Hindi quote `"आपका BP 150/95 आ रहा है"` and a "Refer Immediately" decision. A 5-minute wait is a net time save against the 15–20 min baseline of hand-filling paper forms plus travel to the PHC.
 ## Why this is submitted to four tracks
 | Track | What Sakhi brings |
 |---|---|
-| **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) — not a research demo. |
 | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
-| **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep → LoRA train → GGUF export → Ollama registration → A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate — we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
-| **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 — shipping it would cause clinical harm). |
 ## Reproduce in under 10 minutes
 **Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
 **Health-center mode (workstation only):**
 ```bash
-pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M
 cd frontend && npm install && npm run build && cd ..
 python api.py        # browser: http://localhost:8000
 ```
 **Field mode (phone + Cactus):**
-> **We do not redistribute the Cactus-Compute model** — it is gated under a custom Cactus license. Reviewers verifying the Cactus track follow the documented path below. Most reviewers can verify the engineering claims via the workstation path above without ever installing on-device; the 3-minute demo video shows the full on-device flow on a real phone.
 ```bash
 # Build + install the APK once. After this the model install is in-app, no adb.
@@ -70,13 +74,17 @@ cd frontend && npm run build && npx cap sync android && \
 A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
-## What we'd do with $10K and six more months
-- Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio — the current evaluation is entirely on synthetic TTS audio + LLM-generated conversations.
-- Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that we deliberately did not ship in this submission.
 - Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
 - Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
 ## Contact
-Tushar J — tushar.j@cognavi.com — GitHub: [Tushar-9802/Sakhi](https://github.com/Tushar-9802/Sakhi)

 # Sakhi (सखी) — Judge Brief
+*One-page version of the README. Full detail in [README.md](README.md). 3-min demo video: [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg).*
 ## The problem, in two sentences
 | Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
 | On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
+The 5-minute on-device figure is reproducible via the **Load ANC example** button in Field Mode (Field Mode tab → On-device text → form card → "Load ANC example"). On OnePlus 11R / Snapdragon 8+ Gen 1, the on-device pipeline extracts BP 155/100, verbatim Hindi symptoms (`सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन`), Counseling `PHC जाने की सलाह`, and flags three danger signs — `high_bp_with_symptoms`, `swelling_face`, `swelling_legs` — all with verbatim Hindi `utterance_evidence` and `category: immediate_referral`. Total 320.7 s end-to-end (Form 231.8 s + Danger 88.9 s + normalize + detect). For comparison: the paper-form baseline is 15–20 min of hand-filling plus travel to the PHC.
 ## Why this is submitted to four tracks
 | Track | What Sakhi brings |
 |---|---|
+| **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a workflow matched to how ASHA workers actually operate (health-center mode + field mode with later sync). |
 | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
+| **Unsloth** | One-command LoRA pipeline (`scripts/train_unsloth.py`): data prep → train → GGUF export → Ollama register → A/B eval vs base. Includes a Windows GGUF-export workaround (`scripts/export_merge.py`) for Unsloth's Gemma 4 mmap failure — manual delta-merge + `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`, no WSL needed. Fine-tune pass rate 14/15 vs base 15/15 — base is in the live pipeline; fine-tune is published to Ollama as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) (`ollama pull tusharbrisingr9802/sakhi` to verify A/B locally) for deployments preferring English schema-label normalization (`दस्त` → `Diarrhea`) over raw Hindi. Field-coverage diff in `FIELD_COVERAGE_DIFF.md`. |
+| **Cactus** | On-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. On-device voice-in via `cactusTranscribe` + Gemma was investigated; the README documents why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per [Kumar et al. 2025](https://arxiv.org/abs/2512.10967) and the Vistaar / Gramvaani benchmarks, with deletion-dominant errors on numbers — not in this submission). |
 ## Reproduce in under 10 minutes
+**3-min demo video:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
 **Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
+**Pull the Unsloth fine-tune:** [`ollama pull tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi). The LoRA-fine-tuned Gemma 4 E4B is on the Ollama registry. Run `python scripts/test_ollama_quality.py` against base + fine-tune to reproduce the 15/15 vs 14/15 A/B locally.
 **Health-center mode (workstation only):**
 ```bash
+pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M
 cd frontend && npm install && npm run build && cd ..
 python api.py        # browser: http://localhost:8000
 ```
 **Field mode (phone + Cactus):**
+> **Sakhi does not redistribute the Cactus-Compute model** — it is gated under a custom Cactus license. Reviewers verifying the Cactus track follow the documented path below. Most reviewers can verify the engineering claims via the workstation path above without ever installing on-device; the [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full on-device flow on a real phone.
 ```bash
 # Build + install the APK once. After this the model install is in-app, no adb.
 A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
+## Privacy & data handling
+Audio and transcripts never leave the institution that owns them. Workstation mode keeps everything on the PHC's local network (Whisper + Ollama on local GPU; no OpenAI / Anthropic / Google API). Field mode runs on-device via Cactus SDK — airplane mode does not break it. Patient demographics enter as a typed header rather than being extracted from audio, so identifiers are minimised at the boundary. This posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
+## What's next with $10K and six more months
+- Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio under field conditions. Current evaluation covers 4 real-voice recordings (2 speakers — 1 female Bareilly reader + 1 male self-record — across 3 of 4 role-play scripts) plus the 15-case synthetic test suite; full-corpus rural-female accent + field-noise validation is the next step.
+- Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path not shipped here.
 - Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
 - Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
 ## Contact
+Tushar J — tusharbrisingr9802@gmail.com — GitHub: [Tushar-9802/Sakhi](https://github.com/Tushar-9802/Sakhi)

README.md CHANGED Viewed

@@ -17,6 +17,12 @@ Offline-first tool that converts Hindi home visit conversations into structured
 **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
 **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
 ![Workstation demo: Hindi audio → form + danger signs (30 s)](workstation_demo.gif)
 ## Problem
@@ -27,10 +33,10 @@ India's ASHA workers conduct 50M+ maternal/child health home visits per year acr
 Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
-- **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
 - **Field mode (phone)** has two offline sub-paths:
-  - **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. This is the honest voice path — no on-device ASR attempted.
-  - **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. This is acceptable against the clinical baseline: the status quo is an ASHA hand-filling the same form from memory (15–20 min), carrying it to the PHC (another walk), then waiting for a clinician to read and act on it (hours to days). A 5-minute wait for on-device structured extraction + flagged danger signs is a net time save, not a UX compromise — and it works with zero network, zero shared infrastructure.
 ```
 Workstation path:
@@ -47,7 +53,7 @@ On-device path (text-in):
 ### Why not voice-to-form on-device too?
-We looked into it — the honest answer is it doesn't work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published evidence (arXiv 2512.10967, Vistaar/Gramvaani) shows off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, we chose to *not* ship an unreliable on-device voice path. Record-and-sync with Whisper-Large on the workstation keeps voice-in honest; the on-device LLM does what Gemma 4 is actually good at — Hindi text understanding.
 ## Function Calling
@@ -72,7 +78,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
 | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
 | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
-**Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode we hit in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — `apply_metadata` in `app.py` for workstation audio and text, mirrored as a pure JS function in `pipeline.js` for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills `patient.{name, age, mobile}`; child_health fills `child.{name, age_months, sex}` with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. `asha_id` is sticky across sessions via `localStorage`. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.
 **Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
@@ -88,7 +94,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
 Two reproduction paths. Pick by available hardware.
-**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
 **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
 1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
@@ -98,7 +104,7 @@ Two reproduction paths. Pick by available hardware.
 5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
 6. **Load Model** → **Test Hindi** to confirm inference works.
-**We do not redistribute the Cactus model.** It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without us needing to host the weights ourselves and without needing developer mode or adb on their phone. The 3-minute demo video in the submission shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.
 ## Safety & Limitations
@@ -108,11 +114,29 @@ Sakhi is a decision-support tool, not a diagnostic system. All outputs require h
 **What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
-**False positive controls:** The 6-layer anti-hallucination pipeline aggressively filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.
 **Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
-**Known gaps:** All current test data is synthetic (TTS-generated Hindi audio, LLM-generated training conversations). Real-world ASHA conversations will be noisier, more fragmented, and contain regional dialect variation not yet tested.
 ## Deployment Model
@@ -134,7 +158,7 @@ Health Center (workstation, RTX GPU)              Field (Android phone)
 **Three access points, same backend schema:**
 1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
-2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation does Whisper + E4B (fast, accurate). Best extraction quality available.
 3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
 **Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
@@ -168,21 +192,33 @@ Health Center (workstation, RTX GPU)              Field (Android phone)
 - Covers 0–999 Hindi number words + Whisper misspelling variants
 - Compound values (BP, weight, Hb), decimal points, fractions
 ## Fine-Tuning (Unsloth Track)
-We fine-tuned Gemma 4 E4B via Unsloth LoRA on 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. The resulting adapter is exported as a Q4_K_M GGUF and registered in Ollama as `sakhi:latest`.
-**Configuration:** LR 5e-5, 1 epoch, LoRA r=16/alpha=32, dropout 0.05 — conservative hyperparameters to avoid overfitting on a small dataset.
-**A/B comparison vs base** (see `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`):
-- **Pass rate:** base 15/15 vs fine-tune 14/15 (single fail on heavy Hinglish code-switch → over-referral, a safer failure mode)
-- **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied
-- **Schema normalization:** the fine-tune consistently translates Hindi symptom phrases into English schema labels ("दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness"), making downstream filtering easier. Base retains raw Hindi.
-- **Unique field extractions:** fine-tune recovered 2 visit-type-specific fields the base missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovered 11 fields the fine-tune left null.
-**Production choice:** we kept the base model in the live pipeline for its single-test accuracy edge. The fine-tune demonstrates the reproducible training pipeline and ships as an alternative for deployments that prefer consistent English schema values over raw transcription.
-**Export pipeline (Windows):** the training script (`scripts/train_unsloth.py`) handles the full flow — data prep, LoRA training, auto-eval. For GGUF export we use a manual path (`scripts/export_merge.py`) that bypasses Unsloth's Windows mmap issues: load base + adapter via transformers, compute `delta_W = (B @ A) * (alpha/r)` per pair, then `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`.
 ## Frontend
@@ -206,7 +242,7 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
 # Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
 # ── Health-center deployment (workstation, unified UI + API) ──
-pip install -r requirements-hf.txt          # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
 ollama pull gemma4:e4b-it-q4_K_M             # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
 cd frontend && npm install && npm run build && cd ..
 python api.py
@@ -266,7 +302,7 @@ python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi
 **Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
-**Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally — the live Space exists for convenience, not as the rigorous evaluation path.
 ### How it's deployed
@@ -274,7 +310,7 @@ python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi
 - `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
 - `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
-- `requirements-hf.txt` — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only.
 - `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
 - README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
@@ -313,7 +349,7 @@ src/hindi_normalize.py              # Hindi number/medical term normalization (1
 configs/schemas/                    # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
 Dockerfile                          # HF Space build: Node frontend + CUDA runtime + Ollama
 entrypoint.sh                       # HF Space container init: ollama serve → pull model → uvicorn
-requirements-hf.txt                 # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
 frontend/
   src/App.jsx                       # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
   src/offlineQueue.js               # IndexedDB offline queue + crash-safe chunk persistence

 **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
 **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
+**▶ Watch the 3-min demo:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — full submission video: problem framing, workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
+**▶ Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — the Path 1 workstation stack (FastAPI + Ollama + Whisper) running on an HF Space T4. Same UI, same endpoints; no install needed. ~5 min cold-boot wait after idle — see [Public Demo](#public-demo--huggingface-space) for details.
+**▶ Pull the Unsloth fine-tune:** [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) on the Ollama registry — `ollama pull tusharbrisingr9802/sakhi` fetches the LoRA-fine-tuned Gemma 4 E4B behind the A/B numbers below. The base model (`gemma4:e4b-it-q4_K_M`) is what ships in the live pipeline; this is the side-by-side comparison artifact for the Unsloth track.
 ![Workstation demo: Hindi audio → form + danger signs (30 s)](workstation_demo.gif)
 ## Problem
 Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
+- **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. End-to-end latency ~15–25 s on an RTX 5070 Ti or T4. This is the primary voice-to-form path.
 - **Field mode (phone)** has two offline sub-paths:
+  - **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. On-device ASR is not attempted — see the section below for why.
+  - **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. For comparison: the paper-form baseline is 15–20 min of hand-filling from memory, then a walk to the PHC, then clinician review hours-to-days later. The on-device path works with zero network and zero shared infrastructure.
 ```
 Workstation path:
 ### Why not voice-to-form on-device too?
+The on-device voice path does not work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published benchmarks ([Kumar et al. 2025, *ASR Under the Stethoscope*](https://arxiv.org/abs/2512.10967); Vistaar / Gramvaani corpus evaluations) show off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with substantial variability tied to speaker role / gender / code-mixing and a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, an on-device voice path is not in this submission. Record-and-sync with Whisper-Large on the workstation handles voice-in; the on-device LLM handles Hindi text understanding only.
 ## Function Calling
 | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
 | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
+**Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode surfaced in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — `apply_metadata` in `app.py` for workstation audio and text, mirrored as a pure JS function in `pipeline.js` for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills `patient.{name, age, mobile}`; child_health fills `child.{name, age_months, sex}` with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. `asha_id` is sticky across sessions via `localStorage`. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.
 **Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
 Two reproduction paths. Pick by available hardware.
+**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. The slim `requirements-runtime.txt` covers the serving stack (Ollama client + faster-whisper + FastAPI); PyTorch / Unsloth / bitsandbytes from the comprehensive `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify Sakhi's engineering claims (function calling, normalization, 6-layer validation, schema correctness).
 **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
 1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
 5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
 6. **Load Model** → **Test Hindi** to confirm inference works.
+**Sakhi does not redistribute the Cactus model.** It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without the project needing to host the weights, and without needing developer mode or adb on their phone. The [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.
 ## Safety & Limitations
 **What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
+**False positive controls:** The 6-layer anti-hallucination pipeline filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.
 **Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
+**Known limitations** (full root-cause walkthroughs in [FAILURES.md](FAILURES.md)):
+- **On-device latency.** Field-mode text-in extraction takes ~5 min on a Snapdragon 8+ Gen 1 — versus ~15–25 s on the workstation path. The use case is asynchronous: kick off at the end of a visit, the form is ready by the next stop. Live consultation runs on the workstation path.
+- **Long-clip BP drop.** Whisper-Large CT2 reliably recovers BP `160/110` only when the speaker pauses ~0.5 s around `बटा` (the Hindi "over" separator). At conversational pacing on long clips, the number can drop while the surrounding "बहुत हाई है" framing is preserved; the danger panel still flags severe-hypertension from the qualitative phrase.
+- **Eval-rubric scope.** The 15/15 quality score is asserted against per-case `hallucination_traps` lists — the specific fields that MUST be null for that input — not a whole-schema null-everywhere check. The ANC preeclampsia case has a misclassification not on its trap list: `pregnancy.previous_complications` (a prior-history field) gets populated with current-visit symptoms. The danger panel and referral decision are unaffected. The schema-description fix touches all four visit schemas and would require a full eval re-run; that re-run did not land here.
+- **Synthetic training data + partial real-voice eval.** The 1,154 fine-tune examples and 15-case automated eval suite are LLM-generated Hindi conversations with gTTS audio. Real-voice testing to date covers 4 recordings × 2 speakers (1 female Bareilly reader + 1 male self-record) × 3 of 4 role-play scripts (ANC preeclampsia, PNC Day-7, child diarrhea — see Test Results for details and fixes that came out of it). Rural female ASHA accents, regional dialects, and field background noise are not yet covered.
+- **Regional dialect coverage.** Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, and code-switched Marwari/Bhili speech are not validated. ASHA workers in those regions would need targeted evaluation before deployment.
+## Privacy & Data Handling
+Sakhi is designed so the audio and transcript of a patient visit never cross the boundary of the institution that owns it.
+- **Workstation mode.** ASR + LLM extraction run on the PHC's GPU. Audio uploads from the phone travel over local WiFi LAN to `http://<workstation>:8000`, are processed in memory, and the response goes back to the phone. No third-party API call. No telemetry. No analytics.
+- **Field mode (on-device).** Hindi text → form extraction runs entirely on the phone via Gemma 4 E2B on Cactus SDK; the on-device path is fully offline and airplane mode does not break it. Voice captured in field mode persists to phone-local IndexedDB and is posted only to the configured workstation LAN endpoint at sync time.
+- **No external LLMs.** Gemma 4 weights (E4B on Ollama, E2B INT4 on Cactus) are local. No OpenAI, Anthropic, or Google Cloud API key is required or used anywhere in the pipeline.
+- **Data minimization at the boundary.** Patient demographics enter as a typed header — never extracted from audio — so identifiers do not need to round-trip through ASR + LLM layers.
+- **DPDP Act alignment.** This deployment posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
+The public HuggingFace Space referenced below exists for reviewer convenience only; production deployments would run the workstation stack on PHC-owned hardware.
 ## Deployment Model
 **Three access points, same backend schema:**
 1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
+2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation runs Whisper-Large ASR + E4B Q4_K_M with native function calling. The on-device path (mode 3 below) is text-in only and uses plain-JSON output instead of function calling — workstation mode is the higher-fidelity path of the two.
 3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
 **Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
 - Covers 0–999 Hindi number words + Whisper misspelling variants
 - Compound values (BP, weight, Hb), decimal points, fractions
+**Real-voice validation:** 4 recordings, 2 speakers, 3 of 4 role-play scripts
+- Speakers: 1 female (Bareilly reader, WhatsApp audio over phone mic) + 1 male (self-record, OnePlus 11R mic). Scripts covered: ANC preeclampsia, PNC Day-7 normal, child diarrhea. Script #1 ANC normal not yet recorded.
+- Five normalizer/detector bugs surfaced and fixed from this round (commit `d2d987d`):
+  - `बीबी → BP` — Whisper mishears BP as `बीबी` in fast speech; medical-terms normalizer now maps it.
+  - `parse_hindi_number` no longer over-merges adjacent digits — `दो तीन` stays `2 3` (was `5`), `एक सौ सौ` stays `100 100` (was `10000`).
+  - Visit-type detector dropped `बच्चे को` from child-health keywords — was misrouting the ANC preeclampsia warning `तुम्हारा और बच्चे को खतरा हो सकता है` to child_health.
+  - Preeclampsia diagnosis name (`प्रीक्लिम्सिया`) maps to the symptom triad when the LLM emits the diagnosis instead of the underlying symptoms.
+  - `सूज` verb stem added to swelling-face/hands danger keywords.
+- BP extraction confirmed on short clips with deliberate prosody around `बटा`. On long conversational-pacing clips the numeric value can drop while the danger framing (`BP बहुत हाई है`) survives — the danger panel still flags severe-hypertension on the qualitative phrase. Root-cause walkthrough in [FAILURES.md](FAILURES.md).
+- The patient-name misclassification observed on the child-diarrhea recording (LLM grabbed the child's name into the mother field) is sidestepped by the ASHA-entered metadata header — patient identifiers never depend on ASR.
+- Full-corpus real-audio evaluation (all 4 scripts × multiple speakers under field conditions) is the next eval lift.
 ## Fine-Tuning (Unsloth Track)
+The track deliverables are a reproducible LoRA pipeline on RTX 5070 Ti / Blackwell, a Windows GGUF-export workaround for Unsloth's Gemma 4 mmap failure, and an A/B against base. The fine-tuned model did not beat base on pass-rate; base ships in the live pipeline.
+**Pipeline (`scripts/train_unsloth.py`)** — one command, end-to-end: data prep → LoRA training → adapter saved → GGUF export → Ollama register → auto-eval vs base. Training set: 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. Hyperparameters: LR 5e-5, 1 epoch, LoRA r=16 / alpha=32, dropout 0.05.
+**Windows GGUF-export workaround (`scripts/export_merge.py`)** — Unsloth's bundled GGUF export path hits an mmap failure on Windows for Gemma 4 architectures. The workaround loads base + adapter via `transformers`, computes `delta_W = (B @ A) * (alpha / r)` per LoRA pair, merges, then runs `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`. Reproducible without WSL or a Linux dual-boot.
+**A/B vs base** (full numbers in `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`):
+- **Pass rate:** base 15/15 vs fine-tune 14/15. The single fine-tune failure is on heavy Hinglish code-switching where the fine-tune over-refers (a safer failure mode, still a failure).
+- **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied.
+- **Schema normalization:** fine-tune translates Hindi symptom phrases into English schema labels (`दस्त` → `Diarrhea`, `चक्कर आ रहे हैं` → `dizziness`). Base retains raw Hindi.
+- **Field coverage:** fine-tune recovers 2 visit-type-specific fields the base misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovers 11 fields the fine-tune leaves null.
+**Root cause of the over-referral failure.** The 1,154-example training distribution had Hinglish code-switching disproportionately co-occurring with danger cases, so the LoRA learned `English-in-Hindi-sentence` as a mild danger signal. Documented in [FAILURES.md](FAILURES.md). The base model is in the live Ollama path; the fine-tune is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) — `ollama pull tusharbrisingr9802/sakhi` to verify the A/B locally. For deployments that prefer English schema-label normalization over raw Hindi.
 ## Frontend
 # Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
 # ── Health-center deployment (workstation, unified UI + API) ──
+pip install -r requirements-runtime.txt     # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
 ollama pull gemma4:e4b-it-q4_K_M             # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
 cd frontend && npm install && npm run build && cd ..
 python api.py
 **Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
+**Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the [3-minute demo video](https://youtu.be/n-u7J1lljUg), or follow Path 1 above to run locally — the live Space exists for convenience. Local Path 1 (or the test scripts in `scripts/`) is the evaluation path.
 ### How it's deployed
 - `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
 - `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
+- `requirements-runtime.txt` — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only. Used by both the HF Space Docker build and local Path 1 installs.
 - `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
 - README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
 configs/schemas/                    # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
 Dockerfile                          # HF Space build: Node frontend + CUDA runtime + Ollama
 entrypoint.sh                       # HF Space container init: ollama serve → pull model → uvicorn
+requirements-runtime.txt            # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
 frontend/
   src/App.jsx                       # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
   src/offlineQueue.js               # IndexedDB offline queue + crash-safe chunk persistence

RETRAIN_RESULTS.md CHANGED Viewed

@@ -11,11 +11,13 @@
 | gemma4:e4b-it-q4_K_M (base) | 15/15 |
 | sakhi:latest (fine-tuned) | 14/15 |
 ## Verdict
 **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
-The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is kept available in Ollama as `sakhi:latest` for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
 ## Diagnostics

 | gemma4:e4b-it-q4_K_M (base) | 15/15 |
 | sakhi:latest (fine-tuned) | 14/15 |
+**Reproduce:** `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`.
 ## Verdict
 **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
+The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
 ## Diagnostics