Tushar9802 commited on
Commit
20f5235
·
1 Parent(s): d630c01

docs: tone scrub + YouTube demo link + Ollama-pull reproducer

Browse files

- README + JUDGE_BRIEF + FAILURES: replace solo-dev "we" with neutral or
product-subject phrasing across 10 spots. Matches the voiceover's
"my submission" framing in the demo video. Side-benefit: removes a
few self-grading / hedge-language phrases.
- README + JUDGE_BRIEF: add 3-min demo video link
(https://youtu.be/n-u7J1lljUg) in 6 places — top-of-readme callout,
inline mentions of the on-device demo, Public Demo section.
- RETRAIN_RESULTS: document the `ollama pull` + `ollama cp` two-step
needed to reproduce the A/B against tusharbrisingr9802/sakhi locally.
- FIELD_COVERAGE_DIFF: factual restatement of the base-vs-finetune
trade-off (drop "safer, more consistent alternative" value-judgment).

Files changed (5) hide show
  1. FAILURES.md +24 -8
  2. FIELD_COVERAGE_DIFF.md +1 -1
  3. JUDGE_BRIEF.md +19 -11
  4. README.md +59 -23
  5. RETRAIN_RESULTS.md +3 -1
FAILURES.md CHANGED
@@ -12,7 +12,7 @@ Every test failure in Sakhi's eval suite is recorded here with a root-cause diag
12
 
13
  ### Failure pattern: BP value drift through TTS → ASR
14
 
15
- gTTS (Google Text-to-Speech, the synthesizer we use for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
16
 
17
  **Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
18
 
@@ -29,12 +29,12 @@ gTTS (Google Text-to-Speech, the synthesizer we use for test audio generation
29
 
30
  ### Reproducing these specific failures
31
 
32
- `python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous. (Re-running the suite on a fresh Ollama + Whisper install on 2026-04-19 will produce the definitive current list — will be pinned in a follow-up commit after the Bareilly recordings, alongside the real-audio-path baseline.)
33
 
34
  ### Planned mitigation
35
 
36
- - Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, we expect `test_pipeline_e2e.py` pass rate to rise, not fall — real speech is cleaner than gTTS for Whisper.
37
- - Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will let us re-tune the recall/precision tradeoff.
38
 
39
  ---
40
 
@@ -49,7 +49,7 @@ The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained o
49
 
50
  ### Disposition
51
 
52
- Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama path for its zero-shot pass-rate edge. The fine-tune remains available as `sakhi:latest` in Ollama for deployments that prefer the English-schema-label normalization the fine-tune also produces. We did not further tune — the finding is informative (synthetic-data distribution bias is a known LoRA pitfall), not a ship-blocker.
53
 
54
  ---
55
 
@@ -63,6 +63,22 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
63
 
64
  ---
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
67
 
68
  **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
@@ -75,7 +91,7 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
75
 
76
  ### Disposition
77
 
78
- One-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) is held back close to deadline. The regression surface is the full form schema across all four visit types and we don't have time to re-run the eval suite against a tightened schema with confidence. The safety-critical output (danger panel + referral decision) is unaffected, so the conservative choice is documented disclosure now, schema cleanup post-competition.
79
 
80
  ---
81
 
@@ -87,7 +103,7 @@ The 15/15 pass rate is computed against per-case `hallucination_traps` lists —
87
 
88
  ### Disposition
89
 
90
- The rubric is honest about what it tests — `hallucination_traps` is the literal list of fields each test asserts null for, and the test source is reproducible. But "15/15 tests pass" rests on a narrow per-case rubric, not a whole-schema null-everywhere-not-mentioned check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above before deploy. Post-competition the rubric will be widened; the current ratio is reported as-is.
91
 
92
  ---
93
 
@@ -103,4 +119,4 @@ Conversational pacing on the long clip. BP `एक सौ साठ बटा
103
 
104
  ### Disposition
105
 
106
- Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so the BP path is exercised end-to-end on the most-played sample. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.
 
12
 
13
  ### Failure pattern: BP value drift through TTS → ASR
14
 
15
+ gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
16
 
17
  **Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
18
 
 
29
 
30
  ### Reproducing these specific failures
31
 
32
+ `python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.
33
 
34
  ### Planned mitigation
35
 
36
+ - Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the `test_pipeline_e2e.py` pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper.
37
+ - Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.
38
 
39
  ---
40
 
 
49
 
50
  ### Disposition
51
 
52
+ Documented in `RETRAIN_RESULTS.md`. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.
53
 
54
  ---
55
 
 
63
 
64
  ---
65
 
66
+ ## ANC form: `patient.age` slot misclassification on on-device E2B path
67
+
68
+ **Harness:** Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.
69
+
70
+ **Observed output:** With the `Load ANC example` ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, `patient.age` is populated with `8`. The source is the speaker's response to the ASHA's gestational-age question — `लगभग 8 महीने` ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.
71
+
72
+ ### Root cause
73
+
74
+ Same family as the `pregnancy.previous_complications` walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.
75
+
76
+ ### Disposition
77
+
78
+ Not a safety-critical issue — no clinical decision in the pipeline depends on `patient.age`. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via `apply_metadata`, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template (`"age": "patient's age in YEARS, not gestational months"`); not landed in this submission.
79
+
80
+ ---
81
+
82
  ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
83
 
84
  **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
 
91
 
92
  ### Disposition
93
 
94
+ The one-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.
95
 
96
  ---
97
 
 
103
 
104
  ### Disposition
105
 
106
+ `hallucination_traps` is the literal list of fields each test asserts null for; the test source is `scripts/test_ollama_quality.py:470-473`. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.
107
 
108
  ---
109
 
 
119
 
120
  ### Disposition
121
 
122
+ The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.
FIELD_COVERAGE_DIFF.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  Date: 2026-04-17 09:53
4
 
5
- The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values — translating Hindi symptom phrases to English labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
6
 
7
  ## Summary
8
 
 
2
 
3
  Date: 2026-04-17 09:53
4
 
5
+ The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). The base model extracted more raw fields on average (11 vs 2 unique extractions). The fine-tune translates Hindi symptom phrases into English schema labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") and recovers two visit-type-specific fields the base model misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base ships in the live pipeline for the single-test accuracy edge (15/15 vs 14/15); the fine-tune is registered as a schema-normalization alternative.
6
 
7
  ## Summary
8
 
JUDGE_BRIEF.md CHANGED
@@ -1,6 +1,6 @@
1
  # Sakhi (सखी) — Judge Brief
2
 
3
- *One-page version of the README. Full detail in [README.md](README.md).*
4
 
5
  ## The problem, in two sentences
6
 
@@ -22,31 +22,35 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
22
  | Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
23
  | On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
24
 
25
- The 5-minute on-device figure is tested against the `ms2_0425` ANC preeclampsia training transcript: the model correctly extracts BP 150/95, TT complete, IFA = yes, verbatim Hindi symptoms, and flags `high_bp_with_symptoms` (urgent_care) with the Hindi quote `"आपका BP 150/95 रहा है"` and a "Refer Immediately" decision. A 5-minute wait is a net time save against the 15–20 min baseline of hand-filling paper forms plus travel to the PHC.
26
 
27
  ## Why this is submitted to four tracks
28
 
29
  | Track | What Sakhi brings |
30
  |---|---|
31
- | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) — not a research demo. |
32
  | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
33
- | **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep → LoRA train → GGUF export → Ollama registration → A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
34
- | **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 shipping it would cause clinical harm). |
35
 
36
  ## Reproduce in under 10 minutes
37
 
 
 
38
  **Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
39
 
 
 
40
  **Health-center mode (workstation only):**
41
  ```bash
42
- pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M
43
  cd frontend && npm install && npm run build && cd ..
44
  python api.py # browser: http://localhost:8000
45
  ```
46
 
47
  **Field mode (phone + Cactus):**
48
 
49
- > **We do not redistribute the Cactus-Compute model** — it is gated under a custom Cactus license. Reviewers verifying the Cactus track follow the documented path below. Most reviewers can verify the engineering claims via the workstation path above without ever installing on-device; the 3-minute demo video shows the full on-device flow on a real phone.
50
 
51
  ```bash
52
  # Build + install the APK once. After this the model install is in-app, no adb.
@@ -70,13 +74,17 @@ cd frontend && npm run build && npx cap sync android && \
70
 
71
  A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
72
 
73
- ## What we'd do with $10K and six more months
 
 
 
 
74
 
75
- - Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio the current evaluation is entirely on synthetic TTS audio + LLM-generated conversations.
76
- - Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that we deliberately did not ship in this submission.
77
  - Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
78
  - Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
79
 
80
  ## Contact
81
 
82
- Tushar J — tushar.j@cognavi.com — GitHub: [Tushar-9802/Sakhi](https://github.com/Tushar-9802/Sakhi)
 
1
  # Sakhi (सखी) — Judge Brief
2
 
3
+ *One-page version of the README. Full detail in [README.md](README.md). 3-min demo video: [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg).*
4
 
5
  ## The problem, in two sentences
6
 
 
22
  | Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
23
  | On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
24
 
25
+ The 5-minute on-device figure is reproducible via the **Load ANC example** button in Field Mode (Field Mode tab → On-device text → form card → "Load ANC example"). On OnePlus 11R / Snapdragon 8+ Gen 1, the on-device pipeline extracts BP 155/100, verbatim Hindi symptoms (`सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन`), Counseling `PHC जाने की सलाह`, and flags three danger signs — `high_bp_with_symptoms`, `swelling_face`, `swelling_legs` — all with verbatim Hindi `utterance_evidence` and `category: immediate_referral`. Total 320.7 s end-to-end (Form 231.8 s + Danger 88.9 s + normalize + detect). For comparison: the paper-form baseline is 15–20 min of hand-filling plus travel to the PHC.
26
 
27
  ## Why this is submitted to four tracks
28
 
29
  | Track | What Sakhi brings |
30
  |---|---|
31
+ | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a workflow matched to how ASHA workers actually operate (health-center mode + field mode with later sync). |
32
  | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
33
+ | **Unsloth** | One-command LoRA pipeline (`scripts/train_unsloth.py`): data prep → train → GGUF export → Ollama register → A/B eval vs base. Includes a Windows GGUF-export workaround (`scripts/export_merge.py`) for Unsloth's Gemma 4 mmap failure — manual delta-merge + `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`, no WSL needed. Fine-tune pass rate 14/15 vs base 15/15 base is in the live pipeline; fine-tune is published to Ollama as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) (`ollama pull tusharbrisingr9802/sakhi` to verify A/B locally) for deployments preferring English schema-label normalization (`दस्त` `Diarrhea`) over raw Hindi. Field-coverage diff in `FIELD_COVERAGE_DIFF.md`. |
34
+ | **Cactus** | On-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. On-device voice-in via `cactusTranscribe` + Gemma was investigated; the README documents why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per [Kumar et al. 2025](https://arxiv.org/abs/2512.10967) and the Vistaar / Gramvaani benchmarks, with deletion-dominant errors on numbers — not in this submission). |
35
 
36
  ## Reproduce in under 10 minutes
37
 
38
+ **3-min demo video:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
39
+
40
  **Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
41
 
42
+ **Pull the Unsloth fine-tune:** [`ollama pull tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi). The LoRA-fine-tuned Gemma 4 E4B is on the Ollama registry. Run `python scripts/test_ollama_quality.py` against base + fine-tune to reproduce the 15/15 vs 14/15 A/B locally.
43
+
44
  **Health-center mode (workstation only):**
45
  ```bash
46
+ pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M
47
  cd frontend && npm install && npm run build && cd ..
48
  python api.py # browser: http://localhost:8000
49
  ```
50
 
51
  **Field mode (phone + Cactus):**
52
 
53
+ > **Sakhi does not redistribute the Cactus-Compute model** — it is gated under a custom Cactus license. Reviewers verifying the Cactus track follow the documented path below. Most reviewers can verify the engineering claims via the workstation path above without ever installing on-device; the [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full on-device flow on a real phone.
54
 
55
  ```bash
56
  # Build + install the APK once. After this the model install is in-app, no adb.
 
74
 
75
  A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
76
 
77
+ ## Privacy & data handling
78
+
79
+ Audio and transcripts never leave the institution that owns them. Workstation mode keeps everything on the PHC's local network (Whisper + Ollama on local GPU; no OpenAI / Anthropic / Google API). Field mode runs on-device via Cactus SDK — airplane mode does not break it. Patient demographics enter as a typed header rather than being extracted from audio, so identifiers are minimised at the boundary. This posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
80
+
81
+ ## What's next with $10K and six more months
82
 
83
+ - Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio under field conditions. Current evaluation covers 4 real-voice recordings (2 speakers — 1 female Bareilly reader + 1 male self-record — across 3 of 4 role-play scripts) plus the 15-case synthetic test suite; full-corpus rural-female accent + field-noise validation is the next step.
84
+ - Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path not shipped here.
85
  - Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
86
  - Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
87
 
88
  ## Contact
89
 
90
+ Tushar J — tusharbrisingr9802@gmail.com — GitHub: [Tushar-9802/Sakhi](https://github.com/Tushar-9802/Sakhi)
README.md CHANGED
@@ -17,6 +17,12 @@ Offline-first tool that converts Hindi home visit conversations into structured
17
  **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
18
  **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
19
 
 
 
 
 
 
 
20
  ![Workstation demo: Hindi audio → form + danger signs (30 s)](workstation_demo.gif)
21
 
22
  ## Problem
@@ -27,10 +33,10 @@ India's ASHA workers conduct 50M+ maternal/child health home visits per year acr
27
 
28
  Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
29
 
30
- - **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
31
  - **Field mode (phone)** has two offline sub-paths:
32
- - **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. This is the honest voice path no on-device ASR attempted.
33
- - **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. This is acceptable against the clinical baseline: the status quo is an ASHA hand-filling the same form from memory (15–20 min), carrying it to the PHC (another walk), then waiting for a clinician to read and act on it (hours to days). A 5-minute wait for on-device structured extraction + flagged danger signs is a net time save, not a UX compromise — and it works with zero network, zero shared infrastructure.
34
 
35
  ```
36
  Workstation path:
@@ -47,7 +53,7 @@ On-device path (text-in):
47
 
48
  ### Why not voice-to-form on-device too?
49
 
50
- We looked into it the honest answer is it doesn't work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published evidence (arXiv 2512.10967, Vistaar/Gramvaani) shows off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, we chose to *not* ship an unreliable on-device voice path. Record-and-sync with Whisper-Large on the workstation keeps voice-in honest; the on-device LLM does what Gemma 4 is actually good at — Hindi text understanding.
51
 
52
  ## Function Calling
53
 
@@ -72,7 +78,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
72
  | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
73
  | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
74
 
75
- **Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode we hit in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — `apply_metadata` in `app.py` for workstation audio and text, mirrored as a pure JS function in `pipeline.js` for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills `patient.{name, age, mobile}`; child_health fills `child.{name, age_months, sex}` with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. `asha_id` is sticky across sessions via `localStorage`. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.
76
 
77
  **Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
78
 
@@ -88,7 +94,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
88
 
89
  Two reproduction paths. Pick by available hardware.
90
 
91
- **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
92
 
93
  **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
94
  1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
@@ -98,7 +104,7 @@ Two reproduction paths. Pick by available hardware.
98
  5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
99
  6. **Load Model** → **Test Hindi** to confirm inference works.
100
 
101
- **We do not redistribute the Cactus model.** It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without us needing to host the weights ourselves and without needing developer mode or adb on their phone. The 3-minute demo video in the submission shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.
102
 
103
  ## Safety & Limitations
104
 
@@ -108,11 +114,29 @@ Sakhi is a decision-support tool, not a diagnostic system. All outputs require h
108
 
109
  **What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
110
 
111
- **False positive controls:** The 6-layer anti-hallucination pipeline aggressively filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.
112
 
113
  **Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
114
 
115
- **Known gaps:** All current test data is synthetic (TTS-generated Hindi audio, LLM-generated training conversations). Real-world ASHA conversations will be noisier, more fragmented, and contain regional dialect variation not yet tested.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  ## Deployment Model
118
 
@@ -134,7 +158,7 @@ Health Center (workstation, RTX GPU) Field (Android phone)
134
  **Three access points, same backend schema:**
135
 
136
  1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
137
- 2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation does Whisper + E4B (fast, accurate). Best extraction quality available.
138
  3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
139
 
140
  **Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
@@ -168,21 +192,33 @@ Health Center (workstation, RTX GPU) Field (Android phone)
168
  - Covers 0–999 Hindi number words + Whisper misspelling variants
169
  - Compound values (BP, weight, Hb), decimal points, fractions
170
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  ## Fine-Tuning (Unsloth Track)
172
 
173
- We fine-tuned Gemma 4 E4B via Unsloth LoRA on 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. The resulting adapter is exported as a Q4_K_M GGUF and registered in Ollama as `sakhi:latest`.
174
 
175
- **Configuration:** LR 5e-5, 1 epoch, LoRA r=16/alpha=32, dropout 0.05 — conservative hyperparameters to avoid overfitting on a small dataset.
176
 
177
- **A/B comparison vs base** (see `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`):
178
- - **Pass rate:** base 15/15 vs fine-tune 14/15 (single fail on heavy Hinglish code-switch → over-referral, a safer failure mode)
179
- - **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied
180
- - **Schema normalization:** the fine-tune consistently translates Hindi symptom phrases into English schema labels ("दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness"), making downstream filtering easier. Base retains raw Hindi.
181
- - **Unique field extractions:** fine-tune recovered 2 visit-type-specific fields the base missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovered 11 fields the fine-tune left null.
182
 
183
- **Production choice:** we kept the base model in the live pipeline for its single-test accuracy edge. The fine-tune demonstrates the reproducible training pipeline and ships as an alternative for deployments that prefer consistent English schema values over raw transcription.
 
 
 
 
184
 
185
- **Export pipeline (Windows):** the training script (`scripts/train_unsloth.py`) handles the full flow data prep, LoRA training, auto-eval. For GGUF export we use a manual path (`scripts/export_merge.py`) that bypasses Unsloth's Windows mmap issues: load base + adapter via transformers, compute `delta_W = (B @ A) * (alpha/r)` per pair, then `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`.
186
 
187
  ## Frontend
188
 
@@ -206,7 +242,7 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
206
  # Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
207
 
208
  # ── Health-center deployment (workstation, unified UI + API) ──
209
- pip install -r requirements-hf.txt # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
210
  ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
211
  cd frontend && npm install && npm run build && cd ..
212
  python api.py
@@ -266,7 +302,7 @@ python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
266
 
267
  **Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
268
 
269
- **Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally — the live Space exists for convenience, not as the rigorous evaluation path.
270
 
271
  ### How it's deployed
272
 
@@ -274,7 +310,7 @@ python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
274
 
275
  - `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
276
  - `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
277
- - `requirements-hf.txt` — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only.
278
  - `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
279
  - README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
280
 
@@ -313,7 +349,7 @@ src/hindi_normalize.py # Hindi number/medical term normalization (1
313
  configs/schemas/ # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
314
  Dockerfile # HF Space build: Node frontend + CUDA runtime + Ollama
315
  entrypoint.sh # HF Space container init: ollama serve → pull model → uvicorn
316
- requirements-hf.txt # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
317
  frontend/
318
  src/App.jsx # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
319
  src/offlineQueue.js # IndexedDB offline queue + crash-safe chunk persistence
 
17
  **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
18
  **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
19
 
20
+ **▶ Watch the 3-min demo:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — full submission video: problem framing, workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
21
+
22
+ **▶ Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — the Path 1 workstation stack (FastAPI + Ollama + Whisper) running on an HF Space T4. Same UI, same endpoints; no install needed. ~5 min cold-boot wait after idle — see [Public Demo](#public-demo--huggingface-space) for details.
23
+
24
+ **▶ Pull the Unsloth fine-tune:** [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) on the Ollama registry — `ollama pull tusharbrisingr9802/sakhi` fetches the LoRA-fine-tuned Gemma 4 E4B behind the A/B numbers below. The base model (`gemma4:e4b-it-q4_K_M`) is what ships in the live pipeline; this is the side-by-side comparison artifact for the Unsloth track.
25
+
26
  ![Workstation demo: Hindi audio → form + danger signs (30 s)](workstation_demo.gif)
27
 
28
  ## Problem
 
33
 
34
  Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
35
 
36
+ - **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. End-to-end latency ~15–25 s on an RTX 5070 Ti or T4. This is the primary voice-to-form path.
37
  - **Field mode (phone)** has two offline sub-paths:
38
+ - **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. On-device ASR is not attempted see the section below for why.
39
+ - **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. For comparison: the paper-form baseline is 15–20 min of hand-filling from memory, then a walk to the PHC, then clinician review hours-to-days later. The on-device path works with zero network and zero shared infrastructure.
40
 
41
  ```
42
  Workstation path:
 
53
 
54
  ### Why not voice-to-form on-device too?
55
 
56
+ The on-device voice path does not work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published benchmarks ([Kumar et al. 2025, *ASR Under the Stethoscope*](https://arxiv.org/abs/2512.10967); Vistaar / Gramvaani corpus evaluations) show off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with substantial variability tied to speaker role / gender / code-mixing and a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, an on-device voice path is not in this submission. Record-and-sync with Whisper-Large on the workstation handles voice-in; the on-device LLM handles Hindi text understanding only.
57
 
58
  ## Function Calling
59
 
 
78
  | Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
79
  | Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
80
 
81
+ **Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode surfaced in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — `apply_metadata` in `app.py` for workstation audio and text, mirrored as a pure JS function in `pipeline.js` for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills `patient.{name, age, mobile}`; child_health fills `child.{name, age_months, sex}` with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. `asha_id` is sticky across sessions via `localStorage`. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.
82
 
83
  **Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
84
 
 
94
 
95
  Two reproduction paths. Pick by available hardware.
96
 
97
+ **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. The slim `requirements-runtime.txt` covers the serving stack (Ollama client + faster-whisper + FastAPI); PyTorch / Unsloth / bitsandbytes from the comprehensive `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify Sakhi's engineering claims (function calling, normalization, 6-layer validation, schema correctness).
98
 
99
  **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
100
  1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
 
104
  5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
105
  6. **Load Model** → **Test Hindi** to confirm inference works.
106
 
107
+ **Sakhi does not redistribute the Cactus model.** It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without the project needing to host the weights, and without needing developer mode or adb on their phone. The [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.
108
 
109
  ## Safety & Limitations
110
 
 
114
 
115
  **What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
116
 
117
+ **False positive controls:** The 6-layer anti-hallucination pipeline filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.
118
 
119
  **Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
120
 
121
+ **Known limitations** (full root-cause walkthroughs in [FAILURES.md](FAILURES.md)):
122
+
123
+ - **On-device latency.** Field-mode text-in extraction takes ~5 min on a Snapdragon 8+ Gen 1 — versus ~15–25 s on the workstation path. The use case is asynchronous: kick off at the end of a visit, the form is ready by the next stop. Live consultation runs on the workstation path.
124
+ - **Long-clip BP drop.** Whisper-Large CT2 reliably recovers BP `160/110` only when the speaker pauses ~0.5 s around `बटा` (the Hindi "over" separator). At conversational pacing on long clips, the number can drop while the surrounding "बहुत हाई है" framing is preserved; the danger panel still flags severe-hypertension from the qualitative phrase.
125
+ - **Eval-rubric scope.** The 15/15 quality score is asserted against per-case `hallucination_traps` lists — the specific fields that MUST be null for that input — not a whole-schema null-everywhere check. The ANC preeclampsia case has a misclassification not on its trap list: `pregnancy.previous_complications` (a prior-history field) gets populated with current-visit symptoms. The danger panel and referral decision are unaffected. The schema-description fix touches all four visit schemas and would require a full eval re-run; that re-run did not land here.
126
+ - **Synthetic training data + partial real-voice eval.** The 1,154 fine-tune examples and 15-case automated eval suite are LLM-generated Hindi conversations with gTTS audio. Real-voice testing to date covers 4 recordings × 2 speakers (1 female Bareilly reader + 1 male self-record) × 3 of 4 role-play scripts (ANC preeclampsia, PNC Day-7, child diarrhea — see Test Results for details and fixes that came out of it). Rural female ASHA accents, regional dialects, and field background noise are not yet covered.
127
+ - **Regional dialect coverage.** Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, and code-switched Marwari/Bhili speech are not validated. ASHA workers in those regions would need targeted evaluation before deployment.
128
+
129
+ ## Privacy & Data Handling
130
+
131
+ Sakhi is designed so the audio and transcript of a patient visit never cross the boundary of the institution that owns it.
132
+
133
+ - **Workstation mode.** ASR + LLM extraction run on the PHC's GPU. Audio uploads from the phone travel over local WiFi LAN to `http://<workstation>:8000`, are processed in memory, and the response goes back to the phone. No third-party API call. No telemetry. No analytics.
134
+ - **Field mode (on-device).** Hindi text → form extraction runs entirely on the phone via Gemma 4 E2B on Cactus SDK; the on-device path is fully offline and airplane mode does not break it. Voice captured in field mode persists to phone-local IndexedDB and is posted only to the configured workstation LAN endpoint at sync time.
135
+ - **No external LLMs.** Gemma 4 weights (E4B on Ollama, E2B INT4 on Cactus) are local. No OpenAI, Anthropic, or Google Cloud API key is required or used anywhere in the pipeline.
136
+ - **Data minimization at the boundary.** Patient demographics enter as a typed header — never extracted from audio — so identifiers do not need to round-trip through ASR + LLM layers.
137
+ - **DPDP Act alignment.** This deployment posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
138
+
139
+ The public HuggingFace Space referenced below exists for reviewer convenience only; production deployments would run the workstation stack on PHC-owned hardware.
140
 
141
  ## Deployment Model
142
 
 
158
  **Three access points, same backend schema:**
159
 
160
  1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
161
+ 2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation runs Whisper-Large ASR + E4B Q4_K_M with native function calling. The on-device path (mode 3 below) is text-in only and uses plain-JSON output instead of function calling — workstation mode is the higher-fidelity path of the two.
162
  3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
163
 
164
  **Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
 
192
  - Covers 0–999 Hindi number words + Whisper misspelling variants
193
  - Compound values (BP, weight, Hb), decimal points, fractions
194
 
195
+ **Real-voice validation:** 4 recordings, 2 speakers, 3 of 4 role-play scripts
196
+ - Speakers: 1 female (Bareilly reader, WhatsApp audio over phone mic) + 1 male (self-record, OnePlus 11R mic). Scripts covered: ANC preeclampsia, PNC Day-7 normal, child diarrhea. Script #1 ANC normal not yet recorded.
197
+ - Five normalizer/detector bugs surfaced and fixed from this round (commit `d2d987d`):
198
+ - `बीबी → BP` — Whisper mishears BP as `बीबी` in fast speech; medical-terms normalizer now maps it.
199
+ - `parse_hindi_number` no longer over-merges adjacent digits — `दो तीन` stays `2 3` (was `5`), `एक सौ सौ` stays `100 100` (was `10000`).
200
+ - Visit-type detector dropped `बच्चे को` from child-health keywords — was misrouting the ANC preeclampsia warning `तुम्हारा और बच्चे को खतरा हो सकता है` to child_health.
201
+ - Preeclampsia diagnosis name (`प्रीक्लिम्सिया`) maps to the symptom triad when the LLM emits the diagnosis instead of the underlying symptoms.
202
+ - `सूज` verb stem added to swelling-face/hands danger keywords.
203
+ - BP extraction confirmed on short clips with deliberate prosody around `बटा`. On long conversational-pacing clips the numeric value can drop while the danger framing (`BP बहुत हाई है`) survives — the danger panel still flags severe-hypertension on the qualitative phrase. Root-cause walkthrough in [FAILURES.md](FAILURES.md).
204
+ - The patient-name misclassification observed on the child-diarrhea recording (LLM grabbed the child's name into the mother field) is sidestepped by the ASHA-entered metadata header — patient identifiers never depend on ASR.
205
+ - Full-corpus real-audio evaluation (all 4 scripts × multiple speakers under field conditions) is the next eval lift.
206
+
207
  ## Fine-Tuning (Unsloth Track)
208
 
209
+ The track deliverables are a reproducible LoRA pipeline on RTX 5070 Ti / Blackwell, a Windows GGUF-export workaround for Unsloth's Gemma 4 mmap failure, and an A/B against base. The fine-tuned model did not beat base on pass-rate; base ships in the live pipeline.
210
 
211
+ **Pipeline (`scripts/train_unsloth.py`)** — one command, end-to-end: data prep → LoRA training → adapter saved → GGUF export → Ollama register → auto-eval vs base. Training set: 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. Hyperparameters: LR 5e-5, 1 epoch, LoRA r=16 / alpha=32, dropout 0.05.
212
 
213
+ **Windows GGUF-export workaround (`scripts/export_merge.py`)** Unsloth's bundled GGUF export path hits an mmap failure on Windows for Gemma 4 architectures. The workaround loads base + adapter via `transformers`, computes `delta_W = (B @ A) * (alpha / r)` per LoRA pair, merges, then runs `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`. Reproducible without WSL or a Linux dual-boot.
 
 
 
 
214
 
215
+ **A/B vs base** (full numbers in `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`):
216
+ - **Pass rate:** base 15/15 vs fine-tune 14/15. The single fine-tune failure is on heavy Hinglish code-switching where the fine-tune over-refers (a safer failure mode, still a failure).
217
+ - **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied.
218
+ - **Schema normalization:** fine-tune translates Hindi symptom phrases into English schema labels (`दस्त` → `Diarrhea`, `चक्कर आ रहे हैं` → `dizziness`). Base retains raw Hindi.
219
+ - **Field coverage:** fine-tune recovers 2 visit-type-specific fields the base misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovers 11 fields the fine-tune leaves null.
220
 
221
+ **Root cause of the over-referral failure.** The 1,154-example training distribution had Hinglish code-switching disproportionately co-occurring with danger cases, so the LoRA learned `English-in-Hindi-sentence` as a mild danger signal. Documented in [FAILURES.md](FAILURES.md). The base model is in the live Ollama path; the fine-tune is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) `ollama pull tusharbrisingr9802/sakhi` to verify the A/B locally. For deployments that prefer English schema-label normalization over raw Hindi.
222
 
223
  ## Frontend
224
 
 
242
  # Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
243
 
244
  # ── Health-center deployment (workstation, unified UI + API) ──
245
+ pip install -r requirements-runtime.txt # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
246
  ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
247
  cd frontend && npm install && npm run build && cd ..
248
  python api.py
 
302
 
303
  **Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
304
 
305
+ **Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the [3-minute demo video](https://youtu.be/n-u7J1lljUg), or follow Path 1 above to run locally — the live Space exists for convenience. Local Path 1 (or the test scripts in `scripts/`) is the evaluation path.
306
 
307
  ### How it's deployed
308
 
 
310
 
311
  - `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
312
  - `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
313
+ - `requirements-runtime.txt` — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only. Used by both the HF Space Docker build and local Path 1 installs.
314
  - `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
315
  - README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
316
 
 
349
  configs/schemas/ # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
350
  Dockerfile # HF Space build: Node frontend + CUDA runtime + Ollama
351
  entrypoint.sh # HF Space container init: ollama serve → pull model → uvicorn
352
+ requirements-runtime.txt # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
353
  frontend/
354
  src/App.jsx # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
355
  src/offlineQueue.js # IndexedDB offline queue + crash-safe chunk persistence
RETRAIN_RESULTS.md CHANGED
@@ -11,11 +11,13 @@
11
  | gemma4:e4b-it-q4_K_M (base) | 15/15 |
12
  | sakhi:latest (fine-tuned) | 14/15 |
13
 
 
 
14
  ## Verdict
15
 
16
  **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
17
 
18
- The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is kept available in Ollama as `sakhi:latest` for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
19
 
20
  ## Diagnostics
21
 
 
11
  | gemma4:e4b-it-q4_K_M (base) | 15/15 |
12
  | sakhi:latest (fine-tuned) | 14/15 |
13
 
14
+ **Reproduce:** `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`.
15
+
16
  ## Verdict
17
 
18
  **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
19
 
20
+ The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
21
 
22
  ## Diagnostics
23