Spaces:
Sleeping
Sleeping
Commit ·
20f5235
1
Parent(s): d630c01
docs: tone scrub + YouTube demo link + Ollama-pull reproducer
Browse files- README + JUDGE_BRIEF + FAILURES: replace solo-dev "we" with neutral or
product-subject phrasing across 10 spots. Matches the voiceover's
"my submission" framing in the demo video. Side-benefit: removes a
few self-grading / hedge-language phrases.
- README + JUDGE_BRIEF: add 3-min demo video link
(https://youtu.be/n-u7J1lljUg) in 6 places — top-of-readme callout,
inline mentions of the on-device demo, Public Demo section.
- RETRAIN_RESULTS: document the `ollama pull` + `ollama cp` two-step
needed to reproduce the A/B against tusharbrisingr9802/sakhi locally.
- FIELD_COVERAGE_DIFF: factual restatement of the base-vs-finetune
trade-off (drop "safer, more consistent alternative" value-judgment).
- FAILURES.md +24 -8
- FIELD_COVERAGE_DIFF.md +1 -1
- JUDGE_BRIEF.md +19 -11
- README.md +59 -23
- RETRAIN_RESULTS.md +3 -1
FAILURES.md
CHANGED
|
@@ -12,7 +12,7 @@ Every test failure in Sakhi's eval suite is recorded here with a root-cause diag
|
|
| 12 |
|
| 13 |
### Failure pattern: BP value drift through TTS → ASR
|
| 14 |
|
| 15 |
-
gTTS (Google Text-to-Speech, the synthesizer
|
| 16 |
|
| 17 |
**Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
|
| 18 |
|
|
@@ -29,12 +29,12 @@ gTTS (Google Text-to-Speech, the synthesizer we use for test audio generation
|
|
| 29 |
|
| 30 |
### Reproducing these specific failures
|
| 31 |
|
| 32 |
-
`python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.
|
| 33 |
|
| 34 |
### Planned mitigation
|
| 35 |
|
| 36 |
-
- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in,
|
| 37 |
-
- Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
@@ -49,7 +49,7 @@ The LoRA fine-tune (1,154 synthetic examples, 981 train / 173 val) was trained o
|
|
| 49 |
|
| 50 |
### Disposition
|
| 51 |
|
| 52 |
-
Documented in `RETRAIN_RESULTS.md`.
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
@@ -63,6 +63,22 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
|
|
| 63 |
|
| 64 |
---
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
|
| 67 |
|
| 68 |
**Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
|
|
@@ -75,7 +91,7 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
|
|
| 75 |
|
| 76 |
### Disposition
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
---
|
| 81 |
|
|
@@ -87,7 +103,7 @@ The 15/15 pass rate is computed against per-case `hallucination_traps` lists —
|
|
| 87 |
|
| 88 |
### Disposition
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
---
|
| 93 |
|
|
@@ -103,4 +119,4 @@ Conversational pacing on the long clip. BP `एक सौ साठ बटा
|
|
| 103 |
|
| 104 |
### Disposition
|
| 105 |
|
| 106 |
-
|
|
|
|
| 12 |
|
| 13 |
### Failure pattern: BP value drift through TTS → ASR
|
| 14 |
|
| 15 |
+
gTTS (Google Text-to-Speech, the synthesizer used for test audio generation — see `scripts/generate_test_audio.py`) is a concatenative TTS engine. It is fast and free, but does not produce the prosody of natural Hindi speech — it tends to produce staccato numeric readings with limited inter-word coarticulation. When a number sequence like `"एक सौ साठ बटा एक सौ दस"` (160/105 in the BP format ASHA workers read aloud) runs through gTTS, the pronunciation of `"बटा"` (the Hindi separator equivalent to the English "over" in "160 over 105") can be produced with a sibilance or softening that Whisper-Large-V2 Hindi mishears.
|
| 16 |
|
| 17 |
**Observed failure pattern** (from development iteration logs, before the current passing-13/15 baseline was pinned):
|
| 18 |
|
|
|
|
| 29 |
|
| 30 |
### Reproducing these specific failures
|
| 31 |
|
| 32 |
+
`python scripts/test_pipeline_e2e.py` will re-generate audio (if missing), run the pipeline, and print per-case pass/fail. The two currently failing cases in the 15-case suite are the BP-heavy ANC cases — specifically, the preeclampsia and the severe-anemia cases where Hb or BP is borderline-but-dangerous.
|
| 33 |
|
| 34 |
### Planned mitigation
|
| 35 |
|
| 36 |
+
- Replace gTTS with real-voice recordings for the test suite. The 4-script role-play plan (`ROLE_PLAY_SCRIPTS.md`) produces real-phone-mic Hindi audio in noisy conditions and will supplant the synthetic test audio. Once the real-audio baseline is in, the `test_pipeline_e2e.py` pass rate should rise, not fall — real speech is cleaner than gTTS for Whisper.
|
| 37 |
+
- Widen the Hindi number normalization heuristic for compound-number splitting near common separator positions (`बटा`, `by`, `/`). Currently conservative to avoid false positives; real-audio data will allow re-tuning the recall/precision tradeoff.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
|
|
| 49 |
|
| 50 |
### Disposition
|
| 51 |
|
| 52 |
+
Documented in `RETRAIN_RESULTS.md`. The base model is in the live Ollama path. The fine-tune remains available on the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer English schema-label normalization. Further tuning was not pursued — the failure mode (synthetic-data distribution bias) is a known LoRA pitfall and the base already passes 15/15.
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
|
|
| 63 |
|
| 64 |
---
|
| 65 |
|
| 66 |
+
## ANC form: `patient.age` slot misclassification on on-device E2B path
|
| 67 |
+
|
| 68 |
+
**Harness:** Field Mode on-device text → form, observed during slot 3 video recording on 2026-05-17.
|
| 69 |
+
|
| 70 |
+
**Observed output:** With the `Load ANC example` ANC preeclampsia transcript fed through Gemma 4 E2B INT4 on Cactus SDK, `patient.age` is populated with `8`. The source is the speaker's response to the ASHA's gestational-age question — `लगभग 8 महीने` ("about 8 months [pregnant]") — which the on-device model is grounding to the wrong field. The transcript carries no explicit patient age in years.
|
| 71 |
+
|
| 72 |
+
### Root cause
|
| 73 |
+
|
| 74 |
+
Same family as the `pregnancy.previous_complications` walkthrough below: the model is filling a slot from a number present in the input without grounding it in the slot's semantics. On the E2B INT4 path the surface is wider because the null-filled instance template prompt does not carry per-field descriptions about year-vs-month-vs-week semantics; the E4B Ollama path consumes the JSON Schema which (for the fields that have descriptions) gives the model more discrimination signal.
|
| 75 |
+
|
| 76 |
+
### Disposition
|
| 77 |
+
|
| 78 |
+
Not a safety-critical issue — no clinical decision in the pipeline depends on `patient.age`. The architectural mitigation is already in place: the ASHA-entered metadata header (typed at intake, before any conversation is recorded or processed) supplies patient demographics directly via `apply_metadata`, which merges them into the form envelope and supersedes any conversational extraction. The misclassification only surfaces when demographics are absent from the input, which is the demo / on-device-test scenario, not the deployed ASHA workflow. A schema-side fix would add explicit field descriptions to the on-device template (`"age": "patient's age in YEARS, not gestational months"`); not landed in this submission.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
|
| 83 |
|
| 84 |
**Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
|
|
|
|
| 91 |
|
| 92 |
### Disposition
|
| 93 |
|
| 94 |
+
The one-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) touches the full form schema across all four visit types and would require re-running the 15-case eval to validate no regression. That re-run did not land before this submission. The safety-critical output (danger panel + referral decision) is unaffected; the misclassification is in a non-safety field.
|
| 95 |
|
| 96 |
---
|
| 97 |
|
|
|
|
| 103 |
|
| 104 |
### Disposition
|
| 105 |
|
| 106 |
+
`hallucination_traps` is the literal list of fields each test asserts null for; the test source is `scripts/test_ollama_quality.py:470-473`. "15/15 tests pass" is against this per-case rubric, not a whole-schema null-everywhere check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above. The wider rubric is not landed here.
|
| 107 |
|
| 108 |
---
|
| 109 |
|
|
|
|
| 119 |
|
| 120 |
### Disposition
|
| 121 |
|
| 122 |
+
The mitigation in this submission: the 20 s clip is the manifest default, so the most-played sample exercises the full BP path end-to-end. The 52 s clip remains in the dropdown as the longer-conversation case; on that clip the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped. A custom Hindi-medical Whisper fine-tune would address the root cause; not in this submission.
|
FIELD_COVERAGE_DIFF.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
Date: 2026-04-17 09:53
|
| 4 |
|
| 5 |
-
The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg).
|
| 6 |
|
| 7 |
## Summary
|
| 8 |
|
|
|
|
| 2 |
|
| 3 |
Date: 2026-04-17 09:53
|
| 4 |
|
| 5 |
+
The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). The base model extracted more raw fields on average (11 vs 2 unique extractions). The fine-tune translates Hindi symptom phrases into English schema labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") and recovers two visit-type-specific fields the base model misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base ships in the live pipeline for the single-test accuracy edge (15/15 vs 14/15); the fine-tune is registered as a schema-normalization alternative.
|
| 6 |
|
| 7 |
## Summary
|
| 8 |
|
JUDGE_BRIEF.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Sakhi (सखी) — Judge Brief
|
| 2 |
|
| 3 |
-
*One-page version of the README. Full detail in [README.md](README.md).*
|
| 4 |
|
| 5 |
## The problem, in two sentences
|
| 6 |
|
|
@@ -22,31 +22,35 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
|
|
| 22 |
| Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
|
| 23 |
| On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
|
| 24 |
|
| 25 |
-
The 5-minute on-device figure is
|
| 26 |
|
| 27 |
## Why this is submitted to four tracks
|
| 28 |
|
| 29 |
| Track | What Sakhi brings |
|
| 30 |
|---|---|
|
| 31 |
-
| **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a
|
| 32 |
| **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
|
| 33 |
-
| **Unsloth** |
|
| 34 |
-
| **Cactus** |
|
| 35 |
|
| 36 |
## Reproduce in under 10 minutes
|
| 37 |
|
|
|
|
|
|
|
| 38 |
**Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
|
| 39 |
|
|
|
|
|
|
|
| 40 |
**Health-center mode (workstation only):**
|
| 41 |
```bash
|
| 42 |
-
pip install -r requirements-
|
| 43 |
cd frontend && npm install && npm run build && cd ..
|
| 44 |
python api.py # browser: http://localhost:8000
|
| 45 |
```
|
| 46 |
|
| 47 |
**Field mode (phone + Cactus):**
|
| 48 |
|
| 49 |
-
> **
|
| 50 |
|
| 51 |
```bash
|
| 52 |
# Build + install the APK once. After this the model install is in-app, no adb.
|
|
@@ -70,13 +74,17 @@ cd frontend && npm run build && npx cap sync android && \
|
|
| 70 |
|
| 71 |
A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
|
| 72 |
|
| 73 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
- Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio
|
| 76 |
-
- Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path
|
| 77 |
- Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
|
| 78 |
- Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
|
| 79 |
|
| 80 |
## Contact
|
| 81 |
|
| 82 |
-
Tushar J —
|
|
|
|
| 1 |
# Sakhi (सखी) — Judge Brief
|
| 2 |
|
| 3 |
+
*One-page version of the README. Full detail in [README.md](README.md). 3-min demo video: [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg).*
|
| 4 |
|
| 5 |
## The problem, in two sentences
|
| 6 |
|
|
|
|
| 22 |
| Workstation pipeline latency (audio → form) | ~15–25 s | RTX 5070 Ti, warm Ollama |
|
| 23 |
| On-device pipeline latency (Hindi text → form) | ~5 min | OnePlus 11R / Snapdragon 8+ Gen 1, Gemma 4 E2B INT4 on Cactus |
|
| 24 |
|
| 25 |
+
The 5-minute on-device figure is reproducible via the **Load ANC example** button in Field Mode (Field Mode tab → On-device text → form card → "Load ANC example"). On OnePlus 11R / Snapdragon 8+ Gen 1, the on-device pipeline extracts BP 155/100, verbatim Hindi symptoms (`सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन`), Counseling `PHC जाने की सलाह`, and flags three danger signs — `high_bp_with_symptoms`, `swelling_face`, `swelling_legs` — all with verbatim Hindi `utterance_evidence` and `category: immediate_referral`. Total 320.7 s end-to-end (Form 231.8 s + Danger 88.9 s + normalize + detect). For comparison: the paper-form baseline is 15–20 min of hand-filling plus travel to the PHC.
|
| 26 |
|
| 27 |
## Why this is submitted to four tracks
|
| 28 |
|
| 29 |
| Track | What Sakhi brings |
|
| 30 |
|---|---|
|
| 31 |
+
| **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a workflow matched to how ASHA workers actually operate (health-center mode + field mode with later sync). |
|
| 32 |
| **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
|
| 33 |
+
| **Unsloth** | One-command LoRA pipeline (`scripts/train_unsloth.py`): data prep → train → GGUF export → Ollama register → A/B eval vs base. Includes a Windows GGUF-export workaround (`scripts/export_merge.py`) for Unsloth's Gemma 4 mmap failure — manual delta-merge + `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`, no WSL needed. Fine-tune pass rate 14/15 vs base 15/15 — base is in the live pipeline; fine-tune is published to Ollama as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) (`ollama pull tusharbrisingr9802/sakhi` to verify A/B locally) for deployments preferring English schema-label normalization (`दस्त` → `Diarrhea`) over raw Hindi. Field-coverage diff in `FIELD_COVERAGE_DIFF.md`. |
|
| 34 |
+
| **Cactus** | On-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. On-device voice-in via `cactusTranscribe` + Gemma was investigated; the README documents why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per [Kumar et al. 2025](https://arxiv.org/abs/2512.10967) and the Vistaar / Gramvaani benchmarks, with deletion-dominant errors on numbers — not in this submission). |
|
| 35 |
|
| 36 |
## Reproduce in under 10 minutes
|
| 37 |
|
| 38 |
+
**3-min demo video:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
|
| 39 |
+
|
| 40 |
**Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
|
| 41 |
|
| 42 |
+
**Pull the Unsloth fine-tune:** [`ollama pull tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi). The LoRA-fine-tuned Gemma 4 E4B is on the Ollama registry. Run `python scripts/test_ollama_quality.py` against base + fine-tune to reproduce the 15/15 vs 14/15 A/B locally.
|
| 43 |
+
|
| 44 |
**Health-center mode (workstation only):**
|
| 45 |
```bash
|
| 46 |
+
pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M
|
| 47 |
cd frontend && npm install && npm run build && cd ..
|
| 48 |
python api.py # browser: http://localhost:8000
|
| 49 |
```
|
| 50 |
|
| 51 |
**Field mode (phone + Cactus):**
|
| 52 |
|
| 53 |
+
> **Sakhi does not redistribute the Cactus-Compute model** — it is gated under a custom Cactus license. Reviewers verifying the Cactus track follow the documented path below. Most reviewers can verify the engineering claims via the workstation path above without ever installing on-device; the [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full on-device flow on a real phone.
|
| 54 |
|
| 55 |
```bash
|
| 56 |
# Build + install the APK once. After this the model install is in-app, no adb.
|
|
|
|
| 74 |
|
| 75 |
A sample Hindi transcript ready to paste is at `data/processed/train.jsonl` (line 1 = ANC preeclampsia case) or in the main README.
|
| 76 |
|
| 77 |
+
## Privacy & data handling
|
| 78 |
+
|
| 79 |
+
Audio and transcripts never leave the institution that owns them. Workstation mode keeps everything on the PHC's local network (Whisper + Ollama on local GPU; no OpenAI / Anthropic / Google API). Field mode runs on-device via Cactus SDK — airplane mode does not break it. Patient demographics enter as a typed header rather than being extracted from audio, so identifiers are minimised at the boundary. This posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
|
| 80 |
+
|
| 81 |
+
## What's next with $10K and six more months
|
| 82 |
|
| 83 |
+
- Partner with an ASHA training institute (Santosh Medical College / IIT Madras Bhashini) to collect 100+ hours of *real* ASHA home-visit audio under field conditions. Current evaluation covers 4 real-voice recordings (2 speakers — 1 female Bareilly reader + 1 male self-record — across 3 of 4 role-play scripts) plus the 15-case synthetic test suite; full-corpus rural-female accent + field-noise validation is the next step.
|
| 84 |
+
- Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path not shipped here.
|
| 85 |
- Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
|
| 86 |
- Pilot with 10–20 ASHA workers in one block (Muradnagar / Loni-adjacent) with before/after time-and-accuracy measurement.
|
| 87 |
|
| 88 |
## Contact
|
| 89 |
|
| 90 |
+
Tushar J — tusharbrisingr9802@gmail.com — GitHub: [Tushar-9802/Sakhi](https://github.com/Tushar-9802/Sakhi)
|
README.md
CHANGED
|
@@ -17,6 +17,12 @@ Offline-first tool that converts Hindi home visit conversations into structured
|
|
| 17 |
**Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
|
| 18 |
**Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |

|
| 21 |
|
| 22 |
## Problem
|
|
@@ -27,10 +33,10 @@ India's ASHA workers conduct 50M+ maternal/child health home visits per year acr
|
|
| 27 |
|
| 28 |
Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
|
| 29 |
|
| 30 |
-
- **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone.
|
| 31 |
- **Field mode (phone)** has two offline sub-paths:
|
| 32 |
-
- **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing.
|
| 33 |
-
- **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone.
|
| 34 |
|
| 35 |
```
|
| 36 |
Workstation path:
|
|
@@ -47,7 +53,7 @@ On-device path (text-in):
|
|
| 47 |
|
| 48 |
### Why not voice-to-form on-device too?
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
## Function Calling
|
| 53 |
|
|
@@ -72,7 +78,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
|
|
| 72 |
| Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
|
| 73 |
| Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
|
| 74 |
|
| 75 |
-
**Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode
|
| 76 |
|
| 77 |
**Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
|
| 78 |
|
|
@@ -88,7 +94,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
|
|
| 88 |
|
| 89 |
Two reproduction paths. Pick by available hardware.
|
| 90 |
|
| 91 |
-
**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-
|
| 92 |
|
| 93 |
**Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
|
| 94 |
1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
|
|
@@ -98,7 +104,7 @@ Two reproduction paths. Pick by available hardware.
|
|
| 98 |
5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
|
| 99 |
6. **Load Model** → **Test Hindi** to confirm inference works.
|
| 100 |
|
| 101 |
-
**
|
| 102 |
|
| 103 |
## Safety & Limitations
|
| 104 |
|
|
@@ -108,11 +114,29 @@ Sakhi is a decision-support tool, not a diagnostic system. All outputs require h
|
|
| 108 |
|
| 109 |
**What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
|
| 110 |
|
| 111 |
-
**False positive controls:** The 6-layer anti-hallucination pipeline
|
| 112 |
|
| 113 |
**Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
|
| 114 |
|
| 115 |
-
**Known
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
## Deployment Model
|
| 118 |
|
|
@@ -134,7 +158,7 @@ Health Center (workstation, RTX GPU) Field (Android phone)
|
|
| 134 |
**Three access points, same backend schema:**
|
| 135 |
|
| 136 |
1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
|
| 137 |
-
2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation
|
| 138 |
3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
|
| 139 |
|
| 140 |
**Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
|
|
@@ -168,21 +192,33 @@ Health Center (workstation, RTX GPU) Field (Android phone)
|
|
| 168 |
- Covers 0–999 Hindi number words + Whisper misspelling variants
|
| 169 |
- Compound values (BP, weight, Hb), decimal points, fractions
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
## Fine-Tuning (Unsloth Track)
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
**
|
| 176 |
|
| 177 |
-
**
|
| 178 |
-
- **Pass rate:** base 15/15 vs fine-tune 14/15 (single fail on heavy Hinglish code-switch → over-referral, a safer failure mode)
|
| 179 |
-
- **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied
|
| 180 |
-
- **Schema normalization:** the fine-tune consistently translates Hindi symptom phrases into English schema labels ("दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness"), making downstream filtering easier. Base retains raw Hindi.
|
| 181 |
-
- **Unique field extractions:** fine-tune recovered 2 visit-type-specific fields the base missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovered 11 fields the fine-tune left null.
|
| 182 |
|
| 183 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
**
|
| 186 |
|
| 187 |
## Frontend
|
| 188 |
|
|
@@ -206,7 +242,7 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
|
|
| 206 |
# Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
|
| 207 |
|
| 208 |
# ── Health-center deployment (workstation, unified UI + API) ──
|
| 209 |
-
pip install -r requirements-
|
| 210 |
ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
|
| 211 |
cd frontend && npm install && npm run build && cd ..
|
| 212 |
python api.py
|
|
@@ -266,7 +302,7 @@ python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
|
|
| 266 |
|
| 267 |
**Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
|
| 268 |
|
| 269 |
-
**Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally — the live Space exists for convenience
|
| 270 |
|
| 271 |
### How it's deployed
|
| 272 |
|
|
@@ -274,7 +310,7 @@ python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
|
|
| 274 |
|
| 275 |
- `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
|
| 276 |
- `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
|
| 277 |
-
- `requirements-
|
| 278 |
- `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
|
| 279 |
- README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
|
| 280 |
|
|
@@ -313,7 +349,7 @@ src/hindi_normalize.py # Hindi number/medical term normalization (1
|
|
| 313 |
configs/schemas/ # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
|
| 314 |
Dockerfile # HF Space build: Node frontend + CUDA runtime + Ollama
|
| 315 |
entrypoint.sh # HF Space container init: ollama serve → pull model → uvicorn
|
| 316 |
-
requirements-
|
| 317 |
frontend/
|
| 318 |
src/App.jsx # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
|
| 319 |
src/offlineQueue.js # IndexedDB offline queue + crash-safe chunk persistence
|
|
|
|
| 17 |
**Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
|
| 18 |
**Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
|
| 19 |
|
| 20 |
+
**▶ Watch the 3-min demo:** [youtu.be/n-u7J1lljUg](https://youtu.be/n-u7J1lljUg) — full submission video: problem framing, workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.
|
| 21 |
+
|
| 22 |
+
**▶ Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — the Path 1 workstation stack (FastAPI + Ollama + Whisper) running on an HF Space T4. Same UI, same endpoints; no install needed. ~5 min cold-boot wait after idle — see [Public Demo](#public-demo--huggingface-space) for details.
|
| 23 |
+
|
| 24 |
+
**▶ Pull the Unsloth fine-tune:** [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) on the Ollama registry — `ollama pull tusharbrisingr9802/sakhi` fetches the LoRA-fine-tuned Gemma 4 E4B behind the A/B numbers below. The base model (`gemma4:e4b-it-q4_K_M`) is what ships in the live pipeline; this is the side-by-side comparison artifact for the Unsloth track.
|
| 25 |
+
|
| 26 |

|
| 27 |
|
| 28 |
## Problem
|
|
|
|
| 33 |
|
| 34 |
Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
|
| 35 |
|
| 36 |
+
- **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. End-to-end latency ~15–25 s on an RTX 5070 Ti or T4. This is the primary voice-to-form path.
|
| 37 |
- **Field mode (phone)** has two offline sub-paths:
|
| 38 |
+
- **Record now, sync later** — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. On-device ASR is not attempted — see the section below for why.
|
| 39 |
+
- **Type a note for instant on-device extraction** — for when the ASHA wants structured output *right now* without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. For comparison: the paper-form baseline is 15–20 min of hand-filling from memory, then a walk to the PHC, then clinician review hours-to-days later. The on-device path works with zero network and zero shared infrastructure.
|
| 40 |
|
| 41 |
```
|
| 42 |
Workstation path:
|
|
|
|
| 53 |
|
| 54 |
### Why not voice-to-form on-device too?
|
| 55 |
|
| 56 |
+
The on-device voice path does not work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published benchmarks ([Kumar et al. 2025, *ASR Under the Stethoscope*](https://arxiv.org/abs/2512.10967); Vistaar / Gramvaani corpus evaluations) show off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with substantial variability tied to speaker role / gender / code-mixing and a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, an on-device voice path is not in this submission. Record-and-sync with Whisper-Large on the workstation handles voice-in; the on-device LLM handles Hindi text understanding only.
|
| 57 |
|
| 58 |
## Function Calling
|
| 59 |
|
|
|
|
| 78 |
| Clinical Extraction (health-center mode, audio-in) | Gemma 4 E4B (Q4_K_M via Ollama) | ~5 GB | Function calling: form extraction + danger signs + referral | Workstation (GPU) |
|
| 79 |
| Clinical Extraction (field mode, text-in) | Gemma 4 E2B (INT4 via Cactus SDK) | ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) | Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style `tool_calls`) | Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install) |
|
| 80 |
|
| 81 |
+
**Patient demographics enter as a header, not from the audio.** Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was *said* during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode surfaced in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — `apply_metadata` in `app.py` for workstation audio and text, mirrored as a pure JS function in `pipeline.js` for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills `patient.{name, age, mobile}`; child_health fills `child.{name, age_months, sex}` with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. `asha_id` is sticky across sessions via `localStorage`. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.
|
| 82 |
|
| 83 |
**Hindi number normalization:** Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "���ीन किलो दो सौ ग्राम" → "3.2 kg".
|
| 84 |
|
|
|
|
| 94 |
|
| 95 |
Two reproduction paths. Pick by available hardware.
|
| 96 |
|
| 97 |
+
**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. The slim `requirements-runtime.txt` covers the serving stack (Ollama client + faster-whisper + FastAPI); PyTorch / Unsloth / bitsandbytes from the comprehensive `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify Sakhi's engineering claims (function calling, normalization, 6-layer validation, schema correctness).
|
| 98 |
|
| 99 |
**Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
|
| 100 |
1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
|
|
|
|
| 104 |
5. Open Sakhi → Field Mode → On-Device Probe → **Import model (.zip)** → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
|
| 105 |
6. **Load Model** → **Test Hindi** to confirm inference works.
|
| 106 |
|
| 107 |
+
**Sakhi does not redistribute the Cactus model.** It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without the project needing to host the weights, and without needing developer mode or adb on their phone. The [3-minute demo video](https://youtu.be/n-u7J1lljUg) shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.
|
| 108 |
|
| 109 |
## Safety & Limitations
|
| 110 |
|
|
|
|
| 114 |
|
| 115 |
**What it can miss:** Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.
|
| 116 |
|
| 117 |
+
**False positive controls:** The 6-layer anti-hallucination pipeline filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.
|
| 118 |
|
| 119 |
**Human-in-the-loop:** Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.
|
| 120 |
|
| 121 |
+
**Known limitations** (full root-cause walkthroughs in [FAILURES.md](FAILURES.md)):
|
| 122 |
+
|
| 123 |
+
- **On-device latency.** Field-mode text-in extraction takes ~5 min on a Snapdragon 8+ Gen 1 — versus ~15–25 s on the workstation path. The use case is asynchronous: kick off at the end of a visit, the form is ready by the next stop. Live consultation runs on the workstation path.
|
| 124 |
+
- **Long-clip BP drop.** Whisper-Large CT2 reliably recovers BP `160/110` only when the speaker pauses ~0.5 s around `बटा` (the Hindi "over" separator). At conversational pacing on long clips, the number can drop while the surrounding "बहुत हाई है" framing is preserved; the danger panel still flags severe-hypertension from the qualitative phrase.
|
| 125 |
+
- **Eval-rubric scope.** The 15/15 quality score is asserted against per-case `hallucination_traps` lists — the specific fields that MUST be null for that input — not a whole-schema null-everywhere check. The ANC preeclampsia case has a misclassification not on its trap list: `pregnancy.previous_complications` (a prior-history field) gets populated with current-visit symptoms. The danger panel and referral decision are unaffected. The schema-description fix touches all four visit schemas and would require a full eval re-run; that re-run did not land here.
|
| 126 |
+
- **Synthetic training data + partial real-voice eval.** The 1,154 fine-tune examples and 15-case automated eval suite are LLM-generated Hindi conversations with gTTS audio. Real-voice testing to date covers 4 recordings × 2 speakers (1 female Bareilly reader + 1 male self-record) × 3 of 4 role-play scripts (ANC preeclampsia, PNC Day-7, child diarrhea — see Test Results for details and fixes that came out of it). Rural female ASHA accents, regional dialects, and field background noise are not yet covered.
|
| 127 |
+
- **Regional dialect coverage.** Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, and code-switched Marwari/Bhili speech are not validated. ASHA workers in those regions would need targeted evaluation before deployment.
|
| 128 |
+
|
| 129 |
+
## Privacy & Data Handling
|
| 130 |
+
|
| 131 |
+
Sakhi is designed so the audio and transcript of a patient visit never cross the boundary of the institution that owns it.
|
| 132 |
+
|
| 133 |
+
- **Workstation mode.** ASR + LLM extraction run on the PHC's GPU. Audio uploads from the phone travel over local WiFi LAN to `http://<workstation>:8000`, are processed in memory, and the response goes back to the phone. No third-party API call. No telemetry. No analytics.
|
| 134 |
+
- **Field mode (on-device).** Hindi text → form extraction runs entirely on the phone via Gemma 4 E2B on Cactus SDK; the on-device path is fully offline and airplane mode does not break it. Voice captured in field mode persists to phone-local IndexedDB and is posted only to the configured workstation LAN endpoint at sync time.
|
| 135 |
+
- **No external LLMs.** Gemma 4 weights (E4B on Ollama, E2B INT4 on Cactus) are local. No OpenAI, Anthropic, or Google Cloud API key is required or used anywhere in the pipeline.
|
| 136 |
+
- **Data minimization at the boundary.** Patient demographics enter as a typed header — never extracted from audio — so identifiers do not need to round-trip through ASR + LLM layers.
|
| 137 |
+
- **DPDP Act alignment.** This deployment posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.
|
| 138 |
+
|
| 139 |
+
The public HuggingFace Space referenced below exists for reviewer convenience only; production deployments would run the workstation stack on PHC-owned hardware.
|
| 140 |
|
| 141 |
## Deployment Model
|
| 142 |
|
|
|
|
| 158 |
**Three access points, same backend schema:**
|
| 159 |
|
| 160 |
1. **Workstation browser** — ANM/medical officer at the health center opens `http://localhost:8000` (or `http://<LAN-IP>:8000` from any workstation on the WiFi). FastAPI serves the built React UI at `/` and the pipeline endpoints at `/api/*`. One command (`python api.py`) starts everything.
|
| 161 |
+
2. **Phone, health-center mode** — APK records and posts to workstation's `:8000` over WiFi. Workstation runs Whisper-Large ASR + E4B Q4_K_M with native function calling. The on-device path (mode 3 below) is text-in only and uses plain-JSON output instead of function calling — workstation mode is the higher-fidelity path of the two.
|
| 162 |
3. **Phone, field mode** — APK offers two offline paths. **(a)** Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. **(b)** Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.
|
| 163 |
|
| 164 |
**Crash-safe recording (Field Mode):** audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.
|
|
|
|
| 192 |
- Covers 0–999 Hindi number words + Whisper misspelling variants
|
| 193 |
- Compound values (BP, weight, Hb), decimal points, fractions
|
| 194 |
|
| 195 |
+
**Real-voice validation:** 4 recordings, 2 speakers, 3 of 4 role-play scripts
|
| 196 |
+
- Speakers: 1 female (Bareilly reader, WhatsApp audio over phone mic) + 1 male (self-record, OnePlus 11R mic). Scripts covered: ANC preeclampsia, PNC Day-7 normal, child diarrhea. Script #1 ANC normal not yet recorded.
|
| 197 |
+
- Five normalizer/detector bugs surfaced and fixed from this round (commit `d2d987d`):
|
| 198 |
+
- `बीबी → BP` — Whisper mishears BP as `बीबी` in fast speech; medical-terms normalizer now maps it.
|
| 199 |
+
- `parse_hindi_number` no longer over-merges adjacent digits — `दो तीन` stays `2 3` (was `5`), `एक सौ सौ` stays `100 100` (was `10000`).
|
| 200 |
+
- Visit-type detector dropped `बच्चे को` from child-health keywords — was misrouting the ANC preeclampsia warning `तुम्हारा और बच्चे को खतरा हो सकता है` to child_health.
|
| 201 |
+
- Preeclampsia diagnosis name (`प्रीक्लिम्सिया`) maps to the symptom triad when the LLM emits the diagnosis instead of the underlying symptoms.
|
| 202 |
+
- `सूज` verb stem added to swelling-face/hands danger keywords.
|
| 203 |
+
- BP extraction confirmed on short clips with deliberate prosody around `बटा`. On long conversational-pacing clips the numeric value can drop while the danger framing (`BP बहुत हाई है`) survives — the danger panel still flags severe-hypertension on the qualitative phrase. Root-cause walkthrough in [FAILURES.md](FAILURES.md).
|
| 204 |
+
- The patient-name misclassification observed on the child-diarrhea recording (LLM grabbed the child's name into the mother field) is sidestepped by the ASHA-entered metadata header — patient identifiers never depend on ASR.
|
| 205 |
+
- Full-corpus real-audio evaluation (all 4 scripts × multiple speakers under field conditions) is the next eval lift.
|
| 206 |
+
|
| 207 |
## Fine-Tuning (Unsloth Track)
|
| 208 |
|
| 209 |
+
The track deliverables are a reproducible LoRA pipeline on RTX 5070 Ti / Blackwell, a Windows GGUF-export workaround for Unsloth's Gemma 4 mmap failure, and an A/B against base. The fine-tuned model did not beat base on pass-rate; base ships in the live pipeline.
|
| 210 |
|
| 211 |
+
**Pipeline (`scripts/train_unsloth.py`)** — one command, end-to-end: data prep → LoRA training → adapter saved → GGUF export → Ollama register → auto-eval vs base. Training set: 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. Hyperparameters: LR 5e-5, 1 epoch, LoRA r=16 / alpha=32, dropout 0.05.
|
| 212 |
|
| 213 |
+
**Windows GGUF-export workaround (`scripts/export_merge.py`)** — Unsloth's bundled GGUF export path hits an mmap failure on Windows for Gemma 4 architectures. The workaround loads base + adapter via `transformers`, computes `delta_W = (B @ A) * (alpha / r)` per LoRA pair, merges, then runs `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize Q4_K_M`. Reproducible without WSL or a Linux dual-boot.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
+
**A/B vs base** (full numbers in `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`):
|
| 216 |
+
- **Pass rate:** base 15/15 vs fine-tune 14/15. The single fine-tune failure is on heavy Hinglish code-switching where the fine-tune over-refers (a safer failure mode, still a failure).
|
| 217 |
+
- **Latency:** base 18.7s vs fine-tune 19.0s avg — effectively tied.
|
| 218 |
+
- **Schema normalization:** fine-tune translates Hindi symptom phrases into English schema labels (`दस्त` → `Diarrhea`, `चक्कर आ रहे हैं` → `dizziness`). Base retains raw Hindi.
|
| 219 |
+
- **Field coverage:** fine-tune recovers 2 visit-type-specific fields the base misses (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`); base recovers 11 fields the fine-tune leaves null.
|
| 220 |
|
| 221 |
+
**Root cause of the over-referral failure.** The 1,154-example training distribution had Hinglish code-switching disproportionately co-occurring with danger cases, so the LoRA learned `English-in-Hindi-sentence` as a mild danger signal. Documented in [FAILURES.md](FAILURES.md). The base model is in the live Ollama path; the fine-tune is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) — `ollama pull tusharbrisingr9802/sakhi` to verify the A/B locally. For deployments that prefer English schema-label normalization over raw Hindi.
|
| 222 |
|
| 223 |
## Frontend
|
| 224 |
|
|
|
|
| 242 |
# Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
|
| 243 |
|
| 244 |
# ── Health-center deployment (workstation, unified UI + API) ──
|
| 245 |
+
pip install -r requirements-runtime.txt # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
|
| 246 |
ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
|
| 247 |
cd frontend && npm install && npm run build && cd ..
|
| 248 |
python api.py
|
|
|
|
| 302 |
|
| 303 |
**Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
|
| 304 |
|
| 305 |
+
**Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the [3-minute demo video](https://youtu.be/n-u7J1lljUg), or follow Path 1 above to run locally — the live Space exists for convenience. Local Path 1 (or the test scripts in `scripts/`) is the evaluation path.
|
| 306 |
|
| 307 |
### How it's deployed
|
| 308 |
|
|
|
|
| 310 |
|
| 311 |
- `Dockerfile` — two-stage build: Node 20 builds `frontend/dist`, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
|
| 312 |
- `entrypoint.sh` — starts the Ollama daemon, waits for its API, pulls `gemma4:e4b-it-q4_K_M` if absent, then `exec uvicorn api:app`.
|
| 313 |
+
- `requirements-runtime.txt` — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only. Used by both the HF Space Docker build and local Path 1 installs.
|
| 314 |
- `.dockerignore` — keeps the build context small (no `models/`, no `data/recordings/`, no `frontend/node_modules`, no `cactus-src/`, etc.).
|
| 315 |
- README YAML frontmatter — `sdk: docker`, `app_port: 7860`. HF Space picks this up on push.
|
| 316 |
|
|
|
|
| 349 |
configs/schemas/ # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
|
| 350 |
Dockerfile # HF Space build: Node frontend + CUDA runtime + Ollama
|
| 351 |
entrypoint.sh # HF Space container init: ollama serve → pull model → uvicorn
|
| 352 |
+
requirements-runtime.txt # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
|
| 353 |
frontend/
|
| 354 |
src/App.jsx # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
|
| 355 |
src/offlineQueue.js # IndexedDB offline queue + crash-safe chunk persistence
|
RETRAIN_RESULTS.md
CHANGED
|
@@ -11,11 +11,13 @@
|
|
| 11 |
| gemma4:e4b-it-q4_K_M (base) | 15/15 |
|
| 12 |
| sakhi:latest (fine-tuned) | 14/15 |
|
| 13 |
|
|
|
|
|
|
|
| 14 |
## Verdict
|
| 15 |
|
| 16 |
**Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
|
| 17 |
|
| 18 |
-
The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is
|
| 19 |
|
| 20 |
## Diagnostics
|
| 21 |
|
|
|
|
| 11 |
| gemma4:e4b-it-q4_K_M (base) | 15/15 |
|
| 12 |
| sakhi:latest (fine-tuned) | 14/15 |
|
| 13 |
|
| 14 |
+
**Reproduce:** `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`.
|
| 15 |
+
|
| 16 |
## Verdict
|
| 17 |
|
| 18 |
**Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
|
| 19 |
|
| 20 |
+
The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
|
| 21 |
|
| 22 |
## Diagnostics
|
| 23 |
|