sakhi / README.md
Tushar9802's picture
hf-space: drop binary assets — HF requires Xet for non-text files
30cc0d5
metadata
title: Sakhi
emoji: 🩺
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Hindi voice  ASHA government health forms (Gemma 4)

Sakhi (सखी) — Voice-to-Form for ASHA Workers

Offline-first tool that converts Hindi home visit conversations into structured government health forms and real-time referral decisions for India's 1 million+ ASHA health workers.

Competition: Gemma 4 Good Hackathon ($200K prize pool) Tracks: Health & Sciences | Ollama | Unsloth | Cactus (Android APK) Partner frameworks: Gemma 4 (E2B + E4B), Cactus SDK (on-device Android), Ollama (workstation GPU), Unsloth (LoRA fine-tune), Whisper (Hindi ASR via CTranslate2)

▶ Watch the 3-min demo: youtu.be/n-u7J1lljUg — full submission video: problem framing, workstation voice-to-form path, on-device Hindi text-to-form on a phone in airplane mode, four tracks claimed.

▶ Try it live: https://huggingface.co/spaces/Tushar9802/sakhi — the Path 1 workstation stack (FastAPI + Ollama + Whisper) running on an HF Space T4. Same UI, same endpoints; no install needed. ~5 min cold-boot wait after idle — see Public Demo for details.

▶ Pull the Unsloth fine-tune: tusharbrisingr9802/sakhi on the Ollama registry — ollama pull tusharbrisingr9802/sakhi fetches the LoRA-fine-tuned Gemma 4 E4B behind the A/B numbers below. The base model (gemma4:e4b-it-q4_K_M) is what ships in the live pipeline; this is the side-by-side comparison artifact for the Unsloth track.

Problem

India's ASHA workers conduct 50M+ maternal/child health home visits per year across rural areas. Every visit ends with paper forms filled from memory, then physically carried to the Primary Health Center. Danger signs observed in the field — preeclampsia, postpartum hemorrhage, neonatal distress — often never reach the system in time for intervention.

Solution

Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:

  • Health-center mode (workstation + E4B via Ollama) — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. End-to-end latency ~15–25 s on an RTX 5070 Ti or T4. This is the primary voice-to-form path.
  • Field mode (phone) has two offline sub-paths:
    • Record now, sync later — ASHA records audio during home visits; chunks persist to IndexedDB every 5 s (crash-safe). When the phone is back on health-center WiFi, the queued recordings post to the workstation for full Whisper + E4B processing. On-device ASR is not attempted — see the section below for why.
    • Type a note for instant on-device extraction — for when the ASHA wants structured output right now without network. A short Hindi note in a textarea runs through the full pipeline (normalize → detect visit type → extract form → detect danger signs) entirely on-device via Gemma 4 E2B INT4 on the Cactus SDK. Same schema, same validation as the workstation path. Pipeline latency is ≈ 5 min on a Snapdragon 8+ Gen 1 phone. For comparison: the paper-form baseline is 15–20 min of hand-filling from memory, then a walk to the PHC, then clinician review hours-to-days later. The on-device path works with zero network and zero shared infrastructure.
Workstation path:
[Hindi Audio] → Whisper ASR → Hindi Normalization → Gemma 4 E4B (function calling)
                                                      ├── extract_form()      → structured MCTS JSON
                                                      ├── flag_danger_sign()  → per-sign with utterance evidence
                                                      └── issue_referral()    → urgency + facility + reasoning

On-device path (text-in):
[Hindi Text] → Hindi Normalization → Visit-type detect → Gemma 4 E2B (plain JSON)
                                                          ├── extract_form     → null-filled template filled in
                                                          └── detect_danger    → danger_signs + referral_decision

Why not voice-to-form on-device too?

The on-device voice path does not work well enough yet for clinical Hindi. Cactus's transcribe API supports Whisper / Moonshine / Parakeet only (Gemma 4's audio conformer is for voice understanding in multimodal chat, not dedicated ASR). Cactus ships multilingual Whisper INT4 weights, but no Hindi-specific checkpoint — and published benchmarks (Kumar et al. 2025, ASR Under the Stethoscope; Vistaar / Gramvaani corpus evaluations) show off-the-shelf Whisper on spontaneous rural Hindi hits 27% WER at best and 70%+ on clinical content, with substantial variability tied to speaker role / gender / code-mixing and a deletion-dominant error profile that silently drops numbers and symptoms. For an ASHA decision-support tool where a missed BP reading is a clinical harm, an on-device voice path is not in this submission. Record-and-sync with Whisper-Large on the workstation handles voice-in; the on-device LLM handles Hindi text understanding only.

Function Calling

The pipeline uses Gemma 4's native function calling through Ollama's tools= parameter. A single LLM call invokes up to three tools:

Tool Purpose When called
extract_form Fill visit-specific MCTS/HMIS schema with structured data Every conversation
flag_danger_sign Flag one NHM-defined danger sign with verbatim utterance evidence Only when danger signs are present
issue_referral Referral decision with urgency, facility level, and clinical reasoning Only when danger signs warrant referral

On a normal visit, only extract_form is called. On a high-risk visit (e.g., preeclampsia), the model calls all three — extract_form + multiple flag_danger_sign calls + issue_referral — in a single inference pass.

The pipeline uses a hybrid design: form extraction via format="json" (proven precision on structured schemas) and danger sign detection via native function calling. The model decides whether to flag danger signs and issue referrals — tool calls surface in the API response as tool_calls metadata.

Architecture

Component Model Size Role Deployment
ASR (workstation path only) collabora/whisper-large-v2-hindi (served as the CTranslate2 mirror Tushar9802/whisper-large-v2-hindi-ct2 — faster-whisper requires CT2 format) ~1.5 GB Hindi speech → text via faster-whisper/CTranslate2 Workstation
Normalization src/hindi_normalize.py Hindi number words → digits, medical term mapping Shared (Python server-side; JS port for phone)
Clinical Extraction (health-center mode, audio-in) Gemma 4 E4B (Q4_K_M via Ollama) ~5 GB Function calling: form extraction + danger signs + referral Workstation (GPU)
Clinical Extraction (field mode, text-in) Gemma 4 E2B (INT4 via Cactus SDK) ~4.4 GB download / ~6.3 GB on-device extracted (multimodal package includes audio + vision encoders that the text-in path does not use) Same extraction schema, plain-JSON mode (E2B INT4 does not reliably emit OpenAI-style tool_calls) Android (ARM, Snapdragon 7+ Gen 1 or newer, 8 GB RAM, ~7 GB free storage for the one-time install)

Patient demographics enter as a header, not from the audio. Every clinical EMR works this way: identifiers typed once at intake, the conversation handled separately. The ASHA fills name / age / sex / mobile / ASHA-ID / visit-date in the header above the record button, and the LLM only extracts what was said during the visit — symptoms, vitals, counselling, next-visit date. This avoids a failure mode surfaced in real-voice testing: Whisper-Hindi sometimes mishears patient names as different Hindi words, and a downstream LLM has no prior on what the name should be. Same merge logic runs on all three paths — apply_metadata in app.py for workstation audio and text, mirrored as a pure JS function in pipeline.js for on-device Cactus extraction — so server and phone produce identical envelopes for the same input. ANC fills patient.{name, age, mobile}; child_health fills child.{name, age_months, sex} with year→month conversion; PNC and delivery have no patient sub-object in their form, so the metadata travels in the response envelope only. asha_id is sticky across sessions via localStorage. For Field-mode recordings, the header is captured at record-start so later edits don't pollute earlier queue entries.

Hindi number normalization: Algorithmic parser covering all 0–999 Hindi number words with Whisper misspelling variants. Handles compound medical values: "एक सौ दस बटा सत्तर" → "110/70", "ग्यारह दशमलव पाँच" → "11.5", "तीन किलो दो सौ ग्राम" → "3.2 kg".

Anti-hallucination pipeline (6 layers):

  1. Evidence length filter — danger signs with <10 char evidence dropped
  2. Generic ASHA phrase blocklist — "कोई तकलीफ़ हो तो फ़ोन कर दीजिए" etc. filtered
  3. Normal value filter — strips signs citing "110/70", "बिल्कुल ठीक", "सामान्य"
  4. Transcript grounding — evidence must appear verbatim in the transcript
  5. Deduplication across overlapping danger signs
  6. Form validation — strips invented names (दीदी/बहन patterns), default ages, phantom lab results; range checks on BP (60–250/30–150), Hb (3–20), weight (1–200), gestational weeks (1–45)

Reproducing the demo

Two reproduction paths. Pick by available hardware.

Path 1 — workstation, ~5 minutes (recommended for reviewers). Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are pip install -r requirements-runtime.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py then open http://localhost:8000. The slim requirements-runtime.txt covers the serving stack (Ollama client + faster-whisper + FastAPI); PyTorch / Unsloth / bitsandbytes from the comprehensive requirements.txt are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify Sakhi's engineering claims (function calling, normalization, 6-layer validation, schema correctness).

Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track). Requires accepting the Cactus-Compute model license. Steps:

  1. Accept terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it (1 min, free HF account).
  2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) from that page.
  3. Build + install the APK (./gradlew assembleDebug && adb install -r ...), or take the prebuilt APK from the GitHub Release.
  4. Transfer the zip to the phone's Downloads/ folder via USB MTP or USB-OTG drive. (WhatsApp won't work — 2 GB cap. Drive download to phone is fine if the file lands locally rather than streaming.)
  5. Open Sakhi → Field Mode → On-Device Probe → Import model (.zip) → pick the zip from the system file picker. Wait ~3-5 minutes for extraction (progress bar + log card show live file count and MB written). Re-imports auto-evict the previous model — no manual cleanup, no risk of 12 GB accumulation.
  6. Load ModelTest Hindi to confirm inference works.

Sakhi does not redistribute the Cactus model. It is gated under a custom Cactus-Compute license; hosting it on a public Drive link would violate that gating. The in-app SAF import flow exists precisely so reviewers who DO want to reproduce on-device can do so without the project needing to host the weights, and without needing developer mode or adb on their phone. The 3-minute demo video shows the full flow on a real phone, so the on-device claim can be verified without anyone needing to install the model themselves.

Safety & Limitations

Sakhi is a decision-support tool, not a diagnostic system. All outputs require human review.

What it catches: Danger signs with explicit conversational evidence — elevated BP with symptoms, severe bleeding, neonatal distress indicators. The model only flags what was said in the conversation, grounded by verbatim utterance quotes.

What it can miss: Danger signs not discussed in conversation, subtle clinical findings that require physical examination, conditions that present atypically. The system cannot observe — it can only reason about what was spoken.

False positive controls: The 6-layer anti-hallucination pipeline filters ungrounded danger signs. On the test suite, normal visits produce zero false alarms.

Human-in-the-loop: Every referral decision is presented to the ANM/medical officer at the health center for review before action. The tool accelerates information flow from field to facility — it does not replace clinical judgment.

Known limitations (full root-cause walkthroughs in FAILURES.md):

  • On-device latency. Field-mode text-in extraction takes ~5 min on a Snapdragon 8+ Gen 1 — versus ~15–25 s on the workstation path. The use case is asynchronous: kick off at the end of a visit, the form is ready by the next stop. Live consultation runs on the workstation path.
  • Long-clip BP drop. Whisper-Large CT2 reliably recovers BP 160/110 only when the speaker pauses ~0.5 s around बटा (the Hindi "over" separator). At conversational pacing on long clips, the number can drop while the surrounding "बहुत हाई है" framing is preserved; the danger panel still flags severe-hypertension from the qualitative phrase.
  • Eval-rubric scope. The 15/15 quality score is asserted against per-case hallucination_traps lists — the specific fields that MUST be null for that input — not a whole-schema null-everywhere check. The ANC preeclampsia case has a misclassification not on its trap list: pregnancy.previous_complications (a prior-history field) gets populated with current-visit symptoms. The danger panel and referral decision are unaffected. The schema-description fix touches all four visit schemas and would require a full eval re-run; that re-run did not land here.
  • Synthetic training data + partial real-voice eval. The 1,154 fine-tune examples and 15-case automated eval suite are LLM-generated Hindi conversations with gTTS audio. Real-voice testing to date covers 4 recordings × 2 speakers (1 female Bareilly reader + 1 male self-record) × 3 of 4 role-play scripts (ANC preeclampsia, PNC Day-7, child diarrhea — see Test Results for details and fixes that came out of it). Rural female ASHA accents, regional dialects, and field background noise are not yet covered.
  • Regional dialect coverage. Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, and code-switched Marwari/Bhili speech are not validated. ASHA workers in those regions would need targeted evaluation before deployment.

Privacy & Data Handling

Sakhi is designed so the audio and transcript of a patient visit never cross the boundary of the institution that owns it.

  • Workstation mode. ASR + LLM extraction run on the PHC's GPU. Audio uploads from the phone travel over local WiFi LAN to http://<workstation>:8000, are processed in memory, and the response goes back to the phone. No third-party API call. No telemetry. No analytics.
  • Field mode (on-device). Hindi text → form extraction runs entirely on the phone via Gemma 4 E2B on Cactus SDK; the on-device path is fully offline and airplane mode does not break it. Voice captured in field mode persists to phone-local IndexedDB and is posted only to the configured workstation LAN endpoint at sync time.
  • No external LLMs. Gemma 4 weights (E4B on Ollama, E2B INT4 on Cactus) are local. No OpenAI, Anthropic, or Google Cloud API key is required or used anywhere in the pipeline.
  • Data minimization at the boundary. Patient demographics enter as a typed header — never extracted from audio — so identifiers do not need to round-trip through ASR + LLM layers.
  • DPDP Act alignment. This deployment posture is compatible with India's Digital Personal Data Protection Act, 2023 — data fiduciary stays within the institution, no cross-border transfer, purpose limitation enforced by architecture rather than by policy.

The public HuggingFace Space referenced below exists for reviewer convenience only; production deployments would run the workstation stack on PHC-owned hardware.

Deployment Model

Health Center (workstation, RTX GPU)              Field (Android phone)
┌────────────────────────────────────┐       ┌──────────────────────────────────┐
│  python api.py  →  :8000           │◄─────►│  Native APK (Capacitor + React)  │
│  ├── /api/*   — pipeline endpoints │  WiFi │  ├── Health-center mode:         │
│  └── /        — React UI (dist/)   │  LAN  │  │   POST audio to workstation :8000  │
│                                    │       │  └── Field mode (offline):       │
│  Whisper ASR (CTranslate2)         │       │      (a) record + IDB queue +    │
│  Gemma 4 E4B (Ollama)              │       │          later sync to :8000     │
│                                    │       │      (b) type Hindi note →       │
│  Desktop browser UI:               │       │          Cactus + Gemma 4 E2B    │
│  http://localhost:8000             │       │          on-device text→form     │
└────────────────────────────────────┘       └──────────────────────────────────┘

Three access points, same backend schema:

  1. Workstation browser — ANM/medical officer at the health center opens http://localhost:8000 (or http://<LAN-IP>:8000 from any workstation on the WiFi). FastAPI serves the built React UI at / and the pipeline endpoints at /api/*. One command (python api.py) starts everything.
  2. Phone, health-center mode — APK records and posts to workstation's :8000 over WiFi. Workstation runs Whisper-Large ASR + E4B Q4_K_M with native function calling. The on-device path (mode 3 below) is text-in only and uses plain-JSON output instead of function calling — workstation mode is the higher-fidelity path of the two.
  3. Phone, field mode — APK offers two offline paths. (a) Record audio during home visits — chunks stored crash-safely in IndexedDB every 5 s. Queued recordings sync to the health-center workstation when back on WiFi for full Whisper + E4B processing. (b) Type a short Hindi note in the "on-device text → form" card; the full extraction + danger-sign pipeline runs on the phone via Gemma 4 E2B on Cactus SDK. No network required. Total on-device pipeline latency ≈ 5 min on Snapdragon 8+ Gen 1 — suited for "tap and wait" use, not real-time.

Crash-safe recording (Field Mode): audio chunks are persisted to IndexedDB every 5 seconds during a recording. If the browser tab closes, the phone locks, or the app is killed mid-visit, the chunks survive — on reopen, an orange recovery banner offers to reassemble the partial recording.

Form Types

5 JSON schemas covering NHM/IMNCI protocol:

  • ANC (Antenatal Care) — pregnancy registration, vitals, TT/IFA, lab results, birth preparedness
  • Delivery — birth outcome, type (normal/C-section), infant details, complications, blood loss
  • PNC / HBNC — postnatal mother + newborn assessment (days 1–42), lactation, cord care
  • Child Health / HBYC — growth monitoring, immunization, developmental milestones, illness screening
  • Danger Signs — 10 maternal + 9 newborn danger sign checklist with mandatory utterance evidence, referral decision

Test Results

Text extraction quality (base Gemma 4 E4B): 15/15 tests pass (test_ollama_quality.py)

  • 4/4 visit types: ANC, PNC, delivery, child health
  • Zero false danger alarms on normal visits
  • Correct referral escalation on danger cases
  • Avg 18.7s per test (form + danger sign extraction)
  • The rubric is per-case: each test asserts a small list of hallucination_traps (fields that MUST be null for that input). It does not assert null-everywhere-not-mentioned across the full schema. See FAILURES.md for one known under-specified trap (pregnancy.previous_complications on ANC preeclampsia).

End-to-end audio pipeline: 13/15 tests pass (87%) — test_pipeline_e2e.py

  • 15 synthetic Hindi audio samples through full pipeline
  • 2 failures are TTS→ASR artifacts on BP values (synthetic audio, not real-world). Root-cause walkthrough in FAILURES.md.
  • All visit types pass, all danger sign tests pass, all edge cases pass
  • Avg pipeline timing: ~15s per conversation (RTX 5070 Ti, warm Ollama, hybrid json+FC)

Hindi normalization: 133 tests pass (test_asr.py)

  • Covers 0–999 Hindi number words + Whisper misspelling variants
  • Compound values (BP, weight, Hb), decimal points, fractions

Real-voice validation: 4 recordings, 2 speakers, 3 of 4 role-play scripts

  • Speakers: 1 female (Bareilly reader, WhatsApp audio over phone mic) + 1 male (self-record, OnePlus 11R mic). Scripts covered: ANC preeclampsia, PNC Day-7 normal, child diarrhea. Script #1 ANC normal not yet recorded.
  • Five normalizer/detector bugs surfaced and fixed from this round (commit d2d987d):
    • बीबी → BP — Whisper mishears BP as बीबी in fast speech; medical-terms normalizer now maps it.
    • parse_hindi_number no longer over-merges adjacent digits — दो तीन stays 2 3 (was 5), एक सौ सौ stays 100 100 (was 10000).
    • Visit-type detector dropped बच्चे को from child-health keywords — was misrouting the ANC preeclampsia warning तुम्हारा और बच्चे को खतरा हो सकता है to child_health.
    • Preeclampsia diagnosis name (प्रीक्लिम्सिया) maps to the symptom triad when the LLM emits the diagnosis instead of the underlying symptoms.
    • सूज verb stem added to swelling-face/hands danger keywords.
  • BP extraction confirmed on short clips with deliberate prosody around बटा. On long conversational-pacing clips the numeric value can drop while the danger framing (BP बहुत हाई है) survives — the danger panel still flags severe-hypertension on the qualitative phrase. Root-cause walkthrough in FAILURES.md.
  • The patient-name misclassification observed on the child-diarrhea recording (LLM grabbed the child's name into the mother field) is sidestepped by the ASHA-entered metadata header — patient identifiers never depend on ASR.
  • Full-corpus real-audio evaluation (all 4 scripts × multiple speakers under field conditions) is the next eval lift.

Fine-Tuning (Unsloth Track)

The track deliverables are a reproducible LoRA pipeline on RTX 5070 Ti / Blackwell, a Windows GGUF-export workaround for Unsloth's Gemma 4 mmap failure, and an A/B against base. The fine-tuned model did not beat base on pass-rate; base ships in the live pipeline.

Pipeline (scripts/train_unsloth.py) — one command, end-to-end: data prep → LoRA training → adapter saved → GGUF export → Ollama register → auto-eval vs base. Training set: 1,154 synthetic ASHA visit examples (981 train / 173 val) covering all 4 visit types and 458 positive danger sign cases. Hyperparameters: LR 5e-5, 1 epoch, LoRA r=16 / alpha=32, dropout 0.05.

Windows GGUF-export workaround (scripts/export_merge.py) — Unsloth's bundled GGUF export path hits an mmap failure on Windows for Gemma 4 architectures. The workaround loads base + adapter via transformers, computes delta_W = (B @ A) * (alpha / r) per LoRA pair, merges, then runs llama.cpp/convert_hf_to_gguf.py + llama-quantize Q4_K_M. Reproducible without WSL or a Linux dual-boot.

A/B vs base (full numbers in RETRAIN_RESULTS.md, FIELD_COVERAGE_DIFF.md):

  • Pass rate: base 15/15 vs fine-tune 14/15. The single fine-tune failure is on heavy Hinglish code-switching where the fine-tune over-refers (a safer failure mode, still a failure).
  • Latency: base 18.7s vs fine-tune 19.0s avg — effectively tied.
  • Schema normalization: fine-tune translates Hindi symptom phrases into English schema labels (दस्तDiarrhea, चक्कर आ रहे हैंdizziness). Base retains raw Hindi.
  • Field coverage: fine-tune recovers 2 visit-type-specific fields the base misses (anc_details.facility_or_home, visit_info.hbyc_visit_month); base recovers 11 fields the fine-tune leaves null.

Root cause of the over-referral failure. The 1,154-example training distribution had Hinglish code-switching disproportionately co-occurring with danger cases, so the LoRA learned English-in-Hindi-sentence as a mild danger signal. Documented in FAILURES.md. The base model is in the live Ollama path; the fine-tune is published to the Ollama registry as tusharbrisingr9802/sakhiollama pull tusharbrisingr9802/sakhi to verify the A/B locally. For deployments that prefer English schema-label normalization over raw Hindi.

Frontend

One React + Vite codebase, shipped as both a browser UI (served by FastAPI at /) and a native Android APK (Capacitor-wrapped, same React bundle inside a WebView + native plugins):

Tab Purpose
Voice to Form Record or upload audio, real-time SSE pipeline progress (workstation path). Patient & Visit Info header at the top (name / age / sex / ASHA-ID / visit-date) is posted alongside the audio so demographics don't depend on ASR.
Text to Form Paste transcript, extract structured form with example loader (workstation path)
Field Mode Offline-first: crash-safe audio recording queue (IndexedDB every 5 s) for later sync + on-device text→form card that runs the full pipeline through Gemma 4 E2B on Cactus SDK + On-Device Probe card for loading/health-checking the Cactus model. Same Patient & Visit Info header as the Voice tab; header values are snapshotted at record-start so later edits don't contaminate earlier queue entries. A "Developer view" toggle shows raw per-stage model output for verification.
About & Impact Project context, ASHA program statistics
History Past extractions with JSON/CSV export

JS pipeline port (frontend/src/lib/) — the Python extraction pipeline (Hindi normalization, visit-type detection, form/danger prompts, 6-layer validation, demographics-header merge) has a full JS port so the phone can run the same logic against the on-device Cactus engine, engine-agnostic by design. 72/72 unit tests pass under node --test.

On-device prompt design note: E4B via Ollama handles a raw JSON Schema in the form-extraction prompt cleanly. E2B INT4 on Cactus doesn't — it echoes schema metadata ($schema, title, description, type) back as output data. The JS port sends a null-filled instance template instead (just the field shape with all values as null), and the model's job is to fill in the slots where the transcript says something. Similarly, danger-sign extraction on-device uses plain JSON (E2B doesn't reliably emit OpenAI-style tool_calls in Cactus's parseable shape). The workstation E4B path keeps native function calling.

Quick Start

# Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)

# ── Health-center deployment (workstation, unified UI + API) ──
pip install -r requirements-runtime.txt     # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
ollama pull gemma4:e4b-it-q4_K_M             # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
cd frontend && npm install && npm run build && cd ..
python api.py
# Browser: http://localhost:8000  (React UI)
# Phone APK (on same WiFi): posts to http://<workstation-LAN-IP>:8000

# ── Frontend dev mode (hot-reload) ──
cd frontend && npm run dev           # Vite on :5173, proxies /api to :8000

# ── Android APK (Capacitor, field-deployable) ──
# Prerequisites: JDK 21 (Temurin), Android Studio with SDK
cd frontend
VITE_API_BASE_URL="http://<workstation-LAN-IP>:8000" npm run build
npx cap sync android
cd android && ./gradlew assembleDebug
# APK at: frontend/android/app/build/outputs/apk/debug/app-debug.apk

# ── On-device Cactus model (for field mode) ──
# Two install paths. Pick one.
#
# (A) PRIMARY — no developer tooling required:
#   1. Accept the Cactus-Compute terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it
#   2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) to a PC, then transfer to
#      the phone's Downloads folder via USB cable (MTP) or USB-OTG drive.
#      WhatsApp won't work (2 GB cap). Drive download to the phone also works
#      but Drive's content provider streams lazily, so prefer a downloaded copy.
#   3. Open Sakhi → Field Mode → On-Device Probe → Import model (.zip)
#      → pick the zip from the system file picker.
#   4. Wait ~3-5 min for extraction. Progress bar + log card show live
#      file count and MB written.
#   5. Tap Load Model → Test Hindi to confirm.
#   Re-imports automatically wipe the previous model dir — no manual cleanup,
#   no risk of accumulating multiple 6 GB models on the phone.
#
# (B) DEVELOPER — adb-based, scripted, faster on the same WiFi:
export HF_TOKEN=hf_...            # read token, repo must be accepted on HF UI
bash scripts/setup_cactus_model.sh
# Requires: adb on PATH, phone in USB debug mode authorised for this host,
# debuggable Sakhi APK installed (run-as-able). Full prerequisites +
# troubleshooting documented inside the script header.

# Tests
python scripts/test_ollama_quality.py    # Text extraction (base 15/15, sakhi 14/15)
python scripts/test_pipeline_e2e.py      # Full E2E audio (13/15)
python scripts/test_asr.py               # Hindi normalization (133/133)
cd frontend && npm test                  # JS pipeline port (72/72)

# Retrain + A/B eval (requires the FULL requirements.txt: Unsloth + PyTorch + bitsandbytes,
# plus an RTX GPU, cmake, and llama.cpp binaries on PATH for GGUF export)
pip install -r requirements.txt                 # NOTE: training-only deps, Blackwell-pinned PyTorch nightly
python scripts/train_unsloth.py                 # Full pipeline: prep, train, export, register, eval
python scripts/train_unsloth.py --export-only   # Skip training, just export saved adapter
python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi

Public Demo — HuggingFace Space

Try it live: https://huggingface.co/spaces/Tushar9802/sakhi — same python api.py stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.

Heads-up on cold-boot wait. The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally — the live Space exists for convenience. Local Path 1 (or the test scripts in scripts/) is the evaluation path.

How it's deployed

Files driving the deploy:

  • Dockerfile — two-stage build: Node 20 builds frontend/dist, CUDA 12.2 + cuDNN 8 runtime installs Ollama + Python deps and copies the dist in.
  • entrypoint.sh — starts the Ollama daemon, waits for its API, pulls gemma4:e4b-it-q4_K_M if absent, then exec uvicorn api:app.
  • requirements-runtime.txt — slim runtime deps (faster-whisper, fastapi, uvicorn, ollama). No Unsloth / PyTorch / bitsandbytes — they're training-side only. Used by both the HF Space Docker build and local Path 1 installs.
  • .dockerignore — keeps the build context small (no models/, no data/recordings/, no frontend/node_modules, no cactus-src/, etc.).
  • README YAML frontmatter — sdk: docker, app_port: 7860. HF Space picks this up on push.

Deploy steps (one-time):

pip install huggingface_hub
huggingface-cli login                                    # paste a write token

# Create the Space (sdk=docker, T4 small, persistent storage = small/medium)
huggingface-cli repo create <user>/sakhi --type space --space_sdk docker

# Add the Space as a second git remote alongside GitHub
git remote add hf https://huggingface.co/spaces/<user>/sakhi
git push hf master

# In the HF Space UI, set:
#   Hardware       → T4 small (or larger)
#   Storage        → ephemeral disk only. Every cold-boot re-downloads
#                     + loads ~12 GB.
#   Sleep time     → 1 h. Captures intra-hour clustering of reviewer
#                     traffic without keeping the GPU billed 24/7.
#   Visibility     → Public.

On every cold-boot the container pulls gemma4:e4b-it-q4_K_M (9 GB, ~80 s @ 100 MB/s) and the FastAPI startup hook downloads + loads Whisper-Large CT2 from Tushar9802/whisper-large-v2-hindi-ct2 (3 GB). The Space only marks ready when both models are resident, so the first request after a sleep pays a ~5 min wait.

Subsequent updates: git push hf master after any code change; HF rebuilds and redeploys.

Project Structure

api.py                              # FastAPI backend — SSE streaming + static mount of frontend/dist
app.py                              # Core pipeline — function calling, ASR, extraction, validation
src/hindi_normalize.py              # Hindi number/medical term normalization (160 number words)
configs/schemas/                    # 5 JSON schemas (ANC, PNC, delivery, child health, danger signs)
Dockerfile                          # HF Space build: Node frontend + CUDA runtime + Ollama
entrypoint.sh                       # HF Space container init: ollama serve → pull model → uvicorn
requirements-runtime.txt            # Slim runtime deps (no Unsloth/PyTorch — Ollama serves inference)
frontend/
  src/App.jsx                       # React app — all 5 tabs, on-device text-in card + Cactus probe in Field Mode
  src/offlineQueue.js               # IndexedDB offline queue + crash-safe chunk persistence
  src/lib/                          # JS port of Python pipeline (engine-agnostic)
    hindiNormalize.js               # Full port of src/hindi_normalize.py
    visitTypeDetect.js              # Visit-type keyword heuristic
    validation.js                   # 6-layer anti-hallucination
    prompts.js                      # FORM + DANGER prompts (template-based for on-device E2B)
    pipeline.js                     # Orchestrator (engine.complete({messages, options}) contract)
    cactus.js                       # Capacitor facade for Cactus SDK
    __tests__/                      # 72/72 assertions pass under node --test
  public/sw.js                      # Service worker for PWA offline caching (browser install)
  public/manifest.json              # PWA manifest
  capacitor.config.json             # Capacitor config (appId com.sakhi.app, http scheme for LAN)
  android/                          # Native Android project — Capacitor-generated, produces APK
    app/src/main/java/com/cactus/Cactus.kt             # Cactus SDK Kotlin wrapper (vendored from cactus-src; upstream publishes no Maven artifact)
    app/src/main/java/com/sakhi/app/CactusPlugin.kt    # Capacitor plugin bridging JS ↔ Cactus
    app/src/main/jniLibs/arm64-v8a/libcactus.so        # Cactus native library (66 MB, arm64-v8a). Committed to repo via .gitignore negation because the Cactus project publishes no prebuilt Android .so and no Maven artifact. Build provenance: compiled from github.com/cactus-compute/cactus via its upstream android/build.sh with NDK r27b + CMake 3.22.1 + Ninja on Windows Git Bash. To rebuild: clone cactus, set ANDROID_NDK_HOME + CMAKE_GENERATOR=Ninja, run `bash android/build.sh`. Output .so replaces this file.
scripts/
  test_ollama_quality.py            # A/B quality tests (base 15/15, sakhi 14/15)
  test_pipeline_e2e.py              # End-to-end audio pipeline tests (13/15)
  test_asr.py                       # ASR + Hindi normalization tests (133/133)
  test_function_calling.py          # Gemma 4 function calling validation
  generate_training_data.py         # Synthetic ASHA conversation generation
  prepare_training.py               # Train/val split, schema cleanup, prompt matching
  train_unsloth.py                  # Full pipeline: prep, LoRA train, export, register, eval
  export_merge.py                   # Manual LoRA merge (bypasses Unsloth Windows mmap bug)
  compare_field_coverage.py         # Field-level diff base vs sakhi
data/
  processed/train.jsonl             # 981 training examples
  processed/val.jsonl               # 173 validation examples
  role_play_scripts.md              # Hindi role-play scripts for real-voice validation (4 scenarios)
models/
  checkpoints/final/                # Saved LoRA adapter (85MB)
  exported/sakhi-v2-q4_k_m.gguf     # Quantized fine-tune (5.3GB, registered in Ollama)
  cactus/gemma-4-e2b/               # INT4 on-device model for Cactus (not committed; HF-gated download)
RETRAIN_RESULTS.md                  # A/B score summary
FIELD_COVERAGE_DIFF.md              # Field-level coverage diff