Spaces:
Sleeping
docs: fix Path 1 (model tag + slim deps) + log eval-rubric gaps
Browse filesQuick Start + Path 1 narrative previously pointed reviewers to commands
that would have failed verbatim:
- `ollama pull gemma4:e4b` — but app.py / Dockerfile / entrypoint.sh all
default OLLAMA_MODEL to `gemma4:e4b-it-q4_K_M`. A reviewer pulling the
tagless `e4b` would have hit an Ollama 404 on first inference.
- `pip install -r requirements.txt` — but that file pins PyTorch
nightly cu128 + Unsloth + bitsandbytes, all of which are training-only.
The cu128 wheel is Blackwell-only; reviewers on RTX 30/40 / Linux /
macOS would have either failed the install or waited ~15 min on unused
nightlies. Path 1 inference goes through Ollama + faster-whisper —
requirements-hf.txt is sufficient.
- Prerequisites silently assumed the Ollama daemon was running. Made
that explicit (Windows tray app, Linux/macOS `ollama serve`).
Also corrected Python 3.11+ → 3.10+ (matches Dockerfile) and VRAM guidance
to the actual model footprint (~9 GB resident, not the misleading 16 GB).
The retrain block now explicitly calls for the full requirements.txt with
a NOTE about the Blackwell-pinned nightly so the split is visible.
FAILURES.md gains three sections surfacing failure modes a careful
reviewer would otherwise pose as a question:
1. pregnancy.previous_complications slot misclassification on ANC
preeclampsia (one-line prompt fix held back this close to deadline
due to whole-schema regression surface; root-caused to bare
anc_visit.json field with no `description` attribute).
2. Eval-rubric scope: hallucination_traps are per-case, not
null-everywhere-not-mentioned across the schema — explains why local
15/15 passed for weeks while the misclassification was live.
3. ANC long-clip BP drop on conversational pacing; short clip is the
manifest default to lead with the fast end-to-end demo.
README known-limitations bullet now points at FAILURES.md, and the
JS-pipeline-port test count is corrected 62/62 → 72/72 in both README
and JUDGE_BRIEF.md.
- FAILURES.md +46 -2
- JUDGE_BRIEF.md +1 -1
- README.md +9 -6
|
@@ -57,6 +57,50 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
|
|
| 57 |
|
| 58 |
`scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
|
| 59 |
|
| 60 |
-
## JS pipeline port:
|
| 61 |
|
| 62 |
-
`frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic. No known failures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
`scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
|
| 59 |
|
| 60 |
+
## JS pipeline port: 72 / 72 pass
|
| 61 |
|
| 62 |
+
`frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (`applyMetadata`) across ANC / PNC / child-health / delivery schemas. No known failures.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
|
| 67 |
+
|
| 68 |
+
**Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
|
| 69 |
+
|
| 70 |
+
**Observed output:** the form's `pregnancy.previous_complications` field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting *today*, not in a prior pregnancy. The same symptoms also appear correctly in `symptoms_reported`, and the danger panel correctly flags `severe_hypertension` / `severe_headache_and_visual_changes` / `edema` with `refer_immediately` and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.
|
| 71 |
+
|
| 72 |
+
### Root cause
|
| 73 |
+
|
| 74 |
+
`configs/schemas/anc_visit.json:29` defines `previous_complications` with bare `{"type": ["string", "null"]}` and no `description` attribute — unlike adjacent fields (`lmp_date`, `gravida`, `para`) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.
|
| 75 |
+
|
| 76 |
+
### Disposition
|
| 77 |
+
|
| 78 |
+
One-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) is held back close to deadline. The regression surface is the full form schema across all four visit types and we don't have time to re-run the eval suite against a tightened schema with confidence. The safety-critical output (danger panel + referral decision) is unaffected, so the conservative choice is documented disclosure now, schema cleanup post-competition.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## Eval-rubric scope: per-case hallucination traps under-specify ANC
|
| 83 |
+
|
| 84 |
+
**Harness:** `scripts/test_ollama_quality.py`
|
| 85 |
+
|
| 86 |
+
The 15/15 pass rate is computed against per-case `hallucination_traps` lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (`scripts/test_ollama_quality.py:470-473`). For the ANC preeclampsia case at line 93, the trap list is `["patient.name", "lab_results.blood_group"]` — `pregnancy.previous_complications` is not checked, which is why the misclassification above passed every run.
|
| 87 |
+
|
| 88 |
+
### Disposition
|
| 89 |
+
|
| 90 |
+
The rubric is honest about what it tests — `hallucination_traps` is the literal list of fields each test asserts null for, and the test source is reproducible. But "15/15 tests pass" rests on a narrow per-case rubric, not a whole-schema null-everywhere-not-mentioned check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above before deploy. Post-competition the rubric will be widened; the current ratio is reported as-is.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## ANC long-clip BP drop on conversational pacing
|
| 95 |
+
|
| 96 |
+
**Harness:** `demo_audio/anc_preeclampsia_full.ogg` (52 s self-recorded clip) on the live HF Space.
|
| 97 |
+
|
| 98 |
+
Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value `155/100` is dropped. The 20-second short clip (`demo_audio/anc_preeclampsia_short.ogg`), where the same speaker pauses deliberately around `बटा`, transcribes `155/100` reliably.
|
| 99 |
+
|
| 100 |
+
### Root cause
|
| 101 |
+
|
| 102 |
+
Conversational pacing on the long clip. BP `एक सौ साठ बटा एक सौ दस` is recoverable from Whisper-Large with a ~0.5 s gap around `बटा`, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.
|
| 103 |
+
|
| 104 |
+
### Disposition
|
| 105 |
+
|
| 106 |
+
Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so a reviewer's first impression preserves the full BP path. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.
|
|
@@ -16,7 +16,7 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
|
|
| 16 |
|
| 17 |
| Measurement | Value | Source |
|
| 18 |
|---|---|---|
|
| 19 |
-
| Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` |
|
| 20 |
| End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
|
| 21 |
| Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
|
| 22 |
| On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |
|
|
|
|
| 16 |
|
| 17 |
| Measurement | Value | Source |
|
| 18 |
|---|---|---|
|
| 19 |
+
| Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` — per-case rubric; one under-specified trap documented in [FAILURES.md](FAILURES.md) |
|
| 20 |
| End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
|
| 21 |
| Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
|
| 22 |
| On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |
|
|
@@ -86,7 +86,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
|
|
| 86 |
|
| 87 |
Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
|
| 88 |
|
| 89 |
-
**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥
|
| 90 |
|
| 91 |
**Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
|
| 92 |
1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
|
|
@@ -154,6 +154,7 @@ Health Center (workstation, RTX GPU) Field (Android phone)
|
|
| 154 |
- Zero false danger alarms on normal visits
|
| 155 |
- Correct referral escalation on danger cases
|
| 156 |
- Avg 18.7s per test (form + danger sign extraction)
|
|
|
|
| 157 |
|
| 158 |
**End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
|
| 159 |
- 15 synthetic Hindi audio samples through full pipeline
|
|
@@ -200,11 +201,11 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
|
|
| 200 |
## Quick Start
|
| 201 |
|
| 202 |
```bash
|
| 203 |
-
# Prerequisites: Python 3.
|
| 204 |
|
| 205 |
# ── Health-center deployment (workstation, unified UI + API) ──
|
| 206 |
-
pip install -r requirements.txt
|
| 207 |
-
ollama pull gemma4:e4b
|
| 208 |
cd frontend && npm install && npm run build && cd ..
|
| 209 |
python api.py
|
| 210 |
# Browser: http://localhost:8000 (React UI)
|
|
@@ -251,7 +252,9 @@ python scripts/test_pipeline_e2e.py # Full E2E audio (13/15)
|
|
| 251 |
python scripts/test_asr.py # Hindi normalization (133/133)
|
| 252 |
cd frontend && npm test # JS pipeline port (72/72)
|
| 253 |
|
| 254 |
-
# Retrain + A/B eval (requires
|
|
|
|
|
|
|
| 255 |
python scripts/train_unsloth.py # Full pipeline: prep, train, export, register, eval
|
| 256 |
python scripts/train_unsloth.py --export-only # Skip training, just export saved adapter
|
| 257 |
python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
|
|
@@ -313,7 +316,7 @@ frontend/
|
|
| 313 |
prompts.js # FORM + DANGER prompts (template-based for on-device E2B)
|
| 314 |
pipeline.js # Orchestrator (engine.complete({messages, options}) contract)
|
| 315 |
cactus.js # Capacitor facade for Cactus SDK
|
| 316 |
-
__tests__/ #
|
| 317 |
public/sw.js # Service worker for PWA offline caching (browser install)
|
| 318 |
public/manifest.json # PWA manifest
|
| 319 |
capacitor.config.json # Capacitor config (appId com.sakhi.app, http scheme for LAN)
|
|
|
|
| 86 |
|
| 87 |
Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
|
| 88 |
|
| 89 |
+
**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
|
| 90 |
|
| 91 |
**Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
|
| 92 |
1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
|
|
|
|
| 154 |
- Zero false danger alarms on normal visits
|
| 155 |
- Correct referral escalation on danger cases
|
| 156 |
- Avg 18.7s per test (form + danger sign extraction)
|
| 157 |
+
- The rubric is per-case: each test asserts a small list of `hallucination_traps` (fields that MUST be null for that input). It does not assert null-everywhere-not-mentioned across the full schema. See [FAILURES.md](FAILURES.md) for one known under-specified trap (`pregnancy.previous_complications` on ANC preeclampsia).
|
| 158 |
|
| 159 |
**End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
|
| 160 |
- 15 synthetic Hindi audio samples through full pipeline
|
|
|
|
| 201 |
## Quick Start
|
| 202 |
|
| 203 |
```bash
|
| 204 |
+
# Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
|
| 205 |
|
| 206 |
# ── Health-center deployment (workstation, unified UI + API) ──
|
| 207 |
+
pip install -r requirements-hf.txt # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
|
| 208 |
+
ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
|
| 209 |
cd frontend && npm install && npm run build && cd ..
|
| 210 |
python api.py
|
| 211 |
# Browser: http://localhost:8000 (React UI)
|
|
|
|
| 252 |
python scripts/test_asr.py # Hindi normalization (133/133)
|
| 253 |
cd frontend && npm test # JS pipeline port (72/72)
|
| 254 |
|
| 255 |
+
# Retrain + A/B eval (requires the FULL requirements.txt: Unsloth + PyTorch + bitsandbytes,
|
| 256 |
+
# plus an RTX GPU, cmake, and llama.cpp binaries on PATH for GGUF export)
|
| 257 |
+
pip install -r requirements.txt # NOTE: training-only deps, Blackwell-pinned PyTorch nightly
|
| 258 |
python scripts/train_unsloth.py # Full pipeline: prep, train, export, register, eval
|
| 259 |
python scripts/train_unsloth.py --export-only # Skip training, just export saved adapter
|
| 260 |
python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
|
|
|
|
| 316 |
prompts.js # FORM + DANGER prompts (template-based for on-device E2B)
|
| 317 |
pipeline.js # Orchestrator (engine.complete({messages, options}) contract)
|
| 318 |
cactus.js # Capacitor facade for Cactus SDK
|
| 319 |
+
__tests__/ # 72/72 assertions pass under node --test
|
| 320 |
public/sw.js # Service worker for PWA offline caching (browser install)
|
| 321 |
public/manifest.json # PWA manifest
|
| 322 |
capacitor.config.json # Capacitor config (appId com.sakhi.app, http scheme for LAN)
|