Tushar9802 commited on
Commit
05f829f
·
1 Parent(s): d3595cb

docs: fix Path 1 (model tag + slim deps) + log eval-rubric gaps

Browse files

Quick Start + Path 1 narrative previously pointed reviewers to commands
that would have failed verbatim:
- `ollama pull gemma4:e4b` — but app.py / Dockerfile / entrypoint.sh all
default OLLAMA_MODEL to `gemma4:e4b-it-q4_K_M`. A reviewer pulling the
tagless `e4b` would have hit an Ollama 404 on first inference.
- `pip install -r requirements.txt` — but that file pins PyTorch
nightly cu128 + Unsloth + bitsandbytes, all of which are training-only.
The cu128 wheel is Blackwell-only; reviewers on RTX 30/40 / Linux /
macOS would have either failed the install or waited ~15 min on unused
nightlies. Path 1 inference goes through Ollama + faster-whisper —
requirements-hf.txt is sufficient.
- Prerequisites silently assumed the Ollama daemon was running. Made
that explicit (Windows tray app, Linux/macOS `ollama serve`).
Also corrected Python 3.11+ → 3.10+ (matches Dockerfile) and VRAM guidance
to the actual model footprint (~9 GB resident, not the misleading 16 GB).
The retrain block now explicitly calls for the full requirements.txt with
a NOTE about the Blackwell-pinned nightly so the split is visible.

FAILURES.md gains three sections surfacing failure modes a careful
reviewer would otherwise pose as a question:
1. pregnancy.previous_complications slot misclassification on ANC
preeclampsia (one-line prompt fix held back this close to deadline
due to whole-schema regression surface; root-caused to bare
anc_visit.json field with no `description` attribute).
2. Eval-rubric scope: hallucination_traps are per-case, not
null-everywhere-not-mentioned across the schema — explains why local
15/15 passed for weeks while the misclassification was live.
3. ANC long-clip BP drop on conversational pacing; short clip is the
manifest default to lead with the fast end-to-end demo.

README known-limitations bullet now points at FAILURES.md, and the
JS-pipeline-port test count is corrected 62/62 → 72/72 in both README
and JUDGE_BRIEF.md.

Files changed (3) hide show
  1. FAILURES.md +46 -2
  2. JUDGE_BRIEF.md +1 -1
  3. README.md +9 -6
FAILURES.md CHANGED
@@ -57,6 +57,50 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
57
 
58
  `scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
59
 
60
- ## JS pipeline port: 62 / 62 pass
61
 
62
- `frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic. No known failures.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  `scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
59
 
60
+ ## JS pipeline port: 72 / 72 pass
61
 
62
+ `frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (`applyMetadata`) across ANC / PNC / child-health / delivery schemas. No known failures.
63
+
64
+ ---
65
+
66
+ ## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
67
+
68
+ **Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
69
+
70
+ **Observed output:** the form's `pregnancy.previous_complications` field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting *today*, not in a prior pregnancy. The same symptoms also appear correctly in `symptoms_reported`, and the danger panel correctly flags `severe_hypertension` / `severe_headache_and_visual_changes` / `edema` with `refer_immediately` and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.
71
+
72
+ ### Root cause
73
+
74
+ `configs/schemas/anc_visit.json:29` defines `previous_complications` with bare `{"type": ["string", "null"]}` and no `description` attribute — unlike adjacent fields (`lmp_date`, `gravida`, `para`) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.
75
+
76
+ ### Disposition
77
+
78
+ One-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) is held back close to deadline. The regression surface is the full form schema across all four visit types and we don't have time to re-run the eval suite against a tightened schema with confidence. The safety-critical output (danger panel + referral decision) is unaffected, so the conservative choice is documented disclosure now, schema cleanup post-competition.
79
+
80
+ ---
81
+
82
+ ## Eval-rubric scope: per-case hallucination traps under-specify ANC
83
+
84
+ **Harness:** `scripts/test_ollama_quality.py`
85
+
86
+ The 15/15 pass rate is computed against per-case `hallucination_traps` lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (`scripts/test_ollama_quality.py:470-473`). For the ANC preeclampsia case at line 93, the trap list is `["patient.name", "lab_results.blood_group"]` — `pregnancy.previous_complications` is not checked, which is why the misclassification above passed every run.
87
+
88
+ ### Disposition
89
+
90
+ The rubric is honest about what it tests — `hallucination_traps` is the literal list of fields each test asserts null for, and the test source is reproducible. But "15/15 tests pass" rests on a narrow per-case rubric, not a whole-schema null-everywhere-not-mentioned check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above before deploy. Post-competition the rubric will be widened; the current ratio is reported as-is.
91
+
92
+ ---
93
+
94
+ ## ANC long-clip BP drop on conversational pacing
95
+
96
+ **Harness:** `demo_audio/anc_preeclampsia_full.ogg` (52 s self-recorded clip) on the live HF Space.
97
+
98
+ Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value `155/100` is dropped. The 20-second short clip (`demo_audio/anc_preeclampsia_short.ogg`), where the same speaker pauses deliberately around `बटा`, transcribes `155/100` reliably.
99
+
100
+ ### Root cause
101
+
102
+ Conversational pacing on the long clip. BP `एक सौ साठ बटा एक सौ दस` is recoverable from Whisper-Large with a ~0.5 s gap around `बटा`, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.
103
+
104
+ ### Disposition
105
+
106
+ Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so a reviewer's first impression preserves the full BP path. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.
JUDGE_BRIEF.md CHANGED
@@ -16,7 +16,7 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
16
 
17
  | Measurement | Value | Source |
18
  |---|---|---|
19
- | Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` |
20
  | End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
21
  | Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
22
  | On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |
 
16
 
17
  | Measurement | Value | Source |
18
  |---|---|---|
19
+ | Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` — per-case rubric; one under-specified trap documented in [FAILURES.md](FAILURES.md) |
20
  | End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
21
  | Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
22
  | On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |
README.md CHANGED
@@ -86,7 +86,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
86
 
87
  Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
88
 
89
- **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥16 GB VRAM. No phone needed; same extraction code, same anti-hallucination validation, same form output. `pip install -r requirements.txt && ollama pull gemma4:e4b && python api.py` then open `http://localhost:8000`. Voice-to-form, text-to-form, and queue-and-sync flows all run here. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
90
 
91
  **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
92
  1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
@@ -154,6 +154,7 @@ Health Center (workstation, RTX GPU) Field (Android phone)
154
  - Zero false danger alarms on normal visits
155
  - Correct referral escalation on danger cases
156
  - Avg 18.7s per test (form + danger sign extraction)
 
157
 
158
  **End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
159
  - 15 synthetic Hindi audio samples through full pipeline
@@ -200,11 +201,11 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
200
  ## Quick Start
201
 
202
  ```bash
203
- # Prerequisites: Python 3.11+, Node 18+, Ollama, CUDA GPU (16GB VRAM recommended)
204
 
205
  # ── Health-center deployment (workstation, unified UI + API) ──
206
- pip install -r requirements.txt
207
- ollama pull gemma4:e4b
208
  cd frontend && npm install && npm run build && cd ..
209
  python api.py
210
  # Browser: http://localhost:8000 (React UI)
@@ -251,7 +252,9 @@ python scripts/test_pipeline_e2e.py # Full E2E audio (13/15)
251
  python scripts/test_asr.py # Hindi normalization (133/133)
252
  cd frontend && npm test # JS pipeline port (72/72)
253
 
254
- # Retrain + A/B eval (requires RTX GPU, cmake, llama.cpp binaries)
 
 
255
  python scripts/train_unsloth.py # Full pipeline: prep, train, export, register, eval
256
  python scripts/train_unsloth.py --export-only # Skip training, just export saved adapter
257
  python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
@@ -313,7 +316,7 @@ frontend/
313
  prompts.js # FORM + DANGER prompts (template-based for on-device E2B)
314
  pipeline.js # Orchestrator (engine.complete({messages, options}) contract)
315
  cactus.js # Capacitor facade for Cactus SDK
316
- __tests__/ # 62/62 assertions pass under node --test
317
  public/sw.js # Service worker for PWA offline caching (browser install)
318
  public/manifest.json # PWA manifest
319
  capacitor.config.json # Capacitor config (appId com.sakhi.app, http scheme for LAN)
 
86
 
87
  Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
88
 
89
+ **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
90
 
91
  **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
92
  1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
 
154
  - Zero false danger alarms on normal visits
155
  - Correct referral escalation on danger cases
156
  - Avg 18.7s per test (form + danger sign extraction)
157
+ - The rubric is per-case: each test asserts a small list of `hallucination_traps` (fields that MUST be null for that input). It does not assert null-everywhere-not-mentioned across the full schema. See [FAILURES.md](FAILURES.md) for one known under-specified trap (`pregnancy.previous_complications` on ANC preeclampsia).
158
 
159
  **End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
160
  - 15 synthetic Hindi audio samples through full pipeline
 
201
  ## Quick Start
202
 
203
  ```bash
204
+ # Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
205
 
206
  # ── Health-center deployment (workstation, unified UI + API) ──
207
+ pip install -r requirements-hf.txt # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
208
+ ollama pull gemma4:e4b-it-q4_K_M # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
209
  cd frontend && npm install && npm run build && cd ..
210
  python api.py
211
  # Browser: http://localhost:8000 (React UI)
 
252
  python scripts/test_asr.py # Hindi normalization (133/133)
253
  cd frontend && npm test # JS pipeline port (72/72)
254
 
255
+ # Retrain + A/B eval (requires the FULL requirements.txt: Unsloth + PyTorch + bitsandbytes,
256
+ # plus an RTX GPU, cmake, and llama.cpp binaries on PATH for GGUF export)
257
+ pip install -r requirements.txt # NOTE: training-only deps, Blackwell-pinned PyTorch nightly
258
  python scripts/train_unsloth.py # Full pipeline: prep, train, export, register, eval
259
  python scripts/train_unsloth.py --export-only # Skip training, just export saved adapter
260
  python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
 
316
  prompts.js # FORM + DANGER prompts (template-based for on-device E2B)
317
  pipeline.js # Orchestrator (engine.complete({messages, options}) contract)
318
  cactus.js # Capacitor facade for Cactus SDK
319
+ __tests__/ # 72/72 assertions pass under node --test
320
  public/sw.js # Service worker for PWA offline caching (browser install)
321
  public/manifest.json # PWA manifest
322
  capacitor.config.json # Capacitor config (appId com.sakhi.app, http scheme for LAN)