Spaces:

Tushar9802
/

sakhi

Sleeping

Tushar9802 commited on 7 days ago

Commit

05f829f

1 Parent(s): d3595cb

docs: fix Path 1 (model tag + slim deps) + log eval-rubric gaps

Quick Start + Path 1 narrative previously pointed reviewers to commands
that would have failed verbatim:
- `ollama pull gemma4:e4b` — but app.py / Dockerfile / entrypoint.sh all
default OLLAMA_MODEL to `gemma4:e4b-it-q4_K_M`. A reviewer pulling the
tagless `e4b` would have hit an Ollama 404 on first inference.
- `pip install -r requirements.txt` — but that file pins PyTorch
nightly cu128 + Unsloth + bitsandbytes, all of which are training-only.
The cu128 wheel is Blackwell-only; reviewers on RTX 30/40 / Linux /
macOS would have either failed the install or waited ~15 min on unused
nightlies. Path 1 inference goes through Ollama + faster-whisper —
requirements-hf.txt is sufficient.
- Prerequisites silently assumed the Ollama daemon was running. Made
that explicit (Windows tray app, Linux/macOS `ollama serve`).
Also corrected Python 3.11+ → 3.10+ (matches Dockerfile) and VRAM guidance
to the actual model footprint (~9 GB resident, not the misleading 16 GB).
The retrain block now explicitly calls for the full requirements.txt with
a NOTE about the Blackwell-pinned nightly so the split is visible.

FAILURES.md gains three sections surfacing failure modes a careful
reviewer would otherwise pose as a question:
1. pregnancy.previous_complications slot misclassification on ANC
preeclampsia (one-line prompt fix held back this close to deadline
due to whole-schema regression surface; root-caused to bare
anc_visit.json field with no `description` attribute).
2. Eval-rubric scope: hallucination_traps are per-case, not
null-everywhere-not-mentioned across the schema — explains why local
15/15 passed for weeks while the misclassification was live.
3. ANC long-clip BP drop on conversational pacing; short clip is the
manifest default to lead with the fast end-to-end demo.

README known-limitations bullet now points at FAILURES.md, and the
JS-pipeline-port test count is corrected 62/62 → 72/72 in both README
and JUDGE_BRIEF.md.

Files changed (3) hide show

FAILURES.md +46 -2
JUDGE_BRIEF.md +1 -1
README.md +9 -6

FAILURES.md CHANGED Viewed

@@ -57,6 +57,50 @@ Documented in `RETRAIN_RESULTS.md`. We ship the base model in the live Ollama pa
 `scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
-## JS pipeline port: 62 / 62 pass
-`frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic. No known failures.

 `scripts/test_asr.py` covers all 0–999 Hindi number words + common Whisper misspelling variants + compound medical values (BP, weight, Hb, decimal, fractional). No known failures.
+## JS pipeline port: 72 / 72 pass
+`frontend/src/lib/__tests__/*.test.js` under `node --test`. Covers `parseJsonLoose` repair cases, `extractForm` validation, `extractDangerSigns` JSON path including fenced-JSON tolerance and parse-failure graceful-degrade, `runPipeline` end-to-end with a mock engine, Hindi normalizer parity with the Python port, visit-type keyword heuristic, and the demographics-header merge (`applyMetadata`) across ANC / PNC / child-health / delivery schemas. No known failures.
+---
+## ANC form: `pregnancy.previous_complications` slot misclassification on preeclampsia transcripts
+**Harness:** live ANC preeclampsia inputs — synthetic text example `EXAMPLE_TRANSCRIPTS[1]` in `app.py`, real-voice clip `demo_audio/anc_preeclampsia_full.ogg`.
+**Observed output:** the form's `pregnancy.previous_complications` field is populated with the current-visit symptoms — "सिरदर्द, आँखों के सामने धुंधला दिखना, चेहरे पर सूजन, पैरों में सूजन" — when the conversation describes preeclampsia presenting *today*, not in a prior pregnancy. The same symptoms also appear correctly in `symptoms_reported`, and the danger panel correctly flags `severe_hypertension` / `severe_headache_and_visual_changes` / `edema` with `refer_immediately` and verbatim Hindi evidence. No clinical signal is lost; the misclassification is a duplicate-in-wrong-slot.
+### Root cause
+`configs/schemas/anc_visit.json:29` defines `previous_complications` with bare `{"type": ["string", "null"]}` and no `description` attribute — unlike adjacent fields (`lmp_date`, `gravida`, `para`) which carry explicit descriptions. The model is inferring semantics from the field name alone, and in a conversation densely populated with current findings it slots them into this field. The same input through the JS pipeline on Cactus (E2B INT4) does not exhibit the bug — the on-device path uses a null-filled instance template prompt rather than a raw JSON Schema, which sidesteps the under-described-field ambiguity.
+### Disposition
+One-line schema fix (add `"description": "Complications in PRIOR pregnancies — not current-visit findings"`) is held back close to deadline. The regression surface is the full form schema across all four visit types and we don't have time to re-run the eval suite against a tightened schema with confidence. The safety-critical output (danger panel + referral decision) is unaffected, so the conservative choice is documented disclosure now, schema cleanup post-competition.
+---
+## Eval-rubric scope: per-case hallucination traps under-specify ANC
+**Harness:** `scripts/test_ollama_quality.py`
+The 15/15 pass rate is computed against per-case `hallucination_traps` lists — each test enumerates the specific fields that MUST be null for that input, and the suite only asserts those (`scripts/test_ollama_quality.py:470-473`). For the ANC preeclampsia case at line 93, the trap list is `["patient.name", "lab_results.blood_group"]` — `pregnancy.previous_complications` is not checked, which is why the misclassification above passed every run.
+### Disposition
+The rubric is honest about what it tests — `hallucination_traps` is the literal list of fields each test asserts null for, and the test source is reproducible. But "15/15 tests pass" rests on a narrow per-case rubric, not a whole-schema null-everywhere-not-mentioned check. A wider rubric (every schema field absent from the transcript MUST be null) would have caught the misclassification above before deploy. Post-competition the rubric will be widened; the current ratio is reported as-is.
+---
+## ANC long-clip BP drop on conversational pacing
+**Harness:** `demo_audio/anc_preeclampsia_full.ogg` (52 s self-recorded clip) on the live HF Space.
+Whisper-Large CT2 returns the BP segment as "हाई हो रखा है" — the "BP बहुत ज़्यादा है" framing remains but the actual numeric value `155/100` is dropped. The 20-second short clip (`demo_audio/anc_preeclampsia_short.ogg`), where the same speaker pauses deliberately around `बटा`, transcribes `155/100` reliably.
+### Root cause
+Conversational pacing on the long clip. BP `एक सौ साठ बटा एक सौ दस` is recoverable from Whisper-Large with a ~0.5 s gap around `बटा`, and lossy without. Same speaker, same model, same hardware — the variable is delivery prosody, not Whisper.
+### Disposition
+Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so a reviewer's first impression preserves the full BP path. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.

JUDGE_BRIEF.md CHANGED Viewed

@@ -16,7 +16,7 @@ Sakhi converts Hindi home-visit conversations (voice on a shared health-center w
 | Measurement | Value | Source |
 |---|---|---|
-| Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` |
 | End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
 | Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
 | On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |

 | Measurement | Value | Source |
 |---|---|---|
+| Text extraction pass rate (base Gemma 4 E4B) | **15 / 15** | `scripts/test_ollama_quality.py` — per-case rubric; one under-specified trap documented in [FAILURES.md](FAILURES.md) |
 | End-to-end audio pipeline pass rate | **13 / 15** | `scripts/test_pipeline_e2e.py` (2 TTS→ASR artifacts, documented in FAILURES.md) |
 | Hindi number / medical-term normalization | **133 / 133** | `scripts/test_asr.py` |
 | On-device JS pipeline port (engine-agnostic) | **72 / 72** | `cd frontend && node --test src/lib/__tests__/` |

README.md CHANGED Viewed

@@ -86,7 +86,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
 Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
-**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥16 GB VRAM. No phone needed; same extraction code, same anti-hallucination validation, same form output. `pip install -r requirements.txt && ollama pull gemma4:e4b && python api.py` then open `http://localhost:8000`. Voice-to-form, text-to-form, and queue-and-sync flows all run here. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
 **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
 1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
@@ -154,6 +154,7 @@ Health Center (workstation, RTX GPU)              Field (Android phone)
 - Zero false danger alarms on normal visits
 - Correct referral escalation on danger cases
 - Avg 18.7s per test (form + danger sign extraction)
 **End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
 - 15 synthetic Hindi audio samples through full pipeline
@@ -200,11 +201,11 @@ One React + Vite codebase, shipped as both a browser UI (served by FastAPI at `/
 ## Quick Start
 ```bash
-# Prerequisites: Python 3.11+, Node 18+, Ollama, CUDA GPU (16GB VRAM recommended)
 # ── Health-center deployment (workstation, unified UI + API) ──
-pip install -r requirements.txt
-ollama pull gemma4:e4b
 cd frontend && npm install && npm run build && cd ..
 python api.py
 # Browser: http://localhost:8000  (React UI)
@@ -251,7 +252,9 @@ python scripts/test_pipeline_e2e.py      # Full E2E audio (13/15)
 python scripts/test_asr.py               # Hindi normalization (133/133)
 cd frontend && npm test                  # JS pipeline port (72/72)
-# Retrain + A/B eval (requires RTX GPU, cmake, llama.cpp binaries)
 python scripts/train_unsloth.py                 # Full pipeline: prep, train, export, register, eval
 python scripts/train_unsloth.py --export-only   # Skip training, just export saved adapter
 python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi
@@ -313,7 +316,7 @@ frontend/
     prompts.js                      # FORM + DANGER prompts (template-based for on-device E2B)
     pipeline.js                     # Orchestrator (engine.complete({messages, options}) contract)
     cactus.js                       # Capacitor facade for Cactus SDK
-    __tests__/                      # 62/62 assertions pass under node --test
   public/sw.js                      # Service worker for PWA offline caching (browser install)
   public/manifest.json              # PWA manifest
   capacitor.config.json             # Capacitor config (appId com.sakhi.app, http scheme for LAN)

 Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
+**Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
 **Path 2 — on-device on Android, ~20-25 minutes total (for verifying the Cactus track).** Requires accepting the Cactus-Compute model license. Steps:
 1. Accept terms at [huggingface.co/Cactus-Compute/gemma-4-E2B-it](https://huggingface.co/Cactus-Compute/gemma-4-E2B-it) (1 min, free HF account).
 - Zero false danger alarms on normal visits
 - Correct referral escalation on danger cases
 - Avg 18.7s per test (form + danger sign extraction)
+- The rubric is per-case: each test asserts a small list of `hallucination_traps` (fields that MUST be null for that input). It does not assert null-everywhere-not-mentioned across the full schema. See [FAILURES.md](FAILURES.md) for one known under-specified trap (`pregnancy.previous_complications` on ANC preeclampsia).
 **End-to-end audio pipeline:** 13/15 tests pass (87%) — test_pipeline_e2e.py
 - 15 synthetic Hindi audio samples through full pipeline
 ## Quick Start
 ```bash
+# Prerequisites: Python 3.10+, Node 18+, Ollama (daemon running — Windows: launch the tray app, Linux/macOS: `ollama serve` in another shell), CUDA GPU (~10 GB VRAM for E4B Q4_K_M)
 # ── Health-center deployment (workstation, unified UI + API) ──
+pip install -r requirements-hf.txt          # slim runtime deps; Ollama + faster-whisper, no PyTorch/Unsloth
+ollama pull gemma4:e4b-it-q4_K_M             # ~9 GB; exact tag app.py defaults to (override with OLLAMA_MODEL=...)
 cd frontend && npm install && npm run build && cd ..
 python api.py
 # Browser: http://localhost:8000  (React UI)
 python scripts/test_asr.py               # Hindi normalization (133/133)
 cd frontend && npm test                  # JS pipeline port (72/72)
+# Retrain + A/B eval (requires the FULL requirements.txt: Unsloth + PyTorch + bitsandbytes,
+# plus an RTX GPU, cmake, and llama.cpp binaries on PATH for GGUF export)
+pip install -r requirements.txt                 # NOTE: training-only deps, Blackwell-pinned PyTorch nightly
 python scripts/train_unsloth.py                 # Full pipeline: prep, train, export, register, eval
 python scripts/train_unsloth.py --export-only   # Skip training, just export saved adapter
 python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi
     prompts.js                      # FORM + DANGER prompts (template-based for on-device E2B)
     pipeline.js                     # Orchestrator (engine.complete({messages, options}) contract)
     cactus.js                       # Capacitor facade for Cactus SDK
+    __tests__/                      # 72/72 assertions pass under node --test
   public/sw.js                      # Service worker for PWA offline caching (browser install)
   public/manifest.json              # PWA manifest
   capacitor.config.json             # Capacitor config (appId com.sakhi.app, http scheme for LAN)