shwetangisingh commited on
Commit
5a953f7
Β·
1 Parent(s): b96baea

revert ink vision endpoint from Ollama Cloud back to Gemini

Browse files

gemma4:31b-cloud advertises multimodal support, but image inputs on
Ollama Cloud's free tier were unreliable β€” /ink/recognize calls came
back empty in practice. Reverting INK_VISION_* defaults to Gemini 2.0
Flash via Google AI Studio's OpenAI-compatible endpoint, which we
verified works in production on the deployed Space.

settings.py was already on Gemini defaults; this just realigns
.env.example and the README narrative entry so the local-dev story
matches what's deployed.

Files changed (2) hide show
  1. .env.example +7 -4
  2. README.md +1 -1
.env.example CHANGED
@@ -25,10 +25,13 @@ THINKING_TOKEN_BUDGET=4096
25
  FALLBACK_LATENCY_THRESHOLD=3.5
26
 
27
  # Vision model used by /ink/recognize (needs image_url support).
28
- # Reuses the Ollama Cloud endpoint β€” gemma4:31b-cloud is multimodal.
29
- INK_VISION_MODEL=gemma4:31b-cloud
30
- INK_VISION_BASE_URL=http://localhost:11434/v1
31
- INK_VISION_API_KEY=ollama
 
 
 
32
 
33
  # Frontend flags (VITE_ prefix required for Vite to expose them to the browser).
34
  # Set to "false" to disable air-writing stroke capture and ink recognition.
 
25
  FALLBACK_LATENCY_THRESHOLD=3.5
26
 
27
  # Vision model used by /ink/recognize (needs image_url support).
28
+ # Routed via Google AI Studio's OpenAI-compatible endpoint β€” gemma4:31b-cloud
29
+ # advertises vision support but Ollama Cloud's tier gating made image inputs
30
+ # unreliable, so air-writing recognition uses Gemini 2.0 Flash instead.
31
+ # Get a key at: https://aistudio.google.com/apikey
32
+ INK_VISION_MODEL=gemini-2.0-flash
33
+ INK_VISION_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
34
+ INK_VISION_API_KEY=
35
 
36
  # Frontend flags (VITE_ prefix required for Vite to expose them to the browser).
37
  # Set to "false" to disable air-writing stroke capture and ink recognition.
README.md CHANGED
@@ -459,7 +459,7 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
459
  - Affect is read from MediaPipe FaceLandmarker blendshape scores (`mouthSmileLeft`, `browDownLeft`, `eyeSquintLeft`, `jawOpen`, `browInnerUp`, etc.) rather than hand-rolled landmark math. `classifyAffect` in [frontend/src/lib/sensing.ts](frontend/src/lib/sensing.ts) emits `HAPPY` / `FRUSTRATED` / `SURPRISED` / `NEUTRAL` from those scores.
460
  - **Per-user calibration window.** When the webcam first comes alive, a 5-second overlay records the user's neutral baseline β€” trimmed mean and stddev for each blendshape, plus neutral gaze direction and head pose. Detection then fires when a signal exceeds the user's *own* mean by `SIGMA_K = 2.0` standard deviations, so a face whose resting smile blendshape sits at 0.4 doesn't permanently read as HAPPY. One global tunable (Οƒ multiplier) replaces the wall of magic-number thresholds the old geometric pipeline carried. `Calibrator` in [sensing.ts](frontend/src/lib/sensing.ts), wired through [useSensing.ts](frontend/src/hooks/useSensing.ts), surfaced in [CalibrationOverlay.tsx](frontend/src/components/CalibrationOverlay.tsx). A "Recalibrate" button on the sensing panel re-runs the window any time. Set `VITE_CALIBRATION_ENABLED=false` in `.env` to fall back to fixed thresholds for debugging.
461
  - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) β€” a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
462
- - [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) β€” index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200Γ—200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X β€” incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
463
  - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload β€” `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` β€” which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) β€” only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
464
 
465
  ### Intent decomposition
 
459
  - Affect is read from MediaPipe FaceLandmarker blendshape scores (`mouthSmileLeft`, `browDownLeft`, `eyeSquintLeft`, `jawOpen`, `browInnerUp`, etc.) rather than hand-rolled landmark math. `classifyAffect` in [frontend/src/lib/sensing.ts](frontend/src/lib/sensing.ts) emits `HAPPY` / `FRUSTRATED` / `SURPRISED` / `NEUTRAL` from those scores.
460
  - **Per-user calibration window.** When the webcam first comes alive, a 5-second overlay records the user's neutral baseline β€” trimmed mean and stddev for each blendshape, plus neutral gaze direction and head pose. Detection then fires when a signal exceeds the user's *own* mean by `SIGMA_K = 2.0` standard deviations, so a face whose resting smile blendshape sits at 0.4 doesn't permanently read as HAPPY. One global tunable (Οƒ multiplier) replaces the wall of magic-number thresholds the old geometric pipeline carried. `Calibrator` in [sensing.ts](frontend/src/lib/sensing.ts), wired through [useSensing.ts](frontend/src/hooks/useSensing.ts), surfaced in [CalibrationOverlay.tsx](frontend/src/components/CalibrationOverlay.tsx). A "Recalibrate" button on the sensing panel re-runs the window any time. Set `VITE_CALIBRATION_ENABLED=false` in `.env` to fall back to fixed thresholds for debugging.
461
  - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) β€” a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
462
+ - [x] **[Core]** Air-writing uses a vision LLM (Gemini 2.0 Flash via Google AI Studio's OpenAI-compatible endpoint, configurable through `INK_VISION_MODEL` / `INK_VISION_BASE_URL` / `INK_VISION_API_KEY`) instead of the older in-browser DTW template bank. We briefly swapped to `gemma4:31b-cloud` on Ollama Cloud since gemma4 is multimodal, but image-input on Ollama Cloud's free tier turned out to be unreliable β€” Gemini Flash is cheaper to obtain (free key from [aistudio.google.com/apikey](https://aistudio.google.com/apikey)) and consistent. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) β€” index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200Γ—200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X β€” incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
463
  - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload β€” `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` β€” which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) β€” only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
464
 
465
  ### Intent decomposition