Spaces:
Sleeping
Sleeping
Commit Β·
5a953f7
1
Parent(s): b96baea
revert ink vision endpoint from Ollama Cloud back to Gemini
Browse filesgemma4:31b-cloud advertises multimodal support, but image inputs on
Ollama Cloud's free tier were unreliable β /ink/recognize calls came
back empty in practice. Reverting INK_VISION_* defaults to Gemini 2.0
Flash via Google AI Studio's OpenAI-compatible endpoint, which we
verified works in production on the deployed Space.
settings.py was already on Gemini defaults; this just realigns
.env.example and the README narrative entry so the local-dev story
matches what's deployed.
- .env.example +7 -4
- README.md +1 -1
.env.example
CHANGED
|
@@ -25,10 +25,13 @@ THINKING_TOKEN_BUDGET=4096
|
|
| 25 |
FALLBACK_LATENCY_THRESHOLD=3.5
|
| 26 |
|
| 27 |
# Vision model used by /ink/recognize (needs image_url support).
|
| 28 |
-
#
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
# Frontend flags (VITE_ prefix required for Vite to expose them to the browser).
|
| 34 |
# Set to "false" to disable air-writing stroke capture and ink recognition.
|
|
|
|
| 25 |
FALLBACK_LATENCY_THRESHOLD=3.5
|
| 26 |
|
| 27 |
# Vision model used by /ink/recognize (needs image_url support).
|
| 28 |
+
# Routed via Google AI Studio's OpenAI-compatible endpoint β gemma4:31b-cloud
|
| 29 |
+
# advertises vision support but Ollama Cloud's tier gating made image inputs
|
| 30 |
+
# unreliable, so air-writing recognition uses Gemini 2.0 Flash instead.
|
| 31 |
+
# Get a key at: https://aistudio.google.com/apikey
|
| 32 |
+
INK_VISION_MODEL=gemini-2.0-flash
|
| 33 |
+
INK_VISION_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
|
| 34 |
+
INK_VISION_API_KEY=
|
| 35 |
|
| 36 |
# Frontend flags (VITE_ prefix required for Vite to expose them to the browser).
|
| 37 |
# Set to "false" to disable air-writing stroke capture and ink recognition.
|
README.md
CHANGED
|
@@ -459,7 +459,7 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 459 |
- Affect is read from MediaPipe FaceLandmarker blendshape scores (`mouthSmileLeft`, `browDownLeft`, `eyeSquintLeft`, `jawOpen`, `browInnerUp`, etc.) rather than hand-rolled landmark math. `classifyAffect` in [frontend/src/lib/sensing.ts](frontend/src/lib/sensing.ts) emits `HAPPY` / `FRUSTRATED` / `SURPRISED` / `NEUTRAL` from those scores.
|
| 460 |
- **Per-user calibration window.** When the webcam first comes alive, a 5-second overlay records the user's neutral baseline β trimmed mean and stddev for each blendshape, plus neutral gaze direction and head pose. Detection then fires when a signal exceeds the user's *own* mean by `SIGMA_K = 2.0` standard deviations, so a face whose resting smile blendshape sits at 0.4 doesn't permanently read as HAPPY. One global tunable (Ο multiplier) replaces the wall of magic-number thresholds the old geometric pipeline carried. `Calibrator` in [sensing.ts](frontend/src/lib/sensing.ts), wired through [useSensing.ts](frontend/src/hooks/useSensing.ts), surfaced in [CalibrationOverlay.tsx](frontend/src/components/CalibrationOverlay.tsx). A "Recalibrate" button on the sensing panel re-runs the window any time. Set `VITE_CALIBRATION_ENABLED=false` in `.env` to fall back to fixed thresholds for debugging.
|
| 461 |
- [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) β a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 462 |
-
- [x] **[Core]** Air-writing uses a vision LLM (
|
| 463 |
- [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload β `source β voice_only | air_only | agree | conflict_air | conflict_voice` β which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) β only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
|
| 464 |
|
| 465 |
### Intent decomposition
|
|
|
|
| 459 |
- Affect is read from MediaPipe FaceLandmarker blendshape scores (`mouthSmileLeft`, `browDownLeft`, `eyeSquintLeft`, `jawOpen`, `browInnerUp`, etc.) rather than hand-rolled landmark math. `classifyAffect` in [frontend/src/lib/sensing.ts](frontend/src/lib/sensing.ts) emits `HAPPY` / `FRUSTRATED` / `SURPRISED` / `NEUTRAL` from those scores.
|
| 460 |
- **Per-user calibration window.** When the webcam first comes alive, a 5-second overlay records the user's neutral baseline β trimmed mean and stddev for each blendshape, plus neutral gaze direction and head pose. Detection then fires when a signal exceeds the user's *own* mean by `SIGMA_K = 2.0` standard deviations, so a face whose resting smile blendshape sits at 0.4 doesn't permanently read as HAPPY. One global tunable (Ο multiplier) replaces the wall of magic-number thresholds the old geometric pipeline carried. `Calibrator` in [sensing.ts](frontend/src/lib/sensing.ts), wired through [useSensing.ts](frontend/src/hooks/useSensing.ts), surfaced in [CalibrationOverlay.tsx](frontend/src/components/CalibrationOverlay.tsx). A "Recalibrate" button on the sensing panel re-runs the window any time. Set `VITE_CALIBRATION_ENABLED=false` in `.env` to fall back to fixed thresholds for debugging.
|
| 461 |
- [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) β a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 462 |
+
- [x] **[Core]** Air-writing uses a vision LLM (Gemini 2.0 Flash via Google AI Studio's OpenAI-compatible endpoint, configurable through `INK_VISION_MODEL` / `INK_VISION_BASE_URL` / `INK_VISION_API_KEY`) instead of the older in-browser DTW template bank. We briefly swapped to `gemma4:31b-cloud` on Ollama Cloud since gemma4 is multimodal, but image-input on Ollama Cloud's free tier turned out to be unreliable β Gemini Flash is cheaper to obtain (free key from [aistudio.google.com/apikey](https://aistudio.google.com/apikey)) and consistent. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) β index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200Γ200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X β incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
|
| 463 |
- [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload β `source β voice_only | air_only | agree | conflict_air | conflict_voice` β which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) β only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
|
| 464 |
|
| 465 |
### Intent decomposition
|