Spaces:
Sleeping
Sleeping
Commit ·
535a98d
1
Parent(s): c09a7e7
Add voice + air-writing conflict resolution
Browse filesPush-to-talk Web Speech mic (gated to verbal-access personas) plus
a Jaccard + AAC-priority-token resolver. The resulting
{text, source, voice_text, air_text} drives a supplemental sub-intent
and source-aware planner copy across five branches. Voice/air text is
sanitised before LLM interpolation to close a prompt-injection vector.
Also refreshes CLAUDE.md's persona table to match data/users.json.
- CLAUDE.md +35 -11
- README.md +1 -1
- backend/api/main.py +13 -0
- backend/main.py +2 -0
- backend/pipeline/nodes/feedback.py +3 -0
- backend/pipeline/nodes/intent.py +12 -9
- backend/pipeline/nodes/planner.py +72 -8
- backend/pipeline/state.py +4 -0
- frontend/src/App.css +24 -0
- frontend/src/components/ChatPanel.tsx +67 -0
- frontend/src/hooks/useVoice.ts +170 -0
- frontend/src/lib/resolveIntent.ts +109 -0
- frontend/src/lib/voiceEligibility.ts +18 -0
- frontend/src/types.ts +17 -0
CLAUDE.md
CHANGED
|
@@ -2,11 +2,12 @@
|
|
| 2 |
|
| 3 |
## What This Project Does
|
| 4 |
|
| 5 |
-
An AI chatbot that **speaks as an AAC user**, not to them. Given
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
|
| 11 |
---
|
| 12 |
|
|
@@ -66,13 +67,31 @@ logs/ Per-turn JSONL logs (gitignored)
|
|
| 66 |
|
| 67 |
## Personas
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
| ID | Name | Condition | Access |
|
| 70 |
|----|------|-----------|--------|
|
| 71 |
-
| `
|
| 72 |
-
| `
|
| 73 |
-
| `
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
25 memory chunks
|
| 76 |
|
| 77 |
---
|
| 78 |
|
|
@@ -129,8 +148,13 @@ Copy `.env.example` → `.env` and set:
|
|
| 129 |
- **NEVER use local Ollama models** (e.g. `qwen3:8b`, `gemma3:1b`) — this machine
|
| 130 |
is not powerful enough and will break. Always use cloud-backed models like
|
| 131 |
`gemma4:31b-cloud` via Ollama Cloud.
|
| 132 |
-
- **Adding a persona**: add
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
- **Changing LLM**: set `ACTIVE_LLM_TIER` in `.env` — no code changes needed
|
| 135 |
- **Extending sensing**: sensing runs in the React frontend
|
| 136 |
(`frontend/src/hooks/useSensing.ts`); to add a new signal, classify it
|
|
|
|
| 2 |
|
| 3 |
## What This Project Does
|
| 4 |
|
| 5 |
+
An AI chatbot that **speaks as an AAC user**, not to them. Given one of 14
|
| 6 |
+
personas — nine anchored in real memoirs and five in canonical fiction —
|
| 7 |
+
it fuses real-time multimodal non-verbal signals with personal memory
|
| 8 |
+
retrieval to generate responses in that person's authentic voice. Orchestrated
|
| 9 |
+
as a **plain Python function chain** across five layers, with two conditional
|
| 10 |
+
branches.
|
| 11 |
|
| 12 |
---
|
| 13 |
|
|
|
|
| 67 |
|
| 68 |
## Personas
|
| 69 |
|
| 70 |
+
Fourteen personas shipped. Real-memoir-anchored:
|
| 71 |
+
|
| 72 |
+
| ID | Name | Condition | Access |
|
| 73 |
+
|----|------|-----------|--------|
|
| 74 |
+
| `stephen_hawking` | Stephen Hawking | ALS (advanced) | Cheek-twitch + ACAT predictive speech |
|
| 75 |
+
| `jean_dominique_bauby` | Jean-Dominique Bauby | Locked-in syndrome | Alphabet-blink with amanuensis |
|
| 76 |
+
| `michael_j_fox` | Michael J. Fox | Parkinson's | Voice + adaptive keyboard + dictation |
|
| 77 |
+
| `gabby_giffords` | Gabby Giffords | Aphasia + right hemiparesis (post-TBI) | Left-hand typing + speech-to-text |
|
| 78 |
+
| `jason_becker` | Jason Becker | ALS (fully locked-in) | Eye-gaze + father's letter-code board |
|
| 79 |
+
| `tito_mukhopadhyay` | Tito Mukhopadhyay | Non-verbal autism | Letterboard + pencil |
|
| 80 |
+
| `christopher_reeve` | Christopher Reeve | C1–C2 spinal cord injury | Dictation to assistants; sip-and-puff |
|
| 81 |
+
| `christy_brown` | Christy Brown | Cerebral palsy (spastic quadriplegia) | Left foot typing / writing |
|
| 82 |
+
| `wendy_mitchell` | Wendy Mitchell | Early-onset dementia | Laptop/phone typing + "brain-book" |
|
| 83 |
+
|
| 84 |
+
Canonical fiction:
|
| 85 |
+
|
| 86 |
| ID | Name | Condition | Access |
|
| 87 |
|----|------|-----------|--------|
|
| 88 |
+
| `abed_nadir` | Abed Nadir (*Community*) | Autism (coded); occasional selective mutism | Mostly verbal; text when overloaded |
|
| 89 |
+
| `allie_calhoun` | Allie Hamilton Calhoun (*The Notebook*) | Late-stage Alzheimer's | Verbal when lucid; yes/no otherwise |
|
| 90 |
+
| `forrest_gump` | Forrest Gump | Intellectual disability (IQ ~75) | Verbal primarily |
|
| 91 |
+
| `raymond_babbitt` | Raymond Babbitt (*Rain Man*) | Savant autism | Verbal when calm + visual schedules |
|
| 92 |
+
| `walter_jr_white` | Walter "Flynn" White Jr. (*Breaking Bad*) | Cerebral palsy | Verbal + smartphone typing |
|
| 93 |
|
| 94 |
+
~25 bucketed memory chunks per persona (`family` / `medical` / `hobbies` / `daily_routine` / `social`; buckets tuned per-persona). A short-form voice push-to-talk mic surfaces only for personas whose modelled access method is verbal — see `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts).
|
| 95 |
|
| 96 |
---
|
| 97 |
|
|
|
|
| 148 |
- **NEVER use local Ollama models** (e.g. `qwen3:8b`, `gemma3:1b`) — this machine
|
| 149 |
is not powerful enough and will break. Always use cloud-backed models like
|
| 150 |
`gemma4:31b-cloud` via Ollama Cloud.
|
| 151 |
+
- **Adding a persona**: add a memory JSON under `data/memories/<uid>.json` and
|
| 152 |
+
a matching entry in `data/users.json` (or regenerate both via
|
| 153 |
+
`data/generate_users.py` if present), then
|
| 154 |
+
`python -m backend.retrieval.vector_store` to rebuild indexes. If the
|
| 155 |
+
persona's modelled access method includes live speech, also add their `id`
|
| 156 |
+
to `VOICE_CAPABLE_PERSONAS` in `frontend/src/lib/voiceEligibility.ts` so
|
| 157 |
+
the mic button surfaces.
|
| 158 |
- **Changing LLM**: set `ACTIVE_LLM_TIER` in `.env` — no code changes needed
|
| 159 |
- **Extending sensing**: sensing runs in the React frontend
|
| 160 |
(`frontend/src/hooks/useSensing.ts`); to add a new signal, classify it
|
README.md
CHANGED
|
@@ -400,7 +400,7 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 400 |
- Calibration is now averaged over the first 30 frames (~1s of neutral face) instead of a single-frame snapshot — a brief smile at startup used to lock in a biased baseline. Affect stays null during calibration; gaze/head/gesture/air-writing still flow.
|
| 401 |
- [x] **[Core]** Gestures (`THUMBS_UP` / `THUMBS_DOWN` / `POINTING` / `WAVING`) now carry an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py). A detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 402 |
- [x] **[Core]** Air-writing carries a default template bank ([frontend/src/lib/airTemplates.ts](frontend/src/lib/airTemplates.ts): `yes` / `?` / `hi` / `help` / `done` / `more` / `water` / `stop`) — all single-stroke shapes so DTW can match reliably. On match, the word flows through the pipeline three ways: (1) retrieval picks up the word as an extra `PERSONAL` sub-intent with a bucket hint (see `infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py) — e.g. `help` → medical, `water` → daily_routine), (2) the planner includes an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction in the user message, and (3) the word appears in `logs/turns.jsonl` for debugging. The recognizer has a `MATCH_THRESHOLD` reject gate and `console.debug`s on empty-bank / no-match so unrecognised strokes never reach the backend. To add more templates, append entries to `DEFAULT_AIR_TEMPLATES` as 32-point normalised single-stroke trajectories.
|
| 403 |
-
- [
|
| 404 |
- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal.
|
| 405 |
|
| 406 |
### Intent decomposition
|
|
|
|
| 400 |
- Calibration is now averaged over the first 30 frames (~1s of neutral face) instead of a single-frame snapshot — a brief smile at startup used to lock in a biased baseline. Affect stays null during calibration; gaze/head/gesture/air-writing still flow.
|
| 401 |
- [x] **[Core]** Gestures (`THUMBS_UP` / `THUMBS_DOWN` / `POINTING` / `WAVING`) now carry an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py). A detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 402 |
- [x] **[Core]** Air-writing carries a default template bank ([frontend/src/lib/airTemplates.ts](frontend/src/lib/airTemplates.ts): `yes` / `?` / `hi` / `help` / `done` / `more` / `water` / `stop`) — all single-stroke shapes so DTW can match reliably. On match, the word flows through the pipeline three ways: (1) retrieval picks up the word as an extra `PERSONAL` sub-intent with a bucket hint (see `infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py) — e.g. `help` → medical, `water` → daily_routine), (2) the planner includes an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction in the user message, and (3) the word appears in `logs/turns.jsonl` for debugging. The recognizer has a `MATCH_THRESHOLD` reject gate and `console.debug`s on empty-bank / no-match so unrecognised strokes never reach the backend. To add more templates, append entries to `DEFAULT_AIR_TEMPLATES` as 32-point normalised single-stroke trajectories.
|
| 403 |
+
- [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
|
| 404 |
- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal.
|
| 405 |
|
| 406 |
### Intent decomposition
|
backend/api/main.py
CHANGED
|
@@ -98,6 +98,13 @@ def _reserve_eval_slot(run_id: str) -> None:
|
|
| 98 |
# ── Request / response schemas ─────────────────────────────────────────────────
|
| 99 |
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
class ChatRequest(BaseModel):
|
| 102 |
user_id: str
|
| 103 |
query: str
|
|
@@ -106,6 +113,8 @@ class ChatRequest(BaseModel):
|
|
| 106 |
gaze_bucket: str | None = None
|
| 107 |
air_written_text: str | None = None
|
| 108 |
head_signal: str | None = None # "HEAD_SHAKE"|"HEAD_NOD_DISSATISFIED"
|
|
|
|
|
|
|
| 109 |
|
| 110 |
|
| 111 |
class TurnaroundRequest(BaseModel):
|
|
@@ -210,6 +219,10 @@ def _build_initial_state(req: ChatRequest, session: dict) -> PipelineState:
|
|
| 210 |
gaze_bucket=req.gaze_bucket,
|
| 211 |
air_written_text=req.air_written_text,
|
| 212 |
head_signal=req.head_signal,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
turnaround_triggered=False,
|
| 214 |
raw_query=req.query,
|
| 215 |
intent_route=None,
|
|
|
|
| 98 |
# ── Request / response schemas ─────────────────────────────────────────────────
|
| 99 |
|
| 100 |
|
| 101 |
+
class ResolvedIntent(BaseModel):
|
| 102 |
+
text: str
|
| 103 |
+
source: str # voice_only | air_only | agree | conflict_air | conflict_voice | none
|
| 104 |
+
voice_text: str | None = None
|
| 105 |
+
air_text: str | None = None
|
| 106 |
+
|
| 107 |
+
|
| 108 |
class ChatRequest(BaseModel):
|
| 109 |
user_id: str
|
| 110 |
query: str
|
|
|
|
| 113 |
gaze_bucket: str | None = None
|
| 114 |
air_written_text: str | None = None
|
| 115 |
head_signal: str | None = None # "HEAD_SHAKE"|"HEAD_NOD_DISSATISFIED"
|
| 116 |
+
voice_text: str | None = None
|
| 117 |
+
resolved_intent: ResolvedIntent | None = None
|
| 118 |
|
| 119 |
|
| 120 |
class TurnaroundRequest(BaseModel):
|
|
|
|
| 219 |
gaze_bucket=req.gaze_bucket,
|
| 220 |
air_written_text=req.air_written_text,
|
| 221 |
head_signal=req.head_signal,
|
| 222 |
+
voice_text=req.voice_text,
|
| 223 |
+
resolved_intent=(
|
| 224 |
+
req.resolved_intent.model_dump() if req.resolved_intent else None
|
| 225 |
+
),
|
| 226 |
turnaround_triggered=False,
|
| 227 |
raw_query=req.query,
|
| 228 |
intent_route=None,
|
backend/main.py
CHANGED
|
@@ -171,6 +171,8 @@ def main() -> None:
|
|
| 171 |
gesture_tag=None,
|
| 172 |
gaze_bucket=None,
|
| 173 |
air_written_text=None,
|
|
|
|
|
|
|
| 174 |
raw_query=query,
|
| 175 |
intent_route=pre_route, # pre-filled → intent node sees it and skips LLM call
|
| 176 |
generation_config=pre_gen_config,
|
|
|
|
| 171 |
gesture_tag=None,
|
| 172 |
gaze_bucket=None,
|
| 173 |
air_written_text=None,
|
| 174 |
+
voice_text=None,
|
| 175 |
+
resolved_intent=None,
|
| 176 |
raw_query=query,
|
| 177 |
intent_route=pre_route, # pre-filled → intent node sees it and skips LLM call
|
| 178 |
generation_config=pre_gen_config,
|
backend/pipeline/nodes/feedback.py
CHANGED
|
@@ -51,6 +51,9 @@ def _log_to_jsonl(
|
|
| 51 |
"retrieval_mode": state.get("retrieval_mode_used", "unknown"),
|
| 52 |
"affect": affect,
|
| 53 |
"head_signal": state.get("head_signal"),
|
|
|
|
|
|
|
|
|
|
| 54 |
"turnaround_triggered": state.get("turnaround_triggered", False),
|
| 55 |
"guardrail_passed": state.get("guardrail_passed", True),
|
| 56 |
"num_chunks": len(chunks),
|
|
|
|
| 51 |
"retrieval_mode": state.get("retrieval_mode_used", "unknown"),
|
| 52 |
"affect": affect,
|
| 53 |
"head_signal": state.get("head_signal"),
|
| 54 |
+
"air_written_text": state.get("air_written_text"),
|
| 55 |
+
"voice_text": state.get("voice_text"),
|
| 56 |
+
"resolved_intent": state.get("resolved_intent"),
|
| 57 |
"turnaround_triggered": state.get("turnaround_triggered", False),
|
| 58 |
"guardrail_passed": state.get("guardrail_passed", True),
|
| 59 |
"num_chunks": len(chunks),
|
backend/pipeline/nodes/intent.py
CHANGED
|
@@ -256,18 +256,21 @@ def run(state: PipelineState) -> dict:
|
|
| 256 |
}
|
| 257 |
]
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
|
|
|
|
|
|
|
|
|
| 263 |
# doesn't silently flip the route to PERSONAL and re-enable retrieval.
|
| 264 |
-
|
| 265 |
sub_intents.append(
|
| 266 |
{
|
| 267 |
-
"type":
|
| 268 |
-
"query":
|
| 269 |
-
"bucket_hint": infer_bucket(
|
| 270 |
-
if
|
| 271 |
else None,
|
| 272 |
"priority": priority,
|
| 273 |
}
|
|
|
|
| 256 |
}
|
| 257 |
]
|
| 258 |
|
| 259 |
+
# Prefer resolved_intent.text when the frontend did voice⇄air reconciliation;
|
| 260 |
+
# fall back to raw air_written_text when no voice was captured.
|
| 261 |
+
resolved = state.get("resolved_intent") or {}
|
| 262 |
+
supplement = (resolved.get("text") or "").strip() or state.get("air_written_text")
|
| 263 |
+
if supplement:
|
| 264 |
+
# Classify the supplement the same way as a normal fragment so a
|
| 265 |
+
# present-tense supplement ("tired") on a present-state question
|
| 266 |
# doesn't silently flip the route to PERSONAL and re-enable retrieval.
|
| 267 |
+
sup_cls = _classify(supplement)
|
| 268 |
sub_intents.append(
|
| 269 |
{
|
| 270 |
+
"type": sup_cls,
|
| 271 |
+
"query": supplement,
|
| 272 |
+
"bucket_hint": infer_bucket(supplement)
|
| 273 |
+
if sup_cls == "PERSONAL"
|
| 274 |
else None,
|
| 275 |
"priority": priority,
|
| 276 |
}
|
backend/pipeline/nodes/planner.py
CHANGED
|
@@ -95,6 +95,7 @@ def _run_stream(state: PipelineState, tier: str) -> Iterator[dict]:
|
|
| 95 |
style: StyleDirective = gen_cfg["style"]
|
| 96 |
gesture_tag = state.get("gesture_tag")
|
| 97 |
air_written_text = state.get("air_written_text")
|
|
|
|
| 98 |
turnaround_triggered = state.get("turnaround_triggered", False)
|
| 99 |
rejected_response: str | None = None
|
| 100 |
if turnaround_triggered:
|
|
@@ -195,6 +196,7 @@ def _run_stream(state: PipelineState, tier: str) -> Iterator[dict]:
|
|
| 195 |
gen_cfg,
|
| 196 |
gesture_tag=gesture_tag,
|
| 197 |
air_written_text=air_written_text,
|
|
|
|
| 198 |
rejected_response=rejected_response,
|
| 199 |
rejected_candidates=rejected_candidates,
|
| 200 |
intent_kind=intent_kind,
|
|
@@ -367,6 +369,7 @@ def _run(state: PipelineState, tier: str) -> dict:
|
|
| 367 |
style: StyleDirective = gen_cfg["style"]
|
| 368 |
gesture_tag = state.get("gesture_tag")
|
| 369 |
air_written_text = state.get("air_written_text")
|
|
|
|
| 370 |
turnaround_triggered = state.get("turnaround_triggered", False)
|
| 371 |
rejected_response: str | None = None
|
| 372 |
if turnaround_triggered:
|
|
@@ -409,6 +412,7 @@ def _run(state: PipelineState, tier: str) -> dict:
|
|
| 409 |
gen_cfg,
|
| 410 |
gesture_tag=gesture_tag,
|
| 411 |
air_written_text=air_written_text,
|
|
|
|
| 412 |
rejected_response=rejected_response,
|
| 413 |
rejected_candidates=rejected_candidates,
|
| 414 |
intent_kind=intent_kind,
|
|
@@ -502,6 +506,7 @@ def _run(state: PipelineState, tier: str) -> dict:
|
|
| 502 |
gen_cfg,
|
| 503 |
gesture_tag=gesture_tag,
|
| 504 |
air_written_text=air_written_text,
|
|
|
|
| 505 |
rejected_response=rejected_response,
|
| 506 |
intent_kind=intent_kind,
|
| 507 |
affect=affect,
|
|
@@ -541,6 +546,7 @@ def _build_messages(
|
|
| 541 |
gen_cfg: dict,
|
| 542 |
gesture_tag: str | None = None,
|
| 543 |
air_written_text: str | None = None,
|
|
|
|
| 544 |
rejected_response: str | None = None,
|
| 545 |
rejected_candidates: list[str] | None = None,
|
| 546 |
intent_kind: str = "memory",
|
|
@@ -560,6 +566,7 @@ def _build_messages(
|
|
| 560 |
gesture_tag,
|
| 561 |
air_written_text,
|
| 562 |
profile["name"],
|
|
|
|
| 563 |
rejected_response=rejected_response,
|
| 564 |
rejected_candidates=rejected_candidates,
|
| 565 |
intent_kind=intent_kind,
|
|
@@ -611,6 +618,69 @@ Answering rules:
|
|
| 611 |
--- end character sheet ---"""
|
| 612 |
|
| 613 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 614 |
def _build_user(
|
| 615 |
chunks: list[dict],
|
| 616 |
history: list[dict],
|
|
@@ -621,6 +691,7 @@ def _build_user(
|
|
| 621 |
air_written_text: str | None,
|
| 622 |
persona_name: str,
|
| 623 |
*,
|
|
|
|
| 624 |
rejected_response: str | None = None,
|
| 625 |
rejected_candidates: list[str] | None = None,
|
| 626 |
intent_kind: str = "memory",
|
|
@@ -665,14 +736,7 @@ def _build_user(
|
|
| 665 |
# Gesture opener wins over affect opener — a deliberate thumbs-up is a stronger signal than inferred affect.
|
| 666 |
merged_opener = directive["opener_hint"]
|
| 667 |
|
| 668 |
-
air_writing_block =
|
| 669 |
-
if air_written_text:
|
| 670 |
-
air_writing_block = (
|
| 671 |
-
f'\nThe user air-wrote: "{air_written_text}". '
|
| 672 |
-
"If this looks like a name, noun, or short phrase, "
|
| 673 |
-
"incorporate it verbatim into your response; "
|
| 674 |
-
"otherwise use it as a hint about what they're trying to say."
|
| 675 |
-
)
|
| 676 |
|
| 677 |
persona_mod = gen_cfg.get("persona_mod", "baseline")
|
| 678 |
persona_instruction_line = (
|
|
|
|
| 95 |
style: StyleDirective = gen_cfg["style"]
|
| 96 |
gesture_tag = state.get("gesture_tag")
|
| 97 |
air_written_text = state.get("air_written_text")
|
| 98 |
+
resolved_intent = state.get("resolved_intent")
|
| 99 |
turnaround_triggered = state.get("turnaround_triggered", False)
|
| 100 |
rejected_response: str | None = None
|
| 101 |
if turnaround_triggered:
|
|
|
|
| 196 |
gen_cfg,
|
| 197 |
gesture_tag=gesture_tag,
|
| 198 |
air_written_text=air_written_text,
|
| 199 |
+
resolved_intent=resolved_intent,
|
| 200 |
rejected_response=rejected_response,
|
| 201 |
rejected_candidates=rejected_candidates,
|
| 202 |
intent_kind=intent_kind,
|
|
|
|
| 369 |
style: StyleDirective = gen_cfg["style"]
|
| 370 |
gesture_tag = state.get("gesture_tag")
|
| 371 |
air_written_text = state.get("air_written_text")
|
| 372 |
+
resolved_intent = state.get("resolved_intent")
|
| 373 |
turnaround_triggered = state.get("turnaround_triggered", False)
|
| 374 |
rejected_response: str | None = None
|
| 375 |
if turnaround_triggered:
|
|
|
|
| 412 |
gen_cfg,
|
| 413 |
gesture_tag=gesture_tag,
|
| 414 |
air_written_text=air_written_text,
|
| 415 |
+
resolved_intent=resolved_intent,
|
| 416 |
rejected_response=rejected_response,
|
| 417 |
rejected_candidates=rejected_candidates,
|
| 418 |
intent_kind=intent_kind,
|
|
|
|
| 506 |
gen_cfg,
|
| 507 |
gesture_tag=gesture_tag,
|
| 508 |
air_written_text=air_written_text,
|
| 509 |
+
resolved_intent=resolved_intent,
|
| 510 |
rejected_response=rejected_response,
|
| 511 |
intent_kind=intent_kind,
|
| 512 |
affect=affect,
|
|
|
|
| 546 |
gen_cfg: dict,
|
| 547 |
gesture_tag: str | None = None,
|
| 548 |
air_written_text: str | None = None,
|
| 549 |
+
resolved_intent: dict | None = None,
|
| 550 |
rejected_response: str | None = None,
|
| 551 |
rejected_candidates: list[str] | None = None,
|
| 552 |
intent_kind: str = "memory",
|
|
|
|
| 566 |
gesture_tag,
|
| 567 |
air_written_text,
|
| 568 |
profile["name"],
|
| 569 |
+
resolved_intent=resolved_intent,
|
| 570 |
rejected_response=rejected_response,
|
| 571 |
rejected_candidates=rejected_candidates,
|
| 572 |
intent_kind=intent_kind,
|
|
|
|
| 618 |
--- end character sheet ---"""
|
| 619 |
|
| 620 |
|
| 621 |
+
def _safe_user_text(s: str) -> str:
|
| 622 |
+
# voice_text / air_text arrive from untrusted channels (Web Speech,
|
| 623 |
+
# air-writing DTW). They're f-stringed into LLM messages wrapped in
|
| 624 |
+
# double-quotes — a transcript containing `"` or newlines would break out
|
| 625 |
+
# of the quoted region and could inject instructions. Strip those and cap
|
| 626 |
+
# length. Same pattern as `safe_rejected` for `rejected_response`.
|
| 627 |
+
return s.replace('"', "'").replace("\n", " ").replace("\r", " ")[:200]
|
| 628 |
+
|
| 629 |
+
|
| 630 |
+
def _format_multimodal_intent(
|
| 631 |
+
resolved: dict | None, air_written_text: str | None
|
| 632 |
+
) -> str:
|
| 633 |
+
# Branch on resolved_intent.source so the model sees voice⇄air-writing
|
| 634 |
+
# disagreements explicitly instead of getting a single text without context.
|
| 635 |
+
if resolved:
|
| 636 |
+
source = resolved.get("source") or "none"
|
| 637 |
+
voice_t = _safe_user_text((resolved.get("voice_text") or "").strip())
|
| 638 |
+
air_t = _safe_user_text((resolved.get("air_text") or "").strip())
|
| 639 |
+
text = _safe_user_text((resolved.get("text") or "").strip())
|
| 640 |
+
|
| 641 |
+
if source == "voice_only" and voice_t:
|
| 642 |
+
return (
|
| 643 |
+
f'\nThe user spoke aloud: "{voice_t}". '
|
| 644 |
+
"Treat this as a supplement to the partner's question — "
|
| 645 |
+
"a hint or clarification about what they want."
|
| 646 |
+
)
|
| 647 |
+
if source == "air_only" and air_t:
|
| 648 |
+
return (
|
| 649 |
+
f'\nThe user air-wrote: "{air_t}". '
|
| 650 |
+
"If this looks like a name, noun, or short phrase, "
|
| 651 |
+
"incorporate it verbatim into your response; "
|
| 652 |
+
"otherwise use it as a hint about what they're trying to say."
|
| 653 |
+
)
|
| 654 |
+
if source == "agree" and text:
|
| 655 |
+
return (
|
| 656 |
+
f'\nThe user spoke and air-wrote the same thing: "{text}". '
|
| 657 |
+
"This is a strong signal — lean into it when shaping your reply."
|
| 658 |
+
)
|
| 659 |
+
if source == "conflict_air" and air_t:
|
| 660 |
+
return (
|
| 661 |
+
f'\nThe user spoke "{voice_t}" but also air-wrote "{air_t}". '
|
| 662 |
+
"The air-written token is a canonical AAC signal "
|
| 663 |
+
"(help/stop/water/done/more) — prioritise it over the spoken "
|
| 664 |
+
"words, which may have been misheard."
|
| 665 |
+
)
|
| 666 |
+
if source == "conflict_voice" and voice_t:
|
| 667 |
+
return (
|
| 668 |
+
f'\nThe user spoke "{voice_t}" but air-wrote "{air_t}" — '
|
| 669 |
+
"these don't match. The spoken form is richer; treat it as "
|
| 670 |
+
"the real intent and gently acknowledge the air-writing "
|
| 671 |
+
"may have been a mis-stroke."
|
| 672 |
+
)
|
| 673 |
+
|
| 674 |
+
if air_written_text:
|
| 675 |
+
return (
|
| 676 |
+
f'\nThe user air-wrote: "{_safe_user_text(air_written_text)}". '
|
| 677 |
+
"If this looks like a name, noun, or short phrase, "
|
| 678 |
+
"incorporate it verbatim into your response; "
|
| 679 |
+
"otherwise use it as a hint about what they're trying to say."
|
| 680 |
+
)
|
| 681 |
+
return ""
|
| 682 |
+
|
| 683 |
+
|
| 684 |
def _build_user(
|
| 685 |
chunks: list[dict],
|
| 686 |
history: list[dict],
|
|
|
|
| 691 |
air_written_text: str | None,
|
| 692 |
persona_name: str,
|
| 693 |
*,
|
| 694 |
+
resolved_intent: dict | None = None,
|
| 695 |
rejected_response: str | None = None,
|
| 696 |
rejected_candidates: list[str] | None = None,
|
| 697 |
intent_kind: str = "memory",
|
|
|
|
| 736 |
# Gesture opener wins over affect opener — a deliberate thumbs-up is a stronger signal than inferred affect.
|
| 737 |
merged_opener = directive["opener_hint"]
|
| 738 |
|
| 739 |
+
air_writing_block = _format_multimodal_intent(resolved_intent, air_written_text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 740 |
|
| 741 |
persona_mod = gen_cfg.get("persona_mod", "baseline")
|
| 742 |
persona_instruction_line = (
|
backend/pipeline/state.py
CHANGED
|
@@ -95,6 +95,10 @@ class PipelineState(TypedDict):
|
|
| 95 |
gaze_bucket: str | None # bucket hinted by gaze fixation
|
| 96 |
air_written_text: str | None # concatenated air-written chars
|
| 97 |
head_signal: str | None # "HEAD_SHAKE" | "HEAD_NOD_DISSATISFIED"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
turnaround_triggered: bool # true when re-planned from dissatisfaction signal
|
| 99 |
|
| 100 |
# ── L2: Intent decomposition outputs ─────────────────────────────────────
|
|
|
|
| 95 |
gaze_bucket: str | None # bucket hinted by gaze fixation
|
| 96 |
air_written_text: str | None # concatenated air-written chars
|
| 97 |
head_signal: str | None # "HEAD_SHAKE" | "HEAD_NOD_DISSATISFIED"
|
| 98 |
+
voice_text: str | None # raw Web Speech transcript, pre-resolution
|
| 99 |
+
# Resolved voice⇄air-writing intent. Keys: text, source, voice_text, air_text.
|
| 100 |
+
# source ∈ voice_only | air_only | agree | conflict_air | conflict_voice.
|
| 101 |
+
resolved_intent: dict[str, Any] | None
|
| 102 |
turnaround_triggered: bool # true when re-planned from dissatisfaction signal
|
| 103 |
|
| 104 |
# ── L2: Intent decomposition outputs ─────────────────────────────────────
|
frontend/src/App.css
CHANGED
|
@@ -525,6 +525,30 @@ input[type="text"]:hover {
|
|
| 525 |
color: #ffffff !important;
|
| 526 |
}
|
| 527 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 528 |
.eval-panel {
|
| 529 |
margin-top: 10px;
|
| 530 |
border-top: 1px solid var(--border);
|
|
|
|
| 525 |
color: #ffffff !important;
|
| 526 |
}
|
| 527 |
|
| 528 |
+
.mic-btn {
|
| 529 |
+
background: transparent !important;
|
| 530 |
+
color: var(--accent) !important;
|
| 531 |
+
border: 1px solid var(--accent) !important;
|
| 532 |
+
}
|
| 533 |
+
|
| 534 |
+
.mic-btn.listening {
|
| 535 |
+
background: var(--accent) !important;
|
| 536 |
+
color: #ffffff !important;
|
| 537 |
+
animation: mic-pulse 1.1s ease-in-out infinite;
|
| 538 |
+
}
|
| 539 |
+
|
| 540 |
+
@keyframes mic-pulse {
|
| 541 |
+
0%, 100% { opacity: 1; }
|
| 542 |
+
50% { opacity: 0.6; }
|
| 543 |
+
}
|
| 544 |
+
|
| 545 |
+
.voice-status {
|
| 546 |
+
padding: 4px 12px;
|
| 547 |
+
font-size: 12px;
|
| 548 |
+
color: var(--text-muted);
|
| 549 |
+
font-family: var(--sans);
|
| 550 |
+
}
|
| 551 |
+
|
| 552 |
.eval-panel {
|
| 553 |
margin-top: 10px;
|
| 554 |
border-top: 1px solid var(--border);
|
frontend/src/components/ChatPanel.tsx
CHANGED
|
@@ -14,6 +14,9 @@ import {
|
|
| 14 |
streamRegenerate,
|
| 15 |
} from "../lib/api";
|
| 16 |
import { EvalPanel } from "./EvalPanel";
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
const STRATEGY_LABELS: Record<string, string> = {
|
| 19 |
broad: "broad — all memories",
|
|
@@ -135,6 +138,10 @@ export function ChatPanel({
|
|
| 135 |
const [turnaroundLoading, setTurnaroundLoading] = useState(false);
|
| 136 |
const [regenerateLoading, setRegenerateLoading] = useState(false);
|
| 137 |
const { queueToken, flushNow } = useTokenBatcher(setMessages);
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
const bottomRef = useRef<HTMLDivElement>(null);
|
| 139 |
const lastResponseTsRef = useRef<number>(0);
|
| 140 |
const lastTurnIdRef = useRef<number | null>(null);
|
|
@@ -156,6 +163,8 @@ export function ChatPanel({
|
|
| 156 |
lastResponseTsRef.current = 0;
|
| 157 |
evalPollAbortsRef.current.forEach((ac) => ac.abort());
|
| 158 |
evalPollAbortsRef.current.clear();
|
|
|
|
|
|
|
| 159 |
}, [userId]);
|
| 160 |
|
| 161 |
useEffect(() => {
|
|
@@ -432,6 +441,8 @@ export function ChatPanel({
|
|
| 432 |
setLoading(true);
|
| 433 |
|
| 434 |
const airText = sensing.airWrittenText || null;
|
|
|
|
|
|
|
| 435 |
|
| 436 |
// Push the partner bubble, and a placeholder AAC message we'll fill in
|
| 437 |
// progressively. We need the placeholder's index to target updates — use
|
|
@@ -470,6 +481,8 @@ export function ChatPanel({
|
|
| 470 |
gaze_bucket: sensing.gazeBucket,
|
| 471 |
air_written_text: airText,
|
| 472 |
head_signal: sensing.headSignal,
|
|
|
|
|
|
|
| 473 |
},
|
| 474 |
(evt) => {
|
| 475 |
if (evt.type === "token") {
|
|
@@ -531,6 +544,10 @@ export function ChatPanel({
|
|
| 531 |
}));
|
| 532 |
} finally {
|
| 533 |
if (airText) onAirTextConsumed();
|
|
|
|
|
|
|
|
|
|
|
|
|
| 534 |
setLoading(false);
|
| 535 |
}
|
| 536 |
}
|
|
@@ -568,6 +585,31 @@ export function ChatPanel({
|
|
| 568 |
[messages, setMessages, userId]
|
| 569 |
);
|
| 570 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 571 |
return (
|
| 572 |
<div className="chat-panel">
|
| 573 |
<div className="chat-header">
|
|
@@ -683,6 +725,11 @@ export function ChatPanel({
|
|
| 683 |
)}
|
| 684 |
<div ref={bottomRef} />
|
| 685 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 686 |
<div className="chat-input-row">
|
| 687 |
<input
|
| 688 |
type="text"
|
|
@@ -695,6 +742,26 @@ export function ChatPanel({
|
|
| 695 |
<button onClick={handleSend} disabled={!userId || loading || !backendReady || !input.trim()}>
|
| 696 |
Send
|
| 697 |
</button>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 698 |
</div>
|
| 699 |
</div>
|
| 700 |
);
|
|
|
|
| 14 |
streamRegenerate,
|
| 15 |
} from "../lib/api";
|
| 16 |
import { EvalPanel } from "./EvalPanel";
|
| 17 |
+
import { useVoice } from "../hooks/useVoice";
|
| 18 |
+
import { isVoiceCapable } from "../lib/voiceEligibility";
|
| 19 |
+
import { resolveIntent } from "../lib/resolveIntent";
|
| 20 |
|
| 21 |
const STRATEGY_LABELS: Record<string, string> = {
|
| 22 |
broad: "broad — all memories",
|
|
|
|
| 138 |
const [turnaroundLoading, setTurnaroundLoading] = useState(false);
|
| 139 |
const [regenerateLoading, setRegenerateLoading] = useState(false);
|
| 140 |
const { queueToken, flushNow } = useTokenBatcher(setMessages);
|
| 141 |
+
const [voiceText, setVoiceText] = useState<string | null>(null);
|
| 142 |
+
const [voiceNote, setVoiceNote] = useState<string | null>(null);
|
| 143 |
+
const voice = useVoice();
|
| 144 |
+
const micAvailable = isVoiceCapable(userId) && voice.supported;
|
| 145 |
const bottomRef = useRef<HTMLDivElement>(null);
|
| 146 |
const lastResponseTsRef = useRef<number>(0);
|
| 147 |
const lastTurnIdRef = useRef<number | null>(null);
|
|
|
|
| 163 |
lastResponseTsRef.current = 0;
|
| 164 |
evalPollAbortsRef.current.forEach((ac) => ac.abort());
|
| 165 |
evalPollAbortsRef.current.clear();
|
| 166 |
+
setVoiceText(null);
|
| 167 |
+
setVoiceNote(null);
|
| 168 |
}, [userId]);
|
| 169 |
|
| 170 |
useEffect(() => {
|
|
|
|
| 441 |
setLoading(true);
|
| 442 |
|
| 443 |
const airText = sensing.airWrittenText || null;
|
| 444 |
+
const vText = voiceText;
|
| 445 |
+
const resolved = resolveIntent(vText, airText);
|
| 446 |
|
| 447 |
// Push the partner bubble, and a placeholder AAC message we'll fill in
|
| 448 |
// progressively. We need the placeholder's index to target updates — use
|
|
|
|
| 481 |
gaze_bucket: sensing.gazeBucket,
|
| 482 |
air_written_text: airText,
|
| 483 |
head_signal: sensing.headSignal,
|
| 484 |
+
voice_text: vText,
|
| 485 |
+
resolved_intent: resolved.source === "none" ? null : resolved,
|
| 486 |
},
|
| 487 |
(evt) => {
|
| 488 |
if (evt.type === "token") {
|
|
|
|
| 544 |
}));
|
| 545 |
} finally {
|
| 546 |
if (airText) onAirTextConsumed();
|
| 547 |
+
// Clear voice state unconditionally — a failed send shouldn't silently
|
| 548 |
+
// re-attach a stale transcript to the next turn. User can re-tap mic.
|
| 549 |
+
setVoiceText(null);
|
| 550 |
+
setVoiceNote(null);
|
| 551 |
setLoading(false);
|
| 552 |
}
|
| 553 |
}
|
|
|
|
| 585 |
[messages, setMessages, userId]
|
| 586 |
);
|
| 587 |
|
| 588 |
+
const handleMic = useCallback(async () => {
|
| 589 |
+
if (!micAvailable || voice.listening) return;
|
| 590 |
+
setVoiceNote("Listening...");
|
| 591 |
+
try {
|
| 592 |
+
const cap = await voice.capture();
|
| 593 |
+
if (cap.transcript) {
|
| 594 |
+
setVoiceText(cap.transcript);
|
| 595 |
+
setVoiceNote(`Heard: "${cap.transcript}"`);
|
| 596 |
+
} else {
|
| 597 |
+
setVoiceNote("No speech detected.");
|
| 598 |
+
}
|
| 599 |
+
} catch (e) {
|
| 600 |
+
setVoiceNote(
|
| 601 |
+
`Mic error: ${e instanceof Error ? e.message : "failed"}`
|
| 602 |
+
);
|
| 603 |
+
}
|
| 604 |
+
}, [micAvailable, voice]);
|
| 605 |
+
|
| 606 |
+
const canTurnaround =
|
| 607 |
+
!!userId &&
|
| 608 |
+
backendReady &&
|
| 609 |
+
!loading &&
|
| 610 |
+
!turnaroundLoading &&
|
| 611 |
+
lastTurnIdRef.current !== null;
|
| 612 |
+
|
| 613 |
return (
|
| 614 |
<div className="chat-panel">
|
| 615 |
<div className="chat-header">
|
|
|
|
| 725 |
)}
|
| 726 |
<div ref={bottomRef} />
|
| 727 |
</div>
|
| 728 |
+
{micAvailable && voiceNote && (
|
| 729 |
+
<div className="voice-status" aria-live="polite">
|
| 730 |
+
{voiceNote}
|
| 731 |
+
</div>
|
| 732 |
+
)}
|
| 733 |
<div className="chat-input-row">
|
| 734 |
<input
|
| 735 |
type="text"
|
|
|
|
| 742 |
<button onClick={handleSend} disabled={!userId || loading || !backendReady || !input.trim()}>
|
| 743 |
Send
|
| 744 |
</button>
|
| 745 |
+
{micAvailable && (
|
| 746 |
+
<button
|
| 747 |
+
type="button"
|
| 748 |
+
className={`mic-btn${voice.listening ? " listening" : ""}`}
|
| 749 |
+
onClick={handleMic}
|
| 750 |
+
disabled={!backendReady || loading || voice.listening}
|
| 751 |
+
title="Capture a short voice utterance — resolved against air-writing before sending"
|
| 752 |
+
>
|
| 753 |
+
{voice.listening ? "🎤 Listening…" : "🎤 Speak"}
|
| 754 |
+
</button>
|
| 755 |
+
)}
|
| 756 |
+
<button
|
| 757 |
+
type="button"
|
| 758 |
+
className="turnaround-btn"
|
| 759 |
+
onClick={() => handleTurnaround("manual")}
|
| 760 |
+
disabled={!canTurnaround}
|
| 761 |
+
title="Re-plan the last response (also triggered by a head shake / sharp nod)"
|
| 762 |
+
>
|
| 763 |
+
↻ Not quite right
|
| 764 |
+
</button>
|
| 765 |
</div>
|
| 766 |
</div>
|
| 767 |
);
|
frontend/src/hooks/useVoice.ts
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { useCallback, useEffect, useRef, useState } from "react";
|
| 2 |
+
|
| 3 |
+
// Thin wrapper around the Web Speech API. Chrome/Edge expose
|
| 4 |
+
// `webkitSpeechRecognition`; Safari/Firefox don't — we no-op gracefully.
|
| 5 |
+
|
| 6 |
+
type SRCtor = new () => SpeechRecognitionLike;
|
| 7 |
+
|
| 8 |
+
interface SpeechRecognitionLike {
|
| 9 |
+
lang: string;
|
| 10 |
+
continuous: boolean;
|
| 11 |
+
interimResults: boolean;
|
| 12 |
+
maxAlternatives: number;
|
| 13 |
+
onresult: ((e: SpeechRecognitionEventLike) => void) | null;
|
| 14 |
+
onerror: ((e: { error: string }) => void) | null;
|
| 15 |
+
onend: (() => void) | null;
|
| 16 |
+
start: () => void;
|
| 17 |
+
stop: () => void;
|
| 18 |
+
abort: () => void;
|
| 19 |
+
}
|
| 20 |
+
|
| 21 |
+
interface SpeechRecognitionEventLike {
|
| 22 |
+
results: {
|
| 23 |
+
length: number;
|
| 24 |
+
[index: number]: {
|
| 25 |
+
isFinal: boolean;
|
| 26 |
+
length: number;
|
| 27 |
+
[index: number]: { transcript: string; confidence: number };
|
| 28 |
+
};
|
| 29 |
+
};
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
function getRecognitionCtor(): SRCtor | null {
|
| 33 |
+
if (typeof window === "undefined") return null;
|
| 34 |
+
const w = window as unknown as {
|
| 35 |
+
SpeechRecognition?: SRCtor;
|
| 36 |
+
webkitSpeechRecognition?: SRCtor;
|
| 37 |
+
};
|
| 38 |
+
return w.SpeechRecognition ?? w.webkitSpeechRecognition ?? null;
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
export interface VoiceCapture {
|
| 42 |
+
transcript: string;
|
| 43 |
+
confidence: number;
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
const WINDOW_MS = 3500;
|
| 47 |
+
|
| 48 |
+
export function useVoice() {
|
| 49 |
+
const [supported] = useState(() => getRecognitionCtor() !== null);
|
| 50 |
+
const [listening, setListening] = useState(false);
|
| 51 |
+
const [error, setError] = useState<string | null>(null);
|
| 52 |
+
const recRef = useRef<SpeechRecognitionLike | null>(null);
|
| 53 |
+
const resolveRef = useRef<((v: VoiceCapture) => void) | null>(null);
|
| 54 |
+
const rejectRef = useRef<((err: Error) => void) | null>(null);
|
| 55 |
+
const bestRef = useRef<VoiceCapture>({ transcript: "", confidence: 0 });
|
| 56 |
+
const timerRef = useRef<number | null>(null);
|
| 57 |
+
|
| 58 |
+
const cleanup = useCallback(() => {
|
| 59 |
+
if (timerRef.current !== null) {
|
| 60 |
+
window.clearTimeout(timerRef.current);
|
| 61 |
+
timerRef.current = null;
|
| 62 |
+
}
|
| 63 |
+
const rec = recRef.current;
|
| 64 |
+
if (rec) {
|
| 65 |
+
try {
|
| 66 |
+
rec.abort();
|
| 67 |
+
} catch {
|
| 68 |
+
// ignore — some browsers throw if already stopped
|
| 69 |
+
}
|
| 70 |
+
rec.onresult = null;
|
| 71 |
+
rec.onerror = null;
|
| 72 |
+
rec.onend = null;
|
| 73 |
+
}
|
| 74 |
+
recRef.current = null;
|
| 75 |
+
resolveRef.current = null;
|
| 76 |
+
rejectRef.current = null;
|
| 77 |
+
setListening(false);
|
| 78 |
+
}, []);
|
| 79 |
+
|
| 80 |
+
// Unmount teardown: reject any in-flight promise so `await voice.capture()`
|
| 81 |
+
// in a parent component doesn't hang forever when the tree unmounts mid-listen.
|
| 82 |
+
useEffect(
|
| 83 |
+
() => () => {
|
| 84 |
+
const rj = rejectRef.current;
|
| 85 |
+
cleanup();
|
| 86 |
+
if (rj) rj(new Error("unmounted"));
|
| 87 |
+
},
|
| 88 |
+
[cleanup]
|
| 89 |
+
);
|
| 90 |
+
|
| 91 |
+
const capture = useCallback((): Promise<VoiceCapture> => {
|
| 92 |
+
const Ctor = getRecognitionCtor();
|
| 93 |
+
if (!Ctor) {
|
| 94 |
+
return Promise.reject(new Error("Speech recognition not supported"));
|
| 95 |
+
}
|
| 96 |
+
if (recRef.current) {
|
| 97 |
+
return Promise.reject(new Error("Already listening"));
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
return new Promise<VoiceCapture>((resolve, reject) => {
|
| 101 |
+
const rec = new Ctor();
|
| 102 |
+
rec.lang = navigator.language || "en-US";
|
| 103 |
+
rec.continuous = false;
|
| 104 |
+
rec.interimResults = false;
|
| 105 |
+
rec.maxAlternatives = 1;
|
| 106 |
+
|
| 107 |
+
bestRef.current = { transcript: "", confidence: 0 };
|
| 108 |
+
resolveRef.current = resolve;
|
| 109 |
+
rejectRef.current = reject;
|
| 110 |
+
recRef.current = rec;
|
| 111 |
+
setError(null);
|
| 112 |
+
|
| 113 |
+
rec.onresult = (e) => {
|
| 114 |
+
for (let i = 0; i < e.results.length; i++) {
|
| 115 |
+
const res = e.results[i];
|
| 116 |
+
if (!res.isFinal) continue;
|
| 117 |
+
const alt = res[0];
|
| 118 |
+
if (alt && alt.transcript.trim().length > 0) {
|
| 119 |
+
if (alt.confidence > bestRef.current.confidence) {
|
| 120 |
+
bestRef.current = {
|
| 121 |
+
transcript: alt.transcript.trim(),
|
| 122 |
+
confidence: alt.confidence,
|
| 123 |
+
};
|
| 124 |
+
}
|
| 125 |
+
}
|
| 126 |
+
}
|
| 127 |
+
};
|
| 128 |
+
|
| 129 |
+
rec.onerror = (e) => {
|
| 130 |
+
const msg = e.error || "recognition error";
|
| 131 |
+
setError(msg);
|
| 132 |
+
const rj = rejectRef.current;
|
| 133 |
+
cleanup();
|
| 134 |
+
if (rj) rj(new Error(msg));
|
| 135 |
+
};
|
| 136 |
+
|
| 137 |
+
rec.onend = () => {
|
| 138 |
+
const rs = resolveRef.current;
|
| 139 |
+
const best = bestRef.current;
|
| 140 |
+
cleanup();
|
| 141 |
+
if (rs) rs(best);
|
| 142 |
+
};
|
| 143 |
+
|
| 144 |
+
try {
|
| 145 |
+
rec.start();
|
| 146 |
+
setListening(true);
|
| 147 |
+
timerRef.current = window.setTimeout(() => {
|
| 148 |
+
try {
|
| 149 |
+
rec.stop();
|
| 150 |
+
} catch {
|
| 151 |
+
// onend will still fire
|
| 152 |
+
}
|
| 153 |
+
}, WINDOW_MS);
|
| 154 |
+
} catch (err) {
|
| 155 |
+
const msg = err instanceof Error ? err.message : "failed to start";
|
| 156 |
+
setError(msg);
|
| 157 |
+
cleanup();
|
| 158 |
+
reject(new Error(msg));
|
| 159 |
+
}
|
| 160 |
+
});
|
| 161 |
+
}, [cleanup]);
|
| 162 |
+
|
| 163 |
+
const cancel = useCallback(() => {
|
| 164 |
+
const rj = rejectRef.current;
|
| 165 |
+
cleanup();
|
| 166 |
+
if (rj) rj(new Error("cancelled"));
|
| 167 |
+
}, [cleanup]);
|
| 168 |
+
|
| 169 |
+
return { supported, listening, error, capture, cancel };
|
| 170 |
+
}
|
frontend/src/lib/resolveIntent.ts
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { DEFAULT_AIR_TEMPLATES } from "./airTemplates";
|
| 2 |
+
|
| 3 |
+
// Canonical AAC tokens that carry high signal when someone air-writes them —
|
| 4 |
+
// short, action-oriented, and hard to confuse for casual chat. When the
|
| 5 |
+
// voice transcript and the air-written text disagree, these tokens win.
|
| 6 |
+
const AAC_PRIORITY_TOKENS: ReadonlySet<string> = new Set(
|
| 7 |
+
["help", "stop", "water", "done", "more"].filter((t) =>
|
| 8 |
+
DEFAULT_AIR_TEMPLATES.has(t)
|
| 9 |
+
)
|
| 10 |
+
);
|
| 11 |
+
|
| 12 |
+
export type ResolvedSource =
|
| 13 |
+
| "voice_only"
|
| 14 |
+
| "air_only"
|
| 15 |
+
| "agree"
|
| 16 |
+
| "conflict_air"
|
| 17 |
+
| "conflict_voice"
|
| 18 |
+
| "none";
|
| 19 |
+
|
| 20 |
+
export interface ResolvedIntent {
|
| 21 |
+
text: string;
|
| 22 |
+
source: ResolvedSource;
|
| 23 |
+
voice_text: string | null;
|
| 24 |
+
air_text: string | null;
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
function normalise(s: string | null | undefined): string {
|
| 28 |
+
return (s ?? "").trim().toLowerCase();
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
function tokens(s: string): Set<string> {
|
| 32 |
+
return new Set(
|
| 33 |
+
s
|
| 34 |
+
.toLowerCase()
|
| 35 |
+
.replace(/[^a-z0-9\s]/g, " ")
|
| 36 |
+
.split(/\s+/)
|
| 37 |
+
.filter((w) => w.length > 1)
|
| 38 |
+
);
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
function jaccard(a: Set<string>, b: Set<string>): number {
|
| 42 |
+
if (a.size === 0 || b.size === 0) return 0;
|
| 43 |
+
let inter = 0;
|
| 44 |
+
for (const tok of a) if (b.has(tok)) inter++;
|
| 45 |
+
const union = a.size + b.size - inter;
|
| 46 |
+
return union === 0 ? 0 : inter / union;
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
export function resolveIntent(
|
| 50 |
+
voiceRaw: string | null,
|
| 51 |
+
airRaw: string | null
|
| 52 |
+
): ResolvedIntent {
|
| 53 |
+
const voice = normalise(voiceRaw);
|
| 54 |
+
const air = normalise(airRaw);
|
| 55 |
+
|
| 56 |
+
if (!voice && !air) {
|
| 57 |
+
return { text: "", source: "none", voice_text: null, air_text: null };
|
| 58 |
+
}
|
| 59 |
+
if (voice && !air) {
|
| 60 |
+
return {
|
| 61 |
+
text: voice,
|
| 62 |
+
source: "voice_only",
|
| 63 |
+
voice_text: voice,
|
| 64 |
+
air_text: null,
|
| 65 |
+
};
|
| 66 |
+
}
|
| 67 |
+
if (!voice && air) {
|
| 68 |
+
return { text: air, source: "air_only", voice_text: null, air_text: air };
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
// Both present.
|
| 72 |
+
const voiceTokens = tokens(voice);
|
| 73 |
+
const airTokens = tokens(air);
|
| 74 |
+
const overlap = jaccard(voiceTokens, airTokens);
|
| 75 |
+
|
| 76 |
+
// Air-text appears as a substring of the voice transcript (or vice versa) —
|
| 77 |
+
// user probably said the word while also writing it. Treat as agreement.
|
| 78 |
+
const substringHit =
|
| 79 |
+
voice.includes(air) || air.includes(voice) || overlap >= 0.5;
|
| 80 |
+
|
| 81 |
+
if (substringHit) {
|
| 82 |
+
// Prefer the longer / richer form (usually voice), but mark source as agree.
|
| 83 |
+
const winner = voice.length >= air.length ? voice : air;
|
| 84 |
+
return {
|
| 85 |
+
text: winner,
|
| 86 |
+
source: "agree",
|
| 87 |
+
voice_text: voice,
|
| 88 |
+
air_text: air,
|
| 89 |
+
};
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
// Genuine conflict. AAC priority tokens (help/stop/water/done/more) dominate.
|
| 93 |
+
if (AAC_PRIORITY_TOKENS.has(air)) {
|
| 94 |
+
return {
|
| 95 |
+
text: air,
|
| 96 |
+
source: "conflict_air",
|
| 97 |
+
voice_text: voice,
|
| 98 |
+
air_text: air,
|
| 99 |
+
};
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
// Otherwise voice wins — higher information density.
|
| 103 |
+
return {
|
| 104 |
+
text: voice,
|
| 105 |
+
source: "conflict_voice",
|
| 106 |
+
voice_text: voice,
|
| 107 |
+
air_text: air,
|
| 108 |
+
};
|
| 109 |
+
}
|
frontend/src/lib/voiceEligibility.ts
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// Personas for whom a live-mic button makes sense.
|
| 2 |
+
// Gate reflects each persona's real-world speech access, not the in-universe
|
| 3 |
+
// voice of the character: we hide the mic whenever the modelled access method
|
| 4 |
+
// is non-verbal (locked-in, letterboard, dictation-to-assistant, etc.), even
|
| 5 |
+
// if the character can "speak" in their canon.
|
| 6 |
+
export const VOICE_CAPABLE_PERSONAS: ReadonlySet<string> = new Set([
|
| 7 |
+
"abed_nadir",
|
| 8 |
+
"allie_calhoun",
|
| 9 |
+
"forrest_gump",
|
| 10 |
+
"gabby_giffords",
|
| 11 |
+
"michael_j_fox",
|
| 12 |
+
"raymond_babbitt",
|
| 13 |
+
"walter_jr_white",
|
| 14 |
+
]);
|
| 15 |
+
|
| 16 |
+
export function isVoiceCapable(userId: string | null): boolean {
|
| 17 |
+
return !!userId && VOICE_CAPABLE_PERSONAS.has(userId);
|
| 18 |
+
}
|
frontend/src/types.ts
CHANGED
|
@@ -28,6 +28,21 @@ export interface Persona {
|
|
| 28 |
style: string;
|
| 29 |
}
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
export interface ChatRequest {
|
| 32 |
user_id: string;
|
| 33 |
query: string;
|
|
@@ -36,6 +51,8 @@ export interface ChatRequest {
|
|
| 36 |
gaze_bucket: MemoryBucket | null;
|
| 37 |
air_written_text: string | null;
|
| 38 |
head_signal?: HeadSignal | null;
|
|
|
|
|
|
|
| 39 |
}
|
| 40 |
|
| 41 |
export interface TurnaroundRequest {
|
|
|
|
| 28 |
style: string;
|
| 29 |
}
|
| 30 |
|
| 31 |
+
export type ResolvedSource =
|
| 32 |
+
| "voice_only"
|
| 33 |
+
| "air_only"
|
| 34 |
+
| "agree"
|
| 35 |
+
| "conflict_air"
|
| 36 |
+
| "conflict_voice"
|
| 37 |
+
| "none";
|
| 38 |
+
|
| 39 |
+
export interface ResolvedIntent {
|
| 40 |
+
text: string;
|
| 41 |
+
source: ResolvedSource;
|
| 42 |
+
voice_text: string | null;
|
| 43 |
+
air_text: string | null;
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
export interface ChatRequest {
|
| 47 |
user_id: string;
|
| 48 |
query: string;
|
|
|
|
| 51 |
gaze_bucket: MemoryBucket | null;
|
| 52 |
air_written_text: string | null;
|
| 53 |
head_signal?: HeadSignal | null;
|
| 54 |
+
voice_text?: string | null;
|
| 55 |
+
resolved_intent?: ResolvedIntent | null;
|
| 56 |
}
|
| 57 |
|
| 58 |
export interface TurnaroundRequest {
|