Spaces:
Sleeping
title: Multimodal AAC Chatbot
emoji: πΈ
colorFrom: pink
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: other
Multimodal AAC Chatbot
A chatbot that speaks as an AAC user, not to them. You pick a persona β fourteen are shipped, anchored in real memoirs and canonical fiction β and the partner talks to them. The bot replies in that person's voice, using their memories, and adjusts what it says based on what the webcam sees: facial expression, hand gestures, where they're looking, and letters they trace in the air.
It's a training-free agentic RAG pipeline β a plain Python function chain with two branching points, torch matmul for retrieval, JSONL for logging. The goal was to keep every piece simple enough to read top-to-bottom in an afternoon.
Live demo: huggingface.co/spaces/ub-aac-chatbot/aac-chatbot β pick a persona, allow webcam access, and chat. First turn after a cold start adds ~3s for the one-time NLI model load; subsequent turns run at normal speed.
Table of Contents
- What is AAC?
- System Architecture
- Prerequisites
- Setup
- Configuration
- Running the Project
- Hosting
- Project Structure
- Personas
- Team
What is AAC?
AAC (Augmentative and Alternative Communication) covers the tools people use when spoken or written communication is hard for them β cerebral palsy, ALS, autism, stroke recovery, and so on. Usually that's a tablet with a symbol grid, or an eye-tracker, or a switch. The slow part isn't the typing β it's that most devices don't know you. Every conversation starts from scratch.
This project is a small attempt at the other direction: give each user a persona their device already knows, and let the device reply in their voice.
System Architecture
The browser does all the camera work. MediaPipe JS runs inside React, classifies what it sees into small labels (affect, gesture_tag, gaze_bucket, air_written_text), and sends those alongside the partner's text to /chat. The backend never touches pixels.
React (browser) Backend (Python)
MediaPipe JS βββ
Chat UI βββββββββΌββ POST /chat βββΊ FastAPI βββΊ run_pipeline()
Webcam ββββββββββ β
Intent βββΊ Retrieval βββΊ Planner βββΊ Feedback
Five layers, each a tiny file:
| Layer | Module | What it does |
|---|---|---|
| L1 | frontend/src/hooks/useSensing.ts |
Watches the webcam. Turns faces/hands/gaze into string labels via MediaPipe blendshapes against a per-user calibrated baseline. Air-writing strokes are rendered to a PNG and recognised by a vision LLM via /ink/recognize. |
| L2 | backend/pipeline/nodes/intent.py |
Splits the partner's question on conjunctions and punctuation, then classifies each fragment as PERSONAL, CONTEXTUAL, or OPEN_DOMAIN using cosine similarity against a handful of seed sentences. No LLM call. ~30ms per turn. |
| L3 | backend/pipeline/nodes/retrieval.py |
Each sub-intent goes to its own pool. Personal β the user's memory vector store, softly reranked by session-level bucket + type priors (see below). Contextual β persona memory + relevant in-session turns layered on top (so "what did I just ask" still sounds like them). Open-domain β a stub chunk telling the LLM to answer from its own knowledge (web search is deliberately out of scope). Wide cosine pool β MMR rerank against a query-fused-with-recent-history vector β top 3. |
| L4 | backend/pipeline/nodes/planner.py |
Builds the prompt, calls the LLM, picks a response. Tone and max_tokens are shaped by the detected affect. |
| L5 | backend/pipeline/nodes/feedback.py |
Writes one JSONL row per turn and updates the session-level Bayesian priors over which memory buckets were useful. Skips the update when the guardrail blocked the turn. |
Two places the pipeline branches:
- Frustrated affect β use the fast retrieval path (smaller cosine pool of 8, MMR-rerank to k=2). The user wants an answer, not a thesis.
- Cumulative latency past 3.5s β switch to the smaller fallback model for generation.
Session priors β two axes of topic tracking
The pipeline keeps two per-session Bayesian distributions that track what the current conversation is actually drawing from, and softly biases retrieval toward both:
- P(bucket) over the five memory buckets:
family,medical,hobbies,daily_routine,social. - P(type) over the three chunk types:
narrative,social_post,chat_log. A casual "what's up today?" pulls more chat-log style; a factual question pulls more narrative.
Both axes share one axis-generic core in backend/retrieval/priors.py β the BUCKETS and CHUNK_TYPES label vocabularies live there too. Same three pieces each:
- Soft weighting, not hard filter. In vector_store.retrieve() each candidate's cosine score is adjusted by
0.3 Β· log P(bucket) + 0.2 Β· log P(type). Uniform priors add the same constant to every candidate β zero ranking effect at session start. As a topic (or style) accumulates evidence, its label gets a positive boost. Gaze fixation (an explicit user signal) still hard-filters the bucket axis. - Score-weighted evidence. After each turn, the feedback node accumulates per-label mass across all retrieved personal chunks (cosine-shifted into
[0, β)). A strongmedicalΓnarrativematch moves both priors more than a weaksocialΓchat_logone, and mixed turns update every contributing label proportionally. - Topic-drift decay. Before each update the current distributions are pulled 15% toward uniform. Stale mass decays β if the conversation pivots from "medical" to "hobbies", the old medical prior relaxes within a handful of turns instead of dominating the rest of the session. Guardrail-blocked turns skip both updates entirely.
The per-turn JSONL log includes both bucket_priors_after and type_priors_after, so you can trace how either distribution evolved with a one-liner DuckDB query over logs/turns.jsonl.
Per-turn eval pills in the UI
Every AAC-user bubble renders its eval scores inline. Up to seven pills, depending on what the turn produced:
- SLO latency badge β
t_totalagainst the configured target (default 6s). Green β on pass, red β on miss. groundedβ sentence-level NLI faithfulness against retrieved memories. Rendersβwhen there's no retrieved evidence to check (e.g. aPRESENT_STATEturn that skipped retrieval).relevantβ BGE cosine similarity between query and response embeddings. Catches the "perfectly grounded but off-topic" failure mode that groundedness can't see.affect/gesture/gazeβ multimodal alignment: sentiment match against the detected affect, opener-pattern match against the detected gesture, fraction of retrieved chunks from the gazed-at bucket.diversityβ mean pairwise cosine distance across the candidate slate (only shown when β₯2 candidates). Low values flag the "aloha problem" β three paraphrases of the same answer.
Pills go green / grey / red on 0.75 / 0.4 thresholds. Hover any pill for a tooltip with the actual math from this turn (e.g. "2/2 sentences had NLI entailment prob β₯ 0.50", "3/3 retrieved chunks were from the family bucket"), powered by the explain block each scorer returns alongside its score. Authenticity stars sit on the right; clicking one fires POST /feedback/rating β logs/ratings.jsonl. All pill values come from eval_scores on the /chat response, computed in a BackgroundTask after the response returns and persisted to logs/evals.jsonl.
End-to-end: from partner speaking to response rendered
One diagram, left to right, every step a turn goes through. Follow the arrows.
flowchart LR
subgraph S1["β Partner side (browser)"]
direction TB
IN1[Partner types or speaks a question]
IN2[Webcam frame]
IN1 --> UI[Chat UI]
IN2 --> MP[MediaPipe JS<br/>face + hands + gaze]
MP --> LAB[Classify into labels<br/>affect, gesture_tag,<br/>gaze_bucket, air_written_text]
UI --> REQ
LAB --> REQ[POST /chat<br/>query + labels]
end
REQ ==> S2
subgraph S2["β‘ Backend pipeline (Python)"]
direction TB
HYD[Hydrate PipelineState<br/>session_history, priors, profile] --> INT
INT[Intent node<br/>split query + classify fragments] --> BR1{FRUSTRATED?}
BR1 -- yes --> RFAST[Fast retrieval<br/>pool=8 β MMR k=2]
BR1 -- no --> RFULL[Full retrieval<br/>pool=12 β MMR k=3]
RFAST --> POOL
RFULL --> POOL[Dispatch per sub-intent]
POOL --> PP[PERSONAL<br/>BGE vector store]
POOL --> PC[CONTEXTUAL<br/>personal + BGE over history]
POOL --> PO[OPEN_DOMAIN<br/>stub chunk]
PP --> MERGE[Merge + dedupe]
PC --> MERGE
PO --> MERGE
MERGE --> RR[MMR rerank<br/>query fused with<br/>last-N user turns]
RR --> PLAN[Planner<br/>build prompt with<br/>retrieval blocks + tone tag]
PLAN --> BR2{Total latency<br/>> 3.5s?}
BR2 -- yes --> LLMF[Fallback LLM<br/>Ollama Cloud, smaller]
BR2 -- no --> LLMP[Primary LLM<br/>Ollama Cloud]
LLMF --> GRD[Guardrail check<br/>persona breaks,<br/>unsupported claims]
LLMP --> GRD
GRD --> FB[Feedback node<br/>log turn to JSONL,<br/>bump bucket priors,<br/>append to session history]
end
FB ==> S3
subgraph S3["β’ Back to partner"]
direction TB
RESP[Response in persona's voice<br/>+ latency breakdown<br/>+ eval scores]
RESP --> RENDER[Chat UI renders it]
end
A concrete example. Partner says "how are you, and what's the capital of France?" while the webcam reads a relaxed face:
- Browser sends
{query, affect: NEUTRAL, gesture_tag: null, β¦}. - Intent node splits on
,andandβ two fragments. Classifier tags themPERSONALandOPEN_DOMAIN. - Affect isn't FRUSTRATED, so full retrieval runs.
- Dispatcher hits the persona store for fragment one, emits the open-domain stub for fragment two, merges both.
- Planner drops the two chunks into separate prompt blocks and calls the primary LLM.
- Guardrail passes, feedback writes the row, the response β in Mia's voice β comes back through the same
/chatresponse.
Total wall time is normally under 6 seconds end-to-end; the slow part is the LLM call, not anything you wrote.
What a single turn actually looks like
flowchart TD
A[Partner types or speaks] --> B[React captures query<br/>+ webcam labels]
B --> C[POST /chat]
C --> D[Intent node<br/>split + classify]
D --> E{Any FRUSTRATED<br/>affect signal?}
E -- yes --> F[Fast retrieval<br/>pool=8 β MMR k=2]
E -- no --> G[Full retrieval<br/>pool=12 β MMR k=3]
F --> H{Cumulative<br/>latency > 3.5s?}
G --> H
H -- yes --> I[Fallback LLM<br/>smaller, faster]
H -- no --> J[Primary LLM]
I --> K[Guardrail check]
J --> K
K --> L[Feedback node<br/>JSONL log + priors]
L --> M[Response in persona's voice]
How sub-intents fan out
This is the part that took a few iterations to get right. Each partner query can be multiple questions stitched together with "and" / "but" / punctuation. Each fragment gets classified separately and sent to its own retrieval pool.
flowchart LR
Q[""how are you,<br/>and what's the<br/>capital of France?""] --> S[Split on conjunctions<br/>and punctuation]
S --> F1[fragment:<br/>how are you]
S --> F2[fragment:<br/>capital of France]
F1 --> CL[BGE zero-shot<br/>cosine vs exemplars]
F2 --> CL
CL --> P[PERSONAL<br/>β persona memory vectors]
CL --> CX[CONTEXTUAL<br/>β persona memory +<br/>relevant session history]
CL --> OD[OPEN_DOMAIN<br/>β stub, LLM answers<br/>from own knowledge]
P --> MERGE[Merge, dedupe,<br/>hand to planner]
CX --> MERGE
OD --> MERGE
The classifier is just cosine similarity against 5 seed sentences per class β no LLM, ~30ms per turn. The old version called an LLM and retried up to 3Γ on JSON errors; on a bad day that was 100+ seconds of dead time.
State that flows between nodes
Every node takes a PipelineState dict and returns a partial update. Nothing is global.
flowchart LR
subgraph "set at turn start"
A[user_id, persona_profile,<br/>session_history, turn_id]
B[affect, gesture_tag,<br/>gaze_bucket, air_written_text]
C[raw_query]
end
subgraph "filled in by the pipeline"
D[intent_route,<br/>generation_config]
E[retrieved_chunks,<br/>retrieval_mode_used]
F[candidates,<br/>selected_response,<br/>llm_tier_used]
G[latency_log,<br/>run_id,<br/>guardrail_passed]
end
A --> D
B --> D
C --> D
D --> E
B --> E
D --> F
E --> F
F --> G
Prerequisites
- Python 3.10+ (we use conda; 3.12 is what the env ships with)
- Node.js 22+ and pnpm
- An Ollama Cloud account. Generation hits cloud models β you don't need a local Ollama daemon running.
- A webcam if you want to play with the full stack. The CLI works without one.
Setup
git clone https://github.com/akashkolte/multimodal_aac_chatbot.git
cd multimodal_aac_chatbot
bash setup.sh
setup.sh takes care of everything on the first run: creates the aac-chatbot conda env, installs Python and frontend deps, copies .env.example β .env for you to fill in, and builds the per-persona vector indexes under data/vector_store/. The first build downloads the BGE-small embedder (~130MB), so expect a short wait.
If you edit a persona later, rebuild the indexes: python -m backend.retrieval.vector_store.
Configuration
Everything is a Pydantic setting in backend/config/settings.py with a .env override. The knobs you'll actually touch:
| Variable | Default | What it does |
|---|---|---|
ACTIVE_LLM_TIER |
primary |
Which tier to start on β primary or fallback. The pipeline switches automatically if a turn is slow. |
PRIMARY_MODEL |
gemma4:31b-cloud |
Ollama Cloud model for the primary tier. |
FALLBACK_MODEL |
gemma4:31b-cloud |
Smaller/faster model for the fallback tier. Point this at whatever smaller cloud model you have access to. |
PRIMARY_BASE_URL |
http://localhost:11434/v1 |
OpenAI-compatible endpoint. Defaults to the local Ollama proxy. |
FALLBACK_LATENCY_THRESHOLD |
3.5 |
If intent+retrieval already took this many seconds, skip the primary tier. |
RERANK_ENABLED |
true |
Kill-switch for the MMR reranker. When off, retrieval truncates the cosine top-k directly. |
RERANK_LAMBDA |
0.7 |
MMR balance: 1.0 = pure cosine relevance, lower = more diversity. Drop to 0.5 if results look repetitive. |
RERANK_QUERY_WEIGHT |
0.7 |
Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
LOGS_DIR |
logs |
Where the per-turn JSONL goes. |
SLO_TARGET_S |
6.0 |
Latency SLO used by the efficiency eval. |
EVALS_ENABLED |
true |
Toggle off to skip background eval scoring. |
NLI_MODEL |
cross-encoder/nli-deberta-v3-small |
NLI model used for the groundedness/hallucination scorer. |
FAITHFULNESS_THRESHOLD |
0.5 |
Per-sentence entailment probability needed to count as grounded. |
Running the Project
Full stack
bash run.sh
Starts FastAPI on :8000 and the React dev server on :7550. Open http://localhost:7550. This is the mode you want for the webcam + sensing demo.
Pass any backend.main flag to run.sh and it drops the full stack and runs the CLI with those flags instead β handy for fast iteration:
bash run.sh --debug # CLI with per-turn state dumps
bash run.sh --user mia_chen --debug # jump straight to Mia
CLI directly
conda activate aac-chatbot
python -m backend.main --debug
The CLI prints the full PipelineState after each turn β useful when you want to see what the classifier did or which chunks came back from which pool.
API directly
conda activate aac-chatbot
uvicorn backend.api.main:app --reload
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
Offline eval aggregation
After a few turns have been logged, print a per-persona report:
conda activate aac-chatbot
python -m backend.evals.aggregate
Output covers latency quantiles + SLO pass rate, faithfulness (groundedness / hallucination), multimodal alignment, and the distribution of Likert ratings. Reads logs/turns.jsonl, logs/evals.jsonl, and logs/ratings.jsonl.
Hosting
The project ships with a single Dockerfile that builds the React frontend in stage 1 (Node 22 + pnpm) and runs the FastAPI backend in stage 2 (Python 3.12 + torch + sentence-transformers). The backend serves the built frontend/dist/ as static files, so it's one container, one process, one port.
The same image runs identically in two places.
Locally (for development that mirrors production)
docker build -t aac-chatbot .
docker run --rm -p 8000:8000 -e PORT=8000 --env-file .env aac-chatbot
# β http://localhost:8000
The --env-file .env injects your Ollama Cloud key + endpoints (same .env you use for ./run.sh). Conda + ./run.sh is still the fastest dev loop because it hot-reloads; the docker path is for when you want byte-identical-to-production behaviour.
On Hugging Face Spaces (public URL for graders)
The repo doubles as an HF Space β README.md carries the YAML frontmatter HF needs (sdk: docker, app_port: 7860).
- Create a new Space on huggingface.co (Docker SDK, public).
- Add this repo as a remote:
git remote add space https://huggingface.co/spaces/ub-aac-chatbot/aac-chatbot git push space main - In the Space's Settings β Variables and secrets, add the LLM-tier secrets (don't commit them):
PRIMARY_API_KEY,PRIMARY_BASE_URL,PRIMARY_MODELFALLBACK_API_KEY,FALLBACK_BASE_URL,FALLBACK_MODELINK_VISION_API_KEY,INK_VISION_BASE_URL,INK_VISION_MODEL
- The Space rebuilds the Dockerfile on every push. First build takes ~5-8 min (downloads BGE + builds vector indexes for all personas); subsequent builds reuse Docker layer cache and finish in 2-3 min.
The deployed instance won't persist logs/ or data/pick_index/ across container restarts (HF Spaces filesystem is read-only outside /tmp). For the writeup, your local logs are the source of truth β the Space is just a click-around demo for graders.
Webcam note. getUserMedia requires HTTPS, which both HF Spaces and localhost provide. Random IP addresses don't, so don't try to demo from a LAN IP without a tunnel.
Project Structure
multimodal_aac_chatbot/
βββ frontend/ React + Vite + TypeScript
β βββ src/
β βββ components/ Chat UI, webcam, sensing status
β βββ hooks/ useWebcam, useSensing (MediaPipe JS)
β βββ lib/ API client, sensing classification, calibration, ink recognizer
β
βββ backend/ Python (conda env: aac-chatbot)
β βββ main.py CLI entry point
β βββ api/main.py FastAPI REST API
β βββ config/settings.py Pydantic BaseSettings
β βββ pipeline/
β β βββ graph.py run_pipeline() β plain function chain
β β βββ state.py PipelineState TypedDict
β β βββ nodes/ intent, retrieval, planner, feedback
β βββ sensing/labels.py GESTURE_DIRECTIVES (sensing runs in browser)
β βββ retrieval/ BGE embeddings (torch tensor) + bucket priors
β βββ generation/llm_client.py 2-tier Ollama Cloud LLM client (primary/fallback)
β βββ evals/ faithfulness (NLI), efficiency, multimodal, aggregate CLI
β βββ guardrails/checks.py Input + output safety checks
β
βββ data/
β βββ users.json Persona index
β βββ memories/ Per-persona memory JSONs
β βββ vector_store/ vectors.pt + meta.json (gitignored, rebuilt)
βββ logs/ Per-turn JSONL logs (gitignored)
β
βββ setup.sh One-time setup script
βββ run.sh Start backend + frontend
βββ requirements.txt Python dependencies
βββ .env.example Environment variable template
Personas
Fourteen personas β nine anchored in real memoirs, five in canonical fiction. Together they span ALS, Parkinson's, locked-in syndrome, aphasia, Alzheimer's, cerebral palsy, non-verbal and savant autism, intellectual disability, and spinal cord injury. The point isn't to represent any one person β it's to give the model a wide enough range of voices that "sound like Mia" is a harder target than "sound helpful."
| ID | Source | Condition |
|---|---|---|
stephen_hawking |
Real β My Brief History + interviews | ALS (mid-stage) |
michael_j_fox |
Real β four memoirs | Young-onset Parkinson's |
wendy_mitchell |
Real β Somebody I Used to Know + blog | Early-onset Alzheimer's |
christopher_reeve |
Real β Still Me | C4 spinal cord injury |
christy_brown |
Real β My Left Foot | Cerebral palsy (adult) |
gabby_giffords |
Real β Gabby memoir | Aphasia + TBI |
jason_becker |
Real β Not Dead Yet doc | Late-stage ALS |
jean_dominique_bauby |
Real β The Diving Bell and the Butterfly | Locked-in syndrome |
tito_mukhopadhyay |
Real β three+ books | Non-verbal autism |
abed_nadir |
Fictional β Community | Autism (verbal) |
allie_calhoun |
Fictional β The Notebook | Late-stage Alzheimer's |
forrest_gump |
Fictional β Forrest Gump | Intellectual disability |
walter_jr_white |
Fictional β Breaking Bad | Cerebral palsy (teen) |
raymond_babbitt |
Fictional β Rain Man | Savant autism |
Each persona has ~120β210 memory chunks (canon-driven, no filler) across five buckets β family, medical, hobbies, daily_routine, social β and three chunk types: narrative, social_post, chat_log. Somewhere around 2,300 chunks total across the set.
Data provenance is documented. See references.md for the bibliography β memoirs, films, interviews β and the ethics notes on living-persons treatment.
Adding a new persona: drop a JSON file into data/memories/ following the schema of any existing one, then run python data/generate_users.py and python -m backend.retrieval.vector_store.
TODO
From the spec (pages 10β11). Tags: [Core] = must do, [Bonus] = nice to have, [Eval] = for the grade.
Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend just gets the labels (affect, gesture_tag, gaze_bucket). Only backend/sensing/labels.py (GESTURE_DIRECTIVES) lives on the backend.
Dataset
- [Core] Memories carry three chunk types per persona β
narrative,social_post,chat_logβ each with abucketlabel. Type is preserved through the vector-store metadata and feeds the P(type) session prior.
Sensing (frontend)
- [Core] Head-nod / sharp tilt / head-shake = "I don't like that". Different from frustrated affect.
- frontend
HeadPoseTrackerdecomposes pitch/yaw/roll from MediaPipe's facial transformation matrix; emitsHEAD_SHAKE(yaw oscillation),HEAD_NOD(gentle pitch dip + recovery),HEAD_NOD_DISSATISFIED(sharper pitch dip). All angles are measured as deviation from the per-user calibrated neutral pose (see Affect entry below for how calibration works), so a user with a naturally tilted head doesn't read as permanently nodding. Live p/y/r debug readout in the sidebar. - dedicated
POST /chat/turnaroundendpoint reuses cached last-state β one extra LLM call, no full pipeline re-run - intent-aware turnaround: PERSONAL re-retrieves excluding the rejected bucket and exact rejected chunk texts (with
turnaround_min_scorefloor β falls back to original chunks rather than degrading); PRESENT_STATE flips emotional read or admits uncertainty - UI: rejected bubble gets strikethrough + "rephrased" badge, new bubble appended with "β» turnaround" badge β both visible (you can't unsay something to a partner). Manual "β» Not quite right" button as fallback
- guards:
turnaroundConsumedTurnRefprevents self-retrigger loops; backendturn_idreturned inChatResponseso frontend doesn't desync on persona switch; stale-turn 409
- frontend
- [Core] Smile / positive affect actually changes wording now. Affect compiles into a
StyleDirective(register + prefer/avoid words + exemplar + opener hint) rendered as explicit instructions in the turn-specific user message β see_AFFECT_CONFIGin backend/pipeline/nodes/intent.py and_build_userin backend/pipeline/nodes/planner.py. The persona's ownstylistic_preferences(from the memory JSONs) carry the stable baseline in the cached system message; the affect directive is how that baseline shifts per turn. Measured bycompute_multimodal_alignment(positive/negative lexicon).- Affect is read from MediaPipe FaceLandmarker blendshape scores (
mouthSmileLeft,browDownLeft,eyeSquintLeft,jawOpen,browInnerUp, etc.) rather than hand-rolled landmark math.classifyAffectin frontend/src/lib/sensing.ts emitsHAPPY/FRUSTRATED/SURPRISED/NEUTRALfrom those scores. - Per-user calibration window. When the webcam first comes alive, a 5-second overlay records the user's neutral baseline β trimmed mean and stddev for each blendshape, plus neutral gaze direction and head pose. Detection then fires when a signal exceeds the user's own mean by
SIGMA_K = 2.0standard deviations, so a face whose resting smile blendshape sits at 0.4 doesn't permanently read as HAPPY. One global tunable (Ο multiplier) replaces the wall of magic-number thresholds the old geometric pipeline carried.Calibratorin sensing.ts, wired through useSensing.ts, surfaced in CalibrationOverlay.tsx. A "Recalibrate" button on the sensing panel re-runs the window any time. SetVITE_CALIBRATION_ENABLED=falsein.envto fall back to fixed thresholds for debugging.
- Affect is read from MediaPipe FaceLandmarker blendshape scores (
- [Core] Gestures come from MediaPipe's pretrained
GestureRecognizerrather than hand-rolled landmark geometry. Mapped labels:THUMBS_UP/THUMBS_DOWN/POINTING_UP/CLOSED_FIST/OPEN_PALM/VICTORY/I_LOVE_YOU(seemapGestureLabelin sensing.ts). Each label carries anopener_hintviaGESTURE_DIRECTIVESin backend/sensing/labels.py β a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation. - [Core] Air-writing uses a vision LLM (Gemini 2.0 Flash via Google AI Studio's OpenAI-compatible endpoint, configurable through
INK_VISION_MODEL/INK_VISION_BASE_URL/INK_VISION_API_KEY) instead of the older in-browser DTW template bank. We briefly swapped togemma4:31b-cloudon Ollama Cloud since gemma4 is multimodal, but image-input on Ollama Cloud's free tier turned out to be unreliable β Gemini Flash is cheaper to obtain (free key from aistudio.google.com/apikey) and consistent. Stroke segmentation lives inAirWriterin sensing.ts β index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200Γ200 PNG by inkRecognizer.ts and POSTed to/ink/recognize(backend/api/main.py), which asks the model to return the traced character or short word. The recognized text accumulates insensing.airWrittenTextand flows through the pipeline three ways: (1) retrieval picks the word up as an extraPERSONALsub-intent with a bucket hint (infer_bucketin backend/sensing/bucket_keywords.py), (2) the planner adds an explicit "the user air-wrote X β incorporate verbatim if appropriate" instruction, and (3) the word appears inlogs/turns.jsonlfor debugging. SetVITE_AIRWRITING_ENABLED=falseto disable stroke capture; ifINK_VISION_API_KEYis unset the endpoint returns 503 and the frontend silently keeps tracing without recognition. - [Bonus] Voice + air-writing conflict resolution. A push-to-talk mic (frontend/src/hooks/useVoice.ts) captures a short Web Speech utterance; frontend/src/lib/resolveIntent.ts merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (
help/stop/water/done/morewin ties). The resolver emits a{text, source, voice_text, air_text}payload βsource β voice_only | air_only | agree | conflict_air | conflict_voiceβ which the backend uses in backend/pipeline/nodes/intent.py to pick the supplemental sub-intent, and in backend/pipeline/nodes/planner.py to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona viaVOICE_CAPABLE_PERSONASin frontend/src/lib/voiceEligibility.ts β only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
Intent decomposition
Current state: regex-splits the partner query on conjunctions/punctuation into fragments, then runs each fragment through a BGE zero-shot classifier (cosine vs. seed exemplars per class). No LLM call, no retries. Runs in ~10β30ms per turn. Bucket hints for
PERSONALfragments come from a shared keyword helper in backend/sensing/bucket_keywords.py. Earlier versions used an LLM with Pydantic validation + 3 retries, which cost ~100s per turn on Ollama Cloud when the model emitted bad JSON.
- [Core] Personal / Contextual / Open-domain dispatch to distinct pools (personal β BGE vector store; contextual β persona memory + relevant in-session turns layered on top; open-domain β stub chunk, LLM answers from its own general knowledge β web search is intentionally out of scope).
- intent node latency β split + BGE zero-shot classifier replaces the LLM router. Parallelising sub-query retrieval is still open.
- [Core]
PRESENT_STATEintent class β questions about right-now state ("how are you feeling?", "are you tired?") used to fabricate confident answers from autobiographical memory (wrong by category, not just by wording). Now they skip retrieval entirely and the planner uses an affect-grounded prompt branch with explicit fallback to "I'm not sure" when the read is ambiguous. Margin guard demotes narrow PRESENT_STATE wins to PERSONAL (better to over-retrieve than to silently drop persona memories). Air-written supplements are classified the same way as a normal fragment β a present-tense supplement on a PRESENT_STATE query no longer flips the route to PERSONAL.
Retrieval
Current state: BGE-small cosine search over per-user torch tensors. Each personal sub-intent fetches a wider pool (12 candidates, 8 on the FRUSTRATED fast path), then MMR reranks against a query vector that's fused with the last 2 user turns β see
build_context_vectorandmmr_rerankin backend/retrieval/reranker.py. MMR runs across the merged personal + contextual pool so history-derived chunks compete with persona memories. Knobs in backend/config/settings.py:rerank_lambda(relevance vs diversity, default 0.7),rerank_query_weight(current turn vs history, default 0.7),rerank_enabledas kill-switch. Steady-statet_rerankis ~15ms with no history, ~50ms when history is fused.
- [Core] Reranking β MMR with conversation-context query fusion. Wider cosine pool, then diversity-aware reorder against
0.7Β·current_query + 0.3Β·mean(last-2-user-turns). Both fast and full paths rerank; OPEN_DOMAIN stub is pinned outside the rerank. - [Bonus] Session-level priors on two axes β P(bucket) and P(type) β with evidence weighting, topic-drift decay, and soft log-weighted reranking applied inside
retrieve()before MMR (see the architecture section). Still in-memory β persisting per user across server restarts is a follow-up. - [Bonus] Latency fallback only switches LLM tier. Add more steps:
- flip
rerank_enabled=Falseif retrieval+rerank is slow (cheap kill-switch already in place) - return a canned response if we blow the budget entirely
- threshold is 3.5s, spec says 6s β pick one
- flip
- [Bonus] Cache encoded user-turn embeddings across the session β
build_context_vectorre-encodes the same recent turns every turn (~50ms steady-state cost)
Generation
- [Core] API returns 3 candidates (plus an optional side-index hit) on
/chatβ seecandidatesin backend/api/main.pyChatResponse. Planner fans out three grounding strategies in parallel threads and dedupes identical outputs: broad (all retrieved personal chunks), focused (top chunk only), and serendipitous (random non-top chunks) β see_pick_strategy_chunksin backend/pipeline/nodes/planner.py. Turnaround/present-state retries skip the fan-out and regenerate a single response. - [Core] Frontend picker shows stacked candidate cards with a strategy label under each; click to commit, which strikes the rest, locks the AAC bubble to the chosen text, and fires
POST /chat/pick. One-candidate responses render as a normal bubble. SeehandlePick+.candidate-listin frontend/src/components/ChatPanel.tsx. - [Bonus] Side-index at
data/pick_index/<uid>/stores(query embedding β picked text, strategy, picked_buckets)after every pick. Two feedback loops into generation: (1) the retrieval node injects the previously-picked text as asource: "prior_pick"chunk rendered in a "you answered like this before" block β the three LLM candidates all see it and riff on it; (2) retrieval blends cumulativebucket_pick_countsinto this turn'sbucket_priorsat weight 0.3 (transient β doesn't persist across turns), so users who historically pick family memories bias retrieval toward family without overriding the session prior. The raw picked text is also still surfaced as a standaloneside_indexcandidate. See backend/retrieval/pick_index.py,_blend_pick_history_into_priors+_prepend_prior_pickin backend/pipeline/nodes/retrieval.py, and the prior-pick block in_build_userin backend/pipeline/nodes/planner.py. - LLM temperature bumped from 0.4 β 0.8, then pulled back to 0.7 once chunk-variation became the primary diversity axis. With three different grounding strategies feeding three parallel calls, sampling noise matters less than which memories are in the context window.
Evals
Scoring runs in a FastAPI BackgroundTask after /chat returns, so it never blocks the response. The frontend polls GET /evals/{run_id} to render pills once they're ready. Each scored turn is appended to logs/evals.jsonl, keyed by run_id, so it joins back to logs/turns.jsonl offline. Likert ratings go to logs/ratings.jsonl. Picks go to logs/picks.jsonl.
| Metric | What it answers | Where |
|---|---|---|
| Efficiency | SLO pass/fail on t_total, aggregate p50/p95/p99 |
efficiency.py, aggregate.py |
Faithfulness (grounded) |
Did the response stick to retrieved memories, or hallucinate? Sentence-level NLI; no_evidence short-circuit when nothing was retrieved |
faithfulness.py |
Relevance (relevant) |
Did the response actually address the partner's query? BGE cosine queryβresponse | relevance.py |
| Multimodal alignment | affect (sentiment lexicon vs target), gesture (opener regex vs detected tag), gaze (matched/total retrieved chunks vs gazed bucket) |
multimodal_alignment.py |
| Candidate diversity | Are the picker's candidates actually different, or paraphrases? Mean pairwise cosine distance over the candidate slate | diversity.py |
| Per-candidate breakdown | Each candidate scored for grounded + relevance (not just the selected one) β answers "did the picker beat candidate 0?" offline |
candidates_eval block in evals/init.py |
| Authenticity | Star rating under every assistant bubble β POST /feedback/rating β logs/ratings.jsonl |
EvalPanel.tsx, api/main.py |
Performance note. When the turn produces multiple candidates, scoring is fully batched: a single NLI model.predict over all (candidate Γ sentence Γ chunk) pairs and a single BGE embed_texts over [query, c1, c2, c3] (the candidate vectors feed both relevance and diversity). The selected candidate's per-candidate score is reused as the top-level pill values rather than re-scored. End result: 1 NLI pass + 1 BGE pass per turn regardless of candidate count.
First-turn caveat: the NLI model (cross-encoder/nli-deberta-v3-small, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup. Every turn after that adds ~100-300ms for sentence-level scoring.
Offline analysis. python -m backend.evals.aggregate joins all four log files and prints per-persona reports: latency p50/p95/p99 by tier, mean groundedness/hallucination, multimodal alignment coverage, picker behaviour (pick rate, regenerate rate, strategy win rate, "did picker beat cand 0?", diversity floor), and authenticity Likert distribution.
- [Eval] Faithfulness β NLI scorer, sentence split, threshold on entailment prob.
no_evidenceflagged when nothing retrieved - [Eval] Efficiency β per-turn SLO + aggregate latency (p50/p95/p99) via
aggregate.py, grouped byuser_id Γ llm_tier - [Eval] Multimodal alignment β affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
- [Eval] Authenticity β per-turn stars under each assistant bubble, POST to
/feedback/rating, logged withrun_id + rater_id - [Eval] For the live in-class eval: figure out the actual session β who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
- [Eval] Relevance score β BGE cosine similarity between query and response. Originally specced as an NLI call, but a question rarely entails its answer (the on-topic and off-topic NLI scores both pinned near 10β»β΄), so the embedder we already load for retrieval is the right tool. Fills the gap where a perfectly grounded but off-topic reply scored 100% grounded. See backend/evals/relevance.py.
- [Eval] Candidate diversity β mean pairwise cosine distance among the 3 candidates in a picker round, computed on BGE embeddings (no extra model). Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query. See backend/evals/diversity.py.
- [Eval] Picker-aware metrics β
report_pickerin backend/evals/aggregate.py joinsturns.jsonl+picks.jsonl+evals.jsonland prints: pick rate (% of multi-candidate turns where the user clicked a card), regenerate rate (% of (user, turn_id) pairs that ran the planner more than once), strategy win rate among committed picks, head-to-head "did picker beat candidate 0 on grounded/relevance" using the per-candidate scoring from L453, and diversity coverage (% of turns with mean pairwise cosine distance < 0.10 β the "aloha" floor). Run viapython -m backend.evals.aggregate. - [Eval] Score alternate candidates too, not just the selected one.
compute_evalsnow scores groundedness + relevance for every candidate and stamps which one was selected; full breakdown lands ineval_scores.candidates_evalandlogs/evals.jsonl, top-level pills still describe the selected response. Unlocks "did the picker actually beat candidate 0?" offline analysis. - [Eval] UI coverage β
EvalPanelnow also renders a relevance pill (BGE cosine queryβresponse) and a candidate-diversity pill (mean pairwise cosine distance, hidden when fewer than 2 candidates). Hallucination rate is conveyed inside the grounded tooltip rather than as its own pill (it's1 β groundedness, no extra info). SLO margin is in the latency tooltip. See EvalPanel.tsx. - [Eval] Tooltip math β every pill's
titlenow shows the actual computation, not just the definition. Each scorer returns its raw inputs in anexplainblock (sentence count + entailment threshold for groundedness, pos/neg word counts + sentiment for affect, matched/total chunks for gaze, gesture pattern match), andEvalPanelformats them into specific tooltips like "2/2 sentences had NLI entailment prob β₯ 0.50". SeegroundednessTip/affectTip/gestureTip/gazeTipin EvalPanel.tsx andexplainin multimodal_alignment.py.
Cleanup
- delete
backend/sensing/(dead code, sensing is in frontend) β done, onlylabels.pyremains - per-persona affect overrides (
_PERSONA_TONE_OVERRIDES) deleted β redundant withstylistic_preferencesin the new persona JSONs
Out of scope
Not in the spec β engineering nice-to-haves we'd pick up if the rest is done. Don't block grading on these.
- Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal. (Sensing β untagged in spec)
- [Scale] past ~100k chunks per user, swap torch matmul for
hnswlib; consider a cross-encoder reranker (e.g.bge-reranker-base) ifrerank_pool_kgrows past ~30 (Retrieval β far beyond current corpus size)
Team
- Akash Kolte β akashjag@buffalo.edu
- Shwetangi β shwetang@buffalo.edu
University at Buffalo, SUNY
License
All rights reserved. See the LICENSE file for details.
