Spaces:

ub-aac-chatbot
/

aac-chatbot

Sleeping

shwetangisingh commited on Apr 17

Commit

7fd8c8a

1 Parent(s): 09fe9bc

Route sub-intents to their own pools, rip out the LLM intent router

Partner queries now get split on conjunctions and each fragment goes
to the right place: personal memories, session history, or general
knowledge. The old LLM router was burning ~100s per turn on retry
loops; replaced it with a BGE cosine match against seed sentences
(~30ms).

Also:
- CONTEXTUAL pulls persona memory too, so "what did I just ask" still
sounds like them
- Planner prompt split into system + user so the character sheet gets
cached between turns
- Tightened anti-meta rules after Gemma4 leaked its own character
sheet into the response
- THINKING_MODE=suppress so it stops thinking out loud
- run.sh --debug now forwards to the CLI
- README: three mermaid diagrams, rewritten intro, updated for the 14
personas

Files changed (12) hide show

README.md +194 -66
backend/guardrails/checks.py +8 -0
backend/main.py +2 -15
backend/pipeline/nodes/feedback.py +10 -4
backend/pipeline/nodes/intent.py +148 -120
backend/pipeline/nodes/planner.py +105 -38
backend/pipeline/nodes/retrieval.py +99 -30
backend/pipeline/state.py +3 -2
backend/retrieval/contextual.py +57 -0
backend/retrieval/vector_store.py +5 -0
backend/sensing/bucket_keywords.py +15 -0
run.sh +6 -0

README.md CHANGED Viewed

@@ -1,12 +1,8 @@
 # Multimodal AAC Chatbot
-An AI chatbot that **speaks as an AAC user**, not to them. Given a persona (Mia, Gerald, or Arjun),
-it fuses real-time multimodal non-verbal signals — facial expressions, hand gestures, gaze, and
-air writing — with personal memory retrieval to generate responses in that person's authentic voice.
-Built as a training-free, agentic RAG pipeline — a plain-Python function chain
-with two conditional branches (no LangGraph / LangChain), torch-tensor
-retrieval (no FAISS), and JSONL turn logging (no MLflow).
 ---
@@ -26,44 +22,181 @@ retrieval (no FAISS), and JSONL turn logging (no MLflow).
 ## What is AAC?
-**Augmentative and Alternative Communication (AAC)** refers to tools and technologies that help
-people who have difficulty with spoken or written communication — including individuals with
-Cerebral Palsy, ALS, Autism Spectrum Disorder, and other conditions. This project gives AAC users
-a personalized digital twin that communicates on their behalf.
 ---
 ## System Architecture
 ```
-React Frontend (browser)                    Backend (Python)
-  MediaPipe JS sensing ──┐
-  Chat UI ───────────────┼── POST /chat ──► FastAPI ──► run_pipeline()
-  Webcam feed ───────────┘                                │
-                                            L2 Intent ──► L3 Retrieval ──► L4 Generation ──► L5 Feedback
 ```
 | Layer | Module | What it does |
 |-------|--------|-------------|
-| L1 | `frontend/src/hooks/useSensing.ts` | MediaPipe JS — affect, gesture, gaze, air writing (browser-side) |
-| L2 | `backend/pipeline/nodes/intent.py` | Keyword-based intent routing (no LLM) |
-| L3 | `backend/pipeline/nodes/retrieval.py` | BGE-small embeddings + torch tensor cosine search (mps/cuda/cpu) |
-| L4 | `backend/pipeline/nodes/planner.py` | Expression-conditioned response generation (Qwen3) |
-| L5 | `backend/pipeline/nodes/feedback.py` | JSONL turn logging + Bayesian bucket prior update |
-The pipeline is a plain Python function chain with two conditional branches:
-- FRUSTRATED affect → fast retrieval path (k=2)
-- Latency > 3.5s → fallback to smaller Qwen3-8B model
 ---
 ## Prerequisites
-- Python **3.10+** (via conda)
-- Node.js **22+** and **pnpm**
-- An [Ollama Cloud](https://ollama.com) account — both LLM tiers hit
-  cloud-hosted models; no local Ollama daemon required
-- A webcam (for live sensing; optional for CLI mode)
 ---
@@ -75,57 +208,60 @@ cd multimodal_aac_chatbot
 bash setup.sh
 ```
-The setup script handles:
-- Conda environment creation (`aac-chatbot`, Python 3.12)
-- Python dependency installation
-- `.env` file creation from template
-- Vector index building (downloads BGE-small embedder on first run, saves
-  per-user `vectors.pt` under `data/vector_store/`)
-- Frontend dependency installation (pnpm)
 ---
 ## Configuration
-All settings live in [backend/config/settings.py](backend/config/settings.py) and can be overridden via `.env`.
-| Variable | Default | Description |
 |----------|---------|-------------|
-| `ACTIVE_LLM_TIER` | `primary` | `primary` \| `fallback` |
-| `PRIMARY_MODEL` | `gemma4:31b-cloud` | Ollama Cloud model for primary tier |
-| `FALLBACK_MODEL` | `gemma4:31b-cloud` | Ollama Cloud model for fallback tier (smaller/faster) |
-| `PRIMARY_BASE_URL` | `http://localhost:11434/v1` | Ollama-compatible endpoint |
-| `FALLBACK_LATENCY_THRESHOLD` | `3.5` | Seconds before falling back to smaller model |
-| `LOGS_DIR` | `logs` | Where per-turn JSONL logs are written |
 ---
 ## Running the Project
-### Full stack (recommended)
 ```bash
 bash run.sh
 ```
-This starts FastAPI on `:8000` and React on `:7550`.
-Open [http://localhost:7550](http://localhost:7550) in your browser.
-### CLI only
 ```bash
 conda activate aac-chatbot
 python -m backend.main --debug
 ```
-### API only
 ```bash
 conda activate aac-chatbot
 uvicorn backend.api.main:app --reload
 ```
-Example request:
 ```bash
 curl -X POST http://localhost:8000/chat \
   -H "Content-Type: application/json" \
@@ -173,38 +309,30 @@ multimodal_aac_chatbot/
 ## Personas
-14 personas anchored in real memoirs and canonical fictional characters, spanning ALS,
-Parkinson's, locked-in syndrome, aphasia, Alzheimer's, cerebral palsy, non-verbal autism,
-savant autism, intellectual disability, and spinal cord injury.
 | ID | Source | Condition |
 |----|--------|-----------|
 | `stephen_hawking` | Real — *My Brief History* + interviews | ALS (mid-stage) |
-| `michael_j_fox` | Real — 4 memoirs | Young-onset Parkinson's |
 | `wendy_mitchell` | Real — *Somebody I Used to Know* + blog | Early-onset Alzheimer's |
 | `christopher_reeve` | Real — *Still Me* | C4 spinal cord injury |
 | `christy_brown` | Real — *My Left Foot* | Cerebral palsy (adult) |
 | `gabby_giffords` | Real — *Gabby* memoir | Aphasia + TBI |
 | `jason_becker` | Real — *Not Dead Yet* doc | Late-stage ALS |
 | `jean_dominique_bauby` | Real — *The Diving Bell and the Butterfly* | Locked-in syndrome |
-| `tito_mukhopadhyay` | Real — 3+ books | Non-verbal autism |
 | `abed_nadir` | Fictional — *Community* | Autism (verbal) |
 | `allie_calhoun` | Fictional — *The Notebook* | Late-stage Alzheimer's |
 | `forrest_gump` | Fictional — *Forrest Gump* | Intellectual disability |
 | `walter_jr_white` | Fictional — *Breaking Bad* | Cerebral palsy (teen) |
 | `raymond_babbitt` | Fictional — *Rain Man* | Savant autism |
-Each persona has ~120-210 memory chunks (canon-driven, no filler) across 5 buckets
-(`family`, `medical`, `hobbies`, `daily_routine`, `social`) and 3 chunk types
-(`narrative`, `social_post`, `chat_log`). Total: ~2,300 chunks.
-**Data provenance is documented** — see [references.md](references.md) for the full
-bibliography of memoirs, films, interviews, and other canonical sources behind every
-persona, plus ethics notes on living-persons treatment.
-To add a new persona, write a JSON file in `data/memories/` following the schema of any
-existing persona, then run `python data/generate_users.py` and
-`python -m backend.retrieval.vector_store`.
 ---
@@ -235,10 +363,10 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
 ### Intent decomposition
-> Current state: routing is keyword-based, not LLM-based. The original LLM router (Pydantic-validated JSON) kept emitting the wrong shape with `gemma4:31b-cloud` and hitting the `max_tokens` truncation — 3 retries + hard fallback on every turn, ~30s of dead latency before generation. The keyword router (5 buckets matched against word lists in `intent.py`) handles the demo personas and adds ~0ms. Trade-off: stuck with the 5 hardcoded buckets (`family`, `medical`, `hobbies`, `daily_routine`, `social`) and can't tell `OPEN_DOMAIN` from `PERSONAL`. Fine for now since all personas only have personal memories. Revisit when Ollama Cloud ships `response_format=json_schema` or we add a tiny local classifier.
-- [ ] **[Core]** Personal / Contextual / Open-domain all hit the same vector index right now. Make them actually go different places — open-domain → web search (or stub), contextual → session memory
-- [ ] intent node is slow. Cache the prompt, use a tiny model for routing, parallelise the sub-queries
 ### Retrieval

 # Multimodal AAC Chatbot
+A chatbot that **speaks as an AAC user, not to them.** You pick a persona (Mia, Gerald, or Arjun) and the partner talks to them — the bot replies in that person's voice, using their memories, and adjusts what it says based on what the webcam sees: facial expression, hand gestures, where they're looking, and letters they trace in the air.
+It's a training-free agentic RAG pipeline — a plain Python function chain with two branching points, torch matmul for retrieval, JSONL for logging. The goal was to keep every piece simple enough to read top-to-bottom in an afternoon.
 ---
 ## What is AAC?
+AAC (Augmentative and Alternative Communication) covers the tools people use when spoken or written communication is hard for them — cerebral palsy, ALS, autism, stroke recovery, and so on. Usually that's a tablet with a symbol grid, or an eye-tracker, or a switch. The slow part isn't the typing — it's that most devices don't know *you*. Every conversation starts from scratch.
+This project is a small attempt at the other direction: give each user a persona their device already knows, and let the device reply in their voice.
 ---
 ## System Architecture
+The browser does all the camera work. MediaPipe JS runs inside React, classifies what it sees into small labels (`affect`, `gesture_tag`, `gaze_bucket`, `air_written_text`), and sends those alongside the partner's text to `/chat`. The backend never touches pixels.
 ```
+React (browser)                            Backend (Python)
+  MediaPipe JS  ──┐
+  Chat UI ────────┼── POST /chat ──► FastAPI ──► run_pipeline()
+  Webcam ─────────┘                                │
+                                       Intent ──► Retrieval ──► Planner ──► Feedback
 ```
+Five layers, each a tiny file:
 | Layer | Module | What it does |
 |-------|--------|-------------|
+| L1 | `frontend/src/hooks/useSensing.ts` | Watches the webcam. Turns faces/hands/gaze/air-writing into string labels. Purely frontend. |
+| L2 | `backend/pipeline/nodes/intent.py` | Splits the partner's question on conjunctions and punctuation, then classifies each fragment as PERSONAL, CONTEXTUAL, or OPEN_DOMAIN using cosine similarity against a handful of seed sentences. No LLM call. ~30ms per turn. |
+| L3 | `backend/pipeline/nodes/retrieval.py` | Each sub-intent goes to its own pool. Personal → the user's memory vector store. Contextual → persona memory + relevant in-session turns layered on top (so "what did I just ask" still sounds like *them*). Open-domain → a stub chunk telling the LLM to answer from its own knowledge (web search is deliberately out of scope). |
+| L4 | `backend/pipeline/nodes/planner.py` | Builds the prompt, calls the LLM, picks a response. Tone and max_tokens are shaped by the detected affect. |
+| L5 | `backend/pipeline/nodes/feedback.py` | Writes one JSONL row per turn and bumps the Bayesian priors over which memory bucket was useful. |
+Two places the pipeline branches:
+- **Frustrated affect** → use the fast retrieval path (k=2, skip the reranker). The user wants an answer, not a thesis.
+- **Cumulative latency past 3.5s** → switch to the smaller fallback model for generation.
+### End-to-end: from partner speaking to response rendered
+One diagram, left to right, every step a turn goes through. Follow the arrows.
+```mermaid
+flowchart LR
+    subgraph S1["① Partner side (browser)"]
+        direction TB
+        IN1[Partner types or speaks a question]
+        IN2[Webcam frame]
+        IN1 --> UI[Chat UI]
+        IN2 --> MP[MediaPipe JS<br/>face + hands + gaze]
+        MP --> LAB[Classify into labels<br/>affect, gesture_tag,<br/>gaze_bucket, air_written_text]
+        UI --> REQ
+        LAB --> REQ[POST /chat<br/>query + labels]
+    end
+    REQ ==> S2
+    subgraph S2["② Backend pipeline (Python)"]
+        direction TB
+        HYD[Hydrate PipelineState<br/>session_history, priors, profile] --> INT
+        INT[Intent node<br/>split query + classify fragments] --> BR1{FRUSTRATED?}
+        BR1 -- yes --> RFAST[Fast retrieval<br/>k=2]
+        BR1 -- no --> RFULL[Full retrieval<br/>k=5 → rerank 3]
+        RFAST --> POOL
+        RFULL --> POOL[Dispatch per sub-intent]
+        POOL --> PP[PERSONAL<br/>BGE vector store]
+        POOL --> PC[CONTEXTUAL<br/>personal + BGE over history]
+        POOL --> PO[OPEN_DOMAIN<br/>stub chunk]
+        PP --> MERGE[Merge + dedupe chunks]
+        PC --> MERGE
+        PO --> MERGE
+        MERGE --> PLAN[Planner<br/>build prompt with<br/>3 retrieval blocks + tone tag]
+        PLAN --> BR2{Total latency<br/>&gt; 3.5s?}
+        BR2 -- yes --> LLMF[Fallback LLM<br/>Ollama Cloud, smaller]
+        BR2 -- no --> LLMP[Primary LLM<br/>Ollama Cloud]
+        LLMF --> GRD[Guardrail check<br/>persona breaks,<br/>unsupported claims]
+        LLMP --> GRD
+        GRD --> FB[Feedback node<br/>log turn to JSONL,<br/>bump bucket priors,<br/>append to session history]
+    end
+    FB ==> S3
+    subgraph S3["③ Back to partner"]
+        direction TB
+        RESP[Response in persona's voice<br/>+ latency breakdown<br/>+ eval scores]
+        RESP --> RENDER[Chat UI renders it]
+    end
+```
+**A concrete example.** Partner says *"how are you, and what's the capital of France?"* while the webcam reads a relaxed face:
+1. Browser sends `{query, affect: NEUTRAL, gesture_tag: null, …}`.
+2. Intent node splits on `,` and ` and ` → two fragments. Classifier tags them `PERSONAL` and `OPEN_DOMAIN`.
+3. Affect isn't FRUSTRATED, so full retrieval runs.
+4. Dispatcher hits the persona store for fragment one, emits the open-domain stub for fragment two, merges both.
+5. Planner drops the two chunks into separate prompt blocks and calls the primary LLM.
+6. Guardrail passes, feedback writes the row, the response — in Mia's voice — comes back through the same `/chat` response.
+Total wall time is normally under 6 seconds end-to-end; the slow part is the LLM call, not anything you wrote.
+### What a single turn actually looks like
+```mermaid
+flowchart TD
+    A[Partner types or speaks] --> B[React captures query<br/>+ webcam labels]
+    B --> C[POST /chat]
+    C --> D[Intent node<br/>split + classify]
+    D --> E{Any FRUSTRATED<br/>affect signal?}
+    E -- yes --> F[Fast retrieval<br/>k=2, no reranker]
+    E -- no --> G[Full retrieval<br/>k=5 → rerank to 3]
+    F --> H{Cumulative<br/>latency &gt; 3.5s?}
+    G --> H
+    H -- yes --> I[Fallback LLM<br/>smaller, faster]
+    H -- no --> J[Primary LLM]
+    I --> K[Guardrail check]
+    J --> K
+    K --> L[Feedback node<br/>JSONL log + priors]
+    L --> M[Response in persona's voice]
+```
+### How sub-intents fan out
+This is the part that took a few iterations to get right. Each partner query can be *multiple* questions stitched together with "and" / "but" / punctuation. Each fragment gets classified separately and sent to its own retrieval pool.
+```mermaid
+flowchart LR
+    Q["&quot;how are you,<br/>and what's the<br/>capital of France?&quot;"] --> S[Split on conjunctions<br/>and punctuation]
+    S --> F1[fragment:<br/>how are you]
+    S --> F2[fragment:<br/>capital of France]
+    F1 --> CL[BGE zero-shot<br/>cosine vs exemplars]
+    F2 --> CL
+    CL --> P[PERSONAL<br/>→ persona memory vectors]
+    CL --> CX[CONTEXTUAL<br/>→ persona memory +<br/>relevant session history]
+    CL --> OD[OPEN_DOMAIN<br/>→ stub, LLM answers<br/>from own knowledge]
+    P --> MERGE[Merge, dedupe,<br/>hand to planner]
+    CX --> MERGE
+    OD --> MERGE
+```
+The classifier is just cosine similarity against 5 seed sentences per class — no LLM, ~30ms per turn. The old version called an LLM and retried up to 3× on JSON errors; on a bad day that was 100+ seconds of dead time.
+### State that flows between nodes
+Every node takes a `PipelineState` dict and returns a partial update. Nothing is global.
+```mermaid
+flowchart LR
+    subgraph "set at turn start"
+        A[user_id, persona_profile,<br/>session_history, turn_id]
+        B[affect, gesture_tag,<br/>gaze_bucket, air_written_text]
+        C[raw_query]
+    end
+    subgraph "filled in by the pipeline"
+        D[intent_route,<br/>generation_config]
+        E[retrieved_chunks,<br/>retrieval_mode_used]
+        F[candidates,<br/>selected_response,<br/>llm_tier_used]
+        G[latency_log,<br/>run_id,<br/>guardrail_passed]
+    end
+    A --> D
+    B --> D
+    C --> D
+    D --> E
+    B --> E
+    D --> F
+    E --> F
+    F --> G
+```
 ---
 ## Prerequisites
+- Python 3.10+ (we use conda; 3.12 is what the env ships with)
+- Node.js 22+ and pnpm
+- An [Ollama Cloud](https://ollama.com) account. Generation hits cloud models — you don't need a local Ollama daemon running.
+- A webcam if you want to play with the full stack. The CLI works without one.
 ---
 bash setup.sh
 ```
+`setup.sh` takes care of everything on the first run: creates the `aac-chatbot` conda env, installs Python and frontend deps, copies `.env.example` → `.env` for you to fill in, and builds the per-persona vector indexes under `data/vector_store/`. The first build downloads the BGE-small embedder (~130MB), so expect a short wait.
+If you edit a persona later, rebuild the indexes: `python -m backend.retrieval.vector_store`.
 ---
 ## Configuration
+Everything is a Pydantic setting in [backend/config/settings.py](backend/config/settings.py) with a `.env` override. The knobs you'll actually touch:
+| Variable | Default | What it does |
 |----------|---------|-------------|
+| `ACTIVE_LLM_TIER` | `primary` | Which tier to start on — `primary` or `fallback`. The pipeline switches automatically if a turn is slow. |
+| `PRIMARY_MODEL` | `gemma4:31b-cloud` | Ollama Cloud model for the primary tier. |
+| `FALLBACK_MODEL` | `gemma4:31b-cloud` | Smaller/faster model for the fallback tier. Point this at whatever smaller cloud model you have access to. |
+| `PRIMARY_BASE_URL` | `http://localhost:11434/v1` | OpenAI-compatible endpoint. Defaults to the local Ollama proxy. |
+| `FALLBACK_LATENCY_THRESHOLD` | `3.5` | If intent+retrieval already took this many seconds, skip the primary tier. |
+| `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
 ---
 ## Running the Project
+### Full stack
 ```bash
 bash run.sh
 ```
+Starts FastAPI on `:8000` and the React dev server on `:7550`. Open [http://localhost:7550](http://localhost:7550). This is the mode you want for the webcam + sensing demo.
+Pass any `backend.main` flag to `run.sh` and it drops the full stack and runs the CLI with those flags instead — handy for fast iteration:
+```bash
+bash run.sh --debug                    # CLI with per-turn state dumps
+bash run.sh --user mia_chen --debug    # jump straight to Mia
+```
+### CLI directly
 ```bash
 conda activate aac-chatbot
 python -m backend.main --debug
 ```
+The CLI prints the full `PipelineState` after each turn — useful when you want to see what the classifier did or which chunks came back from which pool.
+### API directly
 ```bash
 conda activate aac-chatbot
 uvicorn backend.api.main:app --reload
 ```
 ```bash
 curl -X POST http://localhost:8000/chat \
   -H "Content-Type: application/json" \
 ## Personas
+Fourteen personas — nine anchored in real memoirs, five in canonical fiction. Together they span ALS, Parkinson's, locked-in syndrome, aphasia, Alzheimer's, cerebral palsy, non-verbal and savant autism, intellectual disability, and spinal cord injury. The point isn't to represent any one person — it's to give the model a wide enough range of voices that "sound like Mia" is a harder target than "sound helpful."
 | ID | Source | Condition |
 |----|--------|-----------|
 | `stephen_hawking` | Real — *My Brief History* + interviews | ALS (mid-stage) |
+| `michael_j_fox` | Real — four memoirs | Young-onset Parkinson's |
 | `wendy_mitchell` | Real — *Somebody I Used to Know* + blog | Early-onset Alzheimer's |
 | `christopher_reeve` | Real — *Still Me* | C4 spinal cord injury |
 | `christy_brown` | Real — *My Left Foot* | Cerebral palsy (adult) |
 | `gabby_giffords` | Real — *Gabby* memoir | Aphasia + TBI |
 | `jason_becker` | Real — *Not Dead Yet* doc | Late-stage ALS |
 | `jean_dominique_bauby` | Real — *The Diving Bell and the Butterfly* | Locked-in syndrome |
+| `tito_mukhopadhyay` | Real — three+ books | Non-verbal autism |
 | `abed_nadir` | Fictional — *Community* | Autism (verbal) |
 | `allie_calhoun` | Fictional — *The Notebook* | Late-stage Alzheimer's |
 | `forrest_gump` | Fictional — *Forrest Gump* | Intellectual disability |
 | `walter_jr_white` | Fictional — *Breaking Bad* | Cerebral palsy (teen) |
 | `raymond_babbitt` | Fictional — *Rain Man* | Savant autism |
+Each persona has ~120–210 memory chunks (canon-driven, no filler) across five buckets — `family`, `medical`, `hobbies`, `daily_routine`, `social` — and three chunk types: `narrative`, `social_post`, `chat_log`. Somewhere around 2,300 chunks total across the set.
+Data provenance is documented. See [references.md](references.md) for the bibliography — memoirs, films, interviews — and the ethics notes on living-persons treatment.
+Adding a new persona: drop a JSON file into `data/memories/` following the schema of any existing one, then run `python data/generate_users.py` and `python -m backend.retrieval.vector_store`.
 ---
 ### Intent decomposition
+> Current state: regex-splits the partner query on conjunctions/punctuation into fragments, then runs each fragment through a BGE zero-shot classifier (cosine vs. 5 seed exemplars per class). No LLM call, no retries. Runs in ~10–30ms per turn. Bucket hints for `PERSONAL` fragments come from a shared keyword helper in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py). Earlier versions used an LLM with Pydantic validation + 3 retries, which cost ~100s per turn on Ollama Cloud when the model emitted bad JSON.
+- [x] **[Core]** Personal / Contextual / Open-domain dispatch to distinct pools (personal → BGE vector store; contextual → persona memory + relevant in-session turns layered on top; open-domain → stub chunk, LLM answers from its own general knowledge — web search is intentionally out of scope).
+- [x] intent node latency — split + BGE zero-shot classifier replaces the LLM router. Parallelising sub-query retrieval is still open.
 ### Retrieval

backend/guardrails/checks.py CHANGED Viewed

@@ -14,6 +14,14 @@ PERSONA_BREAK_SIGNALS = [
     "as your assistant",
     "i was trained",
     "my training data",
 ]
 OUT_OF_SCOPE_SIGNALS = [

     "as your assistant",
     "i was trained",
     "my training data",
+    # meta-narration / brief leakage
+    "the user wants me",
+    "the user is asking",
+    "roleplay as",
+    "role-play as",
+    "key characteristics",
+    "character sheet",
+    "reference only",
 ]
 OUT_OF_SCOPE_SIGNALS = [

backend/main.py CHANGED Viewed

@@ -13,6 +13,7 @@ from backend.pipeline.graph import run_pipeline
 from backend.pipeline.state import GenerationConfig, PipelineState
 from backend.retrieval.bucket_priors import uniform_priors
 from backend.retrieval.vector_store import _get_embedder
 def parse_args() -> argparse.Namespace:
@@ -40,21 +41,7 @@ def parse_args() -> argparse.Namespace:
 def _keyword_intent(query: str) -> tuple[dict, GenerationConfig]:
     """Replicate milestone-1 keyword routing as a fast local-dev shortcut."""
     q = query.lower()
-    bucket: str | None = None
-    if any(
-        w in q
-        for w in ["medication", "medicine", "doctor", "health", "allergic", "therapy"]
-    ):
-        bucket = "medical"
-    elif any(w in q for w in ["family", "mom", "dad", "brother", "sister", "parents"]):
-        bucket = "family"
-    elif any(w in q for w in ["hobby", "like to do", "enjoy", "weekend", "fun"]):
-        bucket = "hobbies"
-    elif any(w in q for w in ["routine", "morning", "wake", "sleep", "daily"]):
-        bucket = "daily_routine"
-    elif any(w in q for w in ["friend", "social", "people", "party", "community"]):
-        bucket = "social"
     intent_type = (
         "CONTEXTUAL"

 from backend.pipeline.state import GenerationConfig, PipelineState
 from backend.retrieval.bucket_priors import uniform_priors
 from backend.retrieval.vector_store import _get_embedder
+from backend.sensing.bucket_keywords import infer_bucket
 def parse_args() -> argparse.Namespace:
 def _keyword_intent(query: str) -> tuple[dict, GenerationConfig]:
     """Replicate milestone-1 keyword routing as a fast local-dev shortcut."""
     q = query.lower()
+    bucket = infer_bucket(query)
     intent_type = (
         "CONTEXTUAL"

backend/pipeline/nodes/feedback.py CHANGED Viewed

@@ -33,6 +33,7 @@ def _log_to_jsonl(state: PipelineState, run_id: str) -> None:
     latency = state.get("latency_log") or {}
     affect = (state.get("affect") or {}).get("emotion", "UNKNOWN")
     entry = {
         "run_id": run_id,
@@ -43,7 +44,12 @@ def _log_to_jsonl(state: PipelineState, run_id: str) -> None:
         "retrieval_mode": state.get("retrieval_mode_used", "unknown"),
         "affect": affect,
         "guardrail_passed": state.get("guardrail_passed", True),
-        "num_chunks": len(state.get("retrieved_chunks") or []),
         "latency": {
             "t_sensing": latency.get("t_sensing", 0.0),
             "t_intent": latency.get("t_intent", 0.0),
@@ -60,11 +66,11 @@ def _log_to_jsonl(state: PipelineState, run_id: str) -> None:
 def _update_bucket_priors(state: PipelineState) -> dict[str, float]:
     chunks = state.get("retrieved_chunks") or []
-    if not chunks:
         return state.get("bucket_priors") or {}
-    # Which bucket sourced the accepted response?
-    top_bucket = chunks[0].get("bucket")
     if not top_bucket:
         return state.get("bucket_priors") or {}

     latency = state.get("latency_log") or {}
     affect = (state.get("affect") or {}).get("emotion", "UNKNOWN")
+    chunks = state.get("retrieved_chunks") or []
     entry = {
         "run_id": run_id,
         "retrieval_mode": state.get("retrieval_mode_used", "unknown"),
         "affect": affect,
         "guardrail_passed": state.get("guardrail_passed", True),
+        "num_chunks": len(chunks),
+        "num_personal": sum(
+            1 for c in chunks if c.get("source", "personal") == "personal"
+        ),
+        "num_contextual": sum(1 for c in chunks if c.get("source") == "contextual"),
+        "num_open_domain": sum(1 for c in chunks if c.get("source") == "open_domain"),
         "latency": {
             "t_sensing": latency.get("t_sensing", 0.0),
             "t_intent": latency.get("t_intent", 0.0),
 def _update_bucket_priors(state: PipelineState) -> dict[str, float]:
     chunks = state.get("retrieved_chunks") or []
+    personal = [c for c in chunks if c.get("source", "personal") == "personal"]
+    if not personal:
         return state.get("bucket_priors") or {}
+    top_bucket = personal[0].get("bucket")
     if not top_bucket:
         return state.get("bucket_priors") or {}

backend/pipeline/nodes/intent.py CHANGED Viewed

@@ -1,45 +1,73 @@
-# Intent decomposition node — LLM-based query classification and routing.
 from __future__ import annotations
 import re
 import time
-from typing import Literal
-from pydantic import BaseModel
 from backend.config.settings import settings
-from backend.generation.llm_client import chat_complete
-from backend.pipeline.state import GenerationConfig, IntentRoute, PipelineState
-# ── Pydantic output schemas ────────────────────────────────────────────────────
-BucketType = Literal["family", "medical", "hobbies", "daily_routine", "social"]
-AffectEmotion = Literal["HAPPY", "FRUSTRATED", "NEUTRAL", "SURPRISED"]
-class SubIntentSchema(BaseModel):
-    type: Literal["PERSONAL", "CONTEXTUAL", "OPEN_DOMAIN"]
-    query: str
-    bucket_hint: BucketType | None = None
-    priority: Literal["fast", "normal"] = "normal"
-class StyleConfig(BaseModel):
-    tone_tag: str  # e.g. "[TONE:WITTY_SARCASTIC]"
-    max_tokens: int
-    retrieval_mode: str  # "fast" | "full"
-    persona_mod: (
-        str  # "amplify_quirks" | "suppress_humor" | "baseline" | "add_confirmation"
-    )
-class IntentRouteSchema(BaseModel):
-    sub_intents: list[SubIntentSchema]
-    style_constraints: StyleConfig
-    affect: AffectEmotion
-# ── Affect → generation config mapping ────────────────────────────────────────
 _AFFECT_CONFIG: dict[str, GenerationConfig] = {
     "HAPPY": {
@@ -68,41 +96,61 @@ _AFFECT_CONFIG: dict[str, GenerationConfig] = {
     },
 }
-# ── System prompt ──────────────────────────────────────────────────────────────
-_SYSTEM_PROMPT = """\
-You are the intent decomposition controller for an AAC (Augmentative and \
-Alternative Communication) chatbot. Given a partner's query and the AAC \
-user's current affect state, classify each intent and produce routing \
-instructions in the required JSON format.
-Intent types:
-- PERSONAL: requires autobiographical memory retrieval
-- CONTEXTUAL: answerable from session history
-- OPEN_DOMAIN: answerable from general knowledge (no retrieval needed)
-Bucket hints (only for PERSONAL): family | medical | hobbies | daily_routine | social
-Priority: set "fast" when affect is FRUSTRATED to reduce latency.
-Respond ONLY with valid JSON matching the IntentRoute schema. No extra text.
-"""
-def _build_user_prompt(
-    query: str, affect: str, persona_name: str, air_written_text: str | None = None
-) -> str:
-    air_note = (
-        f'\nAir-written supplement: "{air_written_text}"' if air_written_text else ""
-    )
-    return (
-        f"Persona: {persona_name}\n"
-        f"Affect: {affect}\n"
-        f"Partner query: {query}{air_note}\n\n"
-        "Produce the IntentRoute JSON:"
-    )
-# ── Node entry point ───────────────────────────────────────────────────────────
 def run(state: PipelineState) -> dict:
@@ -110,78 +158,58 @@ def run(state: PipelineState) -> dict:
     # --fast mode: intent_route already resolved by keyword routing in main.py
     if state.get("intent_route") and state.get("generation_config"):
-        return {}  # nothing to update — downstream nodes use the pre-filled values
     affect_state = state.get("affect") or {}
     emotion: str = affect_state.get("emotion", "NEUTRAL")
     query: str = state["raw_query"]
-    persona_name: str = state["persona_profile"].get("name", "unknown")
     gen_config = _AFFECT_CONFIG.get(emotion, _AFFECT_CONFIG["NEUTRAL"])
-    route: IntentRoute | None = None
-    last_error: str = ""
-    for attempt in range(3):  # up to 2 retries on schema validation failure
-        messages = [
-            {"role": "system", "content": _SYSTEM_PROMPT},
             {
-                "role": "user",
-                "content": _build_user_prompt(
-                    query,
-                    emotion,
-                    persona_name,
-                    air_written_text=state.get("air_written_text"),
-                ),
-            },
-        ]
-        if attempt > 0:
-            messages.append(
-                {
-                    "role": "user",
-                    "content": f"Validation error: {last_error}. Fix and retry.",
-                }
-            )
-        raw = chat_complete(
-            messages=messages,
-            max_tokens=512,
-            temperature=0.0,
         )
-        try:
-            # Strip markdown fences (```json ... ```) that many models add
-            cleaned = re.sub(r"^```(?:json)?\s*", "", raw.strip())
-            cleaned = re.sub(r"\s*```$", "", cleaned.strip())
-            parsed = IntentRouteSchema.model_validate_json(cleaned)
-            route = {
-                "sub_intents": [si.model_dump() for si in parsed.sub_intents],
-                "style_constraints": parsed.style_constraints.model_dump(),
-                "affect": parsed.affect,
             }
-            break
-        except Exception as exc:
-            last_error = str(exc)
-    if route is None:
-        # Hard fallback: treat as a single PERSONAL intent, full retrieval
-        route = {
-            "sub_intents": [
-                {
-                    "type": "PERSONAL",
-                    "query": query,
-                    "bucket_hint": None,
-                    "priority": "normal",
-                }
-            ],
-            "style_constraints": gen_config,
-            "affect": emotion,
-        }
-    t_intent = time.perf_counter() - t0
     latency_log = dict(state.get("latency_log") or {})
-    latency_log["t_intent"] = round(t_intent, 4)
     return {
         "intent_route": route,

+# Intent decomposition node — regex-split fragments + BGE zero-shot classifier.
 from __future__ import annotations
 import re
 import time
+from functools import lru_cache
+import torch
 from backend.config.settings import settings
+from backend.pipeline.state import (
+    GenerationConfig,
+    IntentRoute,
+    PipelineState,
+    SubIntent,
+)
+from backend.retrieval.vector_store import get_device, get_embedder
+from backend.sensing.bucket_keywords import infer_bucket
+_CLASS_EXEMPLARS: dict[str, list[str]] = {
+    "PERSONAL": [
+        "how are you today",
+        "what is your favourite food",
+        "tell me about your family",
+        "what do you do in the mornings",
+        "did you enjoy the weekend",
+    ],
+    "CONTEXTUAL": [
+        "what did you just say",
+        "what did I ask earlier",
+        "you mentioned something before",
+        "can you repeat that",
+        "what were we talking about",
+    ],
+    "OPEN_DOMAIN": [
+        "what is the capital of france",
+        "how many planets are there",
+        "who wrote hamlet",
+        "when was world war two",
+        "what does photosynthesis mean",
+    ],
+}
+_CLASSIFIER_THRESHOLD = (
+    0.45  # below this → PERSONAL fallback (safe default for OOV / typos / short input)
+)
+_CONTEXTUAL_MARGIN_MIN = (
+    0.08  # CONTEXTUAL must beat runner-up by at least this — it over-matches without it
+)
+_MIN_FRAGMENT_WORDS = 3
+_MAX_FRAGMENTS = 4
+_CONTEXTUAL_MARKERS = (
+    "earlier",
+    "before",
+    "mentioned",
+    "said",
+    "asked",
+    "just",
+    "repeat",
+)
+_CONTEXTUAL_MARKER_PATTERN = re.compile(
+    r"\b(" + "|".join(_CONTEXTUAL_MARKERS) + r")\b",
+    flags=re.IGNORECASE,
+)
+_SPLIT_PATTERN = re.compile(
+    r"\s+(?:and|but|also|plus)\s+|[;.?!]+\s+|,\s+(?=\w)",
+    flags=re.IGNORECASE,
+)
 _AFFECT_CONFIG: dict[str, GenerationConfig] = {
     "HAPPY": {
     },
 }
+@lru_cache(maxsize=1)
+def _exemplar_matrices() -> dict[str, torch.Tensor]:
+    embedder = get_embedder()
+    device = get_device()
+    return {
+        cls: embedder.encode(
+            exemplars,
+            convert_to_tensor=True,
+            normalize_embeddings=True,
+            device=device,
+        )
+        for cls, exemplars in _CLASS_EXEMPLARS.items()
+    }
+def _split_query(query: str) -> list[str]:
+    raw = [p.strip() for p in _SPLIT_PATTERN.split(query) if p and p.strip()]
+    keep = [p for p in raw if len(p.split()) >= _MIN_FRAGMENT_WORDS]
+    if not keep:
+        keep = [query.strip()] if query.strip() else []
+    return keep[:_MAX_FRAGMENTS]
+def _classify(fragment: str) -> str:
+    embedder = get_embedder()
+    device = get_device()
+    vec = embedder.encode(
+        [fragment],
+        convert_to_tensor=True,
+        normalize_embeddings=True,
+        device=device,
+    )[0]
+    scores: dict[str, float] = {}
+    for cls, mat in _exemplar_matrices().items():
+        scores[cls] = float((mat @ vec).max())
+    ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
+    best_cls, best_score = ranked[0]
+    runner_up_score = ranked[1][1]
+    if best_score < _CLASSIFIER_THRESHOLD:
+        return "PERSONAL"  # conservative default: treat as a question about the persona
+    # CONTEXTUAL is the riskiest class — if wrong, we lose all persona grounding.
+    # Require it to clearly beat the runner-up and for the fragment to mention
+    # prior discourse (matched at word boundaries, so "just" doesn't match "unjust").
+    if best_cls == "CONTEXTUAL":
+        margin = best_score - runner_up_score
+        has_discourse_marker = bool(_CONTEXTUAL_MARKER_PATTERN.search(fragment))
+        if margin < _CONTEXTUAL_MARGIN_MIN or not has_discourse_marker:
+            return "PERSONAL"
+    return best_cls
 def run(state: PipelineState) -> dict:
     # --fast mode: intent_route already resolved by keyword routing in main.py
     if state.get("intent_route") and state.get("generation_config"):
+        return {}
     affect_state = state.get("affect") or {}
     emotion: str = affect_state.get("emotion", "NEUTRAL")
     query: str = state["raw_query"]
     gen_config = _AFFECT_CONFIG.get(emotion, _AFFECT_CONFIG["NEUTRAL"])
+    fragments = _split_query(query)
+    priority = "fast" if emotion == "FRUSTRATED" else "normal"
+    sub_intents: list[SubIntent] = []
+    for frag in fragments:
+        cls = _classify(frag)
+        bucket_hint = infer_bucket(frag) if cls == "PERSONAL" else None
+        sub_intents.append(
             {
+                "type": cls,
+                "query": frag,
+                "bucket_hint": bucket_hint,
+                "priority": priority,
+            }
         )
+    if not sub_intents:
+        sub_intents = [
+            {
+                "type": "PERSONAL",
+                "query": query,
+                "bucket_hint": None,
+                "priority": priority,
             }
+        ]
+    air_written = state.get("air_written_text")
+    if air_written:
+        sub_intents.append(
+            {
+                "type": "PERSONAL",
+                "query": air_written,
+                "bucket_hint": infer_bucket(air_written),
+                "priority": priority,
+            }
+        )
+    route: IntentRoute = {
+        "sub_intents": sub_intents,
+        "style_constraints": dict(gen_config),
+        "affect": emotion,
+    }
     latency_log = dict(state.get("latency_log") or {})
+    latency_log["t_intent"] = round(time.perf_counter() - t0, 4)
     return {
         "intent_route": route,

backend/pipeline/nodes/planner.py CHANGED Viewed

@@ -53,7 +53,7 @@ def _run(state: PipelineState, tier: str) -> dict:
     )
     gesture_tag = state.get("gesture_tag")
     air_written_text = state.get("air_written_text")
-    prompt = _build_prompt(
         profile,
         chunks,
         history,
@@ -65,7 +65,7 @@ def _run(state: PipelineState, tier: str) -> dict:
     )
     selected = chat_complete(
-        messages=[{"role": "user", "content": prompt}],
         max_tokens=gen_cfg.get("max_tokens", settings.max_tokens_neutral),
         temperature=0.4,
         tier=tier,
@@ -86,8 +86,9 @@ def _run(state: PipelineState, tier: str) -> dict:
         4,
     )
     return {
-        "augmented_prompt": prompt,
         "candidates": [selected],
         "selected_response": selected,
         "llm_tier_used": tier,
@@ -101,7 +102,7 @@ def _resolve_tone_tag(user_id: str, affect: str, default_tag: str) -> str:
     return _PERSONA_TONE_OVERRIDES.get(user_id, {}).get(affect, default_tag)
-def _build_prompt(
     profile: dict,
     chunks: list[dict],
     history: list[dict],
@@ -110,19 +111,29 @@ def _build_prompt(
     gen_cfg: dict,
     gesture_tag: str | None = None,
     air_written_text: str | None = None,
-) -> str:
-    memory_block = (
-        "\n".join(
-            f"  [{c['bucket']}/{c.get('type', 'narrative')}] {c['text']}"
-            for c in chunks
-        )
-        or "  (no memories retrieved)"
-    )
-    history_block = (
-        "\n".join(f"  {h.get('role', '?')}: {h.get('content', '')}" for h in history)
-        or "  (start of session)"
     )
     prefs = profile.get("stylistic_preferences") or {}
     style_bits = []
     if prefs.get("tone"):
@@ -139,9 +150,73 @@ def _build_prompt(
     exemplars = prefs.get("example_phrases") or []
     style_exemplar = "\n  ".join(exemplars) if exemplars else "(no exemplar)"
     access = (profile.get("access_needs") or {}).get("input_method") or "an AAC device"
     gesture_line = ""
     if gesture_tag:
         g_tag = GESTURE_TO_TAG.get(gesture_tag, f"[GESTURE:{gesture_tag}]")
@@ -151,35 +226,27 @@ def _build_prompt(
     if air_written_text:
         air_writing_line = f'\nThe user air-wrote: "{air_written_text}" — treat as supplementary intent.'
-    persona_mod = gen_cfg.get("persona_mod", "baseline")
-    persona_instruction = {
-        "amplify_quirks": "Amplify your characteristic style and personality.",
-        "suppress_humor": "Be direct and supportive. Suppress humor.",
-        "baseline": "Use your natural communication style.",
-        "add_confirmation": "Add a clarifying question or confirmation at the end.",
-    }.get(persona_mod, "Use your natural communication style.")
     return f"""\
-You are {profile["name"]}. You have {profile["condition"]} and communicate through {access}, but your voice and thoughts are fully your own.
-Communication style: {style_summary}
 {tone_tag}{gesture_line}{air_writing_line}
-Style exemplars — match this register:
-  {style_exemplar}
-Personal memories (use ONLY these for personal facts; each tagged [bucket/type] where type is narrative, social_post, or chat_log):
 {memory_block}
 Recent conversation:
 {history_block}
-Partner says: {query}
-Instructions:
-- Speak in first person as {profile["name"]}.
-- {persona_instruction}
-- Keep response to 1-3 sentences.
-- If the answer isn't in your memories, say "I don't know."
-- Do NOT say "As an AI" or break persona.
-Response:"""

     )
     gesture_tag = state.get("gesture_tag")
     air_written_text = state.get("air_written_text")
+    messages = _build_messages(
         profile,
         chunks,
         history,
     )
     selected = chat_complete(
+        messages=messages,
         max_tokens=gen_cfg.get("max_tokens", settings.max_tokens_neutral),
         temperature=0.4,
         tier=tier,
         4,
     )
+    augmented_prompt = "\n\n".join(m["content"] for m in messages)
     return {
+        "augmented_prompt": augmented_prompt,
         "candidates": [selected],
         "selected_response": selected,
         "llm_tier_used": tier,
     return _PERSONA_TONE_OVERRIDES.get(user_id, {}).get(affect, default_tag)
+def _build_messages(
     profile: dict,
     chunks: list[dict],
     history: list[dict],
     gen_cfg: dict,
     gesture_tag: str | None = None,
     air_written_text: str | None = None,
+) -> list[dict]:
+    # Split into a stable system message (same per persona — gets cached by the
+    # provider) and a turn-specific user message. Anything that changes per
+    # turn (retrieval, affect, gesture, partner query) must live in the user
+    # message or the prefix cache invalidates.
+    system_content = _build_system(profile)
+    user_content = _build_user(
+        chunks,
+        history,
+        query,
+        tone_tag,
+        gen_cfg,
+        gesture_tag,
+        air_written_text,
+        profile["name"],
     )
+    return [
+        {"role": "system", "content": system_content},
+        {"role": "user", "content": user_content},
+    ]
+def _build_system(profile: dict) -> str:
     prefs = profile.get("stylistic_preferences") or {}
     style_bits = []
     if prefs.get("tone"):
     exemplars = prefs.get("example_phrases") or []
     style_exemplar = "\n  ".join(exemplars) if exemplars else "(no exemplar)"
     access = (profile.get("access_needs") or {}).get("input_method") or "an AAC device"
+    return f"""\
+You are {profile["name"]}. Reply in first person as them, in 1–3 sentences. \
+Never narrate, analyze, describe, or list traits about your character. \
+Never say "As an AI", "The user wants me to", "Key characteristics", or anything meta. \
+Just speak.
+--- Character sheet (reference only — do NOT quote or paraphrase this block) ---
+Condition: {profile["condition"]}
+Access: {access}
+Voice: {style_summary}
+Style examples (match this register when you speak):
+  {style_exemplar}
+Answering rules:
+- For personal questions: use ONLY the memories in the user message; if they don't cover it, say "I don't know."
+- For general-knowledge questions: answer from what you know, in your voice.
+- Keep it to 1–3 sentences, first person, no meta-commentary.
+--- end character sheet ---"""
+_PERSONA_MOD_INSTRUCTIONS = {
+    "amplify_quirks": "Amplify your characteristic style and personality.",
+    "suppress_humor": "Be direct and supportive. Suppress humor.",
+    "baseline": "Use your natural communication style.",
+    "add_confirmation": "Add a clarifying question or confirmation at the end.",
+}
+def _build_user(
+    chunks: list[dict],
+    history: list[dict],
+    query: str,
+    tone_tag: str,
+    gen_cfg: dict,
+    gesture_tag: str | None,
+    air_written_text: str | None,
+    persona_name: str,
+) -> str:
+    personal_chunks = [c for c in chunks if c.get("source", "personal") == "personal"]
+    contextual_chunks = [c for c in chunks if c.get("source") == "contextual"]
+    open_domain_chunks = [c for c in chunks if c.get("source") == "open_domain"]
+    memory_block = (
+        "\n".join(
+            f"  [{c['bucket']}/{c.get('type', 'narrative')}] {c['text']}"
+            for c in personal_chunks
+        )
+        or "  (no memories retrieved)"
+    )
+    contextual_block = (
+        "\n".join(f"  {c['text']}" for c in contextual_chunks)
+        or "  (nothing relevant from this session)"
+    )
+    open_domain_note = (
+        "  Treat this sub-query as general knowledge; answer from what you know.\n"
+        + "\n".join(f"  {c['text']}" for c in open_domain_chunks)
+        if open_domain_chunks
+        else "  (none)"
+    )
+    history_block = (
+        "\n".join(f"  {h.get('role', '?')}: {h.get('content', '')}" for h in history)
+        or "  (start of session)"
+    )
     gesture_line = ""
     if gesture_tag:
         g_tag = GESTURE_TO_TAG.get(gesture_tag, f"[GESTURE:{gesture_tag}]")
     if air_written_text:
         air_writing_line = f'\nThe user air-wrote: "{air_written_text}" — treat as supplementary intent.'
+    persona_instruction = _PERSONA_MOD_INSTRUCTIONS.get(
+        gen_cfg.get("persona_mod", "baseline"),
+        _PERSONA_MOD_INSTRUCTIONS["baseline"],
+    )
     return f"""\
 {tone_tag}{gesture_line}{air_writing_line}
+{persona_instruction}
+Personal memories:
 {memory_block}
+From earlier in this conversation:
+{contextual_block}
+General knowledge note:
+{open_domain_note}
 Recent conversation:
 {history_block}
+Partner just said: {query}
+Your reply as {persona_name} (1–3 sentences, first person):"""

backend/pipeline/nodes/retrieval.py CHANGED Viewed

@@ -1,59 +1,128 @@
-# Retrieval node — run_fast (FRUSTRATED) and run_full paths.
 from __future__ import annotations
 import time
 from backend.config.settings import settings
-from backend.pipeline.state import PipelineState, RetrievedChunk
 from backend.retrieval.vector_store import retrieve
 def run_fast(state: PipelineState) -> dict:
     """Fast retrieval path for FRUSTRATED affect (k=2, no reranker)."""
     t0 = time.perf_counter()
-    priors = state["bucket_priors"]
-    prior_vals = list(priors.values()) if priors else []
-    priors_uniform = prior_vals and (max(prior_vals) - min(prior_vals)) < 0.05
-    bucket_hint = (
-        state.get("gaze_bucket")
-        if priors_uniform and state.get("gaze_bucket")
-        else _top_prior_bucket(priors)
-    )
-    chunks = retrieve(
-        query=state["raw_query"],
-        user_id=state["user_id"],
-        top_k=settings.retrieval_fast_k,
-        rerank_k=settings.retrieval_fast_k,
-        bucket_filter=bucket_hint,
-    )
     return _build_return(state, chunks, "fast", t0)
 def run_full(state: PipelineState) -> dict:
     """Full retrieval path: top_k cosine matches narrowed to rerank_k."""
     t0 = time.perf_counter()
-    # Prefer gaze hint > intent bucket hint > None
     route = state.get("intent_route") or {}
-    sub_intents = route.get("sub_intents", [])
-    bucket_hint = state.get("gaze_bucket") or next(
-        (si.get("bucket_hint") for si in sub_intents if si.get("bucket_hint")), None
     )
-    chunks = retrieve(
-        query=state["raw_query"],
         user_id=state["user_id"],
-        top_k=settings.retrieval_top_k,
-        rerank_k=settings.retrieval_rerank_k,
         bucket_filter=bucket_hint,
     )
-    return _build_return(state, chunks, "full", t0)
-# ── Helpers ───────────────────────────────────────────────────────────────────
 def _top_prior_bucket(priors: dict[str, float]) -> str | None:

+# Retrieval node — dispatches each sub-intent to its pool, merges results.
 from __future__ import annotations
 import time
 from backend.config.settings import settings
+from backend.pipeline.state import PipelineState, RetrievedChunk, SubIntent
+from backend.retrieval.contextual import retrieve_from_history
 from backend.retrieval.vector_store import retrieve
+_OPEN_DOMAIN_STUB_TEXT = (
+    "(no external knowledge source wired — answer from general knowledge)"
+)
 def run_fast(state: PipelineState) -> dict:
     """Fast retrieval path for FRUSTRATED affect (k=2, no reranker)."""
     t0 = time.perf_counter()
+    chunks = _dispatch_all(state, per_intent_k=settings.retrieval_fast_k)
     return _build_return(state, chunks, "fast", t0)
 def run_full(state: PipelineState) -> dict:
     """Full retrieval path: top_k cosine matches narrowed to rerank_k."""
     t0 = time.perf_counter()
+    chunks = _dispatch_all(state, per_intent_k=settings.retrieval_rerank_k)
+    return _build_return(state, chunks, "full", t0)
+def _dispatch_all(state: PipelineState, per_intent_k: int) -> list[RetrievedChunk]:
     route = state.get("intent_route") or {}
+    sub_intents: list[SubIntent] = route.get("sub_intents") or []
+    if not sub_intents:
+        sub_intents = [
+            {
+                "type": "PERSONAL",
+                "query": state["raw_query"],
+                "bucket_hint": None,
+                "priority": "normal",
+            }
+        ]
+    merged: list[RetrievedChunk] = []
+    for sub in sub_intents:
+        kind = (sub.get("type") or "PERSONAL").upper()
+        if kind == "PERSONAL":
+            merged.extend(_retrieve_personal(sub, state, per_intent_k))
+        elif kind == "CONTEXTUAL":
+            merged.extend(_retrieve_contextual(sub, state, per_intent_k))
+        elif kind == "OPEN_DOMAIN":
+            merged.extend(_retrieve_open_domain(sub))
+        else:
+            merged.extend(_retrieve_personal(sub, state, per_intent_k))
+    return _dedupe(merged)
+def _retrieve_personal(
+    sub: SubIntent, state: PipelineState, k: int
+) -> list[RetrievedChunk]:
+    priors = state["bucket_priors"]
+    prior_vals = list(priors.values()) if priors else []
+    priors_uniform = prior_vals and (max(prior_vals) - min(prior_vals)) < 0.05
+    bucket_hint = (
+        state.get("gaze_bucket")
+        or sub.get("bucket_hint")
+        or (_top_prior_bucket(priors) if not priors_uniform else None)
     )
+    top_k = max(k, settings.retrieval_top_k) if k >= settings.retrieval_rerank_k else k
+    return retrieve(
+        query=sub["query"],
         user_id=state["user_id"],
+        top_k=top_k,
+        rerank_k=k,
         bucket_filter=bucket_hint,
     )
+_CONTEXTUAL_MIN_SCORE = (
+    0.5  # empirical: below this, history matches are usually spurious
+)
+def _retrieve_contextual(
+    sub: SubIntent, state: PipelineState, k: int
+) -> list[RetrievedChunk]:
+    # CONTEXTUAL means "this turn leans on the recent conversation" — but the
+    # persona's memories are still the primary grounding. Always pull personal
+    # chunks; add contextual ones on top when the session history is relevant.
+    personal_chunks = _retrieve_personal(sub, state, k)
+    history = state.get("session_history") or []
+    history_chunks = retrieve_from_history(query=sub["query"], history=history, top_k=k)
+    relevant_history = [
+        c for c in history_chunks if c["score"] >= _CONTEXTUAL_MIN_SCORE
+    ]
+    return personal_chunks + relevant_history
+def _retrieve_open_domain(sub: SubIntent) -> list[RetrievedChunk]:
+    # Intentionally a stub — web search is out of scope. See README "Intent decomposition".
+    return [
+        RetrievedChunk(
+            text=f'{_OPEN_DOMAIN_STUB_TEXT} (sub-query: "{sub["query"]}")',
+            bucket="open_domain",
+            type="narrative",
+            user="",
+            score=0.0,
+            source="open_domain",
+        )
+    ]
+def _dedupe(chunks: list[RetrievedChunk]) -> list[RetrievedChunk]:
+    seen: set[tuple[str, str]] = set()
+    out: list[RetrievedChunk] = []
+    for c in chunks:
+        key = (c["source"], c["text"])
+        if key in seen:
+            continue
+        seen.add(key)
+        out.append(c)
+    return out
 def _top_prior_bucket(priors: dict[str, float]) -> str | None:

backend/pipeline/state.py CHANGED Viewed

@@ -23,10 +23,11 @@ class AffectState(TypedDict):
 class RetrievedChunk(TypedDict):
     text: str
-    bucket: str  # family | medical | hobbies | daily_routine | social
-    type: str  # narrative | social_post | chat_log
     user: str
     score: float  # cosine similarity from the embedder
 class SubIntent(TypedDict):

 class RetrievedChunk(TypedDict):
     text: str
+    bucket: str  # family | medical | hobbies | daily_routine | social | contextual | open_domain
+    type: str  # narrative | social_post | chat_log  (personal chunks only)
     user: str
     score: float  # cosine similarity from the embedder
+    source: str  # "personal" | "contextual" | "open_domain"
 class SubIntent(TypedDict):

backend/retrieval/contextual.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import torch
+from backend.pipeline.state import RetrievedChunk
+from backend.retrieval.vector_store import get_device, get_embedder
+def retrieve_from_history(
+    query: str,
+    history: list[dict],
+    top_k: int = 3,
+    recent_window: int = 20,
+) -> list[RetrievedChunk]:
+    if not history or top_k <= 0:
+        return []
+    window = history[-recent_window:]
+    texts = [_format_turn(h) for h in window]
+    if not any(texts):
+        return []
+    embedder = get_embedder()
+    device = get_device()
+    q_vec = embedder.encode(
+        [query],
+        convert_to_tensor=True,
+        normalize_embeddings=True,
+        device=device,
+    )[0]
+    h_vecs = embedder.encode(
+        texts,
+        convert_to_tensor=True,
+        normalize_embeddings=True,
+        device=device,
+    )
+    scores = h_vecs @ q_vec
+    k = min(top_k, scores.shape[0])
+    top_scores, top_idxs = torch.topk(scores, k)
+    return [
+        RetrievedChunk(
+            text=texts[int(idx)],
+            bucket="contextual",
+            type="chat_log",
+            user="",
+            score=float(score),
+            source="contextual",
+        )
+        for score, idx in zip(top_scores.tolist(), top_idxs.tolist())
+    ]
+def _format_turn(turn: dict) -> str:
+    role = turn.get("role", "?")
+    content = (turn.get("content") or "").strip()
+    return f"{role}: {content}" if content else ""

backend/retrieval/vector_store.py CHANGED Viewed

@@ -31,6 +31,10 @@ def _get_embedder():
     return SentenceTransformer(settings.embed_model, device=_DEVICE)
 # Index cache: one (vectors_tensor, meta) per user_id.
 _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
@@ -88,6 +92,7 @@ def retrieve(
             type=c.get("type", "narrative"),
             user=c["user"],
             score=float(s),
         )
         for s, c in candidates[:rerank_k]
     ]

     return SentenceTransformer(settings.embed_model, device=_DEVICE)
+def get_embedder():
+    return _get_embedder()
 # Index cache: one (vectors_tensor, meta) per user_id.
 _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
             type=c.get("type", "narrative"),
             user=c["user"],
             score=float(s),
+            source="personal",
         )
         for s, c in candidates[:rerank_k]
     ]

backend/sensing/bucket_keywords.py ADDED Viewed

	@@ -0,0 +1,15 @@

+_BUCKET_KEYWORDS: list[tuple[str, tuple[str, ...]]] = [
+    ("medical", ("medication", "medicine", "doctor", "health", "allergic", "therapy")),
+    ("family", ("family", "mom", "dad", "brother", "sister", "parents")),
+    ("hobbies", ("hobby", "like to do", "enjoy", "weekend", "fun")),
+    ("daily_routine", ("routine", "morning", "wake", "sleep", "daily")),
+    ("social", ("friend", "social", "people", "party", "community")),
+]
+def infer_bucket(query: str) -> str | None:
+    q = query.lower()
+    for bucket, words in _BUCKET_KEYWORDS:
+        if any(w in q for w in words):
+            return bucket
+    return None

run.sh CHANGED Viewed

@@ -13,6 +13,12 @@ fi
 eval "$(conda shell.bash hook)"
 conda activate "$CONDA_ENV"
 PIDS=()
 cleanup() {

 eval "$(conda shell.bash hook)"
 conda activate "$CONDA_ENV"
+# If any args were passed (e.g. --debug, --user mia_chen), run the CLI
+# instead of the full stack and forward them verbatim.
+if [ "$#" -gt 0 ]; then
+  exec python -m backend.main "$@"
+fi
 PIDS=()
 cleanup() {