Spaces:

ub-aac-chatbot
/

aac-chatbot

Sleeping

shwetangisingh commited on 15 days ago

Commit

69345ca

1 Parent(s): 1799d5b

relevance + diversity evals, batched scoring, math tooltips

Adds two new eval metrics (relevance via BGE cosine, candidate diversity
via mean pairwise cosine distance) plus per-candidate scoring so we can
ask "did the picker beat candidate 0?" offline. compute_evals batches
the scoring path: one NLI model.predict over all (candidate × sentence
× chunk) pairs and one BGE embed_texts over [query, c1, c2, c3] feeding
both relevance and diversity. EvalPanel surfaces the new pills with
custom CSS hover tooltips that show the actual math from each turn
(e.g. "2/2 sentences had NLI entailment ≥ 0.50") rather than generic
definitions. aggregate.py grows a report_picker section that joins
turns + picks + evals and prints pick rate, regenerate rate, strategy
win rate, and the diversity floor.

Files changed (12) hide show

README.md +37 -17
backend/api/main.py +22 -2
backend/evals/__init__.py +102 -7
backend/evals/aggregate.py +121 -1
backend/evals/diversity.py +23 -0
backend/evals/faithfulness.py +60 -30
backend/evals/multimodal_alignment.py +32 -4
backend/evals/relevance.py +16 -0
backend/retrieval/vector_store.py +9 -0
frontend/src/App.css +47 -0
frontend/src/components/EvalPanel.tsx +218 -48
frontend/src/types.ts +37 -0

README.md CHANGED Viewed

@@ -71,7 +71,15 @@ The per-turn JSONL log includes both `bucket_priors_after` and `type_priors_afte
 ### Per-turn eval pills in the UI
-Every AAC-user bubble renders its eval scores inline: the SLO latency badge, a groundedness pill (faithfulness against retrieved memory), and three multimodal-alignment pills (affect / gesture / gaze). Pills go green / grey / red on 0.75 / 0.4 thresholds, and groundedness renders `—` when the turn had no retrieved evidence to check against. Authenticity stars sit on the right. All values come from `eval_scores` on the `/chat` response, computed synchronously and persisted to `logs/evals.jsonl`.
 ![eval pills rendered inline under an AAC response](docs/images/eval-pills.png)
@@ -401,7 +409,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
 - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
 - [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
 - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
-- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal.
 ### Intent decomposition
@@ -422,7 +429,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
   - return a canned response if we blow the budget entirely
   - threshold is 3.5s, spec says 6s — pick one
 - [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
-- [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30
 ### Generation
@@ -433,34 +439,48 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
 ### Evals
-Scoring runs synchronously on the `/chat` response path and the `eval_scores` dict is returned inline so the frontend can render pills on the AAC bubble immediately. Each scored turn is also appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings from the UI go to `logs/ratings.jsonl`.
-| Metric | Status | Where |
-|--------|--------|-------|
-| Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
-| Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
-| Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
-| Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
-**First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup on top of the LLM call. Every turn after that adds ~100-300ms for sentence-level scoring.
 - [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
 - [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
 - [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
 - [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
 - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
-- [ ] **[Eval]** Relevance score — one NLI call per turn asking "does the response address the partner's query?" Fills the biggest current gap: a perfectly grounded but off-topic reply scores 100% grounded today and we'd never catch it
-- [ ] **[Eval]** Candidate diversity — mean pairwise cosine distance among the 3 candidates in a picker round. Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query
-- [ ] **[Eval]** Picker-aware metrics from `turns.jsonl` + `picks.jsonl`: which strategy wins most (`broad` vs `focused` vs `serendipitous` vs `side_index`), pick rate (% of turns where user clicked a card), regenerate rate (% of turns where user clicked "try again"). All computable offline, no runtime cost
-- [ ] **[Eval]** Score alternate candidates too, not just the selected one. Right now `compute_evals` only scores `selected_response`; scoring all 3 would let us measure whether the picker actually improves quality over taking candidate 0 blindly
-- [ ] **[Eval]** UI coverage gap: `compute_evals` returns 12 fields but `EvalPanel` renders only 5 pills (latency, grounded, affect, gesture, gaze). Hallucination rate, overall multimodal_alignment, SLO target/margin are computed and logged but never surfaced in the bubble. Decide what belongs as a pill vs a tooltip-only number vs an offline-only log field
-- [ ] **[Eval]** On pill hover, tooltip should explain *how* the number was computed, not just what it means. Today the `title` attributes say "Groundedness: fraction of response sentences supported by retrieved memories" — which is the definition but not the math. Want: "5/8 sentences had NLI entailment prob ≥ 0.5 against the retrieved chunks" for groundedness; "3/4 positive-lexicon words matched HAPPY target" for affect; raw scorer inputs + thresholds exposed inline so the number isn't a black box
 ### Cleanup
 - [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
 - [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
 ---
 ## Team

 ### Per-turn eval pills in the UI
+Every AAC-user bubble renders its eval scores inline. Up to seven pills, depending on what the turn produced:
+- **SLO latency badge** — `t_total` against the configured target (default 6s). Green ✓ on pass, red ✗ on miss.
+- **`grounded`** — sentence-level NLI faithfulness against retrieved memories. Renders `—` when there's no retrieved evidence to check (e.g. a `PRESENT_STATE` turn that skipped retrieval).
+- **`relevant`** — BGE cosine similarity between query and response embeddings. Catches the "perfectly grounded but off-topic" failure mode that groundedness can't see.
+- **`affect` / `gesture` / `gaze`** — multimodal alignment: sentiment match against the detected affect, opener-pattern match against the detected gesture, fraction of retrieved chunks from the gazed-at bucket.
+- **`diversity`** — mean pairwise cosine distance across the candidate slate (only shown when ≥2 candidates). Low values flag the "aloha problem" — three paraphrases of the same answer.
+Pills go green / grey / red on 0.75 / 0.4 thresholds. **Hover any pill** for a tooltip with the actual math from this turn (e.g. "2/2 sentences had NLI entailment prob ≥ 0.50", "3/3 retrieved chunks were from the family bucket"), powered by the `explain` block each scorer returns alongside its score. Authenticity stars sit on the right; clicking one fires `POST /feedback/rating` → `logs/ratings.jsonl`. All pill values come from `eval_scores` on the `/chat` response, computed in a `BackgroundTask` after the response returns and persisted to `logs/evals.jsonl`.
 ![eval pills rendered inline under an AAC response](docs/images/eval-pills.png)
 - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
 - [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
 - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
 ### Intent decomposition
   - return a canned response if we blow the budget entirely
   - threshold is 3.5s, spec says 6s — pick one
 - [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
 ### Generation
 ### Evals
+Scoring runs in a FastAPI `BackgroundTask` after `/chat` returns, so it never blocks the response. The frontend polls `GET /evals/{run_id}` to render pills once they're ready. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings go to `logs/ratings.jsonl`. Picks go to `logs/picks.jsonl`.
+| Metric | What it answers | Where |
+|--------|-----------------|-------|
+| Efficiency | SLO pass/fail on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
+| Faithfulness (`grounded`) | Did the response stick to retrieved memories, or hallucinate? Sentence-level NLI; `no_evidence` short-circuit when nothing was retrieved | [faithfulness.py](backend/evals/faithfulness.py) |
+| Relevance (`relevant`) | Did the response actually address the partner's query? BGE cosine query↔response | [relevance.py](backend/evals/relevance.py) |
+| Multimodal alignment | `affect` (sentiment lexicon vs target), `gesture` (opener regex vs detected tag), `gaze` (matched/total retrieved chunks vs gazed bucket) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
+| Candidate diversity | Are the picker's candidates actually different, or paraphrases? Mean pairwise cosine distance over the candidate slate | [diversity.py](backend/evals/diversity.py) |
+| Per-candidate breakdown | Each candidate scored for `grounded` + `relevance` (not just the selected one) — answers "did the picker beat candidate 0?" offline | `candidates_eval` block in [evals/__init__.py](backend/evals/__init__.py) |
+| Authenticity | Star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
+**Performance note.** When the turn produces multiple candidates, scoring is fully batched: a single NLI `model.predict` over all `(candidate × sentence × chunk)` pairs and a single BGE `embed_texts` over `[query, c1, c2, c3]` (the candidate vectors feed both relevance and diversity). The selected candidate's per-candidate score is reused as the top-level pill values rather than re-scored. End result: 1 NLI pass + 1 BGE pass per turn regardless of candidate count.
+**First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup. Every turn after that adds ~100-300ms for sentence-level scoring.
+**Offline analysis.** `python -m backend.evals.aggregate` joins all four log files and prints per-persona reports: latency p50/p95/p99 by tier, mean groundedness/hallucination, multimodal alignment coverage, picker behaviour (pick rate, regenerate rate, strategy win rate, "did picker beat cand 0?", diversity floor), and authenticity Likert distribution.
 - [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
 - [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
 - [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
 - [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
 - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
+- [x] **[Eval]** Relevance score — BGE cosine similarity between query and response. Originally specced as an NLI call, but a question rarely *entails* its answer (the on-topic and off-topic NLI scores both pinned near 10⁻⁴), so the embedder we already load for retrieval is the right tool. Fills the gap where a perfectly grounded but off-topic reply scored 100% grounded. See [backend/evals/relevance.py](backend/evals/relevance.py).
+- [x] **[Eval]** Candidate diversity — mean pairwise cosine distance among the 3 candidates in a picker round, computed on BGE embeddings (no extra model). Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query. See [backend/evals/diversity.py](backend/evals/diversity.py).
+- [x] **[Eval]** Picker-aware metrics — `report_picker` in [backend/evals/aggregate.py](backend/evals/aggregate.py) joins `turns.jsonl` + `picks.jsonl` + `evals.jsonl` and prints: pick rate (% of multi-candidate turns where the user clicked a card), regenerate rate (% of (user, turn_id) pairs that ran the planner more than once), strategy win rate among committed picks, head-to-head "did picker beat candidate 0 on grounded/relevance" using the per-candidate scoring from L453, and diversity coverage (% of turns with mean pairwise cosine distance < 0.10 — the "aloha" floor). Run via `python -m backend.evals.aggregate`.
+- [x] **[Eval]** Score alternate candidates too, not just the selected one. `compute_evals` now scores groundedness + relevance for every candidate and stamps which one was selected; full breakdown lands in `eval_scores.candidates_eval` and `logs/evals.jsonl`, top-level pills still describe the selected response. Unlocks "did the picker actually beat candidate 0?" offline analysis.
+- [x] **[Eval]** UI coverage — `EvalPanel` now also renders a relevance pill (BGE cosine query↔response) and a candidate-diversity pill (mean pairwise cosine distance, hidden when fewer than 2 candidates). Hallucination rate is conveyed inside the grounded tooltip rather than as its own pill (it's `1 − groundedness`, no extra info). SLO margin is in the latency tooltip. See [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx).
+- [x] **[Eval]** Tooltip math — every pill's `title` now shows the actual computation, not just the definition. Each scorer returns its raw inputs in an `explain` block (sentence count + entailment threshold for groundedness, pos/neg word counts + sentiment for affect, matched/total chunks for gaze, gesture pattern match), and `EvalPanel` formats them into specific tooltips like "2/2 sentences had NLI entailment prob ≥ 0.50". See `groundednessTip` / `affectTip` / `gestureTip` / `gazeTip` in [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx) and `explain` in [multimodal_alignment.py](backend/evals/multimodal_alignment.py).
 ### Cleanup
 - [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
 - [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
+### Out of scope
+Not in the spec — engineering nice-to-haves we'd pick up if the rest is done. Don't block grading on these.
+- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal. _(Sensing — untagged in spec)_
+- [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30 _(Retrieval — far beyond current corpus size)_
 ---
 ## Team

backend/api/main.py CHANGED Viewed

@@ -168,6 +168,10 @@ class RatingRequest(BaseModel):
 # ── Helpers ────────────────────────────────────────────────────────────────────
 def _load_persona_profile(user_id: str) -> dict:
     memories_path = settings.memories_dir / f"{user_id}.json"
     try:
@@ -351,6 +355,8 @@ def _compute_and_persist_evals(
     affect: str,
     gesture_tag: str | None,
     gaze_bucket: str | None,
 ) -> dict | None:
     if not settings.evals_enabled or not run_id:
         return None
@@ -363,6 +369,8 @@ def _compute_and_persist_evals(
             gesture_tag=gesture_tag,
             gaze_bucket=gaze_bucket,
             slo_target=settings.slo_target_s,
         )
     except Exception:
         _log.exception("evals scoring failed for run %s", run_id)
@@ -434,6 +442,8 @@ def chat(req: ChatRequest, background_tasks: BackgroundTasks):
             affect=affect_emotion,
             gesture_tag=req.gesture_tag,
             gaze_bucket=req.gaze_bucket,
         )
     return ChatResponse(
@@ -512,6 +522,7 @@ def chat_stream(req: ChatRequest):
         run_id = state.get("run_id")
         # Evals run off the response path; UI polls GET /evals/{run_id}.
         if run_id and settings.evals_enabled:
             _reserve_eval_slot(run_id)
             threading.Thread(
@@ -526,6 +537,8 @@ def chat_stream(req: ChatRequest):
                     affect=affect_emotion,
                     gesture_tag=req.gesture_tag,
                     gaze_bucket=req.gaze_bucket,
                 ),
                 daemon=True,
             ).start()
@@ -534,7 +547,7 @@ def chat_stream(req: ChatRequest):
             "user_id": req.user_id,
             "query": req.query,
             "response": state["selected_response"] or "",
-            "candidates": [dict(c) for c in state.get("candidates") or []],
             "affect": affect_emotion,
             "llm_tier": state.get("llm_tier_used", "unknown"),
             "llm_model": state.get("llm_model_used", "unknown"),
@@ -660,6 +673,8 @@ def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
             affect=affect_emotion,
             gesture_tag=replan_state.get("gesture_tag"),
             gaze_bucket=replan_state.get("gaze_bucket"),
         )
     return ChatResponse(
@@ -827,6 +842,7 @@ def chat_regenerate_stream(req: RegenerateRequest):
         affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
         run_id = replan_state.get("run_id")
         eval_scores = _compute_and_persist_evals(
             run_id=run_id,
             user_id=req.user_id,
@@ -837,12 +853,14 @@ def chat_regenerate_stream(req: RegenerateRequest):
             affect=affect_emotion,
             gesture_tag=replan_state.get("gesture_tag"),
             gaze_bucket=replan_state.get("gaze_bucket"),
         )
         final = {
             "user_id": req.user_id,
             "query": replan_state["raw_query"],
             "response": replan_state["selected_response"] or "",
-            "candidates": [dict(c) for c in replan_state.get("candidates") or []],
             "affect": affect_emotion,
             "llm_tier": replan_state.get("llm_tier_used", "unknown"),
             "llm_model": replan_state.get("llm_model_used", "unknown"),
@@ -942,6 +960,8 @@ def chat_regenerate(req: RegenerateRequest):
         affect=affect_emotion,
         gesture_tag=replan_state.get("gesture_tag"),
         gaze_bucket=replan_state.get("gaze_bucket"),
     )
     return ChatResponse(

 # ── Helpers ────────────────────────────────────────────────────────────────────
+def _candidate_dicts(state) -> list[dict]:
+    return [dict(c) for c in (state.get("candidates") or [])]
 def _load_persona_profile(user_id: str) -> dict:
     memories_path = settings.memories_dir / f"{user_id}.json"
     try:
     affect: str,
     gesture_tag: str | None,
     gaze_bucket: str | None,
+    query: str = "",
+    candidates: list[dict] | None = None,
 ) -> dict | None:
     if not settings.evals_enabled or not run_id:
         return None
             gesture_tag=gesture_tag,
             gaze_bucket=gaze_bucket,
             slo_target=settings.slo_target_s,
+            query=query,
+            candidates=candidates,
         )
     except Exception:
         _log.exception("evals scoring failed for run %s", run_id)
             affect=affect_emotion,
             gesture_tag=req.gesture_tag,
             gaze_bucket=req.gaze_bucket,
+            query=req.query,
+            candidates=_candidate_dicts(result),
         )
     return ChatResponse(
         run_id = state.get("run_id")
         # Evals run off the response path; UI polls GET /evals/{run_id}.
+        cand_dicts = _candidate_dicts(state)
         if run_id and settings.evals_enabled:
             _reserve_eval_slot(run_id)
             threading.Thread(
                     affect=affect_emotion,
                     gesture_tag=req.gesture_tag,
                     gaze_bucket=req.gaze_bucket,
+                    query=req.query,
+                    candidates=cand_dicts,
                 ),
                 daemon=True,
             ).start()
             "user_id": req.user_id,
             "query": req.query,
             "response": state["selected_response"] or "",
+            "candidates": cand_dicts,
             "affect": affect_emotion,
             "llm_tier": state.get("llm_tier_used", "unknown"),
             "llm_model": state.get("llm_model_used", "unknown"),
             affect=affect_emotion,
             gesture_tag=replan_state.get("gesture_tag"),
             gaze_bucket=replan_state.get("gaze_bucket"),
+            query=replan_state.get("raw_query") or "",
+            candidates=_candidate_dicts(replan_state),
         )
     return ChatResponse(
         affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
         run_id = replan_state.get("run_id")
+        cand_dicts = _candidate_dicts(replan_state)
         eval_scores = _compute_and_persist_evals(
             run_id=run_id,
             user_id=req.user_id,
             affect=affect_emotion,
             gesture_tag=replan_state.get("gesture_tag"),
             gaze_bucket=replan_state.get("gaze_bucket"),
+            query=replan_state.get("raw_query") or "",
+            candidates=cand_dicts,
         )
         final = {
             "user_id": req.user_id,
             "query": replan_state["raw_query"],
             "response": replan_state["selected_response"] or "",
+            "candidates": cand_dicts,
             "affect": affect_emotion,
             "llm_tier": replan_state.get("llm_tier_used", "unknown"),
             "llm_model": replan_state.get("llm_model_used", "unknown"),
         affect=affect_emotion,
         gesture_tag=replan_state.get("gesture_tag"),
         gaze_bucket=replan_state.get("gaze_bucket"),
+        query=replan_state.get("raw_query") or "",
+        candidates=_candidate_dicts(replan_state),
     )
     return ChatResponse(

backend/evals/__init__.py CHANGED Viewed

@@ -1,7 +1,51 @@
 # Evaluation metrics — compute after pipeline returns, before API response.
 from backend.evals.efficiency import compute_efficiency
-from backend.evals.faithfulness import compute_faithfulness
 from backend.evals.multimodal_alignment import compute_multimodal_alignment
 def compute_evals(
@@ -12,18 +56,54 @@ def compute_evals(
     gesture_tag: str | None,
     gaze_bucket: str | None,
     slo_target: float = 6.0,
 ) -> dict:
-    """Run all eval scorers and return a unified EvalScores dict."""
-    faith = compute_faithfulness(response, chunks)
     eff = compute_efficiency(latency_log, slo_target)
     align = compute_multimodal_alignment(
         response, affect, gesture_tag, gaze_bucket, chunks
     )
-    return {
-        "groundedness": faith["groundedness"],
-        "hallucination_rate": faith["hallucination_rate"],
-        "no_evidence": faith["no_evidence"],
         "t_total_s": eff["t_total"],
         "slo_target_s": eff["slo_target"],
         "slo_passed": eff["slo_passed"],
@@ -32,4 +112,19 @@ def compute_evals(
         "affect_alignment": align["affect_alignment"],
         "gesture_alignment": align["gesture_alignment"],
         "gaze_alignment": align["gaze_alignment"],
     }

 # Evaluation metrics — compute after pipeline returns, before API response.
+import torch
+from backend.evals.diversity import compute_candidate_diversity
 from backend.evals.efficiency import compute_efficiency
+from backend.evals.faithfulness import compute_faithfulness, compute_faithfulness_batch
 from backend.evals.multimodal_alignment import compute_multimodal_alignment
+from backend.evals.relevance import compute_relevance
+def _score_candidates_batched(
+    candidates: list[dict],
+    chunks: list[dict],
+    query: str,
+) -> tuple[list[dict], "torch.Tensor | None"]:
+    """One BGE pass + one NLI pass across all candidates. Returns per-candidate
+    score dicts and the candidate vector matrix (for diversity reuse), or None
+    when no embedding pass was needed."""
+    texts = [c.get("text", "") for c in candidates]
+    faiths = compute_faithfulness_batch(texts, chunks)
+    cand_vecs: torch.Tensor | None = None
+    if query.strip() and any(t.strip() for t in texts):
+        from backend.retrieval.vector_store import embed_texts
+        vecs = embed_texts([query] + texts)
+        q_vec = vecs[0]
+        cand_vecs = vecs[1:]
+        relevances = [
+            round(max(0.0, float(q_vec @ cand_vecs[i])), 4) for i in range(len(texts))
+        ]
+    else:
+        relevances = [0.0] * len(texts)
+    scores = [{**faiths[i], "relevance": relevances[i]} for i in range(len(candidates))]
+    return scores, cand_vecs
+def _diversity_from_vecs(cand_vecs: "torch.Tensor") -> dict:
+    n = cand_vecs.shape[0]
+    if n < 2:
+        return {"candidate_diversity": 0.0, "n_candidates": n}
+    sims = cand_vecs @ cand_vecs.T
+    iu = torch.triu_indices(n, n, offset=1)
+    return {
+        "candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
+        "n_candidates": n,
+    }
 def compute_evals(
     gesture_tag: str | None,
     gaze_bucket: str | None,
     slo_target: float = 6.0,
+    query: str = "",
+    candidates: list[dict] | None = None,
+    selected_idx: int | None = None,
 ) -> dict:
+    """Run all eval scorers and return a unified EvalScores dict.
+    When candidates are provided, scoring is batched: one BGE encode for
+    query + all candidates, one NLI predict across all (sentence, chunk)
+    pairs, then sliced per candidate. The selected candidate's scores are
+    reused as the top-level fields so the existing UI pills keep working.
+    """
     eff = compute_efficiency(latency_log, slo_target)
     align = compute_multimodal_alignment(
         response, affect, gesture_tag, gaze_bucket, chunks
     )
+    per_cand: list[dict] = []
+    cand_vecs = None
+    if candidates:
+        if selected_idx is None and response:
+            for i, c in enumerate(candidates):
+                if c.get("text", "").strip() == response.strip():
+                    selected_idx = i
+                    break
+        scored, cand_vecs = _score_candidates_batched(candidates, chunks, query)
+        per_cand = [
+            {
+                "idx": i,
+                "strategy": c.get("strategy", "unknown"),
+                "selected": (selected_idx is not None and i == selected_idx),
+                **scored[i],
+            }
+            for i, c in enumerate(candidates)
+        ]
+    if per_cand and selected_idx is not None and 0 <= selected_idx < len(per_cand):
+        # Strip per-candidate-only keys before reusing as top-level scores.
+        top = {
+            k: v
+            for k, v in per_cand[selected_idx].items()
+            if k not in ("idx", "strategy", "selected")
+        }
+    else:
+        faith = compute_faithfulness(response, chunks)
+        top = {**faith, "relevance": compute_relevance(response, query)["relevance"]}
+    out = {
+        **top,
         "t_total_s": eff["t_total"],
         "slo_target_s": eff["slo_target"],
         "slo_passed": eff["slo_passed"],
         "affect_alignment": align["affect_alignment"],
         "gesture_alignment": align["gesture_alignment"],
         "gaze_alignment": align["gaze_alignment"],
+        "explain": align.get("explain", {}),
     }
+    if per_cand:
+        out["candidates_eval"] = per_cand
+        # Reuse cand_vecs from the relevance pass when available; falls back to
+        # standalone BGE encode (e.g. when query was empty).
+        if cand_vecs is not None:
+            out.update(_diversity_from_vecs(cand_vecs))
+        else:
+            out.update(compute_candidate_diversity(candidates))
+    else:
+        out["candidate_diversity"] = 0.0
+        out["n_candidates"] = 1 if response else 0
+    return out

backend/evals/aggregate.py CHANGED Viewed

@@ -149,6 +149,121 @@ def report_authenticity(ratings: list[dict]) -> None:
         print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
 def main() -> None:
     parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
     parser.add_argument("--logs", type=Path, default=settings.logs_dir)
@@ -157,11 +272,16 @@ def main() -> None:
     turns = _load(args.logs / "turns.jsonl")
     evals = _load(args.logs / "evals.jsonl")
     ratings = _load(args.logs / "ratings.jsonl")
-    print(f"Loaded: {len(turns)} turns, {len(evals)} evals, {len(ratings)} ratings")
     report_latency(turns)
     report_faithfulness(evals)
     report_multimodal(evals)
     report_authenticity(ratings)

         print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
+def report_picker(turns: list[dict], picks: list[dict], evals: list[dict]) -> None:
+    """Picker behaviour: pick rate, regenerate rate, strategy win rate, and
+    whether the user's pick beat candidate 0 on grounded/relevance.
+    Sources:
+      - turns.jsonl    one row per turn, includes `candidates` and `n_candidates`
+      - picks.jsonl    one row per /chat/pick — strategy, picked_idx, run_id
+      - evals.jsonl    candidates_eval[] with per-candidate grounded + relevance
+    """
+    print("\n=== Picker Behaviour ===")
+    multi = [t for t in turns if (t.get("n_candidates") or 0) >= 2]
+    if not multi:
+        print(
+            "(no multi-candidate turns logged — older format or single-candidate runs)"
+        )
+        return
+    picks_by_run = {p["run_id"]: p for p in picks if p.get("run_id")}
+    evals_by_run = {e["run_id"]: e for e in evals if e.get("run_id")}
+    n_multi = len(multi)
+    n_picked = sum(1 for t in multi if t["run_id"] in picks_by_run)
+    # A (user_id, turn_id) seen more than once means the planner re-ran for
+    # the same partner query — that's a regenerate. The denominator is the
+    # number of distinct (user, turn) conversations that had at least one
+    # multi-candidate run, not the raw row count.
+    seen: dict[tuple[str, int], int] = defaultdict(int)
+    for t in multi:
+        seen[(t.get("user_id", "?"), t.get("turn_id", -1))] += 1
+    n_regenerated_turns = sum(1 for c in seen.values() if c > 1)
+    n_distinct_turns = max(1, len(seen))
+    print(
+        f"multi-candidate turns: {n_multi} ({n_distinct_turns} distinct)   "
+        f"pick rate: {n_picked / n_multi:.0%}   "
+        f"regenerate rate: {n_regenerated_turns / n_distinct_turns:.0%} "
+        f"(% of distinct turns that re-ran)"
+    )
+    # Strategy win rate — among multi-candidate picks only, how often does
+    # each strategy win. Picks on single-candidate turns aren't a real "win"
+    # (no alternative to lose to) so we filter them out.
+    multi_run_ids = {t["run_id"] for t in multi}
+    strategy_count: dict[str, int] = defaultdict(int)
+    for run_id, p in picks_by_run.items():
+        if run_id in multi_run_ids:
+            strategy_count[p.get("strategy", "unknown")] += 1
+    if strategy_count:
+        total = sum(strategy_count.values())
+        print(f"\nStrategy win rate (n={total} picks):")
+        print(f"  {'strategy':<16} {'picks':>6} {'pct':>6}")
+        for s, n in sorted(strategy_count.items(), key=lambda x: -x[1]):
+            print(f"  {s:<16} {n:>6} {n / total:>5.0%}")
+    # Did the picker beat candidate 0? Only meaningful when we have per-candidate
+    # eval scores AND the user picked a non-zero index. A "win" = picked
+    # candidate scored strictly higher on the metric than candidate 0.
+    head_to_head = []
+    for run_id, pick in picks_by_run.items():
+        ev = evals_by_run.get(run_id)
+        if not ev or not ev.get("candidates_eval"):
+            continue
+        cands = ev["candidates_eval"]
+        if len(cands) < 2:
+            continue
+        picked_idx = pick.get("picked_idx", 0)
+        if picked_idx == 0 or picked_idx >= len(cands):
+            continue
+        head_to_head.append(
+            {
+                "picked_grounded": cands[picked_idx]["groundedness"],
+                "cand0_grounded": cands[0]["groundedness"],
+                "picked_relevance": cands[picked_idx].get("relevance", 0.0),
+                "cand0_relevance": cands[0].get("relevance", 0.0),
+            }
+        )
+    if head_to_head:
+        n = len(head_to_head)
+        beat_grounded = sum(
+            1 for h in head_to_head if h["picked_grounded"] > h["cand0_grounded"]
+        )
+        tied_grounded = sum(
+            1 for h in head_to_head if h["picked_grounded"] == h["cand0_grounded"]
+        )
+        beat_rel = sum(
+            1 for h in head_to_head if h["picked_relevance"] > h["cand0_relevance"]
+        )
+        print(f"\nDid picker beat candidate 0? (n={n} picks where picked_idx > 0)")
+        print(
+            f"  groundedness:  picker > cand0 = {beat_grounded}/{n} ({beat_grounded / n:.0%}), "
+            f"tied = {tied_grounded}/{n}"
+        )
+        print(f"  relevance:     picker > cand0 = {beat_rel}/{n} ({beat_rel / n:.0%})")
+    else:
+        print(
+            "\n(no picks of candidate 1+ with per-candidate eval data — can't measure picker quality yet)"
+        )
+    # Diversity: among multi-candidate turns with eval data, how often is the
+    # picker showing near-paraphrases (the "aloha" problem)?
+    div_scored = [
+        ev
+        for ev in evals_by_run.values()
+        if ev.get("n_candidates", 0) >= 2 and "candidate_diversity" in ev
+    ]
+    if div_scored:
+        diversities = [float(e["candidate_diversity"]) for e in div_scored]
+        low = sum(1 for d in diversities if d < 0.1)
+        print(
+            f"\nCandidate diversity (n={len(div_scored)} turns): "
+            f"mean={statistics.mean(diversities):.2f}  "
+            f"low (<0.10): {low}/{len(div_scored)} ({low / len(div_scored):.0%})"
+        )
 def main() -> None:
     parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
     parser.add_argument("--logs", type=Path, default=settings.logs_dir)
     turns = _load(args.logs / "turns.jsonl")
     evals = _load(args.logs / "evals.jsonl")
     ratings = _load(args.logs / "ratings.jsonl")
+    picks = _load(args.logs / "picks.jsonl")
+    print(
+        f"Loaded: {len(turns)} turns, {len(evals)} evals, "
+        f"{len(picks)} picks, {len(ratings)} ratings"
+    )
     report_latency(turns)
     report_faithfulness(evals)
     report_multimodal(evals)
+    report_picker(turns, picks, evals)
     report_authenticity(ratings)

backend/evals/diversity.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import torch
+def compute_candidate_diversity(candidates: list[dict]) -> dict:
+    """Mean pairwise cosine *distance* among candidate texts.
+    1.0 = maximally different, 0.0 = identical paraphrases.
+    """
+    texts = [c.get("text", "").strip() for c in candidates]
+    texts = [t for t in texts if t]
+    n = len(texts)
+    if n < 2:
+        return {"candidate_diversity": 0.0, "n_candidates": n}
+    from backend.retrieval.vector_store import embed_texts
+    vecs = embed_texts(texts)
+    sims = vecs @ vecs.T
+    iu = torch.triu_indices(n, n, offset=1)
+    return {
+        "candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
+        "n_candidates": n,
+    }

backend/evals/faithfulness.py CHANGED Viewed

@@ -49,46 +49,76 @@ def _split_sentences(text: str) -> list[str]:
     return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
 def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
     """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
-    if not chunks:
-        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
-    sentences = _split_sentences(response)
-    # Too short to score meaningfully (one-word replies, fragments). Flagging as
-    # no_evidence is honest: we're not scoring it, so it should be excluded from
-    # groundedness averages downstream.
-    if not sentences:
-        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
-    chunk_texts = [c.get("text", "") for c in chunks if c.get("text")]
     if not chunk_texts:
-        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
-    if len(sentences) > _MAX_SENTENCES:
-        sentences = sentences[:_MAX_SENTENCES]
     model = _get_model()
-    # NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
-    # response sentences are the claims being checked (hypothesis).
-    pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
     with _predict_sem:
         logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
     probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
     entail = probs[:, _entail_idx]
     n_chunks = len(chunk_texts)
-    threshold = settings.faithfulness_threshold
-    grounded = 0
-    for i in range(len(sentences)):
-        sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
-        if sent_scores.max() >= threshold:
-            grounded += 1
-    total = len(sentences)
-    groundedness = grounded / total
-    return {
-        "groundedness": round(groundedness, 4),
-        "hallucination_rate": round(1.0 - groundedness, 4),
-        "no_evidence": False,
-    }

     return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
+def _no_evidence_result() -> dict:
+    return {
+        "groundedness": 0.0,
+        "hallucination_rate": 0.0,
+        "no_evidence": True,
+        "sentences_total": 0,
+        "sentences_grounded": 0,
+        "nli_threshold": settings.faithfulness_threshold,
+    }
+def _score_from_entail(entail, sentences: list[str], n_chunks: int) -> dict:
+    threshold = settings.faithfulness_threshold
+    grounded = 0
+    for i in range(len(sentences)):
+        sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
+        if sent_scores.max() >= threshold:
+            grounded += 1
+    total = len(sentences)
+    return {
+        "groundedness": round(grounded / total, 4),
+        "hallucination_rate": round(1.0 - grounded / total, 4),
+        "no_evidence": False,
+        "sentences_total": total,
+        "sentences_grounded": grounded,
+        "nli_threshold": threshold,
+    }
 def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
     """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
+    return compute_faithfulness_batch([response], chunks)[0]
+def compute_faithfulness_batch(responses: list[str], chunks: list[dict]) -> list[dict]:
+    """Score multiple candidate responses against the same chunks in a single
+    model.predict call. Caller passes `responses` in candidate order; we return
+    results in the same order. Falls back to _no_evidence_result when there's
+    nothing to score for a given response."""
+    chunk_texts = [c.get("text", "") for c in chunks if c.get("text")] if chunks else []
     if not chunk_texts:
+        return [_no_evidence_result() for _ in responses]
+    per_resp_sentences = [
+        _split_sentences(r)[:_MAX_SENTENCES] if r else [] for r in responses
+    ]
+    pairs: list[tuple[str, str]] = []
+    for sents in per_resp_sentences:
+        for sent in sents:
+            for chunk in chunk_texts:
+                pairs.append((chunk, sent))
+    if not pairs:
+        return [_no_evidence_result() for _ in responses]
     model = _get_model()
     with _predict_sem:
         logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
     probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
     entail = probs[:, _entail_idx]
+    out: list[dict] = []
+    cursor = 0
     n_chunks = len(chunk_texts)
+    for sentences in per_resp_sentences:
+        if not sentences:
+            out.append(_no_evidence_result())
+            continue
+        n_pairs = len(sentences) * n_chunks
+        out.append(
+            _score_from_entail(entail[cursor : cursor + n_pairs], sentences, n_chunks)
+        )
+        cursor += n_pairs
+    return out

backend/evals/multimodal_alignment.py CHANGED Viewed

@@ -77,11 +77,18 @@ def _gesture_alignment(response: str, gesture_tag: str | None) -> float:
     return 1.0 if pattern.search(response) else 0.0
-def _gaze_alignment(chunks: list[dict], gaze_bucket: str | None) -> float:
     if not gaze_bucket or not chunks:
-        return 0.0
     matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
-    return matches / len(chunks)
 def compute_multimodal_alignment(
@@ -92,16 +99,37 @@ def compute_multimodal_alignment(
     chunks: list[dict],
 ) -> dict:
     scores: dict[str, float] = {}
     if affect:
         scores["affect_alignment"] = _affect_alignment(response, affect)
     if gesture_tag:
         scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
     if gaze_bucket:
-        scores["gaze_alignment"] = _gaze_alignment(chunks, gaze_bucket)
     overall = sum(scores.values()) / len(scores) if scores else 0.0
     return {
         "overall_score": round(overall, 4),
         "affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
         "gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
         "gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
     }

     return 1.0 if pattern.search(response) else 0.0
+def _gaze_alignment(
+    chunks: list[dict], gaze_bucket: str | None
+) -> tuple[float, int, int]:
     if not gaze_bucket or not chunks:
+        return 0.0, 0, len(chunks) if chunks else 0
     matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
+    return matches / len(chunks), matches, len(chunks)
+def _affect_breakdown(response: str) -> tuple[int, int]:
+    toks = _tokens(response)
+    return len(toks & _POSITIVE), len(toks & _NEGATIVE)
 def compute_multimodal_alignment(
     chunks: list[dict],
 ) -> dict:
     scores: dict[str, float] = {}
+    explain: dict[str, dict] = {}
     if affect:
         scores["affect_alignment"] = _affect_alignment(response, affect)
+        pos, neg = _affect_breakdown(response)
+        explain["affect"] = {
+            "target": affect,
+            "pos_words": pos,
+            "neg_words": neg,
+            "sentiment": round(_sentiment_score(response), 4),
+        }
     if gesture_tag:
         scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
+        pattern = _GESTURE_OPENER_PATTERNS.get(gesture_tag)
+        explain["gesture"] = {
+            "tag": gesture_tag,
+            "has_pattern": pattern is not None,
+            "matched": bool(pattern.search(response)) if pattern else None,
+        }
     if gaze_bucket:
+        score, matches, total = _gaze_alignment(chunks, gaze_bucket)
+        scores["gaze_alignment"] = score
+        explain["gaze"] = {
+            "bucket": gaze_bucket,
+            "matched_chunks": matches,
+            "total_chunks": total,
+        }
     overall = sum(scores.values()) / len(scores) if scores else 0.0
     return {
         "overall_score": round(overall, 4),
         "affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
         "gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
         "gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
+        "explain": explain,
     }

backend/evals/relevance.py ADDED Viewed

	@@ -0,0 +1,16 @@

+def compute_relevance(response: str, query: str) -> dict:
+    """BGE cosine similarity between query and response embeddings.
+    NLI entailment is the wrong abstraction here (a question rarely entails
+    its answer), so we use the same embedding space the retriever uses.
+    """
+    if not response.strip() or not query.strip():
+        return {"relevance": 0.0, "no_query": not query.strip()}
+    from backend.retrieval.vector_store import embed_texts
+    vecs = embed_texts([query, response])
+    return {
+        "relevance": round(max(0.0, float(vecs[0] @ vecs[1])), 4),
+        "no_query": False,
+    }

backend/retrieval/vector_store.py CHANGED Viewed

@@ -54,6 +54,15 @@ def get_embedder():
     return _get_embedder()
 # Index cache: one (vectors_tensor, meta) per user_id.
 _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}

     return _get_embedder()
+def embed_texts(texts: list[str]) -> torch.Tensor:
+    return _get_embedder().encode(
+        texts,
+        convert_to_tensor=True,
+        normalize_embeddings=True,
+        device=_DEVICE,
+    )
 # Index cache: one (vectors_tensor, meta) per user_id.
 _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}

frontend/src/App.css CHANGED Viewed

@@ -662,6 +662,53 @@ input[type="text"]:hover {
   border-color: var(--border);
 }
 .eval-details {
   display: flex;
   flex-direction: column;

   border-color: var(--border);
 }
+.tip {
+  position: relative;
+  display: inline-flex;
+  cursor: help;
+}
+.tip::after {
+  content: attr(data-tip);
+  position: absolute;
+  bottom: calc(100% + 8px);
+  left: 0;
+  background: #1f2937;
+  color: #f9fafb;
+  padding: 8px 10px;
+  border-radius: 6px;
+  font-size: 11px;
+  font-weight: 400;
+  line-height: 1.45;
+  letter-spacing: 0.1px;
+  white-space: pre-line;
+  width: max-content;
+  max-width: 320px;
+  opacity: 0;
+  pointer-events: none;
+  transition: opacity 0.08s ease-in 0.5s;
+  z-index: 100;
+  box-shadow: 0 4px 12px rgba(0, 0, 0, 0.25);
+}
+.tip::before {
+  content: "";
+  position: absolute;
+  bottom: calc(100% + 2px);
+  left: 12px;
+  border: 6px solid transparent;
+  border-top-color: #1f2937;
+  opacity: 0;
+  pointer-events: none;
+  transition: opacity 0.08s ease-in 0.5s;
+  z-index: 100;
+}
+.tip:hover::after,
+.tip:hover::before {
+  opacity: 1;
+}
 .eval-details {
   display: flex;
   flex-direction: column;

frontend/src/components/EvalPanel.tsx CHANGED Viewed

@@ -1,4 +1,4 @@
-import { useState } from "react";
 import { submitRating } from "../lib/api";
 import type { EvalScores } from "../types";
@@ -20,7 +20,162 @@ function fmt(score: number): string {
   return (score * 100).toFixed(0) + "%";
 }
-export function EvalPanel({
   runId,
   userId,
   latencyTotal,
@@ -34,6 +189,9 @@ export function EvalPanel({
     ? evalScores.slo_passed
     : latencyTotal > 0 && latencyTotal < sloTarget;
   const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
   async function rate(stars: number) {
     if (submitting || value !== null) return;
@@ -57,64 +215,76 @@ export function EvalPanel({
       <div className="eval-row">
         {effectiveLatency > 0 && (
           <span
-            className={`slo-badge ${sloPassed ? "pass" : "fail"}`}
-            title={`SLO target ${sloTarget.toFixed(1)}s`}
           >
-            {effectiveLatency.toFixed(2)}s {sloPassed ? "✓" : "✗"}
           </span>
         )}
         {evalScores && (
           <>
-            <span
-              className={`eval-pill ${
-                evalScores.no_evidence ? "muted" : gradeClass(evalScores.groundedness)
-              }`}
-              title={
-                evalScores.no_evidence
-                  ? "No retrieved evidence — groundedness not scored"
-                  : `Groundedness: fraction of response sentences supported by retrieved memories (hallucination ${fmt(
-                      evalScores.hallucination_rate
-                    )})`
-              }
-            >
-              grounded {evalScores.no_evidence ? "—" : fmt(evalScores.groundedness)}
             </span>
-            <span
-              className={`eval-pill ${gradeClass(evalScores.affect_alignment)}`}
-              title="Affect alignment: does the response tone match the detected facial affect?"
-            >
-              affect {fmt(evalScores.affect_alignment)}
             </span>
-            <span
-              className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}
-              title="Gesture alignment: does the response acknowledge the detected hand gesture?"
-            >
-              gesture {fmt(evalScores.gesture_alignment)}
             </span>
-            <span
-              className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}
-              title="Gaze alignment: did retrieved chunks come from the bucket the user was looking at?"
-            >
-              gaze {fmt(evalScores.gaze_alignment)}
             </span>
           </>
         )}
-        <div className="star-rating" title="Rate authenticity (1-5)">
-          {[1, 2, 3, 4, 5].map((star) => (
-            <button
-              key={star}
-              className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
-              onMouseEnter={() => setHover(star)}
-              onMouseLeave={() => setHover(0)}
-              onClick={() => rate(star)}
-              disabled={value !== null || submitting}
-            >
-              ★
-            </button>
-          ))}
-          {value !== null && <span className="star-label">{value}/5</span>}
-        </div>
       </div>
     </div>
   );
 }

+import { memo, useState } from "react";
 import { submitRating } from "../lib/api";
 import type { EvalScores } from "../types";
   return (score * 100).toFixed(0) + "%";
 }
+function buildTip(parts: {
+  title: string;
+  question: string;
+  how?: string;
+  thisTurn?: string;
+  fallback?: string;
+}): string {
+  const header = `${parts.title} — ${parts.question}`;
+  if (parts.fallback) return `${header}\n\n${parts.fallback}`;
+  const sections = [parts.how, parts.thisTurn].filter(Boolean);
+  return sections.length ? `${header}\n\n${sections.join("\n\n")}` : header;
+}
+function groundednessTip(s: EvalScores): string {
+  const title = "GROUNDED";
+  if (s.no_evidence) {
+    return buildTip({
+      title,
+      question: "Did the response stick to the retrieved memories?",
+      fallback:
+        "Not scored: no memories were retrieved this turn (e.g. a 'how are you feeling?' question that skips retrieval).",
+    });
+  }
+  const total = s.sentences_total ?? 0;
+  const grounded = s.sentences_grounded ?? 0;
+  const thr = s.nli_threshold ?? 0.5;
+  return buildTip({
+    title,
+    question: "Did the response stick to the retrieved memories, or hallucinate?",
+    how:
+      `How: each sentence in the response is checked against each retrieved chunk with an NLI model. ` +
+      `A sentence counts as grounded if at least one chunk entails it with probability ≥ ${thr.toFixed(2)}.`,
+    thisTurn:
+      `This turn: ${grounded}/${total} sentences grounded → ${fmt(s.groundedness)}. ` +
+      `Hallucination = ${fmt(s.hallucination_rate)} (${total - grounded} unsupported).`,
+  });
+}
+function relevanceTip(s: EvalScores): string {
+  return buildTip({
+    title: "RELEVANT",
+    question: "Did the response actually address the partner's question?",
+    how:
+      "How: cosine similarity between the BGE embedding of the query and the embedding of the response. " +
+      "Higher = more semantically on-topic.",
+    thisTurn: `This turn: ${(s.relevance ?? 0).toFixed(3)} → ${fmt(s.relevance ?? 0)}.`,
+  });
+}
+function affectTip(s: EvalScores): string {
+  const question = "Does the response tone match the detected facial expression?";
+  const ex = s.explain?.affect;
+  if (!ex) {
+    return buildTip({ title: "AFFECT", question });
+  }
+  return buildTip({
+    title: "AFFECT",
+    question,
+    how:
+      "How: response sentiment is computed from positive vs negative word counts, " +
+      "then compared to the affect target.",
+    thisTurn:
+      `This turn: detected ${ex.target}, response sentiment = ${ex.sentiment.toFixed(2)} ` +
+      `(${ex.pos_words} positive word${ex.pos_words === 1 ? "" : "s"}, ` +
+      `${ex.neg_words} negative) → ${fmt(s.affect_alignment)}.`,
+  });
+}
+function gestureTip(s: EvalScores): string {
+  const title = "GESTURE";
+  const question = "Does the response opener acknowledge the detected hand gesture?";
+  const ex = s.explain?.gesture;
+  if (!ex) {
+    return buildTip({ title, question, fallback: "No gesture detected this turn — defaults to 0." });
+  }
+  if (!ex.has_pattern) {
+    return buildTip({
+      title,
+      question,
+      fallback: `Detected ${ex.tag}, but this gesture has no opener pattern to test — partial credit (50%).`,
+    });
+  }
+  return buildTip({
+    title,
+    question,
+    how: `How: regex check on the first words of the response (e.g. THUMBS_UP expects 'yes/sure/absolutely…').`,
+    thisTurn:
+      `This turn: detected ${ex.tag}, opener ${ex.matched ? "matched" : "did not match"} ` +
+      `→ ${ex.matched ? "100%" : "0%"}.`,
+  });
+}
+function gazeTip(s: EvalScores): string {
+  const title = "GAZE";
+  const question = "Did the retrieved memories come from the topic the user was looking at?";
+  const ex = s.explain?.gaze;
+  if (!ex) {
+    return buildTip({ title, question, fallback: "No gaze bucket detected this turn — defaults to 0." });
+  }
+  if (ex.total_chunks === 0) {
+    return buildTip({
+      title,
+      question,
+      fallback: `User was looking at: ${ex.bucket}. No chunks retrieved this turn — defaults to 0.`,
+    });
+  }
+  return buildTip({
+    title,
+    question,
+    how: `How: fraction of retrieved chunks whose 'bucket' label matches the gaze target.`,
+    thisTurn:
+      `This turn: user looking at ${ex.bucket}, ${ex.matched_chunks}/${ex.total_chunks} ` +
+      `retrieved chunks matched → ${fmt(s.gaze_alignment)}.`,
+  });
+}
+function diversityTip(s: EvalScores): string {
+  const title = "DIVERSITY";
+  const question = "How different are the candidate responses the picker showed?";
+  const n = s.n_candidates ?? 0;
+  const d = s.candidate_diversity ?? 0;
+  if (n < 2) {
+    return buildTip({ title, question, fallback: `Only ${n} candidate this turn — not meaningful.` });
+  }
+  return buildTip({
+    title,
+    question,
+    how:
+      "How: average pairwise cosine distance between BGE embeddings of the candidate texts. " +
+      "High = varied alternatives. Low = three paraphrases of the same answer (the 'aloha' problem).",
+    thisTurn: `This turn: ${n} candidates, mean pairwise distance = ${d.toFixed(3)} → ${fmt(d)}.`,
+  });
+}
+function sloTip(
+  s: EvalScores | null | undefined,
+  fallbackLatency: number,
+  fallbackTarget: number,
+  fallbackPassed: boolean,
+): string {
+  const latency = s?.t_total_s ?? fallbackLatency;
+  const target = s?.slo_target_s ?? fallbackTarget;
+  const passed = s?.slo_passed ?? fallbackPassed;
+  const margin = s?.slo_margin_s;
+  const sign = (margin ?? 0) >= 0 ? "+" : "";
+  const m = margin !== undefined ? `${sign}${margin.toFixed(2)}s` : "";
+  return buildTip({
+    title: "LATENCY",
+    question: "Did the response arrive within the SLO budget?",
+    thisTurn:
+      `Target: < ${target.toFixed(1)}s. ` +
+      `This turn: ${latency.toFixed(2)}s${m ? ` (${m} margin)` : ""} — ${passed ? "passed ✓" : "failed ✗"}.`,
+  });
+}
+function EvalPanelImpl({
   runId,
   userId,
   latencyTotal,
     ? evalScores.slo_passed
     : latencyTotal > 0 && latencyTotal < sloTarget;
   const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
+  const showDiversity =
+    evalScores && (evalScores.n_candidates ?? 0) >= 2;
+  const showRelevance = evalScores && evalScores.relevance !== undefined;
   async function rate(stars: number) {
     if (submitting || value !== null) return;
       <div className="eval-row">
         {effectiveLatency > 0 && (
           <span
+            className="tip"
+            data-tip={sloTip(evalScores, effectiveLatency, sloTarget, sloPassed)}
           >
+            <span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
+              {effectiveLatency.toFixed(2)}s {sloPassed ? "✓" : "✗"}
+            </span>
           </span>
         )}
         {evalScores && (
           <>
+            <span className="tip" data-tip={groundednessTip(evalScores)}>
+              <span
+                className={`eval-pill ${
+                  evalScores.no_evidence ? "muted" : gradeClass(evalScores.groundedness)
+                }`}
+              >
+                grounded {evalScores.no_evidence ? "—" : fmt(evalScores.groundedness)}
+              </span>
             </span>
+            {showRelevance && (
+              <span className="tip" data-tip={relevanceTip(evalScores)}>
+                <span className={`eval-pill ${gradeClass(evalScores.relevance ?? 0)}`}>
+                  relevant {fmt(evalScores.relevance ?? 0)}
+                </span>
+              </span>
+            )}
+            <span className="tip" data-tip={affectTip(evalScores)}>
+              <span className={`eval-pill ${gradeClass(evalScores.affect_alignment)}`}>
+                affect {fmt(evalScores.affect_alignment)}
+              </span>
             </span>
+            <span className="tip" data-tip={gestureTip(evalScores)}>
+              <span className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}>
+                gesture {fmt(evalScores.gesture_alignment)}
+              </span>
             </span>
+            <span className="tip" data-tip={gazeTip(evalScores)}>
+              <span className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}>
+                gaze {fmt(evalScores.gaze_alignment)}
+              </span>
             </span>
+            {showDiversity && (
+              <span className="tip" data-tip={diversityTip(evalScores)}>
+                <span className={`eval-pill ${gradeClass(evalScores.candidate_diversity ?? 0)}`}>
+                  diversity {fmt(evalScores.candidate_diversity ?? 0)}
+                </span>
+              </span>
+            )}
           </>
         )}
+        <span className="tip" data-tip="Rate how authentic this response felt as the persona (1 = off, 5 = spot on). Logged to ratings.jsonl.">
+          <div className="star-rating">
+            {[1, 2, 3, 4, 5].map((star) => (
+              <button
+                key={star}
+                className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
+                onMouseEnter={() => setHover(star)}
+                onMouseLeave={() => setHover(0)}
+                onClick={() => rate(star)}
+                disabled={value !== null || submitting}
+              >
+                ★
+              </button>
+            ))}
+            {value !== null && <span className="star-label">{value}/5</span>}
+          </div>
+        </span>
       </div>
     </div>
   );
 }
+export const EvalPanel = memo(EvalPanelImpl);

frontend/src/types.ts CHANGED Viewed

@@ -69,10 +69,43 @@ export interface LatencyLog {
   t_total: number;
 }
 export interface EvalScores {
   groundedness: number;
   hallucination_rate: number;
   no_evidence: boolean;
   t_total_s: number;
   slo_target_s: number;
   slo_passed: boolean;
@@ -81,6 +114,10 @@ export interface EvalScores {
   affect_alignment: number;
   gesture_alignment: number;
   gaze_alignment: number;
 }
 export type CandidateStrategy =

   t_total: number;
 }
+export interface CandidateEval {
+  idx: number;
+  strategy: string;
+  selected: boolean;
+  groundedness: number;
+  hallucination_rate: number;
+  no_evidence: boolean;
+  relevance: number;
+}
+export interface EvalExplain {
+  affect?: {
+    target: string;
+    pos_words: number;
+    neg_words: number;
+    sentiment: number;
+  };
+  gesture?: {
+    tag: string;
+    has_pattern: boolean;
+    matched: boolean | null;
+  };
+  gaze?: {
+    bucket: string;
+    matched_chunks: number;
+    total_chunks: number;
+  };
+}
 export interface EvalScores {
   groundedness: number;
   hallucination_rate: number;
   no_evidence: boolean;
+  sentences_total?: number;
+  sentences_grounded?: number;
+  nli_threshold?: number;
+  relevance?: number;
   t_total_s: number;
   slo_target_s: number;
   slo_passed: boolean;
   affect_alignment: number;
   gesture_alignment: number;
   gaze_alignment: number;
+  candidate_diversity?: number;
+  n_candidates?: number;
+  candidates_eval?: CandidateEval[];
+  explain?: EvalExplain;
 }
 export type CandidateStrategy =