shwetangisingh commited on
Commit
69345ca
·
1 Parent(s): 1799d5b

relevance + diversity evals, batched scoring, math tooltips

Browse files

Adds two new eval metrics (relevance via BGE cosine, candidate diversity
via mean pairwise cosine distance) plus per-candidate scoring so we can
ask "did the picker beat candidate 0?" offline. compute_evals batches
the scoring path: one NLI model.predict over all (candidate × sentence
× chunk) pairs and one BGE embed_texts over [query, c1, c2, c3] feeding
both relevance and diversity. EvalPanel surfaces the new pills with
custom CSS hover tooltips that show the actual math from each turn
(e.g. "2/2 sentences had NLI entailment ≥ 0.50") rather than generic
definitions. aggregate.py grows a report_picker section that joins
turns + picks + evals and prints pick rate, regenerate rate, strategy
win rate, and the diversity floor.

README.md CHANGED
@@ -71,7 +71,15 @@ The per-turn JSONL log includes both `bucket_priors_after` and `type_priors_afte
71
 
72
  ### Per-turn eval pills in the UI
73
 
74
- Every AAC-user bubble renders its eval scores inline: the SLO latency badge, a groundedness pill (faithfulness against retrieved memory), and three multimodal-alignment pills (affect / gesture / gaze). Pills go green / grey / red on 0.75 / 0.4 thresholds, and groundedness renders `—` when the turn had no retrieved evidence to check against. Authenticity stars sit on the right. All values come from `eval_scores` on the `/chat` response, computed synchronously and persisted to `logs/evals.jsonl`.
 
 
 
 
 
 
 
 
75
 
76
  ![eval pills rendered inline under an AAC response](docs/images/eval-pills.png)
77
 
@@ -401,7 +409,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
401
  - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
402
  - [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
403
  - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
404
- - [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal.
405
 
406
  ### Intent decomposition
407
 
@@ -422,7 +429,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
422
  - return a canned response if we blow the budget entirely
423
  - threshold is 3.5s, spec says 6s — pick one
424
  - [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
425
- - [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30
426
 
427
  ### Generation
428
 
@@ -433,34 +439,48 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
433
 
434
  ### Evals
435
 
436
- Scoring runs synchronously on the `/chat` response path and the `eval_scores` dict is returned inline so the frontend can render pills on the AAC bubble immediately. Each scored turn is also appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings from the UI go to `logs/ratings.jsonl`.
 
 
 
 
 
 
 
 
 
 
 
 
437
 
438
- | Metric | Status | Where |
439
- |--------|--------|-------|
440
- | Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
441
- | Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
442
- | Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
443
- | Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
444
 
445
- **First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup on top of the LLM call. Every turn after that adds ~100-300ms for sentence-level scoring.
446
 
447
  - [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
448
  - [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
449
  - [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
450
  - [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
451
  - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
452
- - [ ] **[Eval]** Relevance score — one NLI call per turn asking "does the response address the partner's query?" Fills the biggest current gap: a perfectly grounded but off-topic reply scores 100% grounded today and we'd never catch it
453
- - [ ] **[Eval]** Candidate diversity — mean pairwise cosine distance among the 3 candidates in a picker round. Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query
454
- - [ ] **[Eval]** Picker-aware metrics from `turns.jsonl` + `picks.jsonl`: which strategy wins most (`broad` vs `focused` vs `serendipitous` vs `side_index`), pick rate (% of turns where user clicked a card), regenerate rate (% of turns where user clicked "try again"). All computable offline, no runtime cost
455
- - [ ] **[Eval]** Score alternate candidates too, not just the selected one. Right now `compute_evals` only scores `selected_response`; scoring all 3 would let us measure whether the picker actually improves quality over taking candidate 0 blindly
456
- - [ ] **[Eval]** UI coverage gap: `compute_evals` returns 12 fields but `EvalPanel` renders only 5 pills (latency, grounded, affect, gesture, gaze). Hallucination rate, overall multimodal_alignment, SLO target/margin are computed and logged but never surfaced in the bubble. Decide what belongs as a pill vs a tooltip-only number vs an offline-only log field
457
- - [ ] **[Eval]** On pill hover, tooltip should explain *how* the number was computed, not just what it means. Today the `title` attributes say "Groundedness: fraction of response sentences supported by retrieved memories" which is the definition but not the math. Want: "5/8 sentences had NLI entailment prob 0.5 against the retrieved chunks" for groundedness; "3/4 positive-lexicon words matched HAPPY target" for affect; raw scorer inputs + thresholds exposed inline so the number isn't a black box
458
 
459
  ### Cleanup
460
 
461
  - [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
462
  - [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
463
 
 
 
 
 
 
 
 
464
  ---
465
 
466
  ## Team
 
71
 
72
  ### Per-turn eval pills in the UI
73
 
74
+ Every AAC-user bubble renders its eval scores inline. Up to seven pills, depending on what the turn produced:
75
+
76
+ - **SLO latency badge** — `t_total` against the configured target (default 6s). Green ✓ on pass, red ✗ on miss.
77
+ - **`grounded`** — sentence-level NLI faithfulness against retrieved memories. Renders `—` when there's no retrieved evidence to check (e.g. a `PRESENT_STATE` turn that skipped retrieval).
78
+ - **`relevant`** — BGE cosine similarity between query and response embeddings. Catches the "perfectly grounded but off-topic" failure mode that groundedness can't see.
79
+ - **`affect` / `gesture` / `gaze`** — multimodal alignment: sentiment match against the detected affect, opener-pattern match against the detected gesture, fraction of retrieved chunks from the gazed-at bucket.
80
+ - **`diversity`** — mean pairwise cosine distance across the candidate slate (only shown when ≥2 candidates). Low values flag the "aloha problem" — three paraphrases of the same answer.
81
+
82
+ Pills go green / grey / red on 0.75 / 0.4 thresholds. **Hover any pill** for a tooltip with the actual math from this turn (e.g. "2/2 sentences had NLI entailment prob ≥ 0.50", "3/3 retrieved chunks were from the family bucket"), powered by the `explain` block each scorer returns alongside its score. Authenticity stars sit on the right; clicking one fires `POST /feedback/rating` → `logs/ratings.jsonl`. All pill values come from `eval_scores` on the `/chat` response, computed in a `BackgroundTask` after the response returns and persisted to `logs/evals.jsonl`.
83
 
84
  ![eval pills rendered inline under an AAC response](docs/images/eval-pills.png)
85
 
 
409
  - [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
410
  - [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
411
  - [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
 
412
 
413
  ### Intent decomposition
414
 
 
429
  - return a canned response if we blow the budget entirely
430
  - threshold is 3.5s, spec says 6s — pick one
431
  - [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
 
432
 
433
  ### Generation
434
 
 
439
 
440
  ### Evals
441
 
442
+ Scoring runs in a FastAPI `BackgroundTask` after `/chat` returns, so it never blocks the response. The frontend polls `GET /evals/{run_id}` to render pills once they're ready. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings go to `logs/ratings.jsonl`. Picks go to `logs/picks.jsonl`.
443
+
444
+ | Metric | What it answers | Where |
445
+ |--------|-----------------|-------|
446
+ | Efficiency | SLO pass/fail on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
447
+ | Faithfulness (`grounded`) | Did the response stick to retrieved memories, or hallucinate? Sentence-level NLI; `no_evidence` short-circuit when nothing was retrieved | [faithfulness.py](backend/evals/faithfulness.py) |
448
+ | Relevance (`relevant`) | Did the response actually address the partner's query? BGE cosine query↔response | [relevance.py](backend/evals/relevance.py) |
449
+ | Multimodal alignment | `affect` (sentiment lexicon vs target), `gesture` (opener regex vs detected tag), `gaze` (matched/total retrieved chunks vs gazed bucket) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
450
+ | Candidate diversity | Are the picker's candidates actually different, or paraphrases? Mean pairwise cosine distance over the candidate slate | [diversity.py](backend/evals/diversity.py) |
451
+ | Per-candidate breakdown | Each candidate scored for `grounded` + `relevance` (not just the selected one) — answers "did the picker beat candidate 0?" offline | `candidates_eval` block in [evals/__init__.py](backend/evals/__init__.py) |
452
+ | Authenticity | Star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
453
+
454
+ **Performance note.** When the turn produces multiple candidates, scoring is fully batched: a single NLI `model.predict` over all `(candidate × sentence × chunk)` pairs and a single BGE `embed_texts` over `[query, c1, c2, c3]` (the candidate vectors feed both relevance and diversity). The selected candidate's per-candidate score is reused as the top-level pill values rather than re-scored. End result: 1 NLI pass + 1 BGE pass per turn regardless of candidate count.
455
 
456
+ **First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup. Every turn after that adds ~100-300ms for sentence-level scoring.
 
 
 
 
 
457
 
458
+ **Offline analysis.** `python -m backend.evals.aggregate` joins all four log files and prints per-persona reports: latency p50/p95/p99 by tier, mean groundedness/hallucination, multimodal alignment coverage, picker behaviour (pick rate, regenerate rate, strategy win rate, "did picker beat cand 0?", diversity floor), and authenticity Likert distribution.
459
 
460
  - [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
461
  - [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
462
  - [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
463
  - [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
464
  - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
465
+ - [x] **[Eval]** Relevance score — BGE cosine similarity between query and response. Originally specced as an NLI call, but a question rarely *entails* its answer (the on-topic and off-topic NLI scores both pinned near 10⁻⁴), so the embedder we already load for retrieval is the right tool. Fills the gap where a perfectly grounded but off-topic reply scored 100% grounded. See [backend/evals/relevance.py](backend/evals/relevance.py).
466
+ - [x] **[Eval]** Candidate diversity — mean pairwise cosine distance among the 3 candidates in a picker round, computed on BGE embeddings (no extra model). Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query. See [backend/evals/diversity.py](backend/evals/diversity.py).
467
+ - [x] **[Eval]** Picker-aware metrics `report_picker` in [backend/evals/aggregate.py](backend/evals/aggregate.py) joins `turns.jsonl` + `picks.jsonl` + `evals.jsonl` and prints: pick rate (% of multi-candidate turns where the user clicked a card), regenerate rate (% of (user, turn_id) pairs that ran the planner more than once), strategy win rate among committed picks, head-to-head "did picker beat candidate 0 on grounded/relevance" using the per-candidate scoring from L453, and diversity coverage (% of turns with mean pairwise cosine distance < 0.10 — the "aloha" floor). Run via `python -m backend.evals.aggregate`.
468
+ - [x] **[Eval]** Score alternate candidates too, not just the selected one. `compute_evals` now scores groundedness + relevance for every candidate and stamps which one was selected; full breakdown lands in `eval_scores.candidates_eval` and `logs/evals.jsonl`, top-level pills still describe the selected response. Unlocks "did the picker actually beat candidate 0?" offline analysis.
469
+ - [x] **[Eval]** UI coverage `EvalPanel` now also renders a relevance pill (BGE cosine query↔response) and a candidate-diversity pill (mean pairwise cosine distance, hidden when fewer than 2 candidates). Hallucination rate is conveyed inside the grounded tooltip rather than as its own pill (it's `1 groundedness`, no extra info). SLO margin is in the latency tooltip. See [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx).
470
+ - [x] **[Eval]** Tooltip math every pill's `title` now shows the actual computation, not just the definition. Each scorer returns its raw inputs in an `explain` block (sentence count + entailment threshold for groundedness, pos/neg word counts + sentiment for affect, matched/total chunks for gaze, gesture pattern match), and `EvalPanel` formats them into specific tooltips like "2/2 sentences had NLI entailment prob 0.50". See `groundednessTip` / `affectTip` / `gestureTip` / `gazeTip` in [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx) and `explain` in [multimodal_alignment.py](backend/evals/multimodal_alignment.py).
471
 
472
  ### Cleanup
473
 
474
  - [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
475
  - [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
476
 
477
+ ### Out of scope
478
+
479
+ Not in the spec — engineering nice-to-haves we'd pick up if the rest is done. Don't block grading on these.
480
+
481
+ - [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal. _(Sensing — untagged in spec)_
482
+ - [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30 _(Retrieval — far beyond current corpus size)_
483
+
484
  ---
485
 
486
  ## Team
backend/api/main.py CHANGED
@@ -168,6 +168,10 @@ class RatingRequest(BaseModel):
168
  # ── Helpers ────────────────────────────────────────────────────────────────────
169
 
170
 
 
 
 
 
171
  def _load_persona_profile(user_id: str) -> dict:
172
  memories_path = settings.memories_dir / f"{user_id}.json"
173
  try:
@@ -351,6 +355,8 @@ def _compute_and_persist_evals(
351
  affect: str,
352
  gesture_tag: str | None,
353
  gaze_bucket: str | None,
 
 
354
  ) -> dict | None:
355
  if not settings.evals_enabled or not run_id:
356
  return None
@@ -363,6 +369,8 @@ def _compute_and_persist_evals(
363
  gesture_tag=gesture_tag,
364
  gaze_bucket=gaze_bucket,
365
  slo_target=settings.slo_target_s,
 
 
366
  )
367
  except Exception:
368
  _log.exception("evals scoring failed for run %s", run_id)
@@ -434,6 +442,8 @@ def chat(req: ChatRequest, background_tasks: BackgroundTasks):
434
  affect=affect_emotion,
435
  gesture_tag=req.gesture_tag,
436
  gaze_bucket=req.gaze_bucket,
 
 
437
  )
438
 
439
  return ChatResponse(
@@ -512,6 +522,7 @@ def chat_stream(req: ChatRequest):
512
  run_id = state.get("run_id")
513
 
514
  # Evals run off the response path; UI polls GET /evals/{run_id}.
 
515
  if run_id and settings.evals_enabled:
516
  _reserve_eval_slot(run_id)
517
  threading.Thread(
@@ -526,6 +537,8 @@ def chat_stream(req: ChatRequest):
526
  affect=affect_emotion,
527
  gesture_tag=req.gesture_tag,
528
  gaze_bucket=req.gaze_bucket,
 
 
529
  ),
530
  daemon=True,
531
  ).start()
@@ -534,7 +547,7 @@ def chat_stream(req: ChatRequest):
534
  "user_id": req.user_id,
535
  "query": req.query,
536
  "response": state["selected_response"] or "",
537
- "candidates": [dict(c) for c in state.get("candidates") or []],
538
  "affect": affect_emotion,
539
  "llm_tier": state.get("llm_tier_used", "unknown"),
540
  "llm_model": state.get("llm_model_used", "unknown"),
@@ -660,6 +673,8 @@ def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
660
  affect=affect_emotion,
661
  gesture_tag=replan_state.get("gesture_tag"),
662
  gaze_bucket=replan_state.get("gaze_bucket"),
 
 
663
  )
664
 
665
  return ChatResponse(
@@ -827,6 +842,7 @@ def chat_regenerate_stream(req: RegenerateRequest):
827
 
828
  affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
829
  run_id = replan_state.get("run_id")
 
830
  eval_scores = _compute_and_persist_evals(
831
  run_id=run_id,
832
  user_id=req.user_id,
@@ -837,12 +853,14 @@ def chat_regenerate_stream(req: RegenerateRequest):
837
  affect=affect_emotion,
838
  gesture_tag=replan_state.get("gesture_tag"),
839
  gaze_bucket=replan_state.get("gaze_bucket"),
 
 
840
  )
841
  final = {
842
  "user_id": req.user_id,
843
  "query": replan_state["raw_query"],
844
  "response": replan_state["selected_response"] or "",
845
- "candidates": [dict(c) for c in replan_state.get("candidates") or []],
846
  "affect": affect_emotion,
847
  "llm_tier": replan_state.get("llm_tier_used", "unknown"),
848
  "llm_model": replan_state.get("llm_model_used", "unknown"),
@@ -942,6 +960,8 @@ def chat_regenerate(req: RegenerateRequest):
942
  affect=affect_emotion,
943
  gesture_tag=replan_state.get("gesture_tag"),
944
  gaze_bucket=replan_state.get("gaze_bucket"),
 
 
945
  )
946
 
947
  return ChatResponse(
 
168
  # ── Helpers ────────────────────────────────────────────────────────────────────
169
 
170
 
171
+ def _candidate_dicts(state) -> list[dict]:
172
+ return [dict(c) for c in (state.get("candidates") or [])]
173
+
174
+
175
  def _load_persona_profile(user_id: str) -> dict:
176
  memories_path = settings.memories_dir / f"{user_id}.json"
177
  try:
 
355
  affect: str,
356
  gesture_tag: str | None,
357
  gaze_bucket: str | None,
358
+ query: str = "",
359
+ candidates: list[dict] | None = None,
360
  ) -> dict | None:
361
  if not settings.evals_enabled or not run_id:
362
  return None
 
369
  gesture_tag=gesture_tag,
370
  gaze_bucket=gaze_bucket,
371
  slo_target=settings.slo_target_s,
372
+ query=query,
373
+ candidates=candidates,
374
  )
375
  except Exception:
376
  _log.exception("evals scoring failed for run %s", run_id)
 
442
  affect=affect_emotion,
443
  gesture_tag=req.gesture_tag,
444
  gaze_bucket=req.gaze_bucket,
445
+ query=req.query,
446
+ candidates=_candidate_dicts(result),
447
  )
448
 
449
  return ChatResponse(
 
522
  run_id = state.get("run_id")
523
 
524
  # Evals run off the response path; UI polls GET /evals/{run_id}.
525
+ cand_dicts = _candidate_dicts(state)
526
  if run_id and settings.evals_enabled:
527
  _reserve_eval_slot(run_id)
528
  threading.Thread(
 
537
  affect=affect_emotion,
538
  gesture_tag=req.gesture_tag,
539
  gaze_bucket=req.gaze_bucket,
540
+ query=req.query,
541
+ candidates=cand_dicts,
542
  ),
543
  daemon=True,
544
  ).start()
 
547
  "user_id": req.user_id,
548
  "query": req.query,
549
  "response": state["selected_response"] or "",
550
+ "candidates": cand_dicts,
551
  "affect": affect_emotion,
552
  "llm_tier": state.get("llm_tier_used", "unknown"),
553
  "llm_model": state.get("llm_model_used", "unknown"),
 
673
  affect=affect_emotion,
674
  gesture_tag=replan_state.get("gesture_tag"),
675
  gaze_bucket=replan_state.get("gaze_bucket"),
676
+ query=replan_state.get("raw_query") or "",
677
+ candidates=_candidate_dicts(replan_state),
678
  )
679
 
680
  return ChatResponse(
 
842
 
843
  affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
844
  run_id = replan_state.get("run_id")
845
+ cand_dicts = _candidate_dicts(replan_state)
846
  eval_scores = _compute_and_persist_evals(
847
  run_id=run_id,
848
  user_id=req.user_id,
 
853
  affect=affect_emotion,
854
  gesture_tag=replan_state.get("gesture_tag"),
855
  gaze_bucket=replan_state.get("gaze_bucket"),
856
+ query=replan_state.get("raw_query") or "",
857
+ candidates=cand_dicts,
858
  )
859
  final = {
860
  "user_id": req.user_id,
861
  "query": replan_state["raw_query"],
862
  "response": replan_state["selected_response"] or "",
863
+ "candidates": cand_dicts,
864
  "affect": affect_emotion,
865
  "llm_tier": replan_state.get("llm_tier_used", "unknown"),
866
  "llm_model": replan_state.get("llm_model_used", "unknown"),
 
960
  affect=affect_emotion,
961
  gesture_tag=replan_state.get("gesture_tag"),
962
  gaze_bucket=replan_state.get("gaze_bucket"),
963
+ query=replan_state.get("raw_query") or "",
964
+ candidates=_candidate_dicts(replan_state),
965
  )
966
 
967
  return ChatResponse(
backend/evals/__init__.py CHANGED
@@ -1,7 +1,51 @@
1
  # Evaluation metrics — compute after pipeline returns, before API response.
 
 
 
2
  from backend.evals.efficiency import compute_efficiency
3
- from backend.evals.faithfulness import compute_faithfulness
4
  from backend.evals.multimodal_alignment import compute_multimodal_alignment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
 
7
  def compute_evals(
@@ -12,18 +56,54 @@ def compute_evals(
12
  gesture_tag: str | None,
13
  gaze_bucket: str | None,
14
  slo_target: float = 6.0,
 
 
 
15
  ) -> dict:
16
- """Run all eval scorers and return a unified EvalScores dict."""
17
- faith = compute_faithfulness(response, chunks)
 
 
 
 
 
18
  eff = compute_efficiency(latency_log, slo_target)
19
  align = compute_multimodal_alignment(
20
  response, affect, gesture_tag, gaze_bucket, chunks
21
  )
22
 
23
- return {
24
- "groundedness": faith["groundedness"],
25
- "hallucination_rate": faith["hallucination_rate"],
26
- "no_evidence": faith["no_evidence"],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  "t_total_s": eff["t_total"],
28
  "slo_target_s": eff["slo_target"],
29
  "slo_passed": eff["slo_passed"],
@@ -32,4 +112,19 @@ def compute_evals(
32
  "affect_alignment": align["affect_alignment"],
33
  "gesture_alignment": align["gesture_alignment"],
34
  "gaze_alignment": align["gaze_alignment"],
 
35
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Evaluation metrics — compute after pipeline returns, before API response.
2
+ import torch
3
+
4
+ from backend.evals.diversity import compute_candidate_diversity
5
  from backend.evals.efficiency import compute_efficiency
6
+ from backend.evals.faithfulness import compute_faithfulness, compute_faithfulness_batch
7
  from backend.evals.multimodal_alignment import compute_multimodal_alignment
8
+ from backend.evals.relevance import compute_relevance
9
+
10
+
11
+ def _score_candidates_batched(
12
+ candidates: list[dict],
13
+ chunks: list[dict],
14
+ query: str,
15
+ ) -> tuple[list[dict], "torch.Tensor | None"]:
16
+ """One BGE pass + one NLI pass across all candidates. Returns per-candidate
17
+ score dicts and the candidate vector matrix (for diversity reuse), or None
18
+ when no embedding pass was needed."""
19
+ texts = [c.get("text", "") for c in candidates]
20
+ faiths = compute_faithfulness_batch(texts, chunks)
21
+
22
+ cand_vecs: torch.Tensor | None = None
23
+ if query.strip() and any(t.strip() for t in texts):
24
+ from backend.retrieval.vector_store import embed_texts
25
+
26
+ vecs = embed_texts([query] + texts)
27
+ q_vec = vecs[0]
28
+ cand_vecs = vecs[1:]
29
+ relevances = [
30
+ round(max(0.0, float(q_vec @ cand_vecs[i])), 4) for i in range(len(texts))
31
+ ]
32
+ else:
33
+ relevances = [0.0] * len(texts)
34
+
35
+ scores = [{**faiths[i], "relevance": relevances[i]} for i in range(len(candidates))]
36
+ return scores, cand_vecs
37
+
38
+
39
+ def _diversity_from_vecs(cand_vecs: "torch.Tensor") -> dict:
40
+ n = cand_vecs.shape[0]
41
+ if n < 2:
42
+ return {"candidate_diversity": 0.0, "n_candidates": n}
43
+ sims = cand_vecs @ cand_vecs.T
44
+ iu = torch.triu_indices(n, n, offset=1)
45
+ return {
46
+ "candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
47
+ "n_candidates": n,
48
+ }
49
 
50
 
51
  def compute_evals(
 
56
  gesture_tag: str | None,
57
  gaze_bucket: str | None,
58
  slo_target: float = 6.0,
59
+ query: str = "",
60
+ candidates: list[dict] | None = None,
61
+ selected_idx: int | None = None,
62
  ) -> dict:
63
+ """Run all eval scorers and return a unified EvalScores dict.
64
+
65
+ When candidates are provided, scoring is batched: one BGE encode for
66
+ query + all candidates, one NLI predict across all (sentence, chunk)
67
+ pairs, then sliced per candidate. The selected candidate's scores are
68
+ reused as the top-level fields so the existing UI pills keep working.
69
+ """
70
  eff = compute_efficiency(latency_log, slo_target)
71
  align = compute_multimodal_alignment(
72
  response, affect, gesture_tag, gaze_bucket, chunks
73
  )
74
 
75
+ per_cand: list[dict] = []
76
+ cand_vecs = None
77
+ if candidates:
78
+ if selected_idx is None and response:
79
+ for i, c in enumerate(candidates):
80
+ if c.get("text", "").strip() == response.strip():
81
+ selected_idx = i
82
+ break
83
+ scored, cand_vecs = _score_candidates_batched(candidates, chunks, query)
84
+ per_cand = [
85
+ {
86
+ "idx": i,
87
+ "strategy": c.get("strategy", "unknown"),
88
+ "selected": (selected_idx is not None and i == selected_idx),
89
+ **scored[i],
90
+ }
91
+ for i, c in enumerate(candidates)
92
+ ]
93
+
94
+ if per_cand and selected_idx is not None and 0 <= selected_idx < len(per_cand):
95
+ # Strip per-candidate-only keys before reusing as top-level scores.
96
+ top = {
97
+ k: v
98
+ for k, v in per_cand[selected_idx].items()
99
+ if k not in ("idx", "strategy", "selected")
100
+ }
101
+ else:
102
+ faith = compute_faithfulness(response, chunks)
103
+ top = {**faith, "relevance": compute_relevance(response, query)["relevance"]}
104
+
105
+ out = {
106
+ **top,
107
  "t_total_s": eff["t_total"],
108
  "slo_target_s": eff["slo_target"],
109
  "slo_passed": eff["slo_passed"],
 
112
  "affect_alignment": align["affect_alignment"],
113
  "gesture_alignment": align["gesture_alignment"],
114
  "gaze_alignment": align["gaze_alignment"],
115
+ "explain": align.get("explain", {}),
116
  }
117
+
118
+ if per_cand:
119
+ out["candidates_eval"] = per_cand
120
+ # Reuse cand_vecs from the relevance pass when available; falls back to
121
+ # standalone BGE encode (e.g. when query was empty).
122
+ if cand_vecs is not None:
123
+ out.update(_diversity_from_vecs(cand_vecs))
124
+ else:
125
+ out.update(compute_candidate_diversity(candidates))
126
+ else:
127
+ out["candidate_diversity"] = 0.0
128
+ out["n_candidates"] = 1 if response else 0
129
+
130
+ return out
backend/evals/aggregate.py CHANGED
@@ -149,6 +149,121 @@ def report_authenticity(ratings: list[dict]) -> None:
149
  print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
150
 
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  def main() -> None:
153
  parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
154
  parser.add_argument("--logs", type=Path, default=settings.logs_dir)
@@ -157,11 +272,16 @@ def main() -> None:
157
  turns = _load(args.logs / "turns.jsonl")
158
  evals = _load(args.logs / "evals.jsonl")
159
  ratings = _load(args.logs / "ratings.jsonl")
 
160
 
161
- print(f"Loaded: {len(turns)} turns, {len(evals)} evals, {len(ratings)} ratings")
 
 
 
162
  report_latency(turns)
163
  report_faithfulness(evals)
164
  report_multimodal(evals)
 
165
  report_authenticity(ratings)
166
 
167
 
 
149
  print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
150
 
151
 
152
+ def report_picker(turns: list[dict], picks: list[dict], evals: list[dict]) -> None:
153
+ """Picker behaviour: pick rate, regenerate rate, strategy win rate, and
154
+ whether the user's pick beat candidate 0 on grounded/relevance.
155
+
156
+ Sources:
157
+ - turns.jsonl one row per turn, includes `candidates` and `n_candidates`
158
+ - picks.jsonl one row per /chat/pick — strategy, picked_idx, run_id
159
+ - evals.jsonl candidates_eval[] with per-candidate grounded + relevance
160
+ """
161
+ print("\n=== Picker Behaviour ===")
162
+ multi = [t for t in turns if (t.get("n_candidates") or 0) >= 2]
163
+ if not multi:
164
+ print(
165
+ "(no multi-candidate turns logged — older format or single-candidate runs)"
166
+ )
167
+ return
168
+
169
+ picks_by_run = {p["run_id"]: p for p in picks if p.get("run_id")}
170
+ evals_by_run = {e["run_id"]: e for e in evals if e.get("run_id")}
171
+
172
+ n_multi = len(multi)
173
+ n_picked = sum(1 for t in multi if t["run_id"] in picks_by_run)
174
+ # A (user_id, turn_id) seen more than once means the planner re-ran for
175
+ # the same partner query — that's a regenerate. The denominator is the
176
+ # number of distinct (user, turn) conversations that had at least one
177
+ # multi-candidate run, not the raw row count.
178
+ seen: dict[tuple[str, int], int] = defaultdict(int)
179
+ for t in multi:
180
+ seen[(t.get("user_id", "?"), t.get("turn_id", -1))] += 1
181
+ n_regenerated_turns = sum(1 for c in seen.values() if c > 1)
182
+ n_distinct_turns = max(1, len(seen))
183
+ print(
184
+ f"multi-candidate turns: {n_multi} ({n_distinct_turns} distinct) "
185
+ f"pick rate: {n_picked / n_multi:.0%} "
186
+ f"regenerate rate: {n_regenerated_turns / n_distinct_turns:.0%} "
187
+ f"(% of distinct turns that re-ran)"
188
+ )
189
+
190
+ # Strategy win rate — among multi-candidate picks only, how often does
191
+ # each strategy win. Picks on single-candidate turns aren't a real "win"
192
+ # (no alternative to lose to) so we filter them out.
193
+ multi_run_ids = {t["run_id"] for t in multi}
194
+ strategy_count: dict[str, int] = defaultdict(int)
195
+ for run_id, p in picks_by_run.items():
196
+ if run_id in multi_run_ids:
197
+ strategy_count[p.get("strategy", "unknown")] += 1
198
+ if strategy_count:
199
+ total = sum(strategy_count.values())
200
+ print(f"\nStrategy win rate (n={total} picks):")
201
+ print(f" {'strategy':<16} {'picks':>6} {'pct':>6}")
202
+ for s, n in sorted(strategy_count.items(), key=lambda x: -x[1]):
203
+ print(f" {s:<16} {n:>6} {n / total:>5.0%}")
204
+
205
+ # Did the picker beat candidate 0? Only meaningful when we have per-candidate
206
+ # eval scores AND the user picked a non-zero index. A "win" = picked
207
+ # candidate scored strictly higher on the metric than candidate 0.
208
+ head_to_head = []
209
+ for run_id, pick in picks_by_run.items():
210
+ ev = evals_by_run.get(run_id)
211
+ if not ev or not ev.get("candidates_eval"):
212
+ continue
213
+ cands = ev["candidates_eval"]
214
+ if len(cands) < 2:
215
+ continue
216
+ picked_idx = pick.get("picked_idx", 0)
217
+ if picked_idx == 0 or picked_idx >= len(cands):
218
+ continue
219
+ head_to_head.append(
220
+ {
221
+ "picked_grounded": cands[picked_idx]["groundedness"],
222
+ "cand0_grounded": cands[0]["groundedness"],
223
+ "picked_relevance": cands[picked_idx].get("relevance", 0.0),
224
+ "cand0_relevance": cands[0].get("relevance", 0.0),
225
+ }
226
+ )
227
+
228
+ if head_to_head:
229
+ n = len(head_to_head)
230
+ beat_grounded = sum(
231
+ 1 for h in head_to_head if h["picked_grounded"] > h["cand0_grounded"]
232
+ )
233
+ tied_grounded = sum(
234
+ 1 for h in head_to_head if h["picked_grounded"] == h["cand0_grounded"]
235
+ )
236
+ beat_rel = sum(
237
+ 1 for h in head_to_head if h["picked_relevance"] > h["cand0_relevance"]
238
+ )
239
+ print(f"\nDid picker beat candidate 0? (n={n} picks where picked_idx > 0)")
240
+ print(
241
+ f" groundedness: picker > cand0 = {beat_grounded}/{n} ({beat_grounded / n:.0%}), "
242
+ f"tied = {tied_grounded}/{n}"
243
+ )
244
+ print(f" relevance: picker > cand0 = {beat_rel}/{n} ({beat_rel / n:.0%})")
245
+ else:
246
+ print(
247
+ "\n(no picks of candidate 1+ with per-candidate eval data — can't measure picker quality yet)"
248
+ )
249
+
250
+ # Diversity: among multi-candidate turns with eval data, how often is the
251
+ # picker showing near-paraphrases (the "aloha" problem)?
252
+ div_scored = [
253
+ ev
254
+ for ev in evals_by_run.values()
255
+ if ev.get("n_candidates", 0) >= 2 and "candidate_diversity" in ev
256
+ ]
257
+ if div_scored:
258
+ diversities = [float(e["candidate_diversity"]) for e in div_scored]
259
+ low = sum(1 for d in diversities if d < 0.1)
260
+ print(
261
+ f"\nCandidate diversity (n={len(div_scored)} turns): "
262
+ f"mean={statistics.mean(diversities):.2f} "
263
+ f"low (<0.10): {low}/{len(div_scored)} ({low / len(div_scored):.0%})"
264
+ )
265
+
266
+
267
  def main() -> None:
268
  parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
269
  parser.add_argument("--logs", type=Path, default=settings.logs_dir)
 
272
  turns = _load(args.logs / "turns.jsonl")
273
  evals = _load(args.logs / "evals.jsonl")
274
  ratings = _load(args.logs / "ratings.jsonl")
275
+ picks = _load(args.logs / "picks.jsonl")
276
 
277
+ print(
278
+ f"Loaded: {len(turns)} turns, {len(evals)} evals, "
279
+ f"{len(picks)} picks, {len(ratings)} ratings"
280
+ )
281
  report_latency(turns)
282
  report_faithfulness(evals)
283
  report_multimodal(evals)
284
+ report_picker(turns, picks, evals)
285
  report_authenticity(ratings)
286
 
287
 
backend/evals/diversity.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+
4
+ def compute_candidate_diversity(candidates: list[dict]) -> dict:
5
+ """Mean pairwise cosine *distance* among candidate texts.
6
+
7
+ 1.0 = maximally different, 0.0 = identical paraphrases.
8
+ """
9
+ texts = [c.get("text", "").strip() for c in candidates]
10
+ texts = [t for t in texts if t]
11
+ n = len(texts)
12
+ if n < 2:
13
+ return {"candidate_diversity": 0.0, "n_candidates": n}
14
+
15
+ from backend.retrieval.vector_store import embed_texts
16
+
17
+ vecs = embed_texts(texts)
18
+ sims = vecs @ vecs.T
19
+ iu = torch.triu_indices(n, n, offset=1)
20
+ return {
21
+ "candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
22
+ "n_candidates": n,
23
+ }
backend/evals/faithfulness.py CHANGED
@@ -49,46 +49,76 @@ def _split_sentences(text: str) -> list[str]:
49
  return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
50
 
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
53
  """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
54
- if not chunks:
55
- return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
56
 
57
- sentences = _split_sentences(response)
58
- # Too short to score meaningfully (one-word replies, fragments). Flagging as
59
- # no_evidence is honest: we're not scoring it, so it should be excluded from
60
- # groundedness averages downstream.
61
- if not sentences:
62
- return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
63
 
64
- chunk_texts = [c.get("text", "") for c in chunks if c.get("text")]
 
 
 
 
 
65
  if not chunk_texts:
66
- return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
67
-
68
- if len(sentences) > _MAX_SENTENCES:
69
- sentences = sentences[:_MAX_SENTENCES]
 
 
 
 
 
 
 
 
70
 
71
  model = _get_model()
72
- # NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
73
- # response sentences are the claims being checked (hypothesis).
74
- pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
75
  with _predict_sem:
76
  logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
77
  probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
78
  entail = probs[:, _entail_idx]
79
 
 
 
80
  n_chunks = len(chunk_texts)
81
- threshold = settings.faithfulness_threshold
82
- grounded = 0
83
- for i in range(len(sentences)):
84
- sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
85
- if sent_scores.max() >= threshold:
86
- grounded += 1
87
-
88
- total = len(sentences)
89
- groundedness = grounded / total
90
- return {
91
- "groundedness": round(groundedness, 4),
92
- "hallucination_rate": round(1.0 - groundedness, 4),
93
- "no_evidence": False,
94
- }
 
49
  return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
50
 
51
 
52
+ def _no_evidence_result() -> dict:
53
+ return {
54
+ "groundedness": 0.0,
55
+ "hallucination_rate": 0.0,
56
+ "no_evidence": True,
57
+ "sentences_total": 0,
58
+ "sentences_grounded": 0,
59
+ "nli_threshold": settings.faithfulness_threshold,
60
+ }
61
+
62
+
63
+ def _score_from_entail(entail, sentences: list[str], n_chunks: int) -> dict:
64
+ threshold = settings.faithfulness_threshold
65
+ grounded = 0
66
+ for i in range(len(sentences)):
67
+ sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
68
+ if sent_scores.max() >= threshold:
69
+ grounded += 1
70
+ total = len(sentences)
71
+ return {
72
+ "groundedness": round(grounded / total, 4),
73
+ "hallucination_rate": round(1.0 - grounded / total, 4),
74
+ "no_evidence": False,
75
+ "sentences_total": total,
76
+ "sentences_grounded": grounded,
77
+ "nli_threshold": threshold,
78
+ }
79
+
80
+
81
  def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
82
  """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
83
+ return compute_faithfulness_batch([response], chunks)[0]
 
84
 
 
 
 
 
 
 
85
 
86
+ def compute_faithfulness_batch(responses: list[str], chunks: list[dict]) -> list[dict]:
87
+ """Score multiple candidate responses against the same chunks in a single
88
+ model.predict call. Caller passes `responses` in candidate order; we return
89
+ results in the same order. Falls back to _no_evidence_result when there's
90
+ nothing to score for a given response."""
91
+ chunk_texts = [c.get("text", "") for c in chunks if c.get("text")] if chunks else []
92
  if not chunk_texts:
93
+ return [_no_evidence_result() for _ in responses]
94
+
95
+ per_resp_sentences = [
96
+ _split_sentences(r)[:_MAX_SENTENCES] if r else [] for r in responses
97
+ ]
98
+ pairs: list[tuple[str, str]] = []
99
+ for sents in per_resp_sentences:
100
+ for sent in sents:
101
+ for chunk in chunk_texts:
102
+ pairs.append((chunk, sent))
103
+ if not pairs:
104
+ return [_no_evidence_result() for _ in responses]
105
 
106
  model = _get_model()
 
 
 
107
  with _predict_sem:
108
  logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
109
  probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
110
  entail = probs[:, _entail_idx]
111
 
112
+ out: list[dict] = []
113
+ cursor = 0
114
  n_chunks = len(chunk_texts)
115
+ for sentences in per_resp_sentences:
116
+ if not sentences:
117
+ out.append(_no_evidence_result())
118
+ continue
119
+ n_pairs = len(sentences) * n_chunks
120
+ out.append(
121
+ _score_from_entail(entail[cursor : cursor + n_pairs], sentences, n_chunks)
122
+ )
123
+ cursor += n_pairs
124
+ return out
 
 
 
 
backend/evals/multimodal_alignment.py CHANGED
@@ -77,11 +77,18 @@ def _gesture_alignment(response: str, gesture_tag: str | None) -> float:
77
  return 1.0 if pattern.search(response) else 0.0
78
 
79
 
80
- def _gaze_alignment(chunks: list[dict], gaze_bucket: str | None) -> float:
 
 
81
  if not gaze_bucket or not chunks:
82
- return 0.0
83
  matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
84
- return matches / len(chunks)
 
 
 
 
 
85
 
86
 
87
  def compute_multimodal_alignment(
@@ -92,16 +99,37 @@ def compute_multimodal_alignment(
92
  chunks: list[dict],
93
  ) -> dict:
94
  scores: dict[str, float] = {}
 
95
  if affect:
96
  scores["affect_alignment"] = _affect_alignment(response, affect)
 
 
 
 
 
 
 
97
  if gesture_tag:
98
  scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
 
 
 
 
 
 
99
  if gaze_bucket:
100
- scores["gaze_alignment"] = _gaze_alignment(chunks, gaze_bucket)
 
 
 
 
 
 
101
  overall = sum(scores.values()) / len(scores) if scores else 0.0
102
  return {
103
  "overall_score": round(overall, 4),
104
  "affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
105
  "gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
106
  "gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
 
107
  }
 
77
  return 1.0 if pattern.search(response) else 0.0
78
 
79
 
80
+ def _gaze_alignment(
81
+ chunks: list[dict], gaze_bucket: str | None
82
+ ) -> tuple[float, int, int]:
83
  if not gaze_bucket or not chunks:
84
+ return 0.0, 0, len(chunks) if chunks else 0
85
  matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
86
+ return matches / len(chunks), matches, len(chunks)
87
+
88
+
89
+ def _affect_breakdown(response: str) -> tuple[int, int]:
90
+ toks = _tokens(response)
91
+ return len(toks & _POSITIVE), len(toks & _NEGATIVE)
92
 
93
 
94
  def compute_multimodal_alignment(
 
99
  chunks: list[dict],
100
  ) -> dict:
101
  scores: dict[str, float] = {}
102
+ explain: dict[str, dict] = {}
103
  if affect:
104
  scores["affect_alignment"] = _affect_alignment(response, affect)
105
+ pos, neg = _affect_breakdown(response)
106
+ explain["affect"] = {
107
+ "target": affect,
108
+ "pos_words": pos,
109
+ "neg_words": neg,
110
+ "sentiment": round(_sentiment_score(response), 4),
111
+ }
112
  if gesture_tag:
113
  scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
114
+ pattern = _GESTURE_OPENER_PATTERNS.get(gesture_tag)
115
+ explain["gesture"] = {
116
+ "tag": gesture_tag,
117
+ "has_pattern": pattern is not None,
118
+ "matched": bool(pattern.search(response)) if pattern else None,
119
+ }
120
  if gaze_bucket:
121
+ score, matches, total = _gaze_alignment(chunks, gaze_bucket)
122
+ scores["gaze_alignment"] = score
123
+ explain["gaze"] = {
124
+ "bucket": gaze_bucket,
125
+ "matched_chunks": matches,
126
+ "total_chunks": total,
127
+ }
128
  overall = sum(scores.values()) / len(scores) if scores else 0.0
129
  return {
130
  "overall_score": round(overall, 4),
131
  "affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
132
  "gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
133
  "gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
134
+ "explain": explain,
135
  }
backend/evals/relevance.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def compute_relevance(response: str, query: str) -> dict:
2
+ """BGE cosine similarity between query and response embeddings.
3
+
4
+ NLI entailment is the wrong abstraction here (a question rarely entails
5
+ its answer), so we use the same embedding space the retriever uses.
6
+ """
7
+ if not response.strip() or not query.strip():
8
+ return {"relevance": 0.0, "no_query": not query.strip()}
9
+
10
+ from backend.retrieval.vector_store import embed_texts
11
+
12
+ vecs = embed_texts([query, response])
13
+ return {
14
+ "relevance": round(max(0.0, float(vecs[0] @ vecs[1])), 4),
15
+ "no_query": False,
16
+ }
backend/retrieval/vector_store.py CHANGED
@@ -54,6 +54,15 @@ def get_embedder():
54
  return _get_embedder()
55
 
56
 
 
 
 
 
 
 
 
 
 
57
  # Index cache: one (vectors_tensor, meta) per user_id.
58
  _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
59
 
 
54
  return _get_embedder()
55
 
56
 
57
+ def embed_texts(texts: list[str]) -> torch.Tensor:
58
+ return _get_embedder().encode(
59
+ texts,
60
+ convert_to_tensor=True,
61
+ normalize_embeddings=True,
62
+ device=_DEVICE,
63
+ )
64
+
65
+
66
  # Index cache: one (vectors_tensor, meta) per user_id.
67
  _index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
68
 
frontend/src/App.css CHANGED
@@ -662,6 +662,53 @@ input[type="text"]:hover {
662
  border-color: var(--border);
663
  }
664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
665
  .eval-details {
666
  display: flex;
667
  flex-direction: column;
 
662
  border-color: var(--border);
663
  }
664
 
665
+ .tip {
666
+ position: relative;
667
+ display: inline-flex;
668
+ cursor: help;
669
+ }
670
+
671
+ .tip::after {
672
+ content: attr(data-tip);
673
+ position: absolute;
674
+ bottom: calc(100% + 8px);
675
+ left: 0;
676
+ background: #1f2937;
677
+ color: #f9fafb;
678
+ padding: 8px 10px;
679
+ border-radius: 6px;
680
+ font-size: 11px;
681
+ font-weight: 400;
682
+ line-height: 1.45;
683
+ letter-spacing: 0.1px;
684
+ white-space: pre-line;
685
+ width: max-content;
686
+ max-width: 320px;
687
+ opacity: 0;
688
+ pointer-events: none;
689
+ transition: opacity 0.08s ease-in 0.5s;
690
+ z-index: 100;
691
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.25);
692
+ }
693
+
694
+ .tip::before {
695
+ content: "";
696
+ position: absolute;
697
+ bottom: calc(100% + 2px);
698
+ left: 12px;
699
+ border: 6px solid transparent;
700
+ border-top-color: #1f2937;
701
+ opacity: 0;
702
+ pointer-events: none;
703
+ transition: opacity 0.08s ease-in 0.5s;
704
+ z-index: 100;
705
+ }
706
+
707
+ .tip:hover::after,
708
+ .tip:hover::before {
709
+ opacity: 1;
710
+ }
711
+
712
  .eval-details {
713
  display: flex;
714
  flex-direction: column;
frontend/src/components/EvalPanel.tsx CHANGED
@@ -1,4 +1,4 @@
1
- import { useState } from "react";
2
  import { submitRating } from "../lib/api";
3
  import type { EvalScores } from "../types";
4
 
@@ -20,7 +20,162 @@ function fmt(score: number): string {
20
  return (score * 100).toFixed(0) + "%";
21
  }
22
 
23
- export function EvalPanel({
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  runId,
25
  userId,
26
  latencyTotal,
@@ -34,6 +189,9 @@ export function EvalPanel({
34
  ? evalScores.slo_passed
35
  : latencyTotal > 0 && latencyTotal < sloTarget;
36
  const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
 
 
 
37
 
38
  async function rate(stars: number) {
39
  if (submitting || value !== null) return;
@@ -57,64 +215,76 @@ export function EvalPanel({
57
  <div className="eval-row">
58
  {effectiveLatency > 0 && (
59
  <span
60
- className={`slo-badge ${sloPassed ? "pass" : "fail"}`}
61
- title={`SLO target ${sloTarget.toFixed(1)}s`}
62
  >
63
- {effectiveLatency.toFixed(2)}s {sloPassed ? "" : ""}
 
 
64
  </span>
65
  )}
66
  {evalScores && (
67
  <>
68
- <span
69
- className={`eval-pill ${
70
- evalScores.no_evidence ? "muted" : gradeClass(evalScores.groundedness)
71
- }`}
72
- title={
73
- evalScores.no_evidence
74
- ? "No retrieved evidence groundedness not scored"
75
- : `Groundedness: fraction of response sentences supported by retrieved memories (hallucination ${fmt(
76
- evalScores.hallucination_rate
77
- )})`
78
- }
79
- >
80
- grounded {evalScores.no_evidence ? "—" : fmt(evalScores.groundedness)}
81
  </span>
82
- <span
83
- className={`eval-pill ${gradeClass(evalScores.affect_alignment)}`}
84
- title="Affect alignment: does the response tone match the detected facial affect?"
85
- >
86
- affect {fmt(evalScores.affect_alignment)}
 
 
 
 
 
 
87
  </span>
88
- <span
89
- className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}
90
- title="Gesture alignment: does the response acknowledge the detected hand gesture?"
91
- >
92
- gesture {fmt(evalScores.gesture_alignment)}
93
  </span>
94
- <span
95
- className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}
96
- title="Gaze alignment: did retrieved chunks come from the bucket the user was looking at?"
97
- >
98
- gaze {fmt(evalScores.gaze_alignment)}
99
  </span>
 
 
 
 
 
 
 
100
  </>
101
  )}
102
- <div className="star-rating" title="Rate authenticity (1-5)">
103
- {[1, 2, 3, 4, 5].map((star) => (
104
- <button
105
- key={star}
106
- className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
107
- onMouseEnter={() => setHover(star)}
108
- onMouseLeave={() => setHover(0)}
109
- onClick={() => rate(star)}
110
- disabled={value !== null || submitting}
111
- >
112
-
113
- </button>
114
- ))}
115
- {value !== null && <span className="star-label">{value}/5</span>}
116
- </div>
 
 
117
  </div>
118
  </div>
119
  );
120
  }
 
 
 
1
+ import { memo, useState } from "react";
2
  import { submitRating } from "../lib/api";
3
  import type { EvalScores } from "../types";
4
 
 
20
  return (score * 100).toFixed(0) + "%";
21
  }
22
 
23
+ function buildTip(parts: {
24
+ title: string;
25
+ question: string;
26
+ how?: string;
27
+ thisTurn?: string;
28
+ fallback?: string;
29
+ }): string {
30
+ const header = `${parts.title} — ${parts.question}`;
31
+ if (parts.fallback) return `${header}\n\n${parts.fallback}`;
32
+ const sections = [parts.how, parts.thisTurn].filter(Boolean);
33
+ return sections.length ? `${header}\n\n${sections.join("\n\n")}` : header;
34
+ }
35
+
36
+ function groundednessTip(s: EvalScores): string {
37
+ const title = "GROUNDED";
38
+ if (s.no_evidence) {
39
+ return buildTip({
40
+ title,
41
+ question: "Did the response stick to the retrieved memories?",
42
+ fallback:
43
+ "Not scored: no memories were retrieved this turn (e.g. a 'how are you feeling?' question that skips retrieval).",
44
+ });
45
+ }
46
+ const total = s.sentences_total ?? 0;
47
+ const grounded = s.sentences_grounded ?? 0;
48
+ const thr = s.nli_threshold ?? 0.5;
49
+ return buildTip({
50
+ title,
51
+ question: "Did the response stick to the retrieved memories, or hallucinate?",
52
+ how:
53
+ `How: each sentence in the response is checked against each retrieved chunk with an NLI model. ` +
54
+ `A sentence counts as grounded if at least one chunk entails it with probability ≥ ${thr.toFixed(2)}.`,
55
+ thisTurn:
56
+ `This turn: ${grounded}/${total} sentences grounded → ${fmt(s.groundedness)}. ` +
57
+ `Hallucination = ${fmt(s.hallucination_rate)} (${total - grounded} unsupported).`,
58
+ });
59
+ }
60
+
61
+ function relevanceTip(s: EvalScores): string {
62
+ return buildTip({
63
+ title: "RELEVANT",
64
+ question: "Did the response actually address the partner's question?",
65
+ how:
66
+ "How: cosine similarity between the BGE embedding of the query and the embedding of the response. " +
67
+ "Higher = more semantically on-topic.",
68
+ thisTurn: `This turn: ${(s.relevance ?? 0).toFixed(3)} → ${fmt(s.relevance ?? 0)}.`,
69
+ });
70
+ }
71
+
72
+ function affectTip(s: EvalScores): string {
73
+ const question = "Does the response tone match the detected facial expression?";
74
+ const ex = s.explain?.affect;
75
+ if (!ex) {
76
+ return buildTip({ title: "AFFECT", question });
77
+ }
78
+ return buildTip({
79
+ title: "AFFECT",
80
+ question,
81
+ how:
82
+ "How: response sentiment is computed from positive vs negative word counts, " +
83
+ "then compared to the affect target.",
84
+ thisTurn:
85
+ `This turn: detected ${ex.target}, response sentiment = ${ex.sentiment.toFixed(2)} ` +
86
+ `(${ex.pos_words} positive word${ex.pos_words === 1 ? "" : "s"}, ` +
87
+ `${ex.neg_words} negative) → ${fmt(s.affect_alignment)}.`,
88
+ });
89
+ }
90
+
91
+ function gestureTip(s: EvalScores): string {
92
+ const title = "GESTURE";
93
+ const question = "Does the response opener acknowledge the detected hand gesture?";
94
+ const ex = s.explain?.gesture;
95
+ if (!ex) {
96
+ return buildTip({ title, question, fallback: "No gesture detected this turn — defaults to 0." });
97
+ }
98
+ if (!ex.has_pattern) {
99
+ return buildTip({
100
+ title,
101
+ question,
102
+ fallback: `Detected ${ex.tag}, but this gesture has no opener pattern to test — partial credit (50%).`,
103
+ });
104
+ }
105
+ return buildTip({
106
+ title,
107
+ question,
108
+ how: `How: regex check on the first words of the response (e.g. THUMBS_UP expects 'yes/sure/absolutely…').`,
109
+ thisTurn:
110
+ `This turn: detected ${ex.tag}, opener ${ex.matched ? "matched" : "did not match"} ` +
111
+ `→ ${ex.matched ? "100%" : "0%"}.`,
112
+ });
113
+ }
114
+
115
+ function gazeTip(s: EvalScores): string {
116
+ const title = "GAZE";
117
+ const question = "Did the retrieved memories come from the topic the user was looking at?";
118
+ const ex = s.explain?.gaze;
119
+ if (!ex) {
120
+ return buildTip({ title, question, fallback: "No gaze bucket detected this turn — defaults to 0." });
121
+ }
122
+ if (ex.total_chunks === 0) {
123
+ return buildTip({
124
+ title,
125
+ question,
126
+ fallback: `User was looking at: ${ex.bucket}. No chunks retrieved this turn — defaults to 0.`,
127
+ });
128
+ }
129
+ return buildTip({
130
+ title,
131
+ question,
132
+ how: `How: fraction of retrieved chunks whose 'bucket' label matches the gaze target.`,
133
+ thisTurn:
134
+ `This turn: user looking at ${ex.bucket}, ${ex.matched_chunks}/${ex.total_chunks} ` +
135
+ `retrieved chunks matched → ${fmt(s.gaze_alignment)}.`,
136
+ });
137
+ }
138
+
139
+ function diversityTip(s: EvalScores): string {
140
+ const title = "DIVERSITY";
141
+ const question = "How different are the candidate responses the picker showed?";
142
+ const n = s.n_candidates ?? 0;
143
+ const d = s.candidate_diversity ?? 0;
144
+ if (n < 2) {
145
+ return buildTip({ title, question, fallback: `Only ${n} candidate this turn — not meaningful.` });
146
+ }
147
+ return buildTip({
148
+ title,
149
+ question,
150
+ how:
151
+ "How: average pairwise cosine distance between BGE embeddings of the candidate texts. " +
152
+ "High = varied alternatives. Low = three paraphrases of the same answer (the 'aloha' problem).",
153
+ thisTurn: `This turn: ${n} candidates, mean pairwise distance = ${d.toFixed(3)} → ${fmt(d)}.`,
154
+ });
155
+ }
156
+
157
+ function sloTip(
158
+ s: EvalScores | null | undefined,
159
+ fallbackLatency: number,
160
+ fallbackTarget: number,
161
+ fallbackPassed: boolean,
162
+ ): string {
163
+ const latency = s?.t_total_s ?? fallbackLatency;
164
+ const target = s?.slo_target_s ?? fallbackTarget;
165
+ const passed = s?.slo_passed ?? fallbackPassed;
166
+ const margin = s?.slo_margin_s;
167
+ const sign = (margin ?? 0) >= 0 ? "+" : "";
168
+ const m = margin !== undefined ? `${sign}${margin.toFixed(2)}s` : "";
169
+ return buildTip({
170
+ title: "LATENCY",
171
+ question: "Did the response arrive within the SLO budget?",
172
+ thisTurn:
173
+ `Target: < ${target.toFixed(1)}s. ` +
174
+ `This turn: ${latency.toFixed(2)}s${m ? ` (${m} margin)` : ""} — ${passed ? "passed ✓" : "failed ✗"}.`,
175
+ });
176
+ }
177
+
178
+ function EvalPanelImpl({
179
  runId,
180
  userId,
181
  latencyTotal,
 
189
  ? evalScores.slo_passed
190
  : latencyTotal > 0 && latencyTotal < sloTarget;
191
  const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
192
+ const showDiversity =
193
+ evalScores && (evalScores.n_candidates ?? 0) >= 2;
194
+ const showRelevance = evalScores && evalScores.relevance !== undefined;
195
 
196
  async function rate(stars: number) {
197
  if (submitting || value !== null) return;
 
215
  <div className="eval-row">
216
  {effectiveLatency > 0 && (
217
  <span
218
+ className="tip"
219
+ data-tip={sloTip(evalScores, effectiveLatency, sloTarget, sloPassed)}
220
  >
221
+ <span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
222
+ {effectiveLatency.toFixed(2)}s {sloPassed ? "✓" : "✗"}
223
+ </span>
224
  </span>
225
  )}
226
  {evalScores && (
227
  <>
228
+ <span className="tip" data-tip={groundednessTip(evalScores)}>
229
+ <span
230
+ className={`eval-pill ${
231
+ evalScores.no_evidence ? "muted" : gradeClass(evalScores.groundedness)
232
+ }`}
233
+ >
234
+ grounded {evalScores.no_evidence ? "" : fmt(evalScores.groundedness)}
235
+ </span>
 
 
 
 
 
236
  </span>
237
+ {showRelevance && (
238
+ <span className="tip" data-tip={relevanceTip(evalScores)}>
239
+ <span className={`eval-pill ${gradeClass(evalScores.relevance ?? 0)}`}>
240
+ relevant {fmt(evalScores.relevance ?? 0)}
241
+ </span>
242
+ </span>
243
+ )}
244
+ <span className="tip" data-tip={affectTip(evalScores)}>
245
+ <span className={`eval-pill ${gradeClass(evalScores.affect_alignment)}`}>
246
+ affect {fmt(evalScores.affect_alignment)}
247
+ </span>
248
  </span>
249
+ <span className="tip" data-tip={gestureTip(evalScores)}>
250
+ <span className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}>
251
+ gesture {fmt(evalScores.gesture_alignment)}
252
+ </span>
 
253
  </span>
254
+ <span className="tip" data-tip={gazeTip(evalScores)}>
255
+ <span className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}>
256
+ gaze {fmt(evalScores.gaze_alignment)}
257
+ </span>
 
258
  </span>
259
+ {showDiversity && (
260
+ <span className="tip" data-tip={diversityTip(evalScores)}>
261
+ <span className={`eval-pill ${gradeClass(evalScores.candidate_diversity ?? 0)}`}>
262
+ diversity {fmt(evalScores.candidate_diversity ?? 0)}
263
+ </span>
264
+ </span>
265
+ )}
266
  </>
267
  )}
268
+ <span className="tip" data-tip="Rate how authentic this response felt as the persona (1 = off, 5 = spot on). Logged to ratings.jsonl.">
269
+ <div className="star-rating">
270
+ {[1, 2, 3, 4, 5].map((star) => (
271
+ <button
272
+ key={star}
273
+ className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
274
+ onMouseEnter={() => setHover(star)}
275
+ onMouseLeave={() => setHover(0)}
276
+ onClick={() => rate(star)}
277
+ disabled={value !== null || submitting}
278
+ >
279
+
280
+ </button>
281
+ ))}
282
+ {value !== null && <span className="star-label">{value}/5</span>}
283
+ </div>
284
+ </span>
285
  </div>
286
  </div>
287
  );
288
  }
289
+
290
+ export const EvalPanel = memo(EvalPanelImpl);
frontend/src/types.ts CHANGED
@@ -69,10 +69,43 @@ export interface LatencyLog {
69
  t_total: number;
70
  }
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  export interface EvalScores {
73
  groundedness: number;
74
  hallucination_rate: number;
75
  no_evidence: boolean;
 
 
 
 
76
  t_total_s: number;
77
  slo_target_s: number;
78
  slo_passed: boolean;
@@ -81,6 +114,10 @@ export interface EvalScores {
81
  affect_alignment: number;
82
  gesture_alignment: number;
83
  gaze_alignment: number;
 
 
 
 
84
  }
85
 
86
  export type CandidateStrategy =
 
69
  t_total: number;
70
  }
71
 
72
+ export interface CandidateEval {
73
+ idx: number;
74
+ strategy: string;
75
+ selected: boolean;
76
+ groundedness: number;
77
+ hallucination_rate: number;
78
+ no_evidence: boolean;
79
+ relevance: number;
80
+ }
81
+
82
+ export interface EvalExplain {
83
+ affect?: {
84
+ target: string;
85
+ pos_words: number;
86
+ neg_words: number;
87
+ sentiment: number;
88
+ };
89
+ gesture?: {
90
+ tag: string;
91
+ has_pattern: boolean;
92
+ matched: boolean | null;
93
+ };
94
+ gaze?: {
95
+ bucket: string;
96
+ matched_chunks: number;
97
+ total_chunks: number;
98
+ };
99
+ }
100
+
101
  export interface EvalScores {
102
  groundedness: number;
103
  hallucination_rate: number;
104
  no_evidence: boolean;
105
+ sentences_total?: number;
106
+ sentences_grounded?: number;
107
+ nli_threshold?: number;
108
+ relevance?: number;
109
  t_total_s: number;
110
  slo_target_s: number;
111
  slo_passed: boolean;
 
114
  affect_alignment: number;
115
  gesture_alignment: number;
116
  gaze_alignment: number;
117
+ candidate_diversity?: number;
118
+ n_candidates?: number;
119
+ candidates_eval?: CandidateEval[];
120
+ explain?: EvalExplain;
121
  }
122
 
123
  export type CandidateStrategy =