shwetangisingh commited on
Commit
ed5dd6f
·
1 Parent(s): c511e09

Add faithfulness + authenticity evals; run scoring off the response path

Browse files

NLI-based groundedness (cross-encoder/nli-deberta-v3-small) and per-turn
Likert capture via /feedback/rating. compute_evals runs in BackgroundTasks
→ logs/evals.jsonl keyed by run_id. New aggregate.py prints per-persona
latency/faithfulness/alignment/rating rollups.

README.md CHANGED
@@ -230,6 +230,10 @@ Everything is a Pydantic setting in [backend/config/settings.py](backend/config/
230
  | `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
231
  | `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
232
  | `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
 
 
 
 
233
 
234
  ---
235
 
@@ -272,6 +276,17 @@ curl -X POST http://localhost:8000/chat \
272
  -d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
273
  ```
274
 
 
 
 
 
 
 
 
 
 
 
 
275
  ---
276
 
277
  ## Project Structure
@@ -295,6 +310,7 @@ multimodal_aac_chatbot/
295
  │ ├── sensing/labels.py GESTURE_DIRECTIVES (sensing runs in browser)
296
  │ ├── retrieval/ BGE embeddings (torch tensor) + bucket priors
297
  │ ├── generation/llm_client.py 2-tier Ollama Cloud LLM client (primary/fallback)
 
298
  │ └── guardrails/checks.py Input + output safety checks
299
 
300
  ├── data/
@@ -401,21 +417,22 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
401
 
402
  ### Evals
403
 
404
- Live per-turn scores show up in the `EvalPanel`. State:
 
 
 
 
 
 
 
405
 
406
- | Metric | Status |
407
- |--------|--------|
408
- | Efficiency | works (SLO check on `t_total`) |
409
- | Faithfulness | stub, returns 0 |
410
- | Multimodal alignment | works — affect (sentiment lexicon), gesture (opener regex), gaze (bucket match) |
411
- | Authenticity | star rating in UI but not saved |
412
 
413
- - [ ] **[Eval]** Faithfulness — actually check if the response is grounded in what we retrieved. NLI model, sentence-level. If we didn't retrieve anything, flag `no_evidence` instead of pretending we scored it
414
- - [ ] **[Eval]** Efficiency — per-turn SLO check is done, but for the writeup we need aggregate latency: p50/p95 across a fixed query set, broken out by LLM tier. Spec target is < 6s
415
- - [x] **[Eval]** Multimodal alignment — implemented in `backend/evals/multimodal_alignment.py`. Affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket. Returned on every turn as `multimodal_alignment` / `affect_alignment` / `gesture_alignment` / `gaze_alignment`
416
- - [ ] **[Eval]** Authenticity — the Likert stars are wired up in the UI but go nowhere. Save them, log them with the turn so we can actually look at them later
417
  - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
418
- - [ ] **[Eval]** Need an offline version of all three model-driven evals (faithfulness / alignment / efficiency). Aggregate numbers across a fixed query set per persona for the writeup
419
 
420
  ### Cleanup
421
 
 
230
  | `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
231
  | `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
232
  | `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
233
+ | `SLO_TARGET_S` | `6.0` | Latency SLO used by the efficiency eval. |
234
+ | `EVALS_ENABLED` | `true` | Toggle off to skip background eval scoring. |
235
+ | `NLI_MODEL` | `cross-encoder/nli-deberta-v3-small` | NLI model used for the groundedness/hallucination scorer. |
236
+ | `FAITHFULNESS_THRESHOLD` | `0.5` | Per-sentence entailment probability needed to count as grounded. |
237
 
238
  ---
239
 
 
276
  -d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
277
  ```
278
 
279
+ ### Offline eval aggregation
280
+
281
+ After a few turns have been logged, print a per-persona report:
282
+
283
+ ```bash
284
+ conda activate aac-chatbot
285
+ python -m backend.evals.aggregate
286
+ ```
287
+
288
+ Output covers latency quantiles + SLO pass rate, faithfulness (groundedness / hallucination), multimodal alignment, and the distribution of Likert ratings. Reads `logs/turns.jsonl`, `logs/evals.jsonl`, and `logs/ratings.jsonl`.
289
+
290
  ---
291
 
292
  ## Project Structure
 
310
  │ ├── sensing/labels.py GESTURE_DIRECTIVES (sensing runs in browser)
311
  │ ├── retrieval/ BGE embeddings (torch tensor) + bucket priors
312
  │ ├── generation/llm_client.py 2-tier Ollama Cloud LLM client (primary/fallback)
313
+ │ ├── evals/ faithfulness (NLI), efficiency, multimodal, aggregate CLI
314
  │ └── guardrails/checks.py Input + output safety checks
315
 
316
  ├── data/
 
417
 
418
  ### Evals
419
 
420
+ Scoring runs out-of-band via FastAPI `BackgroundTasks` after `/chat` returns — the response path stays clean. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings from the UI go to `logs/ratings.jsonl`.
421
+
422
+ | Metric | Status | Where |
423
+ |--------|--------|-------|
424
+ | Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
425
+ | Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
426
+ | Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
427
+ | Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
428
 
429
+ **First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first background score after a server restart. Turn 1's score lands a few seconds after the response; every turn after that is fast.
 
 
 
 
 
430
 
431
+ - [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
432
+ - [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
433
+ - [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
434
+ - [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
435
  - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
 
436
 
437
  ### Cleanup
438
 
backend/api/main.py CHANGED
@@ -3,10 +3,13 @@ from __future__ import annotations
3
 
4
  import json
5
  import logging
 
 
 
6
 
7
- from fastapi import FastAPI, HTTPException
8
  from fastapi.middleware.cors import CORSMiddleware
9
- from pydantic import BaseModel
10
 
11
  from backend.config.settings import settings
12
  from backend.evals import compute_evals
@@ -38,6 +41,8 @@ app.add_middleware(
38
 
39
  _log = logging.getLogger(__name__)
40
  _models_ready = False
 
 
41
 
42
 
43
  @app.on_event("startup")
@@ -78,20 +83,6 @@ class TurnaroundRequest(BaseModel):
78
  head_signal: str | None = None
79
 
80
 
81
- class EvalScoresResponse(BaseModel):
82
- groundedness: float
83
- hallucination_rate: float
84
- no_evidence: bool
85
- t_total_s: float
86
- slo_target_s: float
87
- slo_passed: bool
88
- slo_margin_s: float
89
- multimodal_alignment: float
90
- affect_alignment: float
91
- gesture_alignment: float
92
- gaze_alignment: float
93
-
94
-
95
  class ChatResponse(BaseModel):
96
  user_id: str
97
  query: str
@@ -102,10 +93,18 @@ class ChatResponse(BaseModel):
102
  retrieval_mode: str
103
  latency: dict
104
  guardrail_passed: bool
105
- eval_scores: EvalScoresResponse | None = None
106
  turn_id: int
107
 
108
 
 
 
 
 
 
 
 
 
109
  # ── Helpers ────────────────────────────────────────────────────────────────────
110
 
111
 
@@ -275,8 +274,44 @@ def reset_session(user_id: str):
275
  return {"status": "reset", "user_id": user_id}
276
 
277
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
  @app.post("/chat", response_model=ChatResponse)
279
- def chat(req: ChatRequest):
280
  guard = check_input(req.query)
281
  if not guard["allowed"]:
282
  return ChatResponse(
@@ -297,22 +332,26 @@ def chat(req: ChatRequest):
297
 
298
  result: PipelineState = run_pipeline(initial_state)
299
 
300
- # Persist updated session state
301
  session["session_history"] = result["session_history"]
302
  session["bucket_priors"] = result["bucket_priors"]
303
  session["last_state"] = result
304
 
305
- # Compute evaluation metrics
306
  affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
307
- eval_scores = compute_evals(
308
- response=result["selected_response"] or "",
309
- chunks=result.get("retrieved_chunks") or [],
310
- latency_log=result.get("latency_log") or {},
311
- affect=affect_emotion,
312
- gesture_tag=req.gesture_tag,
313
- gaze_bucket=req.gaze_bucket,
314
- slo_target=settings.slo_target_s,
315
- )
 
 
 
 
 
 
316
 
317
  return ChatResponse(
318
  user_id=req.user_id,
@@ -324,13 +363,13 @@ def chat(req: ChatRequest):
324
  retrieval_mode=result.get("retrieval_mode_used", "unknown"),
325
  latency=result.get("latency_log") or {},
326
  guardrail_passed=result.get("guardrail_passed", True),
327
- eval_scores=eval_scores,
328
  turn_id=result["turn_id"],
329
  )
330
 
331
 
332
  @app.post("/chat/turnaround", response_model=ChatResponse)
333
- def chat_turnaround(req: TurnaroundRequest):
334
  if req.user_id not in _sessions:
335
  raise HTTPException(status_code=404, detail="no active session")
336
 
@@ -400,15 +439,21 @@ def chat_turnaround(req: TurnaroundRequest):
400
  session["last_state"] = replan_state
401
 
402
  affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
403
- eval_scores = compute_evals(
404
- response=replan_state["selected_response"] or "",
405
- chunks=replan_state.get("retrieved_chunks") or [],
406
- latency_log=replan_state.get("latency_log") or {},
407
- affect=affect_emotion,
408
- gesture_tag=replan_state.get("gesture_tag"),
409
- gaze_bucket=replan_state.get("gaze_bucket"),
410
- slo_target=settings.slo_target_s,
411
- )
 
 
 
 
 
 
412
 
413
  return ChatResponse(
414
  user_id=req.user_id,
@@ -420,6 +465,25 @@ def chat_turnaround(req: TurnaroundRequest):
420
  retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
421
  latency=replan_state.get("latency_log") or {},
422
  guardrail_passed=replan_state.get("guardrail_passed", True),
423
- eval_scores=eval_scores,
424
  turn_id=replan_state["turn_id"],
425
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  import json
5
  import logging
6
+ import re
7
+ import time
8
+ from pathlib import Path
9
 
10
+ from fastapi import BackgroundTasks, FastAPI, HTTPException
11
  from fastapi.middleware.cors import CORSMiddleware
12
+ from pydantic import BaseModel, Field
13
 
14
  from backend.config.settings import settings
15
  from backend.evals import compute_evals
 
41
 
42
  _log = logging.getLogger(__name__)
43
  _models_ready = False
44
+ _RUN_ID_RE = re.compile(r"^[0-9a-f]{32}$")
45
+ _ID_PATTERN = r"^[a-zA-Z0-9_-]+$"
46
 
47
 
48
  @app.on_event("startup")
 
83
  head_signal: str | None = None
84
 
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  class ChatResponse(BaseModel):
87
  user_id: str
88
  query: str
 
93
  retrieval_mode: str
94
  latency: dict
95
  guardrail_passed: bool
96
+ run_id: str | None = None
97
  turn_id: int
98
 
99
 
100
+ class RatingRequest(BaseModel):
101
+ run_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
102
+ user_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
103
+ authenticity: int = Field(ge=1, le=5)
104
+ rater_id: str = Field(default="anonymous", max_length=64, pattern=_ID_PATTERN)
105
+ notes: str | None = Field(default=None, max_length=500)
106
+
107
+
108
  # ── Helpers ────────────────────────────────────────────────────────────────────
109
 
110
 
 
274
  return {"status": "reset", "user_id": user_id}
275
 
276
 
277
+ def _score_and_persist(
278
+ run_id: str,
279
+ user_id: str,
280
+ turn_id: int,
281
+ response: str,
282
+ chunks: list[dict],
283
+ latency_log: dict,
284
+ affect: str,
285
+ gesture_tag: str | None,
286
+ gaze_bucket: str | None,
287
+ ) -> None:
288
+ try:
289
+ scores = compute_evals(
290
+ response=response,
291
+ chunks=chunks,
292
+ latency_log=latency_log,
293
+ affect=affect,
294
+ gesture_tag=gesture_tag,
295
+ gaze_bucket=gaze_bucket,
296
+ slo_target=settings.slo_target_s,
297
+ )
298
+ entry = {
299
+ "run_id": run_id,
300
+ "ts": time.time(),
301
+ "user_id": user_id,
302
+ "turn_id": turn_id,
303
+ **scores,
304
+ }
305
+ logs_dir = Path(settings.logs_dir)
306
+ logs_dir.mkdir(parents=True, exist_ok=True)
307
+ with open(logs_dir / "evals.jsonl", "a", encoding="utf-8") as f:
308
+ f.write(json.dumps(entry, ensure_ascii=False) + "\n")
309
+ except Exception:
310
+ _log.exception("evals background scoring failed for run %s", run_id)
311
+
312
+
313
  @app.post("/chat", response_model=ChatResponse)
314
+ def chat(req: ChatRequest, background_tasks: BackgroundTasks):
315
  guard = check_input(req.query)
316
  if not guard["allowed"]:
317
  return ChatResponse(
 
332
 
333
  result: PipelineState = run_pipeline(initial_state)
334
 
 
335
  session["session_history"] = result["session_history"]
336
  session["bucket_priors"] = result["bucket_priors"]
337
  session["last_state"] = result
338
 
 
339
  affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
340
+ run_id = result.get("run_id")
341
+
342
+ if settings.evals_enabled and run_id:
343
+ background_tasks.add_task(
344
+ _score_and_persist,
345
+ run_id=run_id,
346
+ user_id=req.user_id,
347
+ turn_id=result["turn_id"],
348
+ response=result["selected_response"] or "",
349
+ chunks=list(result.get("retrieved_chunks") or []),
350
+ latency_log=dict(result.get("latency_log") or {}),
351
+ affect=affect_emotion,
352
+ gesture_tag=req.gesture_tag,
353
+ gaze_bucket=req.gaze_bucket,
354
+ )
355
 
356
  return ChatResponse(
357
  user_id=req.user_id,
 
363
  retrieval_mode=result.get("retrieval_mode_used", "unknown"),
364
  latency=result.get("latency_log") or {},
365
  guardrail_passed=result.get("guardrail_passed", True),
366
+ run_id=run_id,
367
  turn_id=result["turn_id"],
368
  )
369
 
370
 
371
  @app.post("/chat/turnaround", response_model=ChatResponse)
372
+ def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
373
  if req.user_id not in _sessions:
374
  raise HTTPException(status_code=404, detail="no active session")
375
 
 
439
  session["last_state"] = replan_state
440
 
441
  affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
442
+ run_id = replan_state.get("run_id")
443
+
444
+ if settings.evals_enabled and run_id:
445
+ background_tasks.add_task(
446
+ _score_and_persist,
447
+ run_id=run_id,
448
+ user_id=req.user_id,
449
+ turn_id=replan_state["turn_id"],
450
+ response=replan_state["selected_response"] or "",
451
+ chunks=list(replan_state.get("retrieved_chunks") or []),
452
+ latency_log=dict(replan_state.get("latency_log") or {}),
453
+ affect=affect_emotion,
454
+ gesture_tag=replan_state.get("gesture_tag"),
455
+ gaze_bucket=replan_state.get("gaze_bucket"),
456
+ )
457
 
458
  return ChatResponse(
459
  user_id=req.user_id,
 
465
  retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
466
  latency=replan_state.get("latency_log") or {},
467
  guardrail_passed=replan_state.get("guardrail_passed", True),
468
+ run_id=run_id,
469
  turn_id=replan_state["turn_id"],
470
  )
471
+
472
+
473
+ @app.post("/feedback/rating")
474
+ def submit_rating(req: RatingRequest):
475
+ if not _RUN_ID_RE.match(req.run_id):
476
+ raise HTTPException(status_code=400, detail="invalid run_id")
477
+ logs_dir = Path(settings.logs_dir)
478
+ logs_dir.mkdir(parents=True, exist_ok=True)
479
+ entry = {
480
+ "ts": time.time(),
481
+ "run_id": req.run_id,
482
+ "user_id": req.user_id,
483
+ "authenticity": req.authenticity,
484
+ "rater_id": req.rater_id,
485
+ "notes": req.notes,
486
+ }
487
+ with open(logs_dir / "ratings.jsonl", "a", encoding="utf-8") as f:
488
+ f.write(json.dumps(entry, ensure_ascii=False) + "\n")
489
+ return {"status": "ok"}
backend/config/settings.py CHANGED
@@ -67,6 +67,11 @@ class Settings(BaseSettings):
67
 
68
  # ── Evaluation ────────────────────────────────────────────────────────────
69
  slo_target_s: float = 6.0 # max acceptable response latency (seconds)
 
 
 
 
 
70
 
71
 
72
  settings = Settings()
 
67
 
68
  # ── Evaluation ────────────────────────────────────────────────────────────
69
  slo_target_s: float = 6.0 # max acceptable response latency (seconds)
70
+ evals_enabled: bool = True
71
+ nli_model: str = "cross-encoder/nli-deberta-v3-small"
72
+ faithfulness_threshold: float = (
73
+ 0.5 # entailment prob for a sentence to count as grounded
74
+ )
75
 
76
 
77
  settings = Settings()
backend/evals/aggregate.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Offline aggregator: reads turns.jsonl, evals.jsonl, ratings.jsonl and prints
2
+ per-persona metrics. Run: python -m backend.evals.aggregate
3
+ """
4
+
5
+ import argparse
6
+ import json
7
+ import statistics
8
+ import sys
9
+ from collections import defaultdict
10
+ from pathlib import Path
11
+
12
+ from backend.config.settings import settings
13
+
14
+
15
+ def _load(path: Path) -> list[dict]:
16
+ if not path.exists():
17
+ return []
18
+ out = []
19
+ skipped = 0
20
+ with open(path, encoding="utf-8") as f:
21
+ for line in f:
22
+ line = line.strip()
23
+ if not line:
24
+ continue
25
+ try:
26
+ out.append(json.loads(line))
27
+ except json.JSONDecodeError:
28
+ skipped += 1
29
+ if skipped:
30
+ print(
31
+ f"[aggregate] skipped {skipped} malformed lines in {path}",
32
+ file=sys.stderr,
33
+ )
34
+ return out
35
+
36
+
37
+ def _quantile(values: list[float], q: float) -> float:
38
+ if not values:
39
+ return 0.0
40
+ if len(values) == 1:
41
+ return values[0]
42
+ idx = max(0, min(len(values) - 1, int(round(q * (len(values) - 1)))))
43
+ return sorted(values)[idx]
44
+
45
+
46
+ def _fmt_ms(s: float) -> str:
47
+ return f"{s * 1000:.0f}ms"
48
+
49
+
50
+ def report_latency(turns: list[dict]) -> None:
51
+ print("\n=== Communication Efficiency (latency) ===")
52
+ by_group: dict[tuple[str, str], list[float]] = defaultdict(list)
53
+ for t in turns:
54
+ key = (t.get("user_id", "?"), t.get("llm_tier", "?"))
55
+ by_group[key].append(t.get("latency", {}).get("t_total", 0.0))
56
+
57
+ slo = settings.slo_target_s
58
+ print(f"SLO target: < {slo}s")
59
+ print(
60
+ f"{'user_id':<18} {'tier':<10} {'n':>5} {'p50':>8} {'p95':>8} {'p99':>8} {'pass%':>7}"
61
+ )
62
+ for (uid, tier), lats in sorted(by_group.items()):
63
+ if not lats:
64
+ continue
65
+ p50 = _quantile(lats, 0.5)
66
+ p95 = _quantile(lats, 0.95)
67
+ p99 = _quantile(lats, 0.99)
68
+ passed = sum(1 for x in lats if x < slo) / len(lats) * 100
69
+ print(
70
+ f"{uid:<18} {tier:<10} {len(lats):>5} "
71
+ f"{_fmt_ms(p50):>8} {_fmt_ms(p95):>8} {_fmt_ms(p99):>8} {passed:>6.1f}%"
72
+ )
73
+
74
+
75
+ def report_faithfulness(evals: list[dict]) -> None:
76
+ print("\n=== Factual Faithfulness ===")
77
+ scored = [e for e in evals if not e.get("no_evidence")]
78
+ if not scored:
79
+ print("(no turns with retrieved evidence)")
80
+ return
81
+ by_user: dict[str, list[dict]] = defaultdict(list)
82
+ for e in scored:
83
+ by_user[e.get("user_id", "?")].append(e)
84
+
85
+ print(f"{'user_id':<18} {'n':>5} {'groundedness':>14} {'hallucination':>14}")
86
+ for uid, rows in sorted(by_user.items()):
87
+ g = statistics.mean(r["groundedness"] for r in rows)
88
+ h = statistics.mean(r["hallucination_rate"] for r in rows)
89
+ print(f"{uid:<18} {len(rows):>5} {g:>13.2%} {h:>13.2%}")
90
+
91
+
92
+ def _mean_nonzero(rows: list[dict], key: str) -> tuple[float, float]:
93
+ # Coverage % undercounts real zeros (a genuinely 0.0-aligned response looks
94
+ # identical to one where the signal was absent). Fixable by serializing
95
+ # null for absent signals in compute_multimodal_alignment.
96
+ vals = [float(r.get(key, 0.0)) for r in rows]
97
+ nonzero = [v for v in vals if v > 0]
98
+ if not nonzero:
99
+ return 0.0, 0.0
100
+ return statistics.mean(nonzero), len(nonzero) / len(vals)
101
+
102
+
103
+ def _fmt_mean_cov(rows: list[dict], key: str) -> str:
104
+ mean, cov = _mean_nonzero(rows, key)
105
+ return f"{mean:>5.0%}|{cov:>5.0%}"
106
+
107
+
108
+ def report_multimodal(evals: list[dict]) -> None:
109
+ print("\n=== Multimodal Alignment (mean among non-zero | coverage) ===")
110
+ if not evals:
111
+ print("(no evals logged)")
112
+ return
113
+ by_user: dict[str, list[dict]] = defaultdict(list)
114
+ for e in evals:
115
+ by_user[e.get("user_id", "?")].append(e)
116
+
117
+ print(f"{'user_id':<18} {'n':>5} {'affect':>16} {'gesture':>16} {'gaze':>16}")
118
+ for uid, rows in sorted(by_user.items()):
119
+ print(
120
+ f"{uid:<18} {len(rows):>5} "
121
+ f"{_fmt_mean_cov(rows, 'affect_alignment'):>16} "
122
+ f"{_fmt_mean_cov(rows, 'gesture_alignment'):>16} "
123
+ f"{_fmt_mean_cov(rows, 'gaze_alignment'):>16}"
124
+ )
125
+
126
+
127
+ def report_authenticity(ratings: list[dict]) -> None:
128
+ print("\n=== Perceived Authenticity (Likert 1-5) ===")
129
+ by_user: dict[str, list[int]] = defaultdict(list)
130
+ for r in ratings:
131
+ raw = r.get("authenticity")
132
+ try:
133
+ score = int(raw)
134
+ except (TypeError, ValueError):
135
+ continue
136
+ if not 1 <= score <= 5:
137
+ continue
138
+ by_user[r.get("user_id", "?")].append(score)
139
+
140
+ if not by_user:
141
+ print("(no valid ratings logged yet)")
142
+ return
143
+
144
+ print(f"{'user_id':<18} {'n':>5} {'mean':>6} {'dist (1..5)':>22}")
145
+ for uid, scores in sorted(by_user.items()):
146
+ mean = statistics.mean(scores)
147
+ dist = [scores.count(i) for i in range(1, 6)]
148
+ dist_str = "/".join(str(x) for x in dist)
149
+ print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
150
+
151
+
152
+ def main() -> None:
153
+ parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
154
+ parser.add_argument("--logs", type=Path, default=settings.logs_dir)
155
+ args = parser.parse_args()
156
+
157
+ turns = _load(args.logs / "turns.jsonl")
158
+ evals = _load(args.logs / "evals.jsonl")
159
+ ratings = _load(args.logs / "ratings.jsonl")
160
+
161
+ print(f"Loaded: {len(turns)} turns, {len(evals)} evals, {len(ratings)} ratings")
162
+ report_latency(turns)
163
+ report_faithfulness(evals)
164
+ report_multimodal(evals)
165
+ report_authenticity(ratings)
166
+
167
+
168
+ if __name__ == "__main__":
169
+ main()
backend/evals/faithfulness.py CHANGED
@@ -1,12 +1,94 @@
1
- # NLI-based faithfulness scoring.
2
- from __future__ import annotations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
 
5
  def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
6
- """Compute groundedness and hallucination rate via NLI."""
7
- no_evidence = len(chunks) == 0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  return {
9
- "groundedness": 0.0,
10
- "hallucination_rate": 0.0,
11
- "no_evidence": no_evidence,
12
  }
 
1
+ import re
2
+ from threading import Lock, Semaphore
3
+
4
+ import torch
5
+
6
+ from backend.config.settings import settings
7
+
8
+ _model = None
9
+ _entail_idx: int | None = None
10
+ _model_lock = Lock()
11
+ _predict_sem = Semaphore(1)
12
+ _SENT_SPLIT = re.compile(r"(?<=[.!?])\s+")
13
+ _MAX_SENTENCES = 20
14
+ _MIN_SENTENCE_WORDS = 2
15
+
16
+
17
+ def _get_model():
18
+ global _model, _entail_idx
19
+ if _model is not None:
20
+ return _model
21
+ with _model_lock:
22
+ if _model is not None:
23
+ return _model
24
+ from sentence_transformers import CrossEncoder
25
+
26
+ device = (
27
+ "mps"
28
+ if torch.backends.mps.is_available()
29
+ else "cuda"
30
+ if torch.cuda.is_available()
31
+ else "cpu"
32
+ )
33
+ model = CrossEncoder(settings.nli_model, device=device)
34
+ label2id = getattr(model.config, "label2id", None) or {}
35
+ for label, idx in label2id.items():
36
+ if label.lower() == "entailment":
37
+ _entail_idx = int(idx)
38
+ break
39
+ if _entail_idx is None:
40
+ raise RuntimeError(
41
+ f"NLI model {settings.nli_model!r} has no 'entailment' label"
42
+ )
43
+ _model = model
44
+ return _model
45
+
46
+
47
+ def _split_sentences(text: str) -> list[str]:
48
+ parts = [s.strip() for s in _SENT_SPLIT.split(text.strip()) if s.strip()]
49
+ return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
50
 
51
 
52
  def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
53
+ """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
54
+ if not chunks:
55
+ return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
56
+
57
+ sentences = _split_sentences(response)
58
+ # Too short to score meaningfully (one-word replies, fragments). Flagging as
59
+ # no_evidence is honest: we're not scoring it, so it should be excluded from
60
+ # groundedness averages downstream.
61
+ if not sentences:
62
+ return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
63
+
64
+ chunk_texts = [c.get("text", "") for c in chunks if c.get("text")]
65
+ if not chunk_texts:
66
+ return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
67
+
68
+ if len(sentences) > _MAX_SENTENCES:
69
+ sentences = sentences[:_MAX_SENTENCES]
70
+
71
+ model = _get_model()
72
+ # NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
73
+ # response sentences are the claims being checked (hypothesis).
74
+ pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
75
+ with _predict_sem:
76
+ logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
77
+ probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
78
+ entail = probs[:, _entail_idx]
79
+
80
+ n_chunks = len(chunk_texts)
81
+ threshold = settings.faithfulness_threshold
82
+ grounded = 0
83
+ for i in range(len(sentences)):
84
+ sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
85
+ if sent_scores.max() >= threshold:
86
+ grounded += 1
87
+
88
+ total = len(sentences)
89
+ groundedness = grounded / total
90
  return {
91
+ "groundedness": round(groundedness, 4),
92
+ "hallucination_rate": round(1.0 - groundedness, 4),
93
+ "no_evidence": False,
94
  }
frontend/src/components/ChatPanel.tsx CHANGED
@@ -85,7 +85,7 @@ export function ChatPanel({
85
  content: res.response,
86
  latency: res.latency,
87
  affect: res.affect,
88
- evalScores: res.eval_scores,
89
  turnId: res.turn_id,
90
  isTurnaround: true,
91
  });
@@ -174,7 +174,7 @@ export function ChatPanel({
174
  content: res.response,
175
  latency: res.latency,
176
  affect: res.affect,
177
- evalScores: res.eval_scores,
178
  turnId: res.turn_id,
179
  },
180
  ]);
@@ -224,8 +224,12 @@ export function ChatPanel({
224
  )}
225
  </span>
226
  <p>{msg.content}</p>
227
- {msg.role === "aac_user" && msg.evalScores && (
228
- <EvalPanel evalScores={msg.evalScores} />
 
 
 
 
229
  )}
230
  </div>
231
  ))}
 
85
  content: res.response,
86
  latency: res.latency,
87
  affect: res.affect,
88
+ runId: res.run_id,
89
  turnId: res.turn_id,
90
  isTurnaround: true,
91
  });
 
174
  content: res.response,
175
  latency: res.latency,
176
  affect: res.affect,
177
+ runId: res.run_id,
178
  turnId: res.turn_id,
179
  },
180
  ]);
 
224
  )}
225
  </span>
226
  <p>{msg.content}</p>
227
+ {msg.role === "aac_user" && msg.runId && userId && (
228
+ <EvalPanel
229
+ runId={msg.runId}
230
+ userId={userId}
231
+ latencyTotal={msg.latency?.t_total ?? 0}
232
+ />
233
  )}
234
  </div>
235
  ))}
frontend/src/components/EvalPanel.tsx CHANGED
@@ -1,137 +1,60 @@
1
  import { useState } from "react";
2
- import type { EvalScores } from "../types";
3
 
4
  interface Props {
5
- evalScores: EvalScores;
 
 
 
6
  }
7
 
8
- function ScoreBar({ value }: { value: number }) {
9
- const pct = Math.min(value * 100, 100);
10
- const color = pct > 70 ? "#4caf50" : pct > 40 ? "#ff9800" : "#f44336";
11
- return (
12
- <div className="score-bar">
13
- <div className="score-bar-fill" style={{ width: `${pct}%`, background: color }} />
14
- </div>
15
- );
16
- }
17
-
18
- function StarRating({
19
- value,
20
- onChange,
21
- }: {
22
- value: number | null;
23
- onChange: (v: number) => void;
24
- }) {
25
  const [hover, setHover] = useState(0);
26
- return (
27
- <div className="star-rating">
28
- {[1, 2, 3, 4, 5].map((star) => (
29
- <button
30
- key={star}
31
- className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
32
- onMouseEnter={() => setHover(star)}
33
- onMouseLeave={() => setHover(0)}
34
- onClick={() => onChange(star)}
35
- >
36
-
37
- </button>
38
- ))}
39
- {value !== null && <span className="star-label">{value}/5</span>}
40
- </div>
41
- );
42
- }
43
-
44
- export function EvalPanel({ evalScores }: Props) {
45
- const [expanded, setExpanded] = useState(false);
46
- const [likert, setLikert] = useState<number | null>(null);
47
 
48
  return (
49
  <div className="eval-panel">
50
- <button
51
- className="eval-toggle"
52
- onClick={() => setExpanded(!expanded)}
53
- >
54
- {expanded ? "▾" : "▸"} Eval Metrics
55
- {evalScores.slo_passed ? (
56
- <span className="slo-badge pass">SLO ✓</span>
57
- ) : (
58
- <span className="slo-badge fail">SLO ✗</span>
59
  )}
60
- {likert !== null && (
61
- <span className="slo-badge">{likert}/5 ★</span>
62
- )}
63
- </button>
64
-
65
- {expanded && (
66
- <div className="eval-details">
67
- <div className="eval-section">
68
- <div className="section-title">Factual Faithfulness</div>
69
- {evalScores.no_evidence ? (
70
- <div className="eval-na">N/A — no evidence retrieved</div>
71
- ) : (
72
- <>
73
- <div className="metric-row">
74
- <span>Groundedness</span>
75
- <span className="metric-value">{(evalScores.groundedness * 100).toFixed(0)}%</span>
76
- </div>
77
- <ScoreBar value={evalScores.groundedness} />
78
- <div className="metric-row">
79
- <span>Hallucination Rate</span>
80
- <span className="metric-value">
81
- {(evalScores.hallucination_rate * 100).toFixed(0)}%
82
- </span>
83
- </div>
84
- </>
85
- )}
86
- </div>
87
-
88
- <div className="eval-section">
89
- <div className="section-title">Communication Efficiency</div>
90
- <div className="metric-row">
91
- <span>Response Time</span>
92
- <span className={`metric-value ${evalScores.slo_passed ? "pass" : "fail"}`}>
93
- {evalScores.t_total_s.toFixed(2)}s
94
- {evalScores.slo_passed ? " ✓" : " ✗"}
95
- </span>
96
- </div>
97
- <div className="metric-row sub">
98
- <span>SLO Target</span>
99
- <span className="metric-value">
100
- &lt; {evalScores.slo_target_s.toFixed(1)}s (margin: {evalScores.slo_margin_s.toFixed(2)}s)
101
- </span>
102
- </div>
103
- </div>
104
-
105
- <div className="eval-section">
106
- <div className="section-title">Multimodal Alignment</div>
107
- <div className="metric-row">
108
- <span>Overall</span>
109
- <span className="metric-value">{(evalScores.multimodal_alignment * 100).toFixed(0)}%</span>
110
- </div>
111
- <ScoreBar value={evalScores.multimodal_alignment} />
112
- <div className="metric-row sub">
113
- <span>Affect</span>
114
- <span className="metric-value">{(evalScores.affect_alignment * 100).toFixed(0)}%</span>
115
- </div>
116
- <div className="metric-row sub">
117
- <span>Gesture</span>
118
- <span className="metric-value">{(evalScores.gesture_alignment * 100).toFixed(0)}%</span>
119
- </div>
120
- <div className="metric-row sub">
121
- <span>Gaze</span>
122
- <span className="metric-value">{(evalScores.gaze_alignment * 100).toFixed(0)}%</span>
123
- </div>
124
- </div>
125
-
126
- <div className="eval-section">
127
- <div className="section-title">Perceived Authenticity</div>
128
- <div className="metric-row">
129
- <span>Rate this response</span>
130
- </div>
131
- <StarRating value={likert} onChange={setLikert} />
132
- </div>
133
  </div>
134
- )}
135
  </div>
136
  );
137
  }
 
1
  import { useState } from "react";
2
+ import { submitRating } from "../lib/api";
3
 
4
  interface Props {
5
+ runId: string;
6
+ userId: string;
7
+ latencyTotal: number;
8
+ sloTarget?: number;
9
  }
10
 
11
+ export function EvalPanel({ runId, userId, latencyTotal, sloTarget = 6.0 }: Props) {
12
+ const [value, setValue] = useState<number | null>(null);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  const [hover, setHover] = useState(0);
14
+ const [submitting, setSubmitting] = useState(false);
15
+ const sloPassed = latencyTotal > 0 && latencyTotal < sloTarget;
16
+
17
+ async function rate(stars: number) {
18
+ if (submitting || value !== null) return;
19
+ setSubmitting(true);
20
+ try {
21
+ await submitRating({
22
+ run_id: runId,
23
+ user_id: userId,
24
+ authenticity: stars,
25
+ });
26
+ setValue(stars);
27
+ } catch (e) {
28
+ console.error("rating submit failed", e);
29
+ } finally {
30
+ setSubmitting(false);
31
+ }
32
+ }
 
 
33
 
34
  return (
35
  <div className="eval-panel">
36
+ <div className="eval-row">
37
+ {latencyTotal > 0 && (
38
+ <span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
39
+ {latencyTotal.toFixed(2)}s {sloPassed ? "✓" : "✗"}
40
+ </span>
 
 
 
 
41
  )}
42
+ <div className="star-rating" title="Rate authenticity (1-5)">
43
+ {[1, 2, 3, 4, 5].map((star) => (
44
+ <button
45
+ key={star}
46
+ className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
47
+ onMouseEnter={() => setHover(star)}
48
+ onMouseLeave={() => setHover(0)}
49
+ onClick={() => rate(star)}
50
+ disabled={value !== null || submitting}
51
+ >
52
+
53
+ </button>
54
+ ))}
55
+ {value !== null && <span className="star-label">{value}/5</span>}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  </div>
57
+ </div>
58
  </div>
59
  );
60
  }
frontend/src/lib/api.ts CHANGED
@@ -44,6 +44,21 @@ export async function resetSession(userId: string): Promise<void> {
44
  if (!res.ok) throw new Error(`API error: ${res.status}`);
45
  }
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  export async function checkHealth(): Promise<boolean> {
48
  try {
49
  const res = await fetch(`${API_BASE}/health`);
 
44
  if (!res.ok) throw new Error(`API error: ${res.status}`);
45
  }
46
 
47
+ export async function submitRating(args: {
48
+ run_id: string;
49
+ user_id: string;
50
+ authenticity: number;
51
+ rater_id?: string;
52
+ notes?: string;
53
+ }): Promise<void> {
54
+ const res = await fetch(`${API_BASE}/feedback/rating`, {
55
+ method: "POST",
56
+ headers: { "Content-Type": "application/json" },
57
+ body: JSON.stringify(args),
58
+ });
59
+ if (!res.ok) throw new Error(`API error: ${res.status}`);
60
+ }
61
+
62
  export async function checkHealth(): Promise<boolean> {
63
  try {
64
  const res = await fetch(`${API_BASE}/health`);
frontend/src/types.ts CHANGED
@@ -52,20 +52,6 @@ export interface LatencyLog {
52
  t_total: number;
53
  }
54
 
55
- export interface EvalScores {
56
- groundedness: number;
57
- hallucination_rate: number;
58
- no_evidence: boolean;
59
- t_total_s: number;
60
- slo_target_s: number;
61
- slo_passed: boolean;
62
- slo_margin_s: number;
63
- multimodal_alignment: number;
64
- affect_alignment: number;
65
- gesture_alignment: number;
66
- gaze_alignment: number;
67
- }
68
-
69
  export interface ChatResponse {
70
  user_id: string;
71
  query: string;
@@ -75,7 +61,7 @@ export interface ChatResponse {
75
  retrieval_mode: string;
76
  latency: LatencyLog;
77
  guardrail_passed: boolean;
78
- eval_scores: EvalScores | null;
79
  turn_id: number;
80
  }
81
 
@@ -84,7 +70,7 @@ export interface ChatMessage {
84
  content: string;
85
  latency?: LatencyLog;
86
  affect?: string;
87
- evalScores?: EvalScores | null;
88
  turnId?: number;
89
  rephrased?: boolean;
90
  isTurnaround?: boolean;
 
52
  t_total: number;
53
  }
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  export interface ChatResponse {
56
  user_id: string;
57
  query: string;
 
61
  retrieval_mode: string;
62
  latency: LatencyLog;
63
  guardrail_passed: boolean;
64
+ run_id: string | null;
65
  turn_id: number;
66
  }
67
 
 
70
  content: string;
71
  latency?: LatencyLog;
72
  affect?: string;
73
+ runId?: string | null;
74
  turnId?: number;
75
  rephrased?: boolean;
76
  isTurnaround?: boolean;