Spaces:

ub-aac-chatbot
/

aac-chatbot

Sleeping

shwetangisingh commited on Apr 17

Commit

ed5dd6f

1 Parent(s): c511e09

Add faithfulness + authenticity evals; run scoring off the response path

NLI-based groundedness (cross-encoder/nli-deberta-v3-small) and per-turn
Likert capture via /feedback/rating. compute_evals runs in BackgroundTasks
→ logs/evals.jsonl keyed by run_id. New aggregate.py prints per-persona
latency/faithfulness/alignment/rating rollups.

Files changed (9) hide show

README.md +29 -12
backend/api/main.py +105 -41
backend/config/settings.py +5 -0
backend/evals/aggregate.py +169 -0
backend/evals/faithfulness.py +89 -7
frontend/src/components/ChatPanel.tsx +8 -4
frontend/src/components/EvalPanel.tsx +46 -123
frontend/src/lib/api.ts +15 -0
frontend/src/types.ts +2 -16

README.md CHANGED Viewed

@@ -230,6 +230,10 @@ Everything is a Pydantic setting in [backend/config/settings.py](backend/config/
 | `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
 | `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
 | `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
 ---
@@ -272,6 +276,17 @@ curl -X POST http://localhost:8000/chat \
   -d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
 ```
 ---
 ## Project Structure
@@ -295,6 +310,7 @@ multimodal_aac_chatbot/
 │   ├── sensing/labels.py          GESTURE_DIRECTIVES (sensing runs in browser)
 │   ├── retrieval/                 BGE embeddings (torch tensor) + bucket priors
 │   ├── generation/llm_client.py   2-tier Ollama Cloud LLM client (primary/fallback)
 │   └── guardrails/checks.py      Input + output safety checks
 │
 ├── data/
@@ -401,21 +417,22 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
 ### Evals
-Live per-turn scores show up in the `EvalPanel`. State:
-| Metric | Status |
-|--------|--------|
-| Efficiency | works (SLO check on `t_total`) |
-| Faithfulness | stub, returns 0 |
-| Multimodal alignment | works — affect (sentiment lexicon), gesture (opener regex), gaze (bucket match) |
-| Authenticity | star rating in UI but not saved |
-- [ ] **[Eval]** Faithfulness — actually check if the response is grounded in what we retrieved. NLI model, sentence-level. If we didn't retrieve anything, flag `no_evidence` instead of pretending we scored it
-- [ ] **[Eval]** Efficiency — per-turn SLO check is done, but for the writeup we need aggregate latency: p50/p95 across a fixed query set, broken out by LLM tier. Spec target is < 6s
-- [x] **[Eval]** Multimodal alignment — implemented in `backend/evals/multimodal_alignment.py`. Affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket. Returned on every turn as `multimodal_alignment` / `affect_alignment` / `gesture_alignment` / `gaze_alignment`
-- [ ] **[Eval]** Authenticity — the Likert stars are wired up in the UI but go nowhere. Save them, log them with the turn so we can actually look at them later
 - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
-- [ ] **[Eval]** Need an offline version of all three model-driven evals (faithfulness / alignment / efficiency). Aggregate numbers across a fixed query set per persona for the writeup
 ### Cleanup

 | `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
 | `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
 | `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
+| `SLO_TARGET_S` | `6.0` | Latency SLO used by the efficiency eval. |
+| `EVALS_ENABLED` | `true` | Toggle off to skip background eval scoring. |
+| `NLI_MODEL` | `cross-encoder/nli-deberta-v3-small` | NLI model used for the groundedness/hallucination scorer. |
+| `FAITHFULNESS_THRESHOLD` | `0.5` | Per-sentence entailment probability needed to count as grounded. |
 ---
   -d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
 ```
+### Offline eval aggregation
+After a few turns have been logged, print a per-persona report:
+```bash
+conda activate aac-chatbot
+python -m backend.evals.aggregate
+```
+Output covers latency quantiles + SLO pass rate, faithfulness (groundedness / hallucination), multimodal alignment, and the distribution of Likert ratings. Reads `logs/turns.jsonl`, `logs/evals.jsonl`, and `logs/ratings.jsonl`.
 ---
 ## Project Structure
 │   ├── sensing/labels.py          GESTURE_DIRECTIVES (sensing runs in browser)
 │   ├── retrieval/                 BGE embeddings (torch tensor) + bucket priors
 │   ├── generation/llm_client.py   2-tier Ollama Cloud LLM client (primary/fallback)
+│   ├── evals/                     faithfulness (NLI), efficiency, multimodal, aggregate CLI
 │   └── guardrails/checks.py      Input + output safety checks
 │
 ├── data/
 ### Evals
+Scoring runs out-of-band via FastAPI `BackgroundTasks` after `/chat` returns — the response path stays clean. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings from the UI go to `logs/ratings.jsonl`.
+| Metric | Status | Where |
+|--------|--------|-------|
+| Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
+| Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
+| Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
+| Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
+**First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first background score after a server restart. Turn 1's score lands a few seconds after the response; every turn after that is fast.
+- [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
+- [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
+- [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
+- [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
 - [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
 ### Cleanup

backend/api/main.py CHANGED Viewed

@@ -3,10 +3,13 @@ from __future__ import annotations
 import json
 import logging
-from fastapi import FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
 from backend.config.settings import settings
 from backend.evals import compute_evals
@@ -38,6 +41,8 @@ app.add_middleware(
 _log = logging.getLogger(__name__)
 _models_ready = False
 @app.on_event("startup")
@@ -78,20 +83,6 @@ class TurnaroundRequest(BaseModel):
     head_signal: str | None = None
-class EvalScoresResponse(BaseModel):
-    groundedness: float
-    hallucination_rate: float
-    no_evidence: bool
-    t_total_s: float
-    slo_target_s: float
-    slo_passed: bool
-    slo_margin_s: float
-    multimodal_alignment: float
-    affect_alignment: float
-    gesture_alignment: float
-    gaze_alignment: float
 class ChatResponse(BaseModel):
     user_id: str
     query: str
@@ -102,10 +93,18 @@ class ChatResponse(BaseModel):
     retrieval_mode: str
     latency: dict
     guardrail_passed: bool
-    eval_scores: EvalScoresResponse | None = None
     turn_id: int
 # ── Helpers ────────────────────────────────────────────────────────────────────
@@ -275,8 +274,44 @@ def reset_session(user_id: str):
     return {"status": "reset", "user_id": user_id}
 @app.post("/chat", response_model=ChatResponse)
-def chat(req: ChatRequest):
     guard = check_input(req.query)
     if not guard["allowed"]:
         return ChatResponse(
@@ -297,22 +332,26 @@ def chat(req: ChatRequest):
     result: PipelineState = run_pipeline(initial_state)
-    # Persist updated session state
     session["session_history"] = result["session_history"]
     session["bucket_priors"] = result["bucket_priors"]
     session["last_state"] = result
-    # Compute evaluation metrics
     affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
-    eval_scores = compute_evals(
-        response=result["selected_response"] or "",
-        chunks=result.get("retrieved_chunks") or [],
-        latency_log=result.get("latency_log") or {},
-        affect=affect_emotion,
-        gesture_tag=req.gesture_tag,
-        gaze_bucket=req.gaze_bucket,
-        slo_target=settings.slo_target_s,
-    )
     return ChatResponse(
         user_id=req.user_id,
@@ -324,13 +363,13 @@ def chat(req: ChatRequest):
         retrieval_mode=result.get("retrieval_mode_used", "unknown"),
         latency=result.get("latency_log") or {},
         guardrail_passed=result.get("guardrail_passed", True),
-        eval_scores=eval_scores,
         turn_id=result["turn_id"],
     )
 @app.post("/chat/turnaround", response_model=ChatResponse)
-def chat_turnaround(req: TurnaroundRequest):
     if req.user_id not in _sessions:
         raise HTTPException(status_code=404, detail="no active session")
@@ -400,15 +439,21 @@ def chat_turnaround(req: TurnaroundRequest):
     session["last_state"] = replan_state
     affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
-    eval_scores = compute_evals(
-        response=replan_state["selected_response"] or "",
-        chunks=replan_state.get("retrieved_chunks") or [],
-        latency_log=replan_state.get("latency_log") or {},
-        affect=affect_emotion,
-        gesture_tag=replan_state.get("gesture_tag"),
-        gaze_bucket=replan_state.get("gaze_bucket"),
-        slo_target=settings.slo_target_s,
-    )
     return ChatResponse(
         user_id=req.user_id,
@@ -420,6 +465,25 @@ def chat_turnaround(req: TurnaroundRequest):
         retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
         latency=replan_state.get("latency_log") or {},
         guardrail_passed=replan_state.get("guardrail_passed", True),
-        eval_scores=eval_scores,
         turn_id=replan_state["turn_id"],
     )

 import json
 import logging
+import re
+import time
+from pathlib import Path
+from fastapi import BackgroundTasks, FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
 from backend.config.settings import settings
 from backend.evals import compute_evals
 _log = logging.getLogger(__name__)
 _models_ready = False
+_RUN_ID_RE = re.compile(r"^[0-9a-f]{32}$")
+_ID_PATTERN = r"^[a-zA-Z0-9_-]+$"
 @app.on_event("startup")
     head_signal: str | None = None
 class ChatResponse(BaseModel):
     user_id: str
     query: str
     retrieval_mode: str
     latency: dict
     guardrail_passed: bool
+    run_id: str | None = None
     turn_id: int
+class RatingRequest(BaseModel):
+    run_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
+    user_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
+    authenticity: int = Field(ge=1, le=5)
+    rater_id: str = Field(default="anonymous", max_length=64, pattern=_ID_PATTERN)
+    notes: str | None = Field(default=None, max_length=500)
 # ── Helpers ────────────────────────────────────────────────────────────────────
     return {"status": "reset", "user_id": user_id}
+def _score_and_persist(
+    run_id: str,
+    user_id: str,
+    turn_id: int,
+    response: str,
+    chunks: list[dict],
+    latency_log: dict,
+    affect: str,
+    gesture_tag: str | None,
+    gaze_bucket: str | None,
+) -> None:
+    try:
+        scores = compute_evals(
+            response=response,
+            chunks=chunks,
+            latency_log=latency_log,
+            affect=affect,
+            gesture_tag=gesture_tag,
+            gaze_bucket=gaze_bucket,
+            slo_target=settings.slo_target_s,
+        )
+        entry = {
+            "run_id": run_id,
+            "ts": time.time(),
+            "user_id": user_id,
+            "turn_id": turn_id,
+            **scores,
+        }
+        logs_dir = Path(settings.logs_dir)
+        logs_dir.mkdir(parents=True, exist_ok=True)
+        with open(logs_dir / "evals.jsonl", "a", encoding="utf-8") as f:
+            f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+    except Exception:
+        _log.exception("evals background scoring failed for run %s", run_id)
 @app.post("/chat", response_model=ChatResponse)
+def chat(req: ChatRequest, background_tasks: BackgroundTasks):
     guard = check_input(req.query)
     if not guard["allowed"]:
         return ChatResponse(
     result: PipelineState = run_pipeline(initial_state)
     session["session_history"] = result["session_history"]
     session["bucket_priors"] = result["bucket_priors"]
     session["last_state"] = result
     affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
+    run_id = result.get("run_id")
+    if settings.evals_enabled and run_id:
+        background_tasks.add_task(
+            _score_and_persist,
+            run_id=run_id,
+            user_id=req.user_id,
+            turn_id=result["turn_id"],
+            response=result["selected_response"] or "",
+            chunks=list(result.get("retrieved_chunks") or []),
+            latency_log=dict(result.get("latency_log") or {}),
+            affect=affect_emotion,
+            gesture_tag=req.gesture_tag,
+            gaze_bucket=req.gaze_bucket,
+        )
     return ChatResponse(
         user_id=req.user_id,
         retrieval_mode=result.get("retrieval_mode_used", "unknown"),
         latency=result.get("latency_log") or {},
         guardrail_passed=result.get("guardrail_passed", True),
+        run_id=run_id,
         turn_id=result["turn_id"],
     )
 @app.post("/chat/turnaround", response_model=ChatResponse)
+def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
     if req.user_id not in _sessions:
         raise HTTPException(status_code=404, detail="no active session")
     session["last_state"] = replan_state
     affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
+    run_id = replan_state.get("run_id")
+    if settings.evals_enabled and run_id:
+        background_tasks.add_task(
+            _score_and_persist,
+            run_id=run_id,
+            user_id=req.user_id,
+            turn_id=replan_state["turn_id"],
+            response=replan_state["selected_response"] or "",
+            chunks=list(replan_state.get("retrieved_chunks") or []),
+            latency_log=dict(replan_state.get("latency_log") or {}),
+            affect=affect_emotion,
+            gesture_tag=replan_state.get("gesture_tag"),
+            gaze_bucket=replan_state.get("gaze_bucket"),
+        )
     return ChatResponse(
         user_id=req.user_id,
         retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
         latency=replan_state.get("latency_log") or {},
         guardrail_passed=replan_state.get("guardrail_passed", True),
+        run_id=run_id,
         turn_id=replan_state["turn_id"],
     )
+@app.post("/feedback/rating")
+def submit_rating(req: RatingRequest):
+    if not _RUN_ID_RE.match(req.run_id):
+        raise HTTPException(status_code=400, detail="invalid run_id")
+    logs_dir = Path(settings.logs_dir)
+    logs_dir.mkdir(parents=True, exist_ok=True)
+    entry = {
+        "ts": time.time(),
+        "run_id": req.run_id,
+        "user_id": req.user_id,
+        "authenticity": req.authenticity,
+        "rater_id": req.rater_id,
+        "notes": req.notes,
+    }
+    with open(logs_dir / "ratings.jsonl", "a", encoding="utf-8") as f:
+        f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+    return {"status": "ok"}

backend/config/settings.py CHANGED Viewed

@@ -67,6 +67,11 @@ class Settings(BaseSettings):
     # ── Evaluation ────────────────────────────────────────────────────────────
     slo_target_s: float = 6.0  # max acceptable response latency (seconds)
 settings = Settings()

     # ── Evaluation ────────────────────────────────────────────────────────────
     slo_target_s: float = 6.0  # max acceptable response latency (seconds)
+    evals_enabled: bool = True
+    nli_model: str = "cross-encoder/nli-deberta-v3-small"
+    faithfulness_threshold: float = (
+        0.5  # entailment prob for a sentence to count as grounded
+    )
 settings = Settings()

backend/evals/aggregate.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""Offline aggregator: reads turns.jsonl, evals.jsonl, ratings.jsonl and prints
+per-persona metrics. Run:  python -m backend.evals.aggregate
+"""
+import argparse
+import json
+import statistics
+import sys
+from collections import defaultdict
+from pathlib import Path
+from backend.config.settings import settings
+def _load(path: Path) -> list[dict]:
+    if not path.exists():
+        return []
+    out = []
+    skipped = 0
+    with open(path, encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                out.append(json.loads(line))
+            except json.JSONDecodeError:
+                skipped += 1
+    if skipped:
+        print(
+            f"[aggregate] skipped {skipped} malformed lines in {path}",
+            file=sys.stderr,
+        )
+    return out
+def _quantile(values: list[float], q: float) -> float:
+    if not values:
+        return 0.0
+    if len(values) == 1:
+        return values[0]
+    idx = max(0, min(len(values) - 1, int(round(q * (len(values) - 1)))))
+    return sorted(values)[idx]
+def _fmt_ms(s: float) -> str:
+    return f"{s * 1000:.0f}ms"
+def report_latency(turns: list[dict]) -> None:
+    print("\n=== Communication Efficiency (latency) ===")
+    by_group: dict[tuple[str, str], list[float]] = defaultdict(list)
+    for t in turns:
+        key = (t.get("user_id", "?"), t.get("llm_tier", "?"))
+        by_group[key].append(t.get("latency", {}).get("t_total", 0.0))
+    slo = settings.slo_target_s
+    print(f"SLO target: < {slo}s")
+    print(
+        f"{'user_id':<18} {'tier':<10} {'n':>5} {'p50':>8} {'p95':>8} {'p99':>8} {'pass%':>7}"
+    )
+    for (uid, tier), lats in sorted(by_group.items()):
+        if not lats:
+            continue
+        p50 = _quantile(lats, 0.5)
+        p95 = _quantile(lats, 0.95)
+        p99 = _quantile(lats, 0.99)
+        passed = sum(1 for x in lats if x < slo) / len(lats) * 100
+        print(
+            f"{uid:<18} {tier:<10} {len(lats):>5} "
+            f"{_fmt_ms(p50):>8} {_fmt_ms(p95):>8} {_fmt_ms(p99):>8} {passed:>6.1f}%"
+        )
+def report_faithfulness(evals: list[dict]) -> None:
+    print("\n=== Factual Faithfulness ===")
+    scored = [e for e in evals if not e.get("no_evidence")]
+    if not scored:
+        print("(no turns with retrieved evidence)")
+        return
+    by_user: dict[str, list[dict]] = defaultdict(list)
+    for e in scored:
+        by_user[e.get("user_id", "?")].append(e)
+    print(f"{'user_id':<18} {'n':>5} {'groundedness':>14} {'hallucination':>14}")
+    for uid, rows in sorted(by_user.items()):
+        g = statistics.mean(r["groundedness"] for r in rows)
+        h = statistics.mean(r["hallucination_rate"] for r in rows)
+        print(f"{uid:<18} {len(rows):>5} {g:>13.2%} {h:>13.2%}")
+def _mean_nonzero(rows: list[dict], key: str) -> tuple[float, float]:
+    # Coverage % undercounts real zeros (a genuinely 0.0-aligned response looks
+    # identical to one where the signal was absent). Fixable by serializing
+    # null for absent signals in compute_multimodal_alignment.
+    vals = [float(r.get(key, 0.0)) for r in rows]
+    nonzero = [v for v in vals if v > 0]
+    if not nonzero:
+        return 0.0, 0.0
+    return statistics.mean(nonzero), len(nonzero) / len(vals)
+def _fmt_mean_cov(rows: list[dict], key: str) -> str:
+    mean, cov = _mean_nonzero(rows, key)
+    return f"{mean:>5.0%}|{cov:>5.0%}"
+def report_multimodal(evals: list[dict]) -> None:
+    print("\n=== Multimodal Alignment (mean among non-zero | coverage) ===")
+    if not evals:
+        print("(no evals logged)")
+        return
+    by_user: dict[str, list[dict]] = defaultdict(list)
+    for e in evals:
+        by_user[e.get("user_id", "?")].append(e)
+    print(f"{'user_id':<18} {'n':>5} {'affect':>16} {'gesture':>16} {'gaze':>16}")
+    for uid, rows in sorted(by_user.items()):
+        print(
+            f"{uid:<18} {len(rows):>5} "
+            f"{_fmt_mean_cov(rows, 'affect_alignment'):>16} "
+            f"{_fmt_mean_cov(rows, 'gesture_alignment'):>16} "
+            f"{_fmt_mean_cov(rows, 'gaze_alignment'):>16}"
+        )
+def report_authenticity(ratings: list[dict]) -> None:
+    print("\n=== Perceived Authenticity (Likert 1-5) ===")
+    by_user: dict[str, list[int]] = defaultdict(list)
+    for r in ratings:
+        raw = r.get("authenticity")
+        try:
+            score = int(raw)
+        except (TypeError, ValueError):
+            continue
+        if not 1 <= score <= 5:
+            continue
+        by_user[r.get("user_id", "?")].append(score)
+    if not by_user:
+        print("(no valid ratings logged yet)")
+        return
+    print(f"{'user_id':<18} {'n':>5} {'mean':>6} {'dist (1..5)':>22}")
+    for uid, scores in sorted(by_user.items()):
+        mean = statistics.mean(scores)
+        dist = [scores.count(i) for i in range(1, 6)]
+        dist_str = "/".join(str(x) for x in dist)
+        print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
+    parser.add_argument("--logs", type=Path, default=settings.logs_dir)
+    args = parser.parse_args()
+    turns = _load(args.logs / "turns.jsonl")
+    evals = _load(args.logs / "evals.jsonl")
+    ratings = _load(args.logs / "ratings.jsonl")
+    print(f"Loaded: {len(turns)} turns, {len(evals)} evals, {len(ratings)} ratings")
+    report_latency(turns)
+    report_faithfulness(evals)
+    report_multimodal(evals)
+    report_authenticity(ratings)
+if __name__ == "__main__":
+    main()

backend/evals/faithfulness.py CHANGED Viewed

@@ -1,12 +1,94 @@
-# NLI-based faithfulness scoring.
-from __future__ import annotations
 def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
-    """Compute groundedness and hallucination rate via NLI."""
-    no_evidence = len(chunks) == 0
     return {
-        "groundedness": 0.0,
-        "hallucination_rate": 0.0,
-        "no_evidence": no_evidence,
     }

+import re
+from threading import Lock, Semaphore
+import torch
+from backend.config.settings import settings
+_model = None
+_entail_idx: int | None = None
+_model_lock = Lock()
+_predict_sem = Semaphore(1)
+_SENT_SPLIT = re.compile(r"(?<=[.!?])\s+")
+_MAX_SENTENCES = 20
+_MIN_SENTENCE_WORDS = 2
+def _get_model():
+    global _model, _entail_idx
+    if _model is not None:
+        return _model
+    with _model_lock:
+        if _model is not None:
+            return _model
+        from sentence_transformers import CrossEncoder
+        device = (
+            "mps"
+            if torch.backends.mps.is_available()
+            else "cuda"
+            if torch.cuda.is_available()
+            else "cpu"
+        )
+        model = CrossEncoder(settings.nli_model, device=device)
+        label2id = getattr(model.config, "label2id", None) or {}
+        for label, idx in label2id.items():
+            if label.lower() == "entailment":
+                _entail_idx = int(idx)
+                break
+        if _entail_idx is None:
+            raise RuntimeError(
+                f"NLI model {settings.nli_model!r} has no 'entailment' label"
+            )
+        _model = model
+        return _model
+def _split_sentences(text: str) -> list[str]:
+    parts = [s.strip() for s in _SENT_SPLIT.split(text.strip()) if s.strip()]
+    return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
 def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
+    """Sentence-level NLI: each sentence must be entailed by at least one chunk."""
+    if not chunks:
+        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
+    sentences = _split_sentences(response)
+    # Too short to score meaningfully (one-word replies, fragments). Flagging as
+    # no_evidence is honest: we're not scoring it, so it should be excluded from
+    # groundedness averages downstream.
+    if not sentences:
+        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
+    chunk_texts = [c.get("text", "") for c in chunks if c.get("text")]
+    if not chunk_texts:
+        return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
+    if len(sentences) > _MAX_SENTENCES:
+        sentences = sentences[:_MAX_SENTENCES]
+    model = _get_model()
+    # NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
+    # response sentences are the claims being checked (hypothesis).
+    pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
+    with _predict_sem:
+        logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
+    probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
+    entail = probs[:, _entail_idx]
+    n_chunks = len(chunk_texts)
+    threshold = settings.faithfulness_threshold
+    grounded = 0
+    for i in range(len(sentences)):
+        sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
+        if sent_scores.max() >= threshold:
+            grounded += 1
+    total = len(sentences)
+    groundedness = grounded / total
     return {
+        "groundedness": round(groundedness, 4),
+        "hallucination_rate": round(1.0 - groundedness, 4),
+        "no_evidence": False,
     }

frontend/src/components/ChatPanel.tsx CHANGED Viewed

@@ -85,7 +85,7 @@ export function ChatPanel({
             content: res.response,
             latency: res.latency,
             affect: res.affect,
-            evalScores: res.eval_scores,
             turnId: res.turn_id,
             isTurnaround: true,
           });
@@ -174,7 +174,7 @@ export function ChatPanel({
           content: res.response,
           latency: res.latency,
           affect: res.affect,
-          evalScores: res.eval_scores,
           turnId: res.turn_id,
         },
       ]);
@@ -224,8 +224,12 @@ export function ChatPanel({
               )}
             </span>
             <p>{msg.content}</p>
-            {msg.role === "aac_user" && msg.evalScores && (
-              <EvalPanel evalScores={msg.evalScores} />
             )}
           </div>
         ))}

             content: res.response,
             latency: res.latency,
             affect: res.affect,
+            runId: res.run_id,
             turnId: res.turn_id,
             isTurnaround: true,
           });
           content: res.response,
           latency: res.latency,
           affect: res.affect,
+          runId: res.run_id,
           turnId: res.turn_id,
         },
       ]);
               )}
             </span>
             <p>{msg.content}</p>
+            {msg.role === "aac_user" && msg.runId && userId && (
+              <EvalPanel
+                runId={msg.runId}
+                userId={userId}
+                latencyTotal={msg.latency?.t_total ?? 0}
+              />
             )}
           </div>
         ))}

frontend/src/components/EvalPanel.tsx CHANGED Viewed

@@ -1,137 +1,60 @@
 import { useState } from "react";
-import type { EvalScores } from "../types";
 interface Props {
-  evalScores: EvalScores;
 }
-function ScoreBar({ value }: { value: number }) {
-  const pct = Math.min(value * 100, 100);
-  const color = pct > 70 ? "#4caf50" : pct > 40 ? "#ff9800" : "#f44336";
-  return (
-    <div className="score-bar">
-      <div className="score-bar-fill" style={{ width: `${pct}%`, background: color }} />
-    </div>
-  );
-}
-function StarRating({
-  value,
-  onChange,
-}: {
-  value: number | null;
-  onChange: (v: number) => void;
-}) {
   const [hover, setHover] = useState(0);
-  return (
-    <div className="star-rating">
-      {[1, 2, 3, 4, 5].map((star) => (
-        <button
-          key={star}
-          className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
-          onMouseEnter={() => setHover(star)}
-          onMouseLeave={() => setHover(0)}
-          onClick={() => onChange(star)}
-        >
-          ★
-        </button>
-      ))}
-      {value !== null && <span className="star-label">{value}/5</span>}
-    </div>
-  );
-}
-export function EvalPanel({ evalScores }: Props) {
-  const [expanded, setExpanded] = useState(false);
-  const [likert, setLikert] = useState<number | null>(null);
   return (
     <div className="eval-panel">
-      <button
-        className="eval-toggle"
-        onClick={() => setExpanded(!expanded)}
-      >
-        {expanded ? "▾" : "▸"} Eval Metrics
-        {evalScores.slo_passed ? (
-          <span className="slo-badge pass">SLO ✓</span>
-        ) : (
-          <span className="slo-badge fail">SLO ✗</span>
         )}
-        {likert !== null && (
-          <span className="slo-badge">{likert}/5 ★</span>
-        )}
-      </button>
-      {expanded && (
-        <div className="eval-details">
-          <div className="eval-section">
-            <div className="section-title">Factual Faithfulness</div>
-            {evalScores.no_evidence ? (
-              <div className="eval-na">N/A — no evidence retrieved</div>
-            ) : (
-              <>
-                <div className="metric-row">
-                  <span>Groundedness</span>
-                  <span className="metric-value">{(evalScores.groundedness * 100).toFixed(0)}%</span>
-                </div>
-                <ScoreBar value={evalScores.groundedness} />
-                <div className="metric-row">
-                  <span>Hallucination Rate</span>
-                  <span className="metric-value">
-                    {(evalScores.hallucination_rate * 100).toFixed(0)}%
-                  </span>
-                </div>
-              </>
-            )}
-          </div>
-          <div className="eval-section">
-            <div className="section-title">Communication Efficiency</div>
-            <div className="metric-row">
-              <span>Response Time</span>
-              <span className={`metric-value ${evalScores.slo_passed ? "pass" : "fail"}`}>
-                {evalScores.t_total_s.toFixed(2)}s
-                {evalScores.slo_passed ? " ✓" : " ✗"}
-              </span>
-            </div>
-            <div className="metric-row sub">
-              <span>SLO Target</span>
-              <span className="metric-value">
-                &lt; {evalScores.slo_target_s.toFixed(1)}s (margin: {evalScores.slo_margin_s.toFixed(2)}s)
-              </span>
-            </div>
-          </div>
-          <div className="eval-section">
-            <div className="section-title">Multimodal Alignment</div>
-            <div className="metric-row">
-              <span>Overall</span>
-              <span className="metric-value">{(evalScores.multimodal_alignment * 100).toFixed(0)}%</span>
-            </div>
-            <ScoreBar value={evalScores.multimodal_alignment} />
-            <div className="metric-row sub">
-              <span>Affect</span>
-              <span className="metric-value">{(evalScores.affect_alignment * 100).toFixed(0)}%</span>
-            </div>
-            <div className="metric-row sub">
-              <span>Gesture</span>
-              <span className="metric-value">{(evalScores.gesture_alignment * 100).toFixed(0)}%</span>
-            </div>
-            <div className="metric-row sub">
-              <span>Gaze</span>
-              <span className="metric-value">{(evalScores.gaze_alignment * 100).toFixed(0)}%</span>
-            </div>
-          </div>
-          <div className="eval-section">
-            <div className="section-title">Perceived Authenticity</div>
-            <div className="metric-row">
-              <span>Rate this response</span>
-            </div>
-            <StarRating value={likert} onChange={setLikert} />
-          </div>
         </div>
-      )}
     </div>
   );
 }

 import { useState } from "react";
+import { submitRating } from "../lib/api";
 interface Props {
+  runId: string;
+  userId: string;
+  latencyTotal: number;
+  sloTarget?: number;
 }
+export function EvalPanel({ runId, userId, latencyTotal, sloTarget = 6.0 }: Props) {
+  const [value, setValue] = useState<number | null>(null);
   const [hover, setHover] = useState(0);
+  const [submitting, setSubmitting] = useState(false);
+  const sloPassed = latencyTotal > 0 && latencyTotal < sloTarget;
+  async function rate(stars: number) {
+    if (submitting || value !== null) return;
+    setSubmitting(true);
+    try {
+      await submitRating({
+        run_id: runId,
+        user_id: userId,
+        authenticity: stars,
+      });
+      setValue(stars);
+    } catch (e) {
+      console.error("rating submit failed", e);
+    } finally {
+      setSubmitting(false);
+    }
+  }
   return (
     <div className="eval-panel">
+      <div className="eval-row">
+        {latencyTotal > 0 && (
+          <span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
+            {latencyTotal.toFixed(2)}s {sloPassed ? "✓" : "✗"}
+          </span>
         )}
+        <div className="star-rating" title="Rate authenticity (1-5)">
+          {[1, 2, 3, 4, 5].map((star) => (
+            <button
+              key={star}
+              className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
+              onMouseEnter={() => setHover(star)}
+              onMouseLeave={() => setHover(0)}
+              onClick={() => rate(star)}
+              disabled={value !== null || submitting}
+            >
+              ★
+            </button>
+          ))}
+          {value !== null && <span className="star-label">{value}/5</span>}
         </div>
+      </div>
     </div>
   );
 }

frontend/src/lib/api.ts CHANGED Viewed

@@ -44,6 +44,21 @@ export async function resetSession(userId: string): Promise<void> {
   if (!res.ok) throw new Error(`API error: ${res.status}`);
 }
 export async function checkHealth(): Promise<boolean> {
   try {
     const res = await fetch(`${API_BASE}/health`);

   if (!res.ok) throw new Error(`API error: ${res.status}`);
 }
+export async function submitRating(args: {
+  run_id: string;
+  user_id: string;
+  authenticity: number;
+  rater_id?: string;
+  notes?: string;
+}): Promise<void> {
+  const res = await fetch(`${API_BASE}/feedback/rating`, {
+    method: "POST",
+    headers: { "Content-Type": "application/json" },
+    body: JSON.stringify(args),
+  });
+  if (!res.ok) throw new Error(`API error: ${res.status}`);
+}
 export async function checkHealth(): Promise<boolean> {
   try {
     const res = await fetch(`${API_BASE}/health`);

frontend/src/types.ts CHANGED Viewed

@@ -52,20 +52,6 @@ export interface LatencyLog {
   t_total: number;
 }
-export interface EvalScores {
-  groundedness: number;
-  hallucination_rate: number;
-  no_evidence: boolean;
-  t_total_s: number;
-  slo_target_s: number;
-  slo_passed: boolean;
-  slo_margin_s: number;
-  multimodal_alignment: number;
-  affect_alignment: number;
-  gesture_alignment: number;
-  gaze_alignment: number;
-}
 export interface ChatResponse {
   user_id: string;
   query: string;
@@ -75,7 +61,7 @@ export interface ChatResponse {
   retrieval_mode: string;
   latency: LatencyLog;
   guardrail_passed: boolean;
-  eval_scores: EvalScores | null;
   turn_id: number;
 }
@@ -84,7 +70,7 @@ export interface ChatMessage {
   content: string;
   latency?: LatencyLog;
   affect?: string;
-  evalScores?: EvalScores | null;
   turnId?: number;
   rephrased?: boolean;
   isTurnaround?: boolean;

   t_total: number;
 }
 export interface ChatResponse {
   user_id: string;
   query: string;
   retrieval_mode: string;
   latency: LatencyLog;
   guardrail_passed: boolean;
+  run_id: string | null;
   turn_id: number;
 }
   content: string;
   latency?: LatencyLog;
   affect?: string;
+  runId?: string | null;
   turnId?: number;
   rephrased?: boolean;
   isTurnaround?: boolean;