Spaces:
Sleeping
Sleeping
Commit ·
ed5dd6f
1
Parent(s): c511e09
Add faithfulness + authenticity evals; run scoring off the response path
Browse filesNLI-based groundedness (cross-encoder/nli-deberta-v3-small) and per-turn
Likert capture via /feedback/rating. compute_evals runs in BackgroundTasks
→ logs/evals.jsonl keyed by run_id. New aggregate.py prints per-persona
latency/faithfulness/alignment/rating rollups.
- README.md +29 -12
- backend/api/main.py +105 -41
- backend/config/settings.py +5 -0
- backend/evals/aggregate.py +169 -0
- backend/evals/faithfulness.py +89 -7
- frontend/src/components/ChatPanel.tsx +8 -4
- frontend/src/components/EvalPanel.tsx +46 -123
- frontend/src/lib/api.ts +15 -0
- frontend/src/types.ts +2 -16
README.md
CHANGED
|
@@ -230,6 +230,10 @@ Everything is a Pydantic setting in [backend/config/settings.py](backend/config/
|
|
| 230 |
| `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
|
| 231 |
| `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
|
| 232 |
| `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
---
|
| 235 |
|
|
@@ -272,6 +276,17 @@ curl -X POST http://localhost:8000/chat \
|
|
| 272 |
-d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
|
| 273 |
```
|
| 274 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
---
|
| 276 |
|
| 277 |
## Project Structure
|
|
@@ -295,6 +310,7 @@ multimodal_aac_chatbot/
|
|
| 295 |
│ ├── sensing/labels.py GESTURE_DIRECTIVES (sensing runs in browser)
|
| 296 |
│ ├── retrieval/ BGE embeddings (torch tensor) + bucket priors
|
| 297 |
│ ├── generation/llm_client.py 2-tier Ollama Cloud LLM client (primary/fallback)
|
|
|
|
| 298 |
│ └── guardrails/checks.py Input + output safety checks
|
| 299 |
│
|
| 300 |
├── data/
|
|
@@ -401,21 +417,22 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 401 |
|
| 402 |
### Evals
|
| 403 |
|
| 404 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 405 |
|
| 406 |
-
|
| 407 |
-
|--------|--------|
|
| 408 |
-
| Efficiency | works (SLO check on `t_total`) |
|
| 409 |
-
| Faithfulness | stub, returns 0 |
|
| 410 |
-
| Multimodal alignment | works — affect (sentiment lexicon), gesture (opener regex), gaze (bucket match) |
|
| 411 |
-
| Authenticity | star rating in UI but not saved |
|
| 412 |
|
| 413 |
-
- [
|
| 414 |
-
- [
|
| 415 |
-
- [x] **[Eval]** Multimodal alignment —
|
| 416 |
-
- [
|
| 417 |
- [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
|
| 418 |
-
- [ ] **[Eval]** Need an offline version of all three model-driven evals (faithfulness / alignment / efficiency). Aggregate numbers across a fixed query set per persona for the writeup
|
| 419 |
|
| 420 |
### Cleanup
|
| 421 |
|
|
|
|
| 230 |
| `RERANK_LAMBDA` | `0.7` | MMR balance: `1.0` = pure cosine relevance, lower = more diversity. Drop to `0.5` if results look repetitive. |
|
| 231 |
| `RERANK_QUERY_WEIGHT` | `0.7` | Weight on the current turn vs the mean of recent user turns when building the rerank query. Lower if follow-ups under-weight prior context. |
|
| 232 |
| `LOGS_DIR` | `logs` | Where the per-turn JSONL goes. |
|
| 233 |
+
| `SLO_TARGET_S` | `6.0` | Latency SLO used by the efficiency eval. |
|
| 234 |
+
| `EVALS_ENABLED` | `true` | Toggle off to skip background eval scoring. |
|
| 235 |
+
| `NLI_MODEL` | `cross-encoder/nli-deberta-v3-small` | NLI model used for the groundedness/hallucination scorer. |
|
| 236 |
+
| `FAITHFULNESS_THRESHOLD` | `0.5` | Per-sentence entailment probability needed to count as grounded. |
|
| 237 |
|
| 238 |
---
|
| 239 |
|
|
|
|
| 276 |
-d '{"user_id": "stephen_hawking", "query": "What do you like to do on weekends?"}'
|
| 277 |
```
|
| 278 |
|
| 279 |
+
### Offline eval aggregation
|
| 280 |
+
|
| 281 |
+
After a few turns have been logged, print a per-persona report:
|
| 282 |
+
|
| 283 |
+
```bash
|
| 284 |
+
conda activate aac-chatbot
|
| 285 |
+
python -m backend.evals.aggregate
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
Output covers latency quantiles + SLO pass rate, faithfulness (groundedness / hallucination), multimodal alignment, and the distribution of Likert ratings. Reads `logs/turns.jsonl`, `logs/evals.jsonl`, and `logs/ratings.jsonl`.
|
| 289 |
+
|
| 290 |
---
|
| 291 |
|
| 292 |
## Project Structure
|
|
|
|
| 310 |
│ ├── sensing/labels.py GESTURE_DIRECTIVES (sensing runs in browser)
|
| 311 |
│ ├── retrieval/ BGE embeddings (torch tensor) + bucket priors
|
| 312 |
│ ├── generation/llm_client.py 2-tier Ollama Cloud LLM client (primary/fallback)
|
| 313 |
+
│ ├── evals/ faithfulness (NLI), efficiency, multimodal, aggregate CLI
|
| 314 |
│ └── guardrails/checks.py Input + output safety checks
|
| 315 |
│
|
| 316 |
├── data/
|
|
|
|
| 417 |
|
| 418 |
### Evals
|
| 419 |
|
| 420 |
+
Scoring runs out-of-band via FastAPI `BackgroundTasks` after `/chat` returns — the response path stays clean. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings from the UI go to `logs/ratings.jsonl`.
|
| 421 |
+
|
| 422 |
+
| Metric | Status | Where |
|
| 423 |
+
|--------|--------|-------|
|
| 424 |
+
| Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
|
| 425 |
+
| Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
|
| 426 |
+
| Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
|
| 427 |
+
| Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
|
| 428 |
|
| 429 |
+
**First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first background score after a server restart. Turn 1's score lands a few seconds after the response; every turn after that is fast.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 430 |
|
| 431 |
+
- [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
|
| 432 |
+
- [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
|
| 433 |
+
- [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
|
| 434 |
+
- [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
|
| 435 |
- [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
|
|
|
|
| 436 |
|
| 437 |
### Cleanup
|
| 438 |
|
backend/api/main.py
CHANGED
|
@@ -3,10 +3,13 @@ from __future__ import annotations
|
|
| 3 |
|
| 4 |
import json
|
| 5 |
import logging
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
-
from fastapi import FastAPI, HTTPException
|
| 8 |
from fastapi.middleware.cors import CORSMiddleware
|
| 9 |
-
from pydantic import BaseModel
|
| 10 |
|
| 11 |
from backend.config.settings import settings
|
| 12 |
from backend.evals import compute_evals
|
|
@@ -38,6 +41,8 @@ app.add_middleware(
|
|
| 38 |
|
| 39 |
_log = logging.getLogger(__name__)
|
| 40 |
_models_ready = False
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
@app.on_event("startup")
|
|
@@ -78,20 +83,6 @@ class TurnaroundRequest(BaseModel):
|
|
| 78 |
head_signal: str | None = None
|
| 79 |
|
| 80 |
|
| 81 |
-
class EvalScoresResponse(BaseModel):
|
| 82 |
-
groundedness: float
|
| 83 |
-
hallucination_rate: float
|
| 84 |
-
no_evidence: bool
|
| 85 |
-
t_total_s: float
|
| 86 |
-
slo_target_s: float
|
| 87 |
-
slo_passed: bool
|
| 88 |
-
slo_margin_s: float
|
| 89 |
-
multimodal_alignment: float
|
| 90 |
-
affect_alignment: float
|
| 91 |
-
gesture_alignment: float
|
| 92 |
-
gaze_alignment: float
|
| 93 |
-
|
| 94 |
-
|
| 95 |
class ChatResponse(BaseModel):
|
| 96 |
user_id: str
|
| 97 |
query: str
|
|
@@ -102,10 +93,18 @@ class ChatResponse(BaseModel):
|
|
| 102 |
retrieval_mode: str
|
| 103 |
latency: dict
|
| 104 |
guardrail_passed: bool
|
| 105 |
-
|
| 106 |
turn_id: int
|
| 107 |
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
# ── Helpers ────────────────────────────────────────────────────────────────────
|
| 110 |
|
| 111 |
|
|
@@ -275,8 +274,44 @@ def reset_session(user_id: str):
|
|
| 275 |
return {"status": "reset", "user_id": user_id}
|
| 276 |
|
| 277 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
@app.post("/chat", response_model=ChatResponse)
|
| 279 |
-
def chat(req: ChatRequest):
|
| 280 |
guard = check_input(req.query)
|
| 281 |
if not guard["allowed"]:
|
| 282 |
return ChatResponse(
|
|
@@ -297,22 +332,26 @@ def chat(req: ChatRequest):
|
|
| 297 |
|
| 298 |
result: PipelineState = run_pipeline(initial_state)
|
| 299 |
|
| 300 |
-
# Persist updated session state
|
| 301 |
session["session_history"] = result["session_history"]
|
| 302 |
session["bucket_priors"] = result["bucket_priors"]
|
| 303 |
session["last_state"] = result
|
| 304 |
|
| 305 |
-
# Compute evaluation metrics
|
| 306 |
affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
|
| 317 |
return ChatResponse(
|
| 318 |
user_id=req.user_id,
|
|
@@ -324,13 +363,13 @@ def chat(req: ChatRequest):
|
|
| 324 |
retrieval_mode=result.get("retrieval_mode_used", "unknown"),
|
| 325 |
latency=result.get("latency_log") or {},
|
| 326 |
guardrail_passed=result.get("guardrail_passed", True),
|
| 327 |
-
|
| 328 |
turn_id=result["turn_id"],
|
| 329 |
)
|
| 330 |
|
| 331 |
|
| 332 |
@app.post("/chat/turnaround", response_model=ChatResponse)
|
| 333 |
-
def chat_turnaround(req: TurnaroundRequest):
|
| 334 |
if req.user_id not in _sessions:
|
| 335 |
raise HTTPException(status_code=404, detail="no active session")
|
| 336 |
|
|
@@ -400,15 +439,21 @@ def chat_turnaround(req: TurnaroundRequest):
|
|
| 400 |
session["last_state"] = replan_state
|
| 401 |
|
| 402 |
affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
|
| 413 |
return ChatResponse(
|
| 414 |
user_id=req.user_id,
|
|
@@ -420,6 +465,25 @@ def chat_turnaround(req: TurnaroundRequest):
|
|
| 420 |
retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
|
| 421 |
latency=replan_state.get("latency_log") or {},
|
| 422 |
guardrail_passed=replan_state.get("guardrail_passed", True),
|
| 423 |
-
|
| 424 |
turn_id=replan_state["turn_id"],
|
| 425 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
import json
|
| 5 |
import logging
|
| 6 |
+
import re
|
| 7 |
+
import time
|
| 8 |
+
from pathlib import Path
|
| 9 |
|
| 10 |
+
from fastapi import BackgroundTasks, FastAPI, HTTPException
|
| 11 |
from fastapi.middleware.cors import CORSMiddleware
|
| 12 |
+
from pydantic import BaseModel, Field
|
| 13 |
|
| 14 |
from backend.config.settings import settings
|
| 15 |
from backend.evals import compute_evals
|
|
|
|
| 41 |
|
| 42 |
_log = logging.getLogger(__name__)
|
| 43 |
_models_ready = False
|
| 44 |
+
_RUN_ID_RE = re.compile(r"^[0-9a-f]{32}$")
|
| 45 |
+
_ID_PATTERN = r"^[a-zA-Z0-9_-]+$"
|
| 46 |
|
| 47 |
|
| 48 |
@app.on_event("startup")
|
|
|
|
| 83 |
head_signal: str | None = None
|
| 84 |
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
class ChatResponse(BaseModel):
|
| 87 |
user_id: str
|
| 88 |
query: str
|
|
|
|
| 93 |
retrieval_mode: str
|
| 94 |
latency: dict
|
| 95 |
guardrail_passed: bool
|
| 96 |
+
run_id: str | None = None
|
| 97 |
turn_id: int
|
| 98 |
|
| 99 |
|
| 100 |
+
class RatingRequest(BaseModel):
|
| 101 |
+
run_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
|
| 102 |
+
user_id: str = Field(min_length=1, max_length=64, pattern=_ID_PATTERN)
|
| 103 |
+
authenticity: int = Field(ge=1, le=5)
|
| 104 |
+
rater_id: str = Field(default="anonymous", max_length=64, pattern=_ID_PATTERN)
|
| 105 |
+
notes: str | None = Field(default=None, max_length=500)
|
| 106 |
+
|
| 107 |
+
|
| 108 |
# ── Helpers ────────────────────────────────────────────────────────────────────
|
| 109 |
|
| 110 |
|
|
|
|
| 274 |
return {"status": "reset", "user_id": user_id}
|
| 275 |
|
| 276 |
|
| 277 |
+
def _score_and_persist(
|
| 278 |
+
run_id: str,
|
| 279 |
+
user_id: str,
|
| 280 |
+
turn_id: int,
|
| 281 |
+
response: str,
|
| 282 |
+
chunks: list[dict],
|
| 283 |
+
latency_log: dict,
|
| 284 |
+
affect: str,
|
| 285 |
+
gesture_tag: str | None,
|
| 286 |
+
gaze_bucket: str | None,
|
| 287 |
+
) -> None:
|
| 288 |
+
try:
|
| 289 |
+
scores = compute_evals(
|
| 290 |
+
response=response,
|
| 291 |
+
chunks=chunks,
|
| 292 |
+
latency_log=latency_log,
|
| 293 |
+
affect=affect,
|
| 294 |
+
gesture_tag=gesture_tag,
|
| 295 |
+
gaze_bucket=gaze_bucket,
|
| 296 |
+
slo_target=settings.slo_target_s,
|
| 297 |
+
)
|
| 298 |
+
entry = {
|
| 299 |
+
"run_id": run_id,
|
| 300 |
+
"ts": time.time(),
|
| 301 |
+
"user_id": user_id,
|
| 302 |
+
"turn_id": turn_id,
|
| 303 |
+
**scores,
|
| 304 |
+
}
|
| 305 |
+
logs_dir = Path(settings.logs_dir)
|
| 306 |
+
logs_dir.mkdir(parents=True, exist_ok=True)
|
| 307 |
+
with open(logs_dir / "evals.jsonl", "a", encoding="utf-8") as f:
|
| 308 |
+
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
|
| 309 |
+
except Exception:
|
| 310 |
+
_log.exception("evals background scoring failed for run %s", run_id)
|
| 311 |
+
|
| 312 |
+
|
| 313 |
@app.post("/chat", response_model=ChatResponse)
|
| 314 |
+
def chat(req: ChatRequest, background_tasks: BackgroundTasks):
|
| 315 |
guard = check_input(req.query)
|
| 316 |
if not guard["allowed"]:
|
| 317 |
return ChatResponse(
|
|
|
|
| 332 |
|
| 333 |
result: PipelineState = run_pipeline(initial_state)
|
| 334 |
|
|
|
|
| 335 |
session["session_history"] = result["session_history"]
|
| 336 |
session["bucket_priors"] = result["bucket_priors"]
|
| 337 |
session["last_state"] = result
|
| 338 |
|
|
|
|
| 339 |
affect_emotion = (result.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 340 |
+
run_id = result.get("run_id")
|
| 341 |
+
|
| 342 |
+
if settings.evals_enabled and run_id:
|
| 343 |
+
background_tasks.add_task(
|
| 344 |
+
_score_and_persist,
|
| 345 |
+
run_id=run_id,
|
| 346 |
+
user_id=req.user_id,
|
| 347 |
+
turn_id=result["turn_id"],
|
| 348 |
+
response=result["selected_response"] or "",
|
| 349 |
+
chunks=list(result.get("retrieved_chunks") or []),
|
| 350 |
+
latency_log=dict(result.get("latency_log") or {}),
|
| 351 |
+
affect=affect_emotion,
|
| 352 |
+
gesture_tag=req.gesture_tag,
|
| 353 |
+
gaze_bucket=req.gaze_bucket,
|
| 354 |
+
)
|
| 355 |
|
| 356 |
return ChatResponse(
|
| 357 |
user_id=req.user_id,
|
|
|
|
| 363 |
retrieval_mode=result.get("retrieval_mode_used", "unknown"),
|
| 364 |
latency=result.get("latency_log") or {},
|
| 365 |
guardrail_passed=result.get("guardrail_passed", True),
|
| 366 |
+
run_id=run_id,
|
| 367 |
turn_id=result["turn_id"],
|
| 368 |
)
|
| 369 |
|
| 370 |
|
| 371 |
@app.post("/chat/turnaround", response_model=ChatResponse)
|
| 372 |
+
def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
|
| 373 |
if req.user_id not in _sessions:
|
| 374 |
raise HTTPException(status_code=404, detail="no active session")
|
| 375 |
|
|
|
|
| 439 |
session["last_state"] = replan_state
|
| 440 |
|
| 441 |
affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 442 |
+
run_id = replan_state.get("run_id")
|
| 443 |
+
|
| 444 |
+
if settings.evals_enabled and run_id:
|
| 445 |
+
background_tasks.add_task(
|
| 446 |
+
_score_and_persist,
|
| 447 |
+
run_id=run_id,
|
| 448 |
+
user_id=req.user_id,
|
| 449 |
+
turn_id=replan_state["turn_id"],
|
| 450 |
+
response=replan_state["selected_response"] or "",
|
| 451 |
+
chunks=list(replan_state.get("retrieved_chunks") or []),
|
| 452 |
+
latency_log=dict(replan_state.get("latency_log") or {}),
|
| 453 |
+
affect=affect_emotion,
|
| 454 |
+
gesture_tag=replan_state.get("gesture_tag"),
|
| 455 |
+
gaze_bucket=replan_state.get("gaze_bucket"),
|
| 456 |
+
)
|
| 457 |
|
| 458 |
return ChatResponse(
|
| 459 |
user_id=req.user_id,
|
|
|
|
| 465 |
retrieval_mode=replan_state.get("retrieval_mode_used", "unknown"),
|
| 466 |
latency=replan_state.get("latency_log") or {},
|
| 467 |
guardrail_passed=replan_state.get("guardrail_passed", True),
|
| 468 |
+
run_id=run_id,
|
| 469 |
turn_id=replan_state["turn_id"],
|
| 470 |
)
|
| 471 |
+
|
| 472 |
+
|
| 473 |
+
@app.post("/feedback/rating")
|
| 474 |
+
def submit_rating(req: RatingRequest):
|
| 475 |
+
if not _RUN_ID_RE.match(req.run_id):
|
| 476 |
+
raise HTTPException(status_code=400, detail="invalid run_id")
|
| 477 |
+
logs_dir = Path(settings.logs_dir)
|
| 478 |
+
logs_dir.mkdir(parents=True, exist_ok=True)
|
| 479 |
+
entry = {
|
| 480 |
+
"ts": time.time(),
|
| 481 |
+
"run_id": req.run_id,
|
| 482 |
+
"user_id": req.user_id,
|
| 483 |
+
"authenticity": req.authenticity,
|
| 484 |
+
"rater_id": req.rater_id,
|
| 485 |
+
"notes": req.notes,
|
| 486 |
+
}
|
| 487 |
+
with open(logs_dir / "ratings.jsonl", "a", encoding="utf-8") as f:
|
| 488 |
+
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
|
| 489 |
+
return {"status": "ok"}
|
backend/config/settings.py
CHANGED
|
@@ -67,6 +67,11 @@ class Settings(BaseSettings):
|
|
| 67 |
|
| 68 |
# ── Evaluation ────────────────────────────────────────────────────────────
|
| 69 |
slo_target_s: float = 6.0 # max acceptable response latency (seconds)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
|
| 72 |
settings = Settings()
|
|
|
|
| 67 |
|
| 68 |
# ── Evaluation ────────────────────────────────────────────────────────────
|
| 69 |
slo_target_s: float = 6.0 # max acceptable response latency (seconds)
|
| 70 |
+
evals_enabled: bool = True
|
| 71 |
+
nli_model: str = "cross-encoder/nli-deberta-v3-small"
|
| 72 |
+
faithfulness_threshold: float = (
|
| 73 |
+
0.5 # entailment prob for a sentence to count as grounded
|
| 74 |
+
)
|
| 75 |
|
| 76 |
|
| 77 |
settings = Settings()
|
backend/evals/aggregate.py
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Offline aggregator: reads turns.jsonl, evals.jsonl, ratings.jsonl and prints
|
| 2 |
+
per-persona metrics. Run: python -m backend.evals.aggregate
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
import statistics
|
| 8 |
+
import sys
|
| 9 |
+
from collections import defaultdict
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
|
| 12 |
+
from backend.config.settings import settings
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def _load(path: Path) -> list[dict]:
|
| 16 |
+
if not path.exists():
|
| 17 |
+
return []
|
| 18 |
+
out = []
|
| 19 |
+
skipped = 0
|
| 20 |
+
with open(path, encoding="utf-8") as f:
|
| 21 |
+
for line in f:
|
| 22 |
+
line = line.strip()
|
| 23 |
+
if not line:
|
| 24 |
+
continue
|
| 25 |
+
try:
|
| 26 |
+
out.append(json.loads(line))
|
| 27 |
+
except json.JSONDecodeError:
|
| 28 |
+
skipped += 1
|
| 29 |
+
if skipped:
|
| 30 |
+
print(
|
| 31 |
+
f"[aggregate] skipped {skipped} malformed lines in {path}",
|
| 32 |
+
file=sys.stderr,
|
| 33 |
+
)
|
| 34 |
+
return out
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def _quantile(values: list[float], q: float) -> float:
|
| 38 |
+
if not values:
|
| 39 |
+
return 0.0
|
| 40 |
+
if len(values) == 1:
|
| 41 |
+
return values[0]
|
| 42 |
+
idx = max(0, min(len(values) - 1, int(round(q * (len(values) - 1)))))
|
| 43 |
+
return sorted(values)[idx]
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def _fmt_ms(s: float) -> str:
|
| 47 |
+
return f"{s * 1000:.0f}ms"
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def report_latency(turns: list[dict]) -> None:
|
| 51 |
+
print("\n=== Communication Efficiency (latency) ===")
|
| 52 |
+
by_group: dict[tuple[str, str], list[float]] = defaultdict(list)
|
| 53 |
+
for t in turns:
|
| 54 |
+
key = (t.get("user_id", "?"), t.get("llm_tier", "?"))
|
| 55 |
+
by_group[key].append(t.get("latency", {}).get("t_total", 0.0))
|
| 56 |
+
|
| 57 |
+
slo = settings.slo_target_s
|
| 58 |
+
print(f"SLO target: < {slo}s")
|
| 59 |
+
print(
|
| 60 |
+
f"{'user_id':<18} {'tier':<10} {'n':>5} {'p50':>8} {'p95':>8} {'p99':>8} {'pass%':>7}"
|
| 61 |
+
)
|
| 62 |
+
for (uid, tier), lats in sorted(by_group.items()):
|
| 63 |
+
if not lats:
|
| 64 |
+
continue
|
| 65 |
+
p50 = _quantile(lats, 0.5)
|
| 66 |
+
p95 = _quantile(lats, 0.95)
|
| 67 |
+
p99 = _quantile(lats, 0.99)
|
| 68 |
+
passed = sum(1 for x in lats if x < slo) / len(lats) * 100
|
| 69 |
+
print(
|
| 70 |
+
f"{uid:<18} {tier:<10} {len(lats):>5} "
|
| 71 |
+
f"{_fmt_ms(p50):>8} {_fmt_ms(p95):>8} {_fmt_ms(p99):>8} {passed:>6.1f}%"
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def report_faithfulness(evals: list[dict]) -> None:
|
| 76 |
+
print("\n=== Factual Faithfulness ===")
|
| 77 |
+
scored = [e for e in evals if not e.get("no_evidence")]
|
| 78 |
+
if not scored:
|
| 79 |
+
print("(no turns with retrieved evidence)")
|
| 80 |
+
return
|
| 81 |
+
by_user: dict[str, list[dict]] = defaultdict(list)
|
| 82 |
+
for e in scored:
|
| 83 |
+
by_user[e.get("user_id", "?")].append(e)
|
| 84 |
+
|
| 85 |
+
print(f"{'user_id':<18} {'n':>5} {'groundedness':>14} {'hallucination':>14}")
|
| 86 |
+
for uid, rows in sorted(by_user.items()):
|
| 87 |
+
g = statistics.mean(r["groundedness"] for r in rows)
|
| 88 |
+
h = statistics.mean(r["hallucination_rate"] for r in rows)
|
| 89 |
+
print(f"{uid:<18} {len(rows):>5} {g:>13.2%} {h:>13.2%}")
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def _mean_nonzero(rows: list[dict], key: str) -> tuple[float, float]:
|
| 93 |
+
# Coverage % undercounts real zeros (a genuinely 0.0-aligned response looks
|
| 94 |
+
# identical to one where the signal was absent). Fixable by serializing
|
| 95 |
+
# null for absent signals in compute_multimodal_alignment.
|
| 96 |
+
vals = [float(r.get(key, 0.0)) for r in rows]
|
| 97 |
+
nonzero = [v for v in vals if v > 0]
|
| 98 |
+
if not nonzero:
|
| 99 |
+
return 0.0, 0.0
|
| 100 |
+
return statistics.mean(nonzero), len(nonzero) / len(vals)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def _fmt_mean_cov(rows: list[dict], key: str) -> str:
|
| 104 |
+
mean, cov = _mean_nonzero(rows, key)
|
| 105 |
+
return f"{mean:>5.0%}|{cov:>5.0%}"
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def report_multimodal(evals: list[dict]) -> None:
|
| 109 |
+
print("\n=== Multimodal Alignment (mean among non-zero | coverage) ===")
|
| 110 |
+
if not evals:
|
| 111 |
+
print("(no evals logged)")
|
| 112 |
+
return
|
| 113 |
+
by_user: dict[str, list[dict]] = defaultdict(list)
|
| 114 |
+
for e in evals:
|
| 115 |
+
by_user[e.get("user_id", "?")].append(e)
|
| 116 |
+
|
| 117 |
+
print(f"{'user_id':<18} {'n':>5} {'affect':>16} {'gesture':>16} {'gaze':>16}")
|
| 118 |
+
for uid, rows in sorted(by_user.items()):
|
| 119 |
+
print(
|
| 120 |
+
f"{uid:<18} {len(rows):>5} "
|
| 121 |
+
f"{_fmt_mean_cov(rows, 'affect_alignment'):>16} "
|
| 122 |
+
f"{_fmt_mean_cov(rows, 'gesture_alignment'):>16} "
|
| 123 |
+
f"{_fmt_mean_cov(rows, 'gaze_alignment'):>16}"
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def report_authenticity(ratings: list[dict]) -> None:
|
| 128 |
+
print("\n=== Perceived Authenticity (Likert 1-5) ===")
|
| 129 |
+
by_user: dict[str, list[int]] = defaultdict(list)
|
| 130 |
+
for r in ratings:
|
| 131 |
+
raw = r.get("authenticity")
|
| 132 |
+
try:
|
| 133 |
+
score = int(raw)
|
| 134 |
+
except (TypeError, ValueError):
|
| 135 |
+
continue
|
| 136 |
+
if not 1 <= score <= 5:
|
| 137 |
+
continue
|
| 138 |
+
by_user[r.get("user_id", "?")].append(score)
|
| 139 |
+
|
| 140 |
+
if not by_user:
|
| 141 |
+
print("(no valid ratings logged yet)")
|
| 142 |
+
return
|
| 143 |
+
|
| 144 |
+
print(f"{'user_id':<18} {'n':>5} {'mean':>6} {'dist (1..5)':>22}")
|
| 145 |
+
for uid, scores in sorted(by_user.items()):
|
| 146 |
+
mean = statistics.mean(scores)
|
| 147 |
+
dist = [scores.count(i) for i in range(1, 6)]
|
| 148 |
+
dist_str = "/".join(str(x) for x in dist)
|
| 149 |
+
print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def main() -> None:
|
| 153 |
+
parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
|
| 154 |
+
parser.add_argument("--logs", type=Path, default=settings.logs_dir)
|
| 155 |
+
args = parser.parse_args()
|
| 156 |
+
|
| 157 |
+
turns = _load(args.logs / "turns.jsonl")
|
| 158 |
+
evals = _load(args.logs / "evals.jsonl")
|
| 159 |
+
ratings = _load(args.logs / "ratings.jsonl")
|
| 160 |
+
|
| 161 |
+
print(f"Loaded: {len(turns)} turns, {len(evals)} evals, {len(ratings)} ratings")
|
| 162 |
+
report_latency(turns)
|
| 163 |
+
report_faithfulness(evals)
|
| 164 |
+
report_multimodal(evals)
|
| 165 |
+
report_authenticity(ratings)
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
if __name__ == "__main__":
|
| 169 |
+
main()
|
backend/evals/faithfulness.py
CHANGED
|
@@ -1,12 +1,94 @@
|
|
| 1 |
-
|
| 2 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
|
| 5 |
def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
|
| 6 |
-
"""
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
return {
|
| 9 |
-
"groundedness":
|
| 10 |
-
"hallucination_rate":
|
| 11 |
-
"no_evidence":
|
| 12 |
}
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from threading import Lock, Semaphore
|
| 3 |
+
|
| 4 |
+
import torch
|
| 5 |
+
|
| 6 |
+
from backend.config.settings import settings
|
| 7 |
+
|
| 8 |
+
_model = None
|
| 9 |
+
_entail_idx: int | None = None
|
| 10 |
+
_model_lock = Lock()
|
| 11 |
+
_predict_sem = Semaphore(1)
|
| 12 |
+
_SENT_SPLIT = re.compile(r"(?<=[.!?])\s+")
|
| 13 |
+
_MAX_SENTENCES = 20
|
| 14 |
+
_MIN_SENTENCE_WORDS = 2
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def _get_model():
|
| 18 |
+
global _model, _entail_idx
|
| 19 |
+
if _model is not None:
|
| 20 |
+
return _model
|
| 21 |
+
with _model_lock:
|
| 22 |
+
if _model is not None:
|
| 23 |
+
return _model
|
| 24 |
+
from sentence_transformers import CrossEncoder
|
| 25 |
+
|
| 26 |
+
device = (
|
| 27 |
+
"mps"
|
| 28 |
+
if torch.backends.mps.is_available()
|
| 29 |
+
else "cuda"
|
| 30 |
+
if torch.cuda.is_available()
|
| 31 |
+
else "cpu"
|
| 32 |
+
)
|
| 33 |
+
model = CrossEncoder(settings.nli_model, device=device)
|
| 34 |
+
label2id = getattr(model.config, "label2id", None) or {}
|
| 35 |
+
for label, idx in label2id.items():
|
| 36 |
+
if label.lower() == "entailment":
|
| 37 |
+
_entail_idx = int(idx)
|
| 38 |
+
break
|
| 39 |
+
if _entail_idx is None:
|
| 40 |
+
raise RuntimeError(
|
| 41 |
+
f"NLI model {settings.nli_model!r} has no 'entailment' label"
|
| 42 |
+
)
|
| 43 |
+
_model = model
|
| 44 |
+
return _model
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def _split_sentences(text: str) -> list[str]:
|
| 48 |
+
parts = [s.strip() for s in _SENT_SPLIT.split(text.strip()) if s.strip()]
|
| 49 |
+
return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
|
| 50 |
|
| 51 |
|
| 52 |
def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
|
| 53 |
+
"""Sentence-level NLI: each sentence must be entailed by at least one chunk."""
|
| 54 |
+
if not chunks:
|
| 55 |
+
return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
|
| 56 |
+
|
| 57 |
+
sentences = _split_sentences(response)
|
| 58 |
+
# Too short to score meaningfully (one-word replies, fragments). Flagging as
|
| 59 |
+
# no_evidence is honest: we're not scoring it, so it should be excluded from
|
| 60 |
+
# groundedness averages downstream.
|
| 61 |
+
if not sentences:
|
| 62 |
+
return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
|
| 63 |
+
|
| 64 |
+
chunk_texts = [c.get("text", "") for c in chunks if c.get("text")]
|
| 65 |
+
if not chunk_texts:
|
| 66 |
+
return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
|
| 67 |
+
|
| 68 |
+
if len(sentences) > _MAX_SENTENCES:
|
| 69 |
+
sentences = sentences[:_MAX_SENTENCES]
|
| 70 |
+
|
| 71 |
+
model = _get_model()
|
| 72 |
+
# NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
|
| 73 |
+
# response sentences are the claims being checked (hypothesis).
|
| 74 |
+
pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
|
| 75 |
+
with _predict_sem:
|
| 76 |
+
logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
|
| 77 |
+
probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
|
| 78 |
+
entail = probs[:, _entail_idx]
|
| 79 |
+
|
| 80 |
+
n_chunks = len(chunk_texts)
|
| 81 |
+
threshold = settings.faithfulness_threshold
|
| 82 |
+
grounded = 0
|
| 83 |
+
for i in range(len(sentences)):
|
| 84 |
+
sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
|
| 85 |
+
if sent_scores.max() >= threshold:
|
| 86 |
+
grounded += 1
|
| 87 |
+
|
| 88 |
+
total = len(sentences)
|
| 89 |
+
groundedness = grounded / total
|
| 90 |
return {
|
| 91 |
+
"groundedness": round(groundedness, 4),
|
| 92 |
+
"hallucination_rate": round(1.0 - groundedness, 4),
|
| 93 |
+
"no_evidence": False,
|
| 94 |
}
|
frontend/src/components/ChatPanel.tsx
CHANGED
|
@@ -85,7 +85,7 @@ export function ChatPanel({
|
|
| 85 |
content: res.response,
|
| 86 |
latency: res.latency,
|
| 87 |
affect: res.affect,
|
| 88 |
-
|
| 89 |
turnId: res.turn_id,
|
| 90 |
isTurnaround: true,
|
| 91 |
});
|
|
@@ -174,7 +174,7 @@ export function ChatPanel({
|
|
| 174 |
content: res.response,
|
| 175 |
latency: res.latency,
|
| 176 |
affect: res.affect,
|
| 177 |
-
|
| 178 |
turnId: res.turn_id,
|
| 179 |
},
|
| 180 |
]);
|
|
@@ -224,8 +224,12 @@ export function ChatPanel({
|
|
| 224 |
)}
|
| 225 |
</span>
|
| 226 |
<p>{msg.content}</p>
|
| 227 |
-
{msg.role === "aac_user" && msg.
|
| 228 |
-
<EvalPanel
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
)}
|
| 230 |
</div>
|
| 231 |
))}
|
|
|
|
| 85 |
content: res.response,
|
| 86 |
latency: res.latency,
|
| 87 |
affect: res.affect,
|
| 88 |
+
runId: res.run_id,
|
| 89 |
turnId: res.turn_id,
|
| 90 |
isTurnaround: true,
|
| 91 |
});
|
|
|
|
| 174 |
content: res.response,
|
| 175 |
latency: res.latency,
|
| 176 |
affect: res.affect,
|
| 177 |
+
runId: res.run_id,
|
| 178 |
turnId: res.turn_id,
|
| 179 |
},
|
| 180 |
]);
|
|
|
|
| 224 |
)}
|
| 225 |
</span>
|
| 226 |
<p>{msg.content}</p>
|
| 227 |
+
{msg.role === "aac_user" && msg.runId && userId && (
|
| 228 |
+
<EvalPanel
|
| 229 |
+
runId={msg.runId}
|
| 230 |
+
userId={userId}
|
| 231 |
+
latencyTotal={msg.latency?.t_total ?? 0}
|
| 232 |
+
/>
|
| 233 |
)}
|
| 234 |
</div>
|
| 235 |
))}
|
frontend/src/components/EvalPanel.tsx
CHANGED
|
@@ -1,137 +1,60 @@
|
|
| 1 |
import { useState } from "react";
|
| 2 |
-
import
|
| 3 |
|
| 4 |
interface Props {
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
| 6 |
}
|
| 7 |
|
| 8 |
-
function
|
| 9 |
-
const
|
| 10 |
-
const color = pct > 70 ? "#4caf50" : pct > 40 ? "#ff9800" : "#f44336";
|
| 11 |
-
return (
|
| 12 |
-
<div className="score-bar">
|
| 13 |
-
<div className="score-bar-fill" style={{ width: `${pct}%`, background: color }} />
|
| 14 |
-
</div>
|
| 15 |
-
);
|
| 16 |
-
}
|
| 17 |
-
|
| 18 |
-
function StarRating({
|
| 19 |
-
value,
|
| 20 |
-
onChange,
|
| 21 |
-
}: {
|
| 22 |
-
value: number | null;
|
| 23 |
-
onChange: (v: number) => void;
|
| 24 |
-
}) {
|
| 25 |
const [hover, setHover] = useState(0);
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
)
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
const [expanded, setExpanded] = useState(false);
|
| 46 |
-
const [likert, setLikert] = useState<number | null>(null);
|
| 47 |
|
| 48 |
return (
|
| 49 |
<div className="eval-panel">
|
| 50 |
-
<
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
{evalScores.slo_passed ? (
|
| 56 |
-
<span className="slo-badge pass">SLO ✓</span>
|
| 57 |
-
) : (
|
| 58 |
-
<span className="slo-badge fail">SLO ✗</span>
|
| 59 |
)}
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
<span>Groundedness</span>
|
| 75 |
-
<span className="metric-value">{(evalScores.groundedness * 100).toFixed(0)}%</span>
|
| 76 |
-
</div>
|
| 77 |
-
<ScoreBar value={evalScores.groundedness} />
|
| 78 |
-
<div className="metric-row">
|
| 79 |
-
<span>Hallucination Rate</span>
|
| 80 |
-
<span className="metric-value">
|
| 81 |
-
{(evalScores.hallucination_rate * 100).toFixed(0)}%
|
| 82 |
-
</span>
|
| 83 |
-
</div>
|
| 84 |
-
</>
|
| 85 |
-
)}
|
| 86 |
-
</div>
|
| 87 |
-
|
| 88 |
-
<div className="eval-section">
|
| 89 |
-
<div className="section-title">Communication Efficiency</div>
|
| 90 |
-
<div className="metric-row">
|
| 91 |
-
<span>Response Time</span>
|
| 92 |
-
<span className={`metric-value ${evalScores.slo_passed ? "pass" : "fail"}`}>
|
| 93 |
-
{evalScores.t_total_s.toFixed(2)}s
|
| 94 |
-
{evalScores.slo_passed ? " ✓" : " ✗"}
|
| 95 |
-
</span>
|
| 96 |
-
</div>
|
| 97 |
-
<div className="metric-row sub">
|
| 98 |
-
<span>SLO Target</span>
|
| 99 |
-
<span className="metric-value">
|
| 100 |
-
< {evalScores.slo_target_s.toFixed(1)}s (margin: {evalScores.slo_margin_s.toFixed(2)}s)
|
| 101 |
-
</span>
|
| 102 |
-
</div>
|
| 103 |
-
</div>
|
| 104 |
-
|
| 105 |
-
<div className="eval-section">
|
| 106 |
-
<div className="section-title">Multimodal Alignment</div>
|
| 107 |
-
<div className="metric-row">
|
| 108 |
-
<span>Overall</span>
|
| 109 |
-
<span className="metric-value">{(evalScores.multimodal_alignment * 100).toFixed(0)}%</span>
|
| 110 |
-
</div>
|
| 111 |
-
<ScoreBar value={evalScores.multimodal_alignment} />
|
| 112 |
-
<div className="metric-row sub">
|
| 113 |
-
<span>Affect</span>
|
| 114 |
-
<span className="metric-value">{(evalScores.affect_alignment * 100).toFixed(0)}%</span>
|
| 115 |
-
</div>
|
| 116 |
-
<div className="metric-row sub">
|
| 117 |
-
<span>Gesture</span>
|
| 118 |
-
<span className="metric-value">{(evalScores.gesture_alignment * 100).toFixed(0)}%</span>
|
| 119 |
-
</div>
|
| 120 |
-
<div className="metric-row sub">
|
| 121 |
-
<span>Gaze</span>
|
| 122 |
-
<span className="metric-value">{(evalScores.gaze_alignment * 100).toFixed(0)}%</span>
|
| 123 |
-
</div>
|
| 124 |
-
</div>
|
| 125 |
-
|
| 126 |
-
<div className="eval-section">
|
| 127 |
-
<div className="section-title">Perceived Authenticity</div>
|
| 128 |
-
<div className="metric-row">
|
| 129 |
-
<span>Rate this response</span>
|
| 130 |
-
</div>
|
| 131 |
-
<StarRating value={likert} onChange={setLikert} />
|
| 132 |
-
</div>
|
| 133 |
</div>
|
| 134 |
-
|
| 135 |
</div>
|
| 136 |
);
|
| 137 |
}
|
|
|
|
| 1 |
import { useState } from "react";
|
| 2 |
+
import { submitRating } from "../lib/api";
|
| 3 |
|
| 4 |
interface Props {
|
| 5 |
+
runId: string;
|
| 6 |
+
userId: string;
|
| 7 |
+
latencyTotal: number;
|
| 8 |
+
sloTarget?: number;
|
| 9 |
}
|
| 10 |
|
| 11 |
+
export function EvalPanel({ runId, userId, latencyTotal, sloTarget = 6.0 }: Props) {
|
| 12 |
+
const [value, setValue] = useState<number | null>(null);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
const [hover, setHover] = useState(0);
|
| 14 |
+
const [submitting, setSubmitting] = useState(false);
|
| 15 |
+
const sloPassed = latencyTotal > 0 && latencyTotal < sloTarget;
|
| 16 |
+
|
| 17 |
+
async function rate(stars: number) {
|
| 18 |
+
if (submitting || value !== null) return;
|
| 19 |
+
setSubmitting(true);
|
| 20 |
+
try {
|
| 21 |
+
await submitRating({
|
| 22 |
+
run_id: runId,
|
| 23 |
+
user_id: userId,
|
| 24 |
+
authenticity: stars,
|
| 25 |
+
});
|
| 26 |
+
setValue(stars);
|
| 27 |
+
} catch (e) {
|
| 28 |
+
console.error("rating submit failed", e);
|
| 29 |
+
} finally {
|
| 30 |
+
setSubmitting(false);
|
| 31 |
+
}
|
| 32 |
+
}
|
|
|
|
|
|
|
| 33 |
|
| 34 |
return (
|
| 35 |
<div className="eval-panel">
|
| 36 |
+
<div className="eval-row">
|
| 37 |
+
{latencyTotal > 0 && (
|
| 38 |
+
<span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
|
| 39 |
+
{latencyTotal.toFixed(2)}s {sloPassed ? "✓" : "✗"}
|
| 40 |
+
</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
)}
|
| 42 |
+
<div className="star-rating" title="Rate authenticity (1-5)">
|
| 43 |
+
{[1, 2, 3, 4, 5].map((star) => (
|
| 44 |
+
<button
|
| 45 |
+
key={star}
|
| 46 |
+
className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
|
| 47 |
+
onMouseEnter={() => setHover(star)}
|
| 48 |
+
onMouseLeave={() => setHover(0)}
|
| 49 |
+
onClick={() => rate(star)}
|
| 50 |
+
disabled={value !== null || submitting}
|
| 51 |
+
>
|
| 52 |
+
★
|
| 53 |
+
</button>
|
| 54 |
+
))}
|
| 55 |
+
{value !== null && <span className="star-label">{value}/5</span>}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
</div>
|
| 57 |
+
</div>
|
| 58 |
</div>
|
| 59 |
);
|
| 60 |
}
|
frontend/src/lib/api.ts
CHANGED
|
@@ -44,6 +44,21 @@ export async function resetSession(userId: string): Promise<void> {
|
|
| 44 |
if (!res.ok) throw new Error(`API error: ${res.status}`);
|
| 45 |
}
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
export async function checkHealth(): Promise<boolean> {
|
| 48 |
try {
|
| 49 |
const res = await fetch(`${API_BASE}/health`);
|
|
|
|
| 44 |
if (!res.ok) throw new Error(`API error: ${res.status}`);
|
| 45 |
}
|
| 46 |
|
| 47 |
+
export async function submitRating(args: {
|
| 48 |
+
run_id: string;
|
| 49 |
+
user_id: string;
|
| 50 |
+
authenticity: number;
|
| 51 |
+
rater_id?: string;
|
| 52 |
+
notes?: string;
|
| 53 |
+
}): Promise<void> {
|
| 54 |
+
const res = await fetch(`${API_BASE}/feedback/rating`, {
|
| 55 |
+
method: "POST",
|
| 56 |
+
headers: { "Content-Type": "application/json" },
|
| 57 |
+
body: JSON.stringify(args),
|
| 58 |
+
});
|
| 59 |
+
if (!res.ok) throw new Error(`API error: ${res.status}`);
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
export async function checkHealth(): Promise<boolean> {
|
| 63 |
try {
|
| 64 |
const res = await fetch(`${API_BASE}/health`);
|
frontend/src/types.ts
CHANGED
|
@@ -52,20 +52,6 @@ export interface LatencyLog {
|
|
| 52 |
t_total: number;
|
| 53 |
}
|
| 54 |
|
| 55 |
-
export interface EvalScores {
|
| 56 |
-
groundedness: number;
|
| 57 |
-
hallucination_rate: number;
|
| 58 |
-
no_evidence: boolean;
|
| 59 |
-
t_total_s: number;
|
| 60 |
-
slo_target_s: number;
|
| 61 |
-
slo_passed: boolean;
|
| 62 |
-
slo_margin_s: number;
|
| 63 |
-
multimodal_alignment: number;
|
| 64 |
-
affect_alignment: number;
|
| 65 |
-
gesture_alignment: number;
|
| 66 |
-
gaze_alignment: number;
|
| 67 |
-
}
|
| 68 |
-
|
| 69 |
export interface ChatResponse {
|
| 70 |
user_id: string;
|
| 71 |
query: string;
|
|
@@ -75,7 +61,7 @@ export interface ChatResponse {
|
|
| 75 |
retrieval_mode: string;
|
| 76 |
latency: LatencyLog;
|
| 77 |
guardrail_passed: boolean;
|
| 78 |
-
|
| 79 |
turn_id: number;
|
| 80 |
}
|
| 81 |
|
|
@@ -84,7 +70,7 @@ export interface ChatMessage {
|
|
| 84 |
content: string;
|
| 85 |
latency?: LatencyLog;
|
| 86 |
affect?: string;
|
| 87 |
-
|
| 88 |
turnId?: number;
|
| 89 |
rephrased?: boolean;
|
| 90 |
isTurnaround?: boolean;
|
|
|
|
| 52 |
t_total: number;
|
| 53 |
}
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
export interface ChatResponse {
|
| 56 |
user_id: string;
|
| 57 |
query: string;
|
|
|
|
| 61 |
retrieval_mode: string;
|
| 62 |
latency: LatencyLog;
|
| 63 |
guardrail_passed: boolean;
|
| 64 |
+
run_id: string | null;
|
| 65 |
turn_id: number;
|
| 66 |
}
|
| 67 |
|
|
|
|
| 70 |
content: string;
|
| 71 |
latency?: LatencyLog;
|
| 72 |
affect?: string;
|
| 73 |
+
runId?: string | null;
|
| 74 |
turnId?: number;
|
| 75 |
rephrased?: boolean;
|
| 76 |
isTurnaround?: boolean;
|