Spaces:
Sleeping
Sleeping
Commit ·
69345ca
1
Parent(s): 1799d5b
relevance + diversity evals, batched scoring, math tooltips
Browse filesAdds two new eval metrics (relevance via BGE cosine, candidate diversity
via mean pairwise cosine distance) plus per-candidate scoring so we can
ask "did the picker beat candidate 0?" offline. compute_evals batches
the scoring path: one NLI model.predict over all (candidate × sentence
× chunk) pairs and one BGE embed_texts over [query, c1, c2, c3] feeding
both relevance and diversity. EvalPanel surfaces the new pills with
custom CSS hover tooltips that show the actual math from each turn
(e.g. "2/2 sentences had NLI entailment ≥ 0.50") rather than generic
definitions. aggregate.py grows a report_picker section that joins
turns + picks + evals and prints pick rate, regenerate rate, strategy
win rate, and the diversity floor.
- README.md +37 -17
- backend/api/main.py +22 -2
- backend/evals/__init__.py +102 -7
- backend/evals/aggregate.py +121 -1
- backend/evals/diversity.py +23 -0
- backend/evals/faithfulness.py +60 -30
- backend/evals/multimodal_alignment.py +32 -4
- backend/evals/relevance.py +16 -0
- backend/retrieval/vector_store.py +9 -0
- frontend/src/App.css +47 -0
- frontend/src/components/EvalPanel.tsx +218 -48
- frontend/src/types.ts +37 -0
README.md
CHANGED
|
@@ -71,7 +71,15 @@ The per-turn JSONL log includes both `bucket_priors_after` and `type_priors_afte
|
|
| 71 |
|
| 72 |
### Per-turn eval pills in the UI
|
| 73 |
|
| 74 |
-
Every AAC-user bubble renders its eval scores inline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |

|
| 77 |
|
|
@@ -401,7 +409,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 401 |
- [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 402 |
- [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
|
| 403 |
- [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
|
| 404 |
-
- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal.
|
| 405 |
|
| 406 |
### Intent decomposition
|
| 407 |
|
|
@@ -422,7 +429,6 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 422 |
- return a canned response if we blow the budget entirely
|
| 423 |
- threshold is 3.5s, spec says 6s — pick one
|
| 424 |
- [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
|
| 425 |
-
- [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30
|
| 426 |
|
| 427 |
### Generation
|
| 428 |
|
|
@@ -433,34 +439,48 @@ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend ju
|
|
| 433 |
|
| 434 |
### Evals
|
| 435 |
|
| 436 |
-
Scoring runs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 437 |
|
| 438 |
-
|
| 439 |
-
|--------|--------|-------|
|
| 440 |
-
| Efficiency | per-turn SLO on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
|
| 441 |
-
| Faithfulness | sentence-level NLI, `no_evidence` short-circuit | [faithfulness.py](backend/evals/faithfulness.py) |
|
| 442 |
-
| Multimodal alignment | affect (sentiment lexicon), gesture (opener regex), gaze (bucket overlap) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
|
| 443 |
-
| Authenticity | star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
|
| 444 |
|
| 445 |
-
**
|
| 446 |
|
| 447 |
- [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
|
| 448 |
- [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
|
| 449 |
- [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
|
| 450 |
- [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
|
| 451 |
- [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
|
| 452 |
-
- [
|
| 453 |
-
- [
|
| 454 |
-
- [
|
| 455 |
-
- [
|
| 456 |
-
- [
|
| 457 |
-
- [
|
| 458 |
|
| 459 |
### Cleanup
|
| 460 |
|
| 461 |
- [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
|
| 462 |
- [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
|
| 463 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 464 |
---
|
| 465 |
|
| 466 |
## Team
|
|
|
|
| 71 |
|
| 72 |
### Per-turn eval pills in the UI
|
| 73 |
|
| 74 |
+
Every AAC-user bubble renders its eval scores inline. Up to seven pills, depending on what the turn produced:
|
| 75 |
+
|
| 76 |
+
- **SLO latency badge** — `t_total` against the configured target (default 6s). Green ✓ on pass, red ✗ on miss.
|
| 77 |
+
- **`grounded`** — sentence-level NLI faithfulness against retrieved memories. Renders `—` when there's no retrieved evidence to check (e.g. a `PRESENT_STATE` turn that skipped retrieval).
|
| 78 |
+
- **`relevant`** — BGE cosine similarity between query and response embeddings. Catches the "perfectly grounded but off-topic" failure mode that groundedness can't see.
|
| 79 |
+
- **`affect` / `gesture` / `gaze`** — multimodal alignment: sentiment match against the detected affect, opener-pattern match against the detected gesture, fraction of retrieved chunks from the gazed-at bucket.
|
| 80 |
+
- **`diversity`** — mean pairwise cosine distance across the candidate slate (only shown when ≥2 candidates). Low values flag the "aloha problem" — three paraphrases of the same answer.
|
| 81 |
+
|
| 82 |
+
Pills go green / grey / red on 0.75 / 0.4 thresholds. **Hover any pill** for a tooltip with the actual math from this turn (e.g. "2/2 sentences had NLI entailment prob ≥ 0.50", "3/3 retrieved chunks were from the family bucket"), powered by the `explain` block each scorer returns alongside its score. Authenticity stars sit on the right; clicking one fires `POST /feedback/rating` → `logs/ratings.jsonl`. All pill values come from `eval_scores` on the `/chat` response, computed in a `BackgroundTask` after the response returns and persisted to `logs/evals.jsonl`.
|
| 83 |
|
| 84 |

|
| 85 |
|
|
|
|
| 409 |
- [x] **[Core]** Gestures come from MediaPipe's pretrained `GestureRecognizer` rather than hand-rolled landmark geometry. Mapped labels: `THUMBS_UP` / `THUMBS_DOWN` / `POINTING_UP` / `CLOSED_FIST` / `OPEN_PALM` / `VICTORY` / `I_LOVE_YOU` (see `mapGestureLabel` in [sensing.ts](frontend/src/lib/sensing.ts)). Each label carries an `opener_hint` via `GESTURE_DIRECTIVES` in [backend/sensing/labels.py](backend/sensing/labels.py) — a detected thumbs-up overrides the affect opener and tells the LLM to lead with an affirmation.
|
| 410 |
- [x] **[Core]** Air-writing uses a vision LLM (`gemma4:31b-cloud` via Ollama Cloud, configurable through `INK_VISION_MODEL`) instead of the older in-browser DTW template bank. Stroke segmentation lives in `AirWriter` in [sensing.ts](frontend/src/lib/sensing.ts) — index-fingertip velocity gates open/close strokes; finished strokes get rendered to a 200×200 PNG by [inkRecognizer.ts](frontend/src/lib/inkRecognizer.ts) and POSTed to `/ink/recognize` ([backend/api/main.py](backend/api/main.py)), which asks the model to return the traced character or short word. The recognized text accumulates in `sensing.airWrittenText` and flows through the pipeline three ways: (1) retrieval picks the word up as an extra `PERSONAL` sub-intent with a bucket hint (`infer_bucket` in [backend/sensing/bucket_keywords.py](backend/sensing/bucket_keywords.py)), (2) the planner adds an explicit "the user air-wrote X — incorporate verbatim if appropriate" instruction, and (3) the word appears in `logs/turns.jsonl` for debugging. Set `VITE_AIRWRITING_ENABLED=false` to disable stroke capture; if `INK_VISION_API_KEY` is unset the endpoint returns 503 and the frontend silently keeps tracing without recognition.
|
| 411 |
- [x] **[Bonus]** Voice + air-writing conflict resolution. A push-to-talk mic ([frontend/src/hooks/useVoice.ts](frontend/src/hooks/useVoice.ts)) captures a short Web Speech utterance; [frontend/src/lib/resolveIntent.ts](frontend/src/lib/resolveIntent.ts) merges it against the air-written text using Jaccard token overlap + AAC-priority tokens (`help/stop/water/done/more` win ties). The resolver emits a `{text, source, voice_text, air_text}` payload — `source ∈ voice_only | air_only | agree | conflict_air | conflict_voice` — which the backend uses in [backend/pipeline/nodes/intent.py](backend/pipeline/nodes/intent.py) to pick the supplemental sub-intent, and in [backend/pipeline/nodes/planner.py](backend/pipeline/nodes/planner.py) to render source-aware prompt copy (conflicts are acknowledged instead of silently overwritten). The mic is gated by persona via `VOICE_CAPABLE_PERSONAS` in [frontend/src/lib/voiceEligibility.ts](frontend/src/lib/voiceEligibility.ts) — only personas whose modelled access method is verbal (Abed, Allie, Forrest, Gabby, Michael J. Fox, Raymond, Walter Jr.) see the button; non-verbal / locked-in / letterboard personas don't.
|
|
|
|
| 412 |
|
| 413 |
### Intent decomposition
|
| 414 |
|
|
|
|
| 429 |
- return a canned response if we blow the budget entirely
|
| 430 |
- threshold is 3.5s, spec says 6s — pick one
|
| 431 |
- [ ] **[Bonus]** Cache encoded user-turn embeddings across the session — `build_context_vector` re-encodes the same recent turns every turn (~50ms steady-state cost)
|
|
|
|
| 432 |
|
| 433 |
### Generation
|
| 434 |
|
|
|
|
| 439 |
|
| 440 |
### Evals
|
| 441 |
|
| 442 |
+
Scoring runs in a FastAPI `BackgroundTask` after `/chat` returns, so it never blocks the response. The frontend polls `GET /evals/{run_id}` to render pills once they're ready. Each scored turn is appended to `logs/evals.jsonl`, keyed by `run_id`, so it joins back to `logs/turns.jsonl` offline. Likert ratings go to `logs/ratings.jsonl`. Picks go to `logs/picks.jsonl`.
|
| 443 |
+
|
| 444 |
+
| Metric | What it answers | Where |
|
| 445 |
+
|--------|-----------------|-------|
|
| 446 |
+
| Efficiency | SLO pass/fail on `t_total`, aggregate p50/p95/p99 | [efficiency.py](backend/evals/efficiency.py), [aggregate.py](backend/evals/aggregate.py) |
|
| 447 |
+
| Faithfulness (`grounded`) | Did the response stick to retrieved memories, or hallucinate? Sentence-level NLI; `no_evidence` short-circuit when nothing was retrieved | [faithfulness.py](backend/evals/faithfulness.py) |
|
| 448 |
+
| Relevance (`relevant`) | Did the response actually address the partner's query? BGE cosine query↔response | [relevance.py](backend/evals/relevance.py) |
|
| 449 |
+
| Multimodal alignment | `affect` (sentiment lexicon vs target), `gesture` (opener regex vs detected tag), `gaze` (matched/total retrieved chunks vs gazed bucket) | [multimodal_alignment.py](backend/evals/multimodal_alignment.py) |
|
| 450 |
+
| Candidate diversity | Are the picker's candidates actually different, or paraphrases? Mean pairwise cosine distance over the candidate slate | [diversity.py](backend/evals/diversity.py) |
|
| 451 |
+
| Per-candidate breakdown | Each candidate scored for `grounded` + `relevance` (not just the selected one) — answers "did the picker beat candidate 0?" offline | `candidates_eval` block in [evals/__init__.py](backend/evals/__init__.py) |
|
| 452 |
+
| Authenticity | Star rating under every assistant bubble → `POST /feedback/rating` → `logs/ratings.jsonl` | [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx), [api/main.py](backend/api/main.py) |
|
| 453 |
+
|
| 454 |
+
**Performance note.** When the turn produces multiple candidates, scoring is fully batched: a single NLI `model.predict` over all `(candidate × sentence × chunk)` pairs and a single BGE `embed_texts` over `[query, c1, c2, c3]` (the candidate vectors feed both relevance and diversity). The selected candidate's per-candidate score is reused as the top-level pill values rather than re-scored. End result: 1 NLI pass + 1 BGE pass per turn regardless of candidate count.
|
| 455 |
|
| 456 |
+
**First-turn caveat:** the NLI model (`cross-encoder/nli-deberta-v3-small`, ~140MB) is lazy-loaded on the first score after a server restart, so turn 1 pays a one-time ~2-3s warmup. Every turn after that adds ~100-300ms for sentence-level scoring.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 457 |
|
| 458 |
+
**Offline analysis.** `python -m backend.evals.aggregate` joins all four log files and prints per-persona reports: latency p50/p95/p99 by tier, mean groundedness/hallucination, multimodal alignment coverage, picker behaviour (pick rate, regenerate rate, strategy win rate, "did picker beat cand 0?", diversity floor), and authenticity Likert distribution.
|
| 459 |
|
| 460 |
- [x] **[Eval]** Faithfulness — NLI scorer, sentence split, threshold on entailment prob. `no_evidence` flagged when nothing retrieved
|
| 461 |
- [x] **[Eval]** Efficiency — per-turn SLO + aggregate latency (p50/p95/p99) via `aggregate.py`, grouped by `user_id × llm_tier`
|
| 462 |
- [x] **[Eval]** Multimodal alignment — affect scored by positive/negative lexicon overlap vs. target sentiment, gesture by opener-phrase regex (THUMBS_UP/THUMBS_DOWN/WAVING), gaze by fraction of retrieved chunks matching the looked-at bucket
|
| 463 |
- [x] **[Eval]** Authenticity — per-turn stars under each assistant bubble, POST to `/feedback/rating`, logged with `run_id + rater_id`
|
| 464 |
- [ ] **[Eval]** For the live in-class eval: figure out the actual session — who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
|
| 465 |
+
- [x] **[Eval]** Relevance score — BGE cosine similarity between query and response. Originally specced as an NLI call, but a question rarely *entails* its answer (the on-topic and off-topic NLI scores both pinned near 10⁻⁴), so the embedder we already load for retrieval is the right tool. Fills the gap where a perfectly grounded but off-topic reply scored 100% grounded. See [backend/evals/relevance.py](backend/evals/relevance.py).
|
| 466 |
+
- [x] **[Eval]** Candidate diversity — mean pairwise cosine distance among the 3 candidates in a picker round, computed on BGE embeddings (no extra model). Low diversity = picker showing three paraphrases of the same answer (the "aloha" problem), which is a signal that retrieval or temperature needs tuning for that query. See [backend/evals/diversity.py](backend/evals/diversity.py).
|
| 467 |
+
- [x] **[Eval]** Picker-aware metrics — `report_picker` in [backend/evals/aggregate.py](backend/evals/aggregate.py) joins `turns.jsonl` + `picks.jsonl` + `evals.jsonl` and prints: pick rate (% of multi-candidate turns where the user clicked a card), regenerate rate (% of (user, turn_id) pairs that ran the planner more than once), strategy win rate among committed picks, head-to-head "did picker beat candidate 0 on grounded/relevance" using the per-candidate scoring from L453, and diversity coverage (% of turns with mean pairwise cosine distance < 0.10 — the "aloha" floor). Run via `python -m backend.evals.aggregate`.
|
| 468 |
+
- [x] **[Eval]** Score alternate candidates too, not just the selected one. `compute_evals` now scores groundedness + relevance for every candidate and stamps which one was selected; full breakdown lands in `eval_scores.candidates_eval` and `logs/evals.jsonl`, top-level pills still describe the selected response. Unlocks "did the picker actually beat candidate 0?" offline analysis.
|
| 469 |
+
- [x] **[Eval]** UI coverage — `EvalPanel` now also renders a relevance pill (BGE cosine query↔response) and a candidate-diversity pill (mean pairwise cosine distance, hidden when fewer than 2 candidates). Hallucination rate is conveyed inside the grounded tooltip rather than as its own pill (it's `1 − groundedness`, no extra info). SLO margin is in the latency tooltip. See [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx).
|
| 470 |
+
- [x] **[Eval]** Tooltip math — every pill's `title` now shows the actual computation, not just the definition. Each scorer returns its raw inputs in an `explain` block (sentence count + entailment threshold for groundedness, pos/neg word counts + sentiment for affect, matched/total chunks for gaze, gesture pattern match), and `EvalPanel` formats them into specific tooltips like "2/2 sentences had NLI entailment prob ≥ 0.50". See `groundednessTip` / `affectTip` / `gestureTip` / `gazeTip` in [EvalPanel.tsx](frontend/src/components/EvalPanel.tsx) and `explain` in [multimodal_alignment.py](backend/evals/multimodal_alignment.py).
|
| 471 |
|
| 472 |
### Cleanup
|
| 473 |
|
| 474 |
- [x] delete `backend/sensing/` (dead code, sensing is in frontend) — done, only `labels.py` remains
|
| 475 |
- [x] per-persona affect overrides (`_PERSONA_TONE_OVERRIDES`) deleted — redundant with `stylistic_preferences` in the new persona JSONs
|
| 476 |
|
| 477 |
+
### Out of scope
|
| 478 |
+
|
| 479 |
+
Not in the spec — engineering nice-to-haves we'd pick up if the rest is done. Don't block grading on these.
|
| 480 |
+
|
| 481 |
+
- [ ] Thumbs-up currently biases the opener via the prompt. Once generation emits N candidates, move this to candidate reranking for a stronger signal. _(Sensing — untagged in spec)_
|
| 482 |
+
- [ ] **[Scale]** past ~100k chunks per user, swap torch matmul for `hnswlib`; consider a cross-encoder reranker (e.g. `bge-reranker-base`) if `rerank_pool_k` grows past ~30 _(Retrieval — far beyond current corpus size)_
|
| 483 |
+
|
| 484 |
---
|
| 485 |
|
| 486 |
## Team
|
backend/api/main.py
CHANGED
|
@@ -168,6 +168,10 @@ class RatingRequest(BaseModel):
|
|
| 168 |
# ── Helpers ────────────────────────────────────────────────────────────────────
|
| 169 |
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
def _load_persona_profile(user_id: str) -> dict:
|
| 172 |
memories_path = settings.memories_dir / f"{user_id}.json"
|
| 173 |
try:
|
|
@@ -351,6 +355,8 @@ def _compute_and_persist_evals(
|
|
| 351 |
affect: str,
|
| 352 |
gesture_tag: str | None,
|
| 353 |
gaze_bucket: str | None,
|
|
|
|
|
|
|
| 354 |
) -> dict | None:
|
| 355 |
if not settings.evals_enabled or not run_id:
|
| 356 |
return None
|
|
@@ -363,6 +369,8 @@ def _compute_and_persist_evals(
|
|
| 363 |
gesture_tag=gesture_tag,
|
| 364 |
gaze_bucket=gaze_bucket,
|
| 365 |
slo_target=settings.slo_target_s,
|
|
|
|
|
|
|
| 366 |
)
|
| 367 |
except Exception:
|
| 368 |
_log.exception("evals scoring failed for run %s", run_id)
|
|
@@ -434,6 +442,8 @@ def chat(req: ChatRequest, background_tasks: BackgroundTasks):
|
|
| 434 |
affect=affect_emotion,
|
| 435 |
gesture_tag=req.gesture_tag,
|
| 436 |
gaze_bucket=req.gaze_bucket,
|
|
|
|
|
|
|
| 437 |
)
|
| 438 |
|
| 439 |
return ChatResponse(
|
|
@@ -512,6 +522,7 @@ def chat_stream(req: ChatRequest):
|
|
| 512 |
run_id = state.get("run_id")
|
| 513 |
|
| 514 |
# Evals run off the response path; UI polls GET /evals/{run_id}.
|
|
|
|
| 515 |
if run_id and settings.evals_enabled:
|
| 516 |
_reserve_eval_slot(run_id)
|
| 517 |
threading.Thread(
|
|
@@ -526,6 +537,8 @@ def chat_stream(req: ChatRequest):
|
|
| 526 |
affect=affect_emotion,
|
| 527 |
gesture_tag=req.gesture_tag,
|
| 528 |
gaze_bucket=req.gaze_bucket,
|
|
|
|
|
|
|
| 529 |
),
|
| 530 |
daemon=True,
|
| 531 |
).start()
|
|
@@ -534,7 +547,7 @@ def chat_stream(req: ChatRequest):
|
|
| 534 |
"user_id": req.user_id,
|
| 535 |
"query": req.query,
|
| 536 |
"response": state["selected_response"] or "",
|
| 537 |
-
"candidates":
|
| 538 |
"affect": affect_emotion,
|
| 539 |
"llm_tier": state.get("llm_tier_used", "unknown"),
|
| 540 |
"llm_model": state.get("llm_model_used", "unknown"),
|
|
@@ -660,6 +673,8 @@ def chat_turnaround(req: TurnaroundRequest, background_tasks: BackgroundTasks):
|
|
| 660 |
affect=affect_emotion,
|
| 661 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 662 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
|
|
|
|
|
|
| 663 |
)
|
| 664 |
|
| 665 |
return ChatResponse(
|
|
@@ -827,6 +842,7 @@ def chat_regenerate_stream(req: RegenerateRequest):
|
|
| 827 |
|
| 828 |
affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 829 |
run_id = replan_state.get("run_id")
|
|
|
|
| 830 |
eval_scores = _compute_and_persist_evals(
|
| 831 |
run_id=run_id,
|
| 832 |
user_id=req.user_id,
|
|
@@ -837,12 +853,14 @@ def chat_regenerate_stream(req: RegenerateRequest):
|
|
| 837 |
affect=affect_emotion,
|
| 838 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 839 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
|
|
|
|
|
|
| 840 |
)
|
| 841 |
final = {
|
| 842 |
"user_id": req.user_id,
|
| 843 |
"query": replan_state["raw_query"],
|
| 844 |
"response": replan_state["selected_response"] or "",
|
| 845 |
-
"candidates":
|
| 846 |
"affect": affect_emotion,
|
| 847 |
"llm_tier": replan_state.get("llm_tier_used", "unknown"),
|
| 848 |
"llm_model": replan_state.get("llm_model_used", "unknown"),
|
|
@@ -942,6 +960,8 @@ def chat_regenerate(req: RegenerateRequest):
|
|
| 942 |
affect=affect_emotion,
|
| 943 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 944 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
|
|
|
|
|
|
| 945 |
)
|
| 946 |
|
| 947 |
return ChatResponse(
|
|
|
|
| 168 |
# ── Helpers ────────────────────────────────────────────────────────────────────
|
| 169 |
|
| 170 |
|
| 171 |
+
def _candidate_dicts(state) -> list[dict]:
|
| 172 |
+
return [dict(c) for c in (state.get("candidates") or [])]
|
| 173 |
+
|
| 174 |
+
|
| 175 |
def _load_persona_profile(user_id: str) -> dict:
|
| 176 |
memories_path = settings.memories_dir / f"{user_id}.json"
|
| 177 |
try:
|
|
|
|
| 355 |
affect: str,
|
| 356 |
gesture_tag: str | None,
|
| 357 |
gaze_bucket: str | None,
|
| 358 |
+
query: str = "",
|
| 359 |
+
candidates: list[dict] | None = None,
|
| 360 |
) -> dict | None:
|
| 361 |
if not settings.evals_enabled or not run_id:
|
| 362 |
return None
|
|
|
|
| 369 |
gesture_tag=gesture_tag,
|
| 370 |
gaze_bucket=gaze_bucket,
|
| 371 |
slo_target=settings.slo_target_s,
|
| 372 |
+
query=query,
|
| 373 |
+
candidates=candidates,
|
| 374 |
)
|
| 375 |
except Exception:
|
| 376 |
_log.exception("evals scoring failed for run %s", run_id)
|
|
|
|
| 442 |
affect=affect_emotion,
|
| 443 |
gesture_tag=req.gesture_tag,
|
| 444 |
gaze_bucket=req.gaze_bucket,
|
| 445 |
+
query=req.query,
|
| 446 |
+
candidates=_candidate_dicts(result),
|
| 447 |
)
|
| 448 |
|
| 449 |
return ChatResponse(
|
|
|
|
| 522 |
run_id = state.get("run_id")
|
| 523 |
|
| 524 |
# Evals run off the response path; UI polls GET /evals/{run_id}.
|
| 525 |
+
cand_dicts = _candidate_dicts(state)
|
| 526 |
if run_id and settings.evals_enabled:
|
| 527 |
_reserve_eval_slot(run_id)
|
| 528 |
threading.Thread(
|
|
|
|
| 537 |
affect=affect_emotion,
|
| 538 |
gesture_tag=req.gesture_tag,
|
| 539 |
gaze_bucket=req.gaze_bucket,
|
| 540 |
+
query=req.query,
|
| 541 |
+
candidates=cand_dicts,
|
| 542 |
),
|
| 543 |
daemon=True,
|
| 544 |
).start()
|
|
|
|
| 547 |
"user_id": req.user_id,
|
| 548 |
"query": req.query,
|
| 549 |
"response": state["selected_response"] or "",
|
| 550 |
+
"candidates": cand_dicts,
|
| 551 |
"affect": affect_emotion,
|
| 552 |
"llm_tier": state.get("llm_tier_used", "unknown"),
|
| 553 |
"llm_model": state.get("llm_model_used", "unknown"),
|
|
|
|
| 673 |
affect=affect_emotion,
|
| 674 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 675 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
| 676 |
+
query=replan_state.get("raw_query") or "",
|
| 677 |
+
candidates=_candidate_dicts(replan_state),
|
| 678 |
)
|
| 679 |
|
| 680 |
return ChatResponse(
|
|
|
|
| 842 |
|
| 843 |
affect_emotion = (replan_state.get("affect") or {}).get("emotion", "NEUTRAL")
|
| 844 |
run_id = replan_state.get("run_id")
|
| 845 |
+
cand_dicts = _candidate_dicts(replan_state)
|
| 846 |
eval_scores = _compute_and_persist_evals(
|
| 847 |
run_id=run_id,
|
| 848 |
user_id=req.user_id,
|
|
|
|
| 853 |
affect=affect_emotion,
|
| 854 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 855 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
| 856 |
+
query=replan_state.get("raw_query") or "",
|
| 857 |
+
candidates=cand_dicts,
|
| 858 |
)
|
| 859 |
final = {
|
| 860 |
"user_id": req.user_id,
|
| 861 |
"query": replan_state["raw_query"],
|
| 862 |
"response": replan_state["selected_response"] or "",
|
| 863 |
+
"candidates": cand_dicts,
|
| 864 |
"affect": affect_emotion,
|
| 865 |
"llm_tier": replan_state.get("llm_tier_used", "unknown"),
|
| 866 |
"llm_model": replan_state.get("llm_model_used", "unknown"),
|
|
|
|
| 960 |
affect=affect_emotion,
|
| 961 |
gesture_tag=replan_state.get("gesture_tag"),
|
| 962 |
gaze_bucket=replan_state.get("gaze_bucket"),
|
| 963 |
+
query=replan_state.get("raw_query") or "",
|
| 964 |
+
candidates=_candidate_dicts(replan_state),
|
| 965 |
)
|
| 966 |
|
| 967 |
return ChatResponse(
|
backend/evals/__init__.py
CHANGED
|
@@ -1,7 +1,51 @@
|
|
| 1 |
# Evaluation metrics — compute after pipeline returns, before API response.
|
|
|
|
|
|
|
|
|
|
| 2 |
from backend.evals.efficiency import compute_efficiency
|
| 3 |
-
from backend.evals.faithfulness import compute_faithfulness
|
| 4 |
from backend.evals.multimodal_alignment import compute_multimodal_alignment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
|
| 7 |
def compute_evals(
|
|
@@ -12,18 +56,54 @@ def compute_evals(
|
|
| 12 |
gesture_tag: str | None,
|
| 13 |
gaze_bucket: str | None,
|
| 14 |
slo_target: float = 6.0,
|
|
|
|
|
|
|
|
|
|
| 15 |
) -> dict:
|
| 16 |
-
"""Run all eval scorers and return a unified EvalScores dict.
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
eff = compute_efficiency(latency_log, slo_target)
|
| 19 |
align = compute_multimodal_alignment(
|
| 20 |
response, affect, gesture_tag, gaze_bucket, chunks
|
| 21 |
)
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
"t_total_s": eff["t_total"],
|
| 28 |
"slo_target_s": eff["slo_target"],
|
| 29 |
"slo_passed": eff["slo_passed"],
|
|
@@ -32,4 +112,19 @@ def compute_evals(
|
|
| 32 |
"affect_alignment": align["affect_alignment"],
|
| 33 |
"gesture_alignment": align["gesture_alignment"],
|
| 34 |
"gaze_alignment": align["gaze_alignment"],
|
|
|
|
| 35 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Evaluation metrics — compute after pipeline returns, before API response.
|
| 2 |
+
import torch
|
| 3 |
+
|
| 4 |
+
from backend.evals.diversity import compute_candidate_diversity
|
| 5 |
from backend.evals.efficiency import compute_efficiency
|
| 6 |
+
from backend.evals.faithfulness import compute_faithfulness, compute_faithfulness_batch
|
| 7 |
from backend.evals.multimodal_alignment import compute_multimodal_alignment
|
| 8 |
+
from backend.evals.relevance import compute_relevance
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def _score_candidates_batched(
|
| 12 |
+
candidates: list[dict],
|
| 13 |
+
chunks: list[dict],
|
| 14 |
+
query: str,
|
| 15 |
+
) -> tuple[list[dict], "torch.Tensor | None"]:
|
| 16 |
+
"""One BGE pass + one NLI pass across all candidates. Returns per-candidate
|
| 17 |
+
score dicts and the candidate vector matrix (for diversity reuse), or None
|
| 18 |
+
when no embedding pass was needed."""
|
| 19 |
+
texts = [c.get("text", "") for c in candidates]
|
| 20 |
+
faiths = compute_faithfulness_batch(texts, chunks)
|
| 21 |
+
|
| 22 |
+
cand_vecs: torch.Tensor | None = None
|
| 23 |
+
if query.strip() and any(t.strip() for t in texts):
|
| 24 |
+
from backend.retrieval.vector_store import embed_texts
|
| 25 |
+
|
| 26 |
+
vecs = embed_texts([query] + texts)
|
| 27 |
+
q_vec = vecs[0]
|
| 28 |
+
cand_vecs = vecs[1:]
|
| 29 |
+
relevances = [
|
| 30 |
+
round(max(0.0, float(q_vec @ cand_vecs[i])), 4) for i in range(len(texts))
|
| 31 |
+
]
|
| 32 |
+
else:
|
| 33 |
+
relevances = [0.0] * len(texts)
|
| 34 |
+
|
| 35 |
+
scores = [{**faiths[i], "relevance": relevances[i]} for i in range(len(candidates))]
|
| 36 |
+
return scores, cand_vecs
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def _diversity_from_vecs(cand_vecs: "torch.Tensor") -> dict:
|
| 40 |
+
n = cand_vecs.shape[0]
|
| 41 |
+
if n < 2:
|
| 42 |
+
return {"candidate_diversity": 0.0, "n_candidates": n}
|
| 43 |
+
sims = cand_vecs @ cand_vecs.T
|
| 44 |
+
iu = torch.triu_indices(n, n, offset=1)
|
| 45 |
+
return {
|
| 46 |
+
"candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
|
| 47 |
+
"n_candidates": n,
|
| 48 |
+
}
|
| 49 |
|
| 50 |
|
| 51 |
def compute_evals(
|
|
|
|
| 56 |
gesture_tag: str | None,
|
| 57 |
gaze_bucket: str | None,
|
| 58 |
slo_target: float = 6.0,
|
| 59 |
+
query: str = "",
|
| 60 |
+
candidates: list[dict] | None = None,
|
| 61 |
+
selected_idx: int | None = None,
|
| 62 |
) -> dict:
|
| 63 |
+
"""Run all eval scorers and return a unified EvalScores dict.
|
| 64 |
+
|
| 65 |
+
When candidates are provided, scoring is batched: one BGE encode for
|
| 66 |
+
query + all candidates, one NLI predict across all (sentence, chunk)
|
| 67 |
+
pairs, then sliced per candidate. The selected candidate's scores are
|
| 68 |
+
reused as the top-level fields so the existing UI pills keep working.
|
| 69 |
+
"""
|
| 70 |
eff = compute_efficiency(latency_log, slo_target)
|
| 71 |
align = compute_multimodal_alignment(
|
| 72 |
response, affect, gesture_tag, gaze_bucket, chunks
|
| 73 |
)
|
| 74 |
|
| 75 |
+
per_cand: list[dict] = []
|
| 76 |
+
cand_vecs = None
|
| 77 |
+
if candidates:
|
| 78 |
+
if selected_idx is None and response:
|
| 79 |
+
for i, c in enumerate(candidates):
|
| 80 |
+
if c.get("text", "").strip() == response.strip():
|
| 81 |
+
selected_idx = i
|
| 82 |
+
break
|
| 83 |
+
scored, cand_vecs = _score_candidates_batched(candidates, chunks, query)
|
| 84 |
+
per_cand = [
|
| 85 |
+
{
|
| 86 |
+
"idx": i,
|
| 87 |
+
"strategy": c.get("strategy", "unknown"),
|
| 88 |
+
"selected": (selected_idx is not None and i == selected_idx),
|
| 89 |
+
**scored[i],
|
| 90 |
+
}
|
| 91 |
+
for i, c in enumerate(candidates)
|
| 92 |
+
]
|
| 93 |
+
|
| 94 |
+
if per_cand and selected_idx is not None and 0 <= selected_idx < len(per_cand):
|
| 95 |
+
# Strip per-candidate-only keys before reusing as top-level scores.
|
| 96 |
+
top = {
|
| 97 |
+
k: v
|
| 98 |
+
for k, v in per_cand[selected_idx].items()
|
| 99 |
+
if k not in ("idx", "strategy", "selected")
|
| 100 |
+
}
|
| 101 |
+
else:
|
| 102 |
+
faith = compute_faithfulness(response, chunks)
|
| 103 |
+
top = {**faith, "relevance": compute_relevance(response, query)["relevance"]}
|
| 104 |
+
|
| 105 |
+
out = {
|
| 106 |
+
**top,
|
| 107 |
"t_total_s": eff["t_total"],
|
| 108 |
"slo_target_s": eff["slo_target"],
|
| 109 |
"slo_passed": eff["slo_passed"],
|
|
|
|
| 112 |
"affect_alignment": align["affect_alignment"],
|
| 113 |
"gesture_alignment": align["gesture_alignment"],
|
| 114 |
"gaze_alignment": align["gaze_alignment"],
|
| 115 |
+
"explain": align.get("explain", {}),
|
| 116 |
}
|
| 117 |
+
|
| 118 |
+
if per_cand:
|
| 119 |
+
out["candidates_eval"] = per_cand
|
| 120 |
+
# Reuse cand_vecs from the relevance pass when available; falls back to
|
| 121 |
+
# standalone BGE encode (e.g. when query was empty).
|
| 122 |
+
if cand_vecs is not None:
|
| 123 |
+
out.update(_diversity_from_vecs(cand_vecs))
|
| 124 |
+
else:
|
| 125 |
+
out.update(compute_candidate_diversity(candidates))
|
| 126 |
+
else:
|
| 127 |
+
out["candidate_diversity"] = 0.0
|
| 128 |
+
out["n_candidates"] = 1 if response else 0
|
| 129 |
+
|
| 130 |
+
return out
|
backend/evals/aggregate.py
CHANGED
|
@@ -149,6 +149,121 @@ def report_authenticity(ratings: list[dict]) -> None:
|
|
| 149 |
print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
|
| 150 |
|
| 151 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
def main() -> None:
|
| 153 |
parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
|
| 154 |
parser.add_argument("--logs", type=Path, default=settings.logs_dir)
|
|
@@ -157,11 +272,16 @@ def main() -> None:
|
|
| 157 |
turns = _load(args.logs / "turns.jsonl")
|
| 158 |
evals = _load(args.logs / "evals.jsonl")
|
| 159 |
ratings = _load(args.logs / "ratings.jsonl")
|
|
|
|
| 160 |
|
| 161 |
-
print(
|
|
|
|
|
|
|
|
|
|
| 162 |
report_latency(turns)
|
| 163 |
report_faithfulness(evals)
|
| 164 |
report_multimodal(evals)
|
|
|
|
| 165 |
report_authenticity(ratings)
|
| 166 |
|
| 167 |
|
|
|
|
| 149 |
print(f"{uid:<18} {len(scores):>5} {mean:>6.2f} {dist_str:>22}")
|
| 150 |
|
| 151 |
|
| 152 |
+
def report_picker(turns: list[dict], picks: list[dict], evals: list[dict]) -> None:
|
| 153 |
+
"""Picker behaviour: pick rate, regenerate rate, strategy win rate, and
|
| 154 |
+
whether the user's pick beat candidate 0 on grounded/relevance.
|
| 155 |
+
|
| 156 |
+
Sources:
|
| 157 |
+
- turns.jsonl one row per turn, includes `candidates` and `n_candidates`
|
| 158 |
+
- picks.jsonl one row per /chat/pick — strategy, picked_idx, run_id
|
| 159 |
+
- evals.jsonl candidates_eval[] with per-candidate grounded + relevance
|
| 160 |
+
"""
|
| 161 |
+
print("\n=== Picker Behaviour ===")
|
| 162 |
+
multi = [t for t in turns if (t.get("n_candidates") or 0) >= 2]
|
| 163 |
+
if not multi:
|
| 164 |
+
print(
|
| 165 |
+
"(no multi-candidate turns logged — older format or single-candidate runs)"
|
| 166 |
+
)
|
| 167 |
+
return
|
| 168 |
+
|
| 169 |
+
picks_by_run = {p["run_id"]: p for p in picks if p.get("run_id")}
|
| 170 |
+
evals_by_run = {e["run_id"]: e for e in evals if e.get("run_id")}
|
| 171 |
+
|
| 172 |
+
n_multi = len(multi)
|
| 173 |
+
n_picked = sum(1 for t in multi if t["run_id"] in picks_by_run)
|
| 174 |
+
# A (user_id, turn_id) seen more than once means the planner re-ran for
|
| 175 |
+
# the same partner query — that's a regenerate. The denominator is the
|
| 176 |
+
# number of distinct (user, turn) conversations that had at least one
|
| 177 |
+
# multi-candidate run, not the raw row count.
|
| 178 |
+
seen: dict[tuple[str, int], int] = defaultdict(int)
|
| 179 |
+
for t in multi:
|
| 180 |
+
seen[(t.get("user_id", "?"), t.get("turn_id", -1))] += 1
|
| 181 |
+
n_regenerated_turns = sum(1 for c in seen.values() if c > 1)
|
| 182 |
+
n_distinct_turns = max(1, len(seen))
|
| 183 |
+
print(
|
| 184 |
+
f"multi-candidate turns: {n_multi} ({n_distinct_turns} distinct) "
|
| 185 |
+
f"pick rate: {n_picked / n_multi:.0%} "
|
| 186 |
+
f"regenerate rate: {n_regenerated_turns / n_distinct_turns:.0%} "
|
| 187 |
+
f"(% of distinct turns that re-ran)"
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
# Strategy win rate — among multi-candidate picks only, how often does
|
| 191 |
+
# each strategy win. Picks on single-candidate turns aren't a real "win"
|
| 192 |
+
# (no alternative to lose to) so we filter them out.
|
| 193 |
+
multi_run_ids = {t["run_id"] for t in multi}
|
| 194 |
+
strategy_count: dict[str, int] = defaultdict(int)
|
| 195 |
+
for run_id, p in picks_by_run.items():
|
| 196 |
+
if run_id in multi_run_ids:
|
| 197 |
+
strategy_count[p.get("strategy", "unknown")] += 1
|
| 198 |
+
if strategy_count:
|
| 199 |
+
total = sum(strategy_count.values())
|
| 200 |
+
print(f"\nStrategy win rate (n={total} picks):")
|
| 201 |
+
print(f" {'strategy':<16} {'picks':>6} {'pct':>6}")
|
| 202 |
+
for s, n in sorted(strategy_count.items(), key=lambda x: -x[1]):
|
| 203 |
+
print(f" {s:<16} {n:>6} {n / total:>5.0%}")
|
| 204 |
+
|
| 205 |
+
# Did the picker beat candidate 0? Only meaningful when we have per-candidate
|
| 206 |
+
# eval scores AND the user picked a non-zero index. A "win" = picked
|
| 207 |
+
# candidate scored strictly higher on the metric than candidate 0.
|
| 208 |
+
head_to_head = []
|
| 209 |
+
for run_id, pick in picks_by_run.items():
|
| 210 |
+
ev = evals_by_run.get(run_id)
|
| 211 |
+
if not ev or not ev.get("candidates_eval"):
|
| 212 |
+
continue
|
| 213 |
+
cands = ev["candidates_eval"]
|
| 214 |
+
if len(cands) < 2:
|
| 215 |
+
continue
|
| 216 |
+
picked_idx = pick.get("picked_idx", 0)
|
| 217 |
+
if picked_idx == 0 or picked_idx >= len(cands):
|
| 218 |
+
continue
|
| 219 |
+
head_to_head.append(
|
| 220 |
+
{
|
| 221 |
+
"picked_grounded": cands[picked_idx]["groundedness"],
|
| 222 |
+
"cand0_grounded": cands[0]["groundedness"],
|
| 223 |
+
"picked_relevance": cands[picked_idx].get("relevance", 0.0),
|
| 224 |
+
"cand0_relevance": cands[0].get("relevance", 0.0),
|
| 225 |
+
}
|
| 226 |
+
)
|
| 227 |
+
|
| 228 |
+
if head_to_head:
|
| 229 |
+
n = len(head_to_head)
|
| 230 |
+
beat_grounded = sum(
|
| 231 |
+
1 for h in head_to_head if h["picked_grounded"] > h["cand0_grounded"]
|
| 232 |
+
)
|
| 233 |
+
tied_grounded = sum(
|
| 234 |
+
1 for h in head_to_head if h["picked_grounded"] == h["cand0_grounded"]
|
| 235 |
+
)
|
| 236 |
+
beat_rel = sum(
|
| 237 |
+
1 for h in head_to_head if h["picked_relevance"] > h["cand0_relevance"]
|
| 238 |
+
)
|
| 239 |
+
print(f"\nDid picker beat candidate 0? (n={n} picks where picked_idx > 0)")
|
| 240 |
+
print(
|
| 241 |
+
f" groundedness: picker > cand0 = {beat_grounded}/{n} ({beat_grounded / n:.0%}), "
|
| 242 |
+
f"tied = {tied_grounded}/{n}"
|
| 243 |
+
)
|
| 244 |
+
print(f" relevance: picker > cand0 = {beat_rel}/{n} ({beat_rel / n:.0%})")
|
| 245 |
+
else:
|
| 246 |
+
print(
|
| 247 |
+
"\n(no picks of candidate 1+ with per-candidate eval data — can't measure picker quality yet)"
|
| 248 |
+
)
|
| 249 |
+
|
| 250 |
+
# Diversity: among multi-candidate turns with eval data, how often is the
|
| 251 |
+
# picker showing near-paraphrases (the "aloha" problem)?
|
| 252 |
+
div_scored = [
|
| 253 |
+
ev
|
| 254 |
+
for ev in evals_by_run.values()
|
| 255 |
+
if ev.get("n_candidates", 0) >= 2 and "candidate_diversity" in ev
|
| 256 |
+
]
|
| 257 |
+
if div_scored:
|
| 258 |
+
diversities = [float(e["candidate_diversity"]) for e in div_scored]
|
| 259 |
+
low = sum(1 for d in diversities if d < 0.1)
|
| 260 |
+
print(
|
| 261 |
+
f"\nCandidate diversity (n={len(div_scored)} turns): "
|
| 262 |
+
f"mean={statistics.mean(diversities):.2f} "
|
| 263 |
+
f"low (<0.10): {low}/{len(div_scored)} ({low / len(div_scored):.0%})"
|
| 264 |
+
)
|
| 265 |
+
|
| 266 |
+
|
| 267 |
def main() -> None:
|
| 268 |
parser = argparse.ArgumentParser(description="Aggregate AAC eval metrics")
|
| 269 |
parser.add_argument("--logs", type=Path, default=settings.logs_dir)
|
|
|
|
| 272 |
turns = _load(args.logs / "turns.jsonl")
|
| 273 |
evals = _load(args.logs / "evals.jsonl")
|
| 274 |
ratings = _load(args.logs / "ratings.jsonl")
|
| 275 |
+
picks = _load(args.logs / "picks.jsonl")
|
| 276 |
|
| 277 |
+
print(
|
| 278 |
+
f"Loaded: {len(turns)} turns, {len(evals)} evals, "
|
| 279 |
+
f"{len(picks)} picks, {len(ratings)} ratings"
|
| 280 |
+
)
|
| 281 |
report_latency(turns)
|
| 282 |
report_faithfulness(evals)
|
| 283 |
report_multimodal(evals)
|
| 284 |
+
report_picker(turns, picks, evals)
|
| 285 |
report_authenticity(ratings)
|
| 286 |
|
| 287 |
|
backend/evals/diversity.py
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
def compute_candidate_diversity(candidates: list[dict]) -> dict:
|
| 5 |
+
"""Mean pairwise cosine *distance* among candidate texts.
|
| 6 |
+
|
| 7 |
+
1.0 = maximally different, 0.0 = identical paraphrases.
|
| 8 |
+
"""
|
| 9 |
+
texts = [c.get("text", "").strip() for c in candidates]
|
| 10 |
+
texts = [t for t in texts if t]
|
| 11 |
+
n = len(texts)
|
| 12 |
+
if n < 2:
|
| 13 |
+
return {"candidate_diversity": 0.0, "n_candidates": n}
|
| 14 |
+
|
| 15 |
+
from backend.retrieval.vector_store import embed_texts
|
| 16 |
+
|
| 17 |
+
vecs = embed_texts(texts)
|
| 18 |
+
sims = vecs @ vecs.T
|
| 19 |
+
iu = torch.triu_indices(n, n, offset=1)
|
| 20 |
+
return {
|
| 21 |
+
"candidate_diversity": round(float(1.0 - sims[iu[0], iu[1]].mean().item()), 4),
|
| 22 |
+
"n_candidates": n,
|
| 23 |
+
}
|
backend/evals/faithfulness.py
CHANGED
|
@@ -49,46 +49,76 @@ def _split_sentences(text: str) -> list[str]:
|
|
| 49 |
return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
|
| 50 |
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
|
| 53 |
"""Sentence-level NLI: each sentence must be entailed by at least one chunk."""
|
| 54 |
-
|
| 55 |
-
return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
|
| 56 |
|
| 57 |
-
sentences = _split_sentences(response)
|
| 58 |
-
# Too short to score meaningfully (one-word replies, fragments). Flagging as
|
| 59 |
-
# no_evidence is honest: we're not scoring it, so it should be excluded from
|
| 60 |
-
# groundedness averages downstream.
|
| 61 |
-
if not sentences:
|
| 62 |
-
return {"groundedness": 0.0, "hallucination_rate": 0.0, "no_evidence": True}
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
if not chunk_texts:
|
| 66 |
-
return
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
model = _get_model()
|
| 72 |
-
# NLI pair order: (premise, hypothesis). Chunks are evidence (premise),
|
| 73 |
-
# response sentences are the claims being checked (hypothesis).
|
| 74 |
-
pairs = [(chunk, sent) for sent in sentences for chunk in chunk_texts]
|
| 75 |
with _predict_sem:
|
| 76 |
logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
|
| 77 |
probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
|
| 78 |
entail = probs[:, _entail_idx]
|
| 79 |
|
|
|
|
|
|
|
| 80 |
n_chunks = len(chunk_texts)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
return
|
| 91 |
-
"groundedness": round(groundedness, 4),
|
| 92 |
-
"hallucination_rate": round(1.0 - groundedness, 4),
|
| 93 |
-
"no_evidence": False,
|
| 94 |
-
}
|
|
|
|
| 49 |
return [s for s in parts if len(s.split()) >= _MIN_SENTENCE_WORDS]
|
| 50 |
|
| 51 |
|
| 52 |
+
def _no_evidence_result() -> dict:
|
| 53 |
+
return {
|
| 54 |
+
"groundedness": 0.0,
|
| 55 |
+
"hallucination_rate": 0.0,
|
| 56 |
+
"no_evidence": True,
|
| 57 |
+
"sentences_total": 0,
|
| 58 |
+
"sentences_grounded": 0,
|
| 59 |
+
"nli_threshold": settings.faithfulness_threshold,
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _score_from_entail(entail, sentences: list[str], n_chunks: int) -> dict:
|
| 64 |
+
threshold = settings.faithfulness_threshold
|
| 65 |
+
grounded = 0
|
| 66 |
+
for i in range(len(sentences)):
|
| 67 |
+
sent_scores = entail[i * n_chunks : (i + 1) * n_chunks]
|
| 68 |
+
if sent_scores.max() >= threshold:
|
| 69 |
+
grounded += 1
|
| 70 |
+
total = len(sentences)
|
| 71 |
+
return {
|
| 72 |
+
"groundedness": round(grounded / total, 4),
|
| 73 |
+
"hallucination_rate": round(1.0 - grounded / total, 4),
|
| 74 |
+
"no_evidence": False,
|
| 75 |
+
"sentences_total": total,
|
| 76 |
+
"sentences_grounded": grounded,
|
| 77 |
+
"nli_threshold": threshold,
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
|
| 81 |
def compute_faithfulness(response: str, chunks: list[dict]) -> dict:
|
| 82 |
"""Sentence-level NLI: each sentence must be entailed by at least one chunk."""
|
| 83 |
+
return compute_faithfulness_batch([response], chunks)[0]
|
|
|
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
+
def compute_faithfulness_batch(responses: list[str], chunks: list[dict]) -> list[dict]:
|
| 87 |
+
"""Score multiple candidate responses against the same chunks in a single
|
| 88 |
+
model.predict call. Caller passes `responses` in candidate order; we return
|
| 89 |
+
results in the same order. Falls back to _no_evidence_result when there's
|
| 90 |
+
nothing to score for a given response."""
|
| 91 |
+
chunk_texts = [c.get("text", "") for c in chunks if c.get("text")] if chunks else []
|
| 92 |
if not chunk_texts:
|
| 93 |
+
return [_no_evidence_result() for _ in responses]
|
| 94 |
+
|
| 95 |
+
per_resp_sentences = [
|
| 96 |
+
_split_sentences(r)[:_MAX_SENTENCES] if r else [] for r in responses
|
| 97 |
+
]
|
| 98 |
+
pairs: list[tuple[str, str]] = []
|
| 99 |
+
for sents in per_resp_sentences:
|
| 100 |
+
for sent in sents:
|
| 101 |
+
for chunk in chunk_texts:
|
| 102 |
+
pairs.append((chunk, sent))
|
| 103 |
+
if not pairs:
|
| 104 |
+
return [_no_evidence_result() for _ in responses]
|
| 105 |
|
| 106 |
model = _get_model()
|
|
|
|
|
|
|
|
|
|
| 107 |
with _predict_sem:
|
| 108 |
logits = model.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
|
| 109 |
probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
|
| 110 |
entail = probs[:, _entail_idx]
|
| 111 |
|
| 112 |
+
out: list[dict] = []
|
| 113 |
+
cursor = 0
|
| 114 |
n_chunks = len(chunk_texts)
|
| 115 |
+
for sentences in per_resp_sentences:
|
| 116 |
+
if not sentences:
|
| 117 |
+
out.append(_no_evidence_result())
|
| 118 |
+
continue
|
| 119 |
+
n_pairs = len(sentences) * n_chunks
|
| 120 |
+
out.append(
|
| 121 |
+
_score_from_entail(entail[cursor : cursor + n_pairs], sentences, n_chunks)
|
| 122 |
+
)
|
| 123 |
+
cursor += n_pairs
|
| 124 |
+
return out
|
|
|
|
|
|
|
|
|
|
|
|
backend/evals/multimodal_alignment.py
CHANGED
|
@@ -77,11 +77,18 @@ def _gesture_alignment(response: str, gesture_tag: str | None) -> float:
|
|
| 77 |
return 1.0 if pattern.search(response) else 0.0
|
| 78 |
|
| 79 |
|
| 80 |
-
def _gaze_alignment(
|
|
|
|
|
|
|
| 81 |
if not gaze_bucket or not chunks:
|
| 82 |
-
return 0.0
|
| 83 |
matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
|
| 84 |
-
return matches / len(chunks)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
|
| 87 |
def compute_multimodal_alignment(
|
|
@@ -92,16 +99,37 @@ def compute_multimodal_alignment(
|
|
| 92 |
chunks: list[dict],
|
| 93 |
) -> dict:
|
| 94 |
scores: dict[str, float] = {}
|
|
|
|
| 95 |
if affect:
|
| 96 |
scores["affect_alignment"] = _affect_alignment(response, affect)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
if gesture_tag:
|
| 98 |
scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
if gaze_bucket:
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
overall = sum(scores.values()) / len(scores) if scores else 0.0
|
| 102 |
return {
|
| 103 |
"overall_score": round(overall, 4),
|
| 104 |
"affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
|
| 105 |
"gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
|
| 106 |
"gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
|
|
|
|
| 107 |
}
|
|
|
|
| 77 |
return 1.0 if pattern.search(response) else 0.0
|
| 78 |
|
| 79 |
|
| 80 |
+
def _gaze_alignment(
|
| 81 |
+
chunks: list[dict], gaze_bucket: str | None
|
| 82 |
+
) -> tuple[float, int, int]:
|
| 83 |
if not gaze_bucket or not chunks:
|
| 84 |
+
return 0.0, 0, len(chunks) if chunks else 0
|
| 85 |
matches = sum(1 for c in chunks if c.get("bucket") == gaze_bucket)
|
| 86 |
+
return matches / len(chunks), matches, len(chunks)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def _affect_breakdown(response: str) -> tuple[int, int]:
|
| 90 |
+
toks = _tokens(response)
|
| 91 |
+
return len(toks & _POSITIVE), len(toks & _NEGATIVE)
|
| 92 |
|
| 93 |
|
| 94 |
def compute_multimodal_alignment(
|
|
|
|
| 99 |
chunks: list[dict],
|
| 100 |
) -> dict:
|
| 101 |
scores: dict[str, float] = {}
|
| 102 |
+
explain: dict[str, dict] = {}
|
| 103 |
if affect:
|
| 104 |
scores["affect_alignment"] = _affect_alignment(response, affect)
|
| 105 |
+
pos, neg = _affect_breakdown(response)
|
| 106 |
+
explain["affect"] = {
|
| 107 |
+
"target": affect,
|
| 108 |
+
"pos_words": pos,
|
| 109 |
+
"neg_words": neg,
|
| 110 |
+
"sentiment": round(_sentiment_score(response), 4),
|
| 111 |
+
}
|
| 112 |
if gesture_tag:
|
| 113 |
scores["gesture_alignment"] = _gesture_alignment(response, gesture_tag)
|
| 114 |
+
pattern = _GESTURE_OPENER_PATTERNS.get(gesture_tag)
|
| 115 |
+
explain["gesture"] = {
|
| 116 |
+
"tag": gesture_tag,
|
| 117 |
+
"has_pattern": pattern is not None,
|
| 118 |
+
"matched": bool(pattern.search(response)) if pattern else None,
|
| 119 |
+
}
|
| 120 |
if gaze_bucket:
|
| 121 |
+
score, matches, total = _gaze_alignment(chunks, gaze_bucket)
|
| 122 |
+
scores["gaze_alignment"] = score
|
| 123 |
+
explain["gaze"] = {
|
| 124 |
+
"bucket": gaze_bucket,
|
| 125 |
+
"matched_chunks": matches,
|
| 126 |
+
"total_chunks": total,
|
| 127 |
+
}
|
| 128 |
overall = sum(scores.values()) / len(scores) if scores else 0.0
|
| 129 |
return {
|
| 130 |
"overall_score": round(overall, 4),
|
| 131 |
"affect_alignment": round(scores.get("affect_alignment", 0.0), 4),
|
| 132 |
"gesture_alignment": round(scores.get("gesture_alignment", 0.0), 4),
|
| 133 |
"gaze_alignment": round(scores.get("gaze_alignment", 0.0), 4),
|
| 134 |
+
"explain": explain,
|
| 135 |
}
|
backend/evals/relevance.py
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def compute_relevance(response: str, query: str) -> dict:
|
| 2 |
+
"""BGE cosine similarity between query and response embeddings.
|
| 3 |
+
|
| 4 |
+
NLI entailment is the wrong abstraction here (a question rarely entails
|
| 5 |
+
its answer), so we use the same embedding space the retriever uses.
|
| 6 |
+
"""
|
| 7 |
+
if not response.strip() or not query.strip():
|
| 8 |
+
return {"relevance": 0.0, "no_query": not query.strip()}
|
| 9 |
+
|
| 10 |
+
from backend.retrieval.vector_store import embed_texts
|
| 11 |
+
|
| 12 |
+
vecs = embed_texts([query, response])
|
| 13 |
+
return {
|
| 14 |
+
"relevance": round(max(0.0, float(vecs[0] @ vecs[1])), 4),
|
| 15 |
+
"no_query": False,
|
| 16 |
+
}
|
backend/retrieval/vector_store.py
CHANGED
|
@@ -54,6 +54,15 @@ def get_embedder():
|
|
| 54 |
return _get_embedder()
|
| 55 |
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
# Index cache: one (vectors_tensor, meta) per user_id.
|
| 58 |
_index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
|
| 59 |
|
|
|
|
| 54 |
return _get_embedder()
|
| 55 |
|
| 56 |
|
| 57 |
+
def embed_texts(texts: list[str]) -> torch.Tensor:
|
| 58 |
+
return _get_embedder().encode(
|
| 59 |
+
texts,
|
| 60 |
+
convert_to_tensor=True,
|
| 61 |
+
normalize_embeddings=True,
|
| 62 |
+
device=_DEVICE,
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
# Index cache: one (vectors_tensor, meta) per user_id.
|
| 67 |
_index_cache: dict[str, tuple[torch.Tensor, list[dict]]] = {}
|
| 68 |
|
frontend/src/App.css
CHANGED
|
@@ -662,6 +662,53 @@ input[type="text"]:hover {
|
|
| 662 |
border-color: var(--border);
|
| 663 |
}
|
| 664 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 665 |
.eval-details {
|
| 666 |
display: flex;
|
| 667 |
flex-direction: column;
|
|
|
|
| 662 |
border-color: var(--border);
|
| 663 |
}
|
| 664 |
|
| 665 |
+
.tip {
|
| 666 |
+
position: relative;
|
| 667 |
+
display: inline-flex;
|
| 668 |
+
cursor: help;
|
| 669 |
+
}
|
| 670 |
+
|
| 671 |
+
.tip::after {
|
| 672 |
+
content: attr(data-tip);
|
| 673 |
+
position: absolute;
|
| 674 |
+
bottom: calc(100% + 8px);
|
| 675 |
+
left: 0;
|
| 676 |
+
background: #1f2937;
|
| 677 |
+
color: #f9fafb;
|
| 678 |
+
padding: 8px 10px;
|
| 679 |
+
border-radius: 6px;
|
| 680 |
+
font-size: 11px;
|
| 681 |
+
font-weight: 400;
|
| 682 |
+
line-height: 1.45;
|
| 683 |
+
letter-spacing: 0.1px;
|
| 684 |
+
white-space: pre-line;
|
| 685 |
+
width: max-content;
|
| 686 |
+
max-width: 320px;
|
| 687 |
+
opacity: 0;
|
| 688 |
+
pointer-events: none;
|
| 689 |
+
transition: opacity 0.08s ease-in 0.5s;
|
| 690 |
+
z-index: 100;
|
| 691 |
+
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.25);
|
| 692 |
+
}
|
| 693 |
+
|
| 694 |
+
.tip::before {
|
| 695 |
+
content: "";
|
| 696 |
+
position: absolute;
|
| 697 |
+
bottom: calc(100% + 2px);
|
| 698 |
+
left: 12px;
|
| 699 |
+
border: 6px solid transparent;
|
| 700 |
+
border-top-color: #1f2937;
|
| 701 |
+
opacity: 0;
|
| 702 |
+
pointer-events: none;
|
| 703 |
+
transition: opacity 0.08s ease-in 0.5s;
|
| 704 |
+
z-index: 100;
|
| 705 |
+
}
|
| 706 |
+
|
| 707 |
+
.tip:hover::after,
|
| 708 |
+
.tip:hover::before {
|
| 709 |
+
opacity: 1;
|
| 710 |
+
}
|
| 711 |
+
|
| 712 |
.eval-details {
|
| 713 |
display: flex;
|
| 714 |
flex-direction: column;
|
frontend/src/components/EvalPanel.tsx
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
import { useState } from "react";
|
| 2 |
import { submitRating } from "../lib/api";
|
| 3 |
import type { EvalScores } from "../types";
|
| 4 |
|
|
@@ -20,7 +20,162 @@ function fmt(score: number): string {
|
|
| 20 |
return (score * 100).toFixed(0) + "%";
|
| 21 |
}
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
runId,
|
| 25 |
userId,
|
| 26 |
latencyTotal,
|
|
@@ -34,6 +189,9 @@ export function EvalPanel({
|
|
| 34 |
? evalScores.slo_passed
|
| 35 |
: latencyTotal > 0 && latencyTotal < sloTarget;
|
| 36 |
const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
async function rate(stars: number) {
|
| 39 |
if (submitting || value !== null) return;
|
|
@@ -57,64 +215,76 @@ export function EvalPanel({
|
|
| 57 |
<div className="eval-row">
|
| 58 |
{effectiveLatency > 0 && (
|
| 59 |
<span
|
| 60 |
-
className=
|
| 61 |
-
|
| 62 |
>
|
| 63 |
-
{
|
|
|
|
|
|
|
| 64 |
</span>
|
| 65 |
)}
|
| 66 |
{evalScores && (
|
| 67 |
<>
|
| 68 |
-
<span
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
evalScores.hallucination_rate
|
| 77 |
-
)})`
|
| 78 |
-
}
|
| 79 |
-
>
|
| 80 |
-
grounded {evalScores.no_evidence ? "—" : fmt(evalScores.groundedness)}
|
| 81 |
</span>
|
| 82 |
-
|
| 83 |
-
className=
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
</span>
|
| 88 |
-
<span
|
| 89 |
-
className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
gesture {fmt(evalScores.gesture_alignment)}
|
| 93 |
</span>
|
| 94 |
-
<span
|
| 95 |
-
className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
gaze {fmt(evalScores.gaze_alignment)}
|
| 99 |
</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
</>
|
| 101 |
)}
|
| 102 |
-
<
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
|
|
|
|
|
|
| 117 |
</div>
|
| 118 |
</div>
|
| 119 |
);
|
| 120 |
}
|
|
|
|
|
|
|
|
|
| 1 |
+
import { memo, useState } from "react";
|
| 2 |
import { submitRating } from "../lib/api";
|
| 3 |
import type { EvalScores } from "../types";
|
| 4 |
|
|
|
|
| 20 |
return (score * 100).toFixed(0) + "%";
|
| 21 |
}
|
| 22 |
|
| 23 |
+
function buildTip(parts: {
|
| 24 |
+
title: string;
|
| 25 |
+
question: string;
|
| 26 |
+
how?: string;
|
| 27 |
+
thisTurn?: string;
|
| 28 |
+
fallback?: string;
|
| 29 |
+
}): string {
|
| 30 |
+
const header = `${parts.title} — ${parts.question}`;
|
| 31 |
+
if (parts.fallback) return `${header}\n\n${parts.fallback}`;
|
| 32 |
+
const sections = [parts.how, parts.thisTurn].filter(Boolean);
|
| 33 |
+
return sections.length ? `${header}\n\n${sections.join("\n\n")}` : header;
|
| 34 |
+
}
|
| 35 |
+
|
| 36 |
+
function groundednessTip(s: EvalScores): string {
|
| 37 |
+
const title = "GROUNDED";
|
| 38 |
+
if (s.no_evidence) {
|
| 39 |
+
return buildTip({
|
| 40 |
+
title,
|
| 41 |
+
question: "Did the response stick to the retrieved memories?",
|
| 42 |
+
fallback:
|
| 43 |
+
"Not scored: no memories were retrieved this turn (e.g. a 'how are you feeling?' question that skips retrieval).",
|
| 44 |
+
});
|
| 45 |
+
}
|
| 46 |
+
const total = s.sentences_total ?? 0;
|
| 47 |
+
const grounded = s.sentences_grounded ?? 0;
|
| 48 |
+
const thr = s.nli_threshold ?? 0.5;
|
| 49 |
+
return buildTip({
|
| 50 |
+
title,
|
| 51 |
+
question: "Did the response stick to the retrieved memories, or hallucinate?",
|
| 52 |
+
how:
|
| 53 |
+
`How: each sentence in the response is checked against each retrieved chunk with an NLI model. ` +
|
| 54 |
+
`A sentence counts as grounded if at least one chunk entails it with probability ≥ ${thr.toFixed(2)}.`,
|
| 55 |
+
thisTurn:
|
| 56 |
+
`This turn: ${grounded}/${total} sentences grounded → ${fmt(s.groundedness)}. ` +
|
| 57 |
+
`Hallucination = ${fmt(s.hallucination_rate)} (${total - grounded} unsupported).`,
|
| 58 |
+
});
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
function relevanceTip(s: EvalScores): string {
|
| 62 |
+
return buildTip({
|
| 63 |
+
title: "RELEVANT",
|
| 64 |
+
question: "Did the response actually address the partner's question?",
|
| 65 |
+
how:
|
| 66 |
+
"How: cosine similarity between the BGE embedding of the query and the embedding of the response. " +
|
| 67 |
+
"Higher = more semantically on-topic.",
|
| 68 |
+
thisTurn: `This turn: ${(s.relevance ?? 0).toFixed(3)} → ${fmt(s.relevance ?? 0)}.`,
|
| 69 |
+
});
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
function affectTip(s: EvalScores): string {
|
| 73 |
+
const question = "Does the response tone match the detected facial expression?";
|
| 74 |
+
const ex = s.explain?.affect;
|
| 75 |
+
if (!ex) {
|
| 76 |
+
return buildTip({ title: "AFFECT", question });
|
| 77 |
+
}
|
| 78 |
+
return buildTip({
|
| 79 |
+
title: "AFFECT",
|
| 80 |
+
question,
|
| 81 |
+
how:
|
| 82 |
+
"How: response sentiment is computed from positive vs negative word counts, " +
|
| 83 |
+
"then compared to the affect target.",
|
| 84 |
+
thisTurn:
|
| 85 |
+
`This turn: detected ${ex.target}, response sentiment = ${ex.sentiment.toFixed(2)} ` +
|
| 86 |
+
`(${ex.pos_words} positive word${ex.pos_words === 1 ? "" : "s"}, ` +
|
| 87 |
+
`${ex.neg_words} negative) → ${fmt(s.affect_alignment)}.`,
|
| 88 |
+
});
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
function gestureTip(s: EvalScores): string {
|
| 92 |
+
const title = "GESTURE";
|
| 93 |
+
const question = "Does the response opener acknowledge the detected hand gesture?";
|
| 94 |
+
const ex = s.explain?.gesture;
|
| 95 |
+
if (!ex) {
|
| 96 |
+
return buildTip({ title, question, fallback: "No gesture detected this turn — defaults to 0." });
|
| 97 |
+
}
|
| 98 |
+
if (!ex.has_pattern) {
|
| 99 |
+
return buildTip({
|
| 100 |
+
title,
|
| 101 |
+
question,
|
| 102 |
+
fallback: `Detected ${ex.tag}, but this gesture has no opener pattern to test — partial credit (50%).`,
|
| 103 |
+
});
|
| 104 |
+
}
|
| 105 |
+
return buildTip({
|
| 106 |
+
title,
|
| 107 |
+
question,
|
| 108 |
+
how: `How: regex check on the first words of the response (e.g. THUMBS_UP expects 'yes/sure/absolutely…').`,
|
| 109 |
+
thisTurn:
|
| 110 |
+
`This turn: detected ${ex.tag}, opener ${ex.matched ? "matched" : "did not match"} ` +
|
| 111 |
+
`→ ${ex.matched ? "100%" : "0%"}.`,
|
| 112 |
+
});
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
function gazeTip(s: EvalScores): string {
|
| 116 |
+
const title = "GAZE";
|
| 117 |
+
const question = "Did the retrieved memories come from the topic the user was looking at?";
|
| 118 |
+
const ex = s.explain?.gaze;
|
| 119 |
+
if (!ex) {
|
| 120 |
+
return buildTip({ title, question, fallback: "No gaze bucket detected this turn — defaults to 0." });
|
| 121 |
+
}
|
| 122 |
+
if (ex.total_chunks === 0) {
|
| 123 |
+
return buildTip({
|
| 124 |
+
title,
|
| 125 |
+
question,
|
| 126 |
+
fallback: `User was looking at: ${ex.bucket}. No chunks retrieved this turn — defaults to 0.`,
|
| 127 |
+
});
|
| 128 |
+
}
|
| 129 |
+
return buildTip({
|
| 130 |
+
title,
|
| 131 |
+
question,
|
| 132 |
+
how: `How: fraction of retrieved chunks whose 'bucket' label matches the gaze target.`,
|
| 133 |
+
thisTurn:
|
| 134 |
+
`This turn: user looking at ${ex.bucket}, ${ex.matched_chunks}/${ex.total_chunks} ` +
|
| 135 |
+
`retrieved chunks matched → ${fmt(s.gaze_alignment)}.`,
|
| 136 |
+
});
|
| 137 |
+
}
|
| 138 |
+
|
| 139 |
+
function diversityTip(s: EvalScores): string {
|
| 140 |
+
const title = "DIVERSITY";
|
| 141 |
+
const question = "How different are the candidate responses the picker showed?";
|
| 142 |
+
const n = s.n_candidates ?? 0;
|
| 143 |
+
const d = s.candidate_diversity ?? 0;
|
| 144 |
+
if (n < 2) {
|
| 145 |
+
return buildTip({ title, question, fallback: `Only ${n} candidate this turn — not meaningful.` });
|
| 146 |
+
}
|
| 147 |
+
return buildTip({
|
| 148 |
+
title,
|
| 149 |
+
question,
|
| 150 |
+
how:
|
| 151 |
+
"How: average pairwise cosine distance between BGE embeddings of the candidate texts. " +
|
| 152 |
+
"High = varied alternatives. Low = three paraphrases of the same answer (the 'aloha' problem).",
|
| 153 |
+
thisTurn: `This turn: ${n} candidates, mean pairwise distance = ${d.toFixed(3)} → ${fmt(d)}.`,
|
| 154 |
+
});
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
function sloTip(
|
| 158 |
+
s: EvalScores | null | undefined,
|
| 159 |
+
fallbackLatency: number,
|
| 160 |
+
fallbackTarget: number,
|
| 161 |
+
fallbackPassed: boolean,
|
| 162 |
+
): string {
|
| 163 |
+
const latency = s?.t_total_s ?? fallbackLatency;
|
| 164 |
+
const target = s?.slo_target_s ?? fallbackTarget;
|
| 165 |
+
const passed = s?.slo_passed ?? fallbackPassed;
|
| 166 |
+
const margin = s?.slo_margin_s;
|
| 167 |
+
const sign = (margin ?? 0) >= 0 ? "+" : "";
|
| 168 |
+
const m = margin !== undefined ? `${sign}${margin.toFixed(2)}s` : "";
|
| 169 |
+
return buildTip({
|
| 170 |
+
title: "LATENCY",
|
| 171 |
+
question: "Did the response arrive within the SLO budget?",
|
| 172 |
+
thisTurn:
|
| 173 |
+
`Target: < ${target.toFixed(1)}s. ` +
|
| 174 |
+
`This turn: ${latency.toFixed(2)}s${m ? ` (${m} margin)` : ""} — ${passed ? "passed ✓" : "failed ✗"}.`,
|
| 175 |
+
});
|
| 176 |
+
}
|
| 177 |
+
|
| 178 |
+
function EvalPanelImpl({
|
| 179 |
runId,
|
| 180 |
userId,
|
| 181 |
latencyTotal,
|
|
|
|
| 189 |
? evalScores.slo_passed
|
| 190 |
: latencyTotal > 0 && latencyTotal < sloTarget;
|
| 191 |
const effectiveLatency = evalScores?.t_total_s ?? latencyTotal;
|
| 192 |
+
const showDiversity =
|
| 193 |
+
evalScores && (evalScores.n_candidates ?? 0) >= 2;
|
| 194 |
+
const showRelevance = evalScores && evalScores.relevance !== undefined;
|
| 195 |
|
| 196 |
async function rate(stars: number) {
|
| 197 |
if (submitting || value !== null) return;
|
|
|
|
| 215 |
<div className="eval-row">
|
| 216 |
{effectiveLatency > 0 && (
|
| 217 |
<span
|
| 218 |
+
className="tip"
|
| 219 |
+
data-tip={sloTip(evalScores, effectiveLatency, sloTarget, sloPassed)}
|
| 220 |
>
|
| 221 |
+
<span className={`slo-badge ${sloPassed ? "pass" : "fail"}`}>
|
| 222 |
+
{effectiveLatency.toFixed(2)}s {sloPassed ? "✓" : "✗"}
|
| 223 |
+
</span>
|
| 224 |
</span>
|
| 225 |
)}
|
| 226 |
{evalScores && (
|
| 227 |
<>
|
| 228 |
+
<span className="tip" data-tip={groundednessTip(evalScores)}>
|
| 229 |
+
<span
|
| 230 |
+
className={`eval-pill ${
|
| 231 |
+
evalScores.no_evidence ? "muted" : gradeClass(evalScores.groundedness)
|
| 232 |
+
}`}
|
| 233 |
+
>
|
| 234 |
+
grounded {evalScores.no_evidence ? "—" : fmt(evalScores.groundedness)}
|
| 235 |
+
</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
</span>
|
| 237 |
+
{showRelevance && (
|
| 238 |
+
<span className="tip" data-tip={relevanceTip(evalScores)}>
|
| 239 |
+
<span className={`eval-pill ${gradeClass(evalScores.relevance ?? 0)}`}>
|
| 240 |
+
relevant {fmt(evalScores.relevance ?? 0)}
|
| 241 |
+
</span>
|
| 242 |
+
</span>
|
| 243 |
+
)}
|
| 244 |
+
<span className="tip" data-tip={affectTip(evalScores)}>
|
| 245 |
+
<span className={`eval-pill ${gradeClass(evalScores.affect_alignment)}`}>
|
| 246 |
+
affect {fmt(evalScores.affect_alignment)}
|
| 247 |
+
</span>
|
| 248 |
</span>
|
| 249 |
+
<span className="tip" data-tip={gestureTip(evalScores)}>
|
| 250 |
+
<span className={`eval-pill ${gradeClass(evalScores.gesture_alignment)}`}>
|
| 251 |
+
gesture {fmt(evalScores.gesture_alignment)}
|
| 252 |
+
</span>
|
|
|
|
| 253 |
</span>
|
| 254 |
+
<span className="tip" data-tip={gazeTip(evalScores)}>
|
| 255 |
+
<span className={`eval-pill ${gradeClass(evalScores.gaze_alignment)}`}>
|
| 256 |
+
gaze {fmt(evalScores.gaze_alignment)}
|
| 257 |
+
</span>
|
|
|
|
| 258 |
</span>
|
| 259 |
+
{showDiversity && (
|
| 260 |
+
<span className="tip" data-tip={diversityTip(evalScores)}>
|
| 261 |
+
<span className={`eval-pill ${gradeClass(evalScores.candidate_diversity ?? 0)}`}>
|
| 262 |
+
diversity {fmt(evalScores.candidate_diversity ?? 0)}
|
| 263 |
+
</span>
|
| 264 |
+
</span>
|
| 265 |
+
)}
|
| 266 |
</>
|
| 267 |
)}
|
| 268 |
+
<span className="tip" data-tip="Rate how authentic this response felt as the persona (1 = off, 5 = spot on). Logged to ratings.jsonl.">
|
| 269 |
+
<div className="star-rating">
|
| 270 |
+
{[1, 2, 3, 4, 5].map((star) => (
|
| 271 |
+
<button
|
| 272 |
+
key={star}
|
| 273 |
+
className={`star ${star <= (hover || (value ?? 0)) ? "active" : ""}`}
|
| 274 |
+
onMouseEnter={() => setHover(star)}
|
| 275 |
+
onMouseLeave={() => setHover(0)}
|
| 276 |
+
onClick={() => rate(star)}
|
| 277 |
+
disabled={value !== null || submitting}
|
| 278 |
+
>
|
| 279 |
+
★
|
| 280 |
+
</button>
|
| 281 |
+
))}
|
| 282 |
+
{value !== null && <span className="star-label">{value}/5</span>}
|
| 283 |
+
</div>
|
| 284 |
+
</span>
|
| 285 |
</div>
|
| 286 |
</div>
|
| 287 |
);
|
| 288 |
}
|
| 289 |
+
|
| 290 |
+
export const EvalPanel = memo(EvalPanelImpl);
|
frontend/src/types.ts
CHANGED
|
@@ -69,10 +69,43 @@ export interface LatencyLog {
|
|
| 69 |
t_total: number;
|
| 70 |
}
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
export interface EvalScores {
|
| 73 |
groundedness: number;
|
| 74 |
hallucination_rate: number;
|
| 75 |
no_evidence: boolean;
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
t_total_s: number;
|
| 77 |
slo_target_s: number;
|
| 78 |
slo_passed: boolean;
|
|
@@ -81,6 +114,10 @@ export interface EvalScores {
|
|
| 81 |
affect_alignment: number;
|
| 82 |
gesture_alignment: number;
|
| 83 |
gaze_alignment: number;
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
}
|
| 85 |
|
| 86 |
export type CandidateStrategy =
|
|
|
|
| 69 |
t_total: number;
|
| 70 |
}
|
| 71 |
|
| 72 |
+
export interface CandidateEval {
|
| 73 |
+
idx: number;
|
| 74 |
+
strategy: string;
|
| 75 |
+
selected: boolean;
|
| 76 |
+
groundedness: number;
|
| 77 |
+
hallucination_rate: number;
|
| 78 |
+
no_evidence: boolean;
|
| 79 |
+
relevance: number;
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
export interface EvalExplain {
|
| 83 |
+
affect?: {
|
| 84 |
+
target: string;
|
| 85 |
+
pos_words: number;
|
| 86 |
+
neg_words: number;
|
| 87 |
+
sentiment: number;
|
| 88 |
+
};
|
| 89 |
+
gesture?: {
|
| 90 |
+
tag: string;
|
| 91 |
+
has_pattern: boolean;
|
| 92 |
+
matched: boolean | null;
|
| 93 |
+
};
|
| 94 |
+
gaze?: {
|
| 95 |
+
bucket: string;
|
| 96 |
+
matched_chunks: number;
|
| 97 |
+
total_chunks: number;
|
| 98 |
+
};
|
| 99 |
+
}
|
| 100 |
+
|
| 101 |
export interface EvalScores {
|
| 102 |
groundedness: number;
|
| 103 |
hallucination_rate: number;
|
| 104 |
no_evidence: boolean;
|
| 105 |
+
sentences_total?: number;
|
| 106 |
+
sentences_grounded?: number;
|
| 107 |
+
nli_threshold?: number;
|
| 108 |
+
relevance?: number;
|
| 109 |
t_total_s: number;
|
| 110 |
slo_target_s: number;
|
| 111 |
slo_passed: boolean;
|
|
|
|
| 114 |
affect_alignment: number;
|
| 115 |
gesture_alignment: number;
|
| 116 |
gaze_alignment: number;
|
| 117 |
+
candidate_diversity?: number;
|
| 118 |
+
n_candidates?: number;
|
| 119 |
+
candidates_eval?: CandidateEval[];
|
| 120 |
+
explain?: EvalExplain;
|
| 121 |
}
|
| 122 |
|
| 123 |
export type CandidateStrategy =
|