shwetangisingh commited on
Commit
2d47b97
Β·
1 Parent(s): 9ad188a

adding detailed TODOs to ReadME

Browse files
Files changed (1) hide show
  1. README.md +64 -25
README.md CHANGED
@@ -184,31 +184,70 @@ To add a new persona, edit `data/generate_users.py` and re-run `python -m backen
184
 
185
  ## TODO
186
 
187
- - [ ] Add more dataset
188
- - [ ] Reduce latency in intention
189
- - [ ] Add more detailed todos
190
-
191
- ### Evals (`backend/evals/`)
192
-
193
- Per-turn metrics returned in `ChatResponse.eval_scores` and rendered in the React debug panel.
194
-
195
- | Metric | File | Status |
196
- |--------|------|--------|
197
- | Communication Efficiency | `efficiency.py` | Done β€” SLO check on `t_total` |
198
- | Factual Faithfulness | `faithfulness.py` | Stub |
199
- | Multimodal Alignment | `multimodal_alignment.py` | Stub |
200
- | Perceived Authenticity | (frontend) | UI star rating; not persisted yet |
201
-
202
- - [ ] **Faithfulness** β€” Load cross-encoder NLI model (e.g. `cross-encoder/nli-deberta-v3-small`),
203
- split response into sentences, check entailment against evidence chunks. Groundedness =
204
- fraction with max entailment > 0.5; hallucination rate = fraction with contradiction > 0.5
205
- and entailment < 0.3. Empty `chunks` β†’ `no_evidence=True`.
206
- - [ ] **Multimodal Alignment** β€” Rule-based (no model):
207
- - Affect β†’ sentiment-word overlap (reuse `affect_positive_map` from planner)
208
- - Gesture β†’ expected-word overlap (reuse `gesture_word_map` from planner)
209
- - Gaze β†’ check whether retrieved chunks came from `gaze_bucket` and response references them
210
- - Overall = mean of non-None sub-scores
211
- - [ ] **Authenticity** β€” Persist Likert ratings (currently client-side only). Add `POST /chat/rate`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
  ---
214
 
 
184
 
185
  ## TODO
186
 
187
+ From the spec (pages 10–11). Tags: **[Core]** = must do, **[Bonus]** = nice to have, **[Eval]** = for the grade.
188
+
189
+ Heads up: all camera/sensing stuff is in the frontend (MediaPipe JS). Backend just gets the labels (`affect`, `gesture_tag`, `gaze_bucket`). The `backend/sensing/` python modules are dead code.
190
+
191
+ ### Dataset
192
+
193
+ - [ ] **[Core]** Memories are only autobiographical narratives right now. Need more variety:
194
+ - [ ] social media posts (voice-matched, synth with LLM)
195
+ - [ ] past chat logs (synth with LLM)
196
+ - [ ] update the generator script + rebuild faiss
197
+ - [ ] tag chunks by type so retriever knows what it pulled
198
+ - [ ] **[Core]** Write down the data schema somewhere so evals can reuse it
199
+
200
+ ### Sensing (frontend)
201
+
202
+ - [ ] **[Core]** Head-nod / sharp tilt = "I don't like that". Different from frustrated affect.
203
+ - [ ] send a `dissatisfaction_signal` flag with the chat request
204
+ - [ ] when set, planner returns a "did you mean X or Y?" instead of an answer (the spec's "Turnaround Option")
205
+ - [ ] **[Core]** Smile / positive affect should actually change the wording (more positive lexicon), not just be metadata. Right now it's annotated in the prompt but we never checked if the LLM is doing anything with it β€” probably need a stronger constraint or example in the prompt
206
+ - [ ] **[Core]** Air-writing is treated as raw text appended to the query. Spec wants it as a stylistic constraint too β€” should it bias tone, or stay query-only? Decide and document
207
+ - [ ] **[Bonus]** Voice + air-writing conflict resolution. Capture short voice (Web Speech API), compare to air-written intent, send a `resolved_intent`
208
+ - [ ] thumbs-up only changes the prompt today β€” should also boost affirmative candidates in the reranker
209
+
210
+ ### Intent decomposition
211
+
212
+ - [ ] **[Core]** Personal / Contextual / Open-domain all hit the same FAISS index right now. Make them actually go different places β€” open-domain β†’ web search (or stub), contextual β†’ session memory
213
+ - [ ] intent node is slow. Cache the prompt, use a tiny model for routing, parallelise the sub-queries
214
+
215
+ ### Retrieval
216
+
217
+ - [ ] **[Bonus]** Bucket priors only live for the session. Persist them per user
218
+ - [ ] **[Bonus]** Latency fallback only switches LLM tier. Add more steps:
219
+ - drop reranker if retrieval is slow
220
+ - return a canned response if we blow the budget entirely
221
+ - threshold is 3.5s, spec says 6s β€” pick one
222
+
223
+ ### Generation
224
+
225
+ - [ ] **[Core]** API returns one response. Should return multiple candidates so the user can pick (and so the next item works)
226
+ - [ ] **[Core]** Frontend needs a candidate picker β€” show all the options, let the user click one, send the selection back
227
+ - [ ] **[Bonus]** When user picks a candidate, save the `(query, picked)` pair to a side faiss index and check it first next turn
228
+
229
+ ### Evals
230
+
231
+ Live per-turn scores show up in the `EvalPanel`. State:
232
+
233
+ | Metric | Status |
234
+ |--------|--------|
235
+ | Efficiency | works (SLO check on `t_total`) |
236
+ | Faithfulness | stub, returns 0 |
237
+ | Multimodal alignment | stub, returns 0 |
238
+ | Authenticity | star rating in UI but not saved |
239
+
240
+ - [ ] **[Eval]** Faithfulness β€” actually check if the response is grounded in what we retrieved. NLI model, sentence-level. If we didn't retrieve anything, flag `no_evidence` instead of pretending we scored it
241
+ - [ ] **[Eval]** Efficiency β€” per-turn SLO check is done, but for the writeup we need aggregate latency: p50/p95 across a fixed query set, broken out by LLM tier. Spec target is < 6s
242
+ - [ ] **[Eval]** Multimodal alignment β€” does the response actually reflect the gesture/affect/gaze? Don't need a model for this, just reuse the word maps the planner already has. Gaze one is trickier β€” check whether the chunks we ended up using came from the bucket the user was looking at
243
+ - [ ] **[Eval]** Authenticity β€” the Likert stars are wired up in the UI but go nowhere. Save them, log them with the turn so we can actually look at them later
244
+ - [ ] **[Eval]** For the live in-class eval: figure out the actual session β€” who rates (partners + experts per spec), how many turns each, what gets shown to them. The Likert form is the easy part; the protocol isn't written down anywhere
245
+ - [ ] **[Eval]** Need an offline version of all three model-driven evals (faithfulness / alignment / efficiency). Aggregate numbers across a fixed query set per persona for the writeup
246
+
247
+ ### Cleanup
248
+
249
+ - [ ] move the affect→tone / persona override dicts out of code into a yaml
250
+ - [ ] delete `backend/sensing/` (dead code, sensing is in frontend)
251
 
252
  ---
253