shwetangisingh commited on
Commit
84b82bd
Β·
1 Parent(s): 7fd8c8a

adding lessons

Browse files
Files changed (1) hide show
  1. lessons.md +194 -0
lessons.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lessons
2
+
3
+ Stuff that didn't work and what we did about it. Roughly in the order
4
+ it happened, because most of the bad decisions came from not
5
+ understanding the last one.
6
+
7
+ ## The LLM intent router that ate 114 seconds
8
+
9
+ First version of intent decomposition was an LLM call β€” Gemma4:31b-cloud,
10
+ Pydantic-validated JSON, 3 retries on schema failure with the validation
11
+ error prepended. Looked clean on paper.
12
+
13
+ In practice, Gemma4 would sometimes emit JSON that hit `max_tokens=512`
14
+ mid-object. Truncated. Validator failed. Retry. Same truncation. Retry
15
+ again. By the time the hard fallback kicked in, ~114 seconds had gone by
16
+ with nothing on screen except a spinner.
17
+
18
+ The retries were the problem. We'd treated them like a cheap safety net,
19
+ but each one was a 30+ second round-trip. And when the first attempt fails
20
+ with a specific failure mode (truncation), the retry almost always hits
21
+ the same wall. One clean try with a sensible fallback beats three slow
22
+ tries that all die the same way.
23
+
24
+ We briefly eyed `response_format=json_schema` as a fix but Ollama Cloud
25
+ doesn't expose it yet.
26
+
27
+ ## So we ripped out the LLM. Then we broke multi-intent.
28
+
29
+ Second pass: kill the LLM, use keyword matching on the whole query. Fast,
30
+ deterministic, done.
31
+
32
+ Except "how are you and what is the capital of France?" now became a
33
+ single PERSONAL sub-intent. The whole point of decomposition β€” routing
34
+ the two halves to different pools β€” was gone. We'd deleted the feature
35
+ to make it fast.
36
+
37
+ Speed isn't the only axis. "Agentic" here means splitting the query
38
+ into typed sub-queries *and* routing them; not just "an LLM is involved
39
+ somewhere." A faster solution that doesn't do what the spec asks is
40
+ not a solution.
41
+
42
+ ## What actually worked: split + zero-shot BGE
43
+
44
+ Regex-split the query on `and` / `but` / punctuation, classify each
45
+ fragment via cosine similarity against 5 seed sentences per class using
46
+ the BGE embedder we already had loaded for retrieval.
47
+
48
+ No LLM, no retries, no new dependencies. Median latency went from 114s
49
+ to ~30ms on the same input. All three intents routed correctly.
50
+
51
+ Moral of the story: the cheapest classifier that works is almost always
52
+ the right one for a prototype. We were already paying for BGE; using it
53
+ for classification too cost us nothing.
54
+
55
+ ## The classifier over-matched CONTEXTUAL
56
+
57
+ Live test, turn 11 of a Forrest Gump session. Partner: "give me a
58
+ detailed introduction." Classifier: `CONTEXTUAL`. Retrieval searched
59
+ session history, found three weak matches, grounded the LLM prompt in
60
+ basically nothing about Forrest. The LLM flailed, guardrail caught
61
+ something, user got "I don't know."
62
+
63
+ Problem was that CONTEXTUAL exemplars like *"what were we talking
64
+ about"* cast too wide a net β€” any meta-shaped question slid into that
65
+ bucket. A single threshold (`> 0.35`) didn't guard against it.
66
+
67
+ Fix used two extra signals: CONTEXTUAL has to beat the runner-up by a
68
+ margin (`0.08`), *and* the fragment has to contain an actual discourse
69
+ word (earlier, mentioned, just, repeat, said) matched at word
70
+ boundaries. Low-confidence goes to PERSONAL, not OPEN_DOMAIN β€” safer
71
+ fallback for a persona bot.
72
+
73
+ One-dimensional thresholds are a weak guard. Adding a margin signal
74
+ and a structural word cue made wrong classifications much harder
75
+ without changing what the happy path does.
76
+
77
+ ## CONTEXTUAL was fighting personal grounding
78
+
79
+ Originally we had CONTEXTUAL as a *replacement* for PERSONAL β€” "this
80
+ turn is about what we just said, so search history instead of
81
+ memories." Wrong. Even when the user asks about prior conversation,
82
+ the response still needs to sound like the persona. Session history
83
+ is extra context, not a source of truth.
84
+
85
+ Now CONTEXTUAL always pulls persona memory first, then layers on
86
+ relevant history (score β‰₯ 0.5). Never an empty personal prompt.
87
+
88
+ Think about where the LLM's source of truth is. For a persona bot,
89
+ that's the persona's memories, every time. Other signals go on top.
90
+
91
+ ## Gemma4 started writing the character brief as output
92
+
93
+ Early testing, `THINKING_MODE=off`. Partner: "hi". LLM:
94
+
95
+ ```
96
+ The user wants me to roleplay as Abed Nadir from Community.
97
+ Key characteristics:
98
+ - Autism spectrum (canonically coded, not explicitly diagnosed)
99
+ - Occasional selective mutism during sensory overload
100
+ ...
101
+ ```
102
+
103
+ It was writing the brief, not being the character. Our prompt
104
+ front-loaded Abed's condition and voice, then asked for a response at
105
+ the bottom. Gemma4 treated the whole thing as a writing assignment β€”
106
+ summarize first, respond second. We'd accidentally written a creative
107
+ writing prompt.
108
+
109
+ Two fixes stacked:
110
+
111
+ 1. Anti-meta rules at the top *and* bottom of the prompt: "never
112
+ narrate, analyze, describe, or list traits. Never say 'As an AI',
113
+ 'The user wants me to', 'Key characteristics'..." Models weight
114
+ the start and end of a prompt more than the middle; saying the
115
+ rule in both spots is cheap.
116
+ 2. `THINKING_MODE=suppress` in `.env`. Ollama supports a `/no_think`
117
+ prefix on the user message; this turns it on. Gemma4 stopped
118
+ emitting the scratchpad entirely.
119
+
120
+ Instruction-tuned models will follow *whatever instruction looks most
121
+ like the task*. If your prompt looks like a character brief, the
122
+ model may complete the brief. State "do not narrate" explicitly, and
123
+ use the no-think flag when the model supports it.
124
+
125
+ ## The guardrail saved us once
126
+
127
+ After the prompt and `/no_think` fixes, a test run *still* leaked β€”
128
+ `"The user wants me to roleplay as Raymond..."` came back from the LLM.
129
+ But the user never saw it. The output guardrail caught the phrase and
130
+ swapped in the safe fallback.
131
+
132
+ Belt-and-braces output checks are worth the effort. When the prompt
133
+ was wrong, the guardrail was still right.
134
+
135
+ ## Open-domain tempted us to build a web search
136
+
137
+ First instinct when we added OPEN_DOMAIN was "great, now we need a
138
+ web search adapter." But the product isn't a search engine β€” it's an
139
+ AAC user's voice. If someone asks Mia for the capital of France, the
140
+ answer is "Paris" in her voice. The LLM already knows basic facts;
141
+ Mia's persona is the scarce thing. Piping in a Wikipedia snippet
142
+ would dilute her voice, not enrich it.
143
+
144
+ So OPEN_DOMAIN just emits a stub chunk that tells the LLM to answer
145
+ from its own knowledge. Cheap, aligned with the product, one less
146
+ thing to break.
147
+
148
+ When you see a retrieval-shaped problem, don't assume a retriever
149
+ is the right answer.
150
+
151
+ ## Caching contextual embeddings was a waste of thought
152
+
153
+ At one point we worried about `retrieve_from_history` re-encoding the
154
+ session window every turn. Measured it: 43ms even with 80 turns of
155
+ history. The LLM call was taking 1.5–95 seconds. Shaving 30ms off a
156
+ 20-second turn is 0.1%, invisible.
157
+
158
+ Measure before you optimize, even when the waste seems obvious.
159
+
160
+ ## Monolithic prompts don't cache
161
+
162
+ Our planner built one giant user message with the character sheet
163
+ (stable per persona) and the retrieved chunks (different every turn)
164
+ mashed together. Prompt caches match prefixes exactly β€” one byte
165
+ change in the retrieval block invalidates the whole prompt, including
166
+ ~300 tokens of character sheet that hadn't changed.
167
+
168
+ Split it: system message holds the stable character sheet and
169
+ answering rules, user message holds the per-turn retrieval and query.
170
+ Provider caches the system prefix β†’ every turn after the first skips
171
+ prefill on ~300 tokens.
172
+
173
+ Structure matters as much as content once you care about latency.
174
+ Stable stuff goes in the system message, per-turn stuff goes in the
175
+ user message. Costs nothing in capability, compounds across turns.
176
+
177
+ ---
178
+
179
+ ## Principles we kept circling back to
180
+
181
+ **Measure first.** Every good decision here was triggered by a number
182
+ β€” 114s, 43ms, 30ms. Every bad decision was triggered by a hunch.
183
+
184
+ **Three pools because we have three sources.** Not four, not five. A
185
+ category without a real retriever behind it just confuses the
186
+ classifier.
187
+
188
+ **Short prompts behave better.** Every time we trimmed something,
189
+ the model was more consistent.
190
+
191
+ **Every path produces at least one chunk.** Empty retrieval blocks
192
+ were the fastest route to a hallucination. CONTEXTUAL with no
193
+ history, OPEN_DOMAIN with nothing wired up, a classifier returning
194
+ an empty sub-intent list β€” all have fallbacks now.