aac-chatbot / lessons.md
shwetangisingh's picture
adding lessons
84b82bd

Lessons

Stuff that didn't work and what we did about it. Roughly in the order it happened, because most of the bad decisions came from not understanding the last one.

The LLM intent router that ate 114 seconds

First version of intent decomposition was an LLM call β€” Gemma4:31b-cloud, Pydantic-validated JSON, 3 retries on schema failure with the validation error prepended. Looked clean on paper.

In practice, Gemma4 would sometimes emit JSON that hit max_tokens=512 mid-object. Truncated. Validator failed. Retry. Same truncation. Retry again. By the time the hard fallback kicked in, ~114 seconds had gone by with nothing on screen except a spinner.

The retries were the problem. We'd treated them like a cheap safety net, but each one was a 30+ second round-trip. And when the first attempt fails with a specific failure mode (truncation), the retry almost always hits the same wall. One clean try with a sensible fallback beats three slow tries that all die the same way.

We briefly eyed response_format=json_schema as a fix but Ollama Cloud doesn't expose it yet.

So we ripped out the LLM. Then we broke multi-intent.

Second pass: kill the LLM, use keyword matching on the whole query. Fast, deterministic, done.

Except "how are you and what is the capital of France?" now became a single PERSONAL sub-intent. The whole point of decomposition β€” routing the two halves to different pools β€” was gone. We'd deleted the feature to make it fast.

Speed isn't the only axis. "Agentic" here means splitting the query into typed sub-queries and routing them; not just "an LLM is involved somewhere." A faster solution that doesn't do what the spec asks is not a solution.

What actually worked: split + zero-shot BGE

Regex-split the query on and / but / punctuation, classify each fragment via cosine similarity against 5 seed sentences per class using the BGE embedder we already had loaded for retrieval.

No LLM, no retries, no new dependencies. Median latency went from 114s to ~30ms on the same input. All three intents routed correctly.

Moral of the story: the cheapest classifier that works is almost always the right one for a prototype. We were already paying for BGE; using it for classification too cost us nothing.

The classifier over-matched CONTEXTUAL

Live test, turn 11 of a Forrest Gump session. Partner: "give me a detailed introduction." Classifier: CONTEXTUAL. Retrieval searched session history, found three weak matches, grounded the LLM prompt in basically nothing about Forrest. The LLM flailed, guardrail caught something, user got "I don't know."

Problem was that CONTEXTUAL exemplars like "what were we talking about" cast too wide a net β€” any meta-shaped question slid into that bucket. A single threshold (> 0.35) didn't guard against it.

Fix used two extra signals: CONTEXTUAL has to beat the runner-up by a margin (0.08), and the fragment has to contain an actual discourse word (earlier, mentioned, just, repeat, said) matched at word boundaries. Low-confidence goes to PERSONAL, not OPEN_DOMAIN β€” safer fallback for a persona bot.

One-dimensional thresholds are a weak guard. Adding a margin signal and a structural word cue made wrong classifications much harder without changing what the happy path does.

CONTEXTUAL was fighting personal grounding

Originally we had CONTEXTUAL as a replacement for PERSONAL β€” "this turn is about what we just said, so search history instead of memories." Wrong. Even when the user asks about prior conversation, the response still needs to sound like the persona. Session history is extra context, not a source of truth.

Now CONTEXTUAL always pulls persona memory first, then layers on relevant history (score β‰₯ 0.5). Never an empty personal prompt.

Think about where the LLM's source of truth is. For a persona bot, that's the persona's memories, every time. Other signals go on top.

Gemma4 started writing the character brief as output

Early testing, THINKING_MODE=off. Partner: "hi". LLM:

The user wants me to roleplay as Abed Nadir from Community.
Key characteristics:
- Autism spectrum (canonically coded, not explicitly diagnosed)
- Occasional selective mutism during sensory overload
...

It was writing the brief, not being the character. Our prompt front-loaded Abed's condition and voice, then asked for a response at the bottom. Gemma4 treated the whole thing as a writing assignment β€” summarize first, respond second. We'd accidentally written a creative writing prompt.

Two fixes stacked:

  1. Anti-meta rules at the top and bottom of the prompt: "never narrate, analyze, describe, or list traits. Never say 'As an AI', 'The user wants me to', 'Key characteristics'..." Models weight the start and end of a prompt more than the middle; saying the rule in both spots is cheap.
  2. THINKING_MODE=suppress in .env. Ollama supports a /no_think prefix on the user message; this turns it on. Gemma4 stopped emitting the scratchpad entirely.

Instruction-tuned models will follow whatever instruction looks most like the task. If your prompt looks like a character brief, the model may complete the brief. State "do not narrate" explicitly, and use the no-think flag when the model supports it.

The guardrail saved us once

After the prompt and /no_think fixes, a test run still leaked β€” "The user wants me to roleplay as Raymond..." came back from the LLM. But the user never saw it. The output guardrail caught the phrase and swapped in the safe fallback.

Belt-and-braces output checks are worth the effort. When the prompt was wrong, the guardrail was still right.

Open-domain tempted us to build a web search

First instinct when we added OPEN_DOMAIN was "great, now we need a web search adapter." But the product isn't a search engine β€” it's an AAC user's voice. If someone asks Mia for the capital of France, the answer is "Paris" in her voice. The LLM already knows basic facts; Mia's persona is the scarce thing. Piping in a Wikipedia snippet would dilute her voice, not enrich it.

So OPEN_DOMAIN just emits a stub chunk that tells the LLM to answer from its own knowledge. Cheap, aligned with the product, one less thing to break.

When you see a retrieval-shaped problem, don't assume a retriever is the right answer.

Caching contextual embeddings was a waste of thought

At one point we worried about retrieve_from_history re-encoding the session window every turn. Measured it: 43ms even with 80 turns of history. The LLM call was taking 1.5–95 seconds. Shaving 30ms off a 20-second turn is 0.1%, invisible.

Measure before you optimize, even when the waste seems obvious.

Monolithic prompts don't cache

Our planner built one giant user message with the character sheet (stable per persona) and the retrieved chunks (different every turn) mashed together. Prompt caches match prefixes exactly β€” one byte change in the retrieval block invalidates the whole prompt, including ~300 tokens of character sheet that hadn't changed.

Split it: system message holds the stable character sheet and answering rules, user message holds the per-turn retrieval and query. Provider caches the system prefix β†’ every turn after the first skips prefill on ~300 tokens.

Structure matters as much as content once you care about latency. Stable stuff goes in the system message, per-turn stuff goes in the user message. Costs nothing in capability, compounds across turns.


Principles we kept circling back to

Measure first. Every good decision here was triggered by a number β€” 114s, 43ms, 30ms. Every bad decision was triggered by a hunch.

Three pools because we have three sources. Not four, not five. A category without a real retriever behind it just confuses the classifier.

Short prompts behave better. Every time we trimmed something, the model was more consistent.

Every path produces at least one chunk. Empty retrieval blocks were the fastest route to a hallucination. CONTEXTUAL with no history, OPEN_DOMAIN with nothing wired up, a classifier returning an empty sub-intent list β€” all have fallbacks now.