Question-aware briefing framing
Diagnosis + recommendation for WS3 of the 2026-05-06 overnight pass.
The four-section briefing structure (Status / Empirical / Modeled / Policy) is non-negotiable β it's what the four Mellea grounding checks score, and rewriting it risks the 4/4 pass rate. What we want to change is the opening sentence of the Status section, so it engages the question shape the user actually asked. Today every briefing leads with a generic "this address is exposed to flood risk" no matter whether the user asked "should I worry?" (resident), "is disclosure required?" (attorney), or "where should we prioritize hardening?" (planner).
Where the system_prompt is set today
| Call site | Path | EXTRA_SYSTEM_PROMPT source |
|---|---|---|
app/fsm.py:983 step_reconcile |
single_address (strict) | app/reconcile.py:53 |
app/intents/neighborhood.py:377 |
neighborhood (strict) | local @ app/intents/neighborhood.py:35 |
app/intents/development_check.py:218 |
development_check (strict) | local @ app/intents/development_check.py:32 |
app/intents/live_now.py:212 |
live_now (non-strict) | local @ app/intents/live_now.py:38 |
app/reconcile.py:1089 reconcile() |
legacy non-strict | app/reconcile.py:53 |
All four strict paths funnel into mellea_validator.reconcile_strict_streaming(doc_msgs, system_prompt, ...). The system_prompt is currently a constant per call site.
Three options Adam outlined
(a) Planner sub-classifier
Add a fifth question_type field to the planner's JSON schema. Granite
4.1:3b classifies it alongside intent. Capstone reads it and conditions
the opening.
- β Reuses an LLM that already understands the query
- β Re-validates the planner contract β the
_validate()parser, the schema doc, the fallback logic, andscripts/probe_addresses.pyall need to grow a new field - β Costs another planner call iteration to converge if the model mis-emits the new field
- β The planner is the warm-cache path the demo lives or dies on β changing its output schema five days before pitch is high-risk
(b) Capstone prompt-conditional
Detect question_type from the raw query string with a deterministic
regex-based heuristic, augment the system_prompt with a per-type
"opening directive," pass through to reconcile_strict_streaming. No
planner change.
- β Lowest blast radius β only touches the Capstone call sites
- β Deterministic, testable, zero added latency (no LLM call)
- β
Easy to roll back β remove the
augment_system_prompt(...)call - β The four Mellea grounding checks stay byte-identical
- β οΈ Question-shape detection is heuristic, not learned. Edge cases (weird phrasings, code-switching) will fall back to a generic directive. Acceptable for the demo personas β they're known up front.
(c) Both
Planner emits a hint, Capstone uses it as a tiebreaker over the heuristic.
- Same risks as (a). Pre-demo, the marginal accuracy isn't worth the schema change.
Recommendation: option (b)
Implementation lives in a new module app/framing.py:
detect(query, intent) -> question_typeβ regex-based detector that returns one of 11 question types (the same eleven as the suite's framing rubric).opening_instruction(question_type) -> str | Noneβ returns the directive sentence to inject, or None forgeneric_exposure(the default β current behavior unchanged).augment_system_prompt(base, query, intent) -> strβ wraps the base prompt with aQUESTION-AWARE OPENINGblock.
Wiring:
app/fsm.pyβ addset_query(q)/_current_query()threadlocals alongside the existingset_strict_mode.step_reconcile()reads the query + intent to augment the system prompt before callingreconcile_strict_streaming.app/intents/single_address.py:run()β callset_query(query)beforeiter_steps, reset infinally(matches the existing threadlocal pattern).app/intents/neighborhood.py:run()β augment the localEXTRA_SYSTEM_PROMPTdirectly before passing toreconcile_strict_streaming.app/intents/development_check.py:run()β same as neighborhood.app/intents/live_now.py:run()β same; non-strict path so it just prepends to the system message content.app/reconcile.py:reconcile()(legacy) β out of scope; it's not on the demo path and the strict path covers all current intents.
Stop conditions
Per Adam's instruction: if the framing rubric scores below 3 on more than five queries after the change lands, document what option (a) / (c) would require and stop. Do not silently expand scope.
The "below 3 on more than five" test is the trigger to move to heavier interventions β typically that the regex detector misclassified the question or the Granite model is ignoring the directive under the existing system prompt's strong four-section discipline.
Outcome of the 2026-05-06 framed run
tests/integration/results/2026-05-06/FRAMING-DELTA.md is the full
report. Headline:
- Mean framing 2.25 β 2.80 (+0.55).
- Queries reaching 5/5: 0 β 3 β q01 resident habitability ("Yes, this address is exposed..."), q02 attorney disclosure ("Disclosure is warranted..."), q13 grant evidence ("Vulnerability assessment: ...").
- Queries reaching β₯ 4/5: 2 β 5.
- Mellea grounding: 4 queries improved (3/4 β 4/4); 2 regressed (q01 4/4 β 3/4, q06 3/4 β 2/4); 14 unchanged. Net +2.
Stop condition fired. 12 / 20 framed queries scored below 3. Triage of the 12:
- Rubric-vs-directive vocabulary mismatch (4 queries). q03, q08,
q10, q12 are bare neighborhood names that the suite labels
capital_planning. The detector returnsjournalism(the bare-neighborhood fallback). Both are valid persona framings; the journalism directive is applied (the openings change), but the capital-planning rubric scores against verdict words like "prioritize" / "merits prioritization" that the journalism directive doesn't request. Not a framing failure β a measurement asymmetry. - Short-prose floor (4 queries). q07, q14, q15, q19 returned
β€ 200 chars of prose because the geocoder failed (q07, q14, q18 β
long conversational queries) or the planner / NTA resolver
short-circuited (q15 ranking query, q19 BBMCR project name).
Documented in
OVERNIGHT-2026-05-06-OUT-OF-SCOPE.md. No framing change can salvage these β they need geocoder + intent-router work first. - Granite ignored the directive (4 queries). q04 (bare address, underwriting label), q05 (bare borough, journalism label), q11 (PS 188 ambiguous), q17 (compare intent), q20 (Astoria control). In each case the framing prompt was injected but the opening stayed generic. Granite 4.1's existing four-section discipline appears to overpower a soft "QUESTION-AWARE OPENING" directive for some question types; the verdict-style types (Yes/No, Disclosure, Vulnerability assessment) succeed because they have explicit token shapes the model can latch onto.
What option (a) would require
Adam's instruction: if the stop condition fires, document option (a) or (c) and stop β do not silently expand scope. NOT IMPLEMENTED. Sketch:
- Planner schema gains a
question_typefield. Add toapp/planner.py:PLAN_SCHEMA_DESC,Plandataclass, and_validate()so the model emits an 11-value enum alongsideintent. - Few-shot the planner on question_type. Add 6-10 worked
examples to
SYSTEM_PROMPT(one per persona from RESEARCH.md) so granite4.1:3b reliably emits the right enum value. The planner is already running withformat=jsonconstrained decoding, so this is a pure prompt-engineering change. - Capstone consumes the planner's question_type instead of the
detector's.
app.framing.augment_system_promptalready takesintent; add a thirdquestion_typeparameter that overridesdetect()when present. Capstone callers (fsm.step_reconcile, the three intents) read it fromplan.question_typeand pass through. - Fall back to the regex detector when the planner emits an unknown / missing value. Belt-and-suspenders against planner regression.
- Re-validate with the same 20-query suite. If mean framing moves from 2.80 β β₯ 3.5 (target: β₯ half the queries scoring 4+), option (a) was the right call. If not, the issue is downstream (Granite ignoring the directive); option (c) won't help.
Cost estimate. ~2-3 hr of work, plus re-validation against the address probe + the 20-query suite. The risk is the planner regressing on intent classification when prompted to also emit a new field β Granite 4.1:3b at temperature 0 with constrained decoding is robust but not infallible. Validate against the full address probe before merging.
What option (c) would add
Layer (a) on top of (b). When the planner emits a question_type that matches the detector's, both agree β use the directive. When they disagree β log the disagreement (telemetry), use the planner's. Marginal value over (a) alone is small; defer unless (a) shows misclassification on the 20-query suite.