File size: 9,315 Bytes
b9a10ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # Question-aware briefing framing
Diagnosis + recommendation for WS3 of the 2026-05-06 overnight pass.
The four-section briefing structure (Status / Empirical / Modeled /
Policy) is non-negotiable β it's what the four Mellea grounding checks
score, and rewriting it risks the 4/4 pass rate. What we want to change
is the **opening sentence of the Status section**, so it engages the
question shape the user actually asked. Today every briefing leads
with a generic "this address is exposed to flood risk" no matter
whether the user asked "should I worry?" (resident), "is disclosure
required?" (attorney), or "where should we prioritize hardening?"
(planner).
## Where the system_prompt is set today
| Call site | Path | `EXTRA_SYSTEM_PROMPT` source |
|-----------|------|------------------------------|
| `app/fsm.py:983 step_reconcile` | single_address (strict) | `app/reconcile.py:53` |
| `app/intents/neighborhood.py:377` | neighborhood (strict) | local @ `app/intents/neighborhood.py:35` |
| `app/intents/development_check.py:218` | development_check (strict) | local @ `app/intents/development_check.py:32` |
| `app/intents/live_now.py:212` | live_now (non-strict) | local @ `app/intents/live_now.py:38` |
| `app/reconcile.py:1089 reconcile()` | legacy non-strict | `app/reconcile.py:53` |
All four strict paths funnel into `mellea_validator.reconcile_strict_streaming(doc_msgs, system_prompt, ...)`. The system_prompt is currently a constant per call site.
## Three options Adam outlined
### (a) Planner sub-classifier
Add a fifth `question_type` field to the planner's JSON schema. Granite
4.1:3b classifies it alongside `intent`. Capstone reads it and conditions
the opening.
- β
Reuses an LLM that already understands the query
- β Re-validates the planner contract β the `_validate()` parser, the
schema doc, the fallback logic, and `scripts/probe_addresses.py`
all need to grow a new field
- β Costs another planner call iteration to converge if the model
mis-emits the new field
- β The planner is the warm-cache path the demo lives or dies on β
changing its output schema five days before pitch is high-risk
### (b) Capstone prompt-conditional
Detect `question_type` from the raw query string with a deterministic
regex-based heuristic, augment the system_prompt with a per-type
"opening directive," pass through to `reconcile_strict_streaming`. No
planner change.
- β
Lowest blast radius β only touches the Capstone call sites
- β
Deterministic, testable, zero added latency (no LLM call)
- β
Easy to roll back β remove the `augment_system_prompt(...)` call
- β
The four Mellea grounding checks stay byte-identical
- β οΈ Question-shape detection is heuristic, not learned. Edge cases
(weird phrasings, code-switching) will fall back to a generic
directive. Acceptable for the demo personas β they're known up
front.
### (c) Both
Planner emits a hint, Capstone uses it as a tiebreaker over the
heuristic.
- Same risks as (a). Pre-demo, the marginal accuracy isn't worth the
schema change.
## Recommendation: **option (b)**
Implementation lives in a new module `app/framing.py`:
- `detect(query, intent) -> question_type` β regex-based detector that
returns one of 11 question types (the same eleven as the suite's
framing rubric).
- `opening_instruction(question_type) -> str | None` β returns the
directive sentence to inject, or None for `generic_exposure` (the
default β current behavior unchanged).
- `augment_system_prompt(base, query, intent) -> str` β wraps the base
prompt with a `QUESTION-AWARE OPENING` block.
Wiring:
1. `app/fsm.py` β add `set_query(q)` / `_current_query()` threadlocals
alongside the existing `set_strict_mode`. `step_reconcile()` reads
the query + intent to augment the system prompt before calling
`reconcile_strict_streaming`.
2. `app/intents/single_address.py:run()` β call `set_query(query)`
before `iter_steps`, reset in `finally` (matches the existing
threadlocal pattern).
3. `app/intents/neighborhood.py:run()` β augment the local
`EXTRA_SYSTEM_PROMPT` directly before passing to
`reconcile_strict_streaming`.
4. `app/intents/development_check.py:run()` β same as neighborhood.
5. `app/intents/live_now.py:run()` β same; non-strict path so it just
prepends to the system message content.
6. `app/reconcile.py:reconcile()` (legacy) β out of scope; it's not on
the demo path and the strict path covers all current intents.
## Stop conditions
Per Adam's instruction: if the framing rubric scores below 3 on more
than five queries after the change lands, document what option (a) /
(c) would require and stop. **Do not silently expand scope.**
The "below 3 on more than five" test is the trigger to move to
heavier interventions β typically that the regex detector misclassified
the question or the Granite model is ignoring the directive under the
existing system prompt's strong four-section discipline.
---
## Outcome of the 2026-05-06 framed run
`tests/integration/results/2026-05-06/FRAMING-DELTA.md` is the full
report. Headline:
- Mean framing **2.25 β 2.80** (+0.55).
- Queries reaching 5/5: **0 β 3** β q01 resident habitability
("Yes, this address is exposed..."), q02 attorney disclosure
("Disclosure is warranted..."), q13 grant evidence
("Vulnerability assessment: ...").
- Queries reaching β₯ 4/5: **2 β 5**.
- Mellea grounding: 4 queries improved (3/4 β 4/4); 2 regressed
(q01 4/4 β 3/4, q06 3/4 β 2/4); 14 unchanged. Net +2.
**Stop condition fired.** 12 / 20 framed queries scored below 3.
Triage of the 12:
1. **Rubric-vs-directive vocabulary mismatch (4 queries).** q03, q08,
q10, q12 are bare neighborhood names that the suite labels
`capital_planning`. The detector returns `journalism` (the
bare-neighborhood fallback). Both are valid persona framings; the
journalism directive *is* applied (the openings change), but the
capital-planning rubric scores against verdict words like
"prioritize" / "merits prioritization" that the journalism
directive doesn't request. **Not a framing failure β a
measurement asymmetry.**
2. **Short-prose floor (4 queries).** q07, q14, q15, q19 returned
β€ 200 chars of prose because the geocoder failed (q07, q14, q18 β
long conversational queries) or the planner / NTA resolver
short-circuited (q15 ranking query, q19 BBMCR project name).
Documented in `OVERNIGHT-2026-05-06-OUT-OF-SCOPE.md`. No framing
change can salvage these β they need geocoder + intent-router
work first.
3. **Granite ignored the directive (4 queries).** q04 (bare address,
underwriting label), q05 (bare borough, journalism label), q11
(PS 188 ambiguous), q17 (compare intent), q20 (Astoria control).
In each case the framing prompt was injected but the opening
stayed generic. Granite 4.1's existing four-section discipline
appears to overpower a soft "QUESTION-AWARE OPENING" directive
for some question types; the verdict-style types (Yes/No,
Disclosure, Vulnerability assessment) succeed because they have
explicit token shapes the model can latch onto.
## What option (a) would require
Adam's instruction: if the stop condition fires, document option (a)
or (c) and stop β do not silently expand scope. **NOT IMPLEMENTED.**
Sketch:
1. **Planner schema gains a `question_type` field.** Add to
`app/planner.py:PLAN_SCHEMA_DESC`, `Plan` dataclass, and
`_validate()` so the model emits an 11-value enum alongside
`intent`.
2. **Few-shot the planner on question_type.** Add 6-10 worked
examples to `SYSTEM_PROMPT` (one per persona from RESEARCH.md)
so granite4.1:3b reliably emits the right enum value. The
planner is already running with `format=json` constrained
decoding, so this is a pure prompt-engineering change.
3. **Capstone consumes the planner's question_type instead of the
detector's.** `app.framing.augment_system_prompt` already takes
`intent`; add a third `question_type` parameter that overrides
`detect()` when present. Capstone callers (fsm.step_reconcile,
the three intents) read it from `plan.question_type` and pass
through.
4. **Fall back to the regex detector when the planner emits an
unknown / missing value.** Belt-and-suspenders against planner
regression.
5. **Re-validate** with the same 20-query suite. If mean framing
moves from 2.80 β β₯ 3.5 (target: β₯ half the queries scoring 4+),
option (a) was the right call. If not, the issue is downstream
(Granite ignoring the directive); option (c) won't help.
**Cost estimate.** ~2-3 hr of work, plus re-validation against the
address probe + the 20-query suite. The risk is the planner
regressing on intent classification when prompted to also emit a
new field β Granite 4.1:3b at temperature 0 with constrained
decoding is robust but not infallible. Validate against the full
address probe before merging.
## What option (c) would add
Layer (a) on top of (b). When the planner emits a question_type that
matches the detector's, both agree β use the directive. When they
disagree β log the disagreement (telemetry), use the planner's.
Marginal value over (a) alone is small; defer unless (a) shows
misclassification on the 20-query suite.
|