File size: 9,315 Bytes
b9a10ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Question-aware briefing framing

Diagnosis + recommendation for WS3 of the 2026-05-06 overnight pass.

The four-section briefing structure (Status / Empirical / Modeled /
Policy) is non-negotiable β€” it's what the four Mellea grounding checks
score, and rewriting it risks the 4/4 pass rate. What we want to change
is the **opening sentence of the Status section**, so it engages the
question shape the user actually asked. Today every briefing leads
with a generic "this address is exposed to flood risk" no matter
whether the user asked "should I worry?" (resident), "is disclosure
required?" (attorney), or "where should we prioritize hardening?"
(planner).

## Where the system_prompt is set today

| Call site | Path | `EXTRA_SYSTEM_PROMPT` source |
|-----------|------|------------------------------|
| `app/fsm.py:983 step_reconcile` | single_address (strict) | `app/reconcile.py:53` |
| `app/intents/neighborhood.py:377` | neighborhood (strict) | local @ `app/intents/neighborhood.py:35` |
| `app/intents/development_check.py:218` | development_check (strict) | local @ `app/intents/development_check.py:32` |
| `app/intents/live_now.py:212` | live_now (non-strict) | local @ `app/intents/live_now.py:38` |
| `app/reconcile.py:1089 reconcile()` | legacy non-strict | `app/reconcile.py:53` |

All four strict paths funnel into `mellea_validator.reconcile_strict_streaming(doc_msgs, system_prompt, ...)`. The system_prompt is currently a constant per call site.

## Three options Adam outlined

### (a) Planner sub-classifier
Add a fifth `question_type` field to the planner's JSON schema. Granite
4.1:3b classifies it alongside `intent`. Capstone reads it and conditions
the opening.

- βœ… Reuses an LLM that already understands the query
- ❌ Re-validates the planner contract β€” the `_validate()` parser, the
      schema doc, the fallback logic, and `scripts/probe_addresses.py`
      all need to grow a new field
- ❌ Costs another planner call iteration to converge if the model
      mis-emits the new field
- ❌ The planner is the warm-cache path the demo lives or dies on β€”
      changing its output schema five days before pitch is high-risk

### (b) Capstone prompt-conditional
Detect `question_type` from the raw query string with a deterministic
regex-based heuristic, augment the system_prompt with a per-type
"opening directive," pass through to `reconcile_strict_streaming`. No
planner change.

- βœ… Lowest blast radius β€” only touches the Capstone call sites
- βœ… Deterministic, testable, zero added latency (no LLM call)
- βœ… Easy to roll back β€” remove the `augment_system_prompt(...)` call
- βœ… The four Mellea grounding checks stay byte-identical
- ⚠️ Question-shape detection is heuristic, not learned. Edge cases
     (weird phrasings, code-switching) will fall back to a generic
     directive. Acceptable for the demo personas β€” they're known up
     front.

### (c) Both
Planner emits a hint, Capstone uses it as a tiebreaker over the
heuristic.

- Same risks as (a). Pre-demo, the marginal accuracy isn't worth the
  schema change.

## Recommendation: **option (b)**

Implementation lives in a new module `app/framing.py`:

- `detect(query, intent) -> question_type` β€” regex-based detector that
  returns one of 11 question types (the same eleven as the suite's
  framing rubric).
- `opening_instruction(question_type) -> str | None` β€” returns the
  directive sentence to inject, or None for `generic_exposure` (the
  default β€” current behavior unchanged).
- `augment_system_prompt(base, query, intent) -> str` β€” wraps the base
  prompt with a `QUESTION-AWARE OPENING` block.

Wiring:

1. `app/fsm.py` β€” add `set_query(q)` / `_current_query()` threadlocals
   alongside the existing `set_strict_mode`. `step_reconcile()` reads
   the query + intent to augment the system prompt before calling
   `reconcile_strict_streaming`.
2. `app/intents/single_address.py:run()` β€” call `set_query(query)`
   before `iter_steps`, reset in `finally` (matches the existing
   threadlocal pattern).
3. `app/intents/neighborhood.py:run()` β€” augment the local
   `EXTRA_SYSTEM_PROMPT` directly before passing to
   `reconcile_strict_streaming`.
4. `app/intents/development_check.py:run()` β€” same as neighborhood.
5. `app/intents/live_now.py:run()` β€” same; non-strict path so it just
   prepends to the system message content.
6. `app/reconcile.py:reconcile()` (legacy) β€” out of scope; it's not on
   the demo path and the strict path covers all current intents.

## Stop conditions

Per Adam's instruction: if the framing rubric scores below 3 on more
than five queries after the change lands, document what option (a) /
(c) would require and stop. **Do not silently expand scope.**

The "below 3 on more than five" test is the trigger to move to
heavier interventions β€” typically that the regex detector misclassified
the question or the Granite model is ignoring the directive under the
existing system prompt's strong four-section discipline.

---

## Outcome of the 2026-05-06 framed run

`tests/integration/results/2026-05-06/FRAMING-DELTA.md` is the full
report. Headline:

- Mean framing **2.25 β†’ 2.80** (+0.55).
- Queries reaching 5/5: **0 β†’ 3** β€” q01 resident habitability
  ("Yes, this address is exposed..."), q02 attorney disclosure
  ("Disclosure is warranted..."), q13 grant evidence
  ("Vulnerability assessment: ...").
- Queries reaching β‰₯ 4/5: **2 β†’ 5**.
- Mellea grounding: 4 queries improved (3/4 β†’ 4/4); 2 regressed
  (q01 4/4 β†’ 3/4, q06 3/4 β†’ 2/4); 14 unchanged. Net +2.

**Stop condition fired.** 12 / 20 framed queries scored below 3.
Triage of the 12:

1. **Rubric-vs-directive vocabulary mismatch (4 queries).** q03, q08,
   q10, q12 are bare neighborhood names that the suite labels
   `capital_planning`. The detector returns `journalism` (the
   bare-neighborhood fallback). Both are valid persona framings; the
   journalism directive *is* applied (the openings change), but the
   capital-planning rubric scores against verdict words like
   "prioritize" / "merits prioritization" that the journalism
   directive doesn't request. **Not a framing failure β€” a
   measurement asymmetry.**
2. **Short-prose floor (4 queries).** q07, q14, q15, q19 returned
   ≀ 200 chars of prose because the geocoder failed (q07, q14, q18 β€”
   long conversational queries) or the planner / NTA resolver
   short-circuited (q15 ranking query, q19 BBMCR project name).
   Documented in `OVERNIGHT-2026-05-06-OUT-OF-SCOPE.md`. No framing
   change can salvage these β€” they need geocoder + intent-router
   work first.
3. **Granite ignored the directive (4 queries).** q04 (bare address,
   underwriting label), q05 (bare borough, journalism label), q11
   (PS 188 ambiguous), q17 (compare intent), q20 (Astoria control).
   In each case the framing prompt was injected but the opening
   stayed generic. Granite 4.1's existing four-section discipline
   appears to overpower a soft "QUESTION-AWARE OPENING" directive
   for some question types; the verdict-style types (Yes/No,
   Disclosure, Vulnerability assessment) succeed because they have
   explicit token shapes the model can latch onto.

## What option (a) would require

Adam's instruction: if the stop condition fires, document option (a)
or (c) and stop β€” do not silently expand scope. **NOT IMPLEMENTED.**
Sketch:

1. **Planner schema gains a `question_type` field.** Add to
   `app/planner.py:PLAN_SCHEMA_DESC`, `Plan` dataclass, and
   `_validate()` so the model emits an 11-value enum alongside
   `intent`.
2. **Few-shot the planner on question_type.** Add 6-10 worked
   examples to `SYSTEM_PROMPT` (one per persona from RESEARCH.md)
   so granite4.1:3b reliably emits the right enum value. The
   planner is already running with `format=json` constrained
   decoding, so this is a pure prompt-engineering change.
3. **Capstone consumes the planner's question_type instead of the
   detector's.** `app.framing.augment_system_prompt` already takes
   `intent`; add a third `question_type` parameter that overrides
   `detect()` when present. Capstone callers (fsm.step_reconcile,
   the three intents) read it from `plan.question_type` and pass
   through.
4. **Fall back to the regex detector when the planner emits an
   unknown / missing value.** Belt-and-suspenders against planner
   regression.
5. **Re-validate** with the same 20-query suite. If mean framing
   moves from 2.80 β†’ β‰₯ 3.5 (target: β‰₯ half the queries scoring 4+),
   option (a) was the right call. If not, the issue is downstream
   (Granite ignoring the directive); option (c) won't help.

**Cost estimate.** ~2-3 hr of work, plus re-validation against the
address probe + the 20-query suite. The risk is the planner
regressing on intent classification when prompted to also emit a
new field β€” Granite 4.1:3b at temperature 0 with constrained
decoding is robust but not infallible. Validate against the full
address probe before merging.

## What option (c) would add

Layer (a) on top of (b). When the planner emits a question_type that
matches the detector's, both agree β†’ use the directive. When they
disagree β†’ log the disagreement (telemetry), use the planner's.
Marginal value over (a) alone is small; defer unless (a) shows
misclassification on the 20-query suite.