| # Question-aware briefing framing |
|
|
| Diagnosis + recommendation for WS3 of the 2026-05-06 overnight pass. |
|
|
| The four-section briefing structure (Status / Empirical / Modeled / |
| Policy) is non-negotiable β it's what the four Mellea grounding checks |
| score, and rewriting it risks the 4/4 pass rate. What we want to change |
| is the **opening sentence of the Status section**, so it engages the |
| question shape the user actually asked. Today every briefing leads |
| with a generic "this address is exposed to flood risk" no matter |
| whether the user asked "should I worry?" (resident), "is disclosure |
| required?" (attorney), or "where should we prioritize hardening?" |
| (planner). |
|
|
| ## Where the system_prompt is set today |
| |
| | Call site | Path | `EXTRA_SYSTEM_PROMPT` source | |
| |-----------|------|------------------------------| |
| | `app/fsm.py:983 step_reconcile` | single_address (strict) | `app/reconcile.py:53` | |
| | `app/intents/neighborhood.py:377` | neighborhood (strict) | local @ `app/intents/neighborhood.py:35` | |
| | `app/intents/development_check.py:218` | development_check (strict) | local @ `app/intents/development_check.py:32` | |
| | `app/intents/live_now.py:212` | live_now (non-strict) | local @ `app/intents/live_now.py:38` | |
| | `app/reconcile.py:1089 reconcile()` | legacy non-strict | `app/reconcile.py:53` | |
|
|
| All four strict paths funnel into `mellea_validator.reconcile_strict_streaming(doc_msgs, system_prompt, ...)`. The system_prompt is currently a constant per call site. |
| |
| ## Three options Adam outlined |
| |
| ### (a) Planner sub-classifier |
| Add a fifth `question_type` field to the planner's JSON schema. Granite |
| 4.1:3b classifies it alongside `intent`. Capstone reads it and conditions |
| the opening. |
|
|
| - β
Reuses an LLM that already understands the query |
| - β Re-validates the planner contract β the `_validate()` parser, the |
| schema doc, the fallback logic, and `scripts/probe_addresses.py` |
| all need to grow a new field |
| - β Costs another planner call iteration to converge if the model |
| mis-emits the new field |
| - β The planner is the warm-cache path the demo lives or dies on β |
| changing its output schema five days before pitch is high-risk |
| |
| ### (b) Capstone prompt-conditional |
| Detect `question_type` from the raw query string with a deterministic |
| regex-based heuristic, augment the system_prompt with a per-type |
| "opening directive," pass through to `reconcile_strict_streaming`. No |
| planner change. |
| |
| - β
Lowest blast radius β only touches the Capstone call sites |
| - β
Deterministic, testable, zero added latency (no LLM call) |
| - β
Easy to roll back β remove the `augment_system_prompt(...)` call |
| - β
The four Mellea grounding checks stay byte-identical |
| - β οΈ Question-shape detection is heuristic, not learned. Edge cases |
| (weird phrasings, code-switching) will fall back to a generic |
| directive. Acceptable for the demo personas β they're known up |
| front. |
| |
| ### (c) Both |
| Planner emits a hint, Capstone uses it as a tiebreaker over the |
| heuristic. |
| |
| - Same risks as (a). Pre-demo, the marginal accuracy isn't worth the |
| schema change. |
| |
| ## Recommendation: **option (b)** |
| |
| Implementation lives in a new module `app/framing.py`: |
| |
| - `detect(query, intent) -> question_type` β regex-based detector that |
| returns one of 11 question types (the same eleven as the suite's |
| framing rubric). |
| - `opening_instruction(question_type) -> str | None` β returns the |
| directive sentence to inject, or None for `generic_exposure` (the |
| default β current behavior unchanged). |
| - `augment_system_prompt(base, query, intent) -> str` β wraps the base |
| prompt with a `QUESTION-AWARE OPENING` block. |
|
|
| Wiring: |
|
|
| 1. `app/fsm.py` β add `set_query(q)` / `_current_query()` threadlocals |
| alongside the existing `set_strict_mode`. `step_reconcile()` reads |
| the query + intent to augment the system prompt before calling |
| `reconcile_strict_streaming`. |
| 2. `app/intents/single_address.py:run()` β call `set_query(query)` |
| before `iter_steps`, reset in `finally` (matches the existing |
| threadlocal pattern). |
| 3. `app/intents/neighborhood.py:run()` β augment the local |
| `EXTRA_SYSTEM_PROMPT` directly before passing to |
| `reconcile_strict_streaming`. |
| 4. `app/intents/development_check.py:run()` β same as neighborhood. |
| 5. `app/intents/live_now.py:run()` β same; non-strict path so it just |
| prepends to the system message content. |
| 6. `app/reconcile.py:reconcile()` (legacy) β out of scope; it's not on |
| the demo path and the strict path covers all current intents. |
|
|
| ## Stop conditions |
|
|
| Per Adam's instruction: if the framing rubric scores below 3 on more |
| than five queries after the change lands, document what option (a) / |
| (c) would require and stop. **Do not silently expand scope.** |
|
|
| The "below 3 on more than five" test is the trigger to move to |
| heavier interventions β typically that the regex detector misclassified |
| the question or the Granite model is ignoring the directive under the |
| existing system prompt's strong four-section discipline. |
|
|
| --- |
|
|
| ## Outcome of the 2026-05-06 framed run |
|
|
| `tests/integration/results/2026-05-06/FRAMING-DELTA.md` is the full |
| report. Headline: |
|
|
| - Mean framing **2.25 β 2.80** (+0.55). |
| - Queries reaching 5/5: **0 β 3** β q01 resident habitability |
| ("Yes, this address is exposed..."), q02 attorney disclosure |
| ("Disclosure is warranted..."), q13 grant evidence |
| ("Vulnerability assessment: ..."). |
| - Queries reaching β₯ 4/5: **2 β 5**. |
| - Mellea grounding: 4 queries improved (3/4 β 4/4); 2 regressed |
| (q01 4/4 β 3/4, q06 3/4 β 2/4); 14 unchanged. Net +2. |
|
|
| **Stop condition fired.** 12 / 20 framed queries scored below 3. |
| Triage of the 12: |
|
|
| 1. **Rubric-vs-directive vocabulary mismatch (4 queries).** q03, q08, |
| q10, q12 are bare neighborhood names that the suite labels |
| `capital_planning`. The detector returns `journalism` (the |
| bare-neighborhood fallback). Both are valid persona framings; the |
| journalism directive *is* applied (the openings change), but the |
| capital-planning rubric scores against verdict words like |
| "prioritize" / "merits prioritization" that the journalism |
| directive doesn't request. **Not a framing failure β a |
| measurement asymmetry.** |
| 2. **Short-prose floor (4 queries).** q07, q14, q15, q19 returned |
| β€ 200 chars of prose because the geocoder failed (q07, q14, q18 β |
| long conversational queries) or the planner / NTA resolver |
| short-circuited (q15 ranking query, q19 BBMCR project name). |
| Documented in `OVERNIGHT-2026-05-06-OUT-OF-SCOPE.md`. No framing |
| change can salvage these β they need geocoder + intent-router |
| work first. |
| 3. **Granite ignored the directive (4 queries).** q04 (bare address, |
| underwriting label), q05 (bare borough, journalism label), q11 |
| (PS 188 ambiguous), q17 (compare intent), q20 (Astoria control). |
| In each case the framing prompt was injected but the opening |
| stayed generic. Granite 4.1's existing four-section discipline |
| appears to overpower a soft "QUESTION-AWARE OPENING" directive |
| for some question types; the verdict-style types (Yes/No, |
| Disclosure, Vulnerability assessment) succeed because they have |
| explicit token shapes the model can latch onto. |
|
|
| ## What option (a) would require |
|
|
| Adam's instruction: if the stop condition fires, document option (a) |
| or (c) and stop β do not silently expand scope. **NOT IMPLEMENTED.** |
| Sketch: |
|
|
| 1. **Planner schema gains a `question_type` field.** Add to |
| `app/planner.py:PLAN_SCHEMA_DESC`, `Plan` dataclass, and |
| `_validate()` so the model emits an 11-value enum alongside |
| `intent`. |
| 2. **Few-shot the planner on question_type.** Add 6-10 worked |
| examples to `SYSTEM_PROMPT` (one per persona from RESEARCH.md) |
| so granite4.1:3b reliably emits the right enum value. The |
| planner is already running with `format=json` constrained |
| decoding, so this is a pure prompt-engineering change. |
| 3. **Capstone consumes the planner's question_type instead of the |
| detector's.** `app.framing.augment_system_prompt` already takes |
| `intent`; add a third `question_type` parameter that overrides |
| `detect()` when present. Capstone callers (fsm.step_reconcile, |
| the three intents) read it from `plan.question_type` and pass |
| through. |
| 4. **Fall back to the regex detector when the planner emits an |
| unknown / missing value.** Belt-and-suspenders against planner |
| regression. |
| 5. **Re-validate** with the same 20-query suite. If mean framing |
| moves from 2.80 β β₯ 3.5 (target: β₯ half the queries scoring 4+), |
| option (a) was the right call. If not, the issue is downstream |
| (Granite ignoring the directive); option (c) won't help. |
|
|
| **Cost estimate.** ~2-3 hr of work, plus re-validation against the |
| address probe + the 20-query suite. The risk is the planner |
| regressing on intent classification when prompted to also emit a |
| new field β Granite 4.1:3b at temperature 0 with constrained |
| decoding is robust but not infallible. Validate against the full |
| address probe before merging. |
|
|
| ## What option (c) would add |
|
|
| Layer (a) on top of (b). When the planner emits a question_type that |
| matches the detector's, both agree β use the directive. When they |
| disagree β log the disagreement (telemetry), use the planner's. |
| Marginal value over (a) alone is small; defer unless (a) shows |
| misclassification on the 20-query suite. |
| |