riprap-nyc / docs /QUESTION-AWARE-FRAMING.md
seriffic's picture
deploy: sync all changes from main at 6904684
b9a10ad

Question-aware briefing framing

Diagnosis + recommendation for WS3 of the 2026-05-06 overnight pass.

The four-section briefing structure (Status / Empirical / Modeled / Policy) is non-negotiable β€” it's what the four Mellea grounding checks score, and rewriting it risks the 4/4 pass rate. What we want to change is the opening sentence of the Status section, so it engages the question shape the user actually asked. Today every briefing leads with a generic "this address is exposed to flood risk" no matter whether the user asked "should I worry?" (resident), "is disclosure required?" (attorney), or "where should we prioritize hardening?" (planner).

Where the system_prompt is set today

Call site Path EXTRA_SYSTEM_PROMPT source
app/fsm.py:983 step_reconcile single_address (strict) app/reconcile.py:53
app/intents/neighborhood.py:377 neighborhood (strict) local @ app/intents/neighborhood.py:35
app/intents/development_check.py:218 development_check (strict) local @ app/intents/development_check.py:32
app/intents/live_now.py:212 live_now (non-strict) local @ app/intents/live_now.py:38
app/reconcile.py:1089 reconcile() legacy non-strict app/reconcile.py:53

All four strict paths funnel into mellea_validator.reconcile_strict_streaming(doc_msgs, system_prompt, ...). The system_prompt is currently a constant per call site.

Three options Adam outlined

(a) Planner sub-classifier

Add a fifth question_type field to the planner's JSON schema. Granite 4.1:3b classifies it alongside intent. Capstone reads it and conditions the opening.

  • βœ… Reuses an LLM that already understands the query
  • ❌ Re-validates the planner contract β€” the _validate() parser, the schema doc, the fallback logic, and scripts/probe_addresses.py all need to grow a new field
  • ❌ Costs another planner call iteration to converge if the model mis-emits the new field
  • ❌ The planner is the warm-cache path the demo lives or dies on β€” changing its output schema five days before pitch is high-risk

(b) Capstone prompt-conditional

Detect question_type from the raw query string with a deterministic regex-based heuristic, augment the system_prompt with a per-type "opening directive," pass through to reconcile_strict_streaming. No planner change.

  • βœ… Lowest blast radius β€” only touches the Capstone call sites
  • βœ… Deterministic, testable, zero added latency (no LLM call)
  • βœ… Easy to roll back β€” remove the augment_system_prompt(...) call
  • βœ… The four Mellea grounding checks stay byte-identical
  • ⚠️ Question-shape detection is heuristic, not learned. Edge cases (weird phrasings, code-switching) will fall back to a generic directive. Acceptable for the demo personas β€” they're known up front.

(c) Both

Planner emits a hint, Capstone uses it as a tiebreaker over the heuristic.

  • Same risks as (a). Pre-demo, the marginal accuracy isn't worth the schema change.

Recommendation: option (b)

Implementation lives in a new module app/framing.py:

  • detect(query, intent) -> question_type β€” regex-based detector that returns one of 11 question types (the same eleven as the suite's framing rubric).
  • opening_instruction(question_type) -> str | None β€” returns the directive sentence to inject, or None for generic_exposure (the default β€” current behavior unchanged).
  • augment_system_prompt(base, query, intent) -> str β€” wraps the base prompt with a QUESTION-AWARE OPENING block.

Wiring:

  1. app/fsm.py β€” add set_query(q) / _current_query() threadlocals alongside the existing set_strict_mode. step_reconcile() reads the query + intent to augment the system prompt before calling reconcile_strict_streaming.
  2. app/intents/single_address.py:run() β€” call set_query(query) before iter_steps, reset in finally (matches the existing threadlocal pattern).
  3. app/intents/neighborhood.py:run() β€” augment the local EXTRA_SYSTEM_PROMPT directly before passing to reconcile_strict_streaming.
  4. app/intents/development_check.py:run() β€” same as neighborhood.
  5. app/intents/live_now.py:run() β€” same; non-strict path so it just prepends to the system message content.
  6. app/reconcile.py:reconcile() (legacy) β€” out of scope; it's not on the demo path and the strict path covers all current intents.

Stop conditions

Per Adam's instruction: if the framing rubric scores below 3 on more than five queries after the change lands, document what option (a) / (c) would require and stop. Do not silently expand scope.

The "below 3 on more than five" test is the trigger to move to heavier interventions β€” typically that the regex detector misclassified the question or the Granite model is ignoring the directive under the existing system prompt's strong four-section discipline.


Outcome of the 2026-05-06 framed run

tests/integration/results/2026-05-06/FRAMING-DELTA.md is the full report. Headline:

  • Mean framing 2.25 β†’ 2.80 (+0.55).
  • Queries reaching 5/5: 0 β†’ 3 β€” q01 resident habitability ("Yes, this address is exposed..."), q02 attorney disclosure ("Disclosure is warranted..."), q13 grant evidence ("Vulnerability assessment: ...").
  • Queries reaching β‰₯ 4/5: 2 β†’ 5.
  • Mellea grounding: 4 queries improved (3/4 β†’ 4/4); 2 regressed (q01 4/4 β†’ 3/4, q06 3/4 β†’ 2/4); 14 unchanged. Net +2.

Stop condition fired. 12 / 20 framed queries scored below 3. Triage of the 12:

  1. Rubric-vs-directive vocabulary mismatch (4 queries). q03, q08, q10, q12 are bare neighborhood names that the suite labels capital_planning. The detector returns journalism (the bare-neighborhood fallback). Both are valid persona framings; the journalism directive is applied (the openings change), but the capital-planning rubric scores against verdict words like "prioritize" / "merits prioritization" that the journalism directive doesn't request. Not a framing failure β€” a measurement asymmetry.
  2. Short-prose floor (4 queries). q07, q14, q15, q19 returned ≀ 200 chars of prose because the geocoder failed (q07, q14, q18 β€” long conversational queries) or the planner / NTA resolver short-circuited (q15 ranking query, q19 BBMCR project name). Documented in OVERNIGHT-2026-05-06-OUT-OF-SCOPE.md. No framing change can salvage these β€” they need geocoder + intent-router work first.
  3. Granite ignored the directive (4 queries). q04 (bare address, underwriting label), q05 (bare borough, journalism label), q11 (PS 188 ambiguous), q17 (compare intent), q20 (Astoria control). In each case the framing prompt was injected but the opening stayed generic. Granite 4.1's existing four-section discipline appears to overpower a soft "QUESTION-AWARE OPENING" directive for some question types; the verdict-style types (Yes/No, Disclosure, Vulnerability assessment) succeed because they have explicit token shapes the model can latch onto.

What option (a) would require

Adam's instruction: if the stop condition fires, document option (a) or (c) and stop β€” do not silently expand scope. NOT IMPLEMENTED. Sketch:

  1. Planner schema gains a question_type field. Add to app/planner.py:PLAN_SCHEMA_DESC, Plan dataclass, and _validate() so the model emits an 11-value enum alongside intent.
  2. Few-shot the planner on question_type. Add 6-10 worked examples to SYSTEM_PROMPT (one per persona from RESEARCH.md) so granite4.1:3b reliably emits the right enum value. The planner is already running with format=json constrained decoding, so this is a pure prompt-engineering change.
  3. Capstone consumes the planner's question_type instead of the detector's. app.framing.augment_system_prompt already takes intent; add a third question_type parameter that overrides detect() when present. Capstone callers (fsm.step_reconcile, the three intents) read it from plan.question_type and pass through.
  4. Fall back to the regex detector when the planner emits an unknown / missing value. Belt-and-suspenders against planner regression.
  5. Re-validate with the same 20-query suite. If mean framing moves from 2.80 β†’ β‰₯ 3.5 (target: β‰₯ half the queries scoring 4+), option (a) was the right call. If not, the issue is downstream (Granite ignoring the directive); option (c) won't help.

Cost estimate. ~2-3 hr of work, plus re-validation against the address probe + the 20-query suite. The risk is the planner regressing on intent classification when prompted to also emit a new field β€” Granite 4.1:3b at temperature 0 with constrained decoding is robust but not infallible. Validate against the full address probe before merging.

What option (c) would add

Layer (a) on top of (b). When the planner emits a question_type that matches the detector's, both agree β†’ use the directive. When they disagree β†’ log the disagreement (telemetry), use the planner's. Marginal value over (a) alone is small; defer unless (a) shows misclassification on the 20-query suite.