# Research Summary **Project:** SQLEnv **Change:** F002 — Answer Verification (multi-type comparison) **Date:** 2026-03-27 **Status:** Draft --- ## 1. Change Overview ### What We're Changing Implement `verify_answer()` in `server/verifier.py` to replace the naive string comparison in `_handle_answer()`. The verifier handles 4 answer types: integer (exact), float (1% tolerance), string (case-insensitive normalized), and list (order-insensitive set comparison). ### Why We're Changing It The current `_handle_answer()` does `submitted.strip().lower() == expected.strip().lower()`, which fails on type mismatches (agent says "42", gold is integer 42), float rounding (95000.1 vs 95000), and list ordering (['A','B'] vs ['B','A']). ### Success Criteria - Float comparison with tolerance: `95000.1` matches `95000` (within 1%) - List comparison ignores order: `['A','B']` matches `['B','A']` - Type coercion works: `"42"` matches integer `42` - Clear pass/fail with no ambiguity --- ## 2. System Context ### Current Behavior `sql_environment.py:410-419` — `_handle_answer()` does naive string comparison: ```python submitted = value.strip().lower() expected = (self._episode.gold_answer or "").strip().lower() is_correct = submitted == expected ``` Returns binary (is_correct, reward). Gold answer is stored as a formatted string via `_format_gold_answer()` which joins rows with ` | ` separators. ### Architecture Context ``` Agent → ANSWER action → step() → _handle_answer() → verifier.verify_answer() ↓ bool (correct/not) ``` ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | `_handle_answer()` | Agent sends ANSWER action | Naive string compare → bool + reward | | `verify_answer()` | Called by `_handle_answer()` | **To be created** — type-aware comparison | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | `predicted` | Agent's ANSWER argument | `str` | `verify_answer()` | | `gold_answer` | `EpisodeContext.gold_answer` | `str` (formatted by `_format_gold_answer`) | `verify_answer()` | | `answer_type` | `QuestionRecord.answer_type` | `str` ("integer", "float", "string", "list") | `verify_answer()` | **Critical note:** `_format_gold_answer()` converts raw SQL rows to a string. For single scalar values, it returns `str(rows[0][0])`. For multi-row results, it joins with ` | ` and newlines. The verifier needs to handle this format or receive raw data. --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `models.py:QuestionRecord` | `answer_type` field | Need type metadata per question | | `sql_environment.py:_format_gold_answer()` | Produces gold answer string | Format determines how verifier parses | | `data/questions/*.json` | Question records | Must include answer_type field | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | `sql_environment.py:_handle_answer()` | Will call `verify_answer()` | Signature: `verify_answer(predicted, gold, answer_type) -> bool` | | F003 (Dense Reward) | Layer 3 terminal reward uses correctness | Binary output unchanged | | F005 (Green Agent) | Evaluation correctness metric | Uses same bool | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Gold answer format mismatch | Medium | Correct answers rejected | Normalize both sides before comparing | | Float precision edge cases | Medium | Near-boundary answers wrong | Use relative tolerance (1%) not absolute | | List parsing from string | Medium | Can't reconstruct list from formatted string | Parse ` \| ` and newline separators | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | `"42"` vs integer `42` | String mismatch | Match via type coercion | | `"95000.1"` vs `95000` | String mismatch | Match via 1% float tolerance | | `"Engineering"` vs `"engineering"` | Matches (both lowercased) | Continue to match | | `"A, B"` vs `"B, A"` | String mismatch | Match via set comparison | | `None` or empty answer | Crashes or false match | Return False | | Multi-row gold answer | String compare of formatted rows | Parse and compare as list/set | ### Invariants to Preserve - [ ] Binary correctness output (bool) — no partial credit at this layer - [ ] ANSWER action still terminates the episode - [ ] Existing test assertions on reward values remain valid --- ## 4b. Code Shape & Design Target ### Existing Vocabulary | Concept | Existing Name | Location | |---------|---------------|----------| | Answer types | `answer_type: str` | `models.py:QuestionRecord` | | Gold answer | `gold_answer: str` | `models.py:EpisodeContext` | | Episode context | `EpisodeContext` dataclass | `models.py:135` | ### Language/Framework Idioms - Flat functions, no service classes - Dataclasses for state, Pydantic for wire types - Type hints throughout ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `verify_answer(predicted, gold, answer_type)` | Main entry — dispatches by type | Single public function | | `_normalize_value(value)` | Strip, lowercase, coerce | Shared across comparers | | `_compare_integer(pred, gold)` | Exact match after int coercion | Type-specific | | `_compare_float(pred, gold, tol=0.01)` | Relative tolerance comparison | Type-specific | | `_compare_string(pred, gold)` | Case-insensitive normalized | Type-specific | | `_compare_list(pred, gold)` | Order-insensitive set comparison | Type-specific | ### Abstraction Level - **Current level:** Flat — plain functions in server modules - **Recommendation:** Match flat style. One module with public `verify_answer()` and private helpers. ### Anti-Patterns to Avoid - Don't create a class hierarchy for answer types — use match/case dispatch - Don't add table comparison yet (post-MVP per user interview) - Don't import heavy dependencies (no numpy/scipy) --- ## 5. Constraints ### Technical Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | No external deps | Pure Python only | No numpy, scipy | | Performance | < 1ms per call | Called once per episode | ### Testing Constraints | Test Suite | Coverage Area | Notes | |------------|---------------|-------| | `tests/test_smoke.py` | 25 passing tests | Some test ANSWER — may need update | --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Should verifier receive raw `list[tuple]` gold_rows in addition to formatted string? | Raw rows enable more accurate list comparison | Design decision — recommend passing answer_type + gold string | | Default when answer_type is missing/unknown? | Some questions may lack type metadata | Recommend fallback to string comparison | --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `server/verifier.py` | Code (stub) | Docstring lists all answer types | | `server/sql_environment.py:410-419` | Code | Current naive `_handle_answer()` | | `models.py:120-147` | Code | QuestionRecord and EpisodeContext | | `docs_draft/SQLEnv_Concept_v1.md` Section 4.2 | Doc | `verify_answer()` reference implementation | | `docs_draft/reward_design.md` | Doc | Answer type comparison strategies |