Research Summary
Project: SQLEnv
Change: F002 β Answer Verification (multi-type comparison)
Date: 2026-03-27
Status: Draft
1. Change Overview
What We're Changing
Implement verify_answer() in server/verifier.py to replace the naive string comparison in _handle_answer(). The verifier handles 4 answer types: integer (exact), float (1% tolerance), string (case-insensitive normalized), and list (order-insensitive set comparison).
Why We're Changing It
The current _handle_answer() does submitted.strip().lower() == expected.strip().lower(), which fails on type mismatches (agent says "42", gold is integer 42), float rounding (95000.1 vs 95000), and list ordering (['A','B'] vs ['B','A']).
Success Criteria
- Float comparison with tolerance:
95000.1 matches 95000 (within 1%)
- List comparison ignores order:
['A','B'] matches ['B','A']
- Type coercion works:
"42" matches integer 42
- Clear pass/fail with no ambiguity
2. System Context
Current Behavior
sql_environment.py:410-419 β _handle_answer() does naive string comparison:
submitted = value.strip().lower()
expected = (self._episode.gold_answer or "").strip().lower()
is_correct = submitted == expected
Returns binary (is_correct, reward). Gold answer is stored as a formatted string via _format_gold_answer() which joins rows with | separators.
Architecture Context
Agent β ANSWER action β step() β _handle_answer() β verifier.verify_answer()
β
bool (correct/not)
Entry Points
| Entry Point |
Trigger |
Current Flow |
_handle_answer() |
Agent sends ANSWER action |
Naive string compare β bool + reward |
verify_answer() |
Called by _handle_answer() |
To be created β type-aware comparison |
Data Flow
| Data |
Source |
Shape/Type |
Destination |
predicted |
Agent's ANSWER argument |
str |
verify_answer() |
gold_answer |
EpisodeContext.gold_answer |
str (formatted by _format_gold_answer) |
verify_answer() |
answer_type |
QuestionRecord.answer_type |
str ("integer", "float", "string", "list") |
verify_answer() |
Critical note: _format_gold_answer() converts raw SQL rows to a string. For single scalar values, it returns str(rows[0][0]). For multi-row results, it joins with | and newlines. The verifier needs to handle this format or receive raw data.
3. Dependencies
Code We Depend On
| Dependency |
What We Use |
Risk if Changed |
models.py:QuestionRecord |
answer_type field |
Need type metadata per question |
sql_environment.py:_format_gold_answer() |
Produces gold answer string |
Format determines how verifier parses |
data/questions/*.json |
Question records |
Must include answer_type field |
Code That Depends On Us
| Dependent |
How They Use Us |
Impact of Our Change |
sql_environment.py:_handle_answer() |
Will call verify_answer() |
Signature: verify_answer(predicted, gold, answer_type) -> bool |
| F003 (Dense Reward) |
Layer 3 terminal reward uses correctness |
Binary output unchanged |
| F005 (Green Agent) |
Evaluation correctness metric |
Uses same bool |
4. Risks & Edge Cases
Identified Risks
| Risk |
Likelihood |
Impact |
Mitigation |
| Gold answer format mismatch |
Medium |
Correct answers rejected |
Normalize both sides before comparing |
| Float precision edge cases |
Medium |
Near-boundary answers wrong |
Use relative tolerance (1%) not absolute |
| List parsing from string |
Medium |
Can't reconstruct list from formatted string |
Parse | and newline separators |
Edge Cases to Handle
| Edge Case |
Current Behavior |
Required Behavior |
"42" vs integer 42 |
String mismatch |
Match via type coercion |
"95000.1" vs 95000 |
String mismatch |
Match via 1% float tolerance |
"Engineering" vs "engineering" |
Matches (both lowercased) |
Continue to match |
"A, B" vs "B, A" |
String mismatch |
Match via set comparison |
None or empty answer |
Crashes or false match |
Return False |
| Multi-row gold answer |
String compare of formatted rows |
Parse and compare as list/set |
Invariants to Preserve
4b. Code Shape & Design Target
Existing Vocabulary
| Concept |
Existing Name |
Location |
| Answer types |
answer_type: str |
models.py:QuestionRecord |
| Gold answer |
gold_answer: str |
models.py:EpisodeContext |
| Episode context |
EpisodeContext dataclass |
models.py:135 |
Language/Framework Idioms
- Flat functions, no service classes
- Dataclasses for state, Pydantic for wire types
- Type hints throughout
Target Shape
| Component |
Purpose |
Why This Boundary |
verify_answer(predicted, gold, answer_type) |
Main entry β dispatches by type |
Single public function |
_normalize_value(value) |
Strip, lowercase, coerce |
Shared across comparers |
_compare_integer(pred, gold) |
Exact match after int coercion |
Type-specific |
_compare_float(pred, gold, tol=0.01) |
Relative tolerance comparison |
Type-specific |
_compare_string(pred, gold) |
Case-insensitive normalized |
Type-specific |
_compare_list(pred, gold) |
Order-insensitive set comparison |
Type-specific |
Abstraction Level
- Current level: Flat β plain functions in server modules
- Recommendation: Match flat style. One module with public
verify_answer() and private helpers.
Anti-Patterns to Avoid
- Don't create a class hierarchy for answer types β use match/case dispatch
- Don't add table comparison yet (post-MVP per user interview)
- Don't import heavy dependencies (no numpy/scipy)
5. Constraints
Technical Constraints
| Constraint |
Requirement |
Notes |
| No external deps |
Pure Python only |
No numpy, scipy |
| Performance |
< 1ms per call |
Called once per episode |
Testing Constraints
| Test Suite |
Coverage Area |
Notes |
tests/test_smoke.py |
25 passing tests |
Some test ANSWER β may need update |
6. Open Questions
| Question |
Why It Matters |
Who Can Answer |
Should verifier receive raw list[tuple] gold_rows in addition to formatted string? |
Raw rows enable more accurate list comparison |
Design decision β recommend passing answer_type + gold string |
| Default when answer_type is missing/unknown? |
Some questions may lack type metadata |
Recommend fallback to string comparison |
7. Context Sources
| Source |
Type |
Notes |
server/verifier.py |
Code (stub) |
Docstring lists all answer types |
server/sql_environment.py:410-419 |
Code |
Current naive _handle_answer() |
models.py:120-147 |
Code |
QuestionRecord and EpisodeContext |
docs_draft/SQLEnv_Concept_v1.md Section 4.2 |
Doc |
verify_answer() reference implementation |
docs_draft/reward_design.md |
Doc |
Answer type comparison strategies |