sql_env / specs /F002-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Research Summary

Project: SQLEnv Change: F002 β€” Answer Verification (multi-type comparison) Date: 2026-03-27 Status: Draft


1. Change Overview

What We're Changing

Implement verify_answer() in server/verifier.py to replace the naive string comparison in _handle_answer(). The verifier handles 4 answer types: integer (exact), float (1% tolerance), string (case-insensitive normalized), and list (order-insensitive set comparison).

Why We're Changing It

The current _handle_answer() does submitted.strip().lower() == expected.strip().lower(), which fails on type mismatches (agent says "42", gold is integer 42), float rounding (95000.1 vs 95000), and list ordering (['A','B'] vs ['B','A']).

Success Criteria

  • Float comparison with tolerance: 95000.1 matches 95000 (within 1%)
  • List comparison ignores order: ['A','B'] matches ['B','A']
  • Type coercion works: "42" matches integer 42
  • Clear pass/fail with no ambiguity

2. System Context

Current Behavior

sql_environment.py:410-419 β€” _handle_answer() does naive string comparison:

submitted = value.strip().lower()
expected = (self._episode.gold_answer or "").strip().lower()
is_correct = submitted == expected

Returns binary (is_correct, reward). Gold answer is stored as a formatted string via _format_gold_answer() which joins rows with | separators.

Architecture Context

Agent β†’ ANSWER action β†’ step() β†’ _handle_answer() β†’ verifier.verify_answer()
                                                          ↓
                                                   bool (correct/not)

Entry Points

Entry Point Trigger Current Flow
_handle_answer() Agent sends ANSWER action Naive string compare β†’ bool + reward
verify_answer() Called by _handle_answer() To be created β€” type-aware comparison

Data Flow

Data Source Shape/Type Destination
predicted Agent's ANSWER argument str verify_answer()
gold_answer EpisodeContext.gold_answer str (formatted by _format_gold_answer) verify_answer()
answer_type QuestionRecord.answer_type str ("integer", "float", "string", "list") verify_answer()

Critical note: _format_gold_answer() converts raw SQL rows to a string. For single scalar values, it returns str(rows[0][0]). For multi-row results, it joins with | and newlines. The verifier needs to handle this format or receive raw data.


3. Dependencies

Code We Depend On

Dependency What We Use Risk if Changed
models.py:QuestionRecord answer_type field Need type metadata per question
sql_environment.py:_format_gold_answer() Produces gold answer string Format determines how verifier parses
data/questions/*.json Question records Must include answer_type field

Code That Depends On Us

Dependent How They Use Us Impact of Our Change
sql_environment.py:_handle_answer() Will call verify_answer() Signature: verify_answer(predicted, gold, answer_type) -> bool
F003 (Dense Reward) Layer 3 terminal reward uses correctness Binary output unchanged
F005 (Green Agent) Evaluation correctness metric Uses same bool

4. Risks & Edge Cases

Identified Risks

Risk Likelihood Impact Mitigation
Gold answer format mismatch Medium Correct answers rejected Normalize both sides before comparing
Float precision edge cases Medium Near-boundary answers wrong Use relative tolerance (1%) not absolute
List parsing from string Medium Can't reconstruct list from formatted string Parse | and newline separators

Edge Cases to Handle

Edge Case Current Behavior Required Behavior
"42" vs integer 42 String mismatch Match via type coercion
"95000.1" vs 95000 String mismatch Match via 1% float tolerance
"Engineering" vs "engineering" Matches (both lowercased) Continue to match
"A, B" vs "B, A" String mismatch Match via set comparison
None or empty answer Crashes or false match Return False
Multi-row gold answer String compare of formatted rows Parse and compare as list/set

Invariants to Preserve

  • Binary correctness output (bool) β€” no partial credit at this layer
  • ANSWER action still terminates the episode
  • Existing test assertions on reward values remain valid

4b. Code Shape & Design Target

Existing Vocabulary

Concept Existing Name Location
Answer types answer_type: str models.py:QuestionRecord
Gold answer gold_answer: str models.py:EpisodeContext
Episode context EpisodeContext dataclass models.py:135

Language/Framework Idioms

  • Flat functions, no service classes
  • Dataclasses for state, Pydantic for wire types
  • Type hints throughout

Target Shape

Component Purpose Why This Boundary
verify_answer(predicted, gold, answer_type) Main entry β€” dispatches by type Single public function
_normalize_value(value) Strip, lowercase, coerce Shared across comparers
_compare_integer(pred, gold) Exact match after int coercion Type-specific
_compare_float(pred, gold, tol=0.01) Relative tolerance comparison Type-specific
_compare_string(pred, gold) Case-insensitive normalized Type-specific
_compare_list(pred, gold) Order-insensitive set comparison Type-specific

Abstraction Level

  • Current level: Flat β€” plain functions in server modules
  • Recommendation: Match flat style. One module with public verify_answer() and private helpers.

Anti-Patterns to Avoid

  • Don't create a class hierarchy for answer types β€” use match/case dispatch
  • Don't add table comparison yet (post-MVP per user interview)
  • Don't import heavy dependencies (no numpy/scipy)

5. Constraints

Technical Constraints

Constraint Requirement Notes
No external deps Pure Python only No numpy, scipy
Performance < 1ms per call Called once per episode

Testing Constraints

Test Suite Coverage Area Notes
tests/test_smoke.py 25 passing tests Some test ANSWER β€” may need update

6. Open Questions

Question Why It Matters Who Can Answer
Should verifier receive raw list[tuple] gold_rows in addition to formatted string? Raw rows enable more accurate list comparison Design decision β€” recommend passing answer_type + gold string
Default when answer_type is missing/unknown? Some questions may lack type metadata Recommend fallback to string comparison

7. Context Sources

Source Type Notes
server/verifier.py Code (stub) Docstring lists all answer types
server/sql_environment.py:410-419 Code Current naive _handle_answer()
models.py:120-147 Code QuestionRecord and EpisodeContext
docs_draft/SQLEnv_Concept_v1.md Section 4.2 Doc verify_answer() reference implementation
docs_draft/reward_design.md Doc Answer type comparison strategies