Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F002-RESEARCH_SUMMARY.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 23 days ago

preview code

raw

history blame contribute delete

7.68 kB

Research Summary

Project: SQLEnv Change: F002 — Answer Verification (multi-type comparison) Date: 2026-03-27 Status: Draft

1. Change Overview

What We're Changing

Implement verify_answer() in server/verifier.py to replace the naive string comparison in _handle_answer(). The verifier handles 4 answer types: integer (exact), float (1% tolerance), string (case-insensitive normalized), and list (order-insensitive set comparison).

Why We're Changing It

The current _handle_answer() does submitted.strip().lower() == expected.strip().lower(), which fails on type mismatches (agent says "42", gold is integer 42), float rounding (95000.1 vs 95000), and list ordering (['A','B'] vs ['B','A']).

Success Criteria

Float comparison with tolerance: 95000.1 matches 95000 (within 1%)
List comparison ignores order: ['A','B'] matches ['B','A']
Type coercion works: "42" matches integer 42
Clear pass/fail with no ambiguity

2. System Context

Current Behavior

sql_environment.py:410-419 — _handle_answer() does naive string comparison:

submitted = value.strip().lower()
expected = (self._episode.gold_answer or "").strip().lower()
is_correct = submitted == expected

Returns binary (is_correct, reward). Gold answer is stored as a formatted string via _format_gold_answer() which joins rows with | separators.

Architecture Context

Agent → ANSWER action → step() → _handle_answer() → verifier.verify_answer()
                                                          ↓
                                                   bool (correct/not)

Entry Points

Entry Point	Trigger	Current Flow
`_handle_answer()`	Agent sends ANSWER action	Naive string compare → bool + reward
`verify_answer()`	Called by `_handle_answer()`	To be created — type-aware comparison

Data Flow

Data	Source	Shape/Type	Destination
`predicted`	Agent's ANSWER argument	`str`	`verify_answer()`
`gold_answer`	`EpisodeContext.gold_answer`	`str` (formatted by `_format_gold_answer`)	`verify_answer()`
`answer_type`	`QuestionRecord.answer_type`	`str` ("integer", "float", "string", "list")	`verify_answer()`

Critical note: _format_gold_answer() converts raw SQL rows to a string. For single scalar values, it returns str(rows[0][0]). For multi-row results, it joins with | and newlines. The verifier needs to handle this format or receive raw data.

3. Dependencies

Code We Depend On

Dependency	What We Use	Risk if Changed
`models.py:QuestionRecord`	`answer_type` field	Need type metadata per question
`sql_environment.py:_format_gold_answer()`	Produces gold answer string	Format determines how verifier parses
`data/questions/*.json`	Question records	Must include answer_type field

Code That Depends On Us

Dependent	How They Use Us	Impact of Our Change
`sql_environment.py:_handle_answer()`	Will call `verify_answer()`	Signature: `verify_answer(predicted, gold, answer_type) -> bool`
F003 (Dense Reward)	Layer 3 terminal reward uses correctness	Binary output unchanged
F005 (Green Agent)	Evaluation correctness metric	Uses same bool

4. Risks & Edge Cases

Identified Risks

Risk	Likelihood	Impact	Mitigation
Gold answer format mismatch	Medium	Correct answers rejected	Normalize both sides before comparing
Float precision edge cases	Medium	Near-boundary answers wrong	Use relative tolerance (1%) not absolute
List parsing from string	Medium	Can't reconstruct list from formatted string	Parse `\|` and newline separators

Edge Cases to Handle

Edge Case	Current Behavior	Required Behavior
`"42"` vs integer `42`	String mismatch	Match via type coercion
`"95000.1"` vs `95000`	String mismatch	Match via 1% float tolerance
`"Engineering"` vs `"engineering"`	Matches (both lowercased)	Continue to match
`"A, B"` vs `"B, A"`	String mismatch	Match via set comparison
`None` or empty answer	Crashes or false match	Return False
Multi-row gold answer	String compare of formatted rows	Parse and compare as list/set

Invariants to Preserve

Binary correctness output (bool) — no partial credit at this layer
ANSWER action still terminates the episode
Existing test assertions on reward values remain valid

4b. Code Shape & Design Target

Existing Vocabulary

Concept	Existing Name	Location
Answer types	`answer_type: str`	`models.py:QuestionRecord`
Gold answer	`gold_answer: str`	`models.py:EpisodeContext`
Episode context	`EpisodeContext` dataclass	`models.py:135`

Language/Framework Idioms

Flat functions, no service classes
Dataclasses for state, Pydantic for wire types
Type hints throughout

Target Shape

Component	Purpose	Why This Boundary
`verify_answer(predicted, gold, answer_type)`	Main entry — dispatches by type	Single public function
`_normalize_value(value)`	Strip, lowercase, coerce	Shared across comparers
`_compare_integer(pred, gold)`	Exact match after int coercion	Type-specific
`_compare_float(pred, gold, tol=0.01)`	Relative tolerance comparison	Type-specific
`_compare_string(pred, gold)`	Case-insensitive normalized	Type-specific
`_compare_list(pred, gold)`	Order-insensitive set comparison	Type-specific

Abstraction Level

Current level: Flat — plain functions in server modules
Recommendation: Match flat style. One module with public verify_answer() and private helpers.

Anti-Patterns to Avoid

Don't create a class hierarchy for answer types — use match/case dispatch
Don't add table comparison yet (post-MVP per user interview)
Don't import heavy dependencies (no numpy/scipy)

5. Constraints

Technical Constraints

Constraint	Requirement	Notes
No external deps	Pure Python only	No numpy, scipy
Performance	< 1ms per call	Called once per episode

Testing Constraints

Test Suite	Coverage Area	Notes
`tests/test_smoke.py`	25 passing tests	Some test ANSWER — may need update

6. Open Questions

Question	Why It Matters	Who Can Answer
Should verifier receive raw `list[tuple]` gold_rows in addition to formatted string?	Raw rows enable more accurate list comparison	Design decision — recommend passing answer_type + gold string
Default when answer_type is missing/unknown?	Some questions may lack type metadata	Recommend fallback to string comparison

7. Context Sources

Source	Type	Notes
`server/verifier.py`	Code (stub)	Docstring lists all answer types
`server/sql_environment.py:410-419`	Code	Current naive `_handle_answer()`
`models.py:120-147`	Code	QuestionRecord and EpisodeContext
`docs_draft/SQLEnv_Concept_v1.md` Section 4.2	Doc	`verify_answer()` reference implementation
`docs_draft/reward_design.md`	Doc	Answer type comparison strategies