# Research Summary

**Project:** SQLEnv
**Change:** F002 — Answer Verification (multi-type comparison)
**Date:** 2026-03-27
**Status:** Draft

---

## 1. Change Overview

### What We're Changing
Implement `verify_answer()` in `server/verifier.py` to replace the naive string comparison in `_handle_answer()`. The verifier handles 4 answer types: integer (exact), float (1% tolerance), string (case-insensitive normalized), and list (order-insensitive set comparison).

### Why We're Changing It
The current `_handle_answer()` does `submitted.strip().lower() == expected.strip().lower()`, which fails on type mismatches (agent says "42", gold is integer 42), float rounding (95000.1 vs 95000), and list ordering (['A','B'] vs ['B','A']).

### Success Criteria
- Float comparison with tolerance: `95000.1` matches `95000` (within 1%)
- List comparison ignores order: `['A','B']` matches `['B','A']`
- Type coercion works: `"42"` matches integer `42`
- Clear pass/fail with no ambiguity

---

## 2. System Context

### Current Behavior
`sql_environment.py:410-419` — `_handle_answer()` does naive string comparison:
```python
submitted = value.strip().lower()
expected = (self._episode.gold_answer or "").strip().lower()
is_correct = submitted == expected
```
Returns binary (is_correct, reward). Gold answer is stored as a formatted string via `_format_gold_answer()` which joins rows with ` | ` separators.

### Architecture Context
```
Agent → ANSWER action → step() → _handle_answer() → verifier.verify_answer()
                                                          ↓
                                                   bool (correct/not)
```

### Entry Points

| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| `_handle_answer()` | Agent sends ANSWER action | Naive string compare → bool + reward |
| `verify_answer()` | Called by `_handle_answer()` | **To be created** — type-aware comparison |

### Data Flow

| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| `predicted` | Agent's ANSWER argument | `str` | `verify_answer()` |
| `gold_answer` | `EpisodeContext.gold_answer` | `str` (formatted by `_format_gold_answer`) | `verify_answer()` |
| `answer_type` | `QuestionRecord.answer_type` | `str` ("integer", "float", "string", "list") | `verify_answer()` |

**Critical note:** `_format_gold_answer()` converts raw SQL rows to a string. For single scalar values, it returns `str(rows[0][0])`. For multi-row results, it joins with ` | ` and newlines. The verifier needs to handle this format or receive raw data.

---

## 3. Dependencies

### Code We Depend On

| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `models.py:QuestionRecord` | `answer_type` field | Need type metadata per question |
| `sql_environment.py:_format_gold_answer()` | Produces gold answer string | Format determines how verifier parses |
| `data/questions/*.json` | Question records | Must include answer_type field |

### Code That Depends On Us

| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| `sql_environment.py:_handle_answer()` | Will call `verify_answer()` | Signature: `verify_answer(predicted, gold, answer_type) -> bool` |
| F003 (Dense Reward) | Layer 3 terminal reward uses correctness | Binary output unchanged |
| F005 (Green Agent) | Evaluation correctness metric | Uses same bool |

---

## 4. Risks & Edge Cases

### Identified Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Gold answer format mismatch | Medium | Correct answers rejected | Normalize both sides before comparing |
| Float precision edge cases | Medium | Near-boundary answers wrong | Use relative tolerance (1%) not absolute |
| List parsing from string | Medium | Can't reconstruct list from formatted string | Parse ` \| ` and newline separators |

### Edge Cases to Handle

| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| `"42"` vs integer `42` | String mismatch | Match via type coercion |
| `"95000.1"` vs `95000` | String mismatch | Match via 1% float tolerance |
| `"Engineering"` vs `"engineering"` | Matches (both lowercased) | Continue to match |
| `"A, B"` vs `"B, A"` | String mismatch | Match via set comparison |
| `None` or empty answer | Crashes or false match | Return False |
| Multi-row gold answer | String compare of formatted rows | Parse and compare as list/set |

### Invariants to Preserve

- [ ] Binary correctness output (bool) — no partial credit at this layer
- [ ] ANSWER action still terminates the episode
- [ ] Existing test assertions on reward values remain valid

---

## 4b. Code Shape & Design Target

### Existing Vocabulary

| Concept | Existing Name | Location |
|---------|---------------|----------|
| Answer types | `answer_type: str` | `models.py:QuestionRecord` |
| Gold answer | `gold_answer: str` | `models.py:EpisodeContext` |
| Episode context | `EpisodeContext` dataclass | `models.py:135` |

### Language/Framework Idioms

- Flat functions, no service classes
- Dataclasses for state, Pydantic for wire types
- Type hints throughout

### Target Shape

| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `verify_answer(predicted, gold, answer_type)` | Main entry — dispatches by type | Single public function |
| `_normalize_value(value)` | Strip, lowercase, coerce | Shared across comparers |
| `_compare_integer(pred, gold)` | Exact match after int coercion | Type-specific |
| `_compare_float(pred, gold, tol=0.01)` | Relative tolerance comparison | Type-specific |
| `_compare_string(pred, gold)` | Case-insensitive normalized | Type-specific |
| `_compare_list(pred, gold)` | Order-insensitive set comparison | Type-specific |

### Abstraction Level

- **Current level:** Flat — plain functions in server modules
- **Recommendation:** Match flat style. One module with public `verify_answer()` and private helpers.

### Anti-Patterns to Avoid

- Don't create a class hierarchy for answer types — use match/case dispatch
- Don't add table comparison yet (post-MVP per user interview)
- Don't import heavy dependencies (no numpy/scipy)

---

## 5. Constraints

### Technical Constraints

| Constraint | Requirement | Notes |
|------------|-------------|-------|
| No external deps | Pure Python only | No numpy, scipy |
| Performance | < 1ms per call | Called once per episode |

### Testing Constraints

| Test Suite | Coverage Area | Notes |
|------------|---------------|-------|
| `tests/test_smoke.py` | 25 passing tests | Some test ANSWER — may need update |

---

## 6. Open Questions

| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Should verifier receive raw `list[tuple]` gold_rows in addition to formatted string? | Raw rows enable more accurate list comparison | Design decision — recommend passing answer_type + gold string |
| Default when answer_type is missing/unknown? | Some questions may lack type metadata | Recommend fallback to string comparison |

---

## 7. Context Sources

| Source | Type | Notes |
|--------|------|-------|
| `server/verifier.py` | Code (stub) | Docstring lists all answer types |
| `server/sql_environment.py:410-419` | Code | Current naive `_handle_answer()` |
| `models.py:120-147` | Code | QuestionRecord and EpisodeContext |
| `docs_draft/SQLEnv_Concept_v1.md` Section 4.2 | Doc | `verify_answer()` reference implementation |
| `docs_draft/reward_design.md` | Doc | Answer type comparison strategies |