Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

hjerpe commited on 8 days ago

Commit

9e64e71

verified ·

1 Parent(s): d9759a5

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
AGENTS.md +2 -59
DEMO.md +286 -0
DEMO_action_feature.md +427 -0
Dockerfile.test +28 -0
ONBOARDING_action_feature.md +341 -0
README.md +34 -5
REVIEW_REPORT.md +21 -34
client.py +2 -16
configs/colab_l4.json +15 -0
configs/test_cpu.json +17 -0
data/databases/car_1/car_1.sqlite +0 -0
data/databases/concert_singer/concert_singer.sqlite +0 -0
data/databases/cre_Doc_Template_Mgt/cre_Doc_Template_Mgt.sqlite +0 -0
data/databases/dog_kennels/dog_kennels.sqlite +0 -0
data/databases/employee_hire_evaluation/employee_hire_evaluation.sqlite +0 -0
data/databases/flight_2/flight_2.sqlite +0 -0
data/databases/pets_1/pets_1.sqlite +0 -0
data/databases/poker_player/poker_player.sqlite +0 -0
data/databases/student_assessment/student_assessment.sqlite +0 -0
data/databases/world_1/world_1.sqlite +3 -0
data/sft/sft_trajectories.json +0 -0
docs/.DS_Store +0 -0
docs/ARCHITECTURE.md +336 -224
docs/DOCS_CONTRACT.json +74 -0
docs/DOCS_TAXONOMY.md +62 -0
docs/FEATURE_SLICING.md +50 -0
docs/QUALITY_SCORE.md +13 -0
docs/README.md +49 -0
docs/SKILLS_HANDBOOK.generated.md +413 -0
docs/blog-material.md +428 -0
docs/blog-outline.md +95 -37
docs/blog-post-v1-preview.html +403 -0
docs/blog-post-v1.md +269 -0
docs/blog-post.md +118 -0
docs/competition-deliverables.md +129 -0
docs/data-sources.md +182 -0
docs/delivery-specs/index.md +66 -0
docs/design-docs/core-beliefs.md +61 -0
docs/design-docs/index.md +1 -1
docs/design-docs/reward-shaping-research.md +197 -0
docs/discovery/.gitkeep +0 -0
docs/discovery/index.md +63 -0
docs/exec-plans/README.md +41 -0
docs/exec-plans/active/.gitkeep +0 -0
docs/exec-plans/completed/.gitkeep +0 -0
docs/exec-plans/tech-debt-tracker.md +37 -0
docs/exploration/README.md +44 -0
docs/exploration/f007-prelaunch-checklist.md +455 -0
docs/exploration/grpo-collapse-analysis.md +119 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/databases/world_1/world_1.sqlite filter=lfs diff=lfs merge=lfs -text

AGENTS.md CHANGED Viewed

@@ -82,64 +82,7 @@ sql-env/                       # project root = environment package
 <!-- GUIDELINES-BEGIN -->
-## Delivery Safety (Move Fast Without Breaking Things)
-Move fast by taking the smallest responsible step that produces real feedback, while pre-committing to guardrails so being wrong is survivable.
-- **Small batches:** Prefer vertical slices and small PRs; reduce blast radius and review/debug time.
-- **Define "broken" first:** Before shipping, write down what you will watch (errors, latency, correctness, cost) and the abort threshold.
-- **Design for reversibility:** Make changes easy to turn off, roll back, or ignore.
-## System Boundaries (Avoid Analysis Paralysis)
-Systems are continuous webs; plans require artificial boundaries.
-- **Boundary rule:** Include only variables/components that could change the decision you are making.
-- **Clouds:** Treat everything else as exogenous inputs; track them as risks/assumptions.
-- **Timebox mapping:** If the landscape is moving faster than you can model it, run a probe (spike, canary, A/B) instead.
-## Maturity Modes
-Match guardrails to maturity:
-- **Exploratory:** Learning > durability. Prefer spikes; avoid irreversible state changes; manual verification is OK; expect throwaway code.
-- **MVP:** Ship a thin end-to-end slice. Manual checks are OK, but you still need a fast rollback path and bounded impact.
-- **Production:** Build to last. Automated tests, observability, progressive rollout, and explicit rollback/incident posture.
-Expect limiting factors to move as you ship: fix the current bottleneck, then re-diagnose the next.
-## Progressive Delivery
-- **Feature flags:** Use flags to make risky changes reversible. Categorize flags (release/experiment/ops/permissioning).
-- **Flags are inventory:** Every flag needs an owner, an expiry, and a removal plan.
-- **Canary/ramp when risk is non-trivial:** Start small, watch signals, ramp gradually; prefer "flip off" over redeploy.
-## Reliability Control Loop (If You Run Production)
-- **SLO + error budget:** If you are within budget, keep shipping; if you burn budget, freeze non-critical changes and pay down reliability.
-## Avoid
-- Big-bang releases, long-lived branches, unowned flags, flaky tests, and alert noise.
-## Python Guidelines
-- Prefer type hints for public APIs; use `typing` / `collections.abc`.
-- Use NumPy-style docstrings; keep them synced with type hints.
-- Error handling: Use specific exceptions; avoid `try: ... except Exception: pass`.
-- Dependencies: Use `uv add <package>`; do not manually edit `pyproject.toml`.
-## Docs Expectations
-- Keep durable design/ops knowledge in `docs/` (architecture, runbook, decisions). Keep AGENTS.md as a short map, not an encyclopedia.
-## Testing Standards
-- **Always use the project's package manager** to run tests. Never invoke test runners directly.
-  - Python (uv): `uv run pytest tests/ -v` (NEVER bare `pytest`)
-  - Python (poetry): `poetry run pytest tests/ -v`
-  - Node: `npm test` or `npm run test`
-  - Rust: `cargo test`
-- **Rationale:** Bare `pytest` bypasses the virtualenv and may use the wrong Python/dependencies. Package managers ensure the correct environment. Bare invocations also trigger unnecessary permission prompts in automated workflows.
 <!-- GUIDELINES-END -->

 <!-- GUIDELINES-BEGIN -->
+<!-- Managed by: opencode-ctx guidelines apply --packs python,testing,delivery-safety -->
+<!-- Run the command above to populate this section -->
 <!-- GUIDELINES-END -->

DEMO.md ADDED Viewed

	@@ -0,0 +1,286 @@

+# Demo: SQLEnv — Flat OpenEnv Environment with Action Dispatch
+> **Generated:** 2026-02-28T14:26Z
+> **Branch:** `refactor-openenv-tutorial-project-structure` @ `f28bfaa`
+> **Environment:** Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)
+---
+## What This Branch Does
+This branch refactors the `sql-env` project from a nested `envs/sql_env/` layout into the canonical flat `openenv init` structure, and integrates the `action-feature` branch's core action dispatch system.
+The result: a working RL environment where an agent sends natural language messages (e.g. _"describe the students table"_), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services — `MockTokenizer` replaces HuggingFace tokenizers and Ollama failures are handled gracefully.
+---
+## Quickstart
+```bash
+git checkout refactor-openenv-tutorial-project-structure
+uv sync
+uv run pytest tests/ -v          # 21 tests, ~3.5s
+```
+**Prerequisites:** Python 3.11-3.12, `uv`.
+**Optional:** Ollama with `llama3.2` for LLM-guided table selection (not needed for demo).
+---
+## Evidence
+### 1. All 21 Tests Pass
+```
+$ uv run pytest tests/ -v
+tests/test_smoke.py::TestModels::test_action_creation PASSED             [  4%]
+tests/test_smoke.py::TestModels::test_action_with_tokens PASSED          [  9%]
+tests/test_smoke.py::TestModels::test_observation_creation PASSED        [ 14%]
+tests/test_smoke.py::TestModels::test_state_creation PASSED              [ 19%]
+tests/test_smoke.py::TestEnvironment::test_instantiation PASSED          [ 23%]
+tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
+tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
+tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
+tests/test_smoke.py::TestEnvironment::test_step_describe PASSED          [ 42%]
+tests/test_smoke.py::TestEnvironment::test_step_sample PASSED            [ 47%]
+tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
+tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED  [ 57%]
+tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED    [ 61%]
+tests/test_smoke.py::TestActionDetection::test_query_default PASSED      [ 66%]
+tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED     [ 71%]
+tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
+tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED    [ 80%]
+tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
+tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
+tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
+tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED  [100%]
+============================== 21 passed in 3.56s ==============================
+```
+Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.
+### 2. Lint and Format Clean
+```
+$ uv run ruff check .
+All checks passed!
+$ uv run ruff format --check .
+14 files already formatted
+```
+### 3. Pydantic Model Contracts
+```python
+>>> from sql_env.models import SQLAction, SQLObservation, SQLState
+SQLAction fields:    ['metadata', 'action_type', 'action_description', 'tokens']
+SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
+SQLState fields:     ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']
+```
+`SQLAction.tokens` and `SQLObservation.tokens` carry torch tensors. `SQLState.history_messages` / `history_tokens` accumulate the full conversation for RL context.
+### 4. Action Type Detection
+The environment classifies natural language messages into action types via keyword matching:
+```
+  [PASS] "describe the students table..." -> describe
+  [PASS] "what columns does Course have..." -> describe
+  [PASS] "show me the schema..." -> describe
+  [PASS] "show me sample rows from students..." -> sample
+  [PASS] "give me example data..." -> sample
+  [PASS] "how many rows are in Courses..." -> sample
+  [PASS] "find all students enrolled in CS101..." -> query
+  [PASS] "select count(*) from students..." -> query
+  [PASS] "what is the average score..." -> query
+```
+Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.
+### 5. MockTokenizer Roundtrip
+```python
+>>> from server.test_sql_env import MockTokenizer
+>>> tok = MockTokenizer()
+>>> msg = [{'role': 'user', 'content': 'describe the students table'}]
+>>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
+>>> tokens.shape
+torch.Size([1, 27])
+>>> tokens[0][:10].tolist()
+[100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
+>>> tok.decode(tokens[0].tolist())
+'describe the students table'
+```
+`MockTokenizer` encodes each character as `ord(c)` and decodes via `chr(t)`. Deterministic, no downloads, perfect for tests.
+### 6. Schema Introspection
+SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:
+```python
+>>> env._get_table_schema('Student')
+Table 'Student' has the following columns:
+- student_id: integer number
+- student_details: text (up to 255 characters)
+>>> env._get_table_schema('NonexistentTable')
+Table 'NonexistentTable' not found in schema.
+```
+9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.
+### 7. Full Environment Interaction (Mock Path)
+A complete multi-turn episode with no external services:
+```python
+>>> from server.sql_environment import SQLEnvironment
+>>> from server.test_sql_env import MockTokenizer
+>>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())
+>>> obs = env.reset()
+>>> obs.messages    # 1 message (system prompt)
+>>> obs.tokens.shape
+torch.Size([32])
+>>> obs.done
+False
+```
+**Turn 1 — Describe:**
+```python
+>>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
+>>> action.action_type
+'describe'
+>>> obs = env.step(action)
+>>> obs.messages[-1]
+{'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
+>>> obs.tokens.shape
+torch.Size([91])
+```
+Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".
+**Turn 2 — Sample:**
+```python
+>>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
+>>> action.action_type
+'sample'
+>>> obs = env.step(action)
+>>> obs.messages[-1]['content']
+"Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
+>>> obs.tokens.shape
+torch.Size([503])
+```
+**Turn 3 — Query (no Ollama):**
+```python
+>>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
+>>> action.action_type
+'query'
+>>> obs = env.step(action)
+>>> obs.messages[-1]['content']
+'Error: Ollama returned status 404'
+>>> obs.tokens.shape
+torch.Size([1028])
+```
+The error is graceful — it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).
+### 8. Client Serialization
+`SQLEnvClient` converts tensors to lists for JSON WebSocket transport:
+```python
+>>> from sql_env.client import SQLEnvClient
+>>> from sql_env.models import SQLAction
+>>> import torch
+>>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
+>>> payload = client._step_payload(action)
+{
+  'action_type': 'query',
+  'action_description': 'select * from students',
+  'tokens': [[1, 2, 3, 4, 5]],
+  'metadata': {}
+}
+```
+Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.
+### 9. Spider Question Data
+```python
+>>> import json
+>>> data = json.load(open('data/questions/student_assessment.json'))
+>>> len(data)
+53
+>>> data[0]['question']
+'which course has most number of registered students?'
+>>> data[0]['query']
+'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'
+```
+53 question-answer pairs from the Spider dataset's `student_assessment` database. Each entry has `db_id`, `query`, `question`, `query_toks`, `query_toks_no_value`, and `question_toks`.
+---
+## What Changed from `main`
+| Area | Before (main) | After (this branch) |
+|------|---------------|---------------------|
+| **Layout** | `envs/sql_env/` nested | Flat root = package |
+| **Build** | hatchling | setuptools |
+| **Python** | 3.13 | 3.11-3.12 (torch compat) |
+| **Models** | Structured obs (question, schema, result) | Chat-based obs (messages + tokens) |
+| **Action** | `argument` field | `action_description` + `tokens` tensor |
+| **Environment** | Scaffold stubs | Real SQLite + Ollama + keyword dispatch |
+| **Client** | Basic EnvClient | Tensor <-> list serialization |
+| **Data** | Empty .gitkeep dirs | 9 ORM models + 53 Spider questions |
+| **Tests** | 0 | 21 (all passing) |
+| **Empty dirs** | `training_pipeline/`, `submission_artifacts/` | Removed |
+---
+## Known Behaviors (Not Bugs)
+1. **Ollama fallback:** Without Ollama, `_call_ollama_to_select_table()` falls back to the first table (`Address`). Query actions return `Error: Ollama returned status 404`. This is by design — the mock path is for dev/test, not production.
+2. **`message_to_action()` mutates state:** It appends the message to `_state.history_messages` before tokenizing. This is intentional — the tokenizer needs the full conversation context.
+3. **`MockTokenizer` in production code:** `server/app.py` imports `MockTokenizer` from `server/test_sql_env.py` when `transformers` is unavailable. This is the teammate's design for running without GPU dependencies.
+---
+## Verification Checklist
+- [x] `uv sync` succeeds (all deps install)
+- [x] `uv run pytest tests/ -v` — 21/21 pass
+- [x] `uv run ruff check .` — all checks passed
+- [x] `uv run ruff format --check .` — 14 files formatted
+- [x] Pydantic models import from `sql_env.models`
+- [x] Environment instantiates with MockTokenizer
+- [x] `reset()` returns valid SQLObservation with system prompt
+- [x] Action detection: 9/9 keyword classifications correct
+- [x] `message_to_action()` creates typed SQLAction with tokens
+- [x] `step(describe)` returns schema from SQLAlchemy introspection
+- [x] `step(sample)` returns SQL query text
+- [x] `step(query)` returns graceful error without Ollama
+- [x] Multi-turn conversation state grows correctly
+- [x] Client tensor <-> list serialization roundtrips
+- [x] Spider data loads (53 questions)
+---
+## What's Next
+**Phase 3:** Reward computation (`server/reward.py`) and answer verification (`server/verifier.py`). Both are currently stubs.
+---
+*All output captured live on 2026-02-28. Reproduce with `uv sync && uv run pytest tests/ -v`.*

DEMO_action_feature.md ADDED Viewed

	@@ -0,0 +1,427 @@

+# Feature Demo: `action-feature` — Core Action Dispatch System
+> **Generated:** 2026-02-28T14:46:19+01:00
+> **Branch:** `origin/action-feature` vs `main`
+> **Project:** sql-env-onboarding (SQLEnv RL Environment for OpenEnv Challenge)
+> **Execution environment:** Python 3.11, torch 2.2.2, MockTokenizer (no Ollama required)
+---
+## What This Feature Does
+The `action-feature` branch adds the **core action dispatch system** to SQLEnv — the RL environment where AI agents learn interactive SQL exploration. Before this branch, the environment had data models but no way to actually process agent actions.
+Now an agent can send natural language messages like _"describe the students table"_ or _"find all students enrolled in CS101"_, and the environment automatically classifies them into one of three action types (**describe**, **sample**, **query**), dispatches them to the appropriate handler, and returns observations with tokenized conversation history. This is the fundamental interaction loop that makes the environment usable for reinforcement learning.
+The system works in two modes:
+- **Mock path** (no external services): Uses `MockTokenizer` for tokenization. Describe and sample actions work fully; query actions return a graceful error since Ollama is unavailable.
+- **Ollama path** (full pipeline): Uses a local Ollama LLM to select relevant tables for describe/sample and to generate SQL for query actions.
+---
+## Quickstart
+```bash
+# 1. Checkout the branch
+git checkout origin/action-feature --detach
+# 2. Install dependencies (from the sql_env package directory)
+cd envs/sql_env/
+uv sync
+uv add sqlalchemy   # missing from pyproject.toml, needed for ORM models
+# 3. Run the full demo (71 checks, ~2 seconds, no external services needed)
+uv run python demo_action_feature.py
+# 4. (Optional) Return to main when done
+git checkout main
+```
+**Prerequisites:** Python 3.11+, `uv` package manager, git.
+**Optional:** Ollama running locally with `llama3.2` model for full query generation (set `OLLAMA_MODEL=llama3.2`).
+> **Note:** `sqlalchemy` is required by the ORM models but was omitted from `pyproject.toml` on the branch. The `uv add sqlalchemy` step is necessary.
+> **Note:** On Python < 3.12, a Pydantic compatibility patch is needed because `openenv` defines `Message` with `typing.TypedDict` instead of `typing_extensions.TypedDict`. The demo script applies this patch automatically.
+---
+## Live Demo — Mock Path (Primary)
+All output below was captured by executing `uv run python demo_action_feature.py` on the `action-feature` branch with no Ollama model configured (default `qwen2` not installed).
+### 1. Environment Instantiation with MockTokenizer
+The environment loads 9 SQLAlchemy ORM models (Address, Person, Student, Course, etc.) and initializes conversation state with a system prompt.
+```bash
+uv run python demo_action_feature.py
+```
+```
+============================================================
+  1. Environment Instantiation with MockTokenizer
+============================================================
+  [PASS] MockTokenizer created
+  [PASS] SQLEnvironment created
+  [PASS] System prompt stored
+  [PASS] Tokenizer stored
+  [PASS] 9 database models loaded
+  [PASS] Initial state has 1 message (system)
+  [PASS] Initial state has 1 token tensor
+  [PASS] System message role is correct
+  [PASS] System message content matches prompt
+```
+The environment correctly stores the custom system prompt, attaches the tokenizer, and loads all 9 database table models from SQLAlchemy.
+### 2. reset() Returns Valid SQLObservation
+```
+============================================================
+  2. reset() Returns Valid SQLObservation
+============================================================
+  [PASS] reset() returns SQLObservation
+  [PASS] Observation has messages list
+  [PASS] Messages contain system prompt
+  [PASS] Observation has tokens tensor
+  [PASS] Tokens tensor is 1D
+  [PASS] Tokens are non-empty
+  [PASS] done is False
+  [PASS] reward is None
+  Observation details:
+    messages count: 1
+    tokens shape:   torch.Size([29])
+    tokens[:10]:    [89, 111, 117, 32, 97, 114, 101, 32, 97, 32]
+```
+After `reset()`, the observation contains one message (the system prompt) tokenized into a 1D tensor via `MockTokenizer` (char codes mod 256). The episode is not done and has no reward — ready for the agent's first action.
+### 3. Action Type Detection
+The keyword classifier maps user messages to three action types: `describe`, `sample`, and `query`.
+```
+============================================================
+  3. Action Type Detection (_detect_action_type)
+============================================================
+  [PASS] 'describe the students table...' -> describe
+  [PASS] 'what columns does Course have...' -> describe
+  [PASS] 'show me the schema...' -> describe
+  [PASS] 'show me sample rows from students...' -> sample
+  [PASS] 'give me example data...' -> sample
+  [PASS] 'how many rows are in Courses...' -> sample
+  [PASS] 'find all students enrolled in CS101...' -> query
+  [PASS] 'select count(*) from students...' -> query
+  [PASS] 'what is the average score...' -> query
+```
+All 9 test cases correctly classified. Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.
+### 4. message_to_action() Creates Properly Typed SQLAction
+```
+============================================================
+  4. message_to_action() Creates Properly Typed SQLAction
+============================================================
+  [PASS] Returns SQLAction
+  [PASS] action_type is 'describe'
+  [PASS] action_description is message content
+  [PASS] tokens is a torch.Tensor
+  [PASS] tokens tensor is non-empty
+  Action details:
+    action_type:        describe
+    action_description: describe the students table
+    tokens shape:       torch.Size([1, 57])
+  [PASS] message_to_action adds message to history
+  [PASS] History[1] is the user message
+  [PASS] Sample message -> action_type 'sample'
+  [PASS] Query message -> action_type 'query'
+```
+`message_to_action()` converts a raw message dict into a `SQLAction` with the correct `action_type`, `action_description`, and tokenized tensor. **Important side effect:** it also appends the message to `_state.history_messages` before tokenizing, so the tokenizer sees the full conversation context.
+### 5. step() with Describe Action
+Without Ollama, the describe action falls back to the first table (Address) and returns its full SQLAlchemy-derived schema.
+```
+============================================================
+  5. step() with Describe Action (Schema from SQLAlchemy Models)
+============================================================
+  [PASS] step() returns SQLObservation
+  [PASS] History now has 3 messages (system + user + assistant)
+  [PASS] Last message is from assistant
+  [PASS] Assistant message contains 'columns'
+  [PASS] Schema info contains column descriptions
+  Describe response (first 200 chars):
+    Table 'Address' has the following columns:
+- address_id: integer number
+- line_1: text (up to 255 characters)
+- line_2: text (up to 255 characters)
+- city: text (up to 255 characters)
+- zip_postcode:
+  [PASS] Tokens tensor grew after step
+```
+The schema is extracted directly from SQLAlchemy model introspection (column names, types converted to natural language). The observation now has 3 messages (system → user → assistant) and the token tensor grew.
+### 6. step() with Sample Action
+The sample action generates a `SELECT * ... LIMIT 10` query for the target table.
+```
+============================================================
+  6. step() with Sample Action (Generates SQL Query Text)
+============================================================
+  [PASS] step(sample) returns assistant message
+  [PASS] Sample response contains SELECT
+  [PASS] Sample response contains LIMIT
+  Sample response:
+    Here's a query to sample data from Address:
+SELECT * FROM Address LIMIT 10;
+```
+### 7. step() with Query Action (No Ollama)
+Without Ollama, the query action returns a clear error message instead of crashing.
+```
+============================================================
+  7. step() with Query Action (Mock Path — No Ollama)
+============================================================
+  [PASS] step(query) returns assistant message
+  [PASS] Query response is error string (no Ollama) or SQL
+  Query response (no Ollama):
+    Error: Ollama returned status 404
+```
+The error is a graceful 404 (Ollama server is running but the default `qwen2` model isn't installed). The conversation continues normally — the error becomes part of the message history.
+### 8. Multi-Turn Conversation State Management
+Three turns of alternating user/assistant messages, verifying the conversation history grows correctly.
+```
+============================================================
+  8. Multi-Turn Conversation State Management
+============================================================
+  [PASS] After turn 1: 3 messages (sys + user + assistant)
+  [PASS] After turn 2: 5 messages (sys + u1 + a1 + u2 + a2)
+  [PASS] Tokens grew between turns
+  [PASS] After turn 3: 7 messages
+  [PASS] Tokens grew again
+  [PASS] Message roles follow expected pattern
+  Conversation summary after 3 turns:
+    [0]     system: You are a test SQL assistant....
+    [1]       user: describe the Address table...
+    [2]  assistant: Table 'Address' has the following columns:  - address_id: in...
+    [3]       user: show me sample rows...
+    [4]  assistant: Here's a query to sample data from Address:  SELECT * FROM A...
+    [5]       user: find all addresses in New York...
+    [6]  assistant: Error: Ollama returned status 404...
+  Total tokens: 987
+```
+Message roles follow the expected `[system, user, assistant, user, assistant, user, assistant]` pattern. Token count grows monotonically: the `_create_observation()` method flattens all `history_tokens` into a single 1D tensor via `torch.cat`.
+### 9. Client Serialization Roundtrip
+The `SQLEnvClient` converts tensor → list for JSON transport and list → tensor on the return path.
+```
+============================================================
+  9. Client Serialization Roundtrip (_step_payload)
+============================================================
+  [PASS] Payload has action_type
+  [PASS] Payload has action_description
+  [PASS] Tokens converted to list
+  [PASS] Token values preserved
+  [PASS] _parse_result returns StepResult
+  [PASS] Observation messages parsed
+  [PASS] Tokens converted back to tensor
+  [PASS] Token values correct
+  [PASS] Reward parsed
+  Payload serialization:
+    action_type: query
+    tokens (list): [[1, 2, 3, 4, 5]]
+```
+---
+## Edge Cases Exercised
+### Empty System Prompt
+When no system prompt is provided (empty string), the environment correctly starts with zero messages and an empty token tensor.
+```
+  [PASS] Empty system prompt -> no messages in history
+  [PASS] Empty system prompt -> empty tokens
+```
+### Invalid Message Inputs
+`message_to_action()` validates its input and raises `ValueError` for malformed messages.
+```
+  [PASS] Missing 'role' raises ValueError
+  [PASS] Missing 'content' raises ValueError
+  [PASS] None content raises ValueError
+```
+### Unknown Table Handling
+Schema lookup and sample query generation gracefully handle non-existent tables.
+```
+  [PASS] Unknown table returns 'not found' message
+  [PASS] Unknown table sample returns error
+```
+### MockTokenizer Encode/Decode Roundtrip
+The mock tokenizer's `ord(c) % 256` encoding correctly roundtrips through `chr(t)` decoding.
+```
+  [PASS] MockTokenizer encode/decode roundtrip
+```
+### Invalid Tokenizer Validation
+The environment constructor rejects tokenizers missing `apply_chat_template`.
+```
+  [PASS] Invalid tokenizer raises ValueError
+```
+---
+## Live Demo — Ollama Path (Optional)
+When Ollama is running locally with a compatible model, the query action generates real SQL and the describe action selects the correct table.
+### Describe with Ollama
+With `OLLAMA_MODEL=llama3.2`, the LLM correctly identifies "Student" as the most relevant table for "describe the students table":
+```
+DESCRIBE RESULT:
+Table 'Student' has the following columns:
+- student_id: integer number
+- student_details: text (up to 255 characters)
+```
+Compare with mock path: fell back to "Address" (first table in dict). **With Ollama, table selection is intelligent.**
+### Query with Ollama
+The LLM generates valid SQL for natural language questions:
+```
+QUERY RESULT:
+SELECT * FROM Students WHERE CourseID IN (SELECT CourseID FROM Courses WHERE CourseName = 'CS101')
+```
+> **Note:** The generated SQL references column names from the schema description prompt, not the actual SQLAlchemy model column names. This is expected — the LLM generates SQL based on the natural language schema it receives.
+---
+## Full Result Summary
+```
+============================================================
+  SUMMARY
+============================================================
+  Total checks: 71
+  Passed:       71
+  Failed:       0
+  ALL CHECKS PASSED
+```
+| Category | Checks | Passed | Failed |
+|----------|--------|--------|--------|
+| Imports | 1 | 1 | 0 |
+| Instantiation | 8 | 8 | 0 |
+| reset() | 8 | 8 | 0 |
+| Action detection | 9 | 9 | 0 |
+| message_to_action | 9 | 9 | 0 |
+| step(describe) | 6 | 6 | 0 |
+| step(sample) | 3 | 3 | 0 |
+| step(query) | 2 | 2 | 0 |
+| Multi-turn state | 6 | 6 | 0 |
+| Client serialization | 9 | 9 | 0 |
+| Edge cases | 9 | 9 | 0 |
+| **Total** | **71** | **71** | **0** |
+---
+## Verification Checklist
+- [x] Environment instantiation with MockTokenizer — 8 checks
+- [x] `reset()` returns valid SQLObservation with system prompt — 8 checks
+- [x] Action type detection for all 3 types (describe/sample/query) — 9 keywords tested
+- [x] `message_to_action()` creates SQLAction with correct type and tokens — 9 checks
+- [x] `step()` with describe returns schema from SQLAlchemy models — 6 checks
+- [x] `step()` with sample returns SQL query text — 3 checks
+- [x] `step()` with query returns Ollama error gracefully (mock path) — 2 checks
+- [x] Multi-turn conversation state grows correctly — 6 checks
+- [x] Client tensor↔list serialization roundtrip — 9 checks
+- [x] Edge cases: empty prompt, invalid inputs, unknown tables, tokenizer validation — 9 checks
+---
+## Known Issues Found
+1. **`sqlalchemy` missing from `pyproject.toml`** — The ORM models import `sqlalchemy` but it's not listed as a dependency. Must `uv add sqlalchemy` manually.
+2. **Pydantic/TypedDict incompatibility on Python < 3.12** — The `openenv` library defines `Message` with `typing.TypedDict`, but Pydantic 2.x requires `typing_extensions.TypedDict`. The demo script monkey-patches this, but the issue would affect any direct usage.
+3. **Ollama default model (`qwen2`) unlikely to be installed** — The default `OLLAMA_MODEL` is `qwen2`, which users probably don't have. The 404 error is graceful but confusing. Consider defaulting to `llama3.2` or documenting the required model.
+4. **describe/sample fallback to first table** — When Ollama is unavailable, `_call_ollama_to_select_table()` silently falls back to the first table in the dict (`Address`). This is correct behavior but may confuse users expecting the table from their query.
+---
+## File Reference
+| File | What it does |
+|------|-------------|
+| `envs/sql_env/demo_action_feature.py` | Executable demo script (71 checks) |
+| `envs/sql_env/server/sql_environment.py` | Core `SQLEnvironment` with reset/step/dispatch |
+| `envs/sql_env/models.py` | `SQLAction`, `SQLObservation`, `SQLState` Pydantic models |
+| `envs/sql_env/client.py` | `SQLEnvClient` with tensor↔list serialization |
+| `envs/sql_env/server/test_sql_env.py` | `MockTokenizer` (char ord encoding) |
+| `envs/sql_env/data/databases/models.py` | 9 SQLAlchemy ORM models |
+---
+## How to Reproduce
+```bash
+git clone <repo-url>
+cd sql-env-onboarding
+git checkout origin/action-feature --detach
+cd envs/sql_env/
+uv sync && uv add sqlalchemy
+uv run python demo_action_feature.py      # Mock path: 71/71 checks
+# Optional: Ollama path
+export OLLAMA_MODEL=llama3.2              # or any installed model
+uv run python demo_action_feature.py      # Query actions now return real SQL
+```
+---
+*Demo generated 2026-02-28. Re-run `uv run python demo_action_feature.py` on the action-feature branch to refresh.*

Dockerfile.test ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.12-slim
+WORKDIR /app
+# Install git for pip installs from GitHub
+RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
+# Copy project
+COPY . .
+# Install sql-env without deps
+RUN pip install --no-deps .
+# Install training deps (same as Colab setup cell)
+RUN pip install \
+    "trl>=0.29.0" \
+    "accelerate>=0.34.0" \
+    "openenv-core[core]>=0.2.1" \
+    "pydantic>=2.0.0" \
+    "jmespath" \
+    "datasets>=3.0.0" \
+    "huggingface_hub>=0.30.0" \
+    "git+https://github.com/huggingface/transformers.git@main"
+# Download Spider databases
+RUN python scripts/download_spider_databases.py
+CMD ["python", "scripts/test_training_local.py"]

ONBOARDING_action_feature.md ADDED Viewed

	@@ -0,0 +1,341 @@

+# Onboarding: `action-feature` Branch
+> What the `action-feature` branch adds compared to `main`.
+> Last updated: 2026-02-28
+> Focus: Branch delta — new components, model changes, data flow, and gaps.
+## What This Branch Does
+The `action-feature` branch transforms SQLEnv from a **scaffold with well-designed Pydantic models** into a **partially working environment** with real action dispatch (describe/sample/query), Ollama-based SQL generation, a WebSocket client, SQLAlchemy ORM models for the `student_assessment` database, and Spider question data. It implements the core `message → action → step → observation` loop that the RL training pipeline will eventually drive.
+---
+## Branch Overview
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│  action-feature: New/Changed Components                            │
+├─────────────────────────────────────────────────────────────────────┤
+│                                                                     │
+│   Training Code / Notebook                                          │
+│   ┌──────────────────────┐                                          │
+│   │  test_env.ipynb  NEW │   Interactive walkthrough (5 test cells) │
+│   └──────────┬───────────┘                                          │
+│              │ imports                                               │
+│   ┌──────────▼───────────┐     ┌──────────────────────────┐         │
+│   │  client.py       NEW │────▶│ models.py       CHANGED  │         │
+│   │  SQLEnvClient        │     │ SQLAction  (+ tokens,    │         │
+│   │  _step_payload()     │     │            action_desc)  │         │
+│   │  _parse_result()     │     │ SQLObservation           │         │
+│   │  _parse_state()      │     │   (messages + tokens)    │         │
+│   │  message_to_action() │     │ SQLState                 │         │
+│   └──────────────────────┘     │   (history + tokens)     │         │
+│              │ WebSocket       └──────────────────────────┘         │
+│   ┌──────────▼───────────┐                                          │
+│   │  server/app.py  CHG  │   FastAPI bootstrap + tokenizer factory  │
+│   │  create_sql_env()    │                                          │
+│   └──────────┬───────────┘                                          │
+│              │ creates                                               │
+│   ┌──────────▼───────────────────────────────────────────────┐      │
+│   │  server/sql_environment.py                           NEW │      │
+│   │  SQLEnvironment(Environment)                             │      │
+│   │  ├── reset()          → clear state, return obs          │      │
+│   │  ├── step(action)     → dispatch on action_type          │      │
+│   │  │   ├── "describe"   → Ollama selects table → ORM info │      │
+│   │  │   ├── "sample"     → Ollama selects table → SQL gen  │      │
+│   │  │   └── "query"      → Ollama generates SQL from NL    │      │
+│   │  ├── message_to_action() → detect type, tokenize        │      │
+│   │  └── _detect_action_type() → keyword classifier         │      │
+│   └──────────┬───────────────────────┬───────────────────────┘      │
+│              │ introspects           │ HTTP calls                    │
+│   ┌──────────▼───────────┐  ┌───────▼────────────────┐             │
+│   │  data/databases/     │  │  Ollama (external)     │             │
+│   │  models.py       NEW │  │  /api/generate         │             │
+│   │  9 SQLAlchemy tables │  │  qwen2 (default)       │             │
+│   └──────────────────────┘  └─────────────────────���──┘             │
+│                                                                     │
+│   ┌──────────────────────┐  ┌────────────────────────────────┐      │
+│   │  data/questions/     │  │  scripts/                  NEW │      │
+│   │  student_assessment  │  │  download_spider_data.py       │      │
+│   │  .json           NEW │  │  generate_models_from_schema.py│      │
+│   │  (30+ Q&A pairs)    │  └────────────────────────────────┘      │
+│   └──────────────────────┘                                          │
+│                                                                     │
+│   ┌──────────────────────┐  ┌────────────────────────┐              │
+│   │ server/test_sql_env  │  │ server/install_deps.sh │              │
+│   │ .py    MockTokenizer │  │ Docker setup       NEW │              │
+│   │                  NEW │  └────────────────────────┘              │
+│   └──────────────────────┘                                          │
+└─────────────────────────────────────────────────────────────────────┘
+```
+---
+## Files Changed/Added
+| File | Status | Purpose |
+|------|--------|---------|
+| `envs/sql_env/models.py` | **Changed** | Rewired `SQLAction`, `SQLObservation`, `SQLState` for message+token paradigm |
+| `envs/sql_env/__init__.py` | **Changed** | Exports `SQLAction`, `SQLObservation`, `SQLState`; lazy client import |
+| `envs/sql_env/client.py` | **New** | `SQLEnvClient(EnvClient)` — WebSocket client with tensor serialization |
+| `envs/sql_env/server/sql_environment.py` | **New** | `SQLEnvironment(Environment)` — core environment logic (463 lines) |
+| `envs/sql_env/server/app.py` | **Changed** | FastAPI bootstrap with tokenizer factory + MockTokenizer fallback |
+| `envs/sql_env/server/__init__.py` | **Changed** | Exports `SQLEnvironment` |
+| `envs/sql_env/server/test_sql_env.py` | **New** | `MockTokenizer` for testing without `transformers` library |
+| `envs/sql_env/server/install_deps.sh` | **New** | Docker setup script: pip install + pre-download GPT-2 tokenizer |
+| `envs/sql_env/server/requirements.txt` | **New** | Server-side pip deps for Docker (fastapi, torch, transformers, etc.) |
+| `envs/sql_env/data/databases/models.py` | **New** | SQLAlchemy ORM for `student_assessment` DB (9 model classes) |
+| `envs/sql_env/data/questions/student_assessment.json` | **New** | 30+ Spider questions with gold SQL, tokenized queries |
+| `envs/sql_env/scripts/download_spider_data.py` | **New** | Downloads Spider questions from HuggingFace by `db_id` |
+| `envs/sql_env/scripts/generate_models_from_schema.py` | **New** | Auto-generates SQLAlchemy models from Spider schema dataset |
+| `envs/sql_env/pyproject.toml` | **Changed** | Python constrained to `>=3.11,<3.13`; added `requests>=2.31.0` |
+| `envs/sql_env/uv.lock` | **Changed** | Lock file updated for new dependencies |
+| `README.md` | **Changed** | Added "Current Package State" section with pinned dependency rationale |
+| `envs/sql_env/server/environment.py` | **Emptied** | Replaced by `sql_environment.py` |
+| `test_env.ipynb` | **New** | Jupyter notebook with 5 interactive test scenarios |
+**Total:** 18 files changed, +5702 / -412 lines.
+---
+## Key Components Introduced
+### 1. `SQLEnvironment` — `envs/sql_env/server/sql_environment.py`
+The heart of the branch. Implements the OpenEnv `Environment` interface with three action types:
+| Action Type | Dispatch Flow | Output |
+|-------------|--------------|--------|
+| `describe` | Ollama selects table → `_get_table_schema()` introspects SQLAlchemy model | Column names + natural language types |
+| `sample` | Ollama selects table → `_generate_sample_query()` | `SELECT * FROM <table> LIMIT 10;` |
+| `query` | `_call_ollama_for_sql()` sends NL + schema to Ollama | Generated SQL string |
+Key methods:
+- **`reset()`** — Clears conversation history, re-initializes system prompt message + tokens. Returns initial `SQLObservation`.
+- **`step(action)`** — Dispatches on `action.action_type`. Appends assistant response to `history_messages`, stores action tokens in `history_tokens`. Returns flattened observation.
+- **`message_to_action(message)`** — Server-side conversion of `Message` dict → `SQLAction`. Detects action type via keywords, appends message to state history, tokenizes full conversation.
+- **`_detect_action_type(content)`** — Keyword classifier: checks for "describe"/"schema"/"columns" → `describe`, "sample"/"example"/"rows" → `sample`, default → `query`.
+- **`_create_observation()`** — Builds `SQLObservation` from current state. Flattens all `history_tokens` into a single 1D tensor via `torch.cat`.
+- **`_get_table_schema(table_name)`** — Introspects SQLAlchemy model columns, converts types to natural language.
+- **`_call_ollama_for_sql(query)`** / **`_call_ollama_to_select_table(request)`** — HTTP POST to Ollama `/api/generate`.
+**Constructor params:** `tokenizer` (must have `apply_chat_template`), optional `system_prompt`, optional `transform`.
+**Environment variables:** `OLLAMA_MODEL` (default: `qwen2`), `OLLAMA_BASE_URL` (default: `http://localhost:11434`).
+### 2. `SQLEnvClient` — `envs/sql_env/client.py`
+WebSocket client extending OpenEnv's `EnvClient[SQLAction, SQLObservation, SQLState]`. Handles tensor↔list serialization for JSON transport:
+- **`_step_payload(action)`** — Converts `action.tokens` (Tensor) to Python list for JSON.
+- **`_parse_result(payload)`** — Deserializes response → `StepResult[SQLObservation]`, converting token lists back to tensors.
+- **`_parse_state(payload)`** — Deserializes state → `SQLState` with tensor reconstruction.
+- **`message_to_action(message, tokenizer, history_messages)`** — Client-side version of action creation (mirrors server logic). Requires passing a tokenizer explicitly.
+### 3. `server/app.py` — FastAPI Bootstrap
+Changed from a stub to a working application:
+- **`get_tokenizer()`** — Loads HuggingFace tokenizer from `TOKENIZER_NAME` env var (default: `mistralai/Mistral-7B-Instruct-v0.1`). Falls back to `MockTokenizer` from `test_sql_env.py` if `transformers` is not installed.
+- **`create_sql_environment()`** — Factory function creating `SQLEnvironment` per WebSocket session.
+- **`app = create_app(create_sql_environment, SQLAction, SQLObservation, env_name="sql_env")`** — Wires up WebSocket endpoints.
+### 4. SQLAlchemy ORM — `envs/sql_env/data/databases/models.py`
+9 model classes for the `student_assessment` database:
+| Model | Table | Key Columns |
+|-------|-------|-------------|
+| `Address` | Addresses | address_id, line_1, city, country |
+| `Person` | People | person_id, first_name, last_name, email_address |
+| `Student` | Students | student_id, student_details |
+| `Course` | Courses | course_id (String PK), course_name |
+| `PersonAddress` | People_Addresses | person_id (FK), address_id (FK), date_from/to |
+| `StudentCourseRegistration` | Student_Course_Registrations | student_id (FK), course_id (FK), registration_date |
+| `StudentCourseAttendance` | Student_Course_Attendance | student_id (FK), course_id (FK), date_of_attendance |
+| `Candidate` | Candidates | candidate_id, candidate_details |
+| `CandidateAssessment` | Candidate_Assessments | candidate_id (FK), qualification, assessment_date |
+All models include proper foreign key relationships with `back_populates`.
+### 5. Spider Question Data — `envs/sql_env/data/questions/student_assessment.json`
+3,355-line JSON file containing 30+ question-answer pairs from the Spider dataset. Each entry includes:
+- `db_id` — always `student_assessment`
+- `question` — natural language question (e.g., "which course has most number of registered students?")
+- `query` — gold SQL (e.g., `SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations...`)
+- `query_toks` / `query_toks_no_value` / `question_toks` — tokenized versions
+### 6. Data Preparation Scripts — `envs/sql_env/scripts/`
+- **`download_spider_data.py`** — CLI tool to download Spider questions from HuggingFace. Supports `--db-id` filtering and `--split` selection.
+- **`generate_models_from_schema.py`** — Auto-generates SQLAlchemy ORM models from the `richardr1126/spider-schema` HuggingFace dataset. Maps Spider types to SQLAlchemy types, handles foreign keys.
+### 7. `MockTokenizer` — `envs/sql_env/server/test_sql_env.py`
+Deterministic tokenizer for testing without `transformers`:
+- **`apply_chat_template()`** — Converts message text to token IDs via `ord(c) % 256`.
+- **`decode()`** — Reverses the encoding back to characters.
+- Imported by `app.py` as a fallback when `transformers` is not installed.
+---
+## Model Changes (Main → Action-Feature)
+### `SQLAction`
+| Field | Main | Action-Feature | Notes |
+|-------|------|----------------|-------|
+| `action_type` | `"DESCRIBE, SAMPLE, QUERY, ANSWER"` | `"describe, sample, query"` | Lowercase, ANSWER removed |
+| `argument` | Table name / SQL / answer value | **Removed** | — |
+| `action_description` | — | **Added**: description string | Replaces `argument` |
+| `tokens` | — | **Added**: `torch.Tensor` | Tokenized conversation |
+### `SQLObservation`
+| Field | Main | Action-Feature | Notes |
+|-------|------|----------------|-------|
+| `question` | NL question string | **Commented out** | — |
+| `schema_info` | DB schema description | **Commented out** | — |
+| `result` | Last action result | **Commented out** | — |
+| `error` | Error message | **Commented out** | — |
+| `step_count` | Current step number | **Commented out** | — |
+| `budget_remaining` | Steps left | **Commented out** | — |
+| `action_history` | Summary of actions | **Commented out** | — |
+| `messages` | — | **Added**: `list[Message]` | Full conversation history |
+| `tokens` | — | **Added**: `torch.Tensor` | Flattened token tensor |
+The original observation fields are **commented out, not deleted** — they're expected to return in a future phase.
+### `SQLState`
+| Field | Main | Action-Feature | Notes |
+|-------|------|----------------|-------|
+| `game_name` | `"sql_env"` | **Commented out** | — |
+| `history_messages` | — | **Added**: `list[Message]` | Full conversation history |
+| `history_tokens` | — | **Added**: `list[torch.Tensor]` | Per-message token tensors |
+| `current_action_type` | — | **Added**: `str` (default `"query"`) | Tracks current action |
+**Design shift:** The branch moves from a **structured observation** (question + schema + result fields) to a **chat-based observation** (raw messages + tokens). This aligns with how LLM-based agents naturally consume conversational context.
+---
+## Data Flow
+```
+User Message (dict: {role: "user", content: "Show me the Student schema"})
+    │
+    ▼
+message_to_action(message)                     [SQLEnvironment or SQLEnvClient]
+    ├── Detect action type via keywords
+    │   "schema" found → action_type = "describe"
+    ├── Append message to _state.history_messages     ← MUTATES STATE
+    ├── Tokenize FULL conversation via tokenizer.apply_chat_template()
+    └── Return SQLAction(action_type="describe",
+    │                     action_description="Show me the Student schema",
+    │                     tokens=<tensor>)
+    │
+    ▼
+step(action)                                    [SQLEnvironment]
+    ├── Dispatch on action.action_type:
+    │   "describe" → _call_ollama_to_select_table("Show me the Student schema")
+    │                → returns "Student"
+    │                → _get_table_schema("Student")
+    │                → introspects SQLAlchemy model columns
+    │                → "Table 'Student' has: student_id: integer, ..."
+    ├── Create assistant Message with schema info
+    ├── Append assistant message to _state.history_messages
+    ├── Append action.tokens to _state.history_tokens
+    └── _create_observation()
+        ├── Flatten all history_tokens via torch.cat → single 1D tensor
+        ├── Copy history_messages
+        ├── Apply transform (if configured)
+        └── Return SQLObservation(messages=[...], tokens=<flat tensor>)
+```
+---
+## External Dependencies Added
+| Dependency | Version | Purpose | Integration Point |
+|------------|---------|---------|-------------------|
+| Ollama (local service) | — | LLM inference for SQL generation + table selection | `sql_environment.py:_call_ollama_for_sql()`, `_call_ollama_to_select_table()` |
+| `requests` | >=2.31.0 | HTTP client for Ollama API | `sql_environment.py` |
+| `torch` | ==2.2.2 | Tensor operations for tokenized representations | `models.py`, `client.py`, `sql_environment.py` |
+| `transformers` | <5 | HuggingFace tokenizers (chat template support) | `app.py:get_tokenizer()` |
+| `numpy` | <2 | Torch dependency constraint | `pyproject.toml` |
+| `sqlalchemy` | (transitive) | ORM for database schema introspection | `data/databases/models.py` |
+| `datasets` | (scripts only) | HuggingFace `load_dataset` for Spider data download | `scripts/download_spider_data.py`, `scripts/generate_models_from_schema.py` |
+**Environment variables:**
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `TOKENIZER_NAME` | `mistralai/Mistral-7B-Instruct-v0.1` | HuggingFace tokenizer model |
+| `SYSTEM_PROMPT` | Built-in schema description | Custom system prompt override |
+| `OLLAMA_MODEL` | `qwen2` | Ollama model for SQL generation |
+| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API endpoint |
+---
+## Known Gaps (Not Yet Implemented)
+| Feature | Status | Notes |
+|---------|--------|-------|
+| `ANSWER` action type | Not implemented | Designed in main-branch models but removed from action-feature |
+| Real database execution | Not implemented | `step()` generates SQL text via Ollama but never executes it against SQLite |
+| Reward computation | Not implemented | `reward.py` is empty; 3-layer design exists in README only |
+| Answer verification | Not implemented | `verifier.py` is empty |
+| Budget tracking | Not implemented | No step limit enforcement |
+| Episode question selection | Not implemented | Environment uses hardcoded schema; `student_assessment.json` is present but not loaded by the environment |
+| Dockerfile | Not implemented | File is empty; `install_deps.sh` is ready |
+| `openenv.yaml` manifest | Not implemented | Empty file |
+| Formal test suite | Not implemented | No `tests/` directory; only `MockTokenizer` and notebook tests |
+---
+## Gotchas
+- **`message_to_action()` mutates state:** On the server side, `message_to_action()` appends the message to `_state.history_messages` *before* tokenizing. This means calling it has a side effect — it's not a pure function. If you call it twice with the same message, you'll get duplicate entries in history.
+- **Client vs Server `message_to_action` diverge:** The server version (`sql_environment.py:message_to_action`) manages state internally and mutates `_state`. The client version (`client.py:message_to_action`) requires passing `history_messages` explicitly and does not manage state. They have different signatures.
+- **Schema description is hardcoded in `sql_environment.py`:** The `_build_schema_description()` function returns a fixed string with table/column names that don't perfectly match the SQLAlchemy ORM models. For example, the schema description says `Students (student_id, person_id, student_acc_status)` but the ORM model has `Students (student_id, student_details)`.
+- **Ollama failure mode is silent:** If Ollama is unreachable, `_call_ollama_to_select_table()` catches all exceptions and returns the *first table in the dict* (`Address`). No error is surfaced to the caller. `_call_ollama_for_sql()` returns an error string, but it's treated as a normal assistant message.
+- **Original observation fields are commented out, not deleted:** `SQLObservation` still has `question`, `schema_info`, `result`, `error`, `step_count`, `budget_remaining`, and `action_history` as comments. They're intended to return in a later phase.
+- **`MockTokenizer` is imported by production code:** `app.py` imports `MockTokenizer` from `test_sql_env.py` at runtime when `transformers` is missing. This couples test utilities to production bootstrap.
+- **`test_env.ipynb` lives at project root:** Not inside `tests/` or `envs/`. Easy to miss when exploring the codebase.
+- **Pydantic + torch.Tensor:** `SQLAction`, `SQLObservation`, and `SQLState` use `torch.Tensor` fields with Pydantic. This requires `arbitrary_types_allowed = True` in the Pydantic model config (inherited from OpenEnv base classes). Standard Pydantic serialization (`.model_dump()`) won't work out of the box with tensors.
+---
+## Entry Points for Reading
+| What You Want to Understand | Start Here | Then Read |
+|----------------------------|------------|-----------|
+| How actions are processed | `envs/sql_env/server/sql_environment.py:step()` | `_detect_action_type()`, `_call_ollama_for_sql()` |
+| How messages become actions | `envs/sql_env/server/sql_environment.py:message_to_action()` | `envs/sql_env/client.py:message_to_action()` |
+| Data contracts | `envs/sql_env/models.py` | Compare with `git show main:envs/sql_env/models.py` |
+| Server bootstrap | `envs/sql_env/server/app.py` | `get_tokenizer()`, `create_sql_environment()` |
+| Database schema | `envs/sql_env/data/databases/models.py` | `envs/sql_env/data/questions/student_assessment.json` |
+| Client-side usage | `envs/sql_env/client.py` | `test_env.ipynb` |
+| Data preparation | `envs/sql_env/scripts/download_spider_data.py` | `scripts/generate_models_from_schema.py` |
+---
+*This document covers only the `action-feature` branch delta. For the overall project design (POMDP architecture, reward layers, episode lifecycle), see [README.md](README.md).*
+These issues are also changed as of now, check when we modify.
+Known Issues Discovered
+1. sqlalchemy is missing from pyproject.toml on the branch
+2. Pydantic/TypedDict incompatibility on Python < 3.12 (demo auto-patches)
+3. Hardcoded schema description in sql_environment.py doesn't match ORM models
+4. Silent Ollama fallback to first table on connection failure
+Please check the latest remote branch action-feature

README.md CHANGED Viewed

@@ -84,13 +84,42 @@ Episode flow:
 ## Train an Agent
-Use the GRPO training pipeline artifacts from F006 and run the notebook workflow:
-- Notebook: `notebooks/train_grpo.ipynb`
-- Training support modules: `training/`
-- Evaluation utilities: `evaluation/`
-This setup is designed for Colab and local CPU/GPU environments.
 ## HuggingFace Space

 ## Train an Agent
+The environment exposes four tools (`describe`, `sample`, `query`, `answer`) that TRL's GRPOTrainer discovers automatically. The model learns to call these tools through GRPO — no custom rollout code needed.
+### Local test (Docker, CPU)
+Verify the training pipeline end-to-end in about 3 minutes:
+```bash
+docker build -f Dockerfile.test -t sqlenv-test .
+docker run --rm sqlenv-test
+```
+This runs 2 training steps with `configs/test_cpu.json` and prints per-step loss, reward, tool call frequency, and model completions.
+### Colab training (GPU)
+Open the notebook and select a GPU runtime (L4 recommended):
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hjerpe/sql-env/blob/main/notebooks/train_grpo.ipynb)
+The notebook uses `configs/colab_l4.json` settings: batch size 4, 4 generations per prompt, bf16 precision. Live reward plots and execution traces update during training.
+### What the model sees
+Each episode, TRL injects tool schemas into the prompt. The model generates structured tool calls:
+```
+<tool_call>{"name": "describe", "arguments": {"table_name": "employee"}}</tool_call>
+```
+TRL parses this, calls `env.describe(table_name="employee")`, and appends the result. The model can then call more tools or submit an answer. Rewards accumulate from each interaction.
+### Configuration
+Training configs live in `configs/`:
+- `test_cpu.json` — 2 steps, 256 tokens, budget 3 (local validation)
+- `colab_l4.json` — full epoch, 512 tokens, budget 10, bf16 (L4 GPU)
 ## HuggingFace Space

REVIEW_REPORT.md CHANGED Viewed

@@ -1,57 +1,44 @@
-# Code Review Report: F006 Step 3.1 (`notebooks/train_grpo.ipynb`, `pyproject.toml`, `tests/e2e/test_training_e2e.py`)
 **Risk Tier:** Medium
-**Status:** Failed
-**Verdict:** BLOCK
 ## Summary
-Step 3.1 is not ready to merge. The training extra currently resolves to a TRL version incompatible with the repo’s pinned Torch version, causing notebook imports to fail before training can start. In addition, the added E2E test only validates notebook structure and does not exercise the required one-step training smoke flow from the verification spec.
 ## Evidence
 ### Tests
-- **Status:** Passed (limited scope)
-- **Command:** `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v`
-- **Results:** `2 passed, 0 failed`
-### Dependency/Runtime Validation
-- **Status:** Failed
-- **Command:** `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"`
-- **Observed:** Import error (`cannot import name 'FSDPModule'`) in TRL with current Torch pin.
 ### Security (Medium)
 - **Status:** Clear
-- **Checks:** Medium-tier quick checks only (no secrets/auth/unsafe execution patterns introduced in scoped changes).
 ## Issues
 ### Critical
-1. **Training extra resolves to incompatible TRL, breaking notebook startup**
-   - **Location:** `pyproject.toml:30-33`, `notebooks/train_grpo.ipynb:29-35`
-   - **Problem:** `training = ["trl>=0.12.0", "accelerate>=0.34.0"]` permits latest TRL (installed as 0.29.1), which fails to import with pinned `torch==2.2.2`.
-   - **Impact:** Notebook cannot run end-to-end (“one click” success criterion fails before training).
-   - **Fix:** Pin a TRL range compatible with Torch 2.2.2 (or upgrade Torch accordingly), then add/import-check coverage in tests.
 ### Important
-1. **E2E smoke test does not validate actual Step 3.1 execution path**
-   - **Location:** `tests/e2e/test_training_e2e.py:25-65`
-   - **Problem:** Test checks notebook text structure and helper filtering only; it does not instantiate trainer, run `trainer.train()`, or verify metrics/comparison outputs as specified.
-   - **Impact:** Regressions in training flow can pass CI undetected.
-   - **Fix:** Add a true smoke execution test (tiny/mocked model + single train step + metric assertion), aligned to `specs/F006-VERIFICATION_SPEC.md` Section 4.
-2. **Comparison cell is not random-vs-trained and does not capture pre-training baseline**
-   - **Location:** `notebooks/train_grpo.ipynb:181-183`
-   - **Problem:** Both `before_rollouts` and `after_rollouts` use `rollout_func` with the same model after training.
-   - **Impact:** Fails the feature’s “before vs after” demo intent (and spec’s random-vs-trained comparison).
-   - **Fix:** Capture baseline episodes before training (or explicit random policy), then run trained-policy episodes after `trainer.train()`.
 ### Minor
-None.
 ## Next Actions
-1. Fix dependency compatibility (TRL/Torch) and prove imports succeed in clean env.
-2. Upgrade E2E smoke test to execute one real/mocked GRPO training step and assert logged metrics.
-3. Correct notebook comparison to true baseline-vs-trained behavior.
-4. Re-run: `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v` and include import-check evidence.

+# Code Review Report: F011 Step 1.3 (`notebooks/compare_methods.ipynb`)
 **Risk Tier:** Medium
+**Status:** Passed with Warnings
+**Verdict:** APPROVE
 ## Summary
+`LLMToolCallingPolicy` is implemented per Step 1.3 intent: it builds episode messages, uses chat-template tool calling, forces ANSWER at low budget, and falls back to `parse_error` on unparseable output. No correctness or security blockers were found in the scoped notebook change.
 ## Evidence
 ### Tests
+- **Status:** Mixed (targeted checks passed; existing unrelated smoke failures persist)
+- **Commands:**
+  - `uv run python - <<'PY' ... compile notebook cells ... PY`
+  - `uv run python - <<'PY' ... runtime checks for valid action / budget fallback / parse fallback ... PY`
+  - `uv run pytest tests/test_smoke.py -v`
+- **Results:**
+  - Notebook code-cell compilation: passed (`Compiled 6 code cells successfully`)
+  - Policy runtime checks: passed (`QUERY` valid path, `ANSWER budget_exhausted`, `ANSWER parse_error`)
+  - Smoke tests: `21 passed, 4 failed` (pre-existing reward expectation mismatches in environment tests)
 ### Security (Medium)
 - **Status:** Clear
+- **Checks:** Medium-tier quick checks on parsing/generation fallback paths; no secret handling, auth, or privilege-sensitive paths added.
 ## Issues
 ### Critical
+None.
 ### Important
+None.
 ### Minor
+1. **Episode reset heuristic is question-text based and can theoretically leak history if two consecutive episodes start with identical question text.**
+   - **Location:** `notebooks/compare_methods.ipynb:313-316`
+   - **Recommendation:** Consider adding a stronger episode boundary signal (e.g., explicit wrapper reset hook or observation-based reset trigger).
 ## Next Actions
+1. Proceed to Step 1.4.
+2. Optionally harden reset boundary logic before large-scale eval runs.

client.py CHANGED Viewed

@@ -1,6 +1,5 @@
 from typing import Any, Dict, Iterable
-import torch
 from openenv.core.client_types import StepResult
 from openenv.core.env_server.interfaces import Message
@@ -50,24 +49,11 @@ class SQLEnvClient(EnvClient[SQLAction, SQLObservation, SQLState]):
         )
     def _parse_state(self, payload: Dict[str, Any]) -> SQLState:
-        # Parse history messages
-        history_messages = payload.get("history_messages", [])
-        # Parse history tokens - convert lists back to tensors
-        history_tokens_data = payload.get("history_tokens", [])
-        history_tokens = []
-        for token_list in history_tokens_data:
-            if token_list:
-                history_tokens.append(torch.tensor(token_list))
-            else:
-                history_tokens.append(torch.tensor([]))
         return SQLState(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),
-            history_messages=history_messages,
-            history_tokens=history_tokens,
-            current_action_type=payload.get("current_action_type", "query"),
         )
     def _detect_action_type(self, message_content: str) -> str:

 from typing import Any, Dict, Iterable
 from openenv.core.client_types import StepResult
 from openenv.core.env_server.interfaces import Message
         )
     def _parse_state(self, payload: Dict[str, Any]) -> SQLState:
         return SQLState(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),
+            history_messages=payload.get("history_messages", []),
+            current_action_type=payload.get("current_action_type", "QUERY"),
         )
     def _detect_action_type(self, message_content: str) -> str:

configs/colab_l4.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "model_name": "Qwen/Qwen3-0.6B",
+  "questions_path": "data/questions/questions_train.json",
+  "db_dir": "data/databases",
+  "output_dir": "outputs/grpo_run",
+  "num_train_epochs": 1,
+  "per_device_train_batch_size": 8,
+  "num_generations": 8,
+  "max_completion_length": 512,
+  "step_budget": 10,
+  "logging_steps": 10,
+  "precision": "bf16",
+  "enable_thinking": false,
+  "num_completions_to_print": 1
+}

configs/test_cpu.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "model_name": "Qwen/Qwen3-0.6B",
+  "questions_path": "data/questions/questions_train.json",
+  "db_dir": "data/databases",
+  "output_dir": "/tmp/sqlenv_test",
+  "num_train_epochs": 1,
+  "max_steps": 2,
+  "per_device_train_batch_size": 2,
+  "num_generations": 2,
+  "max_completion_length": 256,
+  "step_budget": 3,
+  "logging_steps": 1,
+  "precision": "fp32",
+  "dataset_size": 4,
+  "enable_thinking": false,
+  "num_completions_to_print": 2
+}

data/databases/car_1/car_1.sqlite ADDED Viewed

Binary file (65.5 kB). View file

data/databases/concert_singer/concert_singer.sqlite ADDED Viewed

Binary file (36.9 kB). View file

data/databases/cre_Doc_Template_Mgt/cre_Doc_Template_Mgt.sqlite ADDED Viewed

Binary file (24.6 kB). View file

data/databases/dog_kennels/dog_kennels.sqlite ADDED Viewed

Binary file (49.2 kB). View file

data/databases/employee_hire_evaluation/employee_hire_evaluation.sqlite ADDED Viewed

Binary file (36.9 kB). View file

data/databases/flight_2/flight_2.sqlite ADDED Viewed

Binary file (77.8 kB). View file

data/databases/pets_1/pets_1.sqlite ADDED Viewed

Binary file (16.4 kB). View file

data/databases/poker_player/poker_player.sqlite ADDED Viewed

Binary file (20.5 kB). View file

data/databases/student_assessment/student_assessment.sqlite ADDED Viewed

Binary file (57.3 kB). View file

data/databases/world_1/world_1.sqlite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17b986695f16786d58d66f85e49dba87bdfe72953207ab9b1b49da9d2301ef65
+size 319488

data/sft/sft_trajectories.json ADDED Viewed

The diff for this file is too large to render. See raw diff

docs/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

docs/ARCHITECTURE.md CHANGED Viewed

@@ -1,13 +1,13 @@
 # Architecture
-> Last updated: 2026-02-28
 System map for SQLEnv — an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.
 **Goals:**
 - Show how components connect (system map + key flows)
 - Make hidden state explicit (what lives where)
-- Define shared interfaces (Pydantic models, WebSocket API)
 - Keep invariants legible (what must stay true)
 **Non-goals:**
@@ -22,48 +22,53 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
                          SQLEnv System
   ================================================================
-  RL Training Loop                          SQLEnv Server (Docker)
-  ----------------                          ----------------------
-                                           +---------------------+
-  +------------+     WebSocket (JSON)      | server/app.py       |
-  | SQLEnv     |<=========================>| FastAPI + WS        |
-  | Client     |  SQLAction  -> server     |                     |
-  | (client.py)|  SQLObs    <- server      +----------+----------+
-  +-----+------+                                      |
-        |                                             v
-        | tensor <-> list                  +---------------------+
-        | serialization                    | SQLEnvironment      |
-        |                                  | (sql_environment.py)|
-  +-----v------+                           |                     |
-  | RL Agent   |                           | - reset() / step()  |
-  | (external) |                           | - action detection  |
-  | e.g. GRPO  |                           | - message_to_action |
-  +------------+                           +--+-------+-------+--+
-                                              |       |       |
-                                              v       v       v
-                                         +------+ +------+ +--------+
-                                         |Schema| |Sample| | Query  |
-                                         |Intro-| |Gen   | | (Ollama|
-                                         |spect.| |      | |  LLM)  |
-                                         +--+---+ +--+---+ +---+----+
-                                            |        |          |
-                                            v        v          v
-                                         +-------------------------+
-                                         | SQLAlchemy ORM Models   |
-                                         | (data/databases/        |
-                                         |  models.py)             |
-                                         | 9 tables:               |
-                                         | Address, Person,        |
-                                         | Student, Course, ...    |
-                                         +-------------------------+
-  Data (committed)                         External (optional)
-  ----------------                         -------------------
-  data/questions/                          +----------+
-    student_assessment.json                | Ollama   |
-    (53 Spider Q&A pairs)                  | LLM API  |
-                                           | :11434   |
-                                           +----------+
 ```
 ---
@@ -72,19 +77,27 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
 | Component | Owns | Entrypoint | State / Output |
 |-----------|------|------------|----------------|
-| **SQLEnvClient** | WebSocket transport, tensor serialization | `client.py` | Stateless (wraps server) |
-| **FastAPI app** | HTTP/WS endpoints, tokenizer factory | `server/app.py` | In-memory tokenizer |
-| **SQLEnvironment** | Episode lifecycle, action dispatch, state | `server/sql_environment.py` | `SQLState` (in-memory) |
 | **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) |
-| **ORM models** | Database schema definition | `data/databases/models.py` | SQLAlchemy metadata |
-| **Spider data** | Question-answer pairs | `data/questions/student_assessment.json` | 53 Q&A entries |
-| **MockTokenizer** | Dev/test tokenization (no GPU needed) | `server/test_sql_env.py` | Deterministic (ord/chr) |
-### External Services
-| Service | Purpose | Required | Fallback |
-|---------|---------|----------|----------|
-| Ollama (`localhost:11434`) | Table selection + SQL generation | No | First table in dict; query returns error string |
 ---
@@ -93,201 +106,306 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
 ### Flow: Episode (Reset + Multi-Turn Steps)
 ```text
-Client                    Server (SQLEnvironment)              Ollama
-  |                              |                               |
-  |--- reset() ----------------->|                               |
-  |                              |-- init state, system prompt   |
-  |                              |-- tokenize system message     |
-  |<-- SQLObservation -----------|   (MockTokenizer or HF)       |
-  |    .messages=[system]        |                               |
-  |    .tokens=shape([N])        |                               |
-  |                              |                               |
-  |--- message_to_action(msg) -->|                               |
-  |                              |-- detect action type          |
-  |                              |   (keyword matching)          |
-  |                              |-- append msg to history       |
-  |                              |-- tokenize full conversation  |
-  |<-- SQLAction ----------------|                               |
-  |    .action_type="describe"   |                               |
-  |    .tokens=shape([1,M])      |                               |
-  |                              |                               |
-  |--- step(action) ------------>|                               |
-  |                              |-- select table -------------->|
-  |                              |<-- table name (or fallback) --|
-  |                              |-- introspect ORM schema       |
-  |                              |-- append assistant msg        |
-  |                              |-- append action tokens        |
-  |<-- SQLObservation -----------|                               |
-  |    .messages=[sys,usr,asst]  |                               |
-  |    .tokens=shape([N+M+K])    |                               |
-  |                              |                               |
-  (repeat step() for sample, query, answer...)
 ```
-### Flow: Action Detection
 ```text
-User message string
-        |
         v
-  _detect_action_type(content)
-        |
-        +-- contains "describe"/"schema"/"columns"?  --> "describe"
-        |
-        +-- contains "sample"/"example"/"rows"?      --> "sample"
-        |
-        +-- default                                  --> "query"
 ```
-### Flow: Client Serialization (WebSocket Transport)
 ```text
-  Client                                      Server
-    |                                           |
-    |  _step_payload(action):                   |
-    |    tokens: Tensor -> list (JSON-safe)     |
-    |    {action_type, action_description,      |
-    |     tokens: [[1,2,3,...]], metadata}       |
-    |  ---------------------------------------->|
-    |                                           |
-    |  _parse_result(data):                     |
-    |    tokens: list -> Tensor                 |
-    |    StepResult(obs, reward, done, info)     |
-    |  <----------------------------------------|
 ```
 ---
 ## Shared Data Models
-These three Pydantic models are used across client, server, and tests.
-Defined in `models.py`.
-### SQLAction
 ```python
 class SQLAction(Action):
-    action_type: str         # "describe" | "sample" | "query" | "answer"
-    action_description: str  # raw user message content
-    tokens: torch.Tensor     # tokenized conversation context, shape [1, seq_len]
 ```
-**Used by:** SQLEnvironment.step(), SQLEnvClient._step_payload(), tests
-### SQLObservation
 ```python
 class SQLObservation(Observation):
-    messages: list[Message]  # full conversation history [{role, content}, ...]
-    tokens: torch.Tensor     # flattened 1D tensor of all turn tokens concatenated
 ```
-**Used by:** SQLEnvironment.reset()/step(), SQLEnvClient._parse_result(), tests
-### SQLState
 ```python
 class SQLState(State):
-    episode_id: str                    # UUID per episode
-    step_count: int                    # turns taken
-    history_messages: list[Message]    # accumulates across turns
-    history_tokens: list[torch.Tensor] # one tensor per turn, flattened on output
-    current_action_type: str | None    # last detected action type
 ```
-**Used by:** SQLEnvironment (internal), state endpoint
-**Note:** This is a lightweight summary for logging. The full RL state lives inside SQLEnvironment and is not exposed to the agent.
 ---
 ## API Contracts
-### WebSocket (OpenEnv Protocol)
-The server exposes a WebSocket endpoint via FastAPI. The OpenEnv framework handles the protocol — SQLEnv implements `reset()` and `step()` on the server side, and `SQLEnvClient` wraps the client side.
-| Operation | Client Method | Payload | Response |
-|-----------|---------------|---------|----------|
-| Reset | `client.reset()` | `{}` | `SQLObservation` (JSON) |
-| Step | `client.step(action)` | `{action_type, action_description, tokens: list, metadata}` | `StepResult(obs, reward, done, info)` |
-| State | `client.state()` | `{}` | `SQLState` (JSON) |
-### Ollama (Optional)
-| Endpoint | Purpose | Payload |
-|----------|---------|---------|
-| `POST /api/generate` | Table selection | `{model, prompt, stream: false}` |
-| `POST /api/generate` | SQL generation | `{model, prompt, stream: false}` |
-Timeout: 30s. Failure mode: graceful fallback (never crashes).
 ---
 ## Cross-Cutting Concerns
-### Code Style & Abstraction Philosophy
-OOP for framework integration (Environment, EnvClient subclasses), plain methods for logic. Extract helpers when they clarify intent, not for DRY.
-- **Structure:** Flat package root with `server/` for server-only code
-- **Error handling:** Graceful fallbacks (never crash), `ValueError` for invalid inputs
-- **Imports:** `try: from sql_env.X / except: from X` for dual install/Docker compatibility
-### Tokenization
-Two paths, same interface (`apply_chat_template`):
-| Mode | Tokenizer | Source | When |
-|------|-----------|--------|------|
-| Dev/Test | `MockTokenizer` | `server/test_sql_env.py` | No GPU, no downloads |
-| Production | HuggingFace | `transformers` library | Real RL training |
-`MockTokenizer` encodes as `ord(c)` per character, decodes as `chr(t)`. Deterministic and fast.
 ### Configuration
-| Variable | Required | Description | Default |
-|----------|----------|-------------|---------|
-| `OLLAMA_MODEL` | No | Ollama model name for SQL generation | `qwen2` |
-| `OLLAMA_BASE_URL` | No | Ollama API endpoint | `http://localhost:11434` |
 ---
-## Data, State, and Storage Locations
-- **Repo (committed):**
-  - `data/questions/student_assessment.json` — 53 Spider Q&A pairs
-  - `data/databases/models.py` — 9 SQLAlchemy ORM table definitions
-- **Runtime state (in-memory, per episode):**
-  - `SQLState.history_messages` — conversation messages
-  - `SQLState.history_tokens` — tensor per turn
-- **Not yet implemented:**
-  - SQLite database files (Phase 3 — queries currently go through Ollama, not executed locally)
-  - Reward/verification state
----
-## Invariants and Guardrails
-- `self.db_models` refers to **database table** models (SQLAlchemy), never RL models
-- Token tensors grow monotonically across turns (never shrink or reset mid-episode)
-- `message_to_action()` mutates state — it appends to history before tokenizing
-- Ollama failures never crash the environment — always graceful fallback
-- `tests/test_smoke.py` must pass without Ollama, without GPU, without network
-- Schema column names in `_build_schema_description()` must match `data/databases/models.py`
 ---
-## Glossary
-| Term | Definition |
-|------|------------|
-| Episode | One question-answering session: reset -> N steps -> terminal |
-| Action type | One of: describe, sample, query, answer |
-| MockTokenizer | Deterministic char-code tokenizer for dev/test (no GPU) |
-| Spider | Academic text-to-SQL benchmark dataset |
-| ORM models | SQLAlchemy class definitions in `data/databases/models.py` |
-| OpenEnv | Meta's RL environment framework (Environment, EnvClient, Action, Observation) |
 ---
@@ -295,61 +413,52 @@ Two paths, same interface (`apply_chat_template`):
 ### Development
-**Prerequisites:**
-- Python 3.11-3.12 (torch incompatible with 3.13)
-- `uv` package manager
-- Ollama (optional)
-**Setup:**
 ```bash
 git clone <repo-url> && cd sql-env
 uv sync
-uv run pytest tests/ -v    # 21 tests, ~3.5s, no external deps
 ```
-### Production
-**Deployment:** Docker container via OpenEnv CLI (`openenv build` / `openenv push`)
-**Runtime:** FastAPI on port 8000 (defined in `openenv.yaml`)
-**Status:** Dockerfile is a scaffold stub — not yet validated
----
-## Suggested Feature Breakdown
-| ID | Feature | Complexity | Dependencies | Notes |
-|----|---------|------------|--------------|-------|
-| F001 | SQL query execution | standard | - | Execute queries against real SQLite, return results |
-| F002 | Reward computation | standard | F001 | 3-layer reward: operational, progress, terminal |
-| F003 | Answer verification | standard | F001 | Compare agent answer to gold SQL results |
-| F004 | Docker validation | simple | - | Update Dockerfile, test `openenv build` |
-| F005 | Multi-database support | complex | F001 | Load any Spider database, not just student_assessment |
-### Suggested Implementation Order
-1. **F001** — Foundation: wire up SQLite execution so queries return real data
-2. **F002 + F003** — Can be done in parallel once F001 is complete
-3. **F004** — Independent, can be done anytime
-4. **F005** — After the single-database path is solid
 ---
-## Future Considerations
-- **Real SQLite execution:** Queries currently go to Ollama for SQL generation but aren't executed against a database. Phase 3 should execute the generated SQL and return actual results.
-- **Multi-episode batching:** For RL training, the environment will need to support multiple concurrent episodes efficiently.
-- **Reward shaping:** The 3-layer reward (operational, progress, terminal) is designed in `models.py` but not implemented.
-- **Table selection without Ollama:** A lightweight keyword/embedding-based table selector could replace the LLM fallback.
 ---
-## Keeping This Map Current
-Update this file when you change any of:
-- System boundaries (new service, new subsystem)
-- Persistent state locations (new files/dirs written or read)
-- Shared data models or API contracts
-- Cross-cutting invariants
 ---
@@ -357,5 +466,8 @@ Update this file when you change any of:
 - Docs index: `docs/README.md`
 - Operations: `docs/RUNBOOK.md`
 - OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
 - Spider dataset: https://huggingface.co/datasets/xlangai/spider

 # Architecture
+> Last updated: 2026-03-29
 System map for SQLEnv — an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.
 **Goals:**
 - Show how components connect (system map + key flows)
 - Make hidden state explicit (what lives where)
+- Define shared interfaces (Pydantic models, HTTP API)
 - Keep invariants legible (what must stay true)
 **Non-goals:**
                          SQLEnv System
   ================================================================
+  RL Training                                SQLEnv Server (Docker)
+  ─────────────                              ──────────────────────
+  +──────────────+                          +─────────────────────+
+  │ TRL GRPO     │                          │ server/app.py       │
+  │ Trainer      │    HTTP (JSON)           │ FastAPI + OpenEnv   │
+  │              │<========================>│                     │
+  │ training/    │  SQLAction  -> server    +──────────┬──────────+
+  │ trl_adapter  │  SQLObs    <- server               │
+  │ .py          │                                    v
+  +──────────────+                          +─────────────────────+
+        │                                   │ SQLEnvironment      │
+        │ OR                                │ (sql_environment.py)│
+        v                                   │                     │
+  +──────────────+                          │ reset() / step()    │
+  │ Custom       │                          │ action dispatch     │
+  │ rollout_func │                          +──┬──────┬──────┬────+
+  │ (rollout.py) │                             │      │      │
+  +──────────────+                             v      v      v
+                                         +────────────────────────+
+  Evaluation                             │  Action Handlers       │
+  ──────────                             │  DESCRIBE → PRAGMA     │
+  +──────────────+                       │  SAMPLE   → SELECT N   │
+  │ evaluate()   │──> env.reset/step     │  QUERY    → SQL exec   │
+  │ policies  │                       │  ANSWER   → verifier   │
+  │ .py          │                       +────────┬───────────────+
+  +──────────────+                                │
+        │                                         v
+  +──────────────+                       +────────────────────────+
+  │ Policies     │                       │ SQLite (read-only)     │
+  │ RandomPolicy │                       │ data/databases/        │
+  │ OraclePolicy │                       │ {db_id}/{db_id}.sqlite │
+  +──────────────+                       +────────────────────────+
+                                                  │
+                                          ┌───────┴───────┐
+                                          v               v
+                                   +───────────+   +───────────+
+                                   │ reward.py │   │verifier.py│
+                                   │ 3-layer   │   │ type-aware│
+                                   │ dense     │   │ comparison│
+                                   +───────────+   +───────────+
+  Data (committed)                    Synthetic (optional)
+  ────────────────                    ────────────────────
+  data/questions/                     server/synthetic/
+    questions_train.json (473 Q)        generate.py
+    questions_eval.json  (203 Q)        mutations.py
+    db_list.json (10 databases)         validate.py
 ```
 ---
 | Component | Owns | Entrypoint | State / Output |
 |-----------|------|------------|----------------|
+| **SQLEnvironment** | Episode lifecycle, action dispatch, step budget | `server/sql_environment.py` | `EpisodeContext` (in-memory, per episode) |
+| **FastAPI app** | HTTP endpoints, tokenizer factory | `server/app.py` | Stateless (delegates to environment) |
+| **SQLEnvClient** | HTTP transport, payload serialization | `client.py` | Stateless (wraps server) |
 | **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) |
+| **Reward engine** | 3-layer dense reward computation | `server/reward.py` | Mutates `EpisodeContext` accumulators |
+| **Answer verifier** | Type-aware answer comparison | `server/verifier.py` | Stateless (pure function) |
+| **GRPO pipeline** | Training orchestration, rollout, reward callables | `training/` (6 modules) | Training artifacts in `outputs/` |
+| **TRL adapter** | `environment_factory` for TRL GRPOTrainer | `training/trl_adapter.py` | Per-session environment instances |
+| **Evaluation** | Policy protocol, evaluate() runner | `evaluation/policies.py` | `EvaluationResult` metrics |
+| **Oracle policy** | Deterministic upper-bound baseline | `evaluation/oracle_policy.py` | Stateless per-step |
+| **Synthetic DB gen** | Metamorphic testing via data mutations | `server/synthetic/` | Variant SQLite files |
+| **Question dataset** | 676 curated Spider questions across 10 DBs | `data/questions/` | JSON files |
+### External Dependencies
+| Dependency | Purpose | Required |
+|------------|---------|----------|
+| SQLite (stdlib) | Database execution | Yes |
+| OpenEnv (`openenv-core`) | Environment protocol, `create_app` | Yes |
+| TRL (`trl`) | GRPO training | Only for training |
+| HuggingFace Transformers | Tokenizer loading | Only for production server |
 ---
 ### Flow: Episode (Reset + Multi-Turn Steps)
 ```text
+Client / Policy                  SQLEnvironment
+  │                                    │
+  │── reset(seed=42) ────────────────> │
+  │                                    │── pick question (random or seeded)
+  │                                    │── open read-only SQLite connection
+  │                                    │── execute gold_sql → store gold_rows
+  │                                    │── init EpisodeContext (budget=15)
+  │ <── SQLObservation ────────────────│
+  │     .question="How many students?" │
+  │     .schema_info="Tables: student" │   (column details hidden)
+  │     .budget_remaining=15           │
+  │                                    │
+  │── step(DESCRIBE student) ────────> │
+  │                                    │── PRAGMA table_info(student)
+  │                                    │── add to described_tables
+  │                                    │── compute_step_reward()
+  │ <── SQLObservation ────────────────│
+  │     .schema_info="student: id INT" │   (columns now revealed)
+  │     .result="5 columns, 20 rows"   │
+  │     .reward=0.02                   │
+  │     .budget_remaining=14           │
+  │                                    │
+  │── step(QUERY "SELECT COUNT(*)...") │
+  │                                    │── validate (SELECT-only, single stmt)
+  │                                    │── execute with 5s timeout
+  │                                    │── compute_step_reward() (L1 + L2)
+  │ <── SQLObservation ────────────────│
+  │     .result="| COUNT(*) |\n| 20 |" │
+  │     .reward=0.035                  │
+  │                                    │
+  │── step(ANSWER "20") ─────────────> │
+  │                                    │── verify_answer("20", gold, type)
+  │                                    │── terminal reward: +1.0 or 0.0
+  │ <── SQLObservation ────────────────│
+  │     .done=true                     │
+  │     .reward=1.0                    │
 ```
+### Flow: 3-Layer Reward Computation
 ```text
+step() called with action
+        │
+        v
+  Layer 1: Operational Shaping (every action)
+  ├── exec_ok?        → +0.02
+  ├── new SQL hash?   → +0.01 (per unique query, no cumulative cap)
+  ├── repeated SQL?   → -0.01
+  └── step cost       → -0.005
+        │
+        v (only if action_type == QUERY and no error)
+  Layer 2: Progress Shaping (delta-from-previous, PBRS)
+  ├── cardinality score  (25%) — |pred_rows - gold_rows| / max
+  ├── value overlap      (50%) — Jaccard of cell values
+  └── numeric range      (25%) — log-distance proximity
+        │
         v
+  bin to {0.0, 0.25, 0.5, 0.75, 1.0}
+  delta = binned - previous_progress → delta * 0.15
+  (positive = improvement, negative = regression)
+        │
+        v
+  Clip per step to [-0.05, +0.15]
+  No cumulative tracking
+        │
+        v (on ANSWER action)
+  Layer 3: Terminal Correctness
+  └── verify_answer() → +1.0 (correct) or 0.0 (wrong)
 ```
+### Flow: TRL Training Integration
 ```text
+  GRPOTrainer
+      │
+      │── discovers tool methods via docstrings
+      │   (describe, sample, query, answer)
+      │
+      │── per rollout:
+      │     SQLEnvTRL() → SQLEnvironment (internal)
+      │     .reset() → observation string
+      │     .describe(table) → schema string
+      │     .query(sql) → result string
+      │     .answer(value) → final string
+      │
+      │── reward:
+      │     sql_env_reward_func() → accumulated .reward
+      │
+      v
+  Training loop (GRPO: generate N completions, rank by reward)
 ```
 ---
 ## Shared Data Models
+Defined in `models.py`. These cross the HTTP boundary between client and server.
+### SQLAction (agent -> server)
 ```python
 class SQLAction(Action):
+    action_type: str   # DESCRIBE | SAMPLE | QUERY | ANSWER
+    argument: str      # table name, SQL string, or answer value
 ```
+### SQLObservation (server -> agent)
 ```python
 class SQLObservation(Observation):
+    question: str              # NL question to answer
+    schema_info: str           # known schema (incrementally revealed)
+    result: str                # last action result (truncated)
+    error: str                 # error message if action failed
+    step_count: int            # current step number
+    budget_remaining: int      # steps left
+    action_history: list[str]  # summary of prior actions
+    # Inherited: done (bool), reward (float | None)
 ```
+### SQLState (metadata endpoint)
 ```python
 class SQLState(State):
+    history_messages: list[Message]
+    current_action_type: str
 ```
+### Server-Only Types (never sent to agent)
+```python
+@dataclass
+class QuestionRecord:
+    question_id: str
+    question_text: str
+    database_name: str
+    gold_sql: str
+    gold_answer: str
+    answer_type: str          # integer | float | string | list
+    difficulty: str           # easy | medium | hard
+    tables_involved: list[str]
+@dataclass
+class EpisodeContext:
+    episode_id: str
+    db_connection: sqlite3.Connection
+    question_record: QuestionRecord
+    step_count: int = 0
+    budget: int = 15
+    described_tables: set[str]
+    action_log: list[str]
+    done: bool = False
+    gold_answer: str | None
+    gold_rows: list[tuple]
+    # Reward accumulators
+    query_hashes: set[str]
+    best_progress: float = 0.0
+    cumulative_step_reward: float = 0.0
+    cumulative_new_info_reward: float = 0.0
+```
+**POMDP design:** The agent sees `SQLObservation`; the server holds `EpisodeContext`. The agent never sees gold answers, progress scores, or the full database. This separation forces exploration.
 ---
 ## API Contracts
+### HTTP (OpenEnv Protocol)
+The server exposes HTTP endpoints via `openenv.core.env_server.create_app()`.
+| Operation | Method | Payload | Response |
+|-----------|--------|---------|----------|
+| Reset | `POST /reset` | `{seed: int}` (optional) | `SQLObservation` (JSON) |
+| Step | `POST /step` | `{action_type, argument, metadata}` | `{observation, reward, done, info}` |
+| State | `GET /state` | — | `SQLState` (JSON) |
+### Evaluation API
+```python
+# Policy protocol
+class Policy(Protocol):
+    def select_action(self, observation: SQLObservation) -> SQLAction: ...
+# Built-in policies
+RandomPolicy()                    # random baseline
+OraclePolicy(questions)           # gold-answer upper bound
+# Runner
+evaluate(env, policy, n_episodes, seed) -> EvaluationResult
+#   .success_rate, .avg_reward, .avg_steps, .episodes[]
+```
+### TRL Adapter API
+```python
+SQLEnvTRL.configure(questions_path, db_dir, step_budget)  # class method
+# Tool methods (auto-discovered by TRL):
+SQLEnvTRL.describe(table_name: str) -> str
+SQLEnvTRL.sample(table_name: str) -> str
+SQLEnvTRL.query(sql: str) -> str
+SQLEnvTRL.answer(value: str) -> str
+```
 ---
 ## Cross-Cutting Concerns
+### SQL Safety
+All database access enforces:
+- **Read-only** SQLite connections (`file:...?mode=ro`)
+- **SELECT-only** — rejects INSERT, UPDATE, DELETE, ALTER, DROP
+- **Single statement** — rejects `; ...` (no stacked queries)
+- **5-second timeout** via SQLite progress handler
+- **20-row truncation** on all result sets
+### POMDP Structure
+The partial observability is deliberate and load-bearing:
+- Agent sees table names at reset but **not column details** (must DESCRIBE)
+- Query results are **truncated** (at most 20 rows)
+- Agent never sees `gold_answer`, `best_progress`, or `gold_rows`
+- Step budget (default 15) forces strategic allocation of exploration
+### Import Compatibility
+Dual import paths throughout for local vs Docker execution:
+```python
+try:
+    from sql_env.models import SQLAction      # local / pip install
+except ImportError:
+    from models import SQLAction              # Docker (PYTHONPATH=/app/env)
+```
 ### Configuration
+| Variable | Required | Default | Purpose |
+|----------|----------|---------|---------|
+| `QUESTIONS_PATH` | No | `data/questions/student_assessment.json` | Questions JSON |
+| `DB_DIR` | No | `data/databases/` | SQLite database directory |
+| `TOKENIZER_NAME` | No | `mistralai/Mistral-7B-Instruct-v0.1` | HuggingFace tokenizer |
+| `PORT` | No | `8000` | Server port (HF Spaces uses 7860) |
 ---
+## Data, State, and Storage
+### Committed Data
+| Path | Contents |
+|------|----------|
+| `data/questions/questions_train.json` | 473 training questions across 10 DBs |
+| `data/questions/questions_eval.json` | 203 evaluation questions across 10 DBs |
+| `data/questions/db_list.json` | 10 Spider database IDs |
+| `data/databases/models.py` | Legacy SQLAlchemy ORM models |
+### Downloaded Data (gitignored)
+Spider SQLite databases in `data/databases/{db_id}/{db_id}.sqlite`. Downloaded via `scripts/download_spider_databases.py`. The 10 databases: student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player.
+### Runtime State (in-memory, per episode)
+`EpisodeContext` holds all episode state: DB connection, gold data, reward accumulators, action history. Created on `reset()`, discarded when episode ends. Nothing persists between episodes.
 ---
+<!-- ARCHITECTURE-SNAPSHOT-BEGIN -->
+Snapshot (auto-managed)
+- Repo signals: Python (pyproject.toml)
+- Roots: tests/
+- Entrypoint candidates: (none detected)
+```text
+tests/
+  e2e/
+    test_training_e2e.py
+  integration/
+    test_training_pipeline.py
+  unit/
+    test_error_handling.py
+    test_grpo_config.py
+    test_oracle_policy.py
+    test_prompts.py
+    test_reward.py
+    test_rewards.py
+    test_rollout.py
+    test_sft_terminal_message.py
+    test_trl_adapter.py
+  test_evaluation.py
+  test_smoke.py
+  test_synthetic.py
+  test_trl_adapter.py
+  test_verifier.py
+  test_verifier_integration.py
+```
+<!-- ARCHITECTURE-SNAPSHOT-END -->
 ---
 ### Development
+**Prerequisites:** Python 3.11-3.12, `uv`, Docker (for deployment)
 ```bash
 git clone <repo-url> && cd sql-env
 uv sync
+uv run python scripts/download_spider_databases.py
+uv run pytest tests/ -v
 ```
+### Deployment
+**Target:** HuggingFace Spaces (Docker, free tier)
+```bash
+uv run openenv build           # build Docker image
+uv run openenv push            # push to HF Spaces
+```
+The Dockerfile uses multi-stage build with `openenv-base`, runs as non-root `appuser`, bundles Spider databases, and exposes `PORT` (default 7860 on HF Spaces).
 ---
+## Invariants
+- Token tensors in `SQLState` grow monotonically across turns (never shrink mid-episode)
+- `EpisodeContext` is server-only — leaking gold data to the agent breaks the POMDP
+- Per-step rewards clipped to `[-0.05, 0.15]` — terminal reward (+1.0) always dominates exploration (~0.3 max)
+- `tests/` must pass without GPU, without network, without downloaded databases (mocks/fixtures)
+- SQL execution never mutates the database (read-only mode enforced at connection level)
 ---
+## Glossary
+| Term | Definition |
+|------|------------|
+| Episode | One question-answering session: reset -> N steps -> terminal |
+| Action type | One of: DESCRIBE, SAMPLE, QUERY, ANSWER |
+| POMDP | Partially observable MDP — agent acts under uncertainty |
+| Spider | Academic text-to-SQL benchmark dataset (10 DBs used) |
+| OpenEnv | Meta's RL environment framework (Environment, EnvClient) |
+| Green Agent | OpenEnv's evaluation wrapper pattern |
+| Oracle policy | Baseline that uses gold SQL/answer for reward ceiling validation |
+| TRL | HuggingFace Transformer Reinforcement Learning library |
+| GRPO | Group Relative Policy Optimization (RL algorithm used for training) |
+| Dense reward | Per-step reward signal (vs sparse terminal-only reward) |
 ---
 - Docs index: `docs/README.md`
 - Operations: `docs/RUNBOOK.md`
+- Vision: `vision/VISION.md`
+- Feature specs: `specs/FEATURES.json`
 - OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
 - Spider dataset: https://huggingface.co/datasets/xlangai/spider
+- TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv

docs/DOCS_CONTRACT.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "optional_paths": [
+    "vision/README.md",
+    "vision/VISION.md",
+    "vision/ROADMAP.md"
+  ],
+  "required_paths": [
+    "AGENTS.md",
+    "docs/README.md",
+    "docs/ARCHITECTURE.md",
+    "docs/RUNBOOK.md",
+    "docs/QUALITY_SCORE.md",
+    "docs/FEATURE_SLICING.md",
+    "docs/DOCS_TAXONOMY.md",
+    "docs/design-docs/index.md",
+    "docs/design-docs/core-beliefs.md",
+    "docs/design-docs/decisions/0001-template.md",
+    "docs/exec-plans/README.md",
+    "docs/exec-plans/tech-debt-tracker.md",
+    "docs/discovery/index.md",
+    "docs/delivery-specs/index.md",
+    "docs/guides/README.md",
+    "docs/references/README.md",
+    "docs/exploration/README.md",
+    "docs/learnings/README.md",
+    "docs/learnings/architecture.md",
+    "docs/learnings/conventions.md",
+    "docs/learnings/workflow.md",
+    "docs/learnings/integrations.md",
+    "docs/learnings/gotchas.md",
+    "docs/learnings/security.md",
+    "docs/learnings/testing.md",
+    "docs/learnings/archived/README.md",
+    "docs/DOCS_CONTRACT.json"
+  ],
+  "map_files": [
+    "AGENTS.md"
+  ],
+  "index_files": [
+    "docs/README.md"
+  ],
+  "learnings_category_files": [
+    "docs/learnings/architecture.md",
+    "docs/learnings/conventions.md",
+    "docs/learnings/workflow.md",
+    "docs/learnings/integrations.md",
+    "docs/learnings/gotchas.md",
+    "docs/learnings/security.md",
+    "docs/learnings/testing.md"
+  ],
+  "learnings_budget": {
+    "max_bullets_per_category": 30,
+    "require_feature_id_suffix": true,
+    "enforce_dedupe_within_category": true,
+    "enforce_dedupe_across_categories": false
+  },
+  "agents_md_max_lines": 100,
+  "agents_md_max_learning_bullets_total": 0,
+  "orphan_spec_ignore_prefixes": [
+    "F001", "F002", "F003", "F004", "F005",
+    "F006", "F007", "F008", "F009", "F010",
+    "F013", "F014", "F015"
+  ],
+  "directory_types": {
+    "docs/guides": "how-to",
+    "docs/design-docs": "explanation",
+    "docs/discovery": "explanation",
+    "docs/delivery-specs": "reference",
+    "docs/exploration": "exploration",
+    "docs/references": "reference",
+    "docs/learnings": "reference",
+    "docs/exec-plans": "how-to"
+  }
+}

docs/DOCS_TAXONOMY.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Documentation Taxonomy
+Where to put new documentation. Inspired by [Diataxis](https://diataxis.fr/) but adapted for agentic engineering where both humans and AI agents consume docs.
+## Decision Tree
+```
+"I need to write something down."
+    │
+    ├── Does it tell someone HOW TO do a specific task?
+    │   └── YES → docs/guides/                    [how-to]
+    │
+    ├── Does it describe WHAT something IS (APIs, interfaces, facts)?
+    │   └── YES → docs/references/                [reference]
+    │         or → docs/ARCHITECTURE.md            [reference]
+    │         or → docs/delivery-specs/            [reference]
+    │
+    ├── Does it explain WHY a decision was made?
+    │   └── YES → docs/design-docs/               [explanation]
+    │         or → vision/                         [explanation]
+    │
+    ├── Does it validate WHETHER we should build something?
+    │   └── YES → docs/discovery/                  [explanation]
+    │
+    ├── Is it a durable pattern extracted from work?
+    │   └── YES → docs/learnings/<category>.md     [reference]
+    │
+    ├── Is it an idea, investigation, or scratchpad?
+    │   └── YES → docs/exploration/                [exploration]
+    │
+    └── Is it tracking active work progress?
+        └── YES → docs/exec-plans/                 [how-to]
+```
+## Directory Purpose Map
+| Directory | Diataxis Type | Audience | Created By | Mutability |
+|-----------|--------------|----------|------------|------------|
+| `docs/guides/` | How-to | Humans + Agents | Manual or extracted | Stable |
+| `docs/design-docs/` | Explanation | Humans + Agents | `design-doc` skill | Append-only (decisions are permanent) |
+| `docs/discovery/` | Explanation | Humans (agents read) | `discovery` skill | Per-feature, stable once validated |
+| `docs/delivery-specs/` | Reference | Agents + Humans | `delivery-spec` skill | Stable once approved |
+| `docs/references/` | Reference | Agents | Manual or generated | Updated as needed |
+| `docs/learnings/` | Reference | Agents + Humans | `compound-engineer` | Append-only (budgeted) |
+| `docs/exploration/` | Exploration | Humans only | Manual | Ephemeral -- graduate or archive |
+| `docs/exec-plans/` | How-to | Humans (agents read) | Manual | Active then archived |
+| `vision/` | Explanation | Humans + Agents | `strategy` skill or manual | Rare changes |
+| `specs/` | Reference | Agents | Autocode skills | Per-feature lifecycle |
+## Self-Organization Rules
+1. **Start minimal.** Don't create directories until you have content for them. Skills create directories on-demand.
+2. **Graduate content.** Exploration docs that prove durable should move to the appropriate permanent location (learnings, guides, references).
+3. **One purpose per doc.** If a document is doing two things (e.g., explaining WHY and telling HOW), split it.
+4. **Agents navigate by maps.** Keep `AGENTS.md` as a pure index. Keep `docs/README.md` as the docs index. Don't inline content in indexes.
+5. **Enforce mechanically.** Use `DOCS_CONTRACT.json` and `opencode-ctx docs validate` to prevent drift. Narrative instructions degrade with context length; lints apply everywhere.
+## Sources
+- [Diataxis](https://diataxis.fr/) -- Four-type documentation framework (Daniele Procida)
+- [Cynefin](https://en.wikipedia.org/wiki/Cynefin_framework) -- Complex vs. ordered domains inform when to prescribe vs. let emerge (Dave Snowden)
+- [OpenAI Harness Engineering](https://openai.com/index/harness-engineering/) -- "Give Codex a map, not a 1,000-page instruction manual"

docs/FEATURE_SLICING.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# Feature Slicing Strategy
+This doc clarifies how we reconcile two goals that can conflict if handled naively:
+1. Capture human intent and constraints once (discovery / delivery / design).
+2. Ship small, low-risk PRs (vertical slices) with fast feedback.
+## Two Levels: Capability vs. Slice
+We treat "feature" in two different ways depending on which artifact we are talking about.
+### 1) Capability Docs (Shared Context)
+These artifacts capture durable intent/constraints and can (and often should) be shared across multiple slices:
+- `docs/discovery/<capability>.json` and `docs/discovery/<capability>.md`
+  - Outcome, opportunity, PR/FAQ, taste (delights/frustrations/feeling), scope, unknowns
+  - Source of truth for "taste" when present
+- `docs/delivery-specs/<capability>.md` (optional)
+  - Functional requirements + non-functional requirements
+  - Acceptance criteria and rollout notes
+  - Can cover a capability that will be delivered across multiple slices
+- `docs/design-docs/<decision>.md` (optional)
+  - Architecture decisions (ADR-style) that may apply to many slices
+### 2) Feature Slices (Execution Units)
+These are the units we track in `specs/FEATURES.json` and implement as small PRs:
+- Each slice should be independently valuable or a necessary, verifiable step toward value.
+- Each slice should have its own implementation + verification specs.
+- Each slice can reference the same capability docs (discovery/delivery/design) via `feature.docs.*`.
+Key rule:
+- Multiple slices may share the same `docs.discovery_json` / `docs.delivery_spec` / `docs.design_doc`.
+- Slices should NOT share the same `specs/F###-IMPLEMENTATION_SPEC.md`.
+## Practical Heuristics
+- Prefer a single discovery doc per capability, then slice delivery into multiple `FEATURES.json` entries.
+- Keep implementation specs bounded (max ~7-10 steps). If the plan is bigger, split into more slices.
+- If two slices have different success criteria / taste / outcome, they should not share the same discovery JSON.
+## What This Buys Us
+- No repeated interviews: taste is captured once and reused.
+- Small PRs: execution stays incremental and testable.
+- Lower drift: shared intent stays consistent, slice specs stay bounded.

docs/QUALITY_SCORE.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# Quality Score
+This is a lightweight rubric for keeping the repo maintainable as it grows.
+## Documentation
+- `AGENTS.md` stays short and acts as a navigation map.
+- Durable guidance lives in `docs/` with stable paths.
+- `opencode-ctx docs validate` should be green before merging docs changes.
+## Determinism
+- CLI output ordering and messages remain stable across runs.

docs/README.md CHANGED Viewed

@@ -9,12 +9,36 @@ This directory is the system-of-record for durable project knowledge.
 | **Guides** | [guides/README.md](guides/README.md) | how-to | Practical step-by-step procedures |
 | **Design** | [design-docs/index.md](design-docs/index.md) | explanation | Feature design, ADRs, decision rationale |
 | **ADR Template** | [design-docs/decisions/0001-template.md](design-docs/decisions/0001-template.md) | reference | Decision record template |
 | **References** | [references/README.md](references/README.md) | reference | External docs for agent context |
 ## System Docs
 - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md)
 - Operations: [RUNBOOK.md](RUNBOOK.md)
 ## Directory Structure
@@ -23,11 +47,36 @@ docs/
 ├── README.md                    # This file (index)
 ├── ARCHITECTURE.md              # System design overview [reference]
 ├── RUNBOOK.md                   # Operations guide [how-to]
 ├── guides/                      # How-to guides [how-to]
 │   └── README.md                # Guide index
 ├── design-docs/                 # Decision rationale [explanation]
 │   ├── index.md                 # Design docs catalogue
 │   └── decisions/               # Architectural Decision Records
 └── references/                  # External docs [reference]
     └── README.md                # External docs for agent context
 ```

 | **Guides** | [guides/README.md](guides/README.md) | how-to | Practical step-by-step procedures |
 | **Design** | [design-docs/index.md](design-docs/index.md) | explanation | Feature design, ADRs, decision rationale |
 | **ADR Template** | [design-docs/decisions/0001-template.md](design-docs/decisions/0001-template.md) | reference | Decision record template |
+| **Core Beliefs** | [design-docs/core-beliefs.md](design-docs/core-beliefs.md) | explanation | Agent-first operating principles |
+| **Discovery** | [discovery/index.md](discovery/index.md) | explanation | Validate ideas and capture taste |
+| **Delivery Specs** | [delivery-specs/index.md](delivery-specs/index.md) | reference | Engineering handoff specs |
+| **Exec Plans** | [exec-plans/README.md](exec-plans/README.md) | how-to | Complex multi-step work tracking |
+| **Tech Debt** | [exec-plans/tech-debt-tracker.md](exec-plans/tech-debt-tracker.md) | how-to | Known debt and cleanup opportunities |
+| **Exploration** | [exploration/README.md](exploration/README.md) | exploration | Ideas, research, scratchpad |
+| **Learnings** | [learnings/README.md](learnings/README.md) | reference | Durable patterns from completed work |
+| **Archived Learnings** | [learnings/archived/README.md](learnings/archived/README.md) | reference | Overflow learnings archive |
 | **References** | [references/README.md](references/README.md) | reference | External docs for agent context |
+## Learnings by Category
+| Category | File |
+|----------|------|
+| Architecture | [learnings/architecture.md](learnings/architecture.md) |
+| Conventions | [learnings/conventions.md](learnings/conventions.md) |
+| Gotchas | [learnings/gotchas.md](learnings/gotchas.md) |
+| Integrations | [learnings/integrations.md](learnings/integrations.md) |
+| Security | [learnings/security.md](learnings/security.md) |
+| Testing | [learnings/testing.md](learnings/testing.md) |
+| Workflow | [learnings/workflow.md](learnings/workflow.md) |
 ## System Docs
 - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md)
 - Operations: [RUNBOOK.md](RUNBOOK.md)
+- Taxonomy: [DOCS_TAXONOMY.md](DOCS_TAXONOMY.md)
+- Quality: [QUALITY_SCORE.md](QUALITY_SCORE.md)
+- Feature Slicing: [FEATURE_SLICING.md](FEATURE_SLICING.md)
+- Contract: [DOCS_CONTRACT.json](DOCS_CONTRACT.json)
 ## Directory Structure
 ├── README.md                    # This file (index)
 ├── ARCHITECTURE.md              # System design overview [reference]
 ├── RUNBOOK.md                   # Operations guide [how-to]
+├── DOCS_TAXONOMY.md             # Where to put new docs [reference]
+├── QUALITY_SCORE.md             # Domain quality grades [reference]
+├── FEATURE_SLICING.md           # Feature slicing strategy [reference]
+├── DOCS_CONTRACT.json           # Validation config [reference]
 ├── guides/                      # How-to guides [how-to]
 │   └── README.md                # Guide index
 ├── design-docs/                 # Decision rationale [explanation]
 │   ├── index.md                 # Design docs catalogue
+│   ├── core-beliefs.md          # Agent-first principles
 │   └── decisions/               # Architectural Decision Records
+├── discovery/                   # Idea validation [explanation]
+│   └── index.md                 # Discovery index
+├── delivery-specs/              # Engineering handoff [reference]
+│   └── index.md                 # Delivery specs index
+├── exec-plans/                  # Work tracking [how-to]
+│   ├── README.md                # Exec plans index
+│   └── tech-debt-tracker.md     # Technical debt
+├── exploration/                 # Ideas, scratchpad [exploration]
+│   └── README.md                # Exploration index
+├── learnings/                   # Durable patterns [reference]
+│   ├── README.md                # Learnings index
+│   ├── architecture.md          # Architecture learnings
+│   ├── conventions.md           # Convention learnings
+│   ├── gotchas.md               # Gotcha learnings
+│   ├── integrations.md          # Integration learnings
+│   ├── security.md              # Security learnings
+│   ├── testing.md               # Testing learnings
+│   ├── workflow.md              # Workflow learnings
+│   └── archived/                # Overflow archive
+│       └── README.md            # Archive policy
 └── references/                  # External docs [reference]
     └── README.md                # External docs for agent context
 ```

docs/SKILLS_HANDBOOK.generated.md ADDED Viewed

	@@ -0,0 +1,413 @@

+<!-- handbook:ref cli opencode-ctx ralph run -->
+<!-- handbook:ref cli opencode-ctx ralph start -->
+<!-- handbook:ref cli opencode-ctx ralph spec -->
+<!-- handbook:ref cli opencode-ctx ralph review -->
+# OpenCode Workflow Handbook
+Short, terminal-friendly map of the agentic workflow.
+Assets discovered from: /Users/hjerp/.config/opencode
+## Pipeline (What Comes Next)
+## Delegation Brief (Required)
+For any non-trivial work (anything you would normally put in a ticket/PRD), start with a short **Delegation Brief**.
+This is the invariant "front page" that gets carried through discovery, delivery, and implementation so agents do not drift.
+Minimum fields:
+```text
+Objective (what + why):
+Deliverables (artifacts/capabilities):
+In scope / Out of scope (non-goals):
+Boundaries (authority + constraints):
+Checkpoints (when to show work / ask):
+Acceptance checks (how we know it's done):
+```
+How it glues into the existing workflow:
+- `strategy` strengthens Objective/why (when the why is unclear or long-horizon).
+- `discovery` turns Objective into validated scope + taste, and sharpens Acceptance checks.
+- `delivery-spec` turns Deliverables/Boundaries into an engineering handoff.
+- `design-doc` records decision rationale when the "how" has meaningful options.
+- `autocode-implementation-planner` encodes the brief as **Core Intent (Immutable)** in the implementation spec, then uses the information barrier to generate independent verification.
+**Note:** `strategy` is optional — use it when the "why" matters as much as the "what" (methodologies, frameworks, platforms). Most projects start at `discovery`.
+```
+[delegation brief] -> [strategy] -> discovery -> delivery-spec -> design-doc
+      |            |            |            |
+      v            v            v            v
+   vision      validate     scope+handoff   decisions
+  (optional)   +taste
+architecture -> autocode-implementation-planner -> verification-planner -> execution loop
+      |                    |                      |                   |
+      v                    v                      v                   v
+ system map        IMPLEMENTATION_SPEC.md   VERIFICATION_SPEC.md   build/review/verify
+```
+## When to Create Which Doc
+| Complexity | Product Spec? | Design Doc? | Start With |
+|------------|---------------|-------------|------------|
+| **Simple** (CRUD, config) | Skip | Skip | `autocode-implementation-planner` |
+| **Standard** (new feature) | Recommended | Optional | `discovery` skill → implementation planner |
+| **Complex** (multi-system) | Required | Required | `discovery` → `delivery-spec` → design doc → implementation planner |
+| **Exploratory** (unknown) | Skip | Skip | `prototype` skill first |
+**Decision criteria:**
+- Create **discovery** doc when stakeholders need to agree on scope
+- Create **design-doc** when architecture decisions need to be recorded
+- Skip both for tactical work that doesn't need alignment
+## Where Artifacts Live
+- vision/ = project philosophy (optional, for methodologies/frameworks) [explanation]
+- docs/ = durable knowledge, organized by Diataxis type:
+  - docs/guides/ = how-to procedures [how-to]
+  - docs/design-docs/ = decision rationale [explanation]
+  - docs/discovery/ = problem validation + taste [explanation]
+  - docs/delivery-specs/ = engineering handoff [reference]
+  - docs/references/ = external docs for agents [reference]
+  - docs/exploration/ = ideas, scratchpad [exploration]
+  - docs/learnings/ = durable patterns [reference]
+  - docs/DOCS_TAXONOMY.md = where to put new docs
+- specs/ = per-slice execution artifacts (FEATURES.json, impl specs, verification specs, CLARIFICATION_QUESTIONS.md (optional))
+**Architecture lives in `docs/ARCHITECTURE.md`** — it's durable system knowledge, not a per-feature artifact.
+## Architecture: One File, Multiple Tools
+**Canonical location:** `docs/ARCHITECTURE.md`
+All architecture tools output to the same file:
+| Tool | Purpose |
+|------|---------|
+| `architecture` skill | Create NEW architecture (greenfield, redesign) |
+| `codebase-onboarding` skill | Understand EXISTING architecture |
+| `opencode-ctx docs architecture apply` | Refresh auto-managed snapshot |
+| `/update-architecture-diagram` | Quick diagram refresh |
+**Workflow:**
+- Joining existing project → `codebase-onboarding` first
+- Building new system → `architecture` skill
+- Regular maintenance → `opencode-ctx docs architecture apply`
+**Keep it fresh:**
+- Human-curated sections: coarse (5-10 boxes, 2-4 key flows)
+- Auto-managed snapshot: refresh with `opencode-ctx docs architecture apply`
+- CI gate: `opencode-ctx docs architecture check`
+## What To Run (Typical Flow)
+- Start by writing (or asking the agent to draft) a Delegation Brief (required for significant work).
+- To draft a Delegation Brief as a reusable artifact: use `delegation-brief`.
+- If you are unsure what doc to write: use `triage`.
+- To generate "what to build next" ideas from a codebase or product: use `ideation`.
+- To capture project philosophy (rare): use `strategy` or create `vision/VISION.md` manually.
+- To set direction for a longer horizon: use `strategy`.
+- To validate and capture taste for a feature: use `discovery`.
+- To hand off a concrete solution: use `delivery-spec`.
+- To record architectural decisions: use `design-doc`.
+- To map shared components across features: use `architecture`.
+- To plan implementation with verification barrier: use `autocode-implementation-planner`.
+- To inject baseline guardrails into `AGENTS.md`: `opencode-ctx guidelines apply --packs <lang>,testing,delivery-safety --file AGENTS.md`
+**Vision vs Discovery:**
+- Vision = enduring project philosophy (rare, for methodologies/frameworks)
+- Discovery = feature-specific validation and taste capture (common)
+| Use Vision When... | Use Discovery When... |
+|--------------------|----------------------|
+| The project IS a methodology or framework | You're validating a specific feature idea |
+| Philosophy needs to persist across many features | The "why" is feature-specific |
+| Multiple contributors need philosophical alignment | Taste capture is per-feature |
+| The project will evolve over years | The project is < 6 months |
+**Lightweight vision alternative:** Add a "Vision" section to README.md (2-3 paragraphs) instead of a full `vision/` directory.
+Then build + ship with commands:
+- /autocode-next-step (implement next spec step)
+- /feature-demo F### (generate executable demo proving the feature works — auto-runs at loop end)
+- /review-changes (auto-review)
+- /commit-push-pr (commit, push, PR)
+- /techdebt (tech debt scan)
+For hands-free loops (optional):
+- `opencode-ctx ralph start <spec>` (start server + browser + autonomous loop + cleanup)
+- `opencode-ctx ralph run <spec>` (autonomous implement loop)
+- `opencode-ctx ralph spec <spec>` (iterative spec refinement)
+- `opencode-ctx ralph review` (review/fix bounded loop)
+Common options: `-m MODEL` (alias: sonnet, opus, gpt-builder), `--max-iter N`
+## Full Workflow (Complex Features)
+```
+0. Delegation Brief (required for significant work)
+   └─→ docs/delegation-briefs/<feature>.md      # delegation-brief skill
+1. Discovery (taste capture + problem validation)
+   └─→ docs/discovery/<feature>.md            # discovery skill
+   └─→ docs/discovery/<feature>.json          # Machine-readable taste data
+1b. Delivery spec (engineering handoff - optional for simple features)
+   └─→ docs/delivery-specs/<feature>.md       # delivery-spec skill
+2. Design doc (architectural decisions)
+   └─→ docs/design-docs/<feature>.md          # design-doc skill (ADR-style)
+3. Feature planning (multi-feature projects)
+   └─→ specs/FEATURES.json                    # autocode-plan-features skill
+4. Implementation planning (reads taste from step 1)
+   └─→ specs/F001-IMPLEMENTATION_SPEC.md      # autocode-implementation-planner skill
+   └─→ specs/F001-VERIFICATION_SPEC.md        # Auto-generated
+5. Execution
+    └─→ /autocode-next-step                    # Or: opencode-ctx ralph start <spec>
+6. Knowledge extraction (automatic)
+   └─→ docs/learnings/*.md                    # compound-engineer at finalization
+```
+**Taste flows through the pipeline:**
+- `discovery` captures delights, frustrations, feeling
+- `delivery-spec` (optional) translates to functional requirements
+- `autocode-implementation-planner` reads taste JSON and uses as success criteria
+- Verification checks implementation against captured taste
+## Quick Workflow (Simple Features)
+```bash
+# Plan directly
+skill({ name: "autocode-implementation-planner" })
+# Execute
+/autocode-next-step
+# Or autonomous: opencode-ctx ralph start <spec>
+# Review and ship
+/review-changes
+/commit-push-pr
+```
+## Linking Docs to Features (FEATURES.json)
+When using `FEATURES.json`, link to related documentation:
+```json
+{
+  "id": "F001",
+  "name": "User Authentication",
+  "docs": {
+    "discovery_json": "docs/discovery/auth.json",
+    "discovery_md": "docs/discovery/auth.md",
+    "delivery_spec": "docs/delivery-specs/auth.md",
+    "design_doc": "docs/design-docs/auth-system.md"
+  },
+  "specs": {
+    "implementation": "specs/F001-IMPLEMENTATION_SPEC.md",
+    "verification": "specs/F001-VERIFICATION_SPEC.md"
+  }
+}
+```
+The `autocode-implementation-planner` skill automatically checks each linked doc and uses it as context.
+## AGENTS.md Convention
+AGENTS.md is a **navigation map**, not an encyclopedia:
+- ~60-80 lines
+- Links to docs/ for details
+- No inline learnings (those go in `docs/learnings/`)
+- Injectable guidelines via `<!-- GUIDELINES-BEGIN/END -->`
+```bash
+# Apply or update language guidelines (idempotent)
+opencode-ctx guidelines apply --packs python,testing,delivery-safety --file AGENTS.md
+# Frontend projects
+opencode-ctx guidelines apply --packs frontend,testing,delivery-safety --file AGENTS.md
+# Preview without writing
+opencode-ctx guidelines apply --packs python,testing,delivery-safety --file AGENTS.md --dry-run
+```
+Available packs: `python`, `testing`, `frontend`, `delivery-safety`
+## Parallel Feature Development (Fan-Out / Fan-In)
+When a project has multiple independent features to implement, use the
+parallel workflow to create isolated clones for each, then merge back safely.
+```bash
+# Plan: AI-assisted analysis of which features can run in parallel
+opencode-ctx parallel plan                           # analyze FEATURES.json + specs
+opencode-ctx parallel plan --status ready            # filter by status
+opencode-ctx parallel plan --format json             # machine-readable output
+# Fan-out: create parallel clones for features
+opencode-ctx parallel fan-out                        # reads FEATURES.json (planned/ready)
+opencode-ctx parallel fan-out --features F001,F002   # explicit feature list
+opencode-ctx parallel fan-out --from-local           # clone from local (faster)
+# Status: check all parallel clones
+opencode-ctx parallel status
+opencode-ctx parallel status --format json
+# Work in each clone independently
+cd ../repo-F001-auth-system
+opencode-ctx ralph start specs/F001-IMPLEMENTATION_SPEC.md
+# Fan-in: pre-merge conflict analysis
+opencode-ctx parallel fan-in                         # trial merge against main
+opencode-ctx parallel fan-in --format json
+# Cleanup: remove clones after merging
+opencode-ctx parallel cleanup F001
+opencode-ctx parallel cleanup --all --force
+```
+**AI-Orchestrated Workflow (recommended):**
+1. Use `/parallel-plan` command — analyzes implementation specs for file overlaps
+   and dependencies, recommends parallelizable batches, asks for confirmation
+2. On confirmation, automatically calls `fan-out` for the batch
+3. User runs `opencode-ctx ralph start <spec>` (or manual implementation) in each clone
+4. Use `parallel-orchestrator` agent to monitor progress and coordinate merges
+5. `fan-in` runs trial merges; orchestrator generates a merge playbook
+6. User merges per-feature PRs in the suggested order
+7. `cleanup` removes clones after merge
+**Manual Workflow:**
+1. `fan-out` creates sibling clones, each on a `feat/F###-slug` branch
+2. User runs `opencode-ctx ralph start <spec>` (or manual implementation) in each clone
+3. `fan-in` runs trial merges to detect conflicts and suggests merge order
+4. User merges per-feature PRs in the suggested order
+5. `cleanup` removes clones after merge
+The `plan` command reads Section 2 (Change Manifest) from implementation specs to
+extract file lists, computes pairwise file overlaps, and partitions features into
+conflict-free batches. Fan-in respects feature dependencies from FEATURES.json
+and orders clean merges before conflicting ones.
+## Maintenance Commands
+```bash
+# Keep the architecture snapshot current
+opencode-ctx docs architecture apply     # refresh auto-managed snapshot
+opencode-ctx docs architecture check     # CI gate
+# Keep the FEATURES.json schema current (multi-feature projects)
+opencode-ctx features schema apply       # creates specs/schemas/autocode-features-v1.schema.json
+opencode-ctx features schema check       # CI gate
+opencode-ctx features schema apply --dry-run
+# Validate docs structure
+opencode-ctx docs validate
+```
+## Global Configuration
+Additional commands, agents, and skills are in `/Users/hjerp/.config/opencode`:
+| Path | Contents |
+|------|----------|
+| `AGENTS.md` | Global workflow policies |
+| `commands/` | Slash commands (`/autocode-*`, `/review-*`, etc.) |
+| `agents/` | Specialized agents (planner, reviewer, executor, etc.) |
+| `skills/` | Product planning skills (strategy, discovery, delivery-spec, etc.) |
+| `scripts/` | Automation scripts (legacy, deprecated — use `opencode-ctx ralph`) |
+## Skills (Discovered)
+| Skill | Description |
+|---|---|
+| `ai-adoption-engagement` | Run an end-to-end AI adoption consulting engagement: scoping, current-state maturity di... |
+| `ai-strategy` | Plan, evaluate, and iteratively update AI implementation strategy for an organization o... |
+| `architecture` | Create lightweight system architecture establishing shared understanding across feature... |
+| `autocode-implementation-planner` | Research and plan software changes with structured verification handoff. Orchestrates s... |
+| `autocode-plan-features` | Create machine-parseable feature lists for multi-feature projects. Generates FEATURES.j... |
+| `causal-driver` | Build causal driver trees for any metric — separate accounting identities from causal h... |
+| `checkpoint-protocol` | Structured human checkpoint protocol that minimizes evaluation overhead. Transforms 're... |
+| `codebase-onboarding` | Rapidly understand an unfamiliar codebase by generating a structured overview with ASCI... |
+| `communication` | Craft compelling communication for stakeholders. Covers storytelling frameworks, persua... |
+| `core-web-vitals` | Optimize Core Web Vitals (LCP, INP, CLS) for better page experience and search ranking.... |
+| `delegation-brief` | Create a short delegation contract (objective, deliverables, boundaries, checkpoints, a... |
+| `delivery-spec` | Create delivery specs that define solutions for engineers and AI agents (Phase 3). Use... |
+| `design-doc` | Record architectural decisions with rationale (ADR-style). Captures WHY decisions were... |
+| `discovery` | Validate and prioritize product ideas using PR/FAQ, Opportunity Solution Trees, Taste C... |
+| `execution` | Translate PRDs into user stories with acceptance criteria (Phase 4). Use when: (1) Engi... |
+| `frameworks` | Reusable frameworks, checklists, and templates (a lightweight reference library). Use w... |
+| `frontend-builder` | LLM-optimized frontend implementation guidance. Use when: (1) Starting new frontend pro... |
+| `ideation` | Generate structured 'what to build next' candidates from a codebase or product using th... |
+| `landscape-scan` | Competitive intelligence for company building and investment diligence. Maps the full c... |
+| `ml-concept-eval` | Evaluate an ML/statistical technique against a specific business context: is it viable,... |
+| `peer-collaboration` | Coordinate volunteer or peer teams without formal authority. Use when: (1) Working with... |
+| `principal-ml-engineer` | Principal ML engineer playbook: design ML/LLM systems that are reliable, observable, ex... |
+| `project-leadership` | Adaptive project leadership for competitions, research, coursework, ventures, and deliv... |
+| `prototype` | Rapid exploratory prototyping to resolve ambiguity and validate ideas before committing... |
+| `sector-primer` | Rapid industry understanding for consultants and investors. Produces a structured Indus... |
+| `seo` | Optimize for search engine visibility and ranking. Use when asked to "improve SEO", "op... |
+| `spec-staff-review` | Deliberate quality review of implementation specs by a staff engineer persona. Use for... |
+| `strategic-thinker` | Guide users through strategic thinking using the Strategy Kernel framework. Facilitates... |
+| `strategy` | Create product vision boards and outcome-based roadmaps (Phase 0-1). Use when: (1) Annu... |
+| `team-lead` | Reference skill for team leadership principles: coaching, feedback, delegation. Use whe... |
+| `triage` | Guide users through product planning workflow and select the right documentation (Triag... |
+| `visual-artifacts` | Create professional visual artifacts: diagrams (Mermaid, Excalidraw) and presentations... |
+| `web-performance` | Optimize web performance for faster loading and better user experience. Use when asked... |
+| `web-security` | Apply modern web security best practices including CSP, HTTPS, XSS prevention, and depe... |
+| `what-how-alignment` | System-level alignment between intent (what) and implementation (how). Analyzes complet... |
+## Commands (Discovered)
+| Command | Agent | Description |
+|---|---|---|
+| `/autocode-fix-from-review` | `executor` | Apply fixes from review report and run verification |
+| `/autocode-fix-verification` | `verifier` | Fix features marked complete without proper verification evidence |
+| `/autocode-next-step` | `executor` | Execute the next implementation step with verification |
+| `/autocode-refine-spec` | `reviewer` | Iteratively refine an implementation spec before verification planning |
+| `/autocode-verification-planner` | `verification-planner` | Generate verification criteria from sanitized spec |
+| `/commit-push-pr` | `executor` | Commit, Push, and Create Pull Request |
+| `/feature-demo` | `feature-demo` | Generate an executable demo document for a completed feature |
+| `/full-web-audit` | `executor` | Comprehensive web quality audit (performance, accessibility, SEO, security) |
+| `/parallel-plan` | `parallel-orchestrator` | Analyze FEATURES.json and plan parallelizable feature batches |
+| `/review-changes` | `reviewer` | Review changes before commit or PR |
+| `/review-frontend` | `frontend-reviewer` | Visual review of running frontend via Playwright MCP |
+| `/techdebt` | `reviewer` | Analyze code for technical debt, duplications, and AI slop patterns |
+| `/update-architecture-diagram` | `executor` | Refresh the System Diagram in ARCHITECTURE.md to match current codebase |
+| `/update-model-routing` | `executor` | Refresh model routing recommendations with current pricing from models.dev |
+| `/validate-spec` | `verifier` | Check if implementation matches spec; report discrepancies |
+## Ralph CLI (Autonomous Loops)
+| Command | Purpose |
+|---------|---------|
+| `opencode-ctx ralph start <spec>` | Start server + browser + run loop + cleanup (all-in-one) |
+| `opencode-ctx ralph run <spec>` | Run implementation loop (attach to existing server with `-s`) |
+| `opencode-ctx ralph spec <spec>` | Iterative spec refinement (default: 3 iterations) |
+| `opencode-ctx ralph review` | Review + fix loop on current changes |
+Common options: `-m MODEL` (aliases: sonnet, opus, gpt-builder), `--max-iter N`
+## Scripts (Discovered — legacy, prefer `opencode-ctx ralph`)
+| Script | Description |
+|---|---|
+| `cleanup-feature.sh` | Remove a parallel feature clone |
+| `learnings.sh` | Query learnings (AGENTS.md or docs/LEARNINGS) |
+| `parallel-feature.sh` | Create isolated clone for parallel feature development |
+| `ralph-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
+| `ralph-review-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
+| `ralph-spec-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
+| `update-model-routing.sh` | Fetch model pricing from models.dev and generate routing tables |

docs/blog-material.md ADDED Viewed

	@@ -0,0 +1,428 @@

+# Blog Material — Raw Knowledge Dump
+Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.
+## Training Run Summary
+### Run progression (what each run taught us)
+1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste
+2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
+3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
+4. **Run 4**: KL penalty + reference model = OOM on L4
+5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries
+6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
+7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
+8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
+9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions
+### Key numbers
+| Metric | Value |
+|--------|-------|
+| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
+| Hardware | Colab L4 (24GB VRAM) |
+| SFT data | 120 multi-turn trajectories, assistant_only_loss |
+| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
+| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
+| Total training | ~7.3h on L4 for both phases |
+| Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
+| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
+| Parse rate | >98% (valid tool-call JSON) |
+| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
+| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
+| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |
+## Concrete Episode Examples (for blog)
+### Success: Simple aggregation (3 tool calls)
+Question: "What is the total bonus given in all evaluations?"
+1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
+2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
+3. `answer("19500.0")` → correct
+Reward: 1.15
+### Success: Error recovery with JOIN
+Question: "Which employee received the biggest bonus?"
+1. `describe("employee")` → Employee_ID, Name, Age, City
+2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
+3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
+4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
+5. `answer("Louis Deacon")` → correct
+Reward: 1.13
+### Success: Subquery with comparison
+Question: "How many cars have greater accelerate than the one with the most horsepower?"
+1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
+2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
+3. `answer("39")` → correct
+Reward: 1.15
+### Failure: Column name hallucination
+Question: "What are all the makers and models?"
+1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
+2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
+3. `describe("model_list")` → ModelId, Maker, Model
+4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
+5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows)
+Model found the right data but answer was truncated by the 20-row display limit.
+### Failure: FK chain reasoning
+Question: "Which city has most number of arriving flights?"
+1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
+2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4
+3. `answer("Albany")` → incorrect
+Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.
+### Failure: Invented table
+Question: "What are all details of students who registered but did not attend any course?"
+1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
+2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
+3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
+Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.
+## Reward Architecture Details
+### Three-layer structure
+```
+L1 Operational (every step):
+  +0.02  exec_ok
+  +0.01  new_info (unique SQL hash)
+  -0.03  repeat penalty
+  -0.02  step cost
+L2 Progress (QUERY only):
+  Delta from previous binned progress × 0.15
+  Binned to {0, 0.25, 0.5, 0.75, 1.0}
+L3 Terminal (ANSWER only):
+  +1.0 correct, 0.0 wrong
+Per-step clip: [-0.10, 0.15]
+```
+### Why potential-based shaping matters
+- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
+- Our delta progress IS potential-based with γ=1
+- Cumulative caps are NOT potential-based (depend on trajectory history)
+- Without this guarantee, agents learn to farm exploration rewards
+### Anti-farming mechanisms
+- Hard budget (15 steps)
+- Step cost (-0.02)
+- Repeat penalty (-0.03)
+- Terminal dominance (1.0 vs ~0.3 max exploration)
+- Per-step clip [-0.10, 0.15]
+- Post-episode penalty (-0.3)
+## Eval Results (N=50, 2026-04-11)
+### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
+| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
+|--------|----------|------------|-----------|------------|-------------|----------------|
+| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
+| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
+| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
+| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
+| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |
+### Previous run (Run A, same day, same seed)
+| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
+|--------|----------|------------|-----------|------------|----------------|
+| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
+| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
+| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
+| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
+| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |
+### Run-to-run variation (important for blog)
+v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.
+The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.
+Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).
+### Key observations from N=50 eval (with retry, 2 runs)
+1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
+2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
+3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
+4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
+5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
+6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
+7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.
+### v2 vs v1: similar accuracy, more parse failures — behavioral shift
+Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:
+- **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
+- **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
+- **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.
+The v2 parse failure pattern (from raw output):
+```
+[OK] DESCRIBE: country
+[OK] QUERY: SELECT Name FROM country WHERE Population < ...
+[PARSE FAIL] raw: Task complete.    ← gives up, doesn't call answer()
+[PARSE FAIL] raw: Task complete.    ← repeats until budget
+```
+Compare v1 on the same type of question:
+```
+[OK] DESCRIBE: country
+[OK] QUERY: SELECT Name FROM country WHERE ...
+[OK] ANSWER: European cities and their names are: 42    ← wrong, but at least calls answer()
+```
+This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.
+### For the blog narrative
+The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.
+## Format Mismatch Discovery (F011)
+### The three differences between eval and training
+1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
+2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
+3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.
+### Impact
+Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
+After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.
+### Lesson
+Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.
+## Multi-Turn SFT — Why It's Critical
+### Per-turn SFT (broken)
+- 347 examples, each one assistant turn
+- ~50% were describe calls
+- Model learned: "when asked a question, call describe"
+- With KL penalty, model stayed anchored to this single-action policy
+- Result: reward=0.00, all rollouts identical, advantage=0
+### Multi-turn SFT (working)
+- 120 examples, each a full describe→query→answer trajectory
+- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
+- Model learned: the SEQUENCE describe→query→answer
+- With KL penalty, model explores within the multi-turn strategy
+- Result: reward climbs to 0.7 in Phase 1
+### Template patch detail
+Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).
+## The 0.6B Capacity Ceiling
+### What works at 0.6B
+- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
+- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
+- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
+- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
+- Error recovery: describe after SQL error, retry with correct column names
+- `sample` tool usage (learned in Run 6, inconsistent later)
+### What breaks at 0.6B
+- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
+- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
+- Ambiguous column resolution: joins with same column name in both tables
+- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
+- "stadium without concert" pattern: NOT IN with JOIN to get names
+- Aggregate + GROUP BY + HAVING chains on multi-table joins
+### The hallucination pattern
+The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.
+## Thinking Mode Observations (Run 8)
+### Benefits
+- Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
+- Empty `<think></think>` on easy questions — token-efficient, emergent behavior
+- Multi-step join planning in think blocks
+### New failure mode
+~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.
+### OOM risk
+Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.
+## Narrative Threads for Blog
+### "The environment is the product"
+From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.
+### "Small model showing improvement proves more than large model with marginal gains"
+A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.
+### "Analysts don't write perfect queries from scratch"
+The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.
+### "Dense rewards need theory"
+Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.
+### "Multi-turn SFT teaches strategy, not actions"
+The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.
+## References for Blog
+- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
+- DeepSeek-AI (2025). DeepSeek-R1.
+- Shao et al. (2024). DeepSeek-Math: GRPO.
+- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
+- Yu et al. (2018). Spider dataset.
+- Li et al. (2023). BIRD benchmark.
+- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
+- ToolRL (2025). Reward is All Tool Learning Needs.
+- StepTool (2024). Step-grained RL for Tool Learning.
+## Showcase Notebook Transcripts (for blog)
+### Random agent episode (seed=7) — comedic failure
+Question: "Count the number of paragraphs."
+```
+  SAMPLE Paragraphs         → reward=0.015
+  SAMPLE Documents           → reward=0.015
+  DESCRIBE Documents         → reward=0.015
+  SAMPLE Documents           → reward=0.015  (repeat)
+  DESCRIBE Documents         → reward=0.015  (repeat)
+  DESCRIBE Documents         → reward=0.015  (repeat)
+  DESCRIBE Templates         → reward=0.015
+  SAMPLE Documents           → reward=0.015  (repeat)
+  DESCRIBE Documents         → reward=0.015  (repeat)
+  QUERY SELECT * FROM "Templates" LIMIT 5  → reward=0.0625
+  DESCRIBE Documents         → reward=0.015  (repeat)
+  DESCRIBE Paragraphs        → reward=0.015
+  QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
+  QUERY SELECT * FROM "Documents" LIMIT 5  → reward=0.025
+  ANSWER 76 | 20 | Robbin CV | y | None   → reward=0.000 (incorrect)
+```
+Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.
+### Oracle agent episode (seed=0) — clean solve
+Question: "List the id of students who registered some courses and the number of their registered courses?"
+```
+  Step 1: DESCRIBE student_course_registrations
+    → student_id INTEGER, course_id INTEGER, registration_date DATETIME
+    → reward: +0.015
+  Step 2: DESCRIBE students
+    → student_id INTEGER, student_details VARCHAR(255)
+    → reward: +0.015
+  Step 3: QUERY
+    SELECT T1.student_id, count(*)
+    FROM students AS T1
+    JOIN student_course_registrations AS T2
+      ON T1.student_id = T2.student_id
+    GROUP BY T1.student_id
+    → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
+    → reward: +0.150
+  Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
+    → correct
+    → reward: +1.000
+```
+Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.
+### Baseline comparison (50 episodes each)
+| Policy | Success Rate | Avg Reward | Avg Steps |
+|--------|-------------|------------|-----------|
+| Random | 0.0% | 0.247 | 15.0 |
+| Oracle | 100.0% | 1.168 | 3.5 |
+The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.
+### Reward constants (from server/reward.py)
+```
++0.02   successful execution (no errors)
++0.01   new information (unique query)
+-0.02   step cost (every action)
+-0.03   repeat penalty (duplicate SQL)
+[-0.10, +0.15]  per-step clipping range
++1.0    correct answer (terminal)
++0.0    wrong answer (terminal)
+```
+Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.
+## Competition Context
+### OpenEnv Challenge (our target)
+- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
+- Prize: $10K HF credits
+- Judging: primarily blog-based
+- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
+- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.
+### Deliverables
+1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env
+   (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
+2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
+   `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
+3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)
+### TRL integration status (already done — do not re-research)
+`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
+class: `reset()` + named tool methods `describe() / sample() / query() /
+answer()` with docstrings TRL uses to build the tool schema. The notebook
+passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
+reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
+**in-process** (not a WebSocket client to the HF Space) — intentional, because
+training opens N parallel sessions and the Space defaults to 1.
+### Competitive landscape
+- **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
+- **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis
+- **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection
+- Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.
+### What winning entries do
+1. Stakes framing — "this matters in production"
+2. Concrete failure analysis with numbers
+3. Contrast (random vs trained vs oracle)
+4. Real data, not toy puzzles
+5. Non-obvious insights from training
+## Green Agent Evaluator
+### What it is
+OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.
+### Implementation
+- `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
+- `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL)
+- `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)
+### How it works
+```python
+from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
+# Run 50 episodes with random policy
+result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
+print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
+# Run with trained policy (any class with select_action method)
+result = evaluate(env, trained_policy, n_episodes=50, seed=42)
+```
+### Where it's used
+- `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
+- `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`
+### Key design choices
+- **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
+- **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
+- **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required
+- **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down
+### For the blog
+The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.
+## Files to Reference
+| File | Relevance |
+|------|-----------|
+| `docs/blog-outline.md` | Section structure template |
+| `docs/blog-post.md` | Current draft |
+| `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
+| `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
+| `vision/VISION.md` | Product vision, success metrics |
+| `training/trl_adapter.py` | Environment adapter code |
+| `notebooks/compare_methods.ipynb` | Eval notebook |
+| `notebooks/train_grpo.ipynb` | Training notebook |

docs/blog-outline.md CHANGED Viewed

@@ -1,56 +1,114 @@
 # SQLEnv Blog Post Outline
-## 1) Hook: Teaching AI to Think Like a Data Analyst
-- Open with a concrete moment: an agent sees a new schema and must reason through uncertainty instead of guessing one SQL query.
-- Frame the core idea: SQL competence is not only syntax generation; it is iterative investigation with feedback.
-- Position SQLEnv as a training ground where agents learn exploration habits that mirror analyst workflows.
-## 2) The Problem: Static Benchmarks Reward Memorization
-- Explain why single-shot text-to-SQL can hide brittle behavior when schemas, table names, or data distributions shift.
-- Show that leaderboard accuracy does not guarantee robust reasoning on unfamiliar databases.
-- Describe the gap: most benchmarks grade final answers but ignore how the model arrived there.
-- Tie this directly to user pain: correct-looking SQL can fail in real environments where context changes every session.
-## 3) Our Approach: SQLEnv as an Interactive RL Environment
-- Introduce the action loop: `DESCRIBE`, `SAMPLE`, `QUERY`, and `ANSWER` as the minimum interface for grounded exploration.
-- Explain that each episode starts with a natural-language question and a hidden schema to force discovery.
-- Highlight OpenEnv compatibility so the environment can run with standard training tooling and deployment flows.
-## 4) How SQLEnv Works End-to-End
-- Walk through one episode narrative: inspect table shapes, sample data, run targeted joins, then submit an answer.
-- Summarize reward design in plain language: reward reliable execution, reward progress toward the goal, and strongly reward final correctness.
-- Note guardrails: read-only SQL execution, query timeout, and clear error messages to prevent unsafe or confusing behavior.
-## 5) Training with GRPO
-- Briefly explain GRPO as a practical policy optimization method for improving multi-step tool use behavior.
-- Connect training signals to environment telemetry: each step gives usable feedback rather than waiting for terminal reward only.
-- Clarify expected outcome: strategic behavior should improve over random baselines even with modest compute.
-## 6) Results
-- [PLACEHOLDER: Insert F006 metrics for success rate, average reward, and episode efficiency.]
-- Compare random baseline, trained policy, and oracle policy to show both practical gains and theoretical ceiling.
-- Include one short failure case to show where the policy still struggles and why that insight is useful.
-## 7) Technical Highlights
-- Multi-database Spider coverage with structured metadata and deterministic train/eval split.
-- Typed action and observation models that make environment interactions explicit and debuggable.
-- Deployment-ready packaging for HuggingFace Spaces with bundled databases and health checks.
-## 8) Try It Yourself
-- HuggingFace Space: add live link and a one-line instruction for connecting and running a first episode.
-- Colab notebook: link `notebooks/train_grpo.ipynb` with notes on expected runtime and CPU compatibility.
-- GitHub repository: link setup steps, architecture docs, and verification artifacts for reproducibility.
-## 9) What We Learned
-- Dense intermediate rewards improve learning speed only when they align with the final objective.
-- Tool-using agents benefit from transparent errors; better diagnostics create better policy updates.
-- Packaging and storytelling matter: a reproducible deployment and clear narrative are as important as benchmark numbers for adoption.

 # SQLEnv Blog Post Outline
+## 1) Cold Open: Two Agents, Same Question
+Open with two transcripts side by side — no explanation yet, just show the contrast.
+**Random agent** (from showcase notebook, seed=7):
+- "Count the number of paragraphs."
+- SAMPLEs the same table 4 times, DESCRIBEs Documents 5 times, runs 3 SELECT * queries, then submits a random row as the answer.
+- Reward: 0.278, incorrect.
+**Trained agent** (from blog-material, error recovery example):
+- "Which employee received the biggest bonus?"
+- Describes employee, tries wrong column (Salary), gets error, describes evaluation to find Bonus column, writes correct JOIN, answers "Louis Deacon".
+- Reward: 1.13, correct.
+One explored strategically. The other wandered. Both had the same tools, the same budget, the same database. The difference is learned behavior.
+## 2) The Gap (3 sentences, not a section)
+Text-to-SQL benchmarks give the model the full schema and ask for one query. That tests memorization, not investigation. SQLEnv hides the schema and gives the agent a step budget — forcing it to develop the exploration strategy that makes human analysts reliable.
+## 3) Four Actions, One Budget
+Introduce the action space through the oracle episode (showcase notebook, seed=0):
+- Question: "List the id of students who registered some courses and the number of their registered courses?"
+- Step 1: DESCRIBE student_course_registrations → sees columns (+0.015)
+- Step 2: DESCRIBE students → sees student_id (+0.015)
+- Step 3: QUERY with JOIN + GROUP BY → gets the answer (+0.150)
+- Step 4: ANSWER → correct (+1.000)
+- Total: 1.180 in 4 steps.
+Then show the reward architecture table:
+- L1 Operational: +0.02 execution, +0.01 new info, -0.01 repeats, -0.005 step cost
+- L2 Progress: delta from previous query result (potential-based)
+- L3 Terminal: +1.0 correct, 0.0 wrong
+Key point: terminal dominates. Max exploration over 15 steps is ~0.3; correct answer is 1.0. No farming.
+## 4) Training: SFT Teaches Strategy, GRPO Refines It
+NOT "here's how GRPO works." Lead with the insight:
+- Per-turn SFT (347 examples) taught the model to call describe forever. It never learned when to query or answer.
+- Multi-turn SFT (120 full trajectories with assistant_only_loss) taught describe-then-query-then-answer as a coherent strategy.
+- GRPO then refined this into real behaviors: error recovery, answer formatting, knowing when to stop.
+Two-phase curriculum:
+- Phase 1: Easy questions with KL penalty — stabilize format
+- Phase 2: Easy + medium without KL — allow exploration
+Show the reward curve: -0.1 to 0.5-0.7 over 400 steps. Clear learning signal.
+## 5) What the Agent Learned
+Lead with observed behaviors, not metrics:
+- **Schema discovery**: always describes before querying
+- **Error recovery**: wrong column → re-describe → correct retry (concrete example)
+- **Answer formatting**: comma-separated lists, pipe-delimited rows, [] for empty results
+- **Subquery composition**: NOT IN, GROUP BY HAVING, UNION queries
+These emerged from reward signal, not hard-coded rules.
+Comparison table (N=50 eval, 2026-04-11):
+| Method | Accuracy | Parse Rate | Avg Steps |
+|--------|----------|------------|-----------|
+| Zero-shot | 0% | 28% | 10.8 |
+| 1-shot | 0% | 16% | 14.8 |
+| 3-shot | 0% | 20% | 13.8 |
+| GRPO v1 | 28% | 95% | 4.0 |
+| GRPO v2 | 32% | 87% | 3.7 |
+Two things stand out. First, 95% parse rate — the trained model almost always produces valid tool-call JSON. The base model fails 72-84% of the time and wastes its entire step budget repeating the same malformed output. Second, 28-32% accuracy from 0% — the environment produces genuine learning. The base model can't get a single answer right even with 3 examples; the trained model solves 14-16 out of 50 in just 3-4 steps.
+## 6) What the Agent Can't Do (The Interesting Part)
+This is where small models hit a wall — and the wall tells us something about the environment.
+- **Column name hallucination**: reads `FullName` from DESCRIBE, writes `full_name` in SQL. Pretraining biases override in-context schema. A 0.6B model can't fight its own weights.
+- **FK chain reasoning**: single-table queries work; 3+ table JOINs don't. The model can't chain Documents -> Templates -> Ref_Template_Types.
+- **More RL doesn't help**: v2 (double the training steps) produced identical accuracy. The ceiling is pretraining knowledge, not training budget.
+This is actually the point: the environment produces a clear learning signal that saturates at the model's capacity. A larger model (or better SFT on JOIN patterns) would push higher. The environment scales; the 0.6B model doesn't.
+## 7) Reward Theory (Brief — For Technical Judges)
+One paragraph: potential-based shaping (Ng et al., 1999). Our delta progress rewards are F(s,s') = phi(s') - phi(s), which provably preserves the optimal policy. Without this guarantee, agents learn to farm exploration rewards instead of answering questions. We observed this directly when we tried cumulative progress caps (not potential-based) — the agent explored endlessly.
+## 8) Technical Highlights (Bullet List)
+- 10 Spider databases with structured metadata and deterministic train/eval split
+- Typed action and observation models (Pydantic) — every interaction is explicit
+- Read-only SQL via SQLite mode=ro — safety from the database engine, not regex
+- TRL environment_factory integration — plugs into standard GRPO training
+- Docker packaging for HuggingFace Spaces with health checks
+- 473 training / 50 eval questions across easy/medium difficulty
+## 9) Try It Yourself
+- HuggingFace Space: [live demo]
+- Training notebook: notebooks/train_grpo.ipynb — runs on Colab L4 in ~7 hours
+- Showcase notebook: notebooks/showcase_sqlenv.ipynb — explore the environment interactively
+- GitHub: full source, architecture docs
+## 10) What We Learned (Close with Insights)
+Three non-obvious findings:
+1. **Multi-turn SFT teaches strategy, not actions.** Per-turn examples teach vocabulary; multi-turn examples teach conversation. The difference is the difference between a model that calls describe forever and one that knows when to answer.
+2. **Transparent errors produce better policies.** When the environment surfaces "Error: no such column: full_name" instead of empty results, the agent develops error-recovery strategies. Better diagnostics produce better gradient signal.
+3. **Dense rewards need theory.** Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we observed agents farming exploration rewards at the expense of answering questions.

docs/blog-post-v1-preview.html ADDED Viewed

	@@ -0,0 +1,403 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>SQLEnv Blog Post Preview</title>
+<style>
+  :root { --bg: #0d1117; --surface: #161b22; --card: #1a1a2e; --text: #e0e0e0; --muted: #8b949e; --accent: #58a6ff; --border: #30363d; }
+  * { margin: 0; padding: 0; box-sizing: border-box; }
+  body { background: var(--bg); color: var(--text); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; line-height: 1.7; max-width: 820px; margin: 0 auto; padding: 40px 24px 80px; }
+  h1 { font-size: 2em; margin: 0 0 32px; color: #fff; line-height: 1.2; }
+  h2 { font-size: 1.5em; margin: 48px 0 16px; color: #fff; border-bottom: 1px solid var(--border); padding-bottom: 8px; }
+  h3 { font-size: 1.15em; margin: 32px 0 12px; color: #fff; }
+  p { margin: 0 0 16px; }
+  em { color: var(--muted); }
+  strong { color: #fff; }
+  a { color: var(--accent); text-decoration: none; }
+  a:hover { text-decoration: underline; }
+  code { background: var(--surface); padding: 2px 6px; border-radius: 4px; font-size: 0.9em; font-family: 'SF Mono', 'Fira Code', monospace; }
+  ul, ol { margin: 0 0 16px 24px; }
+  li { margin-bottom: 6px; }
+  li code { font-size: 0.85em; }
+  table { width: 100%; border-collapse: collapse; margin: 16px 0; }
+  th { text-align: left; padding: 10px 12px; background: var(--surface); border: 1px solid var(--border); color: #fff; font-size: 0.9em; }
+  td { padding: 10px 12px; border: 1px solid var(--border); font-size: 0.9em; }
+  tr.row-grpo { background: rgba(76, 175, 80, 0.1); }
+  tr.row-grpo td { color: #fff; font-weight: 600; }
+  tr.row-grpo td:nth-child(2) { color: #81c784; }
+  tr.row-base td:nth-child(2) { color: #e57373; }
+  img { max-width: 100%; border-radius: 8px; margin: 16px 0; }
+  figcaption, .caption { font-size: 0.85em; color: var(--muted); font-style: italic; margin-top: -8px; margin-bottom: 16px; }
+  /* Copyable code block */
+  .code-block { position: relative; margin: 16px 0; }
+  .code-block pre { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 16px; padding-right: 48px; overflow-x: auto; font-size: 13px; line-height: 1.5; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; margin: 0; }
+  .code-block.scrollable pre { font-size: 12px; }
+  .reward-comment { color: #81c784; }
+  .copy-btn { position: absolute; top: 8px; right: 8px; background: var(--border); border: none; border-radius: 4px; padding: 4px 8px; cursor: pointer; color: var(--muted); font-size: 12px; font-family: inherit; transition: background 0.15s, color 0.15s; z-index: 1; }
+  .copy-btn:hover { background: var(--muted); color: var(--bg); }
+  .copy-btn.copied { background: #4caf50; color: #fff; }
+  /* Cards */
+  .cards { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
+  .card { flex: 1; min-width: 240px; background: var(--card); border-radius: 12px; padding: 24px 20px; display: flex; flex-direction: column; }
+  .card-header { text-align: center; margin-bottom: 14px; }
+  .card-header .emoji { font-size: 32px; display: block; margin-bottom: 6px; }
+  .card-header .title { font-weight: bold; font-size: 1.05em; }
+  .card-question { color: var(--muted); font-size: 13px; margin-bottom: 14px; text-align: center; height: 36px; display: flex; align-items: center; justify-content: center; }
+  .card-code { background: var(--bg); border: none; border-radius: 8px; padding: 10px 12px; font-size: 10.5px; margin: 0; line-height: 1.45; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; overflow-x: auto; height: 130px; white-space: pre; }
+  .card-outcome { margin-top: 12px; font-size: 13px; font-weight: 600; height: 24px; display: flex; align-items: center; }
+  .card-insight { font-size: 12px; color: var(--muted); margin-top: 8px; line-height: 1.5; border-top: 1px solid var(--border); padding-top: 8px; height: 40px; }
+  .green { color: #4caf50; }
+  .red { color: #ef5350; }
+  .blue { color: #4fc3f7; }
+  .orange { color: #ffb74d; }
+  .neg { color: #e57373; opacity: 0.85; }
+  .pos { color: #81c784; opacity: 0.85; }
+  .marker { font-size: 0.85em; font-weight: 600; }
+  /* Cold open side-by-side */
+  .cold-open { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
+  .cold-open-panel { flex: 1; min-width: 340px; background: var(--surface); border-radius: 12px; padding: 20px; border: 1px solid var(--border); }
+  .cold-open-panel h3 { margin: 0 0 4px; font-size: 1em; border: none; padding: 0; color: #fff; }
+  .cold-open-panel .subtitle { color: var(--muted); font-size: 13px; margin-bottom: 12px; }
+  .cold-open-panel pre { margin: 0; padding: 12px; font-size: 10.5px; border: none; background: var(--bg); border-radius: 8px; height: 200px; overflow-y: auto; }
+  .cold-open-panel .verdict { margin-top: 12px; font-size: 13px; }
+  .cold-open-panel .verdict .red { color: #e57373; }
+  .cold-open-panel .verdict .green { color: #81c784; }
+  /* Color legend */
+  .legend { display: flex; gap: 16px; margin: 8px 0 4px; font-size: 12px; color: var(--muted); flex-wrap: wrap; }
+  .legend-item { display: flex; align-items: center; gap: 4px; }
+  .legend-swatch { width: 10px; height: 10px; border-radius: 2px; display: inline-block; }
+</style>
+</head>
+<body>
+<h1>SQLEnv: Teaching Small Models to Explore Databases Like Analysts</h1>
+<h2>Untrained vs Trained Agent</h2>
+<div class="cold-open">
+<div class="cold-open-panel">
+<h3>Untrained agent</h3>
+<div class="subtitle"><em>"Count the number of paragraphs."</em></div>
+<pre><span class="neg">(-)</span> DESCRIBE Documents    ×5  ← same table, five times
+<span class="neg">(-)</span> SAMPLE  Documents     ×3  ← already saw these rows
+<span class="neg">(-)</span> DESCRIBE Templates        ← wrong table
+<span class="pos">(+)</span> DESCRIBE Paragraphs       ← finally the right table
+<span class="neg">(-)</span> QUERY  SELECT * LIMIT 5   ← no aggregation
+<span class="neg">(-)</span> QUERY  SELECT * LIMIT 5   ← still no COUNT(*)
+<span class="neg">(-)</span> ANSWER "76 | Robbin CV"   ← a random row</pre>
+<div class="verdict"><span class="red">15 steps · reward 0.28 · never wrote SELECT COUNT(*)</span></div>
+</div>
+<div class="cold-open-panel">
+<h3>Trained agent</h3>
+<div class="subtitle"><em>"Which employee received the biggest bonus?"</em></div>
+<pre><span class="pos">(+)</span> DESCRIBE employee
+     → Employee_ID, Name, Age, City
+<span class="neg">(-)</span> QUERY ...ORDER BY Salary DESC
+     → Error: no such column: Salary
+<span class="pos">(+)</span> DESCRIBE evaluation
+     → Employee_ID, Year_awarded, Bonus
+<span class="pos">(+)</span> QUERY ...JOIN...ORDER BY Bonus DESC
+     → Louis Deacon
+<span class="pos">(+)</span> ANSWER "Louis Deacon"</pre>
+<div class="verdict"><span class="green">5 steps · reward 1.13 · recovered from error</span></div>
+</div>
+</div>
+<p>Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.</p>
+<h2>The Gap</h2>
+<p>Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.</p>
+<p>SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.</p>
+<h2>What Analysts Actually Do</h2>
+<p>Consider the situation where you need to answer a question using data in an unfamiliar database. You probably cannot write the final query in one go. Instead, you run <code>DESCRIBE</code> to see what columns exist, <code>SELECT * LIMIT 5</code> to scan the actual data, then build your query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.</p>
+<p>SQLEnv captures this workflow. Four actions mirror what analysts do:</p>
+<ul>
+<li><strong>DESCRIBE</strong> reveals column names and types for a table</li>
+<li><strong>SAMPLE</strong> previews rows to understand the data</li>
+<li><strong>QUERY</strong> executes a read-only SQL query</li>
+<li><strong>ANSWER</strong> submits a final answer</li>
+</ul>
+<p>Each episode starts with a natural-language question and a list of table names. Columns, types, and relationships stay hidden until the agent discovers them through exploration. This partial observability forces strategy over pattern-matching.</p>
+<p>A clean episode on the question <em>"List student IDs with registered courses and their course counts"</em>:</p>
+<div class="code-block scrollable">
+<button class="copy-btn" onclick="copyBlock(this)">Copy</button>
+<pre>Step 1: DESCRIBE student_course_registrations
+        → student_id INTEGER, course_id INTEGER, registration_date DATETIME
+        → reward: +0.015  <span class="reward-comment">← new schema revealed</span>
+Step 2: DESCRIBE students
+        → student_id INTEGER, student_details VARCHAR(255)
+        → reward: +0.015  <span class="reward-comment">← second table described</span>
+Step 3: QUERY SELECT T1.student_id, count(*)
+              FROM students AS T1
+              JOIN student_course_registrations AS T2
+                ON T1.student_id = T2.student_id
+              GROUP BY T1.student_id
+        → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
+        → reward: +0.150  <span class="reward-comment">← results overlap with gold answer</span>
+Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
+        → correct
+        → reward: +1.000  <span class="reward-comment">← terminal: correct answer</span></pre>
+</div>
+<p>Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.</p>
+<h2>Built on OpenEnv</h2>
+<p><a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv</a> is a standard protocol for RL environments with a simple contract:</p>
+<ul>
+<li><code>reset(seed)</code> starts a new episode and returns the initial observation</li>
+<li><code>step(action)</code> executes one action and returns observation, reward, and done flag</li>
+</ul>
+<p>Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code. SQLEnv implements it with four actions (DESCRIBE, SAMPLE, QUERY, ANSWER):</p>
+<div class="legend">
+<div class="legend-item"><span class="legend-swatch" style="background:#d2a8ff;"></span> method</div>
+<div class="legend-item"><span class="legend-swatch" style="background:#7ee787;"></span> action type</div>
+<div class="legend-item"><span class="legend-swatch" style="background:#ffa657;"></span> argument</div>
+</div>
+<div class="code-block">
+<button class="copy-btn" onclick="copyBlock(this)">Copy</button>
+<pre style="margin-top:0">env = SQLEnvironment(questions_path=<span style="color:#a5d6ff">"..."</span>, db_dir=<span style="color:#a5d6ff">"..."</span>, tokenizer=tok)
+obs = env.<span style="color:#d2a8ff">reset</span>(seed=42)
+obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"DESCRIBE"</span>, argument=<span style="color:#ffa657">"employee"</span>))
+obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"QUERY"</span>,    argument=<span style="color:#ffa657">"SELECT COUNT(*) FROM employee"</span>))
+obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"ANSWER"</span>,   argument=<span style="color:#ffa657">"10"</span>))</pre>
+</div>
+<p>TRL's <code>environment_factory</code> auto-discovers the four tool methods from the environment class for GRPO training. The same environment runs locally, in Docker on HuggingFace Spaces, or over WebSocket.</p>
+<p>The Green Agent evaluator wraps this protocol for benchmarking:</p>
+<div class="code-block">
+<pre>evaluate(env, policy, n_episodes=50, seed=0)</pre>
+</div>
+<p>This runs any <code>Policy</code> through the environment and reports success rate, average reward, and step count. Built-in <code>RandomPolicy</code> and <code>OraclePolicy</code> baselines provide lower and upper bounds (0% vs 100% accuracy, 0.25 vs 1.17 reward).</p>
+<h2>Reward Architecture</h2>
+<p>Three layers of reward signal:</p>
+<table>
+<tr><th>Layer</th><th>Signal</th><th>Per-step clip</th></tr>
+<tr>
+  <td><strong>L1: Operational</strong></td>
+  <td>Successful execution <span class="pos">(+0.02)</span>, new info <span class="pos">(+0.01)</span>, repeat <span class="neg">(-0.03)</span>, step cost <span class="neg">(-0.02)</span></td>
+  <td style="white-space:nowrap">[-0.10, 0.15]</td>
+</tr>
+<tr>
+  <td><strong>L2: Progress</strong></td>
+  <td>Delta from previous query result — cardinality, value overlap, numeric proximity. Positive <span class="pos">(+)</span> for improvement, negative <span class="neg">(-)</span> for regression.</td>
+  <td style="white-space:nowrap">[-0.10, 0.15]</td>
+</tr>
+<tr>
+  <td><strong>L3: Terminal</strong></td>
+  <td>Correct answer: <span class="pos">+1.0</span>. Wrong: <span class="neg">0.0</span></td>
+  <td style="white-space:nowrap">one-shot</td>
+</tr>
+</table>
+<p>Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.</p>
+<p>The progress signal uses delta-from-previous-step: potential-based reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al., 1999</a>). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. <a href="https://arxiv.org/abs/2603.22293">TIPS</a> (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. <a href="https://arxiv.org/html/2504.13958v1">ToolRL</a> (2025) found fine-grained reward decomposition improved tool learning by 17%. <a href="https://arxiv.org/abs/2410.07745">StepTool</a> (2024) confirmed step-grained shaping outperformed outcome-only rewards.</p>
+<h2>Training</h2>
+<p>We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's <code>environment_factory</code> runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.</p>
+<p>SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.</p>
+<p>Multi-turn SFT (120 full trajectories with <code>assistant_only_loss</code>) taught describe-then-query-then-answer as a coherent strategy. The subsequent GRPO training refined this into error recovery, answer formatting, and knowing when to stop exploring.</p>
+<p>Two-phase curriculum:</p>
+<ul>
+<li><strong>Phase 1</strong>: Easy questions (single-table), KL penalty (beta=0.04) to keep the policy close to SFT initialization while allowing exploration. Reward climbs from -0.1 to 0.5-0.7 over 400 steps.</li>
+<li><strong>Phase 2</strong>: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.</li>
+</ul>
+<div style="text-align: center; margin: 24px 0;">
+<img src="/Users/hjerp/Projects/sql-env/docs/rl-training-phase-1.png" alt="GRPO Training Progress" style="max-width: 100%; border-radius: 8px;">
+<p class="caption" style="text-align: center;">GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 in Phase 1 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.</p>
+</div>
+<p>SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.</p>
+<h2>What the Agent Learned</h2>
+<p>The following behaviors emerged during training:</p>
+<div class="cards">
+<div class="card">
+<div class="card-header">
+  <span class="emoji">🔎</span>
+  <span class="title blue">Schema Discovery</span>
+</div>
+<div class="card-question"><em>"What is the total bonus in all evaluations?"</em></div>
+<div class="card-code">describe("evaluation")
+ → Employee_ID, Year_awarded, Bonus
+query("SELECT SUM(Bonus) FROM evaluation")
+ → 19500.0
+answer("19500.0")</div>
+<div class="card-outcome"><span class="green">3 steps · reward 1.15 · correct</span></div>
+<div class="card-insight">Aggregation on one table. Describe first, then write targeted SQL.</div>
+</div>
+<div class="card">
+<div class="card-header">
+  <span class="emoji">🔧</span>
+  <span class="title orange">Error Recovery</span>
+</div>
+<div class="card-question"><em>"Which employee received the biggest bonus?"</em></div>
+<div class="card-code">describe("employee") → Name, Age
+query("...ORDER BY Salary DESC")
+ → Error: no such column: Salary
+describe("evaluation") → Bonus
+query("...JOIN...ORDER BY Bonus DESC")
+ → Louis Deacon
+answer("Louis Deacon")</div>
+<div class="card-outcome"><span class="green">5 steps · reward 1.13 · correct</span></div>
+<div class="card-insight">Two-table JOIN with error recovery. Wrong column, re-describe, retry.</div>
+</div>
+<div class="card">
+<div class="card-header">
+  <span class="emoji">🚧</span>
+  <span class="title red">The Ceiling</span>
+</div>
+<div class="card-question"><em>"Which city has the most arriving flights?"</em></div>
+<div class="card-code">describe("AIRPORTS")
+ → City, AirportCode
+query("SELECT CITY, COUNT(*)
+  FROM AIRPORTS GROUP BY CITY
+  ORDER BY COUNT(*) DESC LIMIT 1")
+ → Albany | 4
+answer("Albany")</div>
+<div class="card-outcome"><span class="red">3 steps · reward 0.0 · incorrect</span></div>
+<div class="card-insight">Multi-hop JOIN (3+ tables). Counted airports, not flights. Beyond 0.6B.</div>
+</div>
+</div>
+<p>The first two cards show learned behaviors: schema-first exploration and error recovery. The third shows where 0.6B saturates. We expand on these limitations and next steps below.</p>
+<h3>Evaluation (N=50 episodes, 2 independent runs)</h3>
+<p>All conditions run through SQLEnv's Green Agent evaluator: <code>evaluate(env, policy, n_episodes, seed)</code>.</p>
+<table>
+<tr><th>Method</th><th>Accuracy</th><th>Parse Rate</th><th>Avg Steps</th></tr>
+<tr class="row-base"><td>Zero-shot</td><td>0%</td><td>24-28%</td><td>10.8-12.4</td></tr>
+<tr class="row-base"><td>1-shot</td><td>0-2%</td><td>16-17%</td><td>14.0-14.8</td></tr>
+<tr class="row-base"><td>3-shot</td><td>0%</td><td>19-20%</td><td>13.8-14.8</td></tr>
+<tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>95-100%</td><td>3.5-4.0</td></tr>
+</table>
+<p><strong>95-100% parse rate</strong>: the trained model produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. <strong>~30% accuracy from 0%</strong>: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.</p>
+<p>We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.</p>
+<h2>Limitations at 0.6B Parameters</h2>
+<p>Three failure modes define the current ceiling:</p>
+<ul>
+<li><strong>Column name hallucination.</strong> The model reads <code>FullName</code> from DESCRIBE but writes <code>full_name</code> in SQL, or reads <code>Horsepower</code> and writes <code>HorsepowerDESC</code> (missing space). Pretraining biases override the schema that the model just observed in context.</li>
+<li><strong>FK chain reasoning.</strong> The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.</li>
+<li><strong>More RL does not help.</strong> Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.</li>
+</ul>
+<p>RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.</p>
+<h2>The Learning Space</h2>
+<p>The Green Agent evaluator brackets the learning space with two baselines:</p>
+<table>
+<tr><th>Policy</th><th>Accuracy</th><th>Avg Reward</th><th>Avg Steps</th></tr>
+<tr class="row-base"><td>Random</td><td>0%</td><td>0.247</td><td>15.0</td></tr>
+<tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>~0.35</td><td>3.5</td></tr>
+<tr><td>Oracle</td><td>100%</td><td>1.168</td><td>3.5</td></tr>
+</table>
+<p>Random scores 0.247 by accumulating small exploration rewards across 15 steps without answering. Oracle scores 1.168 in 3.5 steps. This gap between 0.25 and 1.17 represents what a trained agent can learn. Our GRPO agent lands at ~0.35, above random but far below oracle, with room for improvement through better SFT warmup or larger models.</p>
+<h2>Technical Highlights</h2>
+<ul>
+<li><strong>676 questions</strong> (473 train / 203 eval) across 10 Spider databases with difficulty labels</li>
+<li><strong>Typed models</strong> with Pydantic: every action, observation, and state is explicit and debuggable</li>
+<li><strong>Read-only SQL</strong> via SQLite <code>mode=ro</code>, where the database engine enforces safety rather than regex</li>
+<li><strong>Potential-based reward shaping</strong> (Ng et al., 1999) that provably preserves the optimal policy</li>
+<li><strong>TRL environment_factory</strong> integration for standard GRPO training without a custom loop</li>
+<li><strong>Green Agent evaluator</strong> with <code>Policy</code> protocol, <code>evaluate()</code> harness, and <code>RandomPolicy</code>/<code>OraclePolicy</code> baselines</li>
+</ul>
+<h2>Next Steps</h2>
+<p>The environment supports two directions for improvement:</p>
+<p><strong>Thinking mode.</strong> The 30% ceiling comes from multi-table reasoning. The model cannot plan a three-table JOIN path before writing SQL. Qwen3's <code>&lt;think&gt;</code> blocks offer a way to reason about the join chain before writing the query. In our experiments, RL alone did not produce useful thinking: the model either emitted empty <code>&lt;think&gt;&lt;/think&gt;</code> blocks or collapsed into degenerate loops (<code>&lt;think&gt;assistant&lt;think&gt;assistant...</code>) that consumed ~23% of rollouts. Pure RL discovers that thinking tokens exist but not how to use them. SFT warmup with structured reasoning examples ("I need to join Documents → Templates → Ref_Template_Types through Template_ID") could bootstrap the format, then RL could refine when to think and when to skip. This is worth testing at 0.6B before concluding the ceiling requires a larger model.</p>
+<p><strong>Larger models.</strong> Our goal is small models that run locally, so scaling to 7B or beyond changes the deployment story. That said, a 1.7B model has more capacity to attend to DESCRIBE output and override pretrained column names. The environment and reward architecture do not depend on model size, so scaling up requires changing the training configuration rather than redesigning the environment. At some point, larger models may solve these questions with few-shot prompting alone, but the environment remains useful for training small models that need to run without API access.</p>
+<h2>Try It Yourself</h2>
+<ul>
+<li><strong>Training notebook</strong>: <code>notebooks/train_grpo.ipynb</code> runs the full SFT + GRPO pipeline on Colab L4 in ~7 hours</li>
+<li><strong>Comparison notebook</strong>: <code>notebooks/compare_methods.ipynb</code> evaluates base vs trained models side by side</li>
+<li><strong>Showcase notebook</strong>: <code>notebooks/showcase_sqlenv.ipynb</code> lets you explore the environment, run episodes, and see what tools and rewards are available</li>
+<li><strong>GitHub</strong>: full source, architecture docs, and training artifacts</li>
+</ul>
+<h2>Discussion</h2>
+<p><strong>The format of SFT data matters more than the quantity.</strong> Per-turn SFT (347 examples) taught the model individual tool calls but not when to use them. The model called describe repeatedly because half the training examples were describe calls. Multi-turn SFT (120 full trajectories) taught the model to chain describe, query, and answer into a coherent episode. The difference was not the number of examples but whether each example showed a complete problem-solving sequence.</p>
+<p><strong>Transparent errors help the agent learn.</strong> When the environment returns <code>"Error: no such column: full_name"</code> instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.</p>
+<p><strong>Dense rewards benefit from theoretical grounding.</strong> Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.</p>
+<p><strong>The environment is the contribution.</strong> The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.</p>
+<script>
+function copyBlock(btn) {
+  const pre = btn.parentElement.querySelector('pre');
+  const text = pre.innerText;
+  navigator.clipboard.writeText(text).then(() => {
+    btn.textContent = 'Copied!';
+    btn.classList.add('copied');
+    setTimeout(() => {
+      btn.textContent = 'Copy';
+      btn.classList.remove('copied');
+    }, 2000);
+  });
+}
+</script>
+</body>
+</html>

docs/blog-post-v1.md ADDED Viewed

	@@ -0,0 +1,269 @@

+# SQLEnv: Teaching Small Models to Explore Databases Like Analysts
+## Two Agents, Same Question
+<!-- Side-by-side cold open: random vs trained agent -->
+<div style="display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap;">
+<div style="flex: 1; min-width: 340px; background: #161b22; border-radius: 12px; padding: 20px; border: 1px solid #30363d; color: #e0e0e0;">
+<div style="font-weight: bold; color: #ef5350;">Untrained agent</div>
+<div style="color: #8b949e; font-size: 13px; margin-bottom: 12px;"><em>"Count the number of paragraphs."</em></div>
+<pre style="background: #0d1117; border: none; padding: 12px; font-size: 11.5px; margin: 0; color: #c9d1d9;">DESCRIBE Documents    ×5  ← same table, five times
+SAMPLE  Documents     ×3  ← already saw these rows
+DESCRIBE Templates        ← wrong table
+DESCRIBE Paragraphs       ← finally the right table
+QUERY  SELECT * LIMIT 5   ← no aggregation
+QUERY  SELECT * LIMIT 5   ← still no COUNT(*)
+ANSWER "76 | Robbin CV"   ← a random row</pre>
+<div style="margin-top: 12px; font-size: 13px; font-weight: 600;">❌ <span style="color: #ef5350;">15 steps · reward 0.28 · never wrote SELECT COUNT(*)</span></div>
+</div>
+<div style="flex: 1; min-width: 340px; background: #161b22; border-radius: 12px; padding: 20px; border: 1px solid #30363d; color: #e0e0e0;">
+<div style="font-weight: bold; color: #4caf50;">Trained agent</div>
+<div style="color: #8b949e; font-size: 13px; margin-bottom: 12px;"><em>"Which employee received the biggest bonus?"</em></div>
+<pre style="background: #0d1117; border: none; padding: 12px; font-size: 11.5px; margin: 0; color: #c9d1d9;">DESCRIBE employee
+ → Employee_ID, Name, Age, City
+QUERY ...ORDER BY Salary DESC
+ → Error: no such column: Salary
+DESCRIBE evaluation
+ → Employee_ID, Year_awarded, Bonus
+QUERY ...JOIN...ORDER BY Bonus DESC
+ → Louis Deacon
+ANSWER "Louis Deacon"</pre>
+<div style="margin-top: 12px; font-size: 13px; font-weight: 600;">✅ <span style="color: #4caf50;">5 steps · reward 1.13 · recovered from error</span></div>
+</div>
+</div>
+Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.
+## The Gap
+Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.
+SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.
+## What Analysts Actually Do
+Consider the situation where you need to answer a question using data in an unfamiliar database. You probably cannot write the final query in one go. Instead, you run `DESCRIBE` to see what columns exist, `SELECT * LIMIT 5` to scan the actual data, then build your query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.
+SQLEnv captures this workflow. Four actions mirror what analysts do:
+- **DESCRIBE** reveals column names and types for a table
+- **SAMPLE** previews rows to understand the data
+- **QUERY** executes a read-only SQL query
+- **ANSWER** submits a final answer
+Each episode starts with a natural-language question and a list of table names. Columns, types, and relationships stay hidden until the agent discovers them through exploration. This partial observability forces strategy over pattern-matching.
+A clean episode on the question *"List student IDs with registered courses and their course counts"*:
+```
+Step 1: DESCRIBE student_course_registrations
+        → student_id INTEGER, course_id INTEGER, registration_date DATETIME
+        → reward: +0.015
+Step 2: DESCRIBE students
+        → student_id INTEGER, student_details VARCHAR(255)
+        → reward: +0.015
+Step 3: QUERY SELECT T1.student_id, count(*)
+              FROM students AS T1
+              JOIN student_course_registrations AS T2
+                ON T1.student_id = T2.student_id
+              GROUP BY T1.student_id
+        → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
+        → reward: +0.150
+Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
+        → correct
+        → reward: +1.000
+```
+Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.
+## Built on OpenEnv
+[OpenEnv](https://github.com/meta-pytorch/OpenEnv) is a standard protocol for RL environments. The contract is simple:
+- `reset(seed)` starts a new episode and returns the initial observation
+- `step(action)` executes one action and returns observation, reward, and done flag
+Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code.
+SQLEnv implements this protocol with four domain-specific actions:
+```python
+env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tok)
+obs = env.reset(seed=42)          # pick question, load DB, hide schema
+obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
+obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
+obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
+# obs.done=True, obs.reward=1.0
+```
+TRL's `environment_factory` auto-discovers the four tool methods (`describe`, `sample`, `query`, `answer`) from the environment class for GRPO training. The same environment runs locally, in Docker on HuggingFace Spaces, or over WebSocket via `SQLEnvClient`.
+The Green Agent evaluator wraps this protocol for benchmarking:
+```python
+evaluate(env, policy, n_episodes=50, seed=0)
+```
+This runs any `Policy` through the environment and reports success rate, average reward, and step count. Built-in `RandomPolicy` and `OraclePolicy` baselines provide lower and upper bounds (0% vs 100% accuracy, 0.25 vs 1.17 reward).
+## Reward Architecture
+Three layers of reward signal:
+| Layer | Signal | Per-step clip |
+|-------|--------|---------------|
+| **L1: Operational** | Successful execution (+0.02), new info (+0.01), repeat penalty (-0.03), step cost (-0.02) | [-0.10, 0.15] |
+| **L2: Progress** | Delta from previous query result — cardinality, value overlap, numeric proximity | [-0.10, 0.15] |
+| **L3: Terminal** | Correct answer: +1.0. Wrong: 0.0 | one-shot |
+Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.
+The progress signal uses delta-from-previous-step, a form of potential-based reward shaping ([Ng et al., 1999](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf)). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. [TIPS](https://arxiv.org/abs/2603.22293) (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. [ToolRL](https://arxiv.org/html/2504.13958v1) (2025) found fine-grained reward decomposition improved tool learning by 17%. [StepTool](https://arxiv.org/abs/2410.07745) (2024) confirmed step-grained shaping outperformed outcome-only rewards.
+## Training
+We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's `environment_factory` runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.
+SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.
+Multi-turn SFT (120 full trajectories with `assistant_only_loss`) taught describe-then-query-then-answer as a coherent strategy. The subsequent GRPO training refined this into error recovery, answer formatting, and knowing when to stop exploring.
+Two-phase curriculum:
+- **Phase 1**: Easy questions (single-table), KL penalty (beta=0.04) to keep the policy close to SFT initialization while allowing exploration. Reward climbs from -0.1 to 0.5-0.7 over 400 steps.
+- **Phase 2**: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.
+![GRPO Training Progress](rl-training-phase-1.png)
+*GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.*
+SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.
+## What the Agent Learned
+The following behaviors emerged during training:
+<div style="display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap;">
+<div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
+<div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🔍</div>
+<div style="color: #4fc3f7; font-weight: bold; text-align: center; margin-bottom: 4px;">Schema Discovery</div>
+<div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"What is the total bonus in all evaluations?"</em></div>
+<div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("evaluation")
+ → Employee_ID, Year_awarded, Bonus
+query("SELECT SUM(Bonus) FROM evaluation")
+ → 19500.0
+answer("19500.0")</div>
+<div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
+<span style="font-size: 18px;">✅</span> <span style="color: #4caf50;">3 steps · reward 1.15</span>
+</div>
+<div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Aggregation on one table. Describe first, then write targeted SQL.</div>
+</div>
+<div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
+<div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🔄</div>
+<div style="color: #ffb74d; font-weight: bold; text-align: center; margin-bottom: 4px;">Error Recovery</div>
+<div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"Which employee received the biggest bonus?"</em></div>
+<div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("employee") → Name, Age, City
+query("...ORDER BY Salary DESC")
+ → Error: no such column: Salary
+describe("evaluation") → Bonus column
+query("...JOIN...ORDER BY Bonus DESC")
+ → Louis Deacon
+answer("Louis Deacon")</div>
+<div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
+<span style="font-size: 18px;">✅</span> <span style="color: #4caf50;">5 steps · reward 1.13</span>
+</div>
+<div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Two-table JOIN with error recovery. Wrong column, re-describe, retry — the analyst loop.</div>
+</div>
+<div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
+<div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🧱</div>
+<div style="color: #ef5350; font-weight: bold; text-align: center; margin-bottom: 4px;">The Ceiling</div>
+<div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"Which city has the most arriving flights?"</em></div>
+<div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("AIRPORTS")
+ → City, AirportCode
+query("SELECT CITY, COUNT(*)
+  FROM AIRPORTS GROUP BY CITY
+  ORDER BY COUNT(*) DESC LIMIT 1")
+ → Albany | 4
+answer("Albany")</div>
+<div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
+<span style="font-size: 18px;">❌</span> <span style="color: #ef5350;">3 steps · reward 0.0</span>
+</div>
+<div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Multi-hop JOIN (3+ tables). Counted airports instead of joining through flights. Beyond 0.6B capacity.</div>
+</div>
+</div>
+### Evaluation (N=50 episodes, 2 independent runs)
+All conditions run through SQLEnv's Green Agent evaluator: `evaluate(env, policy, n_episodes, seed)`. The same harness powers the showcase notebook (Random vs Oracle baselines) and the full comparison below.
+| Method | Accuracy | Parse Rate | Avg Steps |
+|--------|----------|------------|-----------|
+| Zero-shot | 0% | 24-28% | 10.8-12.4 |
+| 1-shot | 0-2% | 16-17% | 14.0-14.8 |
+| 3-shot | 0% | 19-20% | 13.8-14.8 |
+| GRPO (trained) | ~30% | 95-100% | 3.5-4.0 |
+Two results stand out. **95-100% parse rate**: the trained model almost always produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. **~30% accuracy from 0%**: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.
+We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.
+## Limitations at 0.6B Parameters
+Three failure modes define the current ceiling:
+- **Column name hallucination.** The model reads `FullName` from DESCRIBE but writes `full_name` in SQL, or reads `Horsepower` and writes `HorsepowerDESC` (missing space). Pretraining biases override the schema that the model just observed in context.
+- **FK chain reasoning.** The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.
+- **More RL does not help.** Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.
+RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.
+## The Learning Space
+The Green Agent evaluator brackets the learning space with two baselines:
+| Policy | Accuracy | Avg Reward | Avg Steps |
+|--------|----------|------------|-----------|
+| Random | 0% | 0.247 | 15.0 |
+| **GRPO (trained)** | **~30%** | **~0.35** | **3.5** |
+| Oracle | 100% | 1.168 | 3.5 |
+Random scores 0.247 by accumulating small exploration rewards across 15 steps without answering. Oracle scores 1.168 in 3.5 steps. This gap between 0.25 and 1.17 represents what a trained agent can learn. Our GRPO agent lands at ~0.35, above random but far below oracle, with room for improvement through better SFT warmup or larger models.
+## Technical Highlights
+- **676 questions** (473 train / 203 eval) across 10 Spider databases with difficulty labels
+- **Typed models** with Pydantic: every action, observation, and state is explicit and debuggable
+- **Read-only SQL** via SQLite `mode=ro`, where the database engine enforces safety rather than regex
+- **Potential-based reward shaping** (Ng et al., 1999) that provably preserves the optimal policy
+- **TRL environment_factory** integration for standard GRPO training without a custom loop
+- **Green Agent evaluator** with `Policy` protocol, `evaluate()` harness, and `RandomPolicy`/`OraclePolicy` baselines
+## Next Steps
+The environment supports two directions for improvement:
+**Thinking mode.** The 30% ceiling comes from multi-table reasoning. The model cannot plan a three-table JOIN path before writing SQL. Qwen3's `<think>` blocks offer a way to reason about the join chain before writing the query. In our experiments, RL alone did not produce useful thinking: the model either emitted empty `<think></think>` blocks or collapsed into degenerate loops (`<think>assistant<think>assistant...`) that consumed ~23% of rollouts. Pure RL discovers that thinking tokens exist but not how to use them. SFT warmup with structured reasoning examples ("I need to join Documents → Templates → Ref_Template_Types through Template_ID") could bootstrap the format, then RL could refine when to think and when to skip. This is worth testing at 0.6B before concluding the ceiling requires a larger model.
+**Larger models.** Our goal is small models that run locally, so scaling to 7B or beyond changes the deployment story. That said, a 1.7B model has more capacity to attend to DESCRIBE output and override pretrained column names. The environment and reward architecture do not depend on model size, so scaling up requires changing the training configuration rather than redesigning the environment. At some point, larger models may solve these questions with few-shot prompting alone, but the environment remains useful for training small models that need to run without API access.
+## Try It Yourself
+- **Training notebook**: `notebooks/train_grpo.ipynb` runs the full SFT + GRPO pipeline on Colab L4 in ~7 hours
+- **Comparison notebook**: `notebooks/compare_methods.ipynb` evaluates base vs trained models side by side
+- **Showcase notebook**: `notebooks/showcase_sqlenv.ipynb` lets you explore the environment, run episodes, and see what tools and rewards are available
+- **GitHub**: full source, architecture docs, and training artifacts
+## Discussion
+**The format of SFT data matters more than the quantity.** Per-turn SFT (347 examples) taught the model individual tool calls but not when to use them. The model called describe repeatedly because half the training examples were describe calls. Multi-turn SFT (120 full trajectories) taught the model to chain describe, query, and answer into a coherent episode. The difference was not the number of examples but whether each example showed a complete problem-solving sequence.
+**Transparent errors help the agent learn.** When the environment returns `"Error: no such column: full_name"` instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.
+**Dense rewards benefit from theoretical grounding.** Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.
+**The environment is the contribution.** The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.

docs/blog-post.md ADDED Viewed

	@@ -0,0 +1,118 @@

+# SQLEnv: Teaching Small Models to Explore Databases Like Analysts
+## What Analysts Do That One-Shot LLMs Miss
+A data analyst opens a new database. She doesn't write the final query first. She runs `DESCRIBE` to see what columns exist, `SELECT * LIMIT 5` to understand the data, then builds her query piece by piece — adjusting joins, fixing column names, retrying after errors. The answer emerges from a process, not a guess.
+Most text-to-SQL systems skip this process entirely. They take a question and a schema, produce one SQL query, and hope it's right. When it isn't — wrong column name, missing join, bad filter — there's no recovery. The model guessed once and moved on.
+SQLEnv is a reinforcement learning environment that trains the exploration habit instead.
+## The Problem: Static Benchmarks Reward Memorization
+Spider and BIRD measure whether a model can produce a correct SQL query in one shot. High accuracy on these benchmarks doesn't mean the model can handle an unfamiliar schema. It means the model memorized patterns from training data that happen to match the test set.
+The gap shows up immediately in production. Schemas change. Column names don't match expectations. Tables have unexpected relationships. A model that can't recover from its first wrong guess is brittle in exactly the situations where SQL competence matters most.
+The deeper problem: these benchmarks grade the final answer and ignore the process. An agent that explores the schema, discovers a mistake, and corrects itself gets the same score as one that guesses right on the first try — and no credit at all when the guess fails. There's no learning signal for the investigative reasoning that makes human analysts reliable.
+## SQLEnv: An Interactive RL Environment
+SQLEnv gives agents four actions that mirror what analysts do:
+- **DESCRIBE** — see column names and types for a table
+- **SAMPLE** — view example rows to understand the data
+- **QUERY** — run a read-only SQL query and see results
+- **ANSWER** — submit a final answer
+Each episode starts with a natural-language question and a list of table names. The schema is hidden. The agent must discover columns, types, and relationships through exploration before it can answer. This partial observability is the point — it forces the agent to develop strategy rather than pattern-match.
+The environment plugs into TRL's `environment_factory` for GRPO training, runs in Docker for safe SQL execution, and deploys to HuggingFace Spaces.
+## How an Episode Works
+Consider the question: *"What is the total bonus given in all evaluations?"*
+A trained agent's episode looks like this:
+1. `describe("evaluation")` → sees columns: Employee_ID, Year_awarded, Bonus
+2. `query("SELECT SUM(Bonus) FROM evaluation")` → result: 19500.0
+3. `answer("19500.0")` → correct
+Three tool calls, clean execution. The agent discovered the schema, wrote the right query, and submitted. But the interesting behavior is what happens when things go wrong.
+On a harder question — *"Which employee received the biggest bonus?"* — the agent tries `SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1`, gets `no such column: Salary`, calls `describe("evaluation")` to find the Bonus column, then writes the correct JOIN. The error recovery is learned behavior, not hard-coded.
+## Reward Architecture
+The environment provides three layers of reward signal:
+**Operational feedback** on every step: did the SQL execute (+0.02) or error? Is this a new query (+0.01) or a repeat (-0.01)?
+**Progress signal** on queries: how close are the results to the correct answer? Measured by value overlap and result cardinality, shaped as delta-from-previous-step. This form is potential-based (Ng et al., 1999), which provably preserves the optimal policy — the agent can't game progress rewards at the expense of correctness.
+**Terminal reward** for the answer: +1.0 for correct, 0.0 for wrong.
+Terminal correctness dominates by design. The maximum exploration reward across all steps is roughly 0.3; a correct answer is worth 1.0. An agent that explores endlessly but never answers will always score less than one that answers correctly. Dense intermediate rewards exist to make training feasible on small models — without them, a 0.6B parameter model can't learn from sparse terminal-only signal.
+## Training with GRPO
+We train using Group Relative Policy Optimization (GRPO), which compares multiple rollouts of the same question to compute advantages. TRL's `environment_factory` runs the agent through SQLEnv for each rollout, accumulating rewards across the multi-turn tool-calling loop.
+Training uses a two-phase curriculum:
+- **Phase 1**: Easy questions only (single-table), with KL penalty (β=0.04) to stay close to the SFT-initialized policy
+- **Phase 2**: Easy + medium questions (multi-table JOINs), KL penalty removed to allow exploration
+The SFT warmup matters more than we expected. We generate 120 multi-turn trajectories showing the full describe→query→answer workflow, trained with `assistant_only_loss` so the model learns from its own actions, not the environment's responses. Without this, GRPO has no coherent policy to improve — the agent never learns to call tools in sequence.
+## What We Observed
+Across nine training runs on Qwen3-0.6B, three findings stand out.
+**The environment produces clear learning signal.** Phase 1 reward climbs from near-zero to 0.5–0.7 within 400 steps. The model learns to describe tables before querying, format answers correctly (comma-separated lists, pipe-delimited rows), recover from SQL errors by re-describing tables, and submit `[]` for genuinely empty results. These are strategic behaviors that emerge from reward signal, not from hard-coded rules.
+**The model hits a capacity ceiling on medium questions.** Multi-table JOIN queries — the kind requiring foreign-key chain reasoning across 3+ tables — don't improve with more training. We ran extended training (v2: 4 total epochs across both phases) and the reward curve stayed flat. The dominant failure mode is column name hallucination: the model reads the schema correctly via DESCRIBE, then writes SQL using pretrained column names that don't match. A 0.6B model can't override pretraining biases through RL reward signal alone.
+**Thinking mode helps error recovery but doesn't raise the ceiling.** With Qwen3's thinking mode enabled, the model reasons through SQL errors in `<think>` blocks — noticing column name mismatches, adjusting join paths. Easy-question accuracy stays the same, but error recovery becomes more deliberate. A new failure mode emerges: ~23% of rollouts degenerate into unclosed `<think>` loops that consume the entire token budget. The fix is straightforward (add SFT examples with proper think blocks), but it reveals how small models struggle with structural token management.
+### Training metrics (Run 9, Qwen3-0.6B)
+| Phase | Steps | Questions | Reward range | Mean reward |
+|-------|-------|-----------|-------------|-------------|
+| Phase 1 (easy, β=0.04) | 870 | 435 | 0.01–1.15 | ~0.5 |
+| Phase 2 (easy+medium, β=0.0) | 934 | 467 | 0.01–1.15 | ~0.5 |
+Parse rate: >98% (model produces valid tool-call JSON). The 10% eval accuracy on the GRPO checkpoint vs 0% on the base model confirms the environment produces genuine learning, even if the absolute numbers are modest for a 0.6B model.
+## Technical Highlights
+- **10 Spider databases** with structured metadata and a deterministic train/eval split
+- **Typed action and observation models** make every environment interaction explicit and debuggable
+- **Read-only SQL execution** via SQLite `mode=ro` — safety enforced by the database engine, not regex
+- **Potential-based reward shaping** (Ng et al., 1999) — delta progress rewards provably preserve the optimal policy
+- **TRL environment_factory integration** — the environment plugs into standard GRPO training with no custom training loop
+- **Docker packaging** for HuggingFace Spaces with bundled databases and health checks
+## Future Directions
+Two clear paths forward:
+**Larger models.** The 0.6B model's ceiling comes from pretrained column-name biases overriding schema context. A 1.7B or larger model has more capacity to attend to the DESCRIBE output and override pretraining. The environment and reward architecture are model-agnostic — scaling up is a config change, not a redesign.
+**Thinking mode with targeted SFT.** Qwen3's thinking mode shows promise for multi-step reasoning (error recovery, join path planning) but needs SFT coverage on proper `<think>...</think>` blocks to prevent degenerate loops. Combining a larger model with thinking mode should push into medium-difficulty multi-table questions where the current model plateaus.
+The environment itself is the contribution. Whether the agent is 0.6B or 70B, sparse reward or dense, GRPO or PPO — the action space, reward architecture, and episode structure remain the same. SQLEnv provides the training ground. The models will catch up.
+## Try It Yourself
+- **HuggingFace Space**: [live demo link]
+- **Training notebook**: `notebooks/train_grpo.ipynb` — runs on Colab L4 in ~7 hours for both phases
+- **GitHub**: full source, architecture docs, and verification artifacts
+## What We Learned
+Dense intermediate rewards accelerate learning only when they align with the final objective. Potential-based shaping (Ng et al., 1999) gives us this guarantee — delta progress rewards can't distort the optimal policy. Without this property, agents learn to farm exploration rewards instead of answering questions.
+Tool-using agents benefit from transparent errors. When the environment surfaces `"Error: no such column: full_name"` instead of silently returning empty results, the agent develops error-recovery strategies. Better diagnostics produce better policy updates.
+Multi-turn SFT is the foundation, not a warmup step. Without full-trajectory SFT (describe→query→answer as one example with `assistant_only_loss`), GRPO has no coherent starting policy to improve. Per-turn SFT teaches individual actions; multi-turn SFT teaches strategy. The difference is the difference between a model that calls describe forever and one that knows when to answer.

docs/competition-deliverables.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# OpenEnv Challenge — Deliverables & Status
+## Competition
+**OpenEnv Challenge: SOTA Environments to drive general intelligence**
+Sponsors: PyTorch team at Meta, HuggingFace, Unsloth
+Prizes:
+- $10K in HuggingFace credits
+- Invitation to publish on PyTorch.org blog
+## Judging Criteria
+Evaluated primarily on the submission blog. Judging panel grades on:
+1. Creative and robust use of OpenEnv
+2. Technical excellence
+3. Storytelling
+4. Open-source demo
+5. Green Agent wrapper for the environment
+## Required Deliverables
+### 1. HuggingFace Space
+Environment on the HF Hub. Judges interact with the action space
+(DESCRIBE, SAMPLE, QUERY, ANSWER) against real Spider databases.
+Live at: https://huggingface.co/spaces/hjerpe/sql_env
+Docker image: `registry.hf.space/hjerpe-sql_env:latest`
+Published via `uv run openenv push` on 2026-03-29 (see `specs/F007-DEMO.md`).
+**Status:** Live. Endpoints `/health`, `/docs`, `/web`, `/reset`, `/step`, `/ws`
+exposed by the FastAPI server in `envs/sql_env/server/`. Python client:
+`SQLEnv(base_url="https://hjerpe-sql-env.hf.space")`.
+### 2. Training notebooks/scripts (GitHub)
+Colab-ready notebooks:
+- `notebooks/train_grpo.ipynb` — Full SFT + GRPO pipeline, Colab L4, ~7h
+- `notebooks/compare_methods.ipynb` — Base vs GRPO evaluation (zero-shot, 1-shot, 3-shot, GRPO v1, v2)
+- `notebooks/showcase_sqlenv.ipynb` — Interactive environment demo with Random and Oracle baselines
+**Status:** Complete
+### 3. Blog post (HuggingFace)
+Analyst exploration framing, reward architecture with theory,
+training results (0% to ~30%), failure analysis, lessons learned.
+Draft: `docs/blog-post-v1.md`
+**Status:** Draft v1 complete, not yet published
+## Additional Deliverables
+### 4. GitHub repo
+Clean codebase: zero ruff errors, typed Pydantic models, 280 passing
+tests, architecture docs, training artifacts.
+**Status:** Complete (F016 quality sweep done)
+### 5. Trained checkpoints (HuggingFace Hub)
+- `hjerpe/sqlenv-qwen3-0.6b-grpo` (v1)
+- `hjerpe/sqlenv-qwen3-0.6b-grpo-v2` (v2)
+**Status:** Uploaded
+### 6. Green Agent wrapper
+OpenEnv evaluation wrapper pattern. A `Policy` protocol with
+`evaluate(env, policy, n_episodes, seed)` that reports success rate,
+average reward, and average steps. Includes `RandomPolicy` and
+`OraclePolicy` baselines for standardized comparison.
+Implementation: `evaluation/policies.py`, `evaluation/oracle_policy.py`
+Tests: `tests/test_evaluation.py` (17 tests, all passing)
+Used by: `notebooks/showcase_sqlenv.ipynb`, `notebooks/compare_methods.ipynb`
+**Status:** Complete
+### 7. TRL `environment_factory` adapter
+HuggingFace TRL's native OpenEnv integration: pass a class with
+`reset()` + named tool methods as `environment_factory=` and `GRPOTrainer`
+runs the multi-turn tool-calling loop automatically (no custom
+`rollout_func`).
+Implementation: `training/trl_adapter.py` — class `SQLEnvTRL` exposing
+`describe()`, `sample()`, `query()`, `answer()` as tool methods plus
+`sql_env_reward_func`. Used by `notebooks/train_grpo.ipynb` (cell 16:
+`environment_factory=SQLEnvTRL`).
+Note: the adapter instantiates a **local** in-process `SQLEnvironment`,
+not a WebSocket client to the hosted HF Space. Intentional — training
+needs N parallel sessions (one per generation), and local is faster and
+avoids the Space's default 1-session concurrency limit.
+**Status:** Complete
+## Our Position
+No interactive SQL exploration environment exists. SQL Repair
+(WALKMAN303) is single-turn fix-it. Calendar Gym (Turing) is
+real-world but not SQL. We are the only multi-turn
+strategy-discovery environment for database exploration.
+Key narrative: "The environment is the product." The trained agent
+demonstrates that the environment works, but the contribution is
+the action space, reward architecture, and episode structure.
+## Open Items
+- [x] Deploy HuggingFace Space (live at https://huggingface.co/spaces/hjerpe/sql_env, 2026-03-29)
+- [ ] Publish blog post on HuggingFace (planned 2026-04-12)
+- [ ] Final review of blog-post-v1.md
+- [ ] Verify notebooks run clean on fresh Colab
+- [ ] Post-launch: enable `SUPPORTS_CONCURRENT_SESSIONS=True` + `max_concurrent_envs=64` on the Space for external users who want to retrain against the hosted endpoint
+## Resources
+- OpenEnv tutorial: https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb
+- OpenEnv GitHub: https://github.com/meta-pytorch/OpenEnv
+- OpenEnv docs: https://meta-pytorch.org/OpenEnv/
+- Environment hub: https://huggingface.co/openenv
+- Discord: https://discord.com/invite/YsTYBh6PD9

docs/data-sources.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# Data Sources
+Reference for what data SQLEnv uses, where it comes from, and how to
+regenerate it. All data lives under `data/` and is checked into the repo,
+so a fresh clone works offline after `uv sync`.
+## Summary
+| Artifact | Path | Origin | Count |
+|---|---|---|---|
+| SQLite databases | `data/databases/<db_id>/<db_id>.sqlite` | [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) | 10 databases |
+| Training questions | `data/questions/questions_train.json` | Spider train split (`xlangai/spider`) | 473 questions |
+| Eval questions | `data/questions/questions_eval.json` | Spider validation split (`xlangai/spider`) | 203 questions |
+| DB allowlist | `data/questions/db_list.json` | hand-curated subset | 10 db_ids |
+| SFT trajectories | `data/sft/sft_trajectories.json` | generated from gold SQL | 120 trajectories |
+Total: ~676 questions across 10 Spider databases, plus 120 multi-turn SFT
+warmup trajectories.
+## Upstream: Spider
+[Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a
+cross-domain text-to-SQL benchmark with ~200 databases and ~10k
+question/gold-SQL pairs. Every question has a natural-language prompt, a
+gold SQL query, and a target database. We use two mirrors:
+1. **Questions** via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider)
+   — loaded with `datasets.load_dataset("xlangai/spider", split=...)` in
+   `scripts/download_spider_data.py`.
+2. **SQLite databases** via the Spider GitHub mirror:
+   - `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite`
+   - Fallback: the official Google Drive archive
+     (`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`)
+   — fetched by `scripts/download_spider_databases.py`.
+Spider's license is CC BY-SA 4.0.
+## The 10-database subset (`db_list.json`)
+We do not ship all ~200 Spider databases. The allowlist in
+`data/questions/db_list.json` pins 10:
+```
+student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation,
+pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player
+```
+These were chosen for schema variety (single-table aggregates, 2-table
+joins, 3-table FK chains, naming-convention quirks) while keeping the
+repo small and the training loop fast.
+**Train/eval split is by database**, not random sampling within a
+database. This prevents train/eval leakage at the schema level:
+- **Train databases** (7): `car_1, concert_singer, cre_Doc_Template_Mgt,
+  dog_kennels, employee_hire_evaluation, flight_2, student_assessment`
+- **Eval databases** (4): `flight_2, pets_1, poker_player, world_1`
+`flight_2` appears in both; other eval DBs are schemas the model never
+saw during training. `sql_env.training.data_loading.validate_no_data_leak`
+asserts zero question-text overlap between the two files at load time.
+## Question files
+Both `questions_train.json` and `questions_eval.json` are lists of records
+with this shape (actual sample from `car_1` train):
+```json
+{
+  "question_text": "How many cars have a larger accelerate than the car with the largest horsepower?",
+  "database_name": "car_1",
+  "gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)",
+  "gold_answer": 39,
+  "answer_type": "integer",
+  "difficulty": "easy",
+  "tables_involved": ["CARS_DATA"],
+  "split": "train",
+  "question_id": "car_1_train_000"
+}
+```
+### Counts and difficulty mix
+| Split | Total | easy | medium | hard |
+|---|---|---|---|---|
+| train | 473 | 435 | 32 | 6 |
+| eval  | 203 | 185 | 18 | 0 |
+The easy-heavy distribution is deliberate for the 0.6B capacity ceiling
+(see `docs/blog-material.md` — "The 0.6B Capacity Ceiling"). Medium and
+hard questions are kept in the mix for Phase 2 exposure but are not where
+this model size gains accuracy.
+### Curation pipeline
+`scripts/curate_questions.py` turns raw Spider records into the format
+above. Per question, it:
+1. Filters to databases in `db_list.json`.
+2. Executes `gold_sql` against the real SQLite database to produce
+   `gold_answer` — this is what the environment grades agents against,
+   not a string match on the SQL.
+3. Normalizes the answer into a typed shape (`integer`, `string`,
+   `list[...]`, `table`) via `answer_type`.
+4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT
+   generation to decide which tables to `describe()`).
+5. Assigns the Spider-provided difficulty label.
+6. Writes train and eval to separate files with a per-question
+   `question_id` derived from `{db_id}_{split}_{index}`.
+Re-running the script is idempotent given the same Spider snapshot.
+## SFT warmup trajectories (`data/sft/sft_trajectories.json`)
+120 multi-turn trajectories used as a supervised warmup before GRPO.
+Each record has `messages` (tool-calling chat format) and `tools` (the
+four SQLEnv tool schemas).
+### How they are generated
+`scripts/generate_sft_data.py` walks the training questions and, for each
+one, runs the real `SQLEnvironment` programmatically:
+1. `describe(table)` for every table in `tables_involved`
+2. `query(gold_sql)` — captures real tabular output from the environment
+3. `answer(gold_answer)` — terminal step
+The captured sequence becomes an assistant-labelled trajectory. This is
+**not synthetic text** — the assistant turns wrap the actual environment
+responses the model will see at training and inference time, which is
+what lets GRPO's KL anchor point align with real env output.
+The 120-count is smaller than the 473 training questions because SFT
+samples a subset that exercises each database and difficulty bucket;
+see `scripts/generate_sft_data.py` for the selection logic.
+Why multi-turn matters: an earlier per-turn SFT (347 single-turn
+examples) taught the model to always call `describe` and nothing else.
+Multi-turn teaches the full `describe → query → answer` sequence. See
+`docs/blog-material.md` — "Multi-Turn SFT — Why It's Critical".
+## How to regenerate from scratch
+```bash
+# 1. Databases (SQLite files from the Spider mirror)
+uv run python scripts/download_spider_databases.py --db-id all
+# 2. Raw Spider questions (via HF Datasets)
+uv run python scripts/download_spider_data.py --db-id all --split train
+uv run python scripts/download_spider_data.py --db-id all --split validation
+# 3. Curate into questions_train.json / questions_eval.json
+uv run python scripts/curate_questions.py
+# 4. Regenerate SFT trajectories from gold SQL
+uv run python scripts/generate_sft_data.py
+```
+You should not need to do this in normal operation — the curated files
+are committed. Regenerate only when updating the database allowlist,
+changing the answer-type taxonomy, or rerunning against a new Spider
+snapshot.
+## What we deliberately do not use
+- **BIRD** (Li et al., 2023) — larger, harder text-to-SQL benchmark. Out
+  of scope for a 0.6B model; revisit for a larger-model follow-up.
+- **WikiSQL** — single-table only, doesn't exercise the multi-turn
+  exploration the environment is built for.
+- **Synthetic LLM-generated questions** — we want Spider's human-written
+  prompts so eval results are comparable to published work.
+- **Spider databases outside `db_list.json`** — kept out to keep the
+  repo small and training fast. Easy to widen by editing the list and
+  rerunning the regeneration pipeline.
+## References
+- [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider)
+- [taoyds/spider GitHub mirror](https://github.com/taoyds/spider)
+- [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider)
+- Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for
+  Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP.

docs/delivery-specs/index.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Delivery Specs
+This directory contains delivery specifications that define *how* to build features for engineering handoff.
+## What is a Delivery Spec?
+Delivery specs translate validated ideas into engineering requirements. They answer:
+- What are the functional requirements?
+- What files need to be created/modified?
+- What are the acceptance criteria?
+- What agent instructions apply?
+## When to Use
+| Complexity | Discovery? | Delivery Spec? | Notes |
+|------------|------------|----------------|-------|
+| **Simple** (CRUD) | Skip | Skip | Go directly to `autocode-implementation-planner` |
+| **Standard** (feature) | Recommended | Optional | Discovery provides enough context |
+| **Complex** (system) | Required | Required | Full pipeline for alignment |
+## File Format
+Each feature has one file:
+- `{feature}.md` — Structured delivery spec with requirements, file manifest, agent instructions
+## Delivery Specs Index
+| Feature | Status | Date |
+|---------|--------|------|
+| *None yet* | | |
+## Relationship to Other Docs
+```
+discovery/        →  Validate + Taste (OST, PR/FAQ)
+    ↓
+delivery-specs/   →  Engineering handoff (YOU ARE HERE)
+    ↓
+design-docs/      →  Technical approach (ADR-style)
+    ↓
+FEATURES.json     →  Links docs to implementation
+    ↓
+specs/            →  Implementation specs (autocode)
+```
+## Creating Delivery Specs
+Use the `delivery-spec` skill for structured delivery specs:
+```
+skill({ name: "delivery-spec" })
+```
+The skill guides you through:
+1. **Problem context** — References discovery doc
+2. **Functional requirements** — What the system must do
+3. **File manifest** — Files to create/modify
+4. **Agent instructions** — Patterns, constraints, anti-patterns
+5. **Acceptance criteria** — Definition of done
+## Integration with Discovery
+Delivery specs read from discovery JSON:
+- Pull `taste.delights` as success criteria context
+- Pull `taste.frustrations` as anti-patterns
+- Pull `scope.out_of_scope` as explicit boundaries

docs/design-docs/core-beliefs.md ADDED Viewed

	@@ -0,0 +1,61 @@

+---
+title: Core Beliefs
+description: Agent-first operating principles for the sql-env project
+type: explanation
+doc_type: explanation
+---
+# Core Beliefs
+Agent-first operating principles for this project.
+## Philosophy
+> "Humans steer. Agents execute."
+When something fails, the fix is never "try harder" — it's "what capability is missing?"
+## Principles
+### 1. Repository Knowledge is the System of Record
+Anything an agent can't access in-context effectively doesn't exist. Knowledge that lives in Slack, Google Docs, or people's heads is invisible to the system.
+**Implication:** Push context into the repo. Design decisions, product principles, and team conventions must be documented in versioned files.
+### 2. Enforce Invariants, Not Implementations
+Mechanical constraints (lints, tests, type systems) apply everywhere at once. Narrative instructions degrade with context length.
+**Implication:** When possible, encode rules as code (lints, schemas, types) rather than prose (AGENTS.md instructions).
+### 3. Design Before Implementation
+Think about interfaces before implementation details. The spec-first workflow (Research — Sketch — Spec — Verify) produces deeper, more coherent modules than "prompt and iterate."
+**Implication:** Use the autocode pipeline. Don't skip the research phase.
+### 4. Parse, Don't Validate
+Prefer parsing data into precise types early rather than validating ad-hoc throughout the codebase. Invalid states should be unrepresentable.
+**Reference:** [Parse, Don't Validate](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/)
+### 5. Boring Technology is Agent-Friendly
+Technologies with stable APIs, composability, and strong training-set representation are easier for agents to model.
+**Implication:** Prefer well-established libraries over cutting-edge. Sometimes it's cheaper to reimplement a subset than to work around opaque behavior.
+### 6. Taste is Captured Once, Enforced Continuously
+Human taste is fed back into the system through review comments, refactoring PRs, and bug fixes. Once captured, it applies to every future line of code.
+**Implication:** Use the `compound-engineer` to extract learnings. Update lints when patterns emerge.
+## Anti-Patterns
+- **Encyclopedia AGENTS.md:** When everything is "important," nothing is.
+- **Shotgun Parsing:** Validation scattered throughout the codebase instead of at boundaries.
+- **Try Harder Debugging:** Prompting differently instead of fixing the environment.
+- **AI Slop Fridays:** Manual cleanup sessions instead of continuous garbage collection.

docs/design-docs/index.md CHANGED Viewed

@@ -18,7 +18,7 @@ Architectural Decision Records are stored in [decisions/](decisions/).
 | Feature | Status | Date | Reversibility |
 |---------|--------|------|---------------|
-| *None yet* | | | |
 ## Creating Design Docs

 | Feature | Status | Date | Reversibility |
 |---------|--------|------|---------------|
+| [Reward Shaping Research](reward-shaping-research.md) | Active | 2026-03-29 | N/A (research) |
 ## Creating Design Docs

docs/design-docs/reward-shaping-research.md ADDED Viewed

	@@ -0,0 +1,197 @@

+---
+title: Reward Shaping Research
+description: Theoretical basis for SQLEnv dense reward architecture, comparing cumulative-cap vs delta-based progress approaches grounded in potential-based shaping theory
+doc_type: explanation
+---
+# Reward Shaping Research
+> Last updated: 2026-03-29
+Research notes on dense reward design for SQLEnv. Documents the theoretical basis for our reward architecture, the problems with the original cumulative-cap design, and the rationale for switching to per-step clipping with delta-based progress.
+## Problem Statement
+SQLEnv's original reward system used:
+1. **Cumulative tracking** with a hard cap at [-0.2, 0.5]
+2. **Improvement-only gating** (reward only when `binned_progress > best_progress`)
+Both violate established RL theory and create practical training problems.
+## Potential-Based Reward Shaping (Ng et al., 1999)
+**Paper:** Ng, A. Y., Harada, D., & Russell, S. (1999). "Policy invariance under reward transformations: Theory and application to reward shaping." ICML.
+**Core theorem:** Given an MDP M with reward R, define shaped reward R' = R + F. The optimal policy is preserved **if and only if** F has the form:
+```
+F(s, a, s') = γ · Φ(s') − Φ(s)
+```
+where Φ: S → ℝ is a potential function and γ is the discount factor.
+**Why this matters for SQLEnv:**
+- Cumulative capping is NOT potential-based (the shaping reward depends on trajectory history, not just state transitions)
+- Non-potential-based shaping changes the optimal policy in unpredictable ways
+- Agents may optimize for shaped reward rather than task completion
+**Delta-from-previous IS potential-based** with γ=1:
+```
+F(s, s') = Φ(s') − Φ(s)  where Φ(s) = binned_progress(s)
+```
+This form provably preserves the optimal policy.
+## Why Cumulative Caps Are Harmful
+### The POMDP problem
+The cumulative reward counter is part of the environment's hidden state. The agent cannot observe it. This means:
+- The same (observation, action) pair can yield different rewards depending on hidden history
+- The agent cannot learn when shaping signal will stop
+- Credit assignment breaks: "low reward because bad action" vs "low reward because cap hit"
+### Early saturation
+Once cumulative reward hits 0.5, all subsequent steps return zero shaping. For a 15-step episode, if the cap hits at step 5, steps 6-14 have no learning signal. The agent receives no gradient for half the episode.
+### The cap was redundant
+With a 15-step budget and -0.005 step cost:
+- Max possible L1 reward per step: +0.025 (exec_ok + new_info - step_cost)
+- Max over 15 steps: 0.375
+- Realistic total (mixed actions): ~0.15-0.25
+- Terminal reward: 1.0
+The 4-7x ratio between terminal and exploration makes farming exploration irrational without any cap.
+## Why Improvement-Only Gating Blocks Learning
+### No recovery signal
+If the agent achieves progress 0.75 on step 3, then regresses to 0.25 on step 4, then recovers to 0.75 on step 5:
+- **Old design:** Steps 4 and 5 both get zero reward (0.25 < 0.75, 0.75 ≤ 0.75)
+- **Delta design:** Step 4 gets -0.075 (regression), step 5 gets +0.075 (recovery)
+The delta design gives the agent information about what happened. The old design is silent.
+### Discourages experimentation
+With improvement-only gating, once an agent achieves good progress, any experimental query that might regress temporarily is pure risk (no upside if it doesn't exceed the best). This discourages the kind of exploratory behavior the environment is designed to train.
+## Current Design: Per-Step Clipping + Delta Progress
+### Per-step reward structure
+```
+L1 Operational (every step):
+  +0.02  exec_ok (no error)
+  +0.01  new_info (unique SQL hash)
+  -0.01  repeat penalty (same SQL)
+  -0.005 step cost
+L2 Progress (QUERY only):
+  Weighted score: cardinality (25%) + value overlap (50%) + numeric range (25%)
+  Binned to {0, 0.25, 0.5, 0.75, 1.0}
+  Delta = binned - previous_progress
+  Reward = delta * 0.15
+L3 Terminal (ANSWER only):
+  +1.0 correct, 0.0 wrong
+Per-step clip: [-0.05, 0.15]
+No cumulative tracking.
+```
+### Anti-farming mechanisms
+| Mechanism | What it prevents |
+|-----------|-----------------|
+| Hard budget (15 steps) | Infinite exploration |
+| Step cost (-0.005) | Idle steps |
+| Repeat penalty (-0.01) | Query farming |
+| Terminal dominance (1.0 vs ~0.3 max) | Exploration over answering |
+| Per-step clip (0.15 max) | Single-step reward spikes |
+## Comparison of Progress Approaches
+| Approach | Recovery signal | Farming risk | Theory |
+|----------|----------------|-------------|--------|
+| Improvement-only (old) | None | None | No formal guarantee |
+| Absolute quality each step | Yes | High (repeat good queries) | None |
+| Delta from previous step | Yes | Low | Potential-based (Ng 1999), provably policy-invariant |
+## GRPO Integration
+**GRPO was designed for episode-level rewards** (DeepSeek-R1). Dense per-step rewards are aggregated to a single episode scalar for GRPO's advantage computation.
+**"GRPO is Secretly a Process Reward Model"** (Sullivan et al., 2025/2026, ICLR 2026) proved that GRPO implicitly performs process-level credit assignment when completions share prefixes. They identified a flaw (non-uniform step distribution) and proposed lambda-GRPO.
+**For SQLEnv:** Dense rewards shape rollout behavior within each episode, but get aggregated to episode-level for GRPO. Weight terminal correctness heavily: ~1.0 correctness + 0.3 progress + 0.1 operational.
+**Relevant validation:**
+- **TIPS** (March 2026): potential-based turn-level shaping for multi-turn LLM agents, 11.8% EM improvement over PPO/GRPO baselines
+- **ToolRL** (2025): finer-grained reward decomposition leads to 17% improvement over base models with GRPO
+- **StepTool** (2024): step-grained reward shaping significantly outperformed outcome-only for tool learning
+## Future Directions
+1. **Diminishing novelty bonuses:** `reward = 0.01 / (1 + 0.5 * exploration_count)` instead of flat +0.01 per unique query. Classic count-based exploration (Bellemare et al. 2016, Never Give Up) naturally tapers.
+2. **Curriculum on step budget:** Start with generous budget (20 steps) for easy questions, tighten to 10 for hard ones as training progresses.
+3. **Per-layer independent clipping:** Clip L1 and L2 separately rather than their sum, preventing one layer from consuming the other's budget.
+4. **Lambda-GRPO:** Apply Sullivan et al.'s fix for non-uniform step distribution to improve credit assignment across steps.
+5. **Adaptive Length Penalty (ALP):** From "Just Enough Thinking" (2026): per-prompt length penalties based on solve rate. Could adapt step budget per difficulty level.
+## Why Result-Based, Not SQL-Structure-Based
+A natural question: why compare query *results* to the gold results, rather than comparing the SQL *structure* to the gold SQL?
+### The equivalence problem
+Multiple SQL queries produce identical results:
+```sql
+SELECT name FROM employees WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC')
+SELECT e.name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE d.location = 'NYC'
+SELECT name FROM employees WHERE EXISTS (SELECT 1 FROM departments WHERE id = employees.dept_id AND location = 'NYC')
+```
+Rewarding structural similarity to one gold query penalizes valid alternatives. This creates false negative gradient signal that hurts training.
+### The field moved away from structural comparison
+Spider (Yu et al., 2018) used exact set match (decompose SQL into component sets). BIRD (Li et al., 2023) replaced it with execution accuracy, explicitly arguing that "exact match is too strict and penalizes valid alternative SQL formulations." Every recent system (DAIL-SQL, MAC-SQL, CHESS) evaluates on execution accuracy.
+### Intermediate queries aren't meant to look like the gold
+In our POMDP, the agent runs exploratory queries (`SELECT * FROM t LIMIT 5`, `SELECT COUNT(*)`) to gather information. These should look nothing like the gold query. Rewarding structural similarity would push the agent toward exploitation before it has explored enough.
+### Result comparison is the right signal
+| Dimension | Result-based | SQL-structure-based |
+|-----------|-------------|---------------------|
+| Handles SQL equivalence | Yes | No |
+| Correlates with true objective | Directly | Indirectly (proxy) |
+| Works for exploratory queries | Yes | No (penalizes exploration) |
+| Literature support | Strong (BIRD, CodeRL, LEVER) | Declining (Spider exact match being replaced) |
+### What about SQL validity rewards?
+One structural signal IS worth using: penalizing queries that fail to execute (syntax errors, missing tables/columns). This is not SQL similarity — it's SQL validity. We already do this via L1 operational rewards: exec_ok (+0.02) vs error (-0.005 step cost only). This accelerates learning without biasing toward a specific solution path.
+## References
+- Ng, Harada, Russell (1999). [Policy invariance under reward transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
+- Sullivan et al. (2025/2026). [GRPO is Secretly a Process Reward Model](https://arxiv.org/abs/2509.21154). ICLR 2026.
+- Shao et al. (2024). [DeepSeek-Math: GRPO](https://arxiv.org/abs/2402.03300).
+- DeepSeek-AI (2025). [DeepSeek-R1](https://arxiv.org/pdf/2501.12948).
+- TIPS (2026). [Turn-Level Information-Potential Reward Shaping](https://arxiv.org/abs/2603.22293).
+- ToolRL (2025). [Reward is All Tool Learning Needs](https://arxiv.org/html/2504.13958v1).
+- StepTool (2024). [Step-grained RL for Tool Learning](https://arxiv.org/abs/2410.07745).
+- RAGEN (2025). [Multi-Turn RL for LLM Agents](https://arxiv.org/abs/2504.20073).
+- Bellemare et al. (2016). [Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868).
+- Just Enough Thinking (2026). [Adaptive Length Penalties](https://arxiv.org/html/2506.05256v1).
+- Fireworks AI. [Best Practices for Multi-Turn RL](https://fireworks.ai/blog/best-practices-for-multi-turn-RL).
+- Wiewiora, Cottrell, Elkan (2003). Principled Methods for Advising Reinforcement Learning Agents.
+- Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing](https://arxiv.org/abs/1809.08887).
+- Li et al. (2023). [Can LLM Already Serve as a Database Interface? (BIRD)](https://arxiv.org/abs/2305.03111).
+- Zhong et al. (2020). [Semantic Evaluation for Text-to-SQL with Distilled Test Suites](https://arxiv.org/abs/2010.02840).
+- Le et al. (2022). [CodeRL: Mastering Code Generation through Pretrained Models and Deep RL](https://arxiv.org/abs/2207.01780).
+- Lightman et al. (2023). [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050).
+- Ni et al. (2023). [LEVER: Learning to Verify Language-to-Code Generation with Execution](https://arxiv.org/abs/2302.08468).

docs/discovery/.gitkeep ADDED Viewed

File without changes

docs/discovery/index.md ADDED Viewed

	@@ -0,0 +1,63 @@

+# Discovery
+This directory contains discovery artifacts that validate *what* to build and capture *taste*.
+## What is Discovery?
+Discovery validates product ideas and captures user taste before detailed planning. It answers:
+- What opportunity are we pursuing? (Opportunity Solution Tree)
+- What problem are we solving? (PR/FAQ)
+- What would DELIGHT users? (Taste capture)
+- What would FRUSTRATE users? (Anti-patterns)
+- What FEELING should users have? (North star)
+- How does this compare to alternatives? (RICE/ICE prioritization)
+## File Format
+Each feature has two files:
+- `{feature}.md` — Human-readable discovery doc with OST, PR/FAQ, taste
+- `{feature}.json` — Machine-readable taste data for downstream skills
+## Discovery Index
+| Feature | Status | Date |
+|---------|--------|------|
+| *None yet* | | |
+## Relationship to Other Docs
+```
+discovery/        →  Validate + Taste (OST, PR/FAQ)
+    ↓
+delivery-specs/   →  Engineering handoff (functional requirements)
+    ↓
+design-docs/      →  Technical approach (ADR-style)
+    ↓
+FEATURES.json     →  Links docs to implementation
+    ↓
+specs/            →  Implementation specs (autocode)
+    ↓
+learnings/        →  Knowledge extracted after completion
+```
+## Creating Discovery Docs
+Use the `discovery` skill for structured discovery:
+```
+skill({ name: "discovery" })
+```
+The skill guides you through:
+1. **Opportunity Solution Tree** — Outcome, opportunities, solutions
+2. **PR/FAQ** — Headline, problem, solution, customer quote
+3. **Taste interview** — Delights, frustrations, feeling, maturity
+4. **Prioritization** — RICE/ICE scoring
+## Integration with Autocode
+The `autocode-implementation-planner` skill automatically reads the JSON file:
+- `taste.delights` → Success criteria in implementation spec
+- `taste.frustrations` → Anti-patterns to explicitly avoid
+- `taste.feeling` → North star for implementation decisions
+- `scope.out_of_scope` → Boundaries the implementation must respect

docs/exec-plans/README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# Execution Plans
+This directory contains execution plans for complex, multi-step work.
+## What is an Execution Plan?
+An execution plan (ExecPlan) is a living document that tracks progress on significant work. Unlike implementation specs (which define *what* to build), exec plans track *how* the work is progressing.
+## Directory Structure
+```
+exec-plans/
+├── active/           # Currently in progress
+├── completed/        # Finished work (for reference)
+├── tech-debt-tracker.md  # Known technical debt
+└── README.md         # This file
+```
+## When to Use
+Use an execution plan when:
+- Work spans multiple sessions or days
+- Multiple steps have dependencies
+- Progress needs to be visible to humans
+- Work involves significant research or discovery
+Do NOT use for:
+- Simple, single-session features
+- Bug fixes
+- Routine changes
+## ExecPlan Format
+See the [OpenAI Exec Plans cookbook](https://developers.openai.com/cookbook/articles/codex_exec_plans) for the full specification.
+Key sections:
+- **Purpose / Big Picture:** What someone gains after this change
+- **Progress:** Checklist with timestamps
+- **Surprises & Discoveries:** Unexpected findings
+- **Decision Log:** Why choices were made
+- **Outcomes & Retrospective:** Summary at completion

docs/exec-plans/active/.gitkeep ADDED Viewed

File without changes

docs/exec-plans/completed/.gitkeep ADDED Viewed

File without changes

docs/exec-plans/tech-debt-tracker.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# Technical Debt Tracker
+Known technical debt and cleanup opportunities.
+## Active Debt
+| Item | Type | Severity | Origin | Notes |
+|------|------|----------|--------|-------|
+| `huggingface_hub`/`transformers` version mismatch breaks test collection (`is_offline_mode` import error) in training/error-handling suites | Infra | D (Blocking) | F014 verification run | Blocks full-suite verification and should be resolved with a pinned compatible dependency set for training-related extras. |
+## Types
+- **Code:** Shortcuts, duplication, missing abstractions
+- **Tests:** Missing coverage, flaky tests, slow tests
+- **Docs:** Stale documentation, missing docs
+- **Infra:** Build/deploy issues, tooling gaps
+- **Architecture:** Layer violations, wrong boundaries
+## Severity
+- **D (Blocking):** Must fix before next release
+- **C (Needs Refactor):** Should fix soon, causing friction
+- **B (Minor):** Would be nice to fix
+- **A (Clean):** No action needed
+## Process
+1. `/techdebt` command scans for issues and updates this file
+2. `compound-engineer` may add items from feature work
+3. `what-how-alignment` skill consolidates into refactor proposals
+4. Items graduate to `SUGGESTED_REFACTORS.md` when ready for implementation
+## Resolved Debt
+| Item | Resolution | Date |
+|------|------------|------|
+| *None yet* | | |

docs/exploration/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Exploration
+Ideas, technology research, and ad-hoc investigation notes. This is a scratchpad -- content here is not system-of-record.
+**Diataxis type:** Exploration (learning-oriented, not yet distilled)
+## What Goes Here
+- Technology evaluations and comparisons
+- Prototype findings
+- External API exploration
+- Performance investigations
+- Ideation and backlog notes
+## What Does NOT Go Here
+- Durable learnings (go to `docs/learnings/`)
+- Design decisions (go to `docs/design-docs/`)
+- Implementation specs (go to `specs/`)
+- Operational how-to guides (go to `docs/guides/`)
+## Exploration Index
+| Topic | Type | Date | Summary |
+|-------|------|------|---------|
+| [grpo-collapse-analysis.md](grpo-collapse-analysis.md) | Investigation | 2026-04 | Post-mortem on Qwen3-1.7B GRPO collapse into degenerate null-argument tool calls |
+| [grpo-plateau-plan.md](grpo-plateau-plan.md) | Investigation | 2026-04 | Interventions to push past 30-40% accuracy plateau in GRPO training |
+| [grpo-training-session-log.md](grpo-training-session-log.md) | Investigation | 2026-04 | Running log of SFT warmup + GRPO training sessions on Colab L4 |
+| [rl-vs-icl-research.md](rl-vs-icl-research.md) | Comparison | 2026-04 | When GRPO training adds value over pure prompting for small SQL agents |
+| [train-grpo-walkthrough.md](train-grpo-walkthrough.md) | Prototype | 2026-04 | Step-by-step companion guide for train_grpo.ipynb |
+## Types
+- **Tech Eval:** Evaluating a library, framework, or service
+- **Prototype:** Findings from exploratory prototyping
+- **Investigation:** Deep dive into a specific problem
+- **Comparison:** Side-by-side analysis of options
+## Graduating Content
+When exploration produces durable insights:
+1. Extract patterns to `docs/learnings/<category>.md`
+2. Create reference files in `docs/references/` for agent context
+3. Create how-to guides in `docs/guides/` for operational procedures

docs/exploration/f007-prelaunch-checklist.md ADDED Viewed

	@@ -0,0 +1,455 @@

+# F007 Pre-Launch Checklist (temp, 2026-04-12)
+Scope: verify the HF Space deployment is real and usable **before** the blog
+post goes live today. Delete this file after launch.
+---
+## TL;DR — what to do in the next ~60 min
+| # | Action | Time | Value | Do it? |
+|---|---|---|---|---|
+| 1 | Open Space in browser, confirm it loads | 2 min | **Critical** — judges will click the link first | **YES** |
+| 2 | Hit `/health` and `/docs` | 1 min | **Critical** — proves server is up | **YES** |
+| 3 | Run one full episode via `/web` UI | 5 min | **Critical** — proves action space works end-to-end | **YES** |
+| 4 | Fix stale `docs/competition-deliverables.md` status | 3 min | **High** — doc claims "Not started", Space is live | **YES** |
+| 5 | Python client smoke test against live Space | 10 min | **High** — proves programmatic access (the thing the blog promises) | **YES** |
+| 6 | Pull `registry.hf.space/...` Docker image and run locally | 10 min | Medium — nice to have, judges rarely do this | If time |
+| 7 | `pip install` from Space URL | 5 min | Medium — validates `pyproject.toml` inside Space | If time |
+| 8 | Concurrency audit (`SUPPORTS_CONCURRENT_SESSIONS`) | 15 min | **Low for launch, High for anyone retraining** | Skip today, file issue |
+| 9 | TRL `environment_factory` wrapper | — | — | **Already done** (see below) |
+**Recommendation:** Do 1–5 before publishing. Skip 6–8. Item 9 is already in
+the repo.
+---
+## About TRL (already integrated — do not re-research)
+**TRL** = Hugging Face's `transformers`-based RL library. Its `GRPOTrainer`
+accepts an `environment_factory=MyEnvClass` argument and runs the multi-turn
+tool-calling loop automatically: generate → parse tool call → call your env →
+feed result back → repeat. No custom `rollout_func` needed.
+**We already implement this.** `training/trl_adapter.py::SQLEnvTRL` is a
+TRL-native environment class with:
+- `reset(**kwargs)` — reads `question_text` from the dataset column to route
+  to the correct database
+- Named tool methods with docstrings: `describe(table_name)`, `sample(table_name)`,
+  `query(sql)`, `answer(value)` — not a generic `step()`
+- `sql_env_reward_func` as the reward function
+`notebooks/train_grpo.ipynb` cell 16 passes it directly:
+```python
+trainer = build_trainer(
+    ...
+    reward_funcs=[sql_env_reward_func],
+    environment_factory=SQLEnvTRL,
+    ...
+)
+```
+The Setup cell pins `trl>=0.29.0` and `transformers` from `main` specifically
+because `environment_factory` requires transformers ≥5.2. Our v1/v2 runs used
+this path.
+**One nuance (intentional design):** `SQLEnvTRL.__init__` instantiates a
+**local in-process `SQLEnvironment`**, not a WebSocket client to
+`https://hjerpe-sql-env.hf.space`. Reasons:
+- Training opens N parallel sessions (one per generation). The hosted Space
+  defaults to 1 concurrent session — see `SUPPORTS_CONCURRENT_SESSIONS` in
+  the TRL↔OpenEnv docs.
+- Local is faster, no network hops, no rate limits.
+- The hosted Space is for judges (clicking `/web`) and external users
+  consuming the env via pip/Docker. Training correctly bypasses it.
+**Implication for the blog:** you can claim "TRL-native integration via
+`environment_factory`" factually. It's already true and the notebook proves it.
+**What's still on the post-launch list** is the Space-side concurrency config
+(item 8), not the adapter. Without `SUPPORTS_CONCURRENT_SESSIONS=True` on the
+server, an external user trying to retrain against the hosted Space would hit
+the 1-session cap. This does not affect our own training (we use local).
+---
+## 1. Browser smoke test (2 min) — CRITICAL
+**How:**
+```bash
+open https://huggingface.co/spaces/hjerpe/sql_env
+```
+(or paste the URL into a browser manually)
+**What to check:**
+- [ ] Space status is **Running** (green), not Building / Sleeping / Error
+- [ ] README renders with a clear one-liner of what the env does
+- [ ] No red error banner at the top
+**What you're validating:** that HF Spaces successfully built our image and
+the container is alive. If it's sleeping, the first visit wakes it (~30s
+cold start). Warm it up now and leave the tab open so blog readers don't hit
+a cold Space.
+**If it's broken:** open the "Logs" tab on the Space page → look for the
+Docker build error → fix locally → re-push with `uv run openenv push`.
+---
+## 2. Health + API docs (1 min) — CRITICAL
+**How:**
+```bash
+curl -sS https://hjerpe-sql-env.hf.space/health
+curl -sS https://hjerpe-sql-env.hf.space/docs | head -20   # should be HTML
+open https://hjerpe-sql-env.hf.space/docs                  # visual check
+```
+**What to check:**
+- [ ] `/health` returns HTTP 200 and a JSON body (e.g. `{"status":"ok"}`)
+- [ ] `/docs` Swagger page lists `/reset`, `/step`, `/ws` endpoints
+- [ ] `/step` request schema mentions our `SQLAction` fields (`action_type`,
+      `argument`) with values `DESCRIBE`, `SAMPLE`, `QUERY`, `ANSWER`
+**What you're validating:** the FastAPI server inside the container is up
+and the OpenAPI schema published to the Space matches our local `SQLAction`
+model. If schemas drift, clients break.
+**If it's broken:** usually means the Dockerfile picked up a stale version
+of `sql_env/models.py`. Rebuild and push: `uv run openenv build -t …` then
+`uv run openenv push`.
+---
+## 3. One full episode via the built-in web UI (5 min) — CRITICAL
+**How:**
+```bash
+open https://hjerpe-sql-env.hf.space/web
+```
+OpenEnv ships a `/web` interactive UI on every env. Walk through one full
+episode:
+1. Click **Reset** — a schema hint + question prompt should appear
+2. Enter action `DESCRIBE` with argument = a table name from the reset output
+3. Enter action `SAMPLE` with a table name (confirm 5 sample rows come back)
+4. Enter action `QUERY` with a valid `SELECT ...` (confirm rows return)
+5. Enter action `ANSWER` with your final answer (confirm reward + `done=true`)
+**What to check:**
+- [ ] Each step returns a new observation without error
+- [ ] Terminal `ANSWER` produces a reward (even 0.0 is fine — we're testing
+      plumbing, not correctness)
+- [ ] Screenshot the final screen — free blog content
+**What you're validating:** the end-to-end action space a judge will
+exercise. This is our judges' happiest path.
+**If any step errors, do not publish the blog until it's fixed.**
+---
+## 4. Fix stale deliverables doc (3 min) — HIGH
+`docs/competition-deliverables.md` line 30 says:
+> **Status:** Not started (no Dockerfile yet)
+This is wrong. F007 demo (`specs/F007-DEMO.md`) shows a successful authenticated
+push to `https://huggingface.co/spaces/hjerpe/sql_env` on 2026-03-29. Update to:
+> **Status:** Live at https://huggingface.co/spaces/hjerpe/sql_env — manual
+> episode flow verified 2026-04-12.
+Also update the open items list at the bottom — "Deploy HuggingFace Space"
+should be checked off.
+---
+## 5. Python client smoke test (10 min) — HIGH
+**How:**
+First find the actual client class name and action constructor args — our
+client module may not match the generic OpenEnv template:
+```bash
+rg -n "class SQLEnv\b|base_url" -g '!**/tests/**' -g '!**/docs/**' .
+rg -n "class SQLAction|action_type|argument" sql_env/models.py
+```
+Then create a throwaway script `scratch_hf_smoke.py` in the repo root:
+```python
+from sql_env.client import SQLEnv          # adjust after grep above
+from sql_env.models import SQLAction
+URL = "https://hjerpe-sql-env.hf.space"
+with SQLEnv(base_url=URL).sync() as env:
+    r = env.reset()
+    print("RESET:", r.observation)
+    # Pick any table name from the schema hint in r.observation
+    r = env.step(SQLAction(action_type="DESCRIBE", argument="<table>"))
+    print("DESCRIBE:", r.observation)
+    r = env.step(SQLAction(action_type="QUERY", argument="SELECT 1"))
+    print("QUERY:", r.observation)
+    r = env.step(SQLAction(action_type="ANSWER", argument="<answer>"))
+    print("ANSWER reward=", r.reward, "done=", r.done)
+```
+Run:
+```bash
+uv run python scratch_hf_smoke.py
+```
+**What to check:**
+- [ ] No connection / WebSocket handshake errors
+- [ ] `r.observation` is a populated dict/string at each step
+- [ ] Final step: `r.reward` is a float and `r.done == True`
+- [ ] Delete `scratch_hf_smoke.py` after — do not commit it
+**What you're validating:** that a blog reader copy-pasting our snippet
+against the live Space actually gets a working client. If this fails and
+the `/web` UI (step 3) works, the problem is likely a client-side model
+drift — check that our shipped `sql_env/models.py` matches what the server
+inside the Space expects.
+---
+## 6. Docker image pull (10 min) — IF TIME
+This is the pattern every OpenEnv env on the hub ships. It's how external
+users run our env locally for training (no rate limits, full concurrency).
+**How — option A: pull the pre-built image from HF registry**
+```bash
+docker pull --platform linux/amd64 registry.hf.space/hjerpe-sql_env:latest
+docker run -d --name sqlenv-smoke -p 8001:8000 --platform linux/amd64 \
+  registry.hf.space/hjerpe-sql_env:latest
+# Wait ~5s for uvicorn to boot
+sleep 5
+curl -sS http://0.0.0.0:8001/health
+open http://0.0.0.0:8001/docs
+# Clean up
+docker stop sqlenv-smoke && docker rm sqlenv-smoke
+```
+**How — option B: rebuild locally from our repo (same image the Space runs)**
+```bash
+uv run openenv validate --verbose          # dry-run config check
+uv run openenv build -t openenv-sql-env:local
+docker run -d --name sqlenv-local -p 8001:8000 openenv-sql-env:local
+curl -sS http://0.0.0.0:8001/health
+docker stop sqlenv-local && docker rm sqlenv-local
+```
+**What to check:**
+- [ ] Image pulls without auth (Space is public)
+- [ ] Container starts, `/health` returns 200
+- [ ] `/docs` renders Swagger on localhost
+- [ ] No `--platform` warnings on Apple Silicon (the Space is `linux/amd64`,
+      which runs under Rosetta on M-series Macs — slow but functional)
+**What you're validating:** the reproducibility story. A broken image here
+means the blog's "clone and train" path is dead. Judges rarely click this,
+but any serious user will.
+---
+## 7. pip install from Space (5 min) — IF TIME
+**How:**
+First check the package name declared inside the pushed Space:
+```bash
+curl -sS https://huggingface.co/spaces/hjerpe/sql_env/raw/main/pyproject.toml \
+  | grep -E '^name'
+```
+Then install it into a throwaway venv:
+```bash
+uv venv /tmp/sqlenv-pip-test
+source /tmp/sqlenv-pip-test/bin/activate
+# Replace "openenv-sql-env" with whatever name the pyproject.toml above shows
+pip install "openenv-sql-env @ git+https://huggingface.co/spaces/hjerpe/sql_env"
+python -c "from sql_env.client import SQLEnv; print('OK:', SQLEnv)"
+deactivate && rm -rf /tmp/sqlenv-pip-test
+```
+**What to check:**
+- [ ] `pip install` resolves without dependency errors
+- [ ] The client class imports from the installed wheel
+**What you're validating:** the `pyproject.toml` we pushed into the Space
+actually declares the package correctly. This is the install method TRL
+documents: `pip install "<pkg> @ git+https://huggingface.co/spaces/<space>"`.
+---
+## 8. Concurrency audit — POST-LAUNCH
+**How:**
+```bash
+rg -n "SUPPORTS_CONCURRENT_SESSIONS|max_concurrent_envs|create_app\(" sql_env/server/
+```
+Expected result **today:** no matches (the flag is not set), which means the
+Space defaults to **1 concurrent WebSocket session**. Per the OpenEnv↔TRL
+docs, any training run with `num_generations > 1` against the hosted Space
+will hit capacity errors.
+**Fix (post-launch):** in `sql_env/server/app.py` (or wherever
+`create_app(...)` is called):
+```python
+SUPPORTS_CONCURRENT_SESSIONS = True
+app = create_app(
+    create_sql_environment,
+    SQLAction,
+    SQLObservation,
+    max_concurrent_envs=64,   # ≥ TRL's generation_batch_size
+)
+```
+Then `uv run openenv build -t ...` and `uv run openenv push` again.
+**Why it's not a launch blocker:** the blog does not ask readers to train
+against the hosted Space. Our own training uses the in-process
+`SQLEnvironment` via `SQLEnvTRL` (not the WebSocket client), so we never hit
+this limit. Only matters if an external user wants to run `GRPOTrainer`
+against `https://hjerpe-sql-env.hf.space` directly. File as a GitHub issue
+after the blog ships.
+---
+## 9. TRL `environment_factory` wrapper — DONE
+Already implemented in `training/trl_adapter.py::SQLEnvTRL` and wired into
+`notebooks/train_grpo.ipynb` cell 16. See the TRL section at the top of this
+document for details. No action.
+---
+---
+## Appendix A: Republish the Space from scratch (reference)
+Only run these if step 1–3 show the Space is broken and a rebuild+push is
+needed. Otherwise skip — the current Space is already live.
+**Prereqs (one-time):**
+```bash
+uv sync                                            # project deps
+hf auth login                                      # HuggingFace CLI auth
+# (token with write access to hjerpe/sql_env)
+```
+**Validate + build + push:**
+```bash
+# 1. Dry-run config check — confirms the openenv manifest, Dockerfile
+#    and server entrypoint agree
+uv run openenv validate --verbose
+# 2. Build the Docker image locally (same image HF Spaces will run)
+uv run openenv build -t openenv-sql-env:local
+# 3. Optional: smoke-test the local image before pushing
+docker run -d --name sqlenv-local -p 8001:8000 openenv-sql-env:local
+curl -sS http://0.0.0.0:8001/health
+docker stop sqlenv-local && docker rm sqlenv-local
+# 4. Push to the Space — creates hjerpe/sql_env if it doesn't exist,
+#    uploads files, and triggers the Space's own Docker build
+uv run openenv push
+# expected tail:
+#   ✓ Authenticated as: hjerpe
+#   ✓ Space hjerpe/sql_env is ready
+#   ✓ Upload completed successfully
+#   Space URL: https://huggingface.co/spaces/hjerpe/sql_env
+```
+**After push:** the Space rebuilds its own Docker image on HF's infra (takes
+2–5 min). Watch the build logs in the browser at
+`https://huggingface.co/spaces/hjerpe/sql_env` → "Logs" tab. When it turns
+green, re-run steps 1–5 at the top of this doc to verify.
+**Files that must exist for `openenv push` to work** (already in the repo):
+- `openenv.yaml` — manifest with name, version, description
+- `sql_env/server/Dockerfile` — FastAPI + uvicorn container
+- `sql_env/server/app.py` — `create_app(...)` entrypoint
+- `sql_env/models.py` — `SQLAction` / `SQLObservation` Pydantic models
+- `pyproject.toml` — pip-installable package metadata
+- `README.md` — Space landing page (HF renders it on the Space page)
+If any of these drifts out of sync, `openenv validate --verbose` will flag
+it before you push.
+---
+## Appendix B: Research finding — dangling legacy reward module
+**Finding:** `training/rewards.py` (151 lines) is legacy dead code from the
+pre-F010 rollout-based architecture. It is not used by the production
+training path and can be deleted post-launch.
+**Evidence:**
+- Module docstring (line 1–5): *"Reward callables for TRL GRPO training.
+  These helpers consume **rollout metadata**..."* — this is the OLD
+  pattern where reward functions parsed `kwargs['metadata']` from TRL
+  rollouts instead of reading `env.reward` from environment instances.
+- Internal helper `_extract_metadata_rows()` (line 41): *"TRL can pass
+  rollout metadata in different shapes depending on wrapper code."* —
+  explicit confirmation this is replay-based reward parsing.
+- Functions exposed: `reward_correctness`, `reward_progress`,
+  `reward_operational`.
+- **Zero production imports.** `rg 'from.*training\.rewards|training\.rewards\.reward_'`
+  returns exactly **one** hit: `tests/unit/test_rewards.py`. No script,
+  notebook, or other module in `training/` imports it.
+- The real training path uses `sql_env_reward_func` in
+  `training/trl_adapter.py`, which reads `env.reward` directly from
+  `SQLEnvTRL` instances. This is the `environment_factory` pattern
+  mandated by F010 and documented as the correct choice (see
+  `specs/F010-IMPLEMENTATION_SPEC.md:173` and the user's own memory
+  note: *"Use environment_factory or rollout_func, not replay-based
+  reward parsing"*).
+- Notebook `train_grpo.ipynb` cell 16:
+  `reward_funcs=[sql_env_reward_func]` — pulls from `trl_adapter`,
+  not `rewards.py`.
+**The only `rollout` matches in `training/`** are harmless:
+- `training/prompts.py:1` — docstring mentions "GRPO training rollouts"
+- `training/rewards.py` — the legacy module itself
+- `notebooks/train_grpo.ipynb` cell 16 — a local variable
+  `before_rollouts = sample_random_baseline(...)` that has nothing to do
+  with TRL's `rollout_func`
+**Recommendation (post-launch, low priority):**
+1. Delete `training/rewards.py`
+2. Delete `tests/unit/test_rewards.py`
+3. Confirm `uv run pytest tests/ -v` still passes
+4. Commit with message: `refactor: remove legacy rollout-metadata reward
+   module superseded by F010 environment_factory`
+**Why not today:** zero risk on launch (nothing imports it in production),
+and deleting files during blog-publish day is the wrong kind of churn.
+File as a post-launch cleanup.
+---
+## Post-launch cleanup
+- [ ] Delete this file
+- [ ] File issue for item 8 (Space concurrency)
+- [ ] Delete `training/rewards.py` + `tests/unit/test_rewards.py` (see Appendix B)
+- [ ] Update `docs/competition-deliverables.md` open-items list

docs/exploration/grpo-collapse-analysis.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+title: GRPO Training Collapse Analysis
+description: Root-cause analysis of GRPO training collapse on Qwen3-1.7B caused by extra kwargs in tool calls and advantage collapse
+doc_type: exploration
+---
+# GRPO Training Collapse Analysis
+## What happened
+After SFT warmup, GRPO training on Qwen3-1.7B collapsed within the first 30 steps. The model degenerated into passing extra `null` arguments to every tool call (`"sql": null, "table_name": "...", "value": null`), triggering `unexpected keyword argument` errors on every rollout. It never recovered across 351 steps (~8 hours on L4).
+## Timeline
+| Step | Reward | What the model does |
+|------|--------|-------------------|
+| 10 | -1.25 | First call has extra args, gets error, loops with `Episode is over` |
+| 20 | 0.01 | Occasionally correct describe, but passes wrong args to answer |
+| 30 | 0.00 | Stuck: `describe(sql=null, table_name="concert")` infinite loop |
+| 40-351 | 0.00 | Complete collapse: every rollout is identical error loops |
+## Why it collapsed
+### 1. SFT taught wrong argument patterns
+The SFT examples show `describe(table_name=...)` correctly, but the base Qwen3-1.7B model has a strong prior from pretraining to include all available parameter names in every call. The 353-turn SFT warmup (2 epochs, batch=2) wasn't enough to override this for all 4 tools.
+### 2. Extra kwargs cause hard failures, not soft degradation
+When the model passes `describe(sql=null, table_name="flights")`, TRL dispatches `SQLEnvTRL.describe(sql=None, table_name="flights")` which raises `TypeError: unexpected keyword argument 'sql'`. This is a **hard wall** — the model gets zero useful information back, just an error string it can't learn from.
+### 3. GRPO advantage collapse
+With 6 generations per question:
+- All 6 rollouts pass the same extra args → all get reward 0.0
+- Advantage = 0.0 for every sample → zero gradient signal
+- The model has no way to discover that dropping the extra args would work
+- Loss oscillates near 0 throughout training
+### 4. No recovery mechanism
+Once the model enters the error loop:
+- Error messages say "unexpected keyword argument 'sql'" but don't say "try calling with only table_name"
+- The model retries the same call pattern endlessly
+- Post-episode penalty accumulates negative reward (-1.25 at step 10) but doesn't help because ALL rollouts are equally bad
+- No positive examples exist in any rollout group to provide advantage signal
+## The core problem: kwargs rejection vs. kwargs tolerance
+The TRL adapter methods have strict signatures:
+```python
+def describe(self, table_name: str) -> str:
+def query(self, sql: str) -> str:
+def answer(self, value: str) -> str:
+```
+When the model generates `{"table_name": "flights", "sql": null}`, Python raises TypeError before the method body executes. The model never gets a schema response, so it has no path to success.
+## Fix: Accept and ignore extra kwargs
+The simplest fix is to make the tool methods tolerant of extra arguments:
+```python
+def describe(self, table_name: str, **kwargs) -> str:
+def query(self, sql: str, **kwargs) -> str:
+def answer(self, value: str, **kwargs) -> str:
+def sample(self, table_name: str, **kwargs) -> str:
+```
+This means `describe(sql=null, table_name="flights")` would work — it would ignore `sql` and return the schema. The model gets useful feedback, can write SQL, and has a path to positive reward. GRPO then has signal to learn that the extra args are unnecessary.
+**Why this is the right approach:**
+- Small models (1.7B) lack the capacity to perfectly learn function signatures from tool definitions alone
+- The tool definitions in `<tools>` XML clearly state which params are required — the model will converge toward correct signatures over time via reward signal
+- Strict rejection creates an unrecoverable dead end; tolerance creates a learning gradient
+- This matches how real APIs work — most accept and ignore unexpected fields
+## Other contributing factors
+### SFT quality issues
+- SFT was only 100 questions x ~3.5 turns = 347 examples
+- Only 2 epochs at batch=2 (total 347 steps)
+- The model learned tool-call format but not strict argument isolation
+- Need: more SFT data or more epochs on existing data
+### Missing KL penalty
+- No KL divergence penalty against the SFT reference model
+- GRPO updated the policy freely, drifting away from the SFT distribution
+- A KL penalty (beta=0.01-0.05) would have anchored the model near the working SFT baseline
+### Learning rate may be too high
+- Default TRL learning rate (5e-7 or 1e-6) may be too aggressive for 1.7B
+- Lower LR (1e-7) would make smaller updates, reducing drift risk
+## Recommended fixes (priority order)
+### 1. Add `**kwargs` to all tool methods (critical)
+Prevents the hard wall. Model can still learn correct signatures from reward signal.
+### 2. Increase SFT warmup
+- 4 epochs instead of 2
+- Or increase SFT data from 100 to 200 questions
+- Verify post-SFT that the model generates correct single-arg calls
+### 3. Add KL penalty
+```python
+GRPOConfig(
+    ...,
+    beta=0.04,  # KL penalty against SFT reference
+)
+```
+Prevents policy from drifting too far from the working SFT baseline.
+### 4. Lower GRPO learning rate
+From default to 1e-7 or 5e-8.
+## Verification checklist
+Before running GRPO again:
+- [ ] Post-SFT format check shows `describe(table_name="X")` with NO extra args
+- [ ] Tool methods accept `**kwargs` so extra args don't crash
+- [ ] First 10 GRPO steps show at least some reward > 0
+- [ ] Reward doesn't flatline at 0.0 by step 30