hjerpe commited on
Commit
9e64e71
·
verified ·
1 Parent(s): d9759a5

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. AGENTS.md +2 -59
  3. DEMO.md +286 -0
  4. DEMO_action_feature.md +427 -0
  5. Dockerfile.test +28 -0
  6. ONBOARDING_action_feature.md +341 -0
  7. README.md +34 -5
  8. REVIEW_REPORT.md +21 -34
  9. client.py +2 -16
  10. configs/colab_l4.json +15 -0
  11. configs/test_cpu.json +17 -0
  12. data/databases/car_1/car_1.sqlite +0 -0
  13. data/databases/concert_singer/concert_singer.sqlite +0 -0
  14. data/databases/cre_Doc_Template_Mgt/cre_Doc_Template_Mgt.sqlite +0 -0
  15. data/databases/dog_kennels/dog_kennels.sqlite +0 -0
  16. data/databases/employee_hire_evaluation/employee_hire_evaluation.sqlite +0 -0
  17. data/databases/flight_2/flight_2.sqlite +0 -0
  18. data/databases/pets_1/pets_1.sqlite +0 -0
  19. data/databases/poker_player/poker_player.sqlite +0 -0
  20. data/databases/student_assessment/student_assessment.sqlite +0 -0
  21. data/databases/world_1/world_1.sqlite +3 -0
  22. data/sft/sft_trajectories.json +0 -0
  23. docs/.DS_Store +0 -0
  24. docs/ARCHITECTURE.md +336 -224
  25. docs/DOCS_CONTRACT.json +74 -0
  26. docs/DOCS_TAXONOMY.md +62 -0
  27. docs/FEATURE_SLICING.md +50 -0
  28. docs/QUALITY_SCORE.md +13 -0
  29. docs/README.md +49 -0
  30. docs/SKILLS_HANDBOOK.generated.md +413 -0
  31. docs/blog-material.md +428 -0
  32. docs/blog-outline.md +95 -37
  33. docs/blog-post-v1-preview.html +403 -0
  34. docs/blog-post-v1.md +269 -0
  35. docs/blog-post.md +118 -0
  36. docs/competition-deliverables.md +129 -0
  37. docs/data-sources.md +182 -0
  38. docs/delivery-specs/index.md +66 -0
  39. docs/design-docs/core-beliefs.md +61 -0
  40. docs/design-docs/index.md +1 -1
  41. docs/design-docs/reward-shaping-research.md +197 -0
  42. docs/discovery/.gitkeep +0 -0
  43. docs/discovery/index.md +63 -0
  44. docs/exec-plans/README.md +41 -0
  45. docs/exec-plans/active/.gitkeep +0 -0
  46. docs/exec-plans/completed/.gitkeep +0 -0
  47. docs/exec-plans/tech-debt-tracker.md +37 -0
  48. docs/exploration/README.md +44 -0
  49. docs/exploration/f007-prelaunch-checklist.md +455 -0
  50. docs/exploration/grpo-collapse-analysis.md +119 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/databases/world_1/world_1.sqlite filter=lfs diff=lfs merge=lfs -text
AGENTS.md CHANGED
@@ -82,64 +82,7 @@ sql-env/ # project root = environment package
82
 
83
  <!-- GUIDELINES-BEGIN -->
84
 
85
- ## Delivery Safety (Move Fast Without Breaking Things)
86
-
87
- Move fast by taking the smallest responsible step that produces real feedback, while pre-committing to guardrails so being wrong is survivable.
88
-
89
- - **Small batches:** Prefer vertical slices and small PRs; reduce blast radius and review/debug time.
90
- - **Define "broken" first:** Before shipping, write down what you will watch (errors, latency, correctness, cost) and the abort threshold.
91
- - **Design for reversibility:** Make changes easy to turn off, roll back, or ignore.
92
-
93
- ## System Boundaries (Avoid Analysis Paralysis)
94
-
95
- Systems are continuous webs; plans require artificial boundaries.
96
-
97
- - **Boundary rule:** Include only variables/components that could change the decision you are making.
98
- - **Clouds:** Treat everything else as exogenous inputs; track them as risks/assumptions.
99
- - **Timebox mapping:** If the landscape is moving faster than you can model it, run a probe (spike, canary, A/B) instead.
100
-
101
- ## Maturity Modes
102
-
103
- Match guardrails to maturity:
104
-
105
- - **Exploratory:** Learning > durability. Prefer spikes; avoid irreversible state changes; manual verification is OK; expect throwaway code.
106
- - **MVP:** Ship a thin end-to-end slice. Manual checks are OK, but you still need a fast rollback path and bounded impact.
107
- - **Production:** Build to last. Automated tests, observability, progressive rollout, and explicit rollback/incident posture.
108
-
109
- Expect limiting factors to move as you ship: fix the current bottleneck, then re-diagnose the next.
110
-
111
- ## Progressive Delivery
112
-
113
- - **Feature flags:** Use flags to make risky changes reversible. Categorize flags (release/experiment/ops/permissioning).
114
- - **Flags are inventory:** Every flag needs an owner, an expiry, and a removal plan.
115
- - **Canary/ramp when risk is non-trivial:** Start small, watch signals, ramp gradually; prefer "flip off" over redeploy.
116
-
117
- ## Reliability Control Loop (If You Run Production)
118
-
119
- - **SLO + error budget:** If you are within budget, keep shipping; if you burn budget, freeze non-critical changes and pay down reliability.
120
-
121
- ## Avoid
122
-
123
- - Big-bang releases, long-lived branches, unowned flags, flaky tests, and alert noise.
124
-
125
- ## Python Guidelines
126
-
127
- - Prefer type hints for public APIs; use `typing` / `collections.abc`.
128
- - Use NumPy-style docstrings; keep them synced with type hints.
129
- - Error handling: Use specific exceptions; avoid `try: ... except Exception: pass`.
130
- - Dependencies: Use `uv add <package>`; do not manually edit `pyproject.toml`.
131
-
132
- ## Docs Expectations
133
-
134
- - Keep durable design/ops knowledge in `docs/` (architecture, runbook, decisions). Keep AGENTS.md as a short map, not an encyclopedia.
135
-
136
- ## Testing Standards
137
-
138
- - **Always use the project's package manager** to run tests. Never invoke test runners directly.
139
- - Python (uv): `uv run pytest tests/ -v` (NEVER bare `pytest`)
140
- - Python (poetry): `poetry run pytest tests/ -v`
141
- - Node: `npm test` or `npm run test`
142
- - Rust: `cargo test`
143
- - **Rationale:** Bare `pytest` bypasses the virtualenv and may use the wrong Python/dependencies. Package managers ensure the correct environment. Bare invocations also trigger unnecessary permission prompts in automated workflows.
144
 
145
  <!-- GUIDELINES-END -->
 
82
 
83
  <!-- GUIDELINES-BEGIN -->
84
 
85
+ <!-- Managed by: opencode-ctx guidelines apply --packs python,testing,delivery-safety -->
86
+ <!-- Run the command above to populate this section -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  <!-- GUIDELINES-END -->
DEMO.md ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Demo: SQLEnv — Flat OpenEnv Environment with Action Dispatch
2
+
3
+ > **Generated:** 2026-02-28T14:26Z
4
+ > **Branch:** `refactor-openenv-tutorial-project-structure` @ `f28bfaa`
5
+ > **Environment:** Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)
6
+
7
+ ---
8
+
9
+ ## What This Branch Does
10
+
11
+ This branch refactors the `sql-env` project from a nested `envs/sql_env/` layout into the canonical flat `openenv init` structure, and integrates the `action-feature` branch's core action dispatch system.
12
+
13
+ The result: a working RL environment where an agent sends natural language messages (e.g. _"describe the students table"_), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services — `MockTokenizer` replaces HuggingFace tokenizers and Ollama failures are handled gracefully.
14
+
15
+ ---
16
+
17
+ ## Quickstart
18
+
19
+ ```bash
20
+ git checkout refactor-openenv-tutorial-project-structure
21
+ uv sync
22
+ uv run pytest tests/ -v # 21 tests, ~3.5s
23
+ ```
24
+
25
+ **Prerequisites:** Python 3.11-3.12, `uv`.
26
+ **Optional:** Ollama with `llama3.2` for LLM-guided table selection (not needed for demo).
27
+
28
+ ---
29
+
30
+ ## Evidence
31
+
32
+ ### 1. All 21 Tests Pass
33
+
34
+ ```
35
+ $ uv run pytest tests/ -v
36
+
37
+ tests/test_smoke.py::TestModels::test_action_creation PASSED [ 4%]
38
+ tests/test_smoke.py::TestModels::test_action_with_tokens PASSED [ 9%]
39
+ tests/test_smoke.py::TestModels::test_observation_creation PASSED [ 14%]
40
+ tests/test_smoke.py::TestModels::test_state_creation PASSED [ 19%]
41
+ tests/test_smoke.py::TestEnvironment::test_instantiation PASSED [ 23%]
42
+ tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
43
+ tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
44
+ tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
45
+ tests/test_smoke.py::TestEnvironment::test_step_describe PASSED [ 42%]
46
+ tests/test_smoke.py::TestEnvironment::test_step_sample PASSED [ 47%]
47
+ tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
48
+ tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED [ 57%]
49
+ tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED [ 61%]
50
+ tests/test_smoke.py::TestActionDetection::test_query_default PASSED [ 66%]
51
+ tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED [ 71%]
52
+ tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
53
+ tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED [ 80%]
54
+ tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
55
+ tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
56
+ tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
57
+ tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED [100%]
58
+
59
+ ============================== 21 passed in 3.56s ==============================
60
+ ```
61
+
62
+ Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.
63
+
64
+ ### 2. Lint and Format Clean
65
+
66
+ ```
67
+ $ uv run ruff check .
68
+ All checks passed!
69
+
70
+ $ uv run ruff format --check .
71
+ 14 files already formatted
72
+ ```
73
+
74
+ ### 3. Pydantic Model Contracts
75
+
76
+ ```python
77
+ >>> from sql_env.models import SQLAction, SQLObservation, SQLState
78
+
79
+ SQLAction fields: ['metadata', 'action_type', 'action_description', 'tokens']
80
+ SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
81
+ SQLState fields: ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']
82
+ ```
83
+
84
+ `SQLAction.tokens` and `SQLObservation.tokens` carry torch tensors. `SQLState.history_messages` / `history_tokens` accumulate the full conversation for RL context.
85
+
86
+ ### 4. Action Type Detection
87
+
88
+ The environment classifies natural language messages into action types via keyword matching:
89
+
90
+ ```
91
+ [PASS] "describe the students table..." -> describe
92
+ [PASS] "what columns does Course have..." -> describe
93
+ [PASS] "show me the schema..." -> describe
94
+ [PASS] "show me sample rows from students..." -> sample
95
+ [PASS] "give me example data..." -> sample
96
+ [PASS] "how many rows are in Courses..." -> sample
97
+ [PASS] "find all students enrolled in CS101..." -> query
98
+ [PASS] "select count(*) from students..." -> query
99
+ [PASS] "what is the average score..." -> query
100
+ ```
101
+
102
+ Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.
103
+
104
+ ### 5. MockTokenizer Roundtrip
105
+
106
+ ```python
107
+ >>> from server.test_sql_env import MockTokenizer
108
+ >>> tok = MockTokenizer()
109
+ >>> msg = [{'role': 'user', 'content': 'describe the students table'}]
110
+ >>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
111
+ >>> tokens.shape
112
+ torch.Size([1, 27])
113
+ >>> tokens[0][:10].tolist()
114
+ [100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
115
+ >>> tok.decode(tokens[0].tolist())
116
+ 'describe the students table'
117
+ ```
118
+
119
+ `MockTokenizer` encodes each character as `ord(c)` and decodes via `chr(t)`. Deterministic, no downloads, perfect for tests.
120
+
121
+ ### 6. Schema Introspection
122
+
123
+ SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:
124
+
125
+ ```python
126
+ >>> env._get_table_schema('Student')
127
+ Table 'Student' has the following columns:
128
+
129
+ - student_id: integer number
130
+ - student_details: text (up to 255 characters)
131
+
132
+ >>> env._get_table_schema('NonexistentTable')
133
+ Table 'NonexistentTable' not found in schema.
134
+ ```
135
+
136
+ 9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.
137
+
138
+ ### 7. Full Environment Interaction (Mock Path)
139
+
140
+ A complete multi-turn episode with no external services:
141
+
142
+ ```python
143
+ >>> from server.sql_environment import SQLEnvironment
144
+ >>> from server.test_sql_env import MockTokenizer
145
+ >>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())
146
+
147
+ >>> obs = env.reset()
148
+ >>> obs.messages # 1 message (system prompt)
149
+ >>> obs.tokens.shape
150
+ torch.Size([32])
151
+ >>> obs.done
152
+ False
153
+ ```
154
+
155
+ **Turn 1 — Describe:**
156
+ ```python
157
+ >>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
158
+ >>> action.action_type
159
+ 'describe'
160
+ >>> obs = env.step(action)
161
+ >>> obs.messages[-1]
162
+ {'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
163
+ >>> obs.tokens.shape
164
+ torch.Size([91])
165
+ ```
166
+
167
+ Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".
168
+
169
+ **Turn 2 — Sample:**
170
+ ```python
171
+ >>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
172
+ >>> action.action_type
173
+ 'sample'
174
+ >>> obs = env.step(action)
175
+ >>> obs.messages[-1]['content']
176
+ "Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
177
+ >>> obs.tokens.shape
178
+ torch.Size([503])
179
+ ```
180
+
181
+ **Turn 3 — Query (no Ollama):**
182
+ ```python
183
+ >>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
184
+ >>> action.action_type
185
+ 'query'
186
+ >>> obs = env.step(action)
187
+ >>> obs.messages[-1]['content']
188
+ 'Error: Ollama returned status 404'
189
+ >>> obs.tokens.shape
190
+ torch.Size([1028])
191
+ ```
192
+
193
+ The error is graceful — it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).
194
+
195
+ ### 8. Client Serialization
196
+
197
+ `SQLEnvClient` converts tensors to lists for JSON WebSocket transport:
198
+
199
+ ```python
200
+ >>> from sql_env.client import SQLEnvClient
201
+ >>> from sql_env.models import SQLAction
202
+ >>> import torch
203
+
204
+ >>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
205
+ >>> payload = client._step_payload(action)
206
+ {
207
+ 'action_type': 'query',
208
+ 'action_description': 'select * from students',
209
+ 'tokens': [[1, 2, 3, 4, 5]],
210
+ 'metadata': {}
211
+ }
212
+ ```
213
+
214
+ Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.
215
+
216
+ ### 9. Spider Question Data
217
+
218
+ ```python
219
+ >>> import json
220
+ >>> data = json.load(open('data/questions/student_assessment.json'))
221
+ >>> len(data)
222
+ 53
223
+ >>> data[0]['question']
224
+ 'which course has most number of registered students?'
225
+ >>> data[0]['query']
226
+ 'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'
227
+ ```
228
+
229
+ 53 question-answer pairs from the Spider dataset's `student_assessment` database. Each entry has `db_id`, `query`, `question`, `query_toks`, `query_toks_no_value`, and `question_toks`.
230
+
231
+ ---
232
+
233
+ ## What Changed from `main`
234
+
235
+ | Area | Before (main) | After (this branch) |
236
+ |------|---------------|---------------------|
237
+ | **Layout** | `envs/sql_env/` nested | Flat root = package |
238
+ | **Build** | hatchling | setuptools |
239
+ | **Python** | 3.13 | 3.11-3.12 (torch compat) |
240
+ | **Models** | Structured obs (question, schema, result) | Chat-based obs (messages + tokens) |
241
+ | **Action** | `argument` field | `action_description` + `tokens` tensor |
242
+ | **Environment** | Scaffold stubs | Real SQLite + Ollama + keyword dispatch |
243
+ | **Client** | Basic EnvClient | Tensor <-> list serialization |
244
+ | **Data** | Empty .gitkeep dirs | 9 ORM models + 53 Spider questions |
245
+ | **Tests** | 0 | 21 (all passing) |
246
+ | **Empty dirs** | `training_pipeline/`, `submission_artifacts/` | Removed |
247
+
248
+ ---
249
+
250
+ ## Known Behaviors (Not Bugs)
251
+
252
+ 1. **Ollama fallback:** Without Ollama, `_call_ollama_to_select_table()` falls back to the first table (`Address`). Query actions return `Error: Ollama returned status 404`. This is by design — the mock path is for dev/test, not production.
253
+
254
+ 2. **`message_to_action()` mutates state:** It appends the message to `_state.history_messages` before tokenizing. This is intentional — the tokenizer needs the full conversation context.
255
+
256
+ 3. **`MockTokenizer` in production code:** `server/app.py` imports `MockTokenizer` from `server/test_sql_env.py` when `transformers` is unavailable. This is the teammate's design for running without GPU dependencies.
257
+
258
+ ---
259
+
260
+ ## Verification Checklist
261
+
262
+ - [x] `uv sync` succeeds (all deps install)
263
+ - [x] `uv run pytest tests/ -v` — 21/21 pass
264
+ - [x] `uv run ruff check .` — all checks passed
265
+ - [x] `uv run ruff format --check .` — 14 files formatted
266
+ - [x] Pydantic models import from `sql_env.models`
267
+ - [x] Environment instantiates with MockTokenizer
268
+ - [x] `reset()` returns valid SQLObservation with system prompt
269
+ - [x] Action detection: 9/9 keyword classifications correct
270
+ - [x] `message_to_action()` creates typed SQLAction with tokens
271
+ - [x] `step(describe)` returns schema from SQLAlchemy introspection
272
+ - [x] `step(sample)` returns SQL query text
273
+ - [x] `step(query)` returns graceful error without Ollama
274
+ - [x] Multi-turn conversation state grows correctly
275
+ - [x] Client tensor <-> list serialization roundtrips
276
+ - [x] Spider data loads (53 questions)
277
+
278
+ ---
279
+
280
+ ## What's Next
281
+
282
+ **Phase 3:** Reward computation (`server/reward.py`) and answer verification (`server/verifier.py`). Both are currently stubs.
283
+
284
+ ---
285
+
286
+ *All output captured live on 2026-02-28. Reproduce with `uv sync && uv run pytest tests/ -v`.*
DEMO_action_feature.md ADDED
@@ -0,0 +1,427 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Feature Demo: `action-feature` — Core Action Dispatch System
2
+
3
+ > **Generated:** 2026-02-28T14:46:19+01:00
4
+ > **Branch:** `origin/action-feature` vs `main`
5
+ > **Project:** sql-env-onboarding (SQLEnv RL Environment for OpenEnv Challenge)
6
+ > **Execution environment:** Python 3.11, torch 2.2.2, MockTokenizer (no Ollama required)
7
+
8
+ ---
9
+
10
+ ## What This Feature Does
11
+
12
+ The `action-feature` branch adds the **core action dispatch system** to SQLEnv — the RL environment where AI agents learn interactive SQL exploration. Before this branch, the environment had data models but no way to actually process agent actions.
13
+
14
+ Now an agent can send natural language messages like _"describe the students table"_ or _"find all students enrolled in CS101"_, and the environment automatically classifies them into one of three action types (**describe**, **sample**, **query**), dispatches them to the appropriate handler, and returns observations with tokenized conversation history. This is the fundamental interaction loop that makes the environment usable for reinforcement learning.
15
+
16
+ The system works in two modes:
17
+ - **Mock path** (no external services): Uses `MockTokenizer` for tokenization. Describe and sample actions work fully; query actions return a graceful error since Ollama is unavailable.
18
+ - **Ollama path** (full pipeline): Uses a local Ollama LLM to select relevant tables for describe/sample and to generate SQL for query actions.
19
+
20
+ ---
21
+
22
+ ## Quickstart
23
+
24
+ ```bash
25
+ # 1. Checkout the branch
26
+ git checkout origin/action-feature --detach
27
+
28
+ # 2. Install dependencies (from the sql_env package directory)
29
+ cd envs/sql_env/
30
+ uv sync
31
+ uv add sqlalchemy # missing from pyproject.toml, needed for ORM models
32
+
33
+ # 3. Run the full demo (71 checks, ~2 seconds, no external services needed)
34
+ uv run python demo_action_feature.py
35
+
36
+ # 4. (Optional) Return to main when done
37
+ git checkout main
38
+ ```
39
+
40
+ **Prerequisites:** Python 3.11+, `uv` package manager, git.
41
+ **Optional:** Ollama running locally with `llama3.2` model for full query generation (set `OLLAMA_MODEL=llama3.2`).
42
+
43
+ > **Note:** `sqlalchemy` is required by the ORM models but was omitted from `pyproject.toml` on the branch. The `uv add sqlalchemy` step is necessary.
44
+
45
+ > **Note:** On Python < 3.12, a Pydantic compatibility patch is needed because `openenv` defines `Message` with `typing.TypedDict` instead of `typing_extensions.TypedDict`. The demo script applies this patch automatically.
46
+
47
+ ---
48
+
49
+ ## Live Demo — Mock Path (Primary)
50
+
51
+ All output below was captured by executing `uv run python demo_action_feature.py` on the `action-feature` branch with no Ollama model configured (default `qwen2` not installed).
52
+
53
+ ### 1. Environment Instantiation with MockTokenizer
54
+
55
+ The environment loads 9 SQLAlchemy ORM models (Address, Person, Student, Course, etc.) and initializes conversation state with a system prompt.
56
+
57
+ ```bash
58
+ uv run python demo_action_feature.py
59
+ ```
60
+
61
+ ```
62
+ ============================================================
63
+ 1. Environment Instantiation with MockTokenizer
64
+ ============================================================
65
+ [PASS] MockTokenizer created
66
+ [PASS] SQLEnvironment created
67
+ [PASS] System prompt stored
68
+ [PASS] Tokenizer stored
69
+ [PASS] 9 database models loaded
70
+ [PASS] Initial state has 1 message (system)
71
+ [PASS] Initial state has 1 token tensor
72
+ [PASS] System message role is correct
73
+ [PASS] System message content matches prompt
74
+ ```
75
+
76
+ The environment correctly stores the custom system prompt, attaches the tokenizer, and loads all 9 database table models from SQLAlchemy.
77
+
78
+ ### 2. reset() Returns Valid SQLObservation
79
+
80
+ ```
81
+ ============================================================
82
+ 2. reset() Returns Valid SQLObservation
83
+ ============================================================
84
+ [PASS] reset() returns SQLObservation
85
+ [PASS] Observation has messages list
86
+ [PASS] Messages contain system prompt
87
+ [PASS] Observation has tokens tensor
88
+ [PASS] Tokens tensor is 1D
89
+ [PASS] Tokens are non-empty
90
+ [PASS] done is False
91
+ [PASS] reward is None
92
+
93
+ Observation details:
94
+ messages count: 1
95
+ tokens shape: torch.Size([29])
96
+ tokens[:10]: [89, 111, 117, 32, 97, 114, 101, 32, 97, 32]
97
+ ```
98
+
99
+ After `reset()`, the observation contains one message (the system prompt) tokenized into a 1D tensor via `MockTokenizer` (char codes mod 256). The episode is not done and has no reward — ready for the agent's first action.
100
+
101
+ ### 3. Action Type Detection
102
+
103
+ The keyword classifier maps user messages to three action types: `describe`, `sample`, and `query`.
104
+
105
+ ```
106
+ ============================================================
107
+ 3. Action Type Detection (_detect_action_type)
108
+ ============================================================
109
+ [PASS] 'describe the students table...' -> describe
110
+ [PASS] 'what columns does Course have...' -> describe
111
+ [PASS] 'show me the schema...' -> describe
112
+ [PASS] 'show me sample rows from students...' -> sample
113
+ [PASS] 'give me example data...' -> sample
114
+ [PASS] 'how many rows are in Courses...' -> sample
115
+ [PASS] 'find all students enrolled in CS101...' -> query
116
+ [PASS] 'select count(*) from students...' -> query
117
+ [PASS] 'what is the average score...' -> query
118
+ ```
119
+
120
+ All 9 test cases correctly classified. Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.
121
+
122
+ ### 4. message_to_action() Creates Properly Typed SQLAction
123
+
124
+ ```
125
+ ============================================================
126
+ 4. message_to_action() Creates Properly Typed SQLAction
127
+ ============================================================
128
+ [PASS] Returns SQLAction
129
+ [PASS] action_type is 'describe'
130
+ [PASS] action_description is message content
131
+ [PASS] tokens is a torch.Tensor
132
+ [PASS] tokens tensor is non-empty
133
+
134
+ Action details:
135
+ action_type: describe
136
+ action_description: describe the students table
137
+ tokens shape: torch.Size([1, 57])
138
+ [PASS] message_to_action adds message to history
139
+ [PASS] History[1] is the user message
140
+ [PASS] Sample message -> action_type 'sample'
141
+ [PASS] Query message -> action_type 'query'
142
+ ```
143
+
144
+ `message_to_action()` converts a raw message dict into a `SQLAction` with the correct `action_type`, `action_description`, and tokenized tensor. **Important side effect:** it also appends the message to `_state.history_messages` before tokenizing, so the tokenizer sees the full conversation context.
145
+
146
+ ### 5. step() with Describe Action
147
+
148
+ Without Ollama, the describe action falls back to the first table (Address) and returns its full SQLAlchemy-derived schema.
149
+
150
+ ```
151
+ ============================================================
152
+ 5. step() with Describe Action (Schema from SQLAlchemy Models)
153
+ ============================================================
154
+ [PASS] step() returns SQLObservation
155
+ [PASS] History now has 3 messages (system + user + assistant)
156
+ [PASS] Last message is from assistant
157
+ [PASS] Assistant message contains 'columns'
158
+ [PASS] Schema info contains column descriptions
159
+
160
+ Describe response (first 200 chars):
161
+ Table 'Address' has the following columns:
162
+
163
+ - address_id: integer number
164
+ - line_1: text (up to 255 characters)
165
+ - line_2: text (up to 255 characters)
166
+ - city: text (up to 255 characters)
167
+ - zip_postcode:
168
+ [PASS] Tokens tensor grew after step
169
+ ```
170
+
171
+ The schema is extracted directly from SQLAlchemy model introspection (column names, types converted to natural language). The observation now has 3 messages (system → user → assistant) and the token tensor grew.
172
+
173
+ ### 6. step() with Sample Action
174
+
175
+ The sample action generates a `SELECT * ... LIMIT 10` query for the target table.
176
+
177
+ ```
178
+ ============================================================
179
+ 6. step() with Sample Action (Generates SQL Query Text)
180
+ ============================================================
181
+ [PASS] step(sample) returns assistant message
182
+ [PASS] Sample response contains SELECT
183
+ [PASS] Sample response contains LIMIT
184
+
185
+ Sample response:
186
+ Here's a query to sample data from Address:
187
+
188
+ SELECT * FROM Address LIMIT 10;
189
+ ```
190
+
191
+ ### 7. step() with Query Action (No Ollama)
192
+
193
+ Without Ollama, the query action returns a clear error message instead of crashing.
194
+
195
+ ```
196
+ ============================================================
197
+ 7. step() with Query Action (Mock Path — No Ollama)
198
+ ============================================================
199
+ [PASS] step(query) returns assistant message
200
+ [PASS] Query response is error string (no Ollama) or SQL
201
+
202
+ Query response (no Ollama):
203
+ Error: Ollama returned status 404
204
+ ```
205
+
206
+ The error is a graceful 404 (Ollama server is running but the default `qwen2` model isn't installed). The conversation continues normally — the error becomes part of the message history.
207
+
208
+ ### 8. Multi-Turn Conversation State Management
209
+
210
+ Three turns of alternating user/assistant messages, verifying the conversation history grows correctly.
211
+
212
+ ```
213
+ ============================================================
214
+ 8. Multi-Turn Conversation State Management
215
+ ============================================================
216
+ [PASS] After turn 1: 3 messages (sys + user + assistant)
217
+ [PASS] After turn 2: 5 messages (sys + u1 + a1 + u2 + a2)
218
+ [PASS] Tokens grew between turns
219
+ [PASS] After turn 3: 7 messages
220
+ [PASS] Tokens grew again
221
+ [PASS] Message roles follow expected pattern
222
+
223
+ Conversation summary after 3 turns:
224
+ [0] system: You are a test SQL assistant....
225
+ [1] user: describe the Address table...
226
+ [2] assistant: Table 'Address' has the following columns: - address_id: in...
227
+ [3] user: show me sample rows...
228
+ [4] assistant: Here's a query to sample data from Address: SELECT * FROM A...
229
+ [5] user: find all addresses in New York...
230
+ [6] assistant: Error: Ollama returned status 404...
231
+ Total tokens: 987
232
+ ```
233
+
234
+ Message roles follow the expected `[system, user, assistant, user, assistant, user, assistant]` pattern. Token count grows monotonically: the `_create_observation()` method flattens all `history_tokens` into a single 1D tensor via `torch.cat`.
235
+
236
+ ### 9. Client Serialization Roundtrip
237
+
238
+ The `SQLEnvClient` converts tensor → list for JSON transport and list → tensor on the return path.
239
+
240
+ ```
241
+ ============================================================
242
+ 9. Client Serialization Roundtrip (_step_payload)
243
+ ============================================================
244
+ [PASS] Payload has action_type
245
+ [PASS] Payload has action_description
246
+ [PASS] Tokens converted to list
247
+ [PASS] Token values preserved
248
+ [PASS] _parse_result returns StepResult
249
+ [PASS] Observation messages parsed
250
+ [PASS] Tokens converted back to tensor
251
+ [PASS] Token values correct
252
+ [PASS] Reward parsed
253
+
254
+ Payload serialization:
255
+ action_type: query
256
+ tokens (list): [[1, 2, 3, 4, 5]]
257
+ ```
258
+
259
+ ---
260
+
261
+ ## Edge Cases Exercised
262
+
263
+ ### Empty System Prompt
264
+
265
+ When no system prompt is provided (empty string), the environment correctly starts with zero messages and an empty token tensor.
266
+
267
+ ```
268
+ [PASS] Empty system prompt -> no messages in history
269
+ [PASS] Empty system prompt -> empty tokens
270
+ ```
271
+
272
+ ### Invalid Message Inputs
273
+
274
+ `message_to_action()` validates its input and raises `ValueError` for malformed messages.
275
+
276
+ ```
277
+ [PASS] Missing 'role' raises ValueError
278
+ [PASS] Missing 'content' raises ValueError
279
+ [PASS] None content raises ValueError
280
+ ```
281
+
282
+ ### Unknown Table Handling
283
+
284
+ Schema lookup and sample query generation gracefully handle non-existent tables.
285
+
286
+ ```
287
+ [PASS] Unknown table returns 'not found' message
288
+ [PASS] Unknown table sample returns error
289
+ ```
290
+
291
+ ### MockTokenizer Encode/Decode Roundtrip
292
+
293
+ The mock tokenizer's `ord(c) % 256` encoding correctly roundtrips through `chr(t)` decoding.
294
+
295
+ ```
296
+ [PASS] MockTokenizer encode/decode roundtrip
297
+ ```
298
+
299
+ ### Invalid Tokenizer Validation
300
+
301
+ The environment constructor rejects tokenizers missing `apply_chat_template`.
302
+
303
+ ```
304
+ [PASS] Invalid tokenizer raises ValueError
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Live Demo — Ollama Path (Optional)
310
+
311
+ When Ollama is running locally with a compatible model, the query action generates real SQL and the describe action selects the correct table.
312
+
313
+ ### Describe with Ollama
314
+
315
+ With `OLLAMA_MODEL=llama3.2`, the LLM correctly identifies "Student" as the most relevant table for "describe the students table":
316
+
317
+ ```
318
+ DESCRIBE RESULT:
319
+ Table 'Student' has the following columns:
320
+
321
+ - student_id: integer number
322
+ - student_details: text (up to 255 characters)
323
+ ```
324
+
325
+ Compare with mock path: fell back to "Address" (first table in dict). **With Ollama, table selection is intelligent.**
326
+
327
+ ### Query with Ollama
328
+
329
+ The LLM generates valid SQL for natural language questions:
330
+
331
+ ```
332
+ QUERY RESULT:
333
+ SELECT * FROM Students WHERE CourseID IN (SELECT CourseID FROM Courses WHERE CourseName = 'CS101')
334
+ ```
335
+
336
+ > **Note:** The generated SQL references column names from the schema description prompt, not the actual SQLAlchemy model column names. This is expected — the LLM generates SQL based on the natural language schema it receives.
337
+
338
+ ---
339
+
340
+ ## Full Result Summary
341
+
342
+ ```
343
+ ============================================================
344
+ SUMMARY
345
+ ============================================================
346
+ Total checks: 71
347
+ Passed: 71
348
+ Failed: 0
349
+
350
+ ALL CHECKS PASSED
351
+ ```
352
+
353
+ | Category | Checks | Passed | Failed |
354
+ |----------|--------|--------|--------|
355
+ | Imports | 1 | 1 | 0 |
356
+ | Instantiation | 8 | 8 | 0 |
357
+ | reset() | 8 | 8 | 0 |
358
+ | Action detection | 9 | 9 | 0 |
359
+ | message_to_action | 9 | 9 | 0 |
360
+ | step(describe) | 6 | 6 | 0 |
361
+ | step(sample) | 3 | 3 | 0 |
362
+ | step(query) | 2 | 2 | 0 |
363
+ | Multi-turn state | 6 | 6 | 0 |
364
+ | Client serialization | 9 | 9 | 0 |
365
+ | Edge cases | 9 | 9 | 0 |
366
+ | **Total** | **71** | **71** | **0** |
367
+
368
+ ---
369
+
370
+ ## Verification Checklist
371
+
372
+ - [x] Environment instantiation with MockTokenizer — 8 checks
373
+ - [x] `reset()` returns valid SQLObservation with system prompt — 8 checks
374
+ - [x] Action type detection for all 3 types (describe/sample/query) — 9 keywords tested
375
+ - [x] `message_to_action()` creates SQLAction with correct type and tokens — 9 checks
376
+ - [x] `step()` with describe returns schema from SQLAlchemy models — 6 checks
377
+ - [x] `step()` with sample returns SQL query text — 3 checks
378
+ - [x] `step()` with query returns Ollama error gracefully (mock path) — 2 checks
379
+ - [x] Multi-turn conversation state grows correctly — 6 checks
380
+ - [x] Client tensor↔list serialization roundtrip — 9 checks
381
+ - [x] Edge cases: empty prompt, invalid inputs, unknown tables, tokenizer validation — 9 checks
382
+
383
+ ---
384
+
385
+ ## Known Issues Found
386
+
387
+ 1. **`sqlalchemy` missing from `pyproject.toml`** — The ORM models import `sqlalchemy` but it's not listed as a dependency. Must `uv add sqlalchemy` manually.
388
+
389
+ 2. **Pydantic/TypedDict incompatibility on Python < 3.12** — The `openenv` library defines `Message` with `typing.TypedDict`, but Pydantic 2.x requires `typing_extensions.TypedDict`. The demo script monkey-patches this, but the issue would affect any direct usage.
390
+
391
+ 3. **Ollama default model (`qwen2`) unlikely to be installed** — The default `OLLAMA_MODEL` is `qwen2`, which users probably don't have. The 404 error is graceful but confusing. Consider defaulting to `llama3.2` or documenting the required model.
392
+
393
+ 4. **describe/sample fallback to first table** — When Ollama is unavailable, `_call_ollama_to_select_table()` silently falls back to the first table in the dict (`Address`). This is correct behavior but may confuse users expecting the table from their query.
394
+
395
+ ---
396
+
397
+ ## File Reference
398
+
399
+ | File | What it does |
400
+ |------|-------------|
401
+ | `envs/sql_env/demo_action_feature.py` | Executable demo script (71 checks) |
402
+ | `envs/sql_env/server/sql_environment.py` | Core `SQLEnvironment` with reset/step/dispatch |
403
+ | `envs/sql_env/models.py` | `SQLAction`, `SQLObservation`, `SQLState` Pydantic models |
404
+ | `envs/sql_env/client.py` | `SQLEnvClient` with tensor↔list serialization |
405
+ | `envs/sql_env/server/test_sql_env.py` | `MockTokenizer` (char ord encoding) |
406
+ | `envs/sql_env/data/databases/models.py` | 9 SQLAlchemy ORM models |
407
+
408
+ ---
409
+
410
+ ## How to Reproduce
411
+
412
+ ```bash
413
+ git clone <repo-url>
414
+ cd sql-env-onboarding
415
+ git checkout origin/action-feature --detach
416
+ cd envs/sql_env/
417
+ uv sync && uv add sqlalchemy
418
+ uv run python demo_action_feature.py # Mock path: 71/71 checks
419
+
420
+ # Optional: Ollama path
421
+ export OLLAMA_MODEL=llama3.2 # or any installed model
422
+ uv run python demo_action_feature.py # Query actions now return real SQL
423
+ ```
424
+
425
+ ---
426
+
427
+ *Demo generated 2026-02-28. Re-run `uv run python demo_action_feature.py` on the action-feature branch to refresh.*
Dockerfile.test ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install git for pip installs from GitHub
6
+ RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
7
+
8
+ # Copy project
9
+ COPY . .
10
+
11
+ # Install sql-env without deps
12
+ RUN pip install --no-deps .
13
+
14
+ # Install training deps (same as Colab setup cell)
15
+ RUN pip install \
16
+ "trl>=0.29.0" \
17
+ "accelerate>=0.34.0" \
18
+ "openenv-core[core]>=0.2.1" \
19
+ "pydantic>=2.0.0" \
20
+ "jmespath" \
21
+ "datasets>=3.0.0" \
22
+ "huggingface_hub>=0.30.0" \
23
+ "git+https://github.com/huggingface/transformers.git@main"
24
+
25
+ # Download Spider databases
26
+ RUN python scripts/download_spider_databases.py
27
+
28
+ CMD ["python", "scripts/test_training_local.py"]
ONBOARDING_action_feature.md ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Onboarding: `action-feature` Branch
2
+
3
+ > What the `action-feature` branch adds compared to `main`.
4
+ > Last updated: 2026-02-28
5
+ > Focus: Branch delta — new components, model changes, data flow, and gaps.
6
+
7
+ ## What This Branch Does
8
+
9
+ The `action-feature` branch transforms SQLEnv from a **scaffold with well-designed Pydantic models** into a **partially working environment** with real action dispatch (describe/sample/query), Ollama-based SQL generation, a WebSocket client, SQLAlchemy ORM models for the `student_assessment` database, and Spider question data. It implements the core `message → action → step → observation` loop that the RL training pipeline will eventually drive.
10
+
11
+ ---
12
+
13
+ ## Branch Overview
14
+
15
+ ```
16
+ ┌─────────────────────────────────────────────────────────────────────┐
17
+ │ action-feature: New/Changed Components │
18
+ ├─────────────────────────────────────────────────────────────────────┤
19
+ │ │
20
+ │ Training Code / Notebook │
21
+ │ ┌──────────────────────┐ │
22
+ │ │ test_env.ipynb NEW │ Interactive walkthrough (5 test cells) │
23
+ │ └──────────┬───────────┘ │
24
+ │ │ imports │
25
+ │ ┌──────────▼───────────┐ ┌──────────────────────────┐ │
26
+ │ │ client.py NEW │────▶│ models.py CHANGED │ │
27
+ │ │ SQLEnvClient │ │ SQLAction (+ tokens, │ │
28
+ │ │ _step_payload() │ │ action_desc) │ │
29
+ │ │ _parse_result() │ │ SQLObservation │ │
30
+ │ │ _parse_state() │ │ (messages + tokens) │ │
31
+ │ │ message_to_action() │ │ SQLState │ │
32
+ │ └──────────────────────┘ │ (history + tokens) │ │
33
+ │ │ WebSocket └──────────────────────────┘ │
34
+ │ ┌──────────▼───────────┐ │
35
+ │ │ server/app.py CHG │ FastAPI bootstrap + tokenizer factory │
36
+ │ │ create_sql_env() │ │
37
+ │ └──────────┬───────────┘ │
38
+ │ │ creates │
39
+ │ ┌──────────▼───────────────────────────────────────────────┐ │
40
+ │ │ server/sql_environment.py NEW │ │
41
+ │ │ SQLEnvironment(Environment) │ │
42
+ │ │ ├── reset() → clear state, return obs │ │
43
+ │ │ ├── step(action) → dispatch on action_type │ │
44
+ │ │ │ ├── "describe" → Ollama selects table → ORM info │ │
45
+ │ │ │ ├── "sample" → Ollama selects table → SQL gen │ │
46
+ │ │ │ └── "query" → Ollama generates SQL from NL │ │
47
+ │ │ ├── message_to_action() → detect type, tokenize │ │
48
+ │ │ └── _detect_action_type() → keyword classifier │ │
49
+ │ └──────────┬───────────────────────┬───────────────────────┘ │
50
+ │ │ introspects │ HTTP calls │
51
+ │ ┌──────────▼───────────┐ ┌───────▼────────────────┐ │
52
+ │ │ data/databases/ │ │ Ollama (external) │ │
53
+ │ │ models.py NEW │ │ /api/generate │ │
54
+ │ │ 9 SQLAlchemy tables │ │ qwen2 (default) │ │
55
+ │ └──────────────────────┘ └─────────────────────���──┘ │
56
+ │ │
57
+ │ ┌──────────────────────┐ ┌────────────────────────────────┐ │
58
+ │ │ data/questions/ │ │ scripts/ NEW │ │
59
+ │ │ student_assessment │ │ download_spider_data.py │ │
60
+ │ │ .json NEW │ │ generate_models_from_schema.py│ │
61
+ │ │ (30+ Q&A pairs) │ └────────────────────────────────┘ │
62
+ │ └──────────────────────┘ │
63
+ │ │
64
+ │ ┌──────────────────────┐ ┌────────────────────────┐ │
65
+ │ │ server/test_sql_env │ │ server/install_deps.sh │ │
66
+ │ │ .py MockTokenizer │ │ Docker setup NEW │ │
67
+ │ │ NEW │ └────────────────────────┘ │
68
+ │ └──────────────────────┘ │
69
+ └─────────────────────────────────────────────────────────────────────┘
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Files Changed/Added
75
+
76
+ | File | Status | Purpose |
77
+ |------|--------|---------|
78
+ | `envs/sql_env/models.py` | **Changed** | Rewired `SQLAction`, `SQLObservation`, `SQLState` for message+token paradigm |
79
+ | `envs/sql_env/__init__.py` | **Changed** | Exports `SQLAction`, `SQLObservation`, `SQLState`; lazy client import |
80
+ | `envs/sql_env/client.py` | **New** | `SQLEnvClient(EnvClient)` — WebSocket client with tensor serialization |
81
+ | `envs/sql_env/server/sql_environment.py` | **New** | `SQLEnvironment(Environment)` — core environment logic (463 lines) |
82
+ | `envs/sql_env/server/app.py` | **Changed** | FastAPI bootstrap with tokenizer factory + MockTokenizer fallback |
83
+ | `envs/sql_env/server/__init__.py` | **Changed** | Exports `SQLEnvironment` |
84
+ | `envs/sql_env/server/test_sql_env.py` | **New** | `MockTokenizer` for testing without `transformers` library |
85
+ | `envs/sql_env/server/install_deps.sh` | **New** | Docker setup script: pip install + pre-download GPT-2 tokenizer |
86
+ | `envs/sql_env/server/requirements.txt` | **New** | Server-side pip deps for Docker (fastapi, torch, transformers, etc.) |
87
+ | `envs/sql_env/data/databases/models.py` | **New** | SQLAlchemy ORM for `student_assessment` DB (9 model classes) |
88
+ | `envs/sql_env/data/questions/student_assessment.json` | **New** | 30+ Spider questions with gold SQL, tokenized queries |
89
+ | `envs/sql_env/scripts/download_spider_data.py` | **New** | Downloads Spider questions from HuggingFace by `db_id` |
90
+ | `envs/sql_env/scripts/generate_models_from_schema.py` | **New** | Auto-generates SQLAlchemy models from Spider schema dataset |
91
+ | `envs/sql_env/pyproject.toml` | **Changed** | Python constrained to `>=3.11,<3.13`; added `requests>=2.31.0` |
92
+ | `envs/sql_env/uv.lock` | **Changed** | Lock file updated for new dependencies |
93
+ | `README.md` | **Changed** | Added "Current Package State" section with pinned dependency rationale |
94
+ | `envs/sql_env/server/environment.py` | **Emptied** | Replaced by `sql_environment.py` |
95
+ | `test_env.ipynb` | **New** | Jupyter notebook with 5 interactive test scenarios |
96
+
97
+ **Total:** 18 files changed, +5702 / -412 lines.
98
+
99
+ ---
100
+
101
+ ## Key Components Introduced
102
+
103
+ ### 1. `SQLEnvironment` — `envs/sql_env/server/sql_environment.py`
104
+
105
+ The heart of the branch. Implements the OpenEnv `Environment` interface with three action types:
106
+
107
+ | Action Type | Dispatch Flow | Output |
108
+ |-------------|--------------|--------|
109
+ | `describe` | Ollama selects table → `_get_table_schema()` introspects SQLAlchemy model | Column names + natural language types |
110
+ | `sample` | Ollama selects table → `_generate_sample_query()` | `SELECT * FROM <table> LIMIT 10;` |
111
+ | `query` | `_call_ollama_for_sql()` sends NL + schema to Ollama | Generated SQL string |
112
+
113
+ Key methods:
114
+
115
+ - **`reset()`** — Clears conversation history, re-initializes system prompt message + tokens. Returns initial `SQLObservation`.
116
+ - **`step(action)`** — Dispatches on `action.action_type`. Appends assistant response to `history_messages`, stores action tokens in `history_tokens`. Returns flattened observation.
117
+ - **`message_to_action(message)`** — Server-side conversion of `Message` dict → `SQLAction`. Detects action type via keywords, appends message to state history, tokenizes full conversation.
118
+ - **`_detect_action_type(content)`** — Keyword classifier: checks for "describe"/"schema"/"columns" → `describe`, "sample"/"example"/"rows" → `sample`, default → `query`.
119
+ - **`_create_observation()`** — Builds `SQLObservation` from current state. Flattens all `history_tokens` into a single 1D tensor via `torch.cat`.
120
+ - **`_get_table_schema(table_name)`** — Introspects SQLAlchemy model columns, converts types to natural language.
121
+ - **`_call_ollama_for_sql(query)`** / **`_call_ollama_to_select_table(request)`** — HTTP POST to Ollama `/api/generate`.
122
+
123
+ **Constructor params:** `tokenizer` (must have `apply_chat_template`), optional `system_prompt`, optional `transform`.
124
+
125
+ **Environment variables:** `OLLAMA_MODEL` (default: `qwen2`), `OLLAMA_BASE_URL` (default: `http://localhost:11434`).
126
+
127
+ ### 2. `SQLEnvClient` — `envs/sql_env/client.py`
128
+
129
+ WebSocket client extending OpenEnv's `EnvClient[SQLAction, SQLObservation, SQLState]`. Handles tensor↔list serialization for JSON transport:
130
+
131
+ - **`_step_payload(action)`** — Converts `action.tokens` (Tensor) to Python list for JSON.
132
+ - **`_parse_result(payload)`** — Deserializes response → `StepResult[SQLObservation]`, converting token lists back to tensors.
133
+ - **`_parse_state(payload)`** — Deserializes state → `SQLState` with tensor reconstruction.
134
+ - **`message_to_action(message, tokenizer, history_messages)`** — Client-side version of action creation (mirrors server logic). Requires passing a tokenizer explicitly.
135
+
136
+ ### 3. `server/app.py` — FastAPI Bootstrap
137
+
138
+ Changed from a stub to a working application:
139
+
140
+ - **`get_tokenizer()`** — Loads HuggingFace tokenizer from `TOKENIZER_NAME` env var (default: `mistralai/Mistral-7B-Instruct-v0.1`). Falls back to `MockTokenizer` from `test_sql_env.py` if `transformers` is not installed.
141
+ - **`create_sql_environment()`** — Factory function creating `SQLEnvironment` per WebSocket session.
142
+ - **`app = create_app(create_sql_environment, SQLAction, SQLObservation, env_name="sql_env")`** — Wires up WebSocket endpoints.
143
+
144
+ ### 4. SQLAlchemy ORM — `envs/sql_env/data/databases/models.py`
145
+
146
+ 9 model classes for the `student_assessment` database:
147
+
148
+ | Model | Table | Key Columns |
149
+ |-------|-------|-------------|
150
+ | `Address` | Addresses | address_id, line_1, city, country |
151
+ | `Person` | People | person_id, first_name, last_name, email_address |
152
+ | `Student` | Students | student_id, student_details |
153
+ | `Course` | Courses | course_id (String PK), course_name |
154
+ | `PersonAddress` | People_Addresses | person_id (FK), address_id (FK), date_from/to |
155
+ | `StudentCourseRegistration` | Student_Course_Registrations | student_id (FK), course_id (FK), registration_date |
156
+ | `StudentCourseAttendance` | Student_Course_Attendance | student_id (FK), course_id (FK), date_of_attendance |
157
+ | `Candidate` | Candidates | candidate_id, candidate_details |
158
+ | `CandidateAssessment` | Candidate_Assessments | candidate_id (FK), qualification, assessment_date |
159
+
160
+ All models include proper foreign key relationships with `back_populates`.
161
+
162
+ ### 5. Spider Question Data — `envs/sql_env/data/questions/student_assessment.json`
163
+
164
+ 3,355-line JSON file containing 30+ question-answer pairs from the Spider dataset. Each entry includes:
165
+ - `db_id` — always `student_assessment`
166
+ - `question` — natural language question (e.g., "which course has most number of registered students?")
167
+ - `query` — gold SQL (e.g., `SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations...`)
168
+ - `query_toks` / `query_toks_no_value` / `question_toks` — tokenized versions
169
+
170
+ ### 6. Data Preparation Scripts — `envs/sql_env/scripts/`
171
+
172
+ - **`download_spider_data.py`** — CLI tool to download Spider questions from HuggingFace. Supports `--db-id` filtering and `--split` selection.
173
+ - **`generate_models_from_schema.py`** — Auto-generates SQLAlchemy ORM models from the `richardr1126/spider-schema` HuggingFace dataset. Maps Spider types to SQLAlchemy types, handles foreign keys.
174
+
175
+ ### 7. `MockTokenizer` — `envs/sql_env/server/test_sql_env.py`
176
+
177
+ Deterministic tokenizer for testing without `transformers`:
178
+ - **`apply_chat_template()`** — Converts message text to token IDs via `ord(c) % 256`.
179
+ - **`decode()`** — Reverses the encoding back to characters.
180
+ - Imported by `app.py` as a fallback when `transformers` is not installed.
181
+
182
+ ---
183
+
184
+ ## Model Changes (Main → Action-Feature)
185
+
186
+ ### `SQLAction`
187
+
188
+ | Field | Main | Action-Feature | Notes |
189
+ |-------|------|----------------|-------|
190
+ | `action_type` | `"DESCRIBE, SAMPLE, QUERY, ANSWER"` | `"describe, sample, query"` | Lowercase, ANSWER removed |
191
+ | `argument` | Table name / SQL / answer value | **Removed** | — |
192
+ | `action_description` | — | **Added**: description string | Replaces `argument` |
193
+ | `tokens` | — | **Added**: `torch.Tensor` | Tokenized conversation |
194
+
195
+ ### `SQLObservation`
196
+
197
+ | Field | Main | Action-Feature | Notes |
198
+ |-------|------|----------------|-------|
199
+ | `question` | NL question string | **Commented out** | — |
200
+ | `schema_info` | DB schema description | **Commented out** | — |
201
+ | `result` | Last action result | **Commented out** | — |
202
+ | `error` | Error message | **Commented out** | — |
203
+ | `step_count` | Current step number | **Commented out** | — |
204
+ | `budget_remaining` | Steps left | **Commented out** | — |
205
+ | `action_history` | Summary of actions | **Commented out** | — |
206
+ | `messages` | — | **Added**: `list[Message]` | Full conversation history |
207
+ | `tokens` | — | **Added**: `torch.Tensor` | Flattened token tensor |
208
+
209
+ The original observation fields are **commented out, not deleted** — they're expected to return in a future phase.
210
+
211
+ ### `SQLState`
212
+
213
+ | Field | Main | Action-Feature | Notes |
214
+ |-------|------|----------------|-------|
215
+ | `game_name` | `"sql_env"` | **Commented out** | — |
216
+ | `history_messages` | — | **Added**: `list[Message]` | Full conversation history |
217
+ | `history_tokens` | — | **Added**: `list[torch.Tensor]` | Per-message token tensors |
218
+ | `current_action_type` | — | **Added**: `str` (default `"query"`) | Tracks current action |
219
+
220
+ **Design shift:** The branch moves from a **structured observation** (question + schema + result fields) to a **chat-based observation** (raw messages + tokens). This aligns with how LLM-based agents naturally consume conversational context.
221
+
222
+ ---
223
+
224
+ ## Data Flow
225
+
226
+ ```
227
+ User Message (dict: {role: "user", content: "Show me the Student schema"})
228
+
229
+
230
+ message_to_action(message) [SQLEnvironment or SQLEnvClient]
231
+ ├── Detect action type via keywords
232
+ │ "schema" found → action_type = "describe"
233
+ ├── Append message to _state.history_messages ← MUTATES STATE
234
+ ├── Tokenize FULL conversation via tokenizer.apply_chat_template()
235
+ └── Return SQLAction(action_type="describe",
236
+ │ action_description="Show me the Student schema",
237
+ │ tokens=<tensor>)
238
+
239
+
240
+ step(action) [SQLEnvironment]
241
+ ├── Dispatch on action.action_type:
242
+ │ "describe" → _call_ollama_to_select_table("Show me the Student schema")
243
+ │ → returns "Student"
244
+ │ → _get_table_schema("Student")
245
+ │ → introspects SQLAlchemy model columns
246
+ │ → "Table 'Student' has: student_id: integer, ..."
247
+ ├── Create assistant Message with schema info
248
+ ├── Append assistant message to _state.history_messages
249
+ ├── Append action.tokens to _state.history_tokens
250
+ └── _create_observation()
251
+ ├── Flatten all history_tokens via torch.cat → single 1D tensor
252
+ ├── Copy history_messages
253
+ ├── Apply transform (if configured)
254
+ └── Return SQLObservation(messages=[...], tokens=<flat tensor>)
255
+ ```
256
+
257
+ ---
258
+
259
+ ## External Dependencies Added
260
+
261
+ | Dependency | Version | Purpose | Integration Point |
262
+ |------------|---------|---------|-------------------|
263
+ | Ollama (local service) | — | LLM inference for SQL generation + table selection | `sql_environment.py:_call_ollama_for_sql()`, `_call_ollama_to_select_table()` |
264
+ | `requests` | >=2.31.0 | HTTP client for Ollama API | `sql_environment.py` |
265
+ | `torch` | ==2.2.2 | Tensor operations for tokenized representations | `models.py`, `client.py`, `sql_environment.py` |
266
+ | `transformers` | <5 | HuggingFace tokenizers (chat template support) | `app.py:get_tokenizer()` |
267
+ | `numpy` | <2 | Torch dependency constraint | `pyproject.toml` |
268
+ | `sqlalchemy` | (transitive) | ORM for database schema introspection | `data/databases/models.py` |
269
+ | `datasets` | (scripts only) | HuggingFace `load_dataset` for Spider data download | `scripts/download_spider_data.py`, `scripts/generate_models_from_schema.py` |
270
+
271
+ **Environment variables:**
272
+
273
+ | Variable | Default | Purpose |
274
+ |----------|---------|---------|
275
+ | `TOKENIZER_NAME` | `mistralai/Mistral-7B-Instruct-v0.1` | HuggingFace tokenizer model |
276
+ | `SYSTEM_PROMPT` | Built-in schema description | Custom system prompt override |
277
+ | `OLLAMA_MODEL` | `qwen2` | Ollama model for SQL generation |
278
+ | `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API endpoint |
279
+
280
+ ---
281
+
282
+ ## Known Gaps (Not Yet Implemented)
283
+
284
+ | Feature | Status | Notes |
285
+ |---------|--------|-------|
286
+ | `ANSWER` action type | Not implemented | Designed in main-branch models but removed from action-feature |
287
+ | Real database execution | Not implemented | `step()` generates SQL text via Ollama but never executes it against SQLite |
288
+ | Reward computation | Not implemented | `reward.py` is empty; 3-layer design exists in README only |
289
+ | Answer verification | Not implemented | `verifier.py` is empty |
290
+ | Budget tracking | Not implemented | No step limit enforcement |
291
+ | Episode question selection | Not implemented | Environment uses hardcoded schema; `student_assessment.json` is present but not loaded by the environment |
292
+ | Dockerfile | Not implemented | File is empty; `install_deps.sh` is ready |
293
+ | `openenv.yaml` manifest | Not implemented | Empty file |
294
+ | Formal test suite | Not implemented | No `tests/` directory; only `MockTokenizer` and notebook tests |
295
+
296
+ ---
297
+
298
+ ## Gotchas
299
+
300
+ - **`message_to_action()` mutates state:** On the server side, `message_to_action()` appends the message to `_state.history_messages` *before* tokenizing. This means calling it has a side effect — it's not a pure function. If you call it twice with the same message, you'll get duplicate entries in history.
301
+
302
+ - **Client vs Server `message_to_action` diverge:** The server version (`sql_environment.py:message_to_action`) manages state internally and mutates `_state`. The client version (`client.py:message_to_action`) requires passing `history_messages` explicitly and does not manage state. They have different signatures.
303
+
304
+ - **Schema description is hardcoded in `sql_environment.py`:** The `_build_schema_description()` function returns a fixed string with table/column names that don't perfectly match the SQLAlchemy ORM models. For example, the schema description says `Students (student_id, person_id, student_acc_status)` but the ORM model has `Students (student_id, student_details)`.
305
+
306
+ - **Ollama failure mode is silent:** If Ollama is unreachable, `_call_ollama_to_select_table()` catches all exceptions and returns the *first table in the dict* (`Address`). No error is surfaced to the caller. `_call_ollama_for_sql()` returns an error string, but it's treated as a normal assistant message.
307
+
308
+ - **Original observation fields are commented out, not deleted:** `SQLObservation` still has `question`, `schema_info`, `result`, `error`, `step_count`, `budget_remaining`, and `action_history` as comments. They're intended to return in a later phase.
309
+
310
+ - **`MockTokenizer` is imported by production code:** `app.py` imports `MockTokenizer` from `test_sql_env.py` at runtime when `transformers` is missing. This couples test utilities to production bootstrap.
311
+
312
+ - **`test_env.ipynb` lives at project root:** Not inside `tests/` or `envs/`. Easy to miss when exploring the codebase.
313
+
314
+ - **Pydantic + torch.Tensor:** `SQLAction`, `SQLObservation`, and `SQLState` use `torch.Tensor` fields with Pydantic. This requires `arbitrary_types_allowed = True` in the Pydantic model config (inherited from OpenEnv base classes). Standard Pydantic serialization (`.model_dump()`) won't work out of the box with tensors.
315
+
316
+ ---
317
+
318
+ ## Entry Points for Reading
319
+
320
+ | What You Want to Understand | Start Here | Then Read |
321
+ |----------------------------|------------|-----------|
322
+ | How actions are processed | `envs/sql_env/server/sql_environment.py:step()` | `_detect_action_type()`, `_call_ollama_for_sql()` |
323
+ | How messages become actions | `envs/sql_env/server/sql_environment.py:message_to_action()` | `envs/sql_env/client.py:message_to_action()` |
324
+ | Data contracts | `envs/sql_env/models.py` | Compare with `git show main:envs/sql_env/models.py` |
325
+ | Server bootstrap | `envs/sql_env/server/app.py` | `get_tokenizer()`, `create_sql_environment()` |
326
+ | Database schema | `envs/sql_env/data/databases/models.py` | `envs/sql_env/data/questions/student_assessment.json` |
327
+ | Client-side usage | `envs/sql_env/client.py` | `test_env.ipynb` |
328
+ | Data preparation | `envs/sql_env/scripts/download_spider_data.py` | `scripts/generate_models_from_schema.py` |
329
+
330
+ ---
331
+
332
+ *This document covers only the `action-feature` branch delta. For the overall project design (POMDP architecture, reward layers, episode lifecycle), see [README.md](README.md).*
333
+
334
+ These issues are also changed as of now, check when we modify.
335
+ Known Issues Discovered
336
+ 1. sqlalchemy is missing from pyproject.toml on the branch
337
+ 2. Pydantic/TypedDict incompatibility on Python < 3.12 (demo auto-patches)
338
+ 3. Hardcoded schema description in sql_environment.py doesn't match ORM models
339
+ 4. Silent Ollama fallback to first table on connection failure
340
+
341
+ Please check the latest remote branch action-feature
README.md CHANGED
@@ -84,13 +84,42 @@ Episode flow:
84
 
85
  ## Train an Agent
86
 
87
- Use the GRPO training pipeline artifacts from F006 and run the notebook workflow:
88
 
89
- - Notebook: `notebooks/train_grpo.ipynb`
90
- - Training support modules: `training/`
91
- - Evaluation utilities: `evaluation/`
92
 
93
- This setup is designed for Colab and local CPU/GPU environments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## HuggingFace Space
96
 
 
84
 
85
  ## Train an Agent
86
 
87
+ The environment exposes four tools (`describe`, `sample`, `query`, `answer`) that TRL's GRPOTrainer discovers automatically. The model learns to call these tools through GRPO — no custom rollout code needed.
88
 
89
+ ### Local test (Docker, CPU)
 
 
90
 
91
+ Verify the training pipeline end-to-end in about 3 minutes:
92
+
93
+ ```bash
94
+ docker build -f Dockerfile.test -t sqlenv-test .
95
+ docker run --rm sqlenv-test
96
+ ```
97
+
98
+ This runs 2 training steps with `configs/test_cpu.json` and prints per-step loss, reward, tool call frequency, and model completions.
99
+
100
+ ### Colab training (GPU)
101
+
102
+ Open the notebook and select a GPU runtime (L4 recommended):
103
+
104
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hjerpe/sql-env/blob/main/notebooks/train_grpo.ipynb)
105
+
106
+ The notebook uses `configs/colab_l4.json` settings: batch size 4, 4 generations per prompt, bf16 precision. Live reward plots and execution traces update during training.
107
+
108
+ ### What the model sees
109
+
110
+ Each episode, TRL injects tool schemas into the prompt. The model generates structured tool calls:
111
+
112
+ ```
113
+ <tool_call>{"name": "describe", "arguments": {"table_name": "employee"}}</tool_call>
114
+ ```
115
+
116
+ TRL parses this, calls `env.describe(table_name="employee")`, and appends the result. The model can then call more tools or submit an answer. Rewards accumulate from each interaction.
117
+
118
+ ### Configuration
119
+
120
+ Training configs live in `configs/`:
121
+ - `test_cpu.json` — 2 steps, 256 tokens, budget 3 (local validation)
122
+ - `colab_l4.json` — full epoch, 512 tokens, budget 10, bf16 (L4 GPU)
123
 
124
  ## HuggingFace Space
125
 
REVIEW_REPORT.md CHANGED
@@ -1,57 +1,44 @@
1
- # Code Review Report: F006 Step 3.1 (`notebooks/train_grpo.ipynb`, `pyproject.toml`, `tests/e2e/test_training_e2e.py`)
2
 
3
  **Risk Tier:** Medium
4
- **Status:** Failed
5
- **Verdict:** BLOCK
6
 
7
  ## Summary
8
 
9
- Step 3.1 is not ready to merge. The training extra currently resolves to a TRL version incompatible with the repo’s pinned Torch version, causing notebook imports to fail before training can start. In addition, the added E2E test only validates notebook structure and does not exercise the required one-step training smoke flow from the verification spec.
10
 
11
  ## Evidence
12
 
13
  ### Tests
14
- - **Status:** Passed (limited scope)
15
- - **Command:** `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v`
16
- - **Results:** `2 passed, 0 failed`
17
-
18
- ### Dependency/Runtime Validation
19
- - **Status:** Failed
20
- - **Command:** `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"`
21
- - **Observed:** Import error (`cannot import name 'FSDPModule'`) in TRL with current Torch pin.
 
22
 
23
  ### Security (Medium)
24
  - **Status:** Clear
25
- - **Checks:** Medium-tier quick checks only (no secrets/auth/unsafe execution patterns introduced in scoped changes).
26
 
27
  ## Issues
28
 
29
  ### Critical
30
- 1. **Training extra resolves to incompatible TRL, breaking notebook startup**
31
- - **Location:** `pyproject.toml:30-33`, `notebooks/train_grpo.ipynb:29-35`
32
- - **Problem:** `training = ["trl>=0.12.0", "accelerate>=0.34.0"]` permits latest TRL (installed as 0.29.1), which fails to import with pinned `torch==2.2.2`.
33
- - **Impact:** Notebook cannot run end-to-end (“one click” success criterion fails before training).
34
- - **Fix:** Pin a TRL range compatible with Torch 2.2.2 (or upgrade Torch accordingly), then add/import-check coverage in tests.
35
 
36
  ### Important
37
- 1. **E2E smoke test does not validate actual Step 3.1 execution path**
38
- - **Location:** `tests/e2e/test_training_e2e.py:25-65`
39
- - **Problem:** Test checks notebook text structure and helper filtering only; it does not instantiate trainer, run `trainer.train()`, or verify metrics/comparison outputs as specified.
40
- - **Impact:** Regressions in training flow can pass CI undetected.
41
- - **Fix:** Add a true smoke execution test (tiny/mocked model + single train step + metric assertion), aligned to `specs/F006-VERIFICATION_SPEC.md` Section 4.
42
-
43
- 2. **Comparison cell is not random-vs-trained and does not capture pre-training baseline**
44
- - **Location:** `notebooks/train_grpo.ipynb:181-183`
45
- - **Problem:** Both `before_rollouts` and `after_rollouts` use `rollout_func` with the same model after training.
46
- - **Impact:** Fails the feature’s “before vs after” demo intent (and spec’s random-vs-trained comparison).
47
- - **Fix:** Capture baseline episodes before training (or explicit random policy), then run trained-policy episodes after `trainer.train()`.
48
 
49
  ### Minor
50
- None.
 
 
51
 
52
  ## Next Actions
53
 
54
- 1. Fix dependency compatibility (TRL/Torch) and prove imports succeed in clean env.
55
- 2. Upgrade E2E smoke test to execute one real/mocked GRPO training step and assert logged metrics.
56
- 3. Correct notebook comparison to true baseline-vs-trained behavior.
57
- 4. Re-run: `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v` and include import-check evidence.
 
1
+ # Code Review Report: F011 Step 1.3 (`notebooks/compare_methods.ipynb`)
2
 
3
  **Risk Tier:** Medium
4
+ **Status:** Passed with Warnings
5
+ **Verdict:** APPROVE
6
 
7
  ## Summary
8
 
9
+ `LLMToolCallingPolicy` is implemented per Step 1.3 intent: it builds episode messages, uses chat-template tool calling, forces ANSWER at low budget, and falls back to `parse_error` on unparseable output. No correctness or security blockers were found in the scoped notebook change.
10
 
11
  ## Evidence
12
 
13
  ### Tests
14
+ - **Status:** Mixed (targeted checks passed; existing unrelated smoke failures persist)
15
+ - **Commands:**
16
+ - `uv run python - <<'PY' ... compile notebook cells ... PY`
17
+ - `uv run python - <<'PY' ... runtime checks for valid action / budget fallback / parse fallback ... PY`
18
+ - `uv run pytest tests/test_smoke.py -v`
19
+ - **Results:**
20
+ - Notebook code-cell compilation: passed (`Compiled 6 code cells successfully`)
21
+ - Policy runtime checks: passed (`QUERY` valid path, `ANSWER budget_exhausted`, `ANSWER parse_error`)
22
+ - Smoke tests: `21 passed, 4 failed` (pre-existing reward expectation mismatches in environment tests)
23
 
24
  ### Security (Medium)
25
  - **Status:** Clear
26
+ - **Checks:** Medium-tier quick checks on parsing/generation fallback paths; no secret handling, auth, or privilege-sensitive paths added.
27
 
28
  ## Issues
29
 
30
  ### Critical
31
+ None.
 
 
 
 
32
 
33
  ### Important
34
+ None.
 
 
 
 
 
 
 
 
 
 
35
 
36
  ### Minor
37
+ 1. **Episode reset heuristic is question-text based and can theoretically leak history if two consecutive episodes start with identical question text.**
38
+ - **Location:** `notebooks/compare_methods.ipynb:313-316`
39
+ - **Recommendation:** Consider adding a stronger episode boundary signal (e.g., explicit wrapper reset hook or observation-based reset trigger).
40
 
41
  ## Next Actions
42
 
43
+ 1. Proceed to Step 1.4.
44
+ 2. Optionally harden reset boundary logic before large-scale eval runs.
 
 
client.py CHANGED
@@ -1,6 +1,5 @@
1
  from typing import Any, Dict, Iterable
2
 
3
- import torch
4
  from openenv.core.client_types import StepResult
5
 
6
  from openenv.core.env_server.interfaces import Message
@@ -50,24 +49,11 @@ class SQLEnvClient(EnvClient[SQLAction, SQLObservation, SQLState]):
50
  )
51
 
52
  def _parse_state(self, payload: Dict[str, Any]) -> SQLState:
53
- # Parse history messages
54
- history_messages = payload.get("history_messages", [])
55
-
56
- # Parse history tokens - convert lists back to tensors
57
- history_tokens_data = payload.get("history_tokens", [])
58
- history_tokens = []
59
- for token_list in history_tokens_data:
60
- if token_list:
61
- history_tokens.append(torch.tensor(token_list))
62
- else:
63
- history_tokens.append(torch.tensor([]))
64
-
65
  return SQLState(
66
  episode_id=payload.get("episode_id"),
67
  step_count=payload.get("step_count", 0),
68
- history_messages=history_messages,
69
- history_tokens=history_tokens,
70
- current_action_type=payload.get("current_action_type", "query"),
71
  )
72
 
73
  def _detect_action_type(self, message_content: str) -> str:
 
1
  from typing import Any, Dict, Iterable
2
 
 
3
  from openenv.core.client_types import StepResult
4
 
5
  from openenv.core.env_server.interfaces import Message
 
49
  )
50
 
51
  def _parse_state(self, payload: Dict[str, Any]) -> SQLState:
 
 
 
 
 
 
 
 
 
 
 
 
52
  return SQLState(
53
  episode_id=payload.get("episode_id"),
54
  step_count=payload.get("step_count", 0),
55
+ history_messages=payload.get("history_messages", []),
56
+ current_action_type=payload.get("current_action_type", "QUERY"),
 
57
  )
58
 
59
  def _detect_action_type(self, message_content: str) -> str:
configs/colab_l4.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "Qwen/Qwen3-0.6B",
3
+ "questions_path": "data/questions/questions_train.json",
4
+ "db_dir": "data/databases",
5
+ "output_dir": "outputs/grpo_run",
6
+ "num_train_epochs": 1,
7
+ "per_device_train_batch_size": 8,
8
+ "num_generations": 8,
9
+ "max_completion_length": 512,
10
+ "step_budget": 10,
11
+ "logging_steps": 10,
12
+ "precision": "bf16",
13
+ "enable_thinking": false,
14
+ "num_completions_to_print": 1
15
+ }
configs/test_cpu.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "Qwen/Qwen3-0.6B",
3
+ "questions_path": "data/questions/questions_train.json",
4
+ "db_dir": "data/databases",
5
+ "output_dir": "/tmp/sqlenv_test",
6
+ "num_train_epochs": 1,
7
+ "max_steps": 2,
8
+ "per_device_train_batch_size": 2,
9
+ "num_generations": 2,
10
+ "max_completion_length": 256,
11
+ "step_budget": 3,
12
+ "logging_steps": 1,
13
+ "precision": "fp32",
14
+ "dataset_size": 4,
15
+ "enable_thinking": false,
16
+ "num_completions_to_print": 2
17
+ }
data/databases/car_1/car_1.sqlite ADDED
Binary file (65.5 kB). View file
 
data/databases/concert_singer/concert_singer.sqlite ADDED
Binary file (36.9 kB). View file
 
data/databases/cre_Doc_Template_Mgt/cre_Doc_Template_Mgt.sqlite ADDED
Binary file (24.6 kB). View file
 
data/databases/dog_kennels/dog_kennels.sqlite ADDED
Binary file (49.2 kB). View file
 
data/databases/employee_hire_evaluation/employee_hire_evaluation.sqlite ADDED
Binary file (36.9 kB). View file
 
data/databases/flight_2/flight_2.sqlite ADDED
Binary file (77.8 kB). View file
 
data/databases/pets_1/pets_1.sqlite ADDED
Binary file (16.4 kB). View file
 
data/databases/poker_player/poker_player.sqlite ADDED
Binary file (20.5 kB). View file
 
data/databases/student_assessment/student_assessment.sqlite ADDED
Binary file (57.3 kB). View file
 
data/databases/world_1/world_1.sqlite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17b986695f16786d58d66f85e49dba87bdfe72953207ab9b1b49da9d2301ef65
3
+ size 319488
data/sft/sft_trajectories.json ADDED
The diff for this file is too large to render. See raw diff
 
docs/.DS_Store ADDED
Binary file (6.15 kB). View file
 
docs/ARCHITECTURE.md CHANGED
@@ -1,13 +1,13 @@
1
  # Architecture
2
 
3
- > Last updated: 2026-02-28
4
 
5
  System map for SQLEnv — an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.
6
 
7
  **Goals:**
8
  - Show how components connect (system map + key flows)
9
  - Make hidden state explicit (what lives where)
10
- - Define shared interfaces (Pydantic models, WebSocket API)
11
  - Keep invariants legible (what must stay true)
12
 
13
  **Non-goals:**
@@ -22,48 +22,53 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
22
  SQLEnv System
23
  ================================================================
24
 
25
- RL Training Loop SQLEnv Server (Docker)
26
- ---------------- ----------------------
27
- +---------------------+
28
- +------------+ WebSocket (JSON) | server/app.py |
29
- | SQLEnv |<=========================>| FastAPI + WS |
30
- | Client | SQLAction -> server | |
31
- | (client.py)| SQLObs <- server +----------+----------+
32
- +-----+------+ |
33
- | v
34
- | tensor <-> list +---------------------+
35
- | serialization | SQLEnvironment |
36
- | | (sql_environment.py)|
37
- +-----v------+ | |
38
- | RL Agent | | - reset() / step() |
39
- | (external) | | - action detection |
40
- | e.g. GRPO | | - message_to_action |
41
- +------------+ +--+-------+-------+--+
42
- | | |
43
- v v v
44
- +------+ +------+ +--------+
45
- |Schema| |Sample| | Query |
46
- |Intro-| |Gen | | (Ollama|
47
- |spect.| | | | LLM) |
48
- +--+---+ +--+---+ +---+----+
49
- | | |
50
- v v v
51
- +-------------------------+
52
- | SQLAlchemy ORM Models |
53
- | (data/databases/ |
54
- | models.py) |
55
- | 9 tables: |
56
- | Address, Person, |
57
- | Student, Course, ... |
58
- +-------------------------+
59
-
60
- Data (committed) External (optional)
61
- ---------------- -------------------
62
- data/questions/ +----------+
63
- student_assessment.json | Ollama |
64
- (53 Spider Q&A pairs) | LLM API |
65
- | :11434 |
66
- +----------+
 
 
 
 
 
67
  ```
68
 
69
  ---
@@ -72,19 +77,27 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
72
 
73
  | Component | Owns | Entrypoint | State / Output |
74
  |-----------|------|------------|----------------|
75
- | **SQLEnvClient** | WebSocket transport, tensor serialization | `client.py` | Stateless (wraps server) |
76
- | **FastAPI app** | HTTP/WS endpoints, tokenizer factory | `server/app.py` | In-memory tokenizer |
77
- | **SQLEnvironment** | Episode lifecycle, action dispatch, state | `server/sql_environment.py` | `SQLState` (in-memory) |
78
  | **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) |
79
- | **ORM models** | Database schema definition | `data/databases/models.py` | SQLAlchemy metadata |
80
- | **Spider data** | Question-answer pairs | `data/questions/student_assessment.json` | 53 Q&A entries |
81
- | **MockTokenizer** | Dev/test tokenization (no GPU needed) | `server/test_sql_env.py` | Deterministic (ord/chr) |
82
-
83
- ### External Services
84
-
85
- | Service | Purpose | Required | Fallback |
86
- |---------|---------|----------|----------|
87
- | Ollama (`localhost:11434`) | Table selection + SQL generation | No | First table in dict; query returns error string |
 
 
 
 
 
 
 
 
88
 
89
  ---
90
 
@@ -93,201 +106,306 @@ System map for SQLEnv — an RL environment where agents learn interactive SQL e
93
  ### Flow: Episode (Reset + Multi-Turn Steps)
94
 
95
  ```text
96
- Client Server (SQLEnvironment) Ollama
97
- | | |
98
- |--- reset() ----------------->| |
99
- | |-- init state, system prompt |
100
- | |-- tokenize system message |
101
- |<-- SQLObservation -----------| (MockTokenizer or HF) |
102
- | .messages=[system] | |
103
- | .tokens=shape([N]) | |
104
- | | |
105
- |--- message_to_action(msg) -->| |
106
- | |-- detect action type |
107
- | | (keyword matching) |
108
- | |-- append msg to history |
109
- | |-- tokenize full conversation |
110
- |<-- SQLAction ----------------| |
111
- | .action_type="describe" | |
112
- | .tokens=shape([1,M]) | |
113
- | | |
114
- |--- step(action) ------------>| |
115
- | |-- select table -------------->|
116
- | |<-- table name (or fallback) --|
117
- | |-- introspect ORM schema |
118
- | |-- append assistant msg |
119
- | |-- append action tokens |
120
- |<-- SQLObservation -----------| |
121
- | .messages=[sys,usr,asst] | |
122
- | .tokens=shape([N+M+K]) | |
123
- | | |
124
- (repeat step() for sample, query, answer...)
 
 
 
 
 
 
 
125
  ```
126
 
127
- ### Flow: Action Detection
128
 
129
  ```text
130
- User message string
131
- |
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  v
133
- _detect_action_type(content)
134
- |
135
- +-- contains "describe"/"schema"/"columns"? --> "describe"
136
- |
137
- +-- contains "sample"/"example"/"rows"? --> "sample"
138
- |
139
- +-- default --> "query"
 
 
 
 
140
  ```
141
 
142
- ### Flow: Client Serialization (WebSocket Transport)
143
 
144
  ```text
145
- Client Server
146
- | |
147
- | _step_payload(action): |
148
- | tokens: Tensor -> list (JSON-safe) |
149
- | {action_type, action_description, |
150
- | tokens: [[1,2,3,...]], metadata} |
151
- | ---------------------------------------->|
152
- | |
153
- | _parse_result(data): |
154
- | tokens: list -> Tensor |
155
- | StepResult(obs, reward, done, info) |
156
- | <----------------------------------------|
 
 
 
 
 
157
  ```
158
 
159
  ---
160
 
161
  ## Shared Data Models
162
 
163
- These three Pydantic models are used across client, server, and tests.
164
- Defined in `models.py`.
165
 
166
- ### SQLAction
167
 
168
  ```python
169
  class SQLAction(Action):
170
- action_type: str # "describe" | "sample" | "query" | "answer"
171
- action_description: str # raw user message content
172
- tokens: torch.Tensor # tokenized conversation context, shape [1, seq_len]
173
  ```
174
 
175
- **Used by:** SQLEnvironment.step(), SQLEnvClient._step_payload(), tests
176
-
177
- ### SQLObservation
178
 
179
  ```python
180
  class SQLObservation(Observation):
181
- messages: list[Message] # full conversation history [{role, content}, ...]
182
- tokens: torch.Tensor # flattened 1D tensor of all turn tokens concatenated
 
 
 
 
 
 
183
  ```
184
 
185
- **Used by:** SQLEnvironment.reset()/step(), SQLEnvClient._parse_result(), tests
186
-
187
- ### SQLState
188
 
189
  ```python
190
  class SQLState(State):
191
- episode_id: str # UUID per episode
192
- step_count: int # turns taken
193
- history_messages: list[Message] # accumulates across turns
194
- history_tokens: list[torch.Tensor] # one tensor per turn, flattened on output
195
- current_action_type: str | None # last detected action type
196
  ```
197
 
198
- **Used by:** SQLEnvironment (internal), state endpoint
199
- **Note:** This is a lightweight summary for logging. The full RL state lives inside SQLEnvironment and is not exposed to the agent.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
  ---
202
 
203
  ## API Contracts
204
 
205
- ### WebSocket (OpenEnv Protocol)
206
 
207
- The server exposes a WebSocket endpoint via FastAPI. The OpenEnv framework handles the protocol — SQLEnv implements `reset()` and `step()` on the server side, and `SQLEnvClient` wraps the client side.
208
 
209
- | Operation | Client Method | Payload | Response |
210
- |-----------|---------------|---------|----------|
211
- | Reset | `client.reset()` | `{}` | `SQLObservation` (JSON) |
212
- | Step | `client.step(action)` | `{action_type, action_description, tokens: list, metadata}` | `StepResult(obs, reward, done, info)` |
213
- | State | `client.state()` | `{}` | `SQLState` (JSON) |
214
 
215
- ### Ollama (Optional)
216
 
217
- | Endpoint | Purpose | Payload |
218
- |----------|---------|---------|
219
- | `POST /api/generate` | Table selection | `{model, prompt, stream: false}` |
220
- | `POST /api/generate` | SQL generation | `{model, prompt, stream: false}` |
221
 
222
- Timeout: 30s. Failure mode: graceful fallback (never crashes).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
  ---
225
 
226
  ## Cross-Cutting Concerns
227
 
228
- ### Code Style & Abstraction Philosophy
229
-
230
- OOP for framework integration (Environment, EnvClient subclasses), plain methods for logic. Extract helpers when they clarify intent, not for DRY.
231
 
232
- - **Structure:** Flat package root with `server/` for server-only code
233
- - **Error handling:** Graceful fallbacks (never crash), `ValueError` for invalid inputs
234
- - **Imports:** `try: from sql_env.X / except: from X` for dual install/Docker compatibility
 
 
 
235
 
236
- ### Tokenization
237
 
238
- Two paths, same interface (`apply_chat_template`):
 
 
 
 
239
 
240
- | Mode | Tokenizer | Source | When |
241
- |------|-----------|--------|------|
242
- | Dev/Test | `MockTokenizer` | `server/test_sql_env.py` | No GPU, no downloads |
243
- | Production | HuggingFace | `transformers` library | Real RL training |
244
 
245
- `MockTokenizer` encodes as `ord(c)` per character, decodes as `chr(t)`. Deterministic and fast.
 
 
 
 
 
 
246
 
247
  ### Configuration
248
 
249
- | Variable | Required | Description | Default |
250
- |----------|----------|-------------|---------|
251
- | `OLLAMA_MODEL` | No | Ollama model name for SQL generation | `qwen2` |
252
- | `OLLAMA_BASE_URL` | No | Ollama API endpoint | `http://localhost:11434` |
 
 
253
 
254
  ---
255
 
256
- ## Data, State, and Storage Locations
257
 
258
- - **Repo (committed):**
259
- - `data/questions/student_assessment.json` — 53 Spider Q&A pairs
260
- - `data/databases/models.py` — 9 SQLAlchemy ORM table definitions
261
- - **Runtime state (in-memory, per episode):**
262
- - `SQLState.history_messages` — conversation messages
263
- - `SQLState.history_tokens` — tensor per turn
264
- - **Not yet implemented:**
265
- - SQLite database files (Phase 3 — queries currently go through Ollama, not executed locally)
266
- - Reward/verification state
267
 
268
- ---
 
 
 
 
 
269
 
270
- ## Invariants and Guardrails
271
 
272
- - `self.db_models` refers to **database table** models (SQLAlchemy), never RL models
273
- - Token tensors grow monotonically across turns (never shrink or reset mid-episode)
274
- - `message_to_action()` mutates state it appends to history before tokenizing
275
- - Ollama failures never crash the environment — always graceful fallback
276
- - `tests/test_smoke.py` must pass without Ollama, without GPU, without network
277
- - Schema column names in `_build_schema_description()` must match `data/databases/models.py`
278
 
279
  ---
280
 
281
- ## Glossary
282
 
283
- | Term | Definition |
284
- |------|------------|
285
- | Episode | One question-answering session: reset -> N steps -> terminal |
286
- | Action type | One of: describe, sample, query, answer |
287
- | MockTokenizer | Deterministic char-code tokenizer for dev/test (no GPU) |
288
- | Spider | Academic text-to-SQL benchmark dataset |
289
- | ORM models | SQLAlchemy class definitions in `data/databases/models.py` |
290
- | OpenEnv | Meta's RL environment framework (Environment, EnvClient, Action, Observation) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
291
 
292
  ---
293
 
@@ -295,61 +413,52 @@ Two paths, same interface (`apply_chat_template`):
295
 
296
  ### Development
297
 
298
- **Prerequisites:**
299
- - Python 3.11-3.12 (torch incompatible with 3.13)
300
- - `uv` package manager
301
- - Ollama (optional)
302
 
303
- **Setup:**
304
  ```bash
305
  git clone <repo-url> && cd sql-env
306
  uv sync
307
- uv run pytest tests/ -v # 21 tests, ~3.5s, no external deps
 
308
  ```
309
 
310
- ### Production
311
 
312
- **Deployment:** Docker container via OpenEnv CLI (`openenv build` / `openenv push`)
313
- **Runtime:** FastAPI on port 8000 (defined in `openenv.yaml`)
314
- **Status:** Dockerfile is a scaffold stub — not yet validated
315
 
316
- ---
317
-
318
- ## Suggested Feature Breakdown
319
-
320
- | ID | Feature | Complexity | Dependencies | Notes |
321
- |----|---------|------------|--------------|-------|
322
- | F001 | SQL query execution | standard | - | Execute queries against real SQLite, return results |
323
- | F002 | Reward computation | standard | F001 | 3-layer reward: operational, progress, terminal |
324
- | F003 | Answer verification | standard | F001 | Compare agent answer to gold SQL results |
325
- | F004 | Docker validation | simple | - | Update Dockerfile, test `openenv build` |
326
- | F005 | Multi-database support | complex | F001 | Load any Spider database, not just student_assessment |
327
-
328
- ### Suggested Implementation Order
329
 
330
- 1. **F001** Foundation: wire up SQLite execution so queries return real data
331
- 2. **F002 + F003** — Can be done in parallel once F001 is complete
332
- 3. **F004** — Independent, can be done anytime
333
- 4. **F005** — After the single-database path is solid
334
 
335
  ---
336
 
337
- ## Future Considerations
338
 
339
- - **Real SQLite execution:** Queries currently go to Ollama for SQL generation but aren't executed against a database. Phase 3 should execute the generated SQL and return actual results.
340
- - **Multi-episode batching:** For RL training, the environment will need to support multiple concurrent episodes efficiently.
341
- - **Reward shaping:** The 3-layer reward (operational, progress, terminal) is designed in `models.py` but not implemented.
342
- - **Table selection without Ollama:** A lightweight keyword/embedding-based table selector could replace the LLM fallback.
 
343
 
344
  ---
345
 
346
- ## Keeping This Map Current
347
 
348
- Update this file when you change any of:
349
- - System boundaries (new service, new subsystem)
350
- - Persistent state locations (new files/dirs written or read)
351
- - Shared data models or API contracts
352
- - Cross-cutting invariants
 
 
 
 
 
 
 
353
 
354
  ---
355
 
@@ -357,5 +466,8 @@ Update this file when you change any of:
357
 
358
  - Docs index: `docs/README.md`
359
  - Operations: `docs/RUNBOOK.md`
 
 
360
  - OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
361
  - Spider dataset: https://huggingface.co/datasets/xlangai/spider
 
 
1
  # Architecture
2
 
3
+ > Last updated: 2026-03-29
4
 
5
  System map for SQLEnv — an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.
6
 
7
  **Goals:**
8
  - Show how components connect (system map + key flows)
9
  - Make hidden state explicit (what lives where)
10
+ - Define shared interfaces (Pydantic models, HTTP API)
11
  - Keep invariants legible (what must stay true)
12
 
13
  **Non-goals:**
 
22
  SQLEnv System
23
  ================================================================
24
 
25
+ RL Training SQLEnv Server (Docker)
26
+ ───────────── ──────────────────────
27
+ +──────────────+ +─────────────────────+
28
+ TRL GRPO │ │ server/app.py
29
+ Trainer │ HTTP (JSON) │ FastAPI + OpenEnv │
30
+ │ │<========================>
31
+ training/ │ SQLAction -> server +──────────┬──────────+
32
+ │ trl_adapter │ SQLObs <- server │
33
+ │ .py │ v
34
+ +──────────────+ +─────────────────────+
35
+ │ │ SQLEnvironment
36
+ OR │ (sql_environment.py)
37
+ v
38
+ +──────────────+ │ reset() / step()
39
+ Custom │ │ action dispatch │
40
+ rollout_func │ +──┬──────┬──────┬────+
41
+ │ (rollout.py) │ │ │ │
42
+ +──────────────+ v v v
43
+ +────────────────────────+
44
+ Evaluation │ Action Handlers │
45
+ ────────── │ DESCRIBE PRAGMA │
46
+ +──────────────+ │ SAMPLE SELECT N │
47
+ evaluate() │──> env.reset/step │ QUERY → SQL exec │
48
+ policies │ │ ANSWER → verifier │
49
+ │ .py │ +────────┬───────────────+
50
+ +──────────────+ │
51
+ v
52
+ +──────────────+ +────────────────────────+
53
+ Policies │ │ SQLite (read-only) │
54
+ │ RandomPolicy │ │ data/databases/ │
55
+ OraclePolicy │ │ {db_id}/{db_id}.sqlite │
56
+ +──────────────+ +────────────────────────+
57
+
58
+ ┌───────┴───────┐
59
+ v v
60
+ +───────────+ +───────────+
61
+ │ reward.py │ │verifier.py│
62
+ │ 3-layer │ │ type-aware│
63
+ dense │ │ comparison│
64
+ +───────────+ +───────────+
65
+
66
+ Data (committed) Synthetic (optional)
67
+ ──────────────── ────────────────────
68
+ data/questions/ server/synthetic/
69
+ questions_train.json (473 Q) generate.py
70
+ questions_eval.json (203 Q) mutations.py
71
+ db_list.json (10 databases) validate.py
72
  ```
73
 
74
  ---
 
77
 
78
  | Component | Owns | Entrypoint | State / Output |
79
  |-----------|------|------------|----------------|
80
+ | **SQLEnvironment** | Episode lifecycle, action dispatch, step budget | `server/sql_environment.py` | `EpisodeContext` (in-memory, per episode) |
81
+ | **FastAPI app** | HTTP endpoints, tokenizer factory | `server/app.py` | Stateless (delegates to environment) |
82
+ | **SQLEnvClient** | HTTP transport, payload serialization | `client.py` | Stateless (wraps server) |
83
  | **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) |
84
+ | **Reward engine** | 3-layer dense reward computation | `server/reward.py` | Mutates `EpisodeContext` accumulators |
85
+ | **Answer verifier** | Type-aware answer comparison | `server/verifier.py` | Stateless (pure function) |
86
+ | **GRPO pipeline** | Training orchestration, rollout, reward callables | `training/` (6 modules) | Training artifacts in `outputs/` |
87
+ | **TRL adapter** | `environment_factory` for TRL GRPOTrainer | `training/trl_adapter.py` | Per-session environment instances |
88
+ | **Evaluation** | Policy protocol, evaluate() runner | `evaluation/policies.py` | `EvaluationResult` metrics |
89
+ | **Oracle policy** | Deterministic upper-bound baseline | `evaluation/oracle_policy.py` | Stateless per-step |
90
+ | **Synthetic DB gen** | Metamorphic testing via data mutations | `server/synthetic/` | Variant SQLite files |
91
+ | **Question dataset** | 676 curated Spider questions across 10 DBs | `data/questions/` | JSON files |
92
+
93
+ ### External Dependencies
94
+
95
+ | Dependency | Purpose | Required |
96
+ |------------|---------|----------|
97
+ | SQLite (stdlib) | Database execution | Yes |
98
+ | OpenEnv (`openenv-core`) | Environment protocol, `create_app` | Yes |
99
+ | TRL (`trl`) | GRPO training | Only for training |
100
+ | HuggingFace Transformers | Tokenizer loading | Only for production server |
101
 
102
  ---
103
 
 
106
  ### Flow: Episode (Reset + Multi-Turn Steps)
107
 
108
  ```text
109
+ Client / Policy SQLEnvironment
110
+ │ │
111
+ │── reset(seed=42) ────────────────>
112
+ │ │── pick question (random or seeded)
113
+ │ │── open read-only SQLite connection
114
+ │ │── execute gold_sql store gold_rows
115
+ │ │── init EpisodeContext (budget=15)
116
+ │ <── SQLObservation ────────────────│
117
+ │ .question="How many students?" │
118
+ │ .schema_info="Tables: student" │ (column details hidden)
119
+ │ .budget_remaining=15 │
120
+ │ │
121
+ │── step(DESCRIBE student) ────────>
122
+ │ │── PRAGMA table_info(student)
123
+ │ │── add to described_tables
124
+ │ │── compute_step_reward()
125
+ │ <── SQLObservation ────────────────│
126
+ │ .schema_info="student: id INT" │ (columns now revealed)
127
+ │ .result="5 columns, 20 rows" │
128
+ │ .reward=0.02 │
129
+ │ .budget_remaining=14 │
130
+ │ │
131
+ │── step(QUERY "SELECT COUNT(*)...") │
132
+ │ │── validate (SELECT-only, single stmt)
133
+ │ │── execute with 5s timeout
134
+ │ │── compute_step_reward() (L1 + L2)
135
+ │ <── SQLObservation ────────────────│
136
+ │ .result="| COUNT(*) |\n| 20 |" │
137
+ .reward=0.035 │
138
+ │ │
139
+ │── step(ANSWER "20") ─────────────> │
140
+ │ │── verify_answer("20", gold, type)
141
+ │ │── terminal reward: +1.0 or 0.0
142
+ │ <── SQLObservation ────────────────│
143
+ │ .done=true │
144
+ │ .reward=1.0 │
145
  ```
146
 
147
+ ### Flow: 3-Layer Reward Computation
148
 
149
  ```text
150
+ step() called with action
151
+
152
+ v
153
+ Layer 1: Operational Shaping (every action)
154
+ ├── exec_ok? → +0.02
155
+ ├── new SQL hash? → +0.01 (per unique query, no cumulative cap)
156
+ ├── repeated SQL? → -0.01
157
+ └── step cost → -0.005
158
+
159
+ v (only if action_type == QUERY and no error)
160
+ Layer 2: Progress Shaping (delta-from-previous, PBRS)
161
+ ├── cardinality score (25%) — |pred_rows - gold_rows| / max
162
+ ├── value overlap (50%) — Jaccard of cell values
163
+ └── numeric range (25%) — log-distance proximity
164
+
165
  v
166
+ bin to {0.0, 0.25, 0.5, 0.75, 1.0}
167
+ delta = binned - previous_progress → delta * 0.15
168
+ (positive = improvement, negative = regression)
169
+
170
+ v
171
+ Clip per step to [-0.05, +0.15]
172
+ No cumulative tracking
173
+
174
+ v (on ANSWER action)
175
+ Layer 3: Terminal Correctness
176
+ └── verify_answer() → +1.0 (correct) or 0.0 (wrong)
177
  ```
178
 
179
+ ### Flow: TRL Training Integration
180
 
181
  ```text
182
+ GRPOTrainer
183
+
184
+ │── discovers tool methods via docstrings
185
+ │ (describe, sample, query, answer)
186
+
187
+ │── per rollout:
188
+ │ SQLEnvTRL() → SQLEnvironment (internal)
189
+ │ .reset() → observation string
190
+ │ .describe(table) → schema string
191
+ │ .query(sql) result string
192
+ │ .answer(value) final string
193
+
194
+ │── reward:
195
+ │ sql_env_reward_func() → accumulated .reward
196
+
197
+ v
198
+ Training loop (GRPO: generate N completions, rank by reward)
199
  ```
200
 
201
  ---
202
 
203
  ## Shared Data Models
204
 
205
+ Defined in `models.py`. These cross the HTTP boundary between client and server.
 
206
 
207
+ ### SQLAction (agent -> server)
208
 
209
  ```python
210
  class SQLAction(Action):
211
+ action_type: str # DESCRIBE | SAMPLE | QUERY | ANSWER
212
+ argument: str # table name, SQL string, or answer value
 
213
  ```
214
 
215
+ ### SQLObservation (server -> agent)
 
 
216
 
217
  ```python
218
  class SQLObservation(Observation):
219
+ question: str # NL question to answer
220
+ schema_info: str # known schema (incrementally revealed)
221
+ result: str # last action result (truncated)
222
+ error: str # error message if action failed
223
+ step_count: int # current step number
224
+ budget_remaining: int # steps left
225
+ action_history: list[str] # summary of prior actions
226
+ # Inherited: done (bool), reward (float | None)
227
  ```
228
 
229
+ ### SQLState (metadata endpoint)
 
 
230
 
231
  ```python
232
  class SQLState(State):
233
+ history_messages: list[Message]
234
+ current_action_type: str
 
 
 
235
  ```
236
 
237
+ ### Server-Only Types (never sent to agent)
238
+
239
+ ```python
240
+ @dataclass
241
+ class QuestionRecord:
242
+ question_id: str
243
+ question_text: str
244
+ database_name: str
245
+ gold_sql: str
246
+ gold_answer: str
247
+ answer_type: str # integer | float | string | list
248
+ difficulty: str # easy | medium | hard
249
+ tables_involved: list[str]
250
+
251
+ @dataclass
252
+ class EpisodeContext:
253
+ episode_id: str
254
+ db_connection: sqlite3.Connection
255
+ question_record: QuestionRecord
256
+ step_count: int = 0
257
+ budget: int = 15
258
+ described_tables: set[str]
259
+ action_log: list[str]
260
+ done: bool = False
261
+ gold_answer: str | None
262
+ gold_rows: list[tuple]
263
+ # Reward accumulators
264
+ query_hashes: set[str]
265
+ best_progress: float = 0.0
266
+ cumulative_step_reward: float = 0.0
267
+ cumulative_new_info_reward: float = 0.0
268
+ ```
269
+
270
+ **POMDP design:** The agent sees `SQLObservation`; the server holds `EpisodeContext`. The agent never sees gold answers, progress scores, or the full database. This separation forces exploration.
271
 
272
  ---
273
 
274
  ## API Contracts
275
 
276
+ ### HTTP (OpenEnv Protocol)
277
 
278
+ The server exposes HTTP endpoints via `openenv.core.env_server.create_app()`.
279
 
280
+ | Operation | Method | Payload | Response |
281
+ |-----------|--------|---------|----------|
282
+ | Reset | `POST /reset` | `{seed: int}` (optional) | `SQLObservation` (JSON) |
283
+ | Step | `POST /step` | `{action_type, argument, metadata}` | `{observation, reward, done, info}` |
284
+ | State | `GET /state` | | `SQLState` (JSON) |
285
 
286
+ ### Evaluation API
287
 
288
+ ```python
289
+ # Policy protocol
290
+ class Policy(Protocol):
291
+ def select_action(self, observation: SQLObservation) -> SQLAction: ...
292
 
293
+ # Built-in policies
294
+ RandomPolicy() # random baseline
295
+ OraclePolicy(questions) # gold-answer upper bound
296
+
297
+ # Runner
298
+ evaluate(env, policy, n_episodes, seed) -> EvaluationResult
299
+ # .success_rate, .avg_reward, .avg_steps, .episodes[]
300
+ ```
301
+
302
+ ### TRL Adapter API
303
+
304
+ ```python
305
+ SQLEnvTRL.configure(questions_path, db_dir, step_budget) # class method
306
+ # Tool methods (auto-discovered by TRL):
307
+ SQLEnvTRL.describe(table_name: str) -> str
308
+ SQLEnvTRL.sample(table_name: str) -> str
309
+ SQLEnvTRL.query(sql: str) -> str
310
+ SQLEnvTRL.answer(value: str) -> str
311
+ ```
312
 
313
  ---
314
 
315
  ## Cross-Cutting Concerns
316
 
317
+ ### SQL Safety
 
 
318
 
319
+ All database access enforces:
320
+ - **Read-only** SQLite connections (`file:...?mode=ro`)
321
+ - **SELECT-only** rejects INSERT, UPDATE, DELETE, ALTER, DROP
322
+ - **Single statement** — rejects `; ...` (no stacked queries)
323
+ - **5-second timeout** via SQLite progress handler
324
+ - **20-row truncation** on all result sets
325
 
326
+ ### POMDP Structure
327
 
328
+ The partial observability is deliberate and load-bearing:
329
+ - Agent sees table names at reset but **not column details** (must DESCRIBE)
330
+ - Query results are **truncated** (at most 20 rows)
331
+ - Agent never sees `gold_answer`, `best_progress`, or `gold_rows`
332
+ - Step budget (default 15) forces strategic allocation of exploration
333
 
334
+ ### Import Compatibility
 
 
 
335
 
336
+ Dual import paths throughout for local vs Docker execution:
337
+ ```python
338
+ try:
339
+ from sql_env.models import SQLAction # local / pip install
340
+ except ImportError:
341
+ from models import SQLAction # Docker (PYTHONPATH=/app/env)
342
+ ```
343
 
344
  ### Configuration
345
 
346
+ | Variable | Required | Default | Purpose |
347
+ |----------|----------|---------|---------|
348
+ | `QUESTIONS_PATH` | No | `data/questions/student_assessment.json` | Questions JSON |
349
+ | `DB_DIR` | No | `data/databases/` | SQLite database directory |
350
+ | `TOKENIZER_NAME` | No | `mistralai/Mistral-7B-Instruct-v0.1` | HuggingFace tokenizer |
351
+ | `PORT` | No | `8000` | Server port (HF Spaces uses 7860) |
352
 
353
  ---
354
 
355
+ ## Data, State, and Storage
356
 
357
+ ### Committed Data
 
 
 
 
 
 
 
 
358
 
359
+ | Path | Contents |
360
+ |------|----------|
361
+ | `data/questions/questions_train.json` | 473 training questions across 10 DBs |
362
+ | `data/questions/questions_eval.json` | 203 evaluation questions across 10 DBs |
363
+ | `data/questions/db_list.json` | 10 Spider database IDs |
364
+ | `data/databases/models.py` | Legacy SQLAlchemy ORM models |
365
 
366
+ ### Downloaded Data (gitignored)
367
 
368
+ Spider SQLite databases in `data/databases/{db_id}/{db_id}.sqlite`. Downloaded via `scripts/download_spider_databases.py`. The 10 databases: student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player.
369
+
370
+ ### Runtime State (in-memory, per episode)
371
+
372
+ `EpisodeContext` holds all episode state: DB connection, gold data, reward accumulators, action history. Created on `reset()`, discarded when episode ends. Nothing persists between episodes.
 
373
 
374
  ---
375
 
376
+ <!-- ARCHITECTURE-SNAPSHOT-BEGIN -->
377
 
378
+ Snapshot (auto-managed)
379
+
380
+ - Repo signals: Python (pyproject.toml)
381
+ - Roots: tests/
382
+ - Entrypoint candidates: (none detected)
383
+
384
+ ```text
385
+ tests/
386
+ e2e/
387
+ test_training_e2e.py
388
+ integration/
389
+ test_training_pipeline.py
390
+ unit/
391
+ test_error_handling.py
392
+ test_grpo_config.py
393
+ test_oracle_policy.py
394
+ test_prompts.py
395
+ test_reward.py
396
+ test_rewards.py
397
+ test_rollout.py
398
+ test_sft_terminal_message.py
399
+ test_trl_adapter.py
400
+ test_evaluation.py
401
+ test_smoke.py
402
+ test_synthetic.py
403
+ test_trl_adapter.py
404
+ test_verifier.py
405
+ test_verifier_integration.py
406
+ ```
407
+
408
+ <!-- ARCHITECTURE-SNAPSHOT-END -->
409
 
410
  ---
411
 
 
413
 
414
  ### Development
415
 
416
+ **Prerequisites:** Python 3.11-3.12, `uv`, Docker (for deployment)
 
 
 
417
 
 
418
  ```bash
419
  git clone <repo-url> && cd sql-env
420
  uv sync
421
+ uv run python scripts/download_spider_databases.py
422
+ uv run pytest tests/ -v
423
  ```
424
 
425
+ ### Deployment
426
 
427
+ **Target:** HuggingFace Spaces (Docker, free tier)
 
 
428
 
429
+ ```bash
430
+ uv run openenv build # build Docker image
431
+ uv run openenv push # push to HF Spaces
432
+ ```
 
 
 
 
 
 
 
 
 
433
 
434
+ The Dockerfile uses multi-stage build with `openenv-base`, runs as non-root `appuser`, bundles Spider databases, and exposes `PORT` (default 7860 on HF Spaces).
 
 
 
435
 
436
  ---
437
 
438
+ ## Invariants
439
 
440
+ - Token tensors in `SQLState` grow monotonically across turns (never shrink mid-episode)
441
+ - `EpisodeContext` is server-only leaking gold data to the agent breaks the POMDP
442
+ - Per-step rewards clipped to `[-0.05, 0.15]` terminal reward (+1.0) always dominates exploration (~0.3 max)
443
+ - `tests/` must pass without GPU, without network, without downloaded databases (mocks/fixtures)
444
+ - SQL execution never mutates the database (read-only mode enforced at connection level)
445
 
446
  ---
447
 
448
+ ## Glossary
449
 
450
+ | Term | Definition |
451
+ |------|------------|
452
+ | Episode | One question-answering session: reset -> N steps -> terminal |
453
+ | Action type | One of: DESCRIBE, SAMPLE, QUERY, ANSWER |
454
+ | POMDP | Partially observable MDP — agent acts under uncertainty |
455
+ | Spider | Academic text-to-SQL benchmark dataset (10 DBs used) |
456
+ | OpenEnv | Meta's RL environment framework (Environment, EnvClient) |
457
+ | Green Agent | OpenEnv's evaluation wrapper pattern |
458
+ | Oracle policy | Baseline that uses gold SQL/answer for reward ceiling validation |
459
+ | TRL | HuggingFace Transformer Reinforcement Learning library |
460
+ | GRPO | Group Relative Policy Optimization (RL algorithm used for training) |
461
+ | Dense reward | Per-step reward signal (vs sparse terminal-only reward) |
462
 
463
  ---
464
 
 
466
 
467
  - Docs index: `docs/README.md`
468
  - Operations: `docs/RUNBOOK.md`
469
+ - Vision: `vision/VISION.md`
470
+ - Feature specs: `specs/FEATURES.json`
471
  - OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
472
  - Spider dataset: https://huggingface.co/datasets/xlangai/spider
473
+ - TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv
docs/DOCS_CONTRACT.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "optional_paths": [
3
+ "vision/README.md",
4
+ "vision/VISION.md",
5
+ "vision/ROADMAP.md"
6
+ ],
7
+ "required_paths": [
8
+ "AGENTS.md",
9
+ "docs/README.md",
10
+ "docs/ARCHITECTURE.md",
11
+ "docs/RUNBOOK.md",
12
+ "docs/QUALITY_SCORE.md",
13
+ "docs/FEATURE_SLICING.md",
14
+ "docs/DOCS_TAXONOMY.md",
15
+ "docs/design-docs/index.md",
16
+ "docs/design-docs/core-beliefs.md",
17
+ "docs/design-docs/decisions/0001-template.md",
18
+ "docs/exec-plans/README.md",
19
+ "docs/exec-plans/tech-debt-tracker.md",
20
+ "docs/discovery/index.md",
21
+ "docs/delivery-specs/index.md",
22
+ "docs/guides/README.md",
23
+ "docs/references/README.md",
24
+ "docs/exploration/README.md",
25
+ "docs/learnings/README.md",
26
+ "docs/learnings/architecture.md",
27
+ "docs/learnings/conventions.md",
28
+ "docs/learnings/workflow.md",
29
+ "docs/learnings/integrations.md",
30
+ "docs/learnings/gotchas.md",
31
+ "docs/learnings/security.md",
32
+ "docs/learnings/testing.md",
33
+ "docs/learnings/archived/README.md",
34
+ "docs/DOCS_CONTRACT.json"
35
+ ],
36
+ "map_files": [
37
+ "AGENTS.md"
38
+ ],
39
+ "index_files": [
40
+ "docs/README.md"
41
+ ],
42
+ "learnings_category_files": [
43
+ "docs/learnings/architecture.md",
44
+ "docs/learnings/conventions.md",
45
+ "docs/learnings/workflow.md",
46
+ "docs/learnings/integrations.md",
47
+ "docs/learnings/gotchas.md",
48
+ "docs/learnings/security.md",
49
+ "docs/learnings/testing.md"
50
+ ],
51
+ "learnings_budget": {
52
+ "max_bullets_per_category": 30,
53
+ "require_feature_id_suffix": true,
54
+ "enforce_dedupe_within_category": true,
55
+ "enforce_dedupe_across_categories": false
56
+ },
57
+ "agents_md_max_lines": 100,
58
+ "agents_md_max_learning_bullets_total": 0,
59
+ "orphan_spec_ignore_prefixes": [
60
+ "F001", "F002", "F003", "F004", "F005",
61
+ "F006", "F007", "F008", "F009", "F010",
62
+ "F013", "F014", "F015"
63
+ ],
64
+ "directory_types": {
65
+ "docs/guides": "how-to",
66
+ "docs/design-docs": "explanation",
67
+ "docs/discovery": "explanation",
68
+ "docs/delivery-specs": "reference",
69
+ "docs/exploration": "exploration",
70
+ "docs/references": "reference",
71
+ "docs/learnings": "reference",
72
+ "docs/exec-plans": "how-to"
73
+ }
74
+ }
docs/DOCS_TAXONOMY.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Documentation Taxonomy
2
+
3
+ Where to put new documentation. Inspired by [Diataxis](https://diataxis.fr/) but adapted for agentic engineering where both humans and AI agents consume docs.
4
+
5
+ ## Decision Tree
6
+
7
+ ```
8
+ "I need to write something down."
9
+
10
+ ├── Does it tell someone HOW TO do a specific task?
11
+ │ └── YES → docs/guides/ [how-to]
12
+
13
+ ├── Does it describe WHAT something IS (APIs, interfaces, facts)?
14
+ │ └── YES → docs/references/ [reference]
15
+ │ or → docs/ARCHITECTURE.md [reference]
16
+ │ or → docs/delivery-specs/ [reference]
17
+
18
+ ├── Does it explain WHY a decision was made?
19
+ │ └── YES → docs/design-docs/ [explanation]
20
+ │ or → vision/ [explanation]
21
+
22
+ ├── Does it validate WHETHER we should build something?
23
+ │ └── YES → docs/discovery/ [explanation]
24
+
25
+ ├── Is it a durable pattern extracted from work?
26
+ │ └── YES → docs/learnings/<category>.md [reference]
27
+
28
+ ├── Is it an idea, investigation, or scratchpad?
29
+ │ └── YES → docs/exploration/ [exploration]
30
+
31
+ └── Is it tracking active work progress?
32
+ └── YES → docs/exec-plans/ [how-to]
33
+ ```
34
+
35
+ ## Directory Purpose Map
36
+
37
+ | Directory | Diataxis Type | Audience | Created By | Mutability |
38
+ |-----------|--------------|----------|------------|------------|
39
+ | `docs/guides/` | How-to | Humans + Agents | Manual or extracted | Stable |
40
+ | `docs/design-docs/` | Explanation | Humans + Agents | `design-doc` skill | Append-only (decisions are permanent) |
41
+ | `docs/discovery/` | Explanation | Humans (agents read) | `discovery` skill | Per-feature, stable once validated |
42
+ | `docs/delivery-specs/` | Reference | Agents + Humans | `delivery-spec` skill | Stable once approved |
43
+ | `docs/references/` | Reference | Agents | Manual or generated | Updated as needed |
44
+ | `docs/learnings/` | Reference | Agents + Humans | `compound-engineer` | Append-only (budgeted) |
45
+ | `docs/exploration/` | Exploration | Humans only | Manual | Ephemeral -- graduate or archive |
46
+ | `docs/exec-plans/` | How-to | Humans (agents read) | Manual | Active then archived |
47
+ | `vision/` | Explanation | Humans + Agents | `strategy` skill or manual | Rare changes |
48
+ | `specs/` | Reference | Agents | Autocode skills | Per-feature lifecycle |
49
+
50
+ ## Self-Organization Rules
51
+
52
+ 1. **Start minimal.** Don't create directories until you have content for them. Skills create directories on-demand.
53
+ 2. **Graduate content.** Exploration docs that prove durable should move to the appropriate permanent location (learnings, guides, references).
54
+ 3. **One purpose per doc.** If a document is doing two things (e.g., explaining WHY and telling HOW), split it.
55
+ 4. **Agents navigate by maps.** Keep `AGENTS.md` as a pure index. Keep `docs/README.md` as the docs index. Don't inline content in indexes.
56
+ 5. **Enforce mechanically.** Use `DOCS_CONTRACT.json` and `opencode-ctx docs validate` to prevent drift. Narrative instructions degrade with context length; lints apply everywhere.
57
+
58
+ ## Sources
59
+
60
+ - [Diataxis](https://diataxis.fr/) -- Four-type documentation framework (Daniele Procida)
61
+ - [Cynefin](https://en.wikipedia.org/wiki/Cynefin_framework) -- Complex vs. ordered domains inform when to prescribe vs. let emerge (Dave Snowden)
62
+ - [OpenAI Harness Engineering](https://openai.com/index/harness-engineering/) -- "Give Codex a map, not a 1,000-page instruction manual"
docs/FEATURE_SLICING.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Feature Slicing Strategy
2
+
3
+ This doc clarifies how we reconcile two goals that can conflict if handled naively:
4
+
5
+ 1. Capture human intent and constraints once (discovery / delivery / design).
6
+ 2. Ship small, low-risk PRs (vertical slices) with fast feedback.
7
+
8
+ ## Two Levels: Capability vs. Slice
9
+
10
+ We treat "feature" in two different ways depending on which artifact we are talking about.
11
+
12
+ ### 1) Capability Docs (Shared Context)
13
+
14
+ These artifacts capture durable intent/constraints and can (and often should) be shared across multiple slices:
15
+
16
+ - `docs/discovery/<capability>.json` and `docs/discovery/<capability>.md`
17
+ - Outcome, opportunity, PR/FAQ, taste (delights/frustrations/feeling), scope, unknowns
18
+ - Source of truth for "taste" when present
19
+
20
+ - `docs/delivery-specs/<capability>.md` (optional)
21
+ - Functional requirements + non-functional requirements
22
+ - Acceptance criteria and rollout notes
23
+ - Can cover a capability that will be delivered across multiple slices
24
+
25
+ - `docs/design-docs/<decision>.md` (optional)
26
+ - Architecture decisions (ADR-style) that may apply to many slices
27
+
28
+ ### 2) Feature Slices (Execution Units)
29
+
30
+ These are the units we track in `specs/FEATURES.json` and implement as small PRs:
31
+
32
+ - Each slice should be independently valuable or a necessary, verifiable step toward value.
33
+ - Each slice should have its own implementation + verification specs.
34
+ - Each slice can reference the same capability docs (discovery/delivery/design) via `feature.docs.*`.
35
+
36
+ Key rule:
37
+ - Multiple slices may share the same `docs.discovery_json` / `docs.delivery_spec` / `docs.design_doc`.
38
+ - Slices should NOT share the same `specs/F###-IMPLEMENTATION_SPEC.md`.
39
+
40
+ ## Practical Heuristics
41
+
42
+ - Prefer a single discovery doc per capability, then slice delivery into multiple `FEATURES.json` entries.
43
+ - Keep implementation specs bounded (max ~7-10 steps). If the plan is bigger, split into more slices.
44
+ - If two slices have different success criteria / taste / outcome, they should not share the same discovery JSON.
45
+
46
+ ## What This Buys Us
47
+
48
+ - No repeated interviews: taste is captured once and reused.
49
+ - Small PRs: execution stays incremental and testable.
50
+ - Lower drift: shared intent stays consistent, slice specs stay bounded.
docs/QUALITY_SCORE.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quality Score
2
+
3
+ This is a lightweight rubric for keeping the repo maintainable as it grows.
4
+
5
+ ## Documentation
6
+
7
+ - `AGENTS.md` stays short and acts as a navigation map.
8
+ - Durable guidance lives in `docs/` with stable paths.
9
+ - `opencode-ctx docs validate` should be green before merging docs changes.
10
+
11
+ ## Determinism
12
+
13
+ - CLI output ordering and messages remain stable across runs.
docs/README.md CHANGED
@@ -9,12 +9,36 @@ This directory is the system-of-record for durable project knowledge.
9
  | **Guides** | [guides/README.md](guides/README.md) | how-to | Practical step-by-step procedures |
10
  | **Design** | [design-docs/index.md](design-docs/index.md) | explanation | Feature design, ADRs, decision rationale |
11
  | **ADR Template** | [design-docs/decisions/0001-template.md](design-docs/decisions/0001-template.md) | reference | Decision record template |
 
 
 
 
 
 
 
 
12
  | **References** | [references/README.md](references/README.md) | reference | External docs for agent context |
13
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## System Docs
15
 
16
  - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md)
17
  - Operations: [RUNBOOK.md](RUNBOOK.md)
 
 
 
 
18
 
19
  ## Directory Structure
20
 
@@ -23,11 +47,36 @@ docs/
23
  ├── README.md # This file (index)
24
  ├── ARCHITECTURE.md # System design overview [reference]
25
  ├── RUNBOOK.md # Operations guide [how-to]
 
 
 
 
26
  ├── guides/ # How-to guides [how-to]
27
  │ └── README.md # Guide index
28
  ├── design-docs/ # Decision rationale [explanation]
29
  │ ├── index.md # Design docs catalogue
 
30
  │ └── decisions/ # Architectural Decision Records
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  └── references/ # External docs [reference]
32
  └── README.md # External docs for agent context
33
  ```
 
9
  | **Guides** | [guides/README.md](guides/README.md) | how-to | Practical step-by-step procedures |
10
  | **Design** | [design-docs/index.md](design-docs/index.md) | explanation | Feature design, ADRs, decision rationale |
11
  | **ADR Template** | [design-docs/decisions/0001-template.md](design-docs/decisions/0001-template.md) | reference | Decision record template |
12
+ | **Core Beliefs** | [design-docs/core-beliefs.md](design-docs/core-beliefs.md) | explanation | Agent-first operating principles |
13
+ | **Discovery** | [discovery/index.md](discovery/index.md) | explanation | Validate ideas and capture taste |
14
+ | **Delivery Specs** | [delivery-specs/index.md](delivery-specs/index.md) | reference | Engineering handoff specs |
15
+ | **Exec Plans** | [exec-plans/README.md](exec-plans/README.md) | how-to | Complex multi-step work tracking |
16
+ | **Tech Debt** | [exec-plans/tech-debt-tracker.md](exec-plans/tech-debt-tracker.md) | how-to | Known debt and cleanup opportunities |
17
+ | **Exploration** | [exploration/README.md](exploration/README.md) | exploration | Ideas, research, scratchpad |
18
+ | **Learnings** | [learnings/README.md](learnings/README.md) | reference | Durable patterns from completed work |
19
+ | **Archived Learnings** | [learnings/archived/README.md](learnings/archived/README.md) | reference | Overflow learnings archive |
20
  | **References** | [references/README.md](references/README.md) | reference | External docs for agent context |
21
 
22
+ ## Learnings by Category
23
+
24
+ | Category | File |
25
+ |----------|------|
26
+ | Architecture | [learnings/architecture.md](learnings/architecture.md) |
27
+ | Conventions | [learnings/conventions.md](learnings/conventions.md) |
28
+ | Gotchas | [learnings/gotchas.md](learnings/gotchas.md) |
29
+ | Integrations | [learnings/integrations.md](learnings/integrations.md) |
30
+ | Security | [learnings/security.md](learnings/security.md) |
31
+ | Testing | [learnings/testing.md](learnings/testing.md) |
32
+ | Workflow | [learnings/workflow.md](learnings/workflow.md) |
33
+
34
  ## System Docs
35
 
36
  - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md)
37
  - Operations: [RUNBOOK.md](RUNBOOK.md)
38
+ - Taxonomy: [DOCS_TAXONOMY.md](DOCS_TAXONOMY.md)
39
+ - Quality: [QUALITY_SCORE.md](QUALITY_SCORE.md)
40
+ - Feature Slicing: [FEATURE_SLICING.md](FEATURE_SLICING.md)
41
+ - Contract: [DOCS_CONTRACT.json](DOCS_CONTRACT.json)
42
 
43
  ## Directory Structure
44
 
 
47
  ├── README.md # This file (index)
48
  ├── ARCHITECTURE.md # System design overview [reference]
49
  ├── RUNBOOK.md # Operations guide [how-to]
50
+ ├── DOCS_TAXONOMY.md # Where to put new docs [reference]
51
+ ├── QUALITY_SCORE.md # Domain quality grades [reference]
52
+ ├── FEATURE_SLICING.md # Feature slicing strategy [reference]
53
+ ├── DOCS_CONTRACT.json # Validation config [reference]
54
  ├── guides/ # How-to guides [how-to]
55
  │ └── README.md # Guide index
56
  ├── design-docs/ # Decision rationale [explanation]
57
  │ ├── index.md # Design docs catalogue
58
+ │ ├── core-beliefs.md # Agent-first principles
59
  │ └── decisions/ # Architectural Decision Records
60
+ ├── discovery/ # Idea validation [explanation]
61
+ │ └── index.md # Discovery index
62
+ ├── delivery-specs/ # Engineering handoff [reference]
63
+ │ └── index.md # Delivery specs index
64
+ ├── exec-plans/ # Work tracking [how-to]
65
+ │ ├── README.md # Exec plans index
66
+ │ └── tech-debt-tracker.md # Technical debt
67
+ ├── exploration/ # Ideas, scratchpad [exploration]
68
+ │ └── README.md # Exploration index
69
+ ├── learnings/ # Durable patterns [reference]
70
+ │ ├── README.md # Learnings index
71
+ │ ├── architecture.md # Architecture learnings
72
+ │ ├── conventions.md # Convention learnings
73
+ │ ├── gotchas.md # Gotcha learnings
74
+ │ ├── integrations.md # Integration learnings
75
+ │ ├── security.md # Security learnings
76
+ │ ├── testing.md # Testing learnings
77
+ │ ├── workflow.md # Workflow learnings
78
+ │ └── archived/ # Overflow archive
79
+ │ └── README.md # Archive policy
80
  └── references/ # External docs [reference]
81
  └── README.md # External docs for agent context
82
  ```
docs/SKILLS_HANDBOOK.generated.md ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- handbook:ref cli opencode-ctx ralph run -->
2
+ <!-- handbook:ref cli opencode-ctx ralph start -->
3
+ <!-- handbook:ref cli opencode-ctx ralph spec -->
4
+ <!-- handbook:ref cli opencode-ctx ralph review -->
5
+
6
+ # OpenCode Workflow Handbook
7
+
8
+ Short, terminal-friendly map of the agentic workflow.
9
+
10
+ Assets discovered from: /Users/hjerp/.config/opencode
11
+
12
+ ## Pipeline (What Comes Next)
13
+
14
+ ## Delegation Brief (Required)
15
+
16
+ For any non-trivial work (anything you would normally put in a ticket/PRD), start with a short **Delegation Brief**.
17
+
18
+ This is the invariant "front page" that gets carried through discovery, delivery, and implementation so agents do not drift.
19
+
20
+ Minimum fields:
21
+
22
+ ```text
23
+ Objective (what + why):
24
+ Deliverables (artifacts/capabilities):
25
+ In scope / Out of scope (non-goals):
26
+ Boundaries (authority + constraints):
27
+ Checkpoints (when to show work / ask):
28
+ Acceptance checks (how we know it's done):
29
+ ```
30
+
31
+ How it glues into the existing workflow:
32
+
33
+ - `strategy` strengthens Objective/why (when the why is unclear or long-horizon).
34
+ - `discovery` turns Objective into validated scope + taste, and sharpens Acceptance checks.
35
+ - `delivery-spec` turns Deliverables/Boundaries into an engineering handoff.
36
+ - `design-doc` records decision rationale when the "how" has meaningful options.
37
+ - `autocode-implementation-planner` encodes the brief as **Core Intent (Immutable)** in the implementation spec, then uses the information barrier to generate independent verification.
38
+
39
+ **Note:** `strategy` is optional — use it when the "why" matters as much as the "what" (methodologies, frameworks, platforms). Most projects start at `discovery`.
40
+
41
+ ```
42
+ [delegation brief] -> [strategy] -> discovery -> delivery-spec -> design-doc
43
+ | | | |
44
+ v v v v
45
+ vision validate scope+handoff decisions
46
+ (optional) +taste
47
+
48
+ architecture -> autocode-implementation-planner -> verification-planner -> execution loop
49
+ | | | |
50
+ v v v v
51
+ system map IMPLEMENTATION_SPEC.md VERIFICATION_SPEC.md build/review/verify
52
+ ```
53
+
54
+ ## When to Create Which Doc
55
+
56
+ | Complexity | Product Spec? | Design Doc? | Start With |
57
+ |------------|---------------|-------------|------------|
58
+ | **Simple** (CRUD, config) | Skip | Skip | `autocode-implementation-planner` |
59
+ | **Standard** (new feature) | Recommended | Optional | `discovery` skill → implementation planner |
60
+ | **Complex** (multi-system) | Required | Required | `discovery` → `delivery-spec` → design doc → implementation planner |
61
+ | **Exploratory** (unknown) | Skip | Skip | `prototype` skill first |
62
+
63
+ **Decision criteria:**
64
+ - Create **discovery** doc when stakeholders need to agree on scope
65
+ - Create **design-doc** when architecture decisions need to be recorded
66
+ - Skip both for tactical work that doesn't need alignment
67
+
68
+ ## Where Artifacts Live
69
+
70
+ - vision/ = project philosophy (optional, for methodologies/frameworks) [explanation]
71
+ - docs/ = durable knowledge, organized by Diataxis type:
72
+ - docs/guides/ = how-to procedures [how-to]
73
+ - docs/design-docs/ = decision rationale [explanation]
74
+ - docs/discovery/ = problem validation + taste [explanation]
75
+ - docs/delivery-specs/ = engineering handoff [reference]
76
+ - docs/references/ = external docs for agents [reference]
77
+ - docs/exploration/ = ideas, scratchpad [exploration]
78
+ - docs/learnings/ = durable patterns [reference]
79
+ - docs/DOCS_TAXONOMY.md = where to put new docs
80
+ - specs/ = per-slice execution artifacts (FEATURES.json, impl specs, verification specs, CLARIFICATION_QUESTIONS.md (optional))
81
+
82
+ **Architecture lives in `docs/ARCHITECTURE.md`** — it's durable system knowledge, not a per-feature artifact.
83
+
84
+ ## Architecture: One File, Multiple Tools
85
+
86
+ **Canonical location:** `docs/ARCHITECTURE.md`
87
+
88
+ All architecture tools output to the same file:
89
+
90
+ | Tool | Purpose |
91
+ |------|---------|
92
+ | `architecture` skill | Create NEW architecture (greenfield, redesign) |
93
+ | `codebase-onboarding` skill | Understand EXISTING architecture |
94
+ | `opencode-ctx docs architecture apply` | Refresh auto-managed snapshot |
95
+ | `/update-architecture-diagram` | Quick diagram refresh |
96
+
97
+ **Workflow:**
98
+ - Joining existing project → `codebase-onboarding` first
99
+ - Building new system → `architecture` skill
100
+ - Regular maintenance → `opencode-ctx docs architecture apply`
101
+
102
+ **Keep it fresh:**
103
+ - Human-curated sections: coarse (5-10 boxes, 2-4 key flows)
104
+ - Auto-managed snapshot: refresh with `opencode-ctx docs architecture apply`
105
+ - CI gate: `opencode-ctx docs architecture check`
106
+
107
+ ## What To Run (Typical Flow)
108
+
109
+ - Start by writing (or asking the agent to draft) a Delegation Brief (required for significant work).
110
+ - To draft a Delegation Brief as a reusable artifact: use `delegation-brief`.
111
+ - If you are unsure what doc to write: use `triage`.
112
+ - To generate "what to build next" ideas from a codebase or product: use `ideation`.
113
+ - To capture project philosophy (rare): use `strategy` or create `vision/VISION.md` manually.
114
+ - To set direction for a longer horizon: use `strategy`.
115
+ - To validate and capture taste for a feature: use `discovery`.
116
+ - To hand off a concrete solution: use `delivery-spec`.
117
+ - To record architectural decisions: use `design-doc`.
118
+ - To map shared components across features: use `architecture`.
119
+ - To plan implementation with verification barrier: use `autocode-implementation-planner`.
120
+
121
+ - To inject baseline guardrails into `AGENTS.md`: `opencode-ctx guidelines apply --packs <lang>,testing,delivery-safety --file AGENTS.md`
122
+
123
+ **Vision vs Discovery:**
124
+ - Vision = enduring project philosophy (rare, for methodologies/frameworks)
125
+ - Discovery = feature-specific validation and taste capture (common)
126
+
127
+ | Use Vision When... | Use Discovery When... |
128
+ |--------------------|----------------------|
129
+ | The project IS a methodology or framework | You're validating a specific feature idea |
130
+ | Philosophy needs to persist across many features | The "why" is feature-specific |
131
+ | Multiple contributors need philosophical alignment | Taste capture is per-feature |
132
+ | The project will evolve over years | The project is < 6 months |
133
+
134
+ **Lightweight vision alternative:** Add a "Vision" section to README.md (2-3 paragraphs) instead of a full `vision/` directory.
135
+
136
+ Then build + ship with commands:
137
+
138
+ - /autocode-next-step (implement next spec step)
139
+ - /feature-demo F### (generate executable demo proving the feature works — auto-runs at loop end)
140
+ - /review-changes (auto-review)
141
+ - /commit-push-pr (commit, push, PR)
142
+ - /techdebt (tech debt scan)
143
+
144
+ For hands-free loops (optional):
145
+
146
+ - `opencode-ctx ralph start <spec>` (start server + browser + autonomous loop + cleanup)
147
+ - `opencode-ctx ralph run <spec>` (autonomous implement loop)
148
+ - `opencode-ctx ralph spec <spec>` (iterative spec refinement)
149
+ - `opencode-ctx ralph review` (review/fix bounded loop)
150
+
151
+ Common options: `-m MODEL` (alias: sonnet, opus, gpt-builder), `--max-iter N`
152
+
153
+ ## Full Workflow (Complex Features)
154
+
155
+ ```
156
+ 0. Delegation Brief (required for significant work)
157
+ └─→ docs/delegation-briefs/<feature>.md # delegation-brief skill
158
+
159
+ 1. Discovery (taste capture + problem validation)
160
+ └─→ docs/discovery/<feature>.md # discovery skill
161
+ └─→ docs/discovery/<feature>.json # Machine-readable taste data
162
+
163
+ 1b. Delivery spec (engineering handoff - optional for simple features)
164
+ └─→ docs/delivery-specs/<feature>.md # delivery-spec skill
165
+
166
+ 2. Design doc (architectural decisions)
167
+ └─→ docs/design-docs/<feature>.md # design-doc skill (ADR-style)
168
+
169
+ 3. Feature planning (multi-feature projects)
170
+ └─→ specs/FEATURES.json # autocode-plan-features skill
171
+
172
+ 4. Implementation planning (reads taste from step 1)
173
+ └─→ specs/F001-IMPLEMENTATION_SPEC.md # autocode-implementation-planner skill
174
+ └─→ specs/F001-VERIFICATION_SPEC.md # Auto-generated
175
+
176
+ 5. Execution
177
+ └─→ /autocode-next-step # Or: opencode-ctx ralph start <spec>
178
+
179
+ 6. Knowledge extraction (automatic)
180
+ └─→ docs/learnings/*.md # compound-engineer at finalization
181
+ ```
182
+
183
+ **Taste flows through the pipeline:**
184
+ - `discovery` captures delights, frustrations, feeling
185
+ - `delivery-spec` (optional) translates to functional requirements
186
+ - `autocode-implementation-planner` reads taste JSON and uses as success criteria
187
+ - Verification checks implementation against captured taste
188
+
189
+ ## Quick Workflow (Simple Features)
190
+
191
+ ```bash
192
+ # Plan directly
193
+ skill({ name: "autocode-implementation-planner" })
194
+
195
+ # Execute
196
+ /autocode-next-step
197
+ # Or autonomous: opencode-ctx ralph start <spec>
198
+
199
+ # Review and ship
200
+ /review-changes
201
+ /commit-push-pr
202
+ ```
203
+
204
+ ## Linking Docs to Features (FEATURES.json)
205
+
206
+ When using `FEATURES.json`, link to related documentation:
207
+
208
+ ```json
209
+ {
210
+ "id": "F001",
211
+ "name": "User Authentication",
212
+ "docs": {
213
+ "discovery_json": "docs/discovery/auth.json",
214
+ "discovery_md": "docs/discovery/auth.md",
215
+ "delivery_spec": "docs/delivery-specs/auth.md",
216
+ "design_doc": "docs/design-docs/auth-system.md"
217
+ },
218
+ "specs": {
219
+ "implementation": "specs/F001-IMPLEMENTATION_SPEC.md",
220
+ "verification": "specs/F001-VERIFICATION_SPEC.md"
221
+ }
222
+ }
223
+ ```
224
+
225
+ The `autocode-implementation-planner` skill automatically checks each linked doc and uses it as context.
226
+
227
+ ## AGENTS.md Convention
228
+
229
+ AGENTS.md is a **navigation map**, not an encyclopedia:
230
+
231
+ - ~60-80 lines
232
+ - Links to docs/ for details
233
+ - No inline learnings (those go in `docs/learnings/`)
234
+ - Injectable guidelines via `<!-- GUIDELINES-BEGIN/END -->`
235
+
236
+ ```bash
237
+ # Apply or update language guidelines (idempotent)
238
+ opencode-ctx guidelines apply --packs python,testing,delivery-safety --file AGENTS.md
239
+
240
+ # Frontend projects
241
+ opencode-ctx guidelines apply --packs frontend,testing,delivery-safety --file AGENTS.md
242
+
243
+ # Preview without writing
244
+ opencode-ctx guidelines apply --packs python,testing,delivery-safety --file AGENTS.md --dry-run
245
+ ```
246
+
247
+ Available packs: `python`, `testing`, `frontend`, `delivery-safety`
248
+
249
+ ## Parallel Feature Development (Fan-Out / Fan-In)
250
+
251
+ When a project has multiple independent features to implement, use the
252
+ parallel workflow to create isolated clones for each, then merge back safely.
253
+
254
+ ```bash
255
+ # Plan: AI-assisted analysis of which features can run in parallel
256
+ opencode-ctx parallel plan # analyze FEATURES.json + specs
257
+ opencode-ctx parallel plan --status ready # filter by status
258
+ opencode-ctx parallel plan --format json # machine-readable output
259
+
260
+ # Fan-out: create parallel clones for features
261
+ opencode-ctx parallel fan-out # reads FEATURES.json (planned/ready)
262
+ opencode-ctx parallel fan-out --features F001,F002 # explicit feature list
263
+ opencode-ctx parallel fan-out --from-local # clone from local (faster)
264
+
265
+ # Status: check all parallel clones
266
+ opencode-ctx parallel status
267
+ opencode-ctx parallel status --format json
268
+
269
+ # Work in each clone independently
270
+ cd ../repo-F001-auth-system
271
+ opencode-ctx ralph start specs/F001-IMPLEMENTATION_SPEC.md
272
+
273
+ # Fan-in: pre-merge conflict analysis
274
+ opencode-ctx parallel fan-in # trial merge against main
275
+ opencode-ctx parallel fan-in --format json
276
+
277
+ # Cleanup: remove clones after merging
278
+ opencode-ctx parallel cleanup F001
279
+ opencode-ctx parallel cleanup --all --force
280
+ ```
281
+
282
+ **AI-Orchestrated Workflow (recommended):**
283
+ 1. Use `/parallel-plan` command — analyzes implementation specs for file overlaps
284
+ and dependencies, recommends parallelizable batches, asks for confirmation
285
+ 2. On confirmation, automatically calls `fan-out` for the batch
286
+ 3. User runs `opencode-ctx ralph start <spec>` (or manual implementation) in each clone
287
+ 4. Use `parallel-orchestrator` agent to monitor progress and coordinate merges
288
+ 5. `fan-in` runs trial merges; orchestrator generates a merge playbook
289
+ 6. User merges per-feature PRs in the suggested order
290
+ 7. `cleanup` removes clones after merge
291
+
292
+ **Manual Workflow:**
293
+ 1. `fan-out` creates sibling clones, each on a `feat/F###-slug` branch
294
+ 2. User runs `opencode-ctx ralph start <spec>` (or manual implementation) in each clone
295
+ 3. `fan-in` runs trial merges to detect conflicts and suggests merge order
296
+ 4. User merges per-feature PRs in the suggested order
297
+ 5. `cleanup` removes clones after merge
298
+
299
+ The `plan` command reads Section 2 (Change Manifest) from implementation specs to
300
+ extract file lists, computes pairwise file overlaps, and partitions features into
301
+ conflict-free batches. Fan-in respects feature dependencies from FEATURES.json
302
+ and orders clean merges before conflicting ones.
303
+
304
+ ## Maintenance Commands
305
+
306
+ ```bash
307
+ # Keep the architecture snapshot current
308
+ opencode-ctx docs architecture apply # refresh auto-managed snapshot
309
+ opencode-ctx docs architecture check # CI gate
310
+
311
+ # Keep the FEATURES.json schema current (multi-feature projects)
312
+ opencode-ctx features schema apply # creates specs/schemas/autocode-features-v1.schema.json
313
+ opencode-ctx features schema check # CI gate
314
+ opencode-ctx features schema apply --dry-run
315
+
316
+ # Validate docs structure
317
+ opencode-ctx docs validate
318
+ ```
319
+
320
+ ## Global Configuration
321
+
322
+ Additional commands, agents, and skills are in `/Users/hjerp/.config/opencode`:
323
+
324
+ | Path | Contents |
325
+ |------|----------|
326
+ | `AGENTS.md` | Global workflow policies |
327
+ | `commands/` | Slash commands (`/autocode-*`, `/review-*`, etc.) |
328
+ | `agents/` | Specialized agents (planner, reviewer, executor, etc.) |
329
+ | `skills/` | Product planning skills (strategy, discovery, delivery-spec, etc.) |
330
+ | `scripts/` | Automation scripts (legacy, deprecated — use `opencode-ctx ralph`) |
331
+
332
+ ## Skills (Discovered)
333
+
334
+ | Skill | Description |
335
+ |---|---|
336
+ | `ai-adoption-engagement` | Run an end-to-end AI adoption consulting engagement: scoping, current-state maturity di... |
337
+ | `ai-strategy` | Plan, evaluate, and iteratively update AI implementation strategy for an organization o... |
338
+ | `architecture` | Create lightweight system architecture establishing shared understanding across feature... |
339
+ | `autocode-implementation-planner` | Research and plan software changes with structured verification handoff. Orchestrates s... |
340
+ | `autocode-plan-features` | Create machine-parseable feature lists for multi-feature projects. Generates FEATURES.j... |
341
+ | `causal-driver` | Build causal driver trees for any metric — separate accounting identities from causal h... |
342
+ | `checkpoint-protocol` | Structured human checkpoint protocol that minimizes evaluation overhead. Transforms 're... |
343
+ | `codebase-onboarding` | Rapidly understand an unfamiliar codebase by generating a structured overview with ASCI... |
344
+ | `communication` | Craft compelling communication for stakeholders. Covers storytelling frameworks, persua... |
345
+ | `core-web-vitals` | Optimize Core Web Vitals (LCP, INP, CLS) for better page experience and search ranking.... |
346
+ | `delegation-brief` | Create a short delegation contract (objective, deliverables, boundaries, checkpoints, a... |
347
+ | `delivery-spec` | Create delivery specs that define solutions for engineers and AI agents (Phase 3). Use... |
348
+ | `design-doc` | Record architectural decisions with rationale (ADR-style). Captures WHY decisions were... |
349
+ | `discovery` | Validate and prioritize product ideas using PR/FAQ, Opportunity Solution Trees, Taste C... |
350
+ | `execution` | Translate PRDs into user stories with acceptance criteria (Phase 4). Use when: (1) Engi... |
351
+ | `frameworks` | Reusable frameworks, checklists, and templates (a lightweight reference library). Use w... |
352
+ | `frontend-builder` | LLM-optimized frontend implementation guidance. Use when: (1) Starting new frontend pro... |
353
+ | `ideation` | Generate structured 'what to build next' candidates from a codebase or product using th... |
354
+ | `landscape-scan` | Competitive intelligence for company building and investment diligence. Maps the full c... |
355
+ | `ml-concept-eval` | Evaluate an ML/statistical technique against a specific business context: is it viable,... |
356
+ | `peer-collaboration` | Coordinate volunteer or peer teams without formal authority. Use when: (1) Working with... |
357
+ | `principal-ml-engineer` | Principal ML engineer playbook: design ML/LLM systems that are reliable, observable, ex... |
358
+ | `project-leadership` | Adaptive project leadership for competitions, research, coursework, ventures, and deliv... |
359
+ | `prototype` | Rapid exploratory prototyping to resolve ambiguity and validate ideas before committing... |
360
+ | `sector-primer` | Rapid industry understanding for consultants and investors. Produces a structured Indus... |
361
+ | `seo` | Optimize for search engine visibility and ranking. Use when asked to "improve SEO", "op... |
362
+ | `spec-staff-review` | Deliberate quality review of implementation specs by a staff engineer persona. Use for... |
363
+ | `strategic-thinker` | Guide users through strategic thinking using the Strategy Kernel framework. Facilitates... |
364
+ | `strategy` | Create product vision boards and outcome-based roadmaps (Phase 0-1). Use when: (1) Annu... |
365
+ | `team-lead` | Reference skill for team leadership principles: coaching, feedback, delegation. Use whe... |
366
+ | `triage` | Guide users through product planning workflow and select the right documentation (Triag... |
367
+ | `visual-artifacts` | Create professional visual artifacts: diagrams (Mermaid, Excalidraw) and presentations... |
368
+ | `web-performance` | Optimize web performance for faster loading and better user experience. Use when asked... |
369
+ | `web-security` | Apply modern web security best practices including CSP, HTTPS, XSS prevention, and depe... |
370
+ | `what-how-alignment` | System-level alignment between intent (what) and implementation (how). Analyzes complet... |
371
+
372
+ ## Commands (Discovered)
373
+
374
+ | Command | Agent | Description |
375
+ |---|---|---|
376
+ | `/autocode-fix-from-review` | `executor` | Apply fixes from review report and run verification |
377
+ | `/autocode-fix-verification` | `verifier` | Fix features marked complete without proper verification evidence |
378
+ | `/autocode-next-step` | `executor` | Execute the next implementation step with verification |
379
+ | `/autocode-refine-spec` | `reviewer` | Iteratively refine an implementation spec before verification planning |
380
+ | `/autocode-verification-planner` | `verification-planner` | Generate verification criteria from sanitized spec |
381
+ | `/commit-push-pr` | `executor` | Commit, Push, and Create Pull Request |
382
+ | `/feature-demo` | `feature-demo` | Generate an executable demo document for a completed feature |
383
+ | `/full-web-audit` | `executor` | Comprehensive web quality audit (performance, accessibility, SEO, security) |
384
+ | `/parallel-plan` | `parallel-orchestrator` | Analyze FEATURES.json and plan parallelizable feature batches |
385
+ | `/review-changes` | `reviewer` | Review changes before commit or PR |
386
+ | `/review-frontend` | `frontend-reviewer` | Visual review of running frontend via Playwright MCP |
387
+ | `/techdebt` | `reviewer` | Analyze code for technical debt, duplications, and AI slop patterns |
388
+ | `/update-architecture-diagram` | `executor` | Refresh the System Diagram in ARCHITECTURE.md to match current codebase |
389
+ | `/update-model-routing` | `executor` | Refresh model routing recommendations with current pricing from models.dev |
390
+ | `/validate-spec` | `verifier` | Check if implementation matches spec; report discrepancies |
391
+
392
+ ## Ralph CLI (Autonomous Loops)
393
+
394
+ | Command | Purpose |
395
+ |---------|---------|
396
+ | `opencode-ctx ralph start <spec>` | Start server + browser + run loop + cleanup (all-in-one) |
397
+ | `opencode-ctx ralph run <spec>` | Run implementation loop (attach to existing server with `-s`) |
398
+ | `opencode-ctx ralph spec <spec>` | Iterative spec refinement (default: 3 iterations) |
399
+ | `opencode-ctx ralph review` | Review + fix loop on current changes |
400
+
401
+ Common options: `-m MODEL` (aliases: sonnet, opus, gpt-builder), `--max-iter N`
402
+
403
+ ## Scripts (Discovered — legacy, prefer `opencode-ctx ralph`)
404
+
405
+ | Script | Description |
406
+ |---|---|
407
+ | `cleanup-feature.sh` | Remove a parallel feature clone |
408
+ | `learnings.sh` | Query learnings (AGENTS.md or docs/LEARNINGS) |
409
+ | `parallel-feature.sh` | Create isolated clone for parallel feature development |
410
+ | `ralph-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
411
+ | `ralph-review-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
412
+ | `ralph-spec-loop.sh` | ┌─────────────────────────────────────────────────────────────────────────┐ |
413
+ | `update-model-routing.sh` | Fetch model pricing from models.dev and generate routing tables |
docs/blog-material.md ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Blog Material — Raw Knowledge Dump
2
+
3
+ Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.
4
+
5
+ ## Training Run Summary
6
+
7
+ ### Run progression (what each run taught us)
8
+ 1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste
9
+ 2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
10
+ 3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
11
+ 4. **Run 4**: KL penalty + reference model = OOM on L4
12
+ 5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries
13
+ 6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
14
+ 7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
15
+ 8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
16
+ 9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions
17
+
18
+ ### Key numbers
19
+ | Metric | Value |
20
+ |--------|-------|
21
+ | Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
22
+ | Hardware | Colab L4 (24GB VRAM) |
23
+ | SFT data | 120 multi-turn trajectories, assistant_only_loss |
24
+ | Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
25
+ | Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
26
+ | Total training | ~7.3h on L4 for both phases |
27
+ | Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
28
+ | Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
29
+ | Parse rate | >98% (valid tool-call JSON) |
30
+ | Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
31
+ | Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
32
+ | Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |
33
+
34
+ ## Concrete Episode Examples (for blog)
35
+
36
+ ### Success: Simple aggregation (3 tool calls)
37
+ Question: "What is the total bonus given in all evaluations?"
38
+ 1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
39
+ 2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
40
+ 3. `answer("19500.0")` → correct
41
+ Reward: 1.15
42
+
43
+ ### Success: Error recovery with JOIN
44
+ Question: "Which employee received the biggest bonus?"
45
+ 1. `describe("employee")` → Employee_ID, Name, Age, City
46
+ 2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
47
+ 3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
48
+ 4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
49
+ 5. `answer("Louis Deacon")` → correct
50
+ Reward: 1.13
51
+
52
+ ### Success: Subquery with comparison
53
+ Question: "How many cars have greater accelerate than the one with the most horsepower?"
54
+ 1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
55
+ 2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
56
+ 3. `answer("39")` → correct
57
+ Reward: 1.15
58
+
59
+ ### Failure: Column name hallucination
60
+ Question: "What are all the makers and models?"
61
+ 1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
62
+ 2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
63
+ 3. `describe("model_list")` → ModelId, Maker, Model
64
+ 4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
65
+ 5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows)
66
+ Model found the right data but answer was truncated by the 20-row display limit.
67
+
68
+ ### Failure: FK chain reasoning
69
+ Question: "Which city has most number of arriving flights?"
70
+ 1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
71
+ 2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4
72
+ 3. `answer("Albany")` → incorrect
73
+ Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.
74
+
75
+ ### Failure: Invented table
76
+ Question: "What are all details of students who registered but did not attend any course?"
77
+ 1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
78
+ 2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
79
+ 3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
80
+ Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.
81
+
82
+ ## Reward Architecture Details
83
+
84
+ ### Three-layer structure
85
+ ```
86
+ L1 Operational (every step):
87
+ +0.02 exec_ok
88
+ +0.01 new_info (unique SQL hash)
89
+ -0.03 repeat penalty
90
+ -0.02 step cost
91
+
92
+ L2 Progress (QUERY only):
93
+ Delta from previous binned progress × 0.15
94
+ Binned to {0, 0.25, 0.5, 0.75, 1.0}
95
+
96
+ L3 Terminal (ANSWER only):
97
+ +1.0 correct, 0.0 wrong
98
+
99
+ Per-step clip: [-0.10, 0.15]
100
+ ```
101
+
102
+ ### Why potential-based shaping matters
103
+ - Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
104
+ - Our delta progress IS potential-based with γ=1
105
+ - Cumulative caps are NOT potential-based (depend on trajectory history)
106
+ - Without this guarantee, agents learn to farm exploration rewards
107
+
108
+ ### Anti-farming mechanisms
109
+ - Hard budget (15 steps)
110
+ - Step cost (-0.02)
111
+ - Repeat penalty (-0.03)
112
+ - Terminal dominance (1.0 vs ~0.3 max exploration)
113
+ - Per-step clip [-0.10, 0.15]
114
+ - Post-episode penalty (-0.3)
115
+
116
+ ## Eval Results (N=50, 2026-04-11)
117
+
118
+ ### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
119
+ | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
120
+ |--------|----------|------------|-----------|------------|-------------|----------------|
121
+ | zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
122
+ | 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
123
+ | 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
124
+ | GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
125
+ | GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |
126
+
127
+ ### Previous run (Run A, same day, same seed)
128
+ | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
129
+ |--------|----------|------------|-----------|------------|----------------|
130
+ | zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
131
+ | 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
132
+ | 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
133
+ | GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
134
+ | GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |
135
+
136
+ ### Run-to-run variation (important for blog)
137
+ v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.
138
+
139
+ The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.
140
+
141
+ Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).
142
+
143
+ ### Key observations from N=50 eval (with retry, 2 runs)
144
+ 1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
145
+ 2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
146
+ 3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
147
+ 4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
148
+ 5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
149
+ 6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
150
+ 7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.
151
+
152
+ ### v2 vs v1: similar accuracy, more parse failures — behavioral shift
153
+ Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:
154
+
155
+ - **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
156
+ - **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
157
+ - **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.
158
+
159
+ The v2 parse failure pattern (from raw output):
160
+ ```
161
+ [OK] DESCRIBE: country
162
+ [OK] QUERY: SELECT Name FROM country WHERE Population < ...
163
+ [PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer()
164
+ [PARSE FAIL] raw: Task complete. ← repeats until budget
165
+ ```
166
+
167
+ Compare v1 on the same type of question:
168
+ ```
169
+ [OK] DESCRIBE: country
170
+ [OK] QUERY: SELECT Name FROM country WHERE ...
171
+ [OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer()
172
+ ```
173
+
174
+ This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.
175
+
176
+ ### For the blog narrative
177
+ The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.
178
+
179
+ ## Format Mismatch Discovery (F011)
180
+
181
+ ### The three differences between eval and training
182
+ 1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
183
+ 2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
184
+ 3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.
185
+
186
+ ### Impact
187
+ Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
188
+ After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.
189
+
190
+ ### Lesson
191
+ Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.
192
+
193
+ ## Multi-Turn SFT — Why It's Critical
194
+
195
+ ### Per-turn SFT (broken)
196
+ - 347 examples, each one assistant turn
197
+ - ~50% were describe calls
198
+ - Model learned: "when asked a question, call describe"
199
+ - With KL penalty, model stayed anchored to this single-action policy
200
+ - Result: reward=0.00, all rollouts identical, advantage=0
201
+
202
+ ### Multi-turn SFT (working)
203
+ - 120 examples, each a full describe→query→answer trajectory
204
+ - `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
205
+ - Model learned: the SEQUENCE describe→query→answer
206
+ - With KL penalty, model explores within the multi-turn strategy
207
+ - Result: reward climbs to 0.7 in Phase 1
208
+
209
+ ### Template patch detail
210
+ Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).
211
+
212
+ ## The 0.6B Capacity Ceiling
213
+
214
+ ### What works at 0.6B
215
+ - Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
216
+ - Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
217
+ - WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
218
+ - Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
219
+ - Error recovery: describe after SQL error, retry with correct column names
220
+ - `sample` tool usage (learned in Run 6, inconsistent later)
221
+
222
+ ### What breaks at 0.6B
223
+ - FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
224
+ - Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
225
+ - Ambiguous column resolution: joins with same column name in both tables
226
+ - Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
227
+ - "stadium without concert" pattern: NOT IN with JOIN to get names
228
+ - Aggregate + GROUP BY + HAVING chains on multi-table joins
229
+
230
+ ### The hallucination pattern
231
+ The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.
232
+
233
+ ## Thinking Mode Observations (Run 8)
234
+
235
+ ### Benefits
236
+ - Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
237
+ - Empty `<think></think>` on easy questions — token-efficient, emergent behavior
238
+ - Multi-step join planning in think blocks
239
+
240
+ ### New failure mode
241
+ ~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.
242
+
243
+ ### OOM risk
244
+ Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.
245
+
246
+ ## Narrative Threads for Blog
247
+
248
+ ### "The environment is the product"
249
+ From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.
250
+
251
+ ### "Small model showing improvement proves more than large model with marginal gains"
252
+ A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.
253
+
254
+ ### "Analysts don't write perfect queries from scratch"
255
+ The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.
256
+
257
+ ### "Dense rewards need theory"
258
+ Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.
259
+
260
+ ### "Multi-turn SFT teaches strategy, not actions"
261
+ The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.
262
+
263
+ ## References for Blog
264
+
265
+ - Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
266
+ - DeepSeek-AI (2025). DeepSeek-R1.
267
+ - Shao et al. (2024). DeepSeek-Math: GRPO.
268
+ - Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
269
+ - Yu et al. (2018). Spider dataset.
270
+ - Li et al. (2023). BIRD benchmark.
271
+ - TIPS (2026). Turn-Level Information-Potential Reward Shaping.
272
+ - ToolRL (2025). Reward is All Tool Learning Needs.
273
+ - StepTool (2024). Step-grained RL for Tool Learning.
274
+
275
+ ## Showcase Notebook Transcripts (for blog)
276
+
277
+ ### Random agent episode (seed=7) — comedic failure
278
+ Question: "Count the number of paragraphs."
279
+ ```
280
+ SAMPLE Paragraphs → reward=0.015
281
+ SAMPLE Documents → reward=0.015
282
+ DESCRIBE Documents → reward=0.015
283
+ SAMPLE Documents → reward=0.015 (repeat)
284
+ DESCRIBE Documents → reward=0.015 (repeat)
285
+ DESCRIBE Documents → reward=0.015 (repeat)
286
+ DESCRIBE Templates → reward=0.015
287
+ SAMPLE Documents → reward=0.015 (repeat)
288
+ DESCRIBE Documents → reward=0.015 (repeat)
289
+ QUERY SELECT * FROM "Templates" LIMIT 5 → reward=0.0625
290
+ DESCRIBE Documents → reward=0.015 (repeat)
291
+ DESCRIBE Paragraphs → reward=0.015
292
+ QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
293
+ QUERY SELECT * FROM "Documents" LIMIT 5 → reward=0.025
294
+ ANSWER 76 | 20 | Robbin CV | y | None → reward=0.000 (incorrect)
295
+ ```
296
+ Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.
297
+
298
+ ### Oracle agent episode (seed=0) — clean solve
299
+ Question: "List the id of students who registered some courses and the number of their registered courses?"
300
+ ```
301
+ Step 1: DESCRIBE student_course_registrations
302
+ → student_id INTEGER, course_id INTEGER, registration_date DATETIME
303
+ → reward: +0.015
304
+
305
+ Step 2: DESCRIBE students
306
+ → student_id INTEGER, student_details VARCHAR(255)
307
+ → reward: +0.015
308
+
309
+ Step 3: QUERY
310
+ SELECT T1.student_id, count(*)
311
+ FROM students AS T1
312
+ JOIN student_course_registrations AS T2
313
+ ON T1.student_id = T2.student_id
314
+ GROUP BY T1.student_id
315
+ → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
316
+ → reward: +0.150
317
+
318
+ Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
319
+ → correct
320
+ → reward: +1.000
321
+ ```
322
+ Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.
323
+
324
+ ### Baseline comparison (50 episodes each)
325
+ | Policy | Success Rate | Avg Reward | Avg Steps |
326
+ |--------|-------------|------------|-----------|
327
+ | Random | 0.0% | 0.247 | 15.0 |
328
+ | Oracle | 100.0% | 1.168 | 3.5 |
329
+
330
+ The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.
331
+
332
+ ### Reward constants (from server/reward.py)
333
+ ```
334
+ +0.02 successful execution (no errors)
335
+ +0.01 new information (unique query)
336
+ -0.02 step cost (every action)
337
+ -0.03 repeat penalty (duplicate SQL)
338
+ [-0.10, +0.15] per-step clipping range
339
+ +1.0 correct answer (terminal)
340
+ +0.0 wrong answer (terminal)
341
+ ```
342
+ Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.
343
+
344
+ ## Competition Context
345
+
346
+ ### OpenEnv Challenge (our target)
347
+ - Sponsors: PyTorch/Meta, HuggingFace, Unsloth
348
+ - Prize: $10K HF credits
349
+ - Judging: primarily blog-based
350
+ - Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
351
+ - Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.
352
+
353
+ ### Deliverables
354
+ 1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env
355
+ (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
356
+ 2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
357
+ `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
358
+ 3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)
359
+
360
+ ### TRL integration status (already done — do not re-research)
361
+ `training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
362
+ class: `reset()` + named tool methods `describe() / sample() / query() /
363
+ answer()` with docstrings TRL uses to build the tool schema. The notebook
364
+ passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
365
+ reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
366
+ **in-process** (not a WebSocket client to the HF Space) — intentional, because
367
+ training opens N parallel sessions and the Space defaults to 1.
368
+
369
+ ### Competitive landscape
370
+ - **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
371
+ - **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis
372
+ - **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection
373
+ - Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.
374
+
375
+ ### What winning entries do
376
+ 1. Stakes framing — "this matters in production"
377
+ 2. Concrete failure analysis with numbers
378
+ 3. Contrast (random vs trained vs oracle)
379
+ 4. Real data, not toy puzzles
380
+ 5. Non-obvious insights from training
381
+
382
+ ## Green Agent Evaluator
383
+
384
+ ### What it is
385
+ OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.
386
+
387
+ ### Implementation
388
+ - `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
389
+ - `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL)
390
+ - `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)
391
+
392
+ ### How it works
393
+ ```python
394
+ from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
395
+
396
+ # Run 50 episodes with random policy
397
+ result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
398
+ print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
399
+
400
+ # Run with trained policy (any class with select_action method)
401
+ result = evaluate(env, trained_policy, n_episodes=50, seed=42)
402
+ ```
403
+
404
+ ### Where it's used
405
+ - `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
406
+ - `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`
407
+
408
+ ### Key design choices
409
+ - **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
410
+ - **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
411
+ - **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required
412
+ - **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down
413
+
414
+ ### For the blog
415
+ The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.
416
+
417
+ ## Files to Reference
418
+
419
+ | File | Relevance |
420
+ |------|-----------|
421
+ | `docs/blog-outline.md` | Section structure template |
422
+ | `docs/blog-post.md` | Current draft |
423
+ | `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
424
+ | `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
425
+ | `vision/VISION.md` | Product vision, success metrics |
426
+ | `training/trl_adapter.py` | Environment adapter code |
427
+ | `notebooks/compare_methods.ipynb` | Eval notebook |
428
+ | `notebooks/train_grpo.ipynb` | Training notebook |
docs/blog-outline.md CHANGED
@@ -1,56 +1,114 @@
1
  # SQLEnv Blog Post Outline
2
 
3
- ## 1) Hook: Teaching AI to Think Like a Data Analyst
4
 
5
- - Open with a concrete moment: an agent sees a new schema and must reason through uncertainty instead of guessing one SQL query.
6
- - Frame the core idea: SQL competence is not only syntax generation; it is iterative investigation with feedback.
7
- - Position SQLEnv as a training ground where agents learn exploration habits that mirror analyst workflows.
8
 
9
- ## 2) The Problem: Static Benchmarks Reward Memorization
 
 
 
10
 
11
- - Explain why single-shot text-to-SQL can hide brittle behavior when schemas, table names, or data distributions shift.
12
- - Show that leaderboard accuracy does not guarantee robust reasoning on unfamiliar databases.
13
- - Describe the gap: most benchmarks grade final answers but ignore how the model arrived there.
14
- - Tie this directly to user pain: correct-looking SQL can fail in real environments where context changes every session.
15
 
16
- ## 3) Our Approach: SQLEnv as an Interactive RL Environment
17
 
18
- - Introduce the action loop: `DESCRIBE`, `SAMPLE`, `QUERY`, and `ANSWER` as the minimum interface for grounded exploration.
19
- - Explain that each episode starts with a natural-language question and a hidden schema to force discovery.
20
- - Highlight OpenEnv compatibility so the environment can run with standard training tooling and deployment flows.
21
 
22
- ## 4) How SQLEnv Works End-to-End
23
 
24
- - Walk through one episode narrative: inspect table shapes, sample data, run targeted joins, then submit an answer.
25
- - Summarize reward design in plain language: reward reliable execution, reward progress toward the goal, and strongly reward final correctness.
26
- - Note guardrails: read-only SQL execution, query timeout, and clear error messages to prevent unsafe or confusing behavior.
27
 
28
- ## 5) Training with GRPO
 
 
 
 
 
 
29
 
30
- - Briefly explain GRPO as a practical policy optimization method for improving multi-step tool use behavior.
31
- - Connect training signals to environment telemetry: each step gives usable feedback rather than waiting for terminal reward only.
32
- - Clarify expected outcome: strategic behavior should improve over random baselines even with modest compute.
 
33
 
34
- ## 6) Results
35
 
36
- - [PLACEHOLDER: Insert F006 metrics for success rate, average reward, and episode efficiency.]
37
- - Compare random baseline, trained policy, and oracle policy to show both practical gains and theoretical ceiling.
38
- - Include one short failure case to show where the policy still struggles and why that insight is useful.
39
 
40
- ## 7) Technical Highlights
41
 
42
- - Multi-database Spider coverage with structured metadata and deterministic train/eval split.
43
- - Typed action and observation models that make environment interactions explicit and debuggable.
44
- - Deployment-ready packaging for HuggingFace Spaces with bundled databases and health checks.
45
 
46
- ## 8) Try It Yourself
 
 
47
 
48
- - HuggingFace Space: add live link and a one-line instruction for connecting and running a first episode.
49
- - Colab notebook: link `notebooks/train_grpo.ipynb` with notes on expected runtime and CPU compatibility.
50
- - GitHub repository: link setup steps, architecture docs, and verification artifacts for reproducibility.
51
 
52
- ## 9) What We Learned
53
 
54
- - Dense intermediate rewards improve learning speed only when they align with the final objective.
55
- - Tool-using agents benefit from transparent errors; better diagnostics create better policy updates.
56
- - Packaging and storytelling matter: a reproducible deployment and clear narrative are as important as benchmark numbers for adoption.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # SQLEnv Blog Post Outline
2
 
3
+ ## 1) Cold Open: Two Agents, Same Question
4
 
5
+ Open with two transcripts side by side no explanation yet, just show the contrast.
 
 
6
 
7
+ **Random agent** (from showcase notebook, seed=7):
8
+ - "Count the number of paragraphs."
9
+ - SAMPLEs the same table 4 times, DESCRIBEs Documents 5 times, runs 3 SELECT * queries, then submits a random row as the answer.
10
+ - Reward: 0.278, incorrect.
11
 
12
+ **Trained agent** (from blog-material, error recovery example):
13
+ - "Which employee received the biggest bonus?"
14
+ - Describes employee, tries wrong column (Salary), gets error, describes evaluation to find Bonus column, writes correct JOIN, answers "Louis Deacon".
15
+ - Reward: 1.13, correct.
16
 
17
+ One explored strategically. The other wandered. Both had the same tools, the same budget, the same database. The difference is learned behavior.
18
 
19
+ ## 2) The Gap (3 sentences, not a section)
 
 
20
 
21
+ Text-to-SQL benchmarks give the model the full schema and ask for one query. That tests memorization, not investigation. SQLEnv hides the schema and gives the agent a step budget — forcing it to develop the exploration strategy that makes human analysts reliable.
22
 
23
+ ## 3) Four Actions, One Budget
 
 
24
 
25
+ Introduce the action space through the oracle episode (showcase notebook, seed=0):
26
+ - Question: "List the id of students who registered some courses and the number of their registered courses?"
27
+ - Step 1: DESCRIBE student_course_registrations → sees columns (+0.015)
28
+ - Step 2: DESCRIBE students → sees student_id (+0.015)
29
+ - Step 3: QUERY with JOIN + GROUP BY → gets the answer (+0.150)
30
+ - Step 4: ANSWER → correct (+1.000)
31
+ - Total: 1.180 in 4 steps.
32
 
33
+ Then show the reward architecture table:
34
+ - L1 Operational: +0.02 execution, +0.01 new info, -0.01 repeats, -0.005 step cost
35
+ - L2 Progress: delta from previous query result (potential-based)
36
+ - L3 Terminal: +1.0 correct, 0.0 wrong
37
 
38
+ Key point: terminal dominates. Max exploration over 15 steps is ~0.3; correct answer is 1.0. No farming.
39
 
40
+ ## 4) Training: SFT Teaches Strategy, GRPO Refines It
 
 
41
 
42
+ NOT "here's how GRPO works." Lead with the insight:
43
 
44
+ - Per-turn SFT (347 examples) taught the model to call describe forever. It never learned when to query or answer.
45
+ - Multi-turn SFT (120 full trajectories with assistant_only_loss) taught describe-then-query-then-answer as a coherent strategy.
46
+ - GRPO then refined this into real behaviors: error recovery, answer formatting, knowing when to stop.
47
 
48
+ Two-phase curriculum:
49
+ - Phase 1: Easy questions with KL penalty — stabilize format
50
+ - Phase 2: Easy + medium without KL — allow exploration
51
 
52
+ Show the reward curve: -0.1 to 0.5-0.7 over 400 steps. Clear learning signal.
 
 
53
 
54
+ ## 5) What the Agent Learned
55
 
56
+ Lead with observed behaviors, not metrics:
57
+ - **Schema discovery**: always describes before querying
58
+ - **Error recovery**: wrong column re-describe correct retry (concrete example)
59
+ - **Answer formatting**: comma-separated lists, pipe-delimited rows, [] for empty results
60
+ - **Subquery composition**: NOT IN, GROUP BY HAVING, UNION queries
61
+
62
+ These emerged from reward signal, not hard-coded rules.
63
+
64
+ Comparison table (N=50 eval, 2026-04-11):
65
+
66
+ | Method | Accuracy | Parse Rate | Avg Steps |
67
+ |--------|----------|------------|-----------|
68
+ | Zero-shot | 0% | 28% | 10.8 |
69
+ | 1-shot | 0% | 16% | 14.8 |
70
+ | 3-shot | 0% | 20% | 13.8 |
71
+ | GRPO v1 | 28% | 95% | 4.0 |
72
+ | GRPO v2 | 32% | 87% | 3.7 |
73
+
74
+ Two things stand out. First, 95% parse rate — the trained model almost always produces valid tool-call JSON. The base model fails 72-84% of the time and wastes its entire step budget repeating the same malformed output. Second, 28-32% accuracy from 0% — the environment produces genuine learning. The base model can't get a single answer right even with 3 examples; the trained model solves 14-16 out of 50 in just 3-4 steps.
75
+
76
+ ## 6) What the Agent Can't Do (The Interesting Part)
77
+
78
+ This is where small models hit a wall — and the wall tells us something about the environment.
79
+
80
+ - **Column name hallucination**: reads `FullName` from DESCRIBE, writes `full_name` in SQL. Pretraining biases override in-context schema. A 0.6B model can't fight its own weights.
81
+ - **FK chain reasoning**: single-table queries work; 3+ table JOINs don't. The model can't chain Documents -> Templates -> Ref_Template_Types.
82
+ - **More RL doesn't help**: v2 (double the training steps) produced identical accuracy. The ceiling is pretraining knowledge, not training budget.
83
+
84
+ This is actually the point: the environment produces a clear learning signal that saturates at the model's capacity. A larger model (or better SFT on JOIN patterns) would push higher. The environment scales; the 0.6B model doesn't.
85
+
86
+ ## 7) Reward Theory (Brief — For Technical Judges)
87
+
88
+ One paragraph: potential-based shaping (Ng et al., 1999). Our delta progress rewards are F(s,s') = phi(s') - phi(s), which provably preserves the optimal policy. Without this guarantee, agents learn to farm exploration rewards instead of answering questions. We observed this directly when we tried cumulative progress caps (not potential-based) — the agent explored endlessly.
89
+
90
+ ## 8) Technical Highlights (Bullet List)
91
+
92
+ - 10 Spider databases with structured metadata and deterministic train/eval split
93
+ - Typed action and observation models (Pydantic) — every interaction is explicit
94
+ - Read-only SQL via SQLite mode=ro — safety from the database engine, not regex
95
+ - TRL environment_factory integration — plugs into standard GRPO training
96
+ - Docker packaging for HuggingFace Spaces with health checks
97
+ - 473 training / 50 eval questions across easy/medium difficulty
98
+
99
+ ## 9) Try It Yourself
100
+
101
+ - HuggingFace Space: [live demo]
102
+ - Training notebook: notebooks/train_grpo.ipynb — runs on Colab L4 in ~7 hours
103
+ - Showcase notebook: notebooks/showcase_sqlenv.ipynb — explore the environment interactively
104
+ - GitHub: full source, architecture docs
105
+
106
+ ## 10) What We Learned (Close with Insights)
107
+
108
+ Three non-obvious findings:
109
+
110
+ 1. **Multi-turn SFT teaches strategy, not actions.** Per-turn examples teach vocabulary; multi-turn examples teach conversation. The difference is the difference between a model that calls describe forever and one that knows when to answer.
111
+
112
+ 2. **Transparent errors produce better policies.** When the environment surfaces "Error: no such column: full_name" instead of empty results, the agent develops error-recovery strategies. Better diagnostics produce better gradient signal.
113
+
114
+ 3. **Dense rewards need theory.** Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we observed agents farming exploration rewards at the expense of answering questions.
docs/blog-post-v1-preview.html ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>SQLEnv Blog Post Preview</title>
7
+ <style>
8
+ :root { --bg: #0d1117; --surface: #161b22; --card: #1a1a2e; --text: #e0e0e0; --muted: #8b949e; --accent: #58a6ff; --border: #30363d; }
9
+ * { margin: 0; padding: 0; box-sizing: border-box; }
10
+ body { background: var(--bg); color: var(--text); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; line-height: 1.7; max-width: 820px; margin: 0 auto; padding: 40px 24px 80px; }
11
+ h1 { font-size: 2em; margin: 0 0 32px; color: #fff; line-height: 1.2; }
12
+ h2 { font-size: 1.5em; margin: 48px 0 16px; color: #fff; border-bottom: 1px solid var(--border); padding-bottom: 8px; }
13
+ h3 { font-size: 1.15em; margin: 32px 0 12px; color: #fff; }
14
+ p { margin: 0 0 16px; }
15
+ em { color: var(--muted); }
16
+ strong { color: #fff; }
17
+ a { color: var(--accent); text-decoration: none; }
18
+ a:hover { text-decoration: underline; }
19
+ code { background: var(--surface); padding: 2px 6px; border-radius: 4px; font-size: 0.9em; font-family: 'SF Mono', 'Fira Code', monospace; }
20
+ ul, ol { margin: 0 0 16px 24px; }
21
+ li { margin-bottom: 6px; }
22
+ li code { font-size: 0.85em; }
23
+ table { width: 100%; border-collapse: collapse; margin: 16px 0; }
24
+ th { text-align: left; padding: 10px 12px; background: var(--surface); border: 1px solid var(--border); color: #fff; font-size: 0.9em; }
25
+ td { padding: 10px 12px; border: 1px solid var(--border); font-size: 0.9em; }
26
+ tr.row-grpo { background: rgba(76, 175, 80, 0.1); }
27
+ tr.row-grpo td { color: #fff; font-weight: 600; }
28
+ tr.row-grpo td:nth-child(2) { color: #81c784; }
29
+ tr.row-base td:nth-child(2) { color: #e57373; }
30
+ img { max-width: 100%; border-radius: 8px; margin: 16px 0; }
31
+ figcaption, .caption { font-size: 0.85em; color: var(--muted); font-style: italic; margin-top: -8px; margin-bottom: 16px; }
32
+
33
+ /* Copyable code block */
34
+ .code-block { position: relative; margin: 16px 0; }
35
+ .code-block pre { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 16px; padding-right: 48px; overflow-x: auto; font-size: 13px; line-height: 1.5; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; margin: 0; }
36
+ .code-block.scrollable pre { font-size: 12px; }
37
+ .reward-comment { color: #81c784; }
38
+ .copy-btn { position: absolute; top: 8px; right: 8px; background: var(--border); border: none; border-radius: 4px; padding: 4px 8px; cursor: pointer; color: var(--muted); font-size: 12px; font-family: inherit; transition: background 0.15s, color 0.15s; z-index: 1; }
39
+ .copy-btn:hover { background: var(--muted); color: var(--bg); }
40
+ .copy-btn.copied { background: #4caf50; color: #fff; }
41
+
42
+ /* Cards */
43
+ .cards { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
44
+ .card { flex: 1; min-width: 240px; background: var(--card); border-radius: 12px; padding: 24px 20px; display: flex; flex-direction: column; }
45
+ .card-header { text-align: center; margin-bottom: 14px; }
46
+ .card-header .emoji { font-size: 32px; display: block; margin-bottom: 6px; }
47
+ .card-header .title { font-weight: bold; font-size: 1.05em; }
48
+ .card-question { color: var(--muted); font-size: 13px; margin-bottom: 14px; text-align: center; height: 36px; display: flex; align-items: center; justify-content: center; }
49
+ .card-code { background: var(--bg); border: none; border-radius: 8px; padding: 10px 12px; font-size: 10.5px; margin: 0; line-height: 1.45; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; overflow-x: auto; height: 130px; white-space: pre; }
50
+ .card-outcome { margin-top: 12px; font-size: 13px; font-weight: 600; height: 24px; display: flex; align-items: center; }
51
+ .card-insight { font-size: 12px; color: var(--muted); margin-top: 8px; line-height: 1.5; border-top: 1px solid var(--border); padding-top: 8px; height: 40px; }
52
+
53
+ .green { color: #4caf50; }
54
+ .red { color: #ef5350; }
55
+ .blue { color: #4fc3f7; }
56
+ .orange { color: #ffb74d; }
57
+ .neg { color: #e57373; opacity: 0.85; }
58
+ .pos { color: #81c784; opacity: 0.85; }
59
+ .marker { font-size: 0.85em; font-weight: 600; }
60
+
61
+ /* Cold open side-by-side */
62
+ .cold-open { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
63
+ .cold-open-panel { flex: 1; min-width: 340px; background: var(--surface); border-radius: 12px; padding: 20px; border: 1px solid var(--border); }
64
+ .cold-open-panel h3 { margin: 0 0 4px; font-size: 1em; border: none; padding: 0; color: #fff; }
65
+ .cold-open-panel .subtitle { color: var(--muted); font-size: 13px; margin-bottom: 12px; }
66
+ .cold-open-panel pre { margin: 0; padding: 12px; font-size: 10.5px; border: none; background: var(--bg); border-radius: 8px; height: 200px; overflow-y: auto; }
67
+ .cold-open-panel .verdict { margin-top: 12px; font-size: 13px; }
68
+ .cold-open-panel .verdict .red { color: #e57373; }
69
+ .cold-open-panel .verdict .green { color: #81c784; }
70
+
71
+ /* Color legend */
72
+ .legend { display: flex; gap: 16px; margin: 8px 0 4px; font-size: 12px; color: var(--muted); flex-wrap: wrap; }
73
+ .legend-item { display: flex; align-items: center; gap: 4px; }
74
+ .legend-swatch { width: 10px; height: 10px; border-radius: 2px; display: inline-block; }
75
+ </style>
76
+ </head>
77
+ <body>
78
+
79
+ <h1>SQLEnv: Teaching Small Models to Explore Databases Like Analysts</h1>
80
+
81
+ <h2>Untrained vs Trained Agent</h2>
82
+
83
+ <div class="cold-open">
84
+
85
+ <div class="cold-open-panel">
86
+ <h3>Untrained agent</h3>
87
+ <div class="subtitle"><em>"Count the number of paragraphs."</em></div>
88
+ <pre><span class="neg">(-)</span> DESCRIBE Documents ×5 ← same table, five times
89
+ <span class="neg">(-)</span> SAMPLE Documents ×3 ← already saw these rows
90
+ <span class="neg">(-)</span> DESCRIBE Templates ← wrong table
91
+ <span class="pos">(+)</span> DESCRIBE Paragraphs ← finally the right table
92
+ <span class="neg">(-)</span> QUERY SELECT * LIMIT 5 ← no aggregation
93
+ <span class="neg">(-)</span> QUERY SELECT * LIMIT 5 ← still no COUNT(*)
94
+ <span class="neg">(-)</span> ANSWER "76 | Robbin CV" ← a random row</pre>
95
+ <div class="verdict"><span class="red">15 steps · reward 0.28 · never wrote SELECT COUNT(*)</span></div>
96
+ </div>
97
+
98
+ <div class="cold-open-panel">
99
+ <h3>Trained agent</h3>
100
+ <div class="subtitle"><em>"Which employee received the biggest bonus?"</em></div>
101
+ <pre><span class="pos">(+)</span> DESCRIBE employee
102
+ → Employee_ID, Name, Age, City
103
+ <span class="neg">(-)</span> QUERY ...ORDER BY Salary DESC
104
+ → Error: no such column: Salary
105
+ <span class="pos">(+)</span> DESCRIBE evaluation
106
+ → Employee_ID, Year_awarded, Bonus
107
+ <span class="pos">(+)</span> QUERY ...JOIN...ORDER BY Bonus DESC
108
+ → Louis Deacon
109
+ <span class="pos">(+)</span> ANSWER "Louis Deacon"</pre>
110
+ <div class="verdict"><span class="green">5 steps · reward 1.13 · recovered from error</span></div>
111
+ </div>
112
+
113
+ </div>
114
+
115
+ <p>Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.</p>
116
+
117
+ <h2>The Gap</h2>
118
+
119
+ <p>Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.</p>
120
+
121
+ <p>SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.</p>
122
+
123
+ <h2>What Analysts Actually Do</h2>
124
+
125
+ <p>Consider the situation where you need to answer a question using data in an unfamiliar database. You probably cannot write the final query in one go. Instead, you run <code>DESCRIBE</code> to see what columns exist, <code>SELECT * LIMIT 5</code> to scan the actual data, then build your query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.</p>
126
+
127
+ <p>SQLEnv captures this workflow. Four actions mirror what analysts do:</p>
128
+
129
+ <ul>
130
+ <li><strong>DESCRIBE</strong> reveals column names and types for a table</li>
131
+ <li><strong>SAMPLE</strong> previews rows to understand the data</li>
132
+ <li><strong>QUERY</strong> executes a read-only SQL query</li>
133
+ <li><strong>ANSWER</strong> submits a final answer</li>
134
+ </ul>
135
+
136
+ <p>Each episode starts with a natural-language question and a list of table names. Columns, types, and relationships stay hidden until the agent discovers them through exploration. This partial observability forces strategy over pattern-matching.</p>
137
+
138
+ <p>A clean episode on the question <em>"List student IDs with registered courses and their course counts"</em>:</p>
139
+
140
+ <div class="code-block scrollable">
141
+ <button class="copy-btn" onclick="copyBlock(this)">Copy</button>
142
+ <pre>Step 1: DESCRIBE student_course_registrations
143
+ → student_id INTEGER, course_id INTEGER, registration_date DATETIME
144
+ → reward: +0.015 <span class="reward-comment">← new schema revealed</span>
145
+
146
+ Step 2: DESCRIBE students
147
+ → student_id INTEGER, student_details VARCHAR(255)
148
+ → reward: +0.015 <span class="reward-comment">← second table described</span>
149
+
150
+ Step 3: QUERY SELECT T1.student_id, count(*)
151
+ FROM students AS T1
152
+ JOIN student_course_registrations AS T2
153
+ ON T1.student_id = T2.student_id
154
+ GROUP BY T1.student_id
155
+ → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
156
+ → reward: +0.150 <span class="reward-comment">← results overlap with gold answer</span>
157
+
158
+ Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
159
+ → correct
160
+ → reward: +1.000 <span class="reward-comment">← terminal: correct answer</span></pre>
161
+ </div>
162
+
163
+ <p>Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.</p>
164
+
165
+ <h2>Built on OpenEnv</h2>
166
+
167
+ <p><a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv</a> is a standard protocol for RL environments with a simple contract:</p>
168
+
169
+ <ul>
170
+ <li><code>reset(seed)</code> starts a new episode and returns the initial observation</li>
171
+ <li><code>step(action)</code> executes one action and returns observation, reward, and done flag</li>
172
+ </ul>
173
+
174
+ <p>Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code. SQLEnv implements it with four actions (DESCRIBE, SAMPLE, QUERY, ANSWER):</p>
175
+
176
+ <div class="legend">
177
+ <div class="legend-item"><span class="legend-swatch" style="background:#d2a8ff;"></span> method</div>
178
+ <div class="legend-item"><span class="legend-swatch" style="background:#7ee787;"></span> action type</div>
179
+ <div class="legend-item"><span class="legend-swatch" style="background:#ffa657;"></span> argument</div>
180
+ </div>
181
+
182
+ <div class="code-block">
183
+ <button class="copy-btn" onclick="copyBlock(this)">Copy</button>
184
+ <pre style="margin-top:0">env = SQLEnvironment(questions_path=<span style="color:#a5d6ff">"..."</span>, db_dir=<span style="color:#a5d6ff">"..."</span>, tokenizer=tok)
185
+ obs = env.<span style="color:#d2a8ff">reset</span>(seed=42)
186
+ obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"DESCRIBE"</span>, argument=<span style="color:#ffa657">"employee"</span>))
187
+ obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"QUERY"</span>, argument=<span style="color:#ffa657">"SELECT COUNT(*) FROM employee"</span>))
188
+ obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"ANSWER"</span>, argument=<span style="color:#ffa657">"10"</span>))</pre>
189
+ </div>
190
+
191
+ <p>TRL's <code>environment_factory</code> auto-discovers the four tool methods from the environment class for GRPO training. The same environment runs locally, in Docker on HuggingFace Spaces, or over WebSocket.</p>
192
+
193
+ <p>The Green Agent evaluator wraps this protocol for benchmarking:</p>
194
+
195
+ <div class="code-block">
196
+ <pre>evaluate(env, policy, n_episodes=50, seed=0)</pre>
197
+ </div>
198
+
199
+ <p>This runs any <code>Policy</code> through the environment and reports success rate, average reward, and step count. Built-in <code>RandomPolicy</code> and <code>OraclePolicy</code> baselines provide lower and upper bounds (0% vs 100% accuracy, 0.25 vs 1.17 reward).</p>
200
+
201
+ <h2>Reward Architecture</h2>
202
+
203
+ <p>Three layers of reward signal:</p>
204
+
205
+ <table>
206
+ <tr><th>Layer</th><th>Signal</th><th>Per-step clip</th></tr>
207
+ <tr>
208
+ <td><strong>L1: Operational</strong></td>
209
+ <td>Successful execution <span class="pos">(+0.02)</span>, new info <span class="pos">(+0.01)</span>, repeat <span class="neg">(-0.03)</span>, step cost <span class="neg">(-0.02)</span></td>
210
+ <td style="white-space:nowrap">[-0.10, 0.15]</td>
211
+ </tr>
212
+ <tr>
213
+ <td><strong>L2: Progress</strong></td>
214
+ <td>Delta from previous query result — cardinality, value overlap, numeric proximity. Positive <span class="pos">(+)</span> for improvement, negative <span class="neg">(-)</span> for regression.</td>
215
+ <td style="white-space:nowrap">[-0.10, 0.15]</td>
216
+ </tr>
217
+ <tr>
218
+ <td><strong>L3: Terminal</strong></td>
219
+ <td>Correct answer: <span class="pos">+1.0</span>. Wrong: <span class="neg">0.0</span></td>
220
+ <td style="white-space:nowrap">one-shot</td>
221
+ </tr>
222
+ </table>
223
+
224
+ <p>Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.</p>
225
+
226
+ <p>The progress signal uses delta-from-previous-step: potential-based reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al., 1999</a>). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. <a href="https://arxiv.org/abs/2603.22293">TIPS</a> (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. <a href="https://arxiv.org/html/2504.13958v1">ToolRL</a> (2025) found fine-grained reward decomposition improved tool learning by 17%. <a href="https://arxiv.org/abs/2410.07745">StepTool</a> (2024) confirmed step-grained shaping outperformed outcome-only rewards.</p>
227
+
228
+ <h2>Training</h2>
229
+
230
+ <p>We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's <code>environment_factory</code> runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.</p>
231
+
232
+ <p>SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.</p>
233
+
234
+ <p>Multi-turn SFT (120 full trajectories with <code>assistant_only_loss</code>) taught describe-then-query-then-answer as a coherent strategy. The subsequent GRPO training refined this into error recovery, answer formatting, and knowing when to stop exploring.</p>
235
+
236
+ <p>Two-phase curriculum:</p>
237
+ <ul>
238
+ <li><strong>Phase 1</strong>: Easy questions (single-table), KL penalty (beta=0.04) to keep the policy close to SFT initialization while allowing exploration. Reward climbs from -0.1 to 0.5-0.7 over 400 steps.</li>
239
+ <li><strong>Phase 2</strong>: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.</li>
240
+ </ul>
241
+
242
+ <div style="text-align: center; margin: 24px 0;">
243
+ <img src="/Users/hjerp/Projects/sql-env/docs/rl-training-phase-1.png" alt="GRPO Training Progress" style="max-width: 100%; border-radius: 8px;">
244
+ <p class="caption" style="text-align: center;">GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 in Phase 1 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.</p>
245
+ </div>
246
+
247
+ <p>SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.</p>
248
+
249
+ <h2>What the Agent Learned</h2>
250
+
251
+ <p>The following behaviors emerged during training:</p>
252
+
253
+ <div class="cards">
254
+
255
+ <div class="card">
256
+ <div class="card-header">
257
+ <span class="emoji">🔎</span>
258
+ <span class="title blue">Schema Discovery</span>
259
+ </div>
260
+ <div class="card-question"><em>"What is the total bonus in all evaluations?"</em></div>
261
+ <div class="card-code">describe("evaluation")
262
+ → Employee_ID, Year_awarded, Bonus
263
+ query("SELECT SUM(Bonus) FROM evaluation")
264
+ → 19500.0
265
+ answer("19500.0")</div>
266
+ <div class="card-outcome"><span class="green">3 steps · reward 1.15 · correct</span></div>
267
+ <div class="card-insight">Aggregation on one table. Describe first, then write targeted SQL.</div>
268
+ </div>
269
+
270
+ <div class="card">
271
+ <div class="card-header">
272
+ <span class="emoji">🔧</span>
273
+ <span class="title orange">Error Recovery</span>
274
+ </div>
275
+ <div class="card-question"><em>"Which employee received the biggest bonus?"</em></div>
276
+ <div class="card-code">describe("employee") → Name, Age
277
+ query("...ORDER BY Salary DESC")
278
+ → Error: no such column: Salary
279
+ describe("evaluation") → Bonus
280
+ query("...JOIN...ORDER BY Bonus DESC")
281
+ → Louis Deacon
282
+ answer("Louis Deacon")</div>
283
+ <div class="card-outcome"><span class="green">5 steps · reward 1.13 · correct</span></div>
284
+ <div class="card-insight">Two-table JOIN with error recovery. Wrong column, re-describe, retry.</div>
285
+ </div>
286
+
287
+ <div class="card">
288
+ <div class="card-header">
289
+ <span class="emoji">🚧</span>
290
+ <span class="title red">The Ceiling</span>
291
+ </div>
292
+ <div class="card-question"><em>"Which city has the most arriving flights?"</em></div>
293
+ <div class="card-code">describe("AIRPORTS")
294
+ → City, AirportCode
295
+ query("SELECT CITY, COUNT(*)
296
+ FROM AIRPORTS GROUP BY CITY
297
+ ORDER BY COUNT(*) DESC LIMIT 1")
298
+ → Albany | 4
299
+ answer("Albany")</div>
300
+ <div class="card-outcome"><span class="red">3 steps · reward 0.0 · incorrect</span></div>
301
+ <div class="card-insight">Multi-hop JOIN (3+ tables). Counted airports, not flights. Beyond 0.6B.</div>
302
+ </div>
303
+
304
+ </div>
305
+
306
+ <p>The first two cards show learned behaviors: schema-first exploration and error recovery. The third shows where 0.6B saturates. We expand on these limitations and next steps below.</p>
307
+
308
+ <h3>Evaluation (N=50 episodes, 2 independent runs)</h3>
309
+
310
+ <p>All conditions run through SQLEnv's Green Agent evaluator: <code>evaluate(env, policy, n_episodes, seed)</code>.</p>
311
+
312
+ <table>
313
+ <tr><th>Method</th><th>Accuracy</th><th>Parse Rate</th><th>Avg Steps</th></tr>
314
+ <tr class="row-base"><td>Zero-shot</td><td>0%</td><td>24-28%</td><td>10.8-12.4</td></tr>
315
+ <tr class="row-base"><td>1-shot</td><td>0-2%</td><td>16-17%</td><td>14.0-14.8</td></tr>
316
+ <tr class="row-base"><td>3-shot</td><td>0%</td><td>19-20%</td><td>13.8-14.8</td></tr>
317
+ <tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>95-100%</td><td>3.5-4.0</td></tr>
318
+ </table>
319
+
320
+ <p><strong>95-100% parse rate</strong>: the trained model produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. <strong>~30% accuracy from 0%</strong>: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.</p>
321
+
322
+ <p>We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.</p>
323
+
324
+ <h2>Limitations at 0.6B Parameters</h2>
325
+
326
+ <p>Three failure modes define the current ceiling:</p>
327
+
328
+ <ul>
329
+ <li><strong>Column name hallucination.</strong> The model reads <code>FullName</code> from DESCRIBE but writes <code>full_name</code> in SQL, or reads <code>Horsepower</code> and writes <code>HorsepowerDESC</code> (missing space). Pretraining biases override the schema that the model just observed in context.</li>
330
+ <li><strong>FK chain reasoning.</strong> The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.</li>
331
+ <li><strong>More RL does not help.</strong> Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.</li>
332
+ </ul>
333
+
334
+ <p>RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.</p>
335
+
336
+ <h2>The Learning Space</h2>
337
+
338
+ <p>The Green Agent evaluator brackets the learning space with two baselines:</p>
339
+
340
+ <table>
341
+ <tr><th>Policy</th><th>Accuracy</th><th>Avg Reward</th><th>Avg Steps</th></tr>
342
+ <tr class="row-base"><td>Random</td><td>0%</td><td>0.247</td><td>15.0</td></tr>
343
+ <tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>~0.35</td><td>3.5</td></tr>
344
+ <tr><td>Oracle</td><td>100%</td><td>1.168</td><td>3.5</td></tr>
345
+ </table>
346
+
347
+ <p>Random scores 0.247 by accumulating small exploration rewards across 15 steps without answering. Oracle scores 1.168 in 3.5 steps. This gap between 0.25 and 1.17 represents what a trained agent can learn. Our GRPO agent lands at ~0.35, above random but far below oracle, with room for improvement through better SFT warmup or larger models.</p>
348
+
349
+ <h2>Technical Highlights</h2>
350
+
351
+ <ul>
352
+ <li><strong>676 questions</strong> (473 train / 203 eval) across 10 Spider databases with difficulty labels</li>
353
+ <li><strong>Typed models</strong> with Pydantic: every action, observation, and state is explicit and debuggable</li>
354
+ <li><strong>Read-only SQL</strong> via SQLite <code>mode=ro</code>, where the database engine enforces safety rather than regex</li>
355
+ <li><strong>Potential-based reward shaping</strong> (Ng et al., 1999) that provably preserves the optimal policy</li>
356
+ <li><strong>TRL environment_factory</strong> integration for standard GRPO training without a custom loop</li>
357
+ <li><strong>Green Agent evaluator</strong> with <code>Policy</code> protocol, <code>evaluate()</code> harness, and <code>RandomPolicy</code>/<code>OraclePolicy</code> baselines</li>
358
+ </ul>
359
+
360
+ <h2>Next Steps</h2>
361
+
362
+ <p>The environment supports two directions for improvement:</p>
363
+
364
+ <p><strong>Thinking mode.</strong> The 30% ceiling comes from multi-table reasoning. The model cannot plan a three-table JOIN path before writing SQL. Qwen3's <code>&lt;think&gt;</code> blocks offer a way to reason about the join chain before writing the query. In our experiments, RL alone did not produce useful thinking: the model either emitted empty <code>&lt;think&gt;&lt;/think&gt;</code> blocks or collapsed into degenerate loops (<code>&lt;think&gt;assistant&lt;think&gt;assistant...</code>) that consumed ~23% of rollouts. Pure RL discovers that thinking tokens exist but not how to use them. SFT warmup with structured reasoning examples ("I need to join Documents → Templates → Ref_Template_Types through Template_ID") could bootstrap the format, then RL could refine when to think and when to skip. This is worth testing at 0.6B before concluding the ceiling requires a larger model.</p>
365
+
366
+ <p><strong>Larger models.</strong> Our goal is small models that run locally, so scaling to 7B or beyond changes the deployment story. That said, a 1.7B model has more capacity to attend to DESCRIBE output and override pretrained column names. The environment and reward architecture do not depend on model size, so scaling up requires changing the training configuration rather than redesigning the environment. At some point, larger models may solve these questions with few-shot prompting alone, but the environment remains useful for training small models that need to run without API access.</p>
367
+
368
+ <h2>Try It Yourself</h2>
369
+
370
+ <ul>
371
+ <li><strong>Training notebook</strong>: <code>notebooks/train_grpo.ipynb</code> runs the full SFT + GRPO pipeline on Colab L4 in ~7 hours</li>
372
+ <li><strong>Comparison notebook</strong>: <code>notebooks/compare_methods.ipynb</code> evaluates base vs trained models side by side</li>
373
+ <li><strong>Showcase notebook</strong>: <code>notebooks/showcase_sqlenv.ipynb</code> lets you explore the environment, run episodes, and see what tools and rewards are available</li>
374
+ <li><strong>GitHub</strong>: full source, architecture docs, and training artifacts</li>
375
+ </ul>
376
+
377
+ <h2>Discussion</h2>
378
+
379
+ <p><strong>The format of SFT data matters more than the quantity.</strong> Per-turn SFT (347 examples) taught the model individual tool calls but not when to use them. The model called describe repeatedly because half the training examples were describe calls. Multi-turn SFT (120 full trajectories) taught the model to chain describe, query, and answer into a coherent episode. The difference was not the number of examples but whether each example showed a complete problem-solving sequence.</p>
380
+
381
+ <p><strong>Transparent errors help the agent learn.</strong> When the environment returns <code>"Error: no such column: full_name"</code> instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.</p>
382
+
383
+ <p><strong>Dense rewards benefit from theoretical grounding.</strong> Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.</p>
384
+
385
+ <p><strong>The environment is the contribution.</strong> The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.</p>
386
+
387
+ <script>
388
+ function copyBlock(btn) {
389
+ const pre = btn.parentElement.querySelector('pre');
390
+ const text = pre.innerText;
391
+ navigator.clipboard.writeText(text).then(() => {
392
+ btn.textContent = 'Copied!';
393
+ btn.classList.add('copied');
394
+ setTimeout(() => {
395
+ btn.textContent = 'Copy';
396
+ btn.classList.remove('copied');
397
+ }, 2000);
398
+ });
399
+ }
400
+ </script>
401
+
402
+ </body>
403
+ </html>
docs/blog-post-v1.md ADDED
@@ -0,0 +1,269 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SQLEnv: Teaching Small Models to Explore Databases Like Analysts
2
+
3
+ ## Two Agents, Same Question
4
+
5
+ <!-- Side-by-side cold open: random vs trained agent -->
6
+ <div style="display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap;">
7
+ <div style="flex: 1; min-width: 340px; background: #161b22; border-radius: 12px; padding: 20px; border: 1px solid #30363d; color: #e0e0e0;">
8
+ <div style="font-weight: bold; color: #ef5350;">Untrained agent</div>
9
+ <div style="color: #8b949e; font-size: 13px; margin-bottom: 12px;"><em>"Count the number of paragraphs."</em></div>
10
+ <pre style="background: #0d1117; border: none; padding: 12px; font-size: 11.5px; margin: 0; color: #c9d1d9;">DESCRIBE Documents ×5 ← same table, five times
11
+ SAMPLE Documents ×3 ← already saw these rows
12
+ DESCRIBE Templates ← wrong table
13
+ DESCRIBE Paragraphs ← finally the right table
14
+ QUERY SELECT * LIMIT 5 ← no aggregation
15
+ QUERY SELECT * LIMIT 5 ← still no COUNT(*)
16
+ ANSWER "76 | Robbin CV" ← a random row</pre>
17
+ <div style="margin-top: 12px; font-size: 13px; font-weight: 600;">❌ <span style="color: #ef5350;">15 steps · reward 0.28 · never wrote SELECT COUNT(*)</span></div>
18
+ </div>
19
+ <div style="flex: 1; min-width: 340px; background: #161b22; border-radius: 12px; padding: 20px; border: 1px solid #30363d; color: #e0e0e0;">
20
+ <div style="font-weight: bold; color: #4caf50;">Trained agent</div>
21
+ <div style="color: #8b949e; font-size: 13px; margin-bottom: 12px;"><em>"Which employee received the biggest bonus?"</em></div>
22
+ <pre style="background: #0d1117; border: none; padding: 12px; font-size: 11.5px; margin: 0; color: #c9d1d9;">DESCRIBE employee
23
+ → Employee_ID, Name, Age, City
24
+ QUERY ...ORDER BY Salary DESC
25
+ → Error: no such column: Salary
26
+ DESCRIBE evaluation
27
+ → Employee_ID, Year_awarded, Bonus
28
+ QUERY ...JOIN...ORDER BY Bonus DESC
29
+ → Louis Deacon
30
+ ANSWER "Louis Deacon"</pre>
31
+ <div style="margin-top: 12px; font-size: 13px; font-weight: 600;">✅ <span style="color: #4caf50;">5 steps · reward 1.13 · recovered from error</span></div>
32
+ </div>
33
+ </div>
34
+
35
+ Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.
36
+
37
+ ## The Gap
38
+
39
+ Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.
40
+
41
+ SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.
42
+
43
+ ## What Analysts Actually Do
44
+
45
+ Consider the situation where you need to answer a question using data in an unfamiliar database. You probably cannot write the final query in one go. Instead, you run `DESCRIBE` to see what columns exist, `SELECT * LIMIT 5` to scan the actual data, then build your query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.
46
+
47
+ SQLEnv captures this workflow. Four actions mirror what analysts do:
48
+
49
+ - **DESCRIBE** reveals column names and types for a table
50
+ - **SAMPLE** previews rows to understand the data
51
+ - **QUERY** executes a read-only SQL query
52
+ - **ANSWER** submits a final answer
53
+
54
+ Each episode starts with a natural-language question and a list of table names. Columns, types, and relationships stay hidden until the agent discovers them through exploration. This partial observability forces strategy over pattern-matching.
55
+
56
+ A clean episode on the question *"List student IDs with registered courses and their course counts"*:
57
+
58
+ ```
59
+ Step 1: DESCRIBE student_course_registrations
60
+ → student_id INTEGER, course_id INTEGER, registration_date DATETIME
61
+ → reward: +0.015
62
+
63
+ Step 2: DESCRIBE students
64
+ → student_id INTEGER, student_details VARCHAR(255)
65
+ → reward: +0.015
66
+
67
+ Step 3: QUERY SELECT T1.student_id, count(*)
68
+ FROM students AS T1
69
+ JOIN student_course_registrations AS T2
70
+ ON T1.student_id = T2.student_id
71
+ GROUP BY T1.student_id
72
+ → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
73
+ → reward: +0.150
74
+
75
+ Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
76
+ → correct
77
+ → reward: +1.000
78
+ ```
79
+
80
+ Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.
81
+
82
+ ## Built on OpenEnv
83
+
84
+ [OpenEnv](https://github.com/meta-pytorch/OpenEnv) is a standard protocol for RL environments. The contract is simple:
85
+
86
+ - `reset(seed)` starts a new episode and returns the initial observation
87
+ - `step(action)` executes one action and returns observation, reward, and done flag
88
+
89
+ Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code.
90
+
91
+ SQLEnv implements this protocol with four domain-specific actions:
92
+
93
+ ```python
94
+ env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tok)
95
+ obs = env.reset(seed=42) # pick question, load DB, hide schema
96
+ obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
97
+ obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
98
+ obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
99
+ # obs.done=True, obs.reward=1.0
100
+ ```
101
+
102
+ TRL's `environment_factory` auto-discovers the four tool methods (`describe`, `sample`, `query`, `answer`) from the environment class for GRPO training. The same environment runs locally, in Docker on HuggingFace Spaces, or over WebSocket via `SQLEnvClient`.
103
+
104
+ The Green Agent evaluator wraps this protocol for benchmarking:
105
+
106
+ ```python
107
+ evaluate(env, policy, n_episodes=50, seed=0)
108
+ ```
109
+
110
+ This runs any `Policy` through the environment and reports success rate, average reward, and step count. Built-in `RandomPolicy` and `OraclePolicy` baselines provide lower and upper bounds (0% vs 100% accuracy, 0.25 vs 1.17 reward).
111
+
112
+ ## Reward Architecture
113
+
114
+ Three layers of reward signal:
115
+
116
+ | Layer | Signal | Per-step clip |
117
+ |-------|--------|---------------|
118
+ | **L1: Operational** | Successful execution (+0.02), new info (+0.01), repeat penalty (-0.03), step cost (-0.02) | [-0.10, 0.15] |
119
+ | **L2: Progress** | Delta from previous query result — cardinality, value overlap, numeric proximity | [-0.10, 0.15] |
120
+ | **L3: Terminal** | Correct answer: +1.0. Wrong: 0.0 | one-shot |
121
+
122
+ Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.
123
+
124
+ The progress signal uses delta-from-previous-step, a form of potential-based reward shaping ([Ng et al., 1999](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf)). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. [TIPS](https://arxiv.org/abs/2603.22293) (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. [ToolRL](https://arxiv.org/html/2504.13958v1) (2025) found fine-grained reward decomposition improved tool learning by 17%. [StepTool](https://arxiv.org/abs/2410.07745) (2024) confirmed step-grained shaping outperformed outcome-only rewards.
125
+
126
+ ## Training
127
+
128
+ We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's `environment_factory` runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.
129
+
130
+ SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.
131
+
132
+ Multi-turn SFT (120 full trajectories with `assistant_only_loss`) taught describe-then-query-then-answer as a coherent strategy. The subsequent GRPO training refined this into error recovery, answer formatting, and knowing when to stop exploring.
133
+
134
+ Two-phase curriculum:
135
+ - **Phase 1**: Easy questions (single-table), KL penalty (beta=0.04) to keep the policy close to SFT initialization while allowing exploration. Reward climbs from -0.1 to 0.5-0.7 over 400 steps.
136
+ - **Phase 2**: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.
137
+
138
+ ![GRPO Training Progress](rl-training-phase-1.png)
139
+ *GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.*
140
+
141
+ SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.
142
+
143
+ ## What the Agent Learned
144
+
145
+ The following behaviors emerged during training:
146
+
147
+ <div style="display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap;">
148
+
149
+ <div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
150
+ <div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🔍</div>
151
+ <div style="color: #4fc3f7; font-weight: bold; text-align: center; margin-bottom: 4px;">Schema Discovery</div>
152
+ <div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"What is the total bonus in all evaluations?"</em></div>
153
+ <div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("evaluation")
154
+ → Employee_ID, Year_awarded, Bonus
155
+ query("SELECT SUM(Bonus) FROM evaluation")
156
+ → 19500.0
157
+ answer("19500.0")</div>
158
+ <div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
159
+ <span style="font-size: 18px;">✅</span> <span style="color: #4caf50;">3 steps · reward 1.15</span>
160
+ </div>
161
+ <div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Aggregation on one table. Describe first, then write targeted SQL.</div>
162
+ </div>
163
+
164
+ <div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
165
+ <div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🔄</div>
166
+ <div style="color: #ffb74d; font-weight: bold; text-align: center; margin-bottom: 4px;">Error Recovery</div>
167
+ <div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"Which employee received the biggest bonus?"</em></div>
168
+ <div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("employee") → Name, Age, City
169
+ query("...ORDER BY Salary DESC")
170
+ → Error: no such column: Salary
171
+ describe("evaluation") → Bonus column
172
+ query("...JOIN...ORDER BY Bonus DESC")
173
+ → Louis Deacon
174
+ answer("Louis Deacon")</div>
175
+ <div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
176
+ <span style="font-size: 18px;">✅</span> <span style="color: #4caf50;">5 steps · reward 1.13</span>
177
+ </div>
178
+ <div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Two-table JOIN with error recovery. Wrong column, re-describe, retry — the analyst loop.</div>
179
+ </div>
180
+
181
+ <div style="flex: 1; min-width: 240px; background: #1a1a2e; border-radius: 12px; padding: 24px 20px; color: #e0e0e0; display: flex; flex-direction: column;">
182
+ <div style="font-size: 36px; text-align: center; margin-bottom: 12px;">🧱</div>
183
+ <div style="color: #ef5350; font-weight: bold; text-align: center; margin-bottom: 4px;">The Ceiling</div>
184
+ <div style="color: #8b949e; font-size: 13px; text-align: center; margin-bottom: 14px;"><em>"Which city has the most arriving flights?"</em></div>
185
+ <div style="background: #0d1117; border-radius: 8px; padding: 12px; font-size: 11px; font-family: monospace; color: #c9d1d9; flex: 1; min-height: 140px; white-space: pre; line-height: 1.5;">describe("AIRPORTS")
186
+ → City, AirportCode
187
+ query("SELECT CITY, COUNT(*)
188
+ FROM AIRPORTS GROUP BY CITY
189
+ ORDER BY COUNT(*) DESC LIMIT 1")
190
+ → Albany | 4
191
+ answer("Albany")</div>
192
+ <div style="display: flex; align-items: center; gap: 8px; margin-top: 12px; font-size: 13px; font-weight: 600;">
193
+ <span style="font-size: 18px;">❌</span> <span style="color: #ef5350;">3 steps · reward 0.0</span>
194
+ </div>
195
+ <div style="font-size: 12px; color: #8b949e; margin-top: 8px; line-height: 1.5; border-top: 1px solid #30363d; padding-top: 8px;">Multi-hop JOIN (3+ tables). Counted airports instead of joining through flights. Beyond 0.6B capacity.</div>
196
+ </div>
197
+
198
+ </div>
199
+
200
+ ### Evaluation (N=50 episodes, 2 independent runs)
201
+
202
+ All conditions run through SQLEnv's Green Agent evaluator: `evaluate(env, policy, n_episodes, seed)`. The same harness powers the showcase notebook (Random vs Oracle baselines) and the full comparison below.
203
+
204
+ | Method | Accuracy | Parse Rate | Avg Steps |
205
+ |--------|----------|------------|-----------|
206
+ | Zero-shot | 0% | 24-28% | 10.8-12.4 |
207
+ | 1-shot | 0-2% | 16-17% | 14.0-14.8 |
208
+ | 3-shot | 0% | 19-20% | 13.8-14.8 |
209
+ | GRPO (trained) | ~30% | 95-100% | 3.5-4.0 |
210
+
211
+ Two results stand out. **95-100% parse rate**: the trained model almost always produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. **~30% accuracy from 0%**: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.
212
+
213
+ We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.
214
+
215
+ ## Limitations at 0.6B Parameters
216
+
217
+ Three failure modes define the current ceiling:
218
+
219
+ - **Column name hallucination.** The model reads `FullName` from DESCRIBE but writes `full_name` in SQL, or reads `Horsepower` and writes `HorsepowerDESC` (missing space). Pretraining biases override the schema that the model just observed in context.
220
+ - **FK chain reasoning.** The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.
221
+ - **More RL does not help.** Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.
222
+
223
+ RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.
224
+
225
+ ## The Learning Space
226
+
227
+ The Green Agent evaluator brackets the learning space with two baselines:
228
+
229
+ | Policy | Accuracy | Avg Reward | Avg Steps |
230
+ |--------|----------|------------|-----------|
231
+ | Random | 0% | 0.247 | 15.0 |
232
+ | **GRPO (trained)** | **~30%** | **~0.35** | **3.5** |
233
+ | Oracle | 100% | 1.168 | 3.5 |
234
+
235
+ Random scores 0.247 by accumulating small exploration rewards across 15 steps without answering. Oracle scores 1.168 in 3.5 steps. This gap between 0.25 and 1.17 represents what a trained agent can learn. Our GRPO agent lands at ~0.35, above random but far below oracle, with room for improvement through better SFT warmup or larger models.
236
+
237
+ ## Technical Highlights
238
+
239
+ - **676 questions** (473 train / 203 eval) across 10 Spider databases with difficulty labels
240
+ - **Typed models** with Pydantic: every action, observation, and state is explicit and debuggable
241
+ - **Read-only SQL** via SQLite `mode=ro`, where the database engine enforces safety rather than regex
242
+ - **Potential-based reward shaping** (Ng et al., 1999) that provably preserves the optimal policy
243
+ - **TRL environment_factory** integration for standard GRPO training without a custom loop
244
+ - **Green Agent evaluator** with `Policy` protocol, `evaluate()` harness, and `RandomPolicy`/`OraclePolicy` baselines
245
+
246
+ ## Next Steps
247
+
248
+ The environment supports two directions for improvement:
249
+
250
+ **Thinking mode.** The 30% ceiling comes from multi-table reasoning. The model cannot plan a three-table JOIN path before writing SQL. Qwen3's `<think>` blocks offer a way to reason about the join chain before writing the query. In our experiments, RL alone did not produce useful thinking: the model either emitted empty `<think></think>` blocks or collapsed into degenerate loops (`<think>assistant<think>assistant...`) that consumed ~23% of rollouts. Pure RL discovers that thinking tokens exist but not how to use them. SFT warmup with structured reasoning examples ("I need to join Documents → Templates → Ref_Template_Types through Template_ID") could bootstrap the format, then RL could refine when to think and when to skip. This is worth testing at 0.6B before concluding the ceiling requires a larger model.
251
+
252
+ **Larger models.** Our goal is small models that run locally, so scaling to 7B or beyond changes the deployment story. That said, a 1.7B model has more capacity to attend to DESCRIBE output and override pretrained column names. The environment and reward architecture do not depend on model size, so scaling up requires changing the training configuration rather than redesigning the environment. At some point, larger models may solve these questions with few-shot prompting alone, but the environment remains useful for training small models that need to run without API access.
253
+
254
+ ## Try It Yourself
255
+
256
+ - **Training notebook**: `notebooks/train_grpo.ipynb` runs the full SFT + GRPO pipeline on Colab L4 in ~7 hours
257
+ - **Comparison notebook**: `notebooks/compare_methods.ipynb` evaluates base vs trained models side by side
258
+ - **Showcase notebook**: `notebooks/showcase_sqlenv.ipynb` lets you explore the environment, run episodes, and see what tools and rewards are available
259
+ - **GitHub**: full source, architecture docs, and training artifacts
260
+
261
+ ## Discussion
262
+
263
+ **The format of SFT data matters more than the quantity.** Per-turn SFT (347 examples) taught the model individual tool calls but not when to use them. The model called describe repeatedly because half the training examples were describe calls. Multi-turn SFT (120 full trajectories) taught the model to chain describe, query, and answer into a coherent episode. The difference was not the number of examples but whether each example showed a complete problem-solving sequence.
264
+
265
+ **Transparent errors help the agent learn.** When the environment returns `"Error: no such column: full_name"` instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.
266
+
267
+ **Dense rewards benefit from theoretical grounding.** Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.
268
+
269
+ **The environment is the contribution.** The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.
docs/blog-post.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SQLEnv: Teaching Small Models to Explore Databases Like Analysts
2
+
3
+ ## What Analysts Do That One-Shot LLMs Miss
4
+
5
+ A data analyst opens a new database. She doesn't write the final query first. She runs `DESCRIBE` to see what columns exist, `SELECT * LIMIT 5` to understand the data, then builds her query piece by piece — adjusting joins, fixing column names, retrying after errors. The answer emerges from a process, not a guess.
6
+
7
+ Most text-to-SQL systems skip this process entirely. They take a question and a schema, produce one SQL query, and hope it's right. When it isn't — wrong column name, missing join, bad filter — there's no recovery. The model guessed once and moved on.
8
+
9
+ SQLEnv is a reinforcement learning environment that trains the exploration habit instead.
10
+
11
+ ## The Problem: Static Benchmarks Reward Memorization
12
+
13
+ Spider and BIRD measure whether a model can produce a correct SQL query in one shot. High accuracy on these benchmarks doesn't mean the model can handle an unfamiliar schema. It means the model memorized patterns from training data that happen to match the test set.
14
+
15
+ The gap shows up immediately in production. Schemas change. Column names don't match expectations. Tables have unexpected relationships. A model that can't recover from its first wrong guess is brittle in exactly the situations where SQL competence matters most.
16
+
17
+ The deeper problem: these benchmarks grade the final answer and ignore the process. An agent that explores the schema, discovers a mistake, and corrects itself gets the same score as one that guesses right on the first try — and no credit at all when the guess fails. There's no learning signal for the investigative reasoning that makes human analysts reliable.
18
+
19
+ ## SQLEnv: An Interactive RL Environment
20
+
21
+ SQLEnv gives agents four actions that mirror what analysts do:
22
+
23
+ - **DESCRIBE** — see column names and types for a table
24
+ - **SAMPLE** — view example rows to understand the data
25
+ - **QUERY** — run a read-only SQL query and see results
26
+ - **ANSWER** — submit a final answer
27
+
28
+ Each episode starts with a natural-language question and a list of table names. The schema is hidden. The agent must discover columns, types, and relationships through exploration before it can answer. This partial observability is the point — it forces the agent to develop strategy rather than pattern-match.
29
+
30
+ The environment plugs into TRL's `environment_factory` for GRPO training, runs in Docker for safe SQL execution, and deploys to HuggingFace Spaces.
31
+
32
+ ## How an Episode Works
33
+
34
+ Consider the question: *"What is the total bonus given in all evaluations?"*
35
+
36
+ A trained agent's episode looks like this:
37
+
38
+ 1. `describe("evaluation")` → sees columns: Employee_ID, Year_awarded, Bonus
39
+ 2. `query("SELECT SUM(Bonus) FROM evaluation")` → result: 19500.0
40
+ 3. `answer("19500.0")` → correct
41
+
42
+ Three tool calls, clean execution. The agent discovered the schema, wrote the right query, and submitted. But the interesting behavior is what happens when things go wrong.
43
+
44
+ On a harder question — *"Which employee received the biggest bonus?"* — the agent tries `SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1`, gets `no such column: Salary`, calls `describe("evaluation")` to find the Bonus column, then writes the correct JOIN. The error recovery is learned behavior, not hard-coded.
45
+
46
+ ## Reward Architecture
47
+
48
+ The environment provides three layers of reward signal:
49
+
50
+ **Operational feedback** on every step: did the SQL execute (+0.02) or error? Is this a new query (+0.01) or a repeat (-0.01)?
51
+
52
+ **Progress signal** on queries: how close are the results to the correct answer? Measured by value overlap and result cardinality, shaped as delta-from-previous-step. This form is potential-based (Ng et al., 1999), which provably preserves the optimal policy — the agent can't game progress rewards at the expense of correctness.
53
+
54
+ **Terminal reward** for the answer: +1.0 for correct, 0.0 for wrong.
55
+
56
+ Terminal correctness dominates by design. The maximum exploration reward across all steps is roughly 0.3; a correct answer is worth 1.0. An agent that explores endlessly but never answers will always score less than one that answers correctly. Dense intermediate rewards exist to make training feasible on small models — without them, a 0.6B parameter model can't learn from sparse terminal-only signal.
57
+
58
+ ## Training with GRPO
59
+
60
+ We train using Group Relative Policy Optimization (GRPO), which compares multiple rollouts of the same question to compute advantages. TRL's `environment_factory` runs the agent through SQLEnv for each rollout, accumulating rewards across the multi-turn tool-calling loop.
61
+
62
+ Training uses a two-phase curriculum:
63
+ - **Phase 1**: Easy questions only (single-table), with KL penalty (β=0.04) to stay close to the SFT-initialized policy
64
+ - **Phase 2**: Easy + medium questions (multi-table JOINs), KL penalty removed to allow exploration
65
+
66
+ The SFT warmup matters more than we expected. We generate 120 multi-turn trajectories showing the full describe→query→answer workflow, trained with `assistant_only_loss` so the model learns from its own actions, not the environment's responses. Without this, GRPO has no coherent policy to improve — the agent never learns to call tools in sequence.
67
+
68
+ ## What We Observed
69
+
70
+ Across nine training runs on Qwen3-0.6B, three findings stand out.
71
+
72
+ **The environment produces clear learning signal.** Phase 1 reward climbs from near-zero to 0.5–0.7 within 400 steps. The model learns to describe tables before querying, format answers correctly (comma-separated lists, pipe-delimited rows), recover from SQL errors by re-describing tables, and submit `[]` for genuinely empty results. These are strategic behaviors that emerge from reward signal, not from hard-coded rules.
73
+
74
+ **The model hits a capacity ceiling on medium questions.** Multi-table JOIN queries — the kind requiring foreign-key chain reasoning across 3+ tables — don't improve with more training. We ran extended training (v2: 4 total epochs across both phases) and the reward curve stayed flat. The dominant failure mode is column name hallucination: the model reads the schema correctly via DESCRIBE, then writes SQL using pretrained column names that don't match. A 0.6B model can't override pretraining biases through RL reward signal alone.
75
+
76
+ **Thinking mode helps error recovery but doesn't raise the ceiling.** With Qwen3's thinking mode enabled, the model reasons through SQL errors in `<think>` blocks — noticing column name mismatches, adjusting join paths. Easy-question accuracy stays the same, but error recovery becomes more deliberate. A new failure mode emerges: ~23% of rollouts degenerate into unclosed `<think>` loops that consume the entire token budget. The fix is straightforward (add SFT examples with proper think blocks), but it reveals how small models struggle with structural token management.
77
+
78
+ ### Training metrics (Run 9, Qwen3-0.6B)
79
+
80
+ | Phase | Steps | Questions | Reward range | Mean reward |
81
+ |-------|-------|-----------|-------------|-------------|
82
+ | Phase 1 (easy, β=0.04) | 870 | 435 | 0.01–1.15 | ~0.5 |
83
+ | Phase 2 (easy+medium, β=0.0) | 934 | 467 | 0.01–1.15 | ~0.5 |
84
+
85
+ Parse rate: >98% (model produces valid tool-call JSON). The 10% eval accuracy on the GRPO checkpoint vs 0% on the base model confirms the environment produces genuine learning, even if the absolute numbers are modest for a 0.6B model.
86
+
87
+ ## Technical Highlights
88
+
89
+ - **10 Spider databases** with structured metadata and a deterministic train/eval split
90
+ - **Typed action and observation models** make every environment interaction explicit and debuggable
91
+ - **Read-only SQL execution** via SQLite `mode=ro` — safety enforced by the database engine, not regex
92
+ - **Potential-based reward shaping** (Ng et al., 1999) — delta progress rewards provably preserve the optimal policy
93
+ - **TRL environment_factory integration** — the environment plugs into standard GRPO training with no custom training loop
94
+ - **Docker packaging** for HuggingFace Spaces with bundled databases and health checks
95
+
96
+ ## Future Directions
97
+
98
+ Two clear paths forward:
99
+
100
+ **Larger models.** The 0.6B model's ceiling comes from pretrained column-name biases overriding schema context. A 1.7B or larger model has more capacity to attend to the DESCRIBE output and override pretraining. The environment and reward architecture are model-agnostic — scaling up is a config change, not a redesign.
101
+
102
+ **Thinking mode with targeted SFT.** Qwen3's thinking mode shows promise for multi-step reasoning (error recovery, join path planning) but needs SFT coverage on proper `<think>...</think>` blocks to prevent degenerate loops. Combining a larger model with thinking mode should push into medium-difficulty multi-table questions where the current model plateaus.
103
+
104
+ The environment itself is the contribution. Whether the agent is 0.6B or 70B, sparse reward or dense, GRPO or PPO — the action space, reward architecture, and episode structure remain the same. SQLEnv provides the training ground. The models will catch up.
105
+
106
+ ## Try It Yourself
107
+
108
+ - **HuggingFace Space**: [live demo link]
109
+ - **Training notebook**: `notebooks/train_grpo.ipynb` — runs on Colab L4 in ~7 hours for both phases
110
+ - **GitHub**: full source, architecture docs, and verification artifacts
111
+
112
+ ## What We Learned
113
+
114
+ Dense intermediate rewards accelerate learning only when they align with the final objective. Potential-based shaping (Ng et al., 1999) gives us this guarantee — delta progress rewards can't distort the optimal policy. Without this property, agents learn to farm exploration rewards instead of answering questions.
115
+
116
+ Tool-using agents benefit from transparent errors. When the environment surfaces `"Error: no such column: full_name"` instead of silently returning empty results, the agent develops error-recovery strategies. Better diagnostics produce better policy updates.
117
+
118
+ Multi-turn SFT is the foundation, not a warmup step. Without full-trajectory SFT (describe→query→answer as one example with `assistant_only_loss`), GRPO has no coherent starting policy to improve. Per-turn SFT teaches individual actions; multi-turn SFT teaches strategy. The difference is the difference between a model that calls describe forever and one that knows when to answer.
docs/competition-deliverables.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Challenge — Deliverables & Status
2
+
3
+ ## Competition
4
+
5
+ **OpenEnv Challenge: SOTA Environments to drive general intelligence**
6
+
7
+ Sponsors: PyTorch team at Meta, HuggingFace, Unsloth
8
+
9
+ Prizes:
10
+ - $10K in HuggingFace credits
11
+ - Invitation to publish on PyTorch.org blog
12
+
13
+ ## Judging Criteria
14
+
15
+ Evaluated primarily on the submission blog. Judging panel grades on:
16
+
17
+ 1. Creative and robust use of OpenEnv
18
+ 2. Technical excellence
19
+ 3. Storytelling
20
+ 4. Open-source demo
21
+ 5. Green Agent wrapper for the environment
22
+
23
+ ## Required Deliverables
24
+
25
+ ### 1. HuggingFace Space
26
+
27
+ Environment on the HF Hub. Judges interact with the action space
28
+ (DESCRIBE, SAMPLE, QUERY, ANSWER) against real Spider databases.
29
+
30
+ Live at: https://huggingface.co/spaces/hjerpe/sql_env
31
+ Docker image: `registry.hf.space/hjerpe-sql_env:latest`
32
+ Published via `uv run openenv push` on 2026-03-29 (see `specs/F007-DEMO.md`).
33
+
34
+ **Status:** Live. Endpoints `/health`, `/docs`, `/web`, `/reset`, `/step`, `/ws`
35
+ exposed by the FastAPI server in `envs/sql_env/server/`. Python client:
36
+ `SQLEnv(base_url="https://hjerpe-sql-env.hf.space")`.
37
+
38
+ ### 2. Training notebooks/scripts (GitHub)
39
+
40
+ Colab-ready notebooks:
41
+ - `notebooks/train_grpo.ipynb` — Full SFT + GRPO pipeline, Colab L4, ~7h
42
+ - `notebooks/compare_methods.ipynb` — Base vs GRPO evaluation (zero-shot, 1-shot, 3-shot, GRPO v1, v2)
43
+ - `notebooks/showcase_sqlenv.ipynb` — Interactive environment demo with Random and Oracle baselines
44
+
45
+ **Status:** Complete
46
+
47
+ ### 3. Blog post (HuggingFace)
48
+
49
+ Analyst exploration framing, reward architecture with theory,
50
+ training results (0% to ~30%), failure analysis, lessons learned.
51
+
52
+ Draft: `docs/blog-post-v1.md`
53
+
54
+ **Status:** Draft v1 complete, not yet published
55
+
56
+ ## Additional Deliverables
57
+
58
+ ### 4. GitHub repo
59
+
60
+ Clean codebase: zero ruff errors, typed Pydantic models, 280 passing
61
+ tests, architecture docs, training artifacts.
62
+
63
+ **Status:** Complete (F016 quality sweep done)
64
+
65
+ ### 5. Trained checkpoints (HuggingFace Hub)
66
+
67
+ - `hjerpe/sqlenv-qwen3-0.6b-grpo` (v1)
68
+ - `hjerpe/sqlenv-qwen3-0.6b-grpo-v2` (v2)
69
+
70
+ **Status:** Uploaded
71
+
72
+ ### 6. Green Agent wrapper
73
+
74
+ OpenEnv evaluation wrapper pattern. A `Policy` protocol with
75
+ `evaluate(env, policy, n_episodes, seed)` that reports success rate,
76
+ average reward, and average steps. Includes `RandomPolicy` and
77
+ `OraclePolicy` baselines for standardized comparison.
78
+
79
+ Implementation: `evaluation/policies.py`, `evaluation/oracle_policy.py`
80
+ Tests: `tests/test_evaluation.py` (17 tests, all passing)
81
+ Used by: `notebooks/showcase_sqlenv.ipynb`, `notebooks/compare_methods.ipynb`
82
+
83
+ **Status:** Complete
84
+
85
+ ### 7. TRL `environment_factory` adapter
86
+
87
+ HuggingFace TRL's native OpenEnv integration: pass a class with
88
+ `reset()` + named tool methods as `environment_factory=` and `GRPOTrainer`
89
+ runs the multi-turn tool-calling loop automatically (no custom
90
+ `rollout_func`).
91
+
92
+ Implementation: `training/trl_adapter.py` — class `SQLEnvTRL` exposing
93
+ `describe()`, `sample()`, `query()`, `answer()` as tool methods plus
94
+ `sql_env_reward_func`. Used by `notebooks/train_grpo.ipynb` (cell 16:
95
+ `environment_factory=SQLEnvTRL`).
96
+
97
+ Note: the adapter instantiates a **local** in-process `SQLEnvironment`,
98
+ not a WebSocket client to the hosted HF Space. Intentional — training
99
+ needs N parallel sessions (one per generation), and local is faster and
100
+ avoids the Space's default 1-session concurrency limit.
101
+
102
+ **Status:** Complete
103
+
104
+ ## Our Position
105
+
106
+ No interactive SQL exploration environment exists. SQL Repair
107
+ (WALKMAN303) is single-turn fix-it. Calendar Gym (Turing) is
108
+ real-world but not SQL. We are the only multi-turn
109
+ strategy-discovery environment for database exploration.
110
+
111
+ Key narrative: "The environment is the product." The trained agent
112
+ demonstrates that the environment works, but the contribution is
113
+ the action space, reward architecture, and episode structure.
114
+
115
+ ## Open Items
116
+
117
+ - [x] Deploy HuggingFace Space (live at https://huggingface.co/spaces/hjerpe/sql_env, 2026-03-29)
118
+ - [ ] Publish blog post on HuggingFace (planned 2026-04-12)
119
+ - [ ] Final review of blog-post-v1.md
120
+ - [ ] Verify notebooks run clean on fresh Colab
121
+ - [ ] Post-launch: enable `SUPPORTS_CONCURRENT_SESSIONS=True` + `max_concurrent_envs=64` on the Space for external users who want to retrain against the hosted endpoint
122
+
123
+ ## Resources
124
+
125
+ - OpenEnv tutorial: https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb
126
+ - OpenEnv GitHub: https://github.com/meta-pytorch/OpenEnv
127
+ - OpenEnv docs: https://meta-pytorch.org/OpenEnv/
128
+ - Environment hub: https://huggingface.co/openenv
129
+ - Discord: https://discord.com/invite/YsTYBh6PD9
docs/data-sources.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Sources
2
+
3
+ Reference for what data SQLEnv uses, where it comes from, and how to
4
+ regenerate it. All data lives under `data/` and is checked into the repo,
5
+ so a fresh clone works offline after `uv sync`.
6
+
7
+ ## Summary
8
+
9
+ | Artifact | Path | Origin | Count |
10
+ |---|---|---|---|
11
+ | SQLite databases | `data/databases/<db_id>/<db_id>.sqlite` | [Spider](https://yale-lily.github.io/spider) (taoyds/spider on GitHub) | 10 databases |
12
+ | Training questions | `data/questions/questions_train.json` | Spider train split (`xlangai/spider`) | 473 questions |
13
+ | Eval questions | `data/questions/questions_eval.json` | Spider validation split (`xlangai/spider`) | 203 questions |
14
+ | DB allowlist | `data/questions/db_list.json` | hand-curated subset | 10 db_ids |
15
+ | SFT trajectories | `data/sft/sft_trajectories.json` | generated from gold SQL | 120 trajectories |
16
+
17
+ Total: ~676 questions across 10 Spider databases, plus 120 multi-turn SFT
18
+ warmup trajectories.
19
+
20
+ ## Upstream: Spider
21
+
22
+ [Spider](https://yale-lily.github.io/spider) (Yu et al., EMNLP 2018) is a
23
+ cross-domain text-to-SQL benchmark with ~200 databases and ~10k
24
+ question/gold-SQL pairs. Every question has a natural-language prompt, a
25
+ gold SQL query, and a target database. We use two mirrors:
26
+
27
+ 1. **Questions** via HuggingFace Datasets: [`xlangai/spider`](https://huggingface.co/datasets/xlangai/spider)
28
+ — loaded with `datasets.load_dataset("xlangai/spider", split=...)` in
29
+ `scripts/download_spider_data.py`.
30
+ 2. **SQLite databases** via the Spider GitHub mirror:
31
+ - `https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite`
32
+ - Fallback: the official Google Drive archive
33
+ (`1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J`)
34
+ — fetched by `scripts/download_spider_databases.py`.
35
+
36
+ Spider's license is CC BY-SA 4.0.
37
+
38
+ ## The 10-database subset (`db_list.json`)
39
+
40
+ We do not ship all ~200 Spider databases. The allowlist in
41
+ `data/questions/db_list.json` pins 10:
42
+
43
+ ```
44
+ student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation,
45
+ pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player
46
+ ```
47
+
48
+ These were chosen for schema variety (single-table aggregates, 2-table
49
+ joins, 3-table FK chains, naming-convention quirks) while keeping the
50
+ repo small and the training loop fast.
51
+
52
+ **Train/eval split is by database**, not random sampling within a
53
+ database. This prevents train/eval leakage at the schema level:
54
+
55
+ - **Train databases** (7): `car_1, concert_singer, cre_Doc_Template_Mgt,
56
+ dog_kennels, employee_hire_evaluation, flight_2, student_assessment`
57
+ - **Eval databases** (4): `flight_2, pets_1, poker_player, world_1`
58
+
59
+ `flight_2` appears in both; other eval DBs are schemas the model never
60
+ saw during training. `sql_env.training.data_loading.validate_no_data_leak`
61
+ asserts zero question-text overlap between the two files at load time.
62
+
63
+ ## Question files
64
+
65
+ Both `questions_train.json` and `questions_eval.json` are lists of records
66
+ with this shape (actual sample from `car_1` train):
67
+
68
+ ```json
69
+ {
70
+ "question_text": "How many cars have a larger accelerate than the car with the largest horsepower?",
71
+ "database_name": "car_1",
72
+ "gold_sql": "SELECT COUNT(*) FROM CARS_DATA WHERE Accelerate > (SELECT Accelerate FROM CARS_DATA ORDER BY Horsepower DESC LIMIT 1)",
73
+ "gold_answer": 39,
74
+ "answer_type": "integer",
75
+ "difficulty": "easy",
76
+ "tables_involved": ["CARS_DATA"],
77
+ "split": "train",
78
+ "question_id": "car_1_train_000"
79
+ }
80
+ ```
81
+
82
+ ### Counts and difficulty mix
83
+
84
+ | Split | Total | easy | medium | hard |
85
+ |---|---|---|---|---|
86
+ | train | 473 | 435 | 32 | 6 |
87
+ | eval | 203 | 185 | 18 | 0 |
88
+
89
+ The easy-heavy distribution is deliberate for the 0.6B capacity ceiling
90
+ (see `docs/blog-material.md` — "The 0.6B Capacity Ceiling"). Medium and
91
+ hard questions are kept in the mix for Phase 2 exposure but are not where
92
+ this model size gains accuracy.
93
+
94
+ ### Curation pipeline
95
+
96
+ `scripts/curate_questions.py` turns raw Spider records into the format
97
+ above. Per question, it:
98
+
99
+ 1. Filters to databases in `db_list.json`.
100
+ 2. Executes `gold_sql` against the real SQLite database to produce
101
+ `gold_answer` — this is what the environment grades agents against,
102
+ not a string match on the SQL.
103
+ 3. Normalizes the answer into a typed shape (`integer`, `string`,
104
+ `list[...]`, `table`) via `answer_type`.
105
+ 4. Parses `FROM` and `JOIN` tokens to fill `tables_involved` (used by SFT
106
+ generation to decide which tables to `describe()`).
107
+ 5. Assigns the Spider-provided difficulty label.
108
+ 6. Writes train and eval to separate files with a per-question
109
+ `question_id` derived from `{db_id}_{split}_{index}`.
110
+
111
+ Re-running the script is idempotent given the same Spider snapshot.
112
+
113
+ ## SFT warmup trajectories (`data/sft/sft_trajectories.json`)
114
+
115
+ 120 multi-turn trajectories used as a supervised warmup before GRPO.
116
+ Each record has `messages` (tool-calling chat format) and `tools` (the
117
+ four SQLEnv tool schemas).
118
+
119
+ ### How they are generated
120
+
121
+ `scripts/generate_sft_data.py` walks the training questions and, for each
122
+ one, runs the real `SQLEnvironment` programmatically:
123
+
124
+ 1. `describe(table)` for every table in `tables_involved`
125
+ 2. `query(gold_sql)` — captures real tabular output from the environment
126
+ 3. `answer(gold_answer)` — terminal step
127
+
128
+ The captured sequence becomes an assistant-labelled trajectory. This is
129
+ **not synthetic text** — the assistant turns wrap the actual environment
130
+ responses the model will see at training and inference time, which is
131
+ what lets GRPO's KL anchor point align with real env output.
132
+
133
+ The 120-count is smaller than the 473 training questions because SFT
134
+ samples a subset that exercises each database and difficulty bucket;
135
+ see `scripts/generate_sft_data.py` for the selection logic.
136
+
137
+ Why multi-turn matters: an earlier per-turn SFT (347 single-turn
138
+ examples) taught the model to always call `describe` and nothing else.
139
+ Multi-turn teaches the full `describe → query → answer` sequence. See
140
+ `docs/blog-material.md` — "Multi-Turn SFT — Why It's Critical".
141
+
142
+ ## How to regenerate from scratch
143
+
144
+ ```bash
145
+ # 1. Databases (SQLite files from the Spider mirror)
146
+ uv run python scripts/download_spider_databases.py --db-id all
147
+
148
+ # 2. Raw Spider questions (via HF Datasets)
149
+ uv run python scripts/download_spider_data.py --db-id all --split train
150
+ uv run python scripts/download_spider_data.py --db-id all --split validation
151
+
152
+ # 3. Curate into questions_train.json / questions_eval.json
153
+ uv run python scripts/curate_questions.py
154
+
155
+ # 4. Regenerate SFT trajectories from gold SQL
156
+ uv run python scripts/generate_sft_data.py
157
+ ```
158
+
159
+ You should not need to do this in normal operation — the curated files
160
+ are committed. Regenerate only when updating the database allowlist,
161
+ changing the answer-type taxonomy, or rerunning against a new Spider
162
+ snapshot.
163
+
164
+ ## What we deliberately do not use
165
+
166
+ - **BIRD** (Li et al., 2023) — larger, harder text-to-SQL benchmark. Out
167
+ of scope for a 0.6B model; revisit for a larger-model follow-up.
168
+ - **WikiSQL** — single-table only, doesn't exercise the multi-turn
169
+ exploration the environment is built for.
170
+ - **Synthetic LLM-generated questions** — we want Spider's human-written
171
+ prompts so eval results are comparable to published work.
172
+ - **Spider databases outside `db_list.json`** — kept out to keep the
173
+ repo small and training fast. Easy to widen by editing the list and
174
+ rerunning the regeneration pipeline.
175
+
176
+ ## References
177
+
178
+ - [Spider dataset (Yale LILY)](https://yale-lily.github.io/spider)
179
+ - [taoyds/spider GitHub mirror](https://github.com/taoyds/spider)
180
+ - [xlangai/spider on HuggingFace](https://huggingface.co/datasets/xlangai/spider)
181
+ - Yu et al. (2018). *Spider: A Large-Scale Human-Labeled Dataset for
182
+ Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.* EMNLP.
docs/delivery-specs/index.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Delivery Specs
2
+
3
+ This directory contains delivery specifications that define *how* to build features for engineering handoff.
4
+
5
+ ## What is a Delivery Spec?
6
+
7
+ Delivery specs translate validated ideas into engineering requirements. They answer:
8
+ - What are the functional requirements?
9
+ - What files need to be created/modified?
10
+ - What are the acceptance criteria?
11
+ - What agent instructions apply?
12
+
13
+ ## When to Use
14
+
15
+ | Complexity | Discovery? | Delivery Spec? | Notes |
16
+ |------------|------------|----------------|-------|
17
+ | **Simple** (CRUD) | Skip | Skip | Go directly to `autocode-implementation-planner` |
18
+ | **Standard** (feature) | Recommended | Optional | Discovery provides enough context |
19
+ | **Complex** (system) | Required | Required | Full pipeline for alignment |
20
+
21
+ ## File Format
22
+
23
+ Each feature has one file:
24
+ - `{feature}.md` — Structured delivery spec with requirements, file manifest, agent instructions
25
+
26
+ ## Delivery Specs Index
27
+
28
+ | Feature | Status | Date |
29
+ |---------|--------|------|
30
+ | *None yet* | | |
31
+
32
+ ## Relationship to Other Docs
33
+
34
+ ```
35
+ discovery/ → Validate + Taste (OST, PR/FAQ)
36
+
37
+ delivery-specs/ → Engineering handoff (YOU ARE HERE)
38
+
39
+ design-docs/ → Technical approach (ADR-style)
40
+
41
+ FEATURES.json → Links docs to implementation
42
+
43
+ specs/ → Implementation specs (autocode)
44
+ ```
45
+
46
+ ## Creating Delivery Specs
47
+
48
+ Use the `delivery-spec` skill for structured delivery specs:
49
+
50
+ ```
51
+ skill({ name: "delivery-spec" })
52
+ ```
53
+
54
+ The skill guides you through:
55
+ 1. **Problem context** — References discovery doc
56
+ 2. **Functional requirements** — What the system must do
57
+ 3. **File manifest** — Files to create/modify
58
+ 4. **Agent instructions** — Patterns, constraints, anti-patterns
59
+ 5. **Acceptance criteria** — Definition of done
60
+
61
+ ## Integration with Discovery
62
+
63
+ Delivery specs read from discovery JSON:
64
+ - Pull `taste.delights` as success criteria context
65
+ - Pull `taste.frustrations` as anti-patterns
66
+ - Pull `scope.out_of_scope` as explicit boundaries
docs/design-docs/core-beliefs.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Core Beliefs
3
+ description: Agent-first operating principles for the sql-env project
4
+ type: explanation
5
+ doc_type: explanation
6
+ ---
7
+
8
+ # Core Beliefs
9
+
10
+ Agent-first operating principles for this project.
11
+
12
+ ## Philosophy
13
+
14
+ > "Humans steer. Agents execute."
15
+
16
+ When something fails, the fix is never "try harder" — it's "what capability is missing?"
17
+
18
+ ## Principles
19
+
20
+ ### 1. Repository Knowledge is the System of Record
21
+
22
+ Anything an agent can't access in-context effectively doesn't exist. Knowledge that lives in Slack, Google Docs, or people's heads is invisible to the system.
23
+
24
+ **Implication:** Push context into the repo. Design decisions, product principles, and team conventions must be documented in versioned files.
25
+
26
+ ### 2. Enforce Invariants, Not Implementations
27
+
28
+ Mechanical constraints (lints, tests, type systems) apply everywhere at once. Narrative instructions degrade with context length.
29
+
30
+ **Implication:** When possible, encode rules as code (lints, schemas, types) rather than prose (AGENTS.md instructions).
31
+
32
+ ### 3. Design Before Implementation
33
+
34
+ Think about interfaces before implementation details. The spec-first workflow (Research — Sketch — Spec — Verify) produces deeper, more coherent modules than "prompt and iterate."
35
+
36
+ **Implication:** Use the autocode pipeline. Don't skip the research phase.
37
+
38
+ ### 4. Parse, Don't Validate
39
+
40
+ Prefer parsing data into precise types early rather than validating ad-hoc throughout the codebase. Invalid states should be unrepresentable.
41
+
42
+ **Reference:** [Parse, Don't Validate](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/)
43
+
44
+ ### 5. Boring Technology is Agent-Friendly
45
+
46
+ Technologies with stable APIs, composability, and strong training-set representation are easier for agents to model.
47
+
48
+ **Implication:** Prefer well-established libraries over cutting-edge. Sometimes it's cheaper to reimplement a subset than to work around opaque behavior.
49
+
50
+ ### 6. Taste is Captured Once, Enforced Continuously
51
+
52
+ Human taste is fed back into the system through review comments, refactoring PRs, and bug fixes. Once captured, it applies to every future line of code.
53
+
54
+ **Implication:** Use the `compound-engineer` to extract learnings. Update lints when patterns emerge.
55
+
56
+ ## Anti-Patterns
57
+
58
+ - **Encyclopedia AGENTS.md:** When everything is "important," nothing is.
59
+ - **Shotgun Parsing:** Validation scattered throughout the codebase instead of at boundaries.
60
+ - **Try Harder Debugging:** Prompting differently instead of fixing the environment.
61
+ - **AI Slop Fridays:** Manual cleanup sessions instead of continuous garbage collection.
docs/design-docs/index.md CHANGED
@@ -18,7 +18,7 @@ Architectural Decision Records are stored in [decisions/](decisions/).
18
 
19
  | Feature | Status | Date | Reversibility |
20
  |---------|--------|------|---------------|
21
- | *None yet* | | | |
22
 
23
  ## Creating Design Docs
24
 
 
18
 
19
  | Feature | Status | Date | Reversibility |
20
  |---------|--------|------|---------------|
21
+ | [Reward Shaping Research](reward-shaping-research.md) | Active | 2026-03-29 | N/A (research) |
22
 
23
  ## Creating Design Docs
24
 
docs/design-docs/reward-shaping-research.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Reward Shaping Research
3
+ description: Theoretical basis for SQLEnv dense reward architecture, comparing cumulative-cap vs delta-based progress approaches grounded in potential-based shaping theory
4
+ doc_type: explanation
5
+ ---
6
+
7
+ # Reward Shaping Research
8
+
9
+ > Last updated: 2026-03-29
10
+
11
+ Research notes on dense reward design for SQLEnv. Documents the theoretical basis for our reward architecture, the problems with the original cumulative-cap design, and the rationale for switching to per-step clipping with delta-based progress.
12
+
13
+ ## Problem Statement
14
+
15
+ SQLEnv's original reward system used:
16
+ 1. **Cumulative tracking** with a hard cap at [-0.2, 0.5]
17
+ 2. **Improvement-only gating** (reward only when `binned_progress > best_progress`)
18
+
19
+ Both violate established RL theory and create practical training problems.
20
+
21
+ ## Potential-Based Reward Shaping (Ng et al., 1999)
22
+
23
+ **Paper:** Ng, A. Y., Harada, D., & Russell, S. (1999). "Policy invariance under reward transformations: Theory and application to reward shaping." ICML.
24
+
25
+ **Core theorem:** Given an MDP M with reward R, define shaped reward R' = R + F. The optimal policy is preserved **if and only if** F has the form:
26
+
27
+ ```
28
+ F(s, a, s') = γ · Φ(s') − Φ(s)
29
+ ```
30
+
31
+ where Φ: S → ℝ is a potential function and γ is the discount factor.
32
+
33
+ **Why this matters for SQLEnv:**
34
+ - Cumulative capping is NOT potential-based (the shaping reward depends on trajectory history, not just state transitions)
35
+ - Non-potential-based shaping changes the optimal policy in unpredictable ways
36
+ - Agents may optimize for shaped reward rather than task completion
37
+
38
+ **Delta-from-previous IS potential-based** with γ=1:
39
+ ```
40
+ F(s, s') = Φ(s') − Φ(s) where Φ(s) = binned_progress(s)
41
+ ```
42
+ This form provably preserves the optimal policy.
43
+
44
+ ## Why Cumulative Caps Are Harmful
45
+
46
+ ### The POMDP problem
47
+ The cumulative reward counter is part of the environment's hidden state. The agent cannot observe it. This means:
48
+ - The same (observation, action) pair can yield different rewards depending on hidden history
49
+ - The agent cannot learn when shaping signal will stop
50
+ - Credit assignment breaks: "low reward because bad action" vs "low reward because cap hit"
51
+
52
+ ### Early saturation
53
+ Once cumulative reward hits 0.5, all subsequent steps return zero shaping. For a 15-step episode, if the cap hits at step 5, steps 6-14 have no learning signal. The agent receives no gradient for half the episode.
54
+
55
+ ### The cap was redundant
56
+ With a 15-step budget and -0.005 step cost:
57
+ - Max possible L1 reward per step: +0.025 (exec_ok + new_info - step_cost)
58
+ - Max over 15 steps: 0.375
59
+ - Realistic total (mixed actions): ~0.15-0.25
60
+ - Terminal reward: 1.0
61
+
62
+ The 4-7x ratio between terminal and exploration makes farming exploration irrational without any cap.
63
+
64
+ ## Why Improvement-Only Gating Blocks Learning
65
+
66
+ ### No recovery signal
67
+ If the agent achieves progress 0.75 on step 3, then regresses to 0.25 on step 4, then recovers to 0.75 on step 5:
68
+ - **Old design:** Steps 4 and 5 both get zero reward (0.25 < 0.75, 0.75 ≤ 0.75)
69
+ - **Delta design:** Step 4 gets -0.075 (regression), step 5 gets +0.075 (recovery)
70
+
71
+ The delta design gives the agent information about what happened. The old design is silent.
72
+
73
+ ### Discourages experimentation
74
+ With improvement-only gating, once an agent achieves good progress, any experimental query that might regress temporarily is pure risk (no upside if it doesn't exceed the best). This discourages the kind of exploratory behavior the environment is designed to train.
75
+
76
+ ## Current Design: Per-Step Clipping + Delta Progress
77
+
78
+ ### Per-step reward structure
79
+ ```
80
+ L1 Operational (every step):
81
+ +0.02 exec_ok (no error)
82
+ +0.01 new_info (unique SQL hash)
83
+ -0.01 repeat penalty (same SQL)
84
+ -0.005 step cost
85
+
86
+ L2 Progress (QUERY only):
87
+ Weighted score: cardinality (25%) + value overlap (50%) + numeric range (25%)
88
+ Binned to {0, 0.25, 0.5, 0.75, 1.0}
89
+ Delta = binned - previous_progress
90
+ Reward = delta * 0.15
91
+
92
+ L3 Terminal (ANSWER only):
93
+ +1.0 correct, 0.0 wrong
94
+
95
+ Per-step clip: [-0.05, 0.15]
96
+ No cumulative tracking.
97
+ ```
98
+
99
+ ### Anti-farming mechanisms
100
+ | Mechanism | What it prevents |
101
+ |-----------|-----------------|
102
+ | Hard budget (15 steps) | Infinite exploration |
103
+ | Step cost (-0.005) | Idle steps |
104
+ | Repeat penalty (-0.01) | Query farming |
105
+ | Terminal dominance (1.0 vs ~0.3 max) | Exploration over answering |
106
+ | Per-step clip (0.15 max) | Single-step reward spikes |
107
+
108
+ ## Comparison of Progress Approaches
109
+
110
+ | Approach | Recovery signal | Farming risk | Theory |
111
+ |----------|----------------|-------------|--------|
112
+ | Improvement-only (old) | None | None | No formal guarantee |
113
+ | Absolute quality each step | Yes | High (repeat good queries) | None |
114
+ | Delta from previous step | Yes | Low | Potential-based (Ng 1999), provably policy-invariant |
115
+
116
+ ## GRPO Integration
117
+
118
+ **GRPO was designed for episode-level rewards** (DeepSeek-R1). Dense per-step rewards are aggregated to a single episode scalar for GRPO's advantage computation.
119
+
120
+ **"GRPO is Secretly a Process Reward Model"** (Sullivan et al., 2025/2026, ICLR 2026) proved that GRPO implicitly performs process-level credit assignment when completions share prefixes. They identified a flaw (non-uniform step distribution) and proposed lambda-GRPO.
121
+
122
+ **For SQLEnv:** Dense rewards shape rollout behavior within each episode, but get aggregated to episode-level for GRPO. Weight terminal correctness heavily: ~1.0 correctness + 0.3 progress + 0.1 operational.
123
+
124
+ **Relevant validation:**
125
+ - **TIPS** (March 2026): potential-based turn-level shaping for multi-turn LLM agents, 11.8% EM improvement over PPO/GRPO baselines
126
+ - **ToolRL** (2025): finer-grained reward decomposition leads to 17% improvement over base models with GRPO
127
+ - **StepTool** (2024): step-grained reward shaping significantly outperformed outcome-only for tool learning
128
+
129
+ ## Future Directions
130
+
131
+ 1. **Diminishing novelty bonuses:** `reward = 0.01 / (1 + 0.5 * exploration_count)` instead of flat +0.01 per unique query. Classic count-based exploration (Bellemare et al. 2016, Never Give Up) naturally tapers.
132
+
133
+ 2. **Curriculum on step budget:** Start with generous budget (20 steps) for easy questions, tighten to 10 for hard ones as training progresses.
134
+
135
+ 3. **Per-layer independent clipping:** Clip L1 and L2 separately rather than their sum, preventing one layer from consuming the other's budget.
136
+
137
+ 4. **Lambda-GRPO:** Apply Sullivan et al.'s fix for non-uniform step distribution to improve credit assignment across steps.
138
+
139
+ 5. **Adaptive Length Penalty (ALP):** From "Just Enough Thinking" (2026): per-prompt length penalties based on solve rate. Could adapt step budget per difficulty level.
140
+
141
+ ## Why Result-Based, Not SQL-Structure-Based
142
+
143
+ A natural question: why compare query *results* to the gold results, rather than comparing the SQL *structure* to the gold SQL?
144
+
145
+ ### The equivalence problem
146
+
147
+ Multiple SQL queries produce identical results:
148
+
149
+ ```sql
150
+ SELECT name FROM employees WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC')
151
+ SELECT e.name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE d.location = 'NYC'
152
+ SELECT name FROM employees WHERE EXISTS (SELECT 1 FROM departments WHERE id = employees.dept_id AND location = 'NYC')
153
+ ```
154
+
155
+ Rewarding structural similarity to one gold query penalizes valid alternatives. This creates false negative gradient signal that hurts training.
156
+
157
+ ### The field moved away from structural comparison
158
+
159
+ Spider (Yu et al., 2018) used exact set match (decompose SQL into component sets). BIRD (Li et al., 2023) replaced it with execution accuracy, explicitly arguing that "exact match is too strict and penalizes valid alternative SQL formulations." Every recent system (DAIL-SQL, MAC-SQL, CHESS) evaluates on execution accuracy.
160
+
161
+ ### Intermediate queries aren't meant to look like the gold
162
+
163
+ In our POMDP, the agent runs exploratory queries (`SELECT * FROM t LIMIT 5`, `SELECT COUNT(*)`) to gather information. These should look nothing like the gold query. Rewarding structural similarity would push the agent toward exploitation before it has explored enough.
164
+
165
+ ### Result comparison is the right signal
166
+
167
+ | Dimension | Result-based | SQL-structure-based |
168
+ |-----------|-------------|---------------------|
169
+ | Handles SQL equivalence | Yes | No |
170
+ | Correlates with true objective | Directly | Indirectly (proxy) |
171
+ | Works for exploratory queries | Yes | No (penalizes exploration) |
172
+ | Literature support | Strong (BIRD, CodeRL, LEVER) | Declining (Spider exact match being replaced) |
173
+
174
+ ### What about SQL validity rewards?
175
+
176
+ One structural signal IS worth using: penalizing queries that fail to execute (syntax errors, missing tables/columns). This is not SQL similarity — it's SQL validity. We already do this via L1 operational rewards: exec_ok (+0.02) vs error (-0.005 step cost only). This accelerates learning without biasing toward a specific solution path.
177
+
178
+ ## References
179
+
180
+ - Ng, Harada, Russell (1999). [Policy invariance under reward transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
181
+ - Sullivan et al. (2025/2026). [GRPO is Secretly a Process Reward Model](https://arxiv.org/abs/2509.21154). ICLR 2026.
182
+ - Shao et al. (2024). [DeepSeek-Math: GRPO](https://arxiv.org/abs/2402.03300).
183
+ - DeepSeek-AI (2025). [DeepSeek-R1](https://arxiv.org/pdf/2501.12948).
184
+ - TIPS (2026). [Turn-Level Information-Potential Reward Shaping](https://arxiv.org/abs/2603.22293).
185
+ - ToolRL (2025). [Reward is All Tool Learning Needs](https://arxiv.org/html/2504.13958v1).
186
+ - StepTool (2024). [Step-grained RL for Tool Learning](https://arxiv.org/abs/2410.07745).
187
+ - RAGEN (2025). [Multi-Turn RL for LLM Agents](https://arxiv.org/abs/2504.20073).
188
+ - Bellemare et al. (2016). [Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868).
189
+ - Just Enough Thinking (2026). [Adaptive Length Penalties](https://arxiv.org/html/2506.05256v1).
190
+ - Fireworks AI. [Best Practices for Multi-Turn RL](https://fireworks.ai/blog/best-practices-for-multi-turn-RL).
191
+ - Wiewiora, Cottrell, Elkan (2003). Principled Methods for Advising Reinforcement Learning Agents.
192
+ - Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing](https://arxiv.org/abs/1809.08887).
193
+ - Li et al. (2023). [Can LLM Already Serve as a Database Interface? (BIRD)](https://arxiv.org/abs/2305.03111).
194
+ - Zhong et al. (2020). [Semantic Evaluation for Text-to-SQL with Distilled Test Suites](https://arxiv.org/abs/2010.02840).
195
+ - Le et al. (2022). [CodeRL: Mastering Code Generation through Pretrained Models and Deep RL](https://arxiv.org/abs/2207.01780).
196
+ - Lightman et al. (2023). [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050).
197
+ - Ni et al. (2023). [LEVER: Learning to Verify Language-to-Code Generation with Execution](https://arxiv.org/abs/2302.08468).
docs/discovery/.gitkeep ADDED
File without changes
docs/discovery/index.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Discovery
2
+
3
+ This directory contains discovery artifacts that validate *what* to build and capture *taste*.
4
+
5
+ ## What is Discovery?
6
+
7
+ Discovery validates product ideas and captures user taste before detailed planning. It answers:
8
+ - What opportunity are we pursuing? (Opportunity Solution Tree)
9
+ - What problem are we solving? (PR/FAQ)
10
+ - What would DELIGHT users? (Taste capture)
11
+ - What would FRUSTRATE users? (Anti-patterns)
12
+ - What FEELING should users have? (North star)
13
+ - How does this compare to alternatives? (RICE/ICE prioritization)
14
+
15
+ ## File Format
16
+
17
+ Each feature has two files:
18
+ - `{feature}.md` — Human-readable discovery doc with OST, PR/FAQ, taste
19
+ - `{feature}.json` — Machine-readable taste data for downstream skills
20
+
21
+ ## Discovery Index
22
+
23
+ | Feature | Status | Date |
24
+ |---------|--------|------|
25
+ | *None yet* | | |
26
+
27
+ ## Relationship to Other Docs
28
+
29
+ ```
30
+ discovery/ → Validate + Taste (OST, PR/FAQ)
31
+
32
+ delivery-specs/ → Engineering handoff (functional requirements)
33
+
34
+ design-docs/ → Technical approach (ADR-style)
35
+
36
+ FEATURES.json → Links docs to implementation
37
+
38
+ specs/ → Implementation specs (autocode)
39
+
40
+ learnings/ → Knowledge extracted after completion
41
+ ```
42
+
43
+ ## Creating Discovery Docs
44
+
45
+ Use the `discovery` skill for structured discovery:
46
+
47
+ ```
48
+ skill({ name: "discovery" })
49
+ ```
50
+
51
+ The skill guides you through:
52
+ 1. **Opportunity Solution Tree** — Outcome, opportunities, solutions
53
+ 2. **PR/FAQ** — Headline, problem, solution, customer quote
54
+ 3. **Taste interview** — Delights, frustrations, feeling, maturity
55
+ 4. **Prioritization** — RICE/ICE scoring
56
+
57
+ ## Integration with Autocode
58
+
59
+ The `autocode-implementation-planner` skill automatically reads the JSON file:
60
+ - `taste.delights` → Success criteria in implementation spec
61
+ - `taste.frustrations` → Anti-patterns to explicitly avoid
62
+ - `taste.feeling` → North star for implementation decisions
63
+ - `scope.out_of_scope` → Boundaries the implementation must respect
docs/exec-plans/README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Execution Plans
2
+
3
+ This directory contains execution plans for complex, multi-step work.
4
+
5
+ ## What is an Execution Plan?
6
+
7
+ An execution plan (ExecPlan) is a living document that tracks progress on significant work. Unlike implementation specs (which define *what* to build), exec plans track *how* the work is progressing.
8
+
9
+ ## Directory Structure
10
+
11
+ ```
12
+ exec-plans/
13
+ ├── active/ # Currently in progress
14
+ ├── completed/ # Finished work (for reference)
15
+ ├── tech-debt-tracker.md # Known technical debt
16
+ └── README.md # This file
17
+ ```
18
+
19
+ ## When to Use
20
+
21
+ Use an execution plan when:
22
+ - Work spans multiple sessions or days
23
+ - Multiple steps have dependencies
24
+ - Progress needs to be visible to humans
25
+ - Work involves significant research or discovery
26
+
27
+ Do NOT use for:
28
+ - Simple, single-session features
29
+ - Bug fixes
30
+ - Routine changes
31
+
32
+ ## ExecPlan Format
33
+
34
+ See the [OpenAI Exec Plans cookbook](https://developers.openai.com/cookbook/articles/codex_exec_plans) for the full specification.
35
+
36
+ Key sections:
37
+ - **Purpose / Big Picture:** What someone gains after this change
38
+ - **Progress:** Checklist with timestamps
39
+ - **Surprises & Discoveries:** Unexpected findings
40
+ - **Decision Log:** Why choices were made
41
+ - **Outcomes & Retrospective:** Summary at completion
docs/exec-plans/active/.gitkeep ADDED
File without changes
docs/exec-plans/completed/.gitkeep ADDED
File without changes
docs/exec-plans/tech-debt-tracker.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technical Debt Tracker
2
+
3
+ Known technical debt and cleanup opportunities.
4
+
5
+ ## Active Debt
6
+
7
+ | Item | Type | Severity | Origin | Notes |
8
+ |------|------|----------|--------|-------|
9
+ | `huggingface_hub`/`transformers` version mismatch breaks test collection (`is_offline_mode` import error) in training/error-handling suites | Infra | D (Blocking) | F014 verification run | Blocks full-suite verification and should be resolved with a pinned compatible dependency set for training-related extras. |
10
+
11
+ ## Types
12
+
13
+ - **Code:** Shortcuts, duplication, missing abstractions
14
+ - **Tests:** Missing coverage, flaky tests, slow tests
15
+ - **Docs:** Stale documentation, missing docs
16
+ - **Infra:** Build/deploy issues, tooling gaps
17
+ - **Architecture:** Layer violations, wrong boundaries
18
+
19
+ ## Severity
20
+
21
+ - **D (Blocking):** Must fix before next release
22
+ - **C (Needs Refactor):** Should fix soon, causing friction
23
+ - **B (Minor):** Would be nice to fix
24
+ - **A (Clean):** No action needed
25
+
26
+ ## Process
27
+
28
+ 1. `/techdebt` command scans for issues and updates this file
29
+ 2. `compound-engineer` may add items from feature work
30
+ 3. `what-how-alignment` skill consolidates into refactor proposals
31
+ 4. Items graduate to `SUGGESTED_REFACTORS.md` when ready for implementation
32
+
33
+ ## Resolved Debt
34
+
35
+ | Item | Resolution | Date |
36
+ |------|------------|------|
37
+ | *None yet* | | |
docs/exploration/README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Exploration
2
+
3
+ Ideas, technology research, and ad-hoc investigation notes. This is a scratchpad -- content here is not system-of-record.
4
+
5
+ **Diataxis type:** Exploration (learning-oriented, not yet distilled)
6
+
7
+ ## What Goes Here
8
+
9
+ - Technology evaluations and comparisons
10
+ - Prototype findings
11
+ - External API exploration
12
+ - Performance investigations
13
+ - Ideation and backlog notes
14
+
15
+ ## What Does NOT Go Here
16
+
17
+ - Durable learnings (go to `docs/learnings/`)
18
+ - Design decisions (go to `docs/design-docs/`)
19
+ - Implementation specs (go to `specs/`)
20
+ - Operational how-to guides (go to `docs/guides/`)
21
+
22
+ ## Exploration Index
23
+
24
+ | Topic | Type | Date | Summary |
25
+ |-------|------|------|---------|
26
+ | [grpo-collapse-analysis.md](grpo-collapse-analysis.md) | Investigation | 2026-04 | Post-mortem on Qwen3-1.7B GRPO collapse into degenerate null-argument tool calls |
27
+ | [grpo-plateau-plan.md](grpo-plateau-plan.md) | Investigation | 2026-04 | Interventions to push past 30-40% accuracy plateau in GRPO training |
28
+ | [grpo-training-session-log.md](grpo-training-session-log.md) | Investigation | 2026-04 | Running log of SFT warmup + GRPO training sessions on Colab L4 |
29
+ | [rl-vs-icl-research.md](rl-vs-icl-research.md) | Comparison | 2026-04 | When GRPO training adds value over pure prompting for small SQL agents |
30
+ | [train-grpo-walkthrough.md](train-grpo-walkthrough.md) | Prototype | 2026-04 | Step-by-step companion guide for train_grpo.ipynb |
31
+
32
+ ## Types
33
+
34
+ - **Tech Eval:** Evaluating a library, framework, or service
35
+ - **Prototype:** Findings from exploratory prototyping
36
+ - **Investigation:** Deep dive into a specific problem
37
+ - **Comparison:** Side-by-side analysis of options
38
+
39
+ ## Graduating Content
40
+
41
+ When exploration produces durable insights:
42
+ 1. Extract patterns to `docs/learnings/<category>.md`
43
+ 2. Create reference files in `docs/references/` for agent context
44
+ 3. Create how-to guides in `docs/guides/` for operational procedures
docs/exploration/f007-prelaunch-checklist.md ADDED
@@ -0,0 +1,455 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F007 Pre-Launch Checklist (temp, 2026-04-12)
2
+
3
+ Scope: verify the HF Space deployment is real and usable **before** the blog
4
+ post goes live today. Delete this file after launch.
5
+
6
+ ---
7
+
8
+ ## TL;DR — what to do in the next ~60 min
9
+
10
+ | # | Action | Time | Value | Do it? |
11
+ |---|---|---|---|---|
12
+ | 1 | Open Space in browser, confirm it loads | 2 min | **Critical** — judges will click the link first | **YES** |
13
+ | 2 | Hit `/health` and `/docs` | 1 min | **Critical** — proves server is up | **YES** |
14
+ | 3 | Run one full episode via `/web` UI | 5 min | **Critical** — proves action space works end-to-end | **YES** |
15
+ | 4 | Fix stale `docs/competition-deliverables.md` status | 3 min | **High** — doc claims "Not started", Space is live | **YES** |
16
+ | 5 | Python client smoke test against live Space | 10 min | **High** — proves programmatic access (the thing the blog promises) | **YES** |
17
+ | 6 | Pull `registry.hf.space/...` Docker image and run locally | 10 min | Medium — nice to have, judges rarely do this | If time |
18
+ | 7 | `pip install` from Space URL | 5 min | Medium — validates `pyproject.toml` inside Space | If time |
19
+ | 8 | Concurrency audit (`SUPPORTS_CONCURRENT_SESSIONS`) | 15 min | **Low for launch, High for anyone retraining** | Skip today, file issue |
20
+ | 9 | TRL `environment_factory` wrapper | — | — | **Already done** (see below) |
21
+
22
+ **Recommendation:** Do 1–5 before publishing. Skip 6–8. Item 9 is already in
23
+ the repo.
24
+
25
+ ---
26
+
27
+ ## About TRL (already integrated — do not re-research)
28
+
29
+ **TRL** = Hugging Face's `transformers`-based RL library. Its `GRPOTrainer`
30
+ accepts an `environment_factory=MyEnvClass` argument and runs the multi-turn
31
+ tool-calling loop automatically: generate → parse tool call → call your env →
32
+ feed result back → repeat. No custom `rollout_func` needed.
33
+
34
+ **We already implement this.** `training/trl_adapter.py::SQLEnvTRL` is a
35
+ TRL-native environment class with:
36
+
37
+ - `reset(**kwargs)` — reads `question_text` from the dataset column to route
38
+ to the correct database
39
+ - Named tool methods with docstrings: `describe(table_name)`, `sample(table_name)`,
40
+ `query(sql)`, `answer(value)` — not a generic `step()`
41
+ - `sql_env_reward_func` as the reward function
42
+
43
+ `notebooks/train_grpo.ipynb` cell 16 passes it directly:
44
+
45
+ ```python
46
+ trainer = build_trainer(
47
+ ...
48
+ reward_funcs=[sql_env_reward_func],
49
+ environment_factory=SQLEnvTRL,
50
+ ...
51
+ )
52
+ ```
53
+
54
+ The Setup cell pins `trl>=0.29.0` and `transformers` from `main` specifically
55
+ because `environment_factory` requires transformers ≥5.2. Our v1/v2 runs used
56
+ this path.
57
+
58
+ **One nuance (intentional design):** `SQLEnvTRL.__init__` instantiates a
59
+ **local in-process `SQLEnvironment`**, not a WebSocket client to
60
+ `https://hjerpe-sql-env.hf.space`. Reasons:
61
+
62
+ - Training opens N parallel sessions (one per generation). The hosted Space
63
+ defaults to 1 concurrent session — see `SUPPORTS_CONCURRENT_SESSIONS` in
64
+ the TRL↔OpenEnv docs.
65
+ - Local is faster, no network hops, no rate limits.
66
+ - The hosted Space is for judges (clicking `/web`) and external users
67
+ consuming the env via pip/Docker. Training correctly bypasses it.
68
+
69
+ **Implication for the blog:** you can claim "TRL-native integration via
70
+ `environment_factory`" factually. It's already true and the notebook proves it.
71
+
72
+ **What's still on the post-launch list** is the Space-side concurrency config
73
+ (item 8), not the adapter. Without `SUPPORTS_CONCURRENT_SESSIONS=True` on the
74
+ server, an external user trying to retrain against the hosted Space would hit
75
+ the 1-session cap. This does not affect our own training (we use local).
76
+
77
+ ---
78
+
79
+ ## 1. Browser smoke test (2 min) — CRITICAL
80
+
81
+ **How:**
82
+ ```bash
83
+ open https://huggingface.co/spaces/hjerpe/sql_env
84
+ ```
85
+ (or paste the URL into a browser manually)
86
+
87
+ **What to check:**
88
+ - [ ] Space status is **Running** (green), not Building / Sleeping / Error
89
+ - [ ] README renders with a clear one-liner of what the env does
90
+ - [ ] No red error banner at the top
91
+
92
+ **What you're validating:** that HF Spaces successfully built our image and
93
+ the container is alive. If it's sleeping, the first visit wakes it (~30s
94
+ cold start). Warm it up now and leave the tab open so blog readers don't hit
95
+ a cold Space.
96
+
97
+ **If it's broken:** open the "Logs" tab on the Space page → look for the
98
+ Docker build error → fix locally → re-push with `uv run openenv push`.
99
+
100
+ ---
101
+
102
+ ## 2. Health + API docs (1 min) — CRITICAL
103
+
104
+ **How:**
105
+ ```bash
106
+ curl -sS https://hjerpe-sql-env.hf.space/health
107
+ curl -sS https://hjerpe-sql-env.hf.space/docs | head -20 # should be HTML
108
+ open https://hjerpe-sql-env.hf.space/docs # visual check
109
+ ```
110
+
111
+ **What to check:**
112
+ - [ ] `/health` returns HTTP 200 and a JSON body (e.g. `{"status":"ok"}`)
113
+ - [ ] `/docs` Swagger page lists `/reset`, `/step`, `/ws` endpoints
114
+ - [ ] `/step` request schema mentions our `SQLAction` fields (`action_type`,
115
+ `argument`) with values `DESCRIBE`, `SAMPLE`, `QUERY`, `ANSWER`
116
+
117
+ **What you're validating:** the FastAPI server inside the container is up
118
+ and the OpenAPI schema published to the Space matches our local `SQLAction`
119
+ model. If schemas drift, clients break.
120
+
121
+ **If it's broken:** usually means the Dockerfile picked up a stale version
122
+ of `sql_env/models.py`. Rebuild and push: `uv run openenv build -t …` then
123
+ `uv run openenv push`.
124
+
125
+ ---
126
+
127
+ ## 3. One full episode via the built-in web UI (5 min) — CRITICAL
128
+
129
+ **How:**
130
+ ```bash
131
+ open https://hjerpe-sql-env.hf.space/web
132
+ ```
133
+
134
+ OpenEnv ships a `/web` interactive UI on every env. Walk through one full
135
+ episode:
136
+
137
+ 1. Click **Reset** — a schema hint + question prompt should appear
138
+ 2. Enter action `DESCRIBE` with argument = a table name from the reset output
139
+ 3. Enter action `SAMPLE` with a table name (confirm 5 sample rows come back)
140
+ 4. Enter action `QUERY` with a valid `SELECT ...` (confirm rows return)
141
+ 5. Enter action `ANSWER` with your final answer (confirm reward + `done=true`)
142
+
143
+ **What to check:**
144
+ - [ ] Each step returns a new observation without error
145
+ - [ ] Terminal `ANSWER` produces a reward (even 0.0 is fine — we're testing
146
+ plumbing, not correctness)
147
+ - [ ] Screenshot the final screen — free blog content
148
+
149
+ **What you're validating:** the end-to-end action space a judge will
150
+ exercise. This is our judges' happiest path.
151
+
152
+ **If any step errors, do not publish the blog until it's fixed.**
153
+
154
+ ---
155
+
156
+ ## 4. Fix stale deliverables doc (3 min) — HIGH
157
+
158
+ `docs/competition-deliverables.md` line 30 says:
159
+
160
+ > **Status:** Not started (no Dockerfile yet)
161
+
162
+ This is wrong. F007 demo (`specs/F007-DEMO.md`) shows a successful authenticated
163
+ push to `https://huggingface.co/spaces/hjerpe/sql_env` on 2026-03-29. Update to:
164
+
165
+ > **Status:** Live at https://huggingface.co/spaces/hjerpe/sql_env — manual
166
+ > episode flow verified 2026-04-12.
167
+
168
+ Also update the open items list at the bottom — "Deploy HuggingFace Space"
169
+ should be checked off.
170
+
171
+ ---
172
+
173
+ ## 5. Python client smoke test (10 min) — HIGH
174
+
175
+ **How:**
176
+
177
+ First find the actual client class name and action constructor args — our
178
+ client module may not match the generic OpenEnv template:
179
+ ```bash
180
+ rg -n "class SQLEnv\b|base_url" -g '!**/tests/**' -g '!**/docs/**' .
181
+ rg -n "class SQLAction|action_type|argument" sql_env/models.py
182
+ ```
183
+
184
+ Then create a throwaway script `scratch_hf_smoke.py` in the repo root:
185
+
186
+ ```python
187
+ from sql_env.client import SQLEnv # adjust after grep above
188
+ from sql_env.models import SQLAction
189
+
190
+ URL = "https://hjerpe-sql-env.hf.space"
191
+
192
+ with SQLEnv(base_url=URL).sync() as env:
193
+ r = env.reset()
194
+ print("RESET:", r.observation)
195
+
196
+ # Pick any table name from the schema hint in r.observation
197
+ r = env.step(SQLAction(action_type="DESCRIBE", argument="<table>"))
198
+ print("DESCRIBE:", r.observation)
199
+
200
+ r = env.step(SQLAction(action_type="QUERY", argument="SELECT 1"))
201
+ print("QUERY:", r.observation)
202
+
203
+ r = env.step(SQLAction(action_type="ANSWER", argument="<answer>"))
204
+ print("ANSWER reward=", r.reward, "done=", r.done)
205
+ ```
206
+
207
+ Run:
208
+ ```bash
209
+ uv run python scratch_hf_smoke.py
210
+ ```
211
+
212
+ **What to check:**
213
+ - [ ] No connection / WebSocket handshake errors
214
+ - [ ] `r.observation` is a populated dict/string at each step
215
+ - [ ] Final step: `r.reward` is a float and `r.done == True`
216
+ - [ ] Delete `scratch_hf_smoke.py` after — do not commit it
217
+
218
+ **What you're validating:** that a blog reader copy-pasting our snippet
219
+ against the live Space actually gets a working client. If this fails and
220
+ the `/web` UI (step 3) works, the problem is likely a client-side model
221
+ drift — check that our shipped `sql_env/models.py` matches what the server
222
+ inside the Space expects.
223
+
224
+ ---
225
+
226
+ ## 6. Docker image pull (10 min) — IF TIME
227
+
228
+ This is the pattern every OpenEnv env on the hub ships. It's how external
229
+ users run our env locally for training (no rate limits, full concurrency).
230
+
231
+ **How — option A: pull the pre-built image from HF registry**
232
+ ```bash
233
+ docker pull --platform linux/amd64 registry.hf.space/hjerpe-sql_env:latest
234
+ docker run -d --name sqlenv-smoke -p 8001:8000 --platform linux/amd64 \
235
+ registry.hf.space/hjerpe-sql_env:latest
236
+
237
+ # Wait ~5s for uvicorn to boot
238
+ sleep 5
239
+ curl -sS http://0.0.0.0:8001/health
240
+ open http://0.0.0.0:8001/docs
241
+
242
+ # Clean up
243
+ docker stop sqlenv-smoke && docker rm sqlenv-smoke
244
+ ```
245
+
246
+ **How — option B: rebuild locally from our repo (same image the Space runs)**
247
+ ```bash
248
+ uv run openenv validate --verbose # dry-run config check
249
+ uv run openenv build -t openenv-sql-env:local
250
+ docker run -d --name sqlenv-local -p 8001:8000 openenv-sql-env:local
251
+ curl -sS http://0.0.0.0:8001/health
252
+ docker stop sqlenv-local && docker rm sqlenv-local
253
+ ```
254
+
255
+ **What to check:**
256
+ - [ ] Image pulls without auth (Space is public)
257
+ - [ ] Container starts, `/health` returns 200
258
+ - [ ] `/docs` renders Swagger on localhost
259
+ - [ ] No `--platform` warnings on Apple Silicon (the Space is `linux/amd64`,
260
+ which runs under Rosetta on M-series Macs — slow but functional)
261
+
262
+ **What you're validating:** the reproducibility story. A broken image here
263
+ means the blog's "clone and train" path is dead. Judges rarely click this,
264
+ but any serious user will.
265
+
266
+ ---
267
+
268
+ ## 7. pip install from Space (5 min) — IF TIME
269
+
270
+ **How:**
271
+
272
+ First check the package name declared inside the pushed Space:
273
+ ```bash
274
+ curl -sS https://huggingface.co/spaces/hjerpe/sql_env/raw/main/pyproject.toml \
275
+ | grep -E '^name'
276
+ ```
277
+
278
+ Then install it into a throwaway venv:
279
+ ```bash
280
+ uv venv /tmp/sqlenv-pip-test
281
+ source /tmp/sqlenv-pip-test/bin/activate
282
+
283
+ # Replace "openenv-sql-env" with whatever name the pyproject.toml above shows
284
+ pip install "openenv-sql-env @ git+https://huggingface.co/spaces/hjerpe/sql_env"
285
+
286
+ python -c "from sql_env.client import SQLEnv; print('OK:', SQLEnv)"
287
+
288
+ deactivate && rm -rf /tmp/sqlenv-pip-test
289
+ ```
290
+
291
+ **What to check:**
292
+ - [ ] `pip install` resolves without dependency errors
293
+ - [ ] The client class imports from the installed wheel
294
+
295
+ **What you're validating:** the `pyproject.toml` we pushed into the Space
296
+ actually declares the package correctly. This is the install method TRL
297
+ documents: `pip install "<pkg> @ git+https://huggingface.co/spaces/<space>"`.
298
+
299
+ ---
300
+
301
+ ## 8. Concurrency audit — POST-LAUNCH
302
+
303
+ **How:**
304
+ ```bash
305
+ rg -n "SUPPORTS_CONCURRENT_SESSIONS|max_concurrent_envs|create_app\(" sql_env/server/
306
+ ```
307
+
308
+ Expected result **today:** no matches (the flag is not set), which means the
309
+ Space defaults to **1 concurrent WebSocket session**. Per the OpenEnv↔TRL
310
+ docs, any training run with `num_generations > 1` against the hosted Space
311
+ will hit capacity errors.
312
+
313
+ **Fix (post-launch):** in `sql_env/server/app.py` (or wherever
314
+ `create_app(...)` is called):
315
+ ```python
316
+ SUPPORTS_CONCURRENT_SESSIONS = True
317
+
318
+ app = create_app(
319
+ create_sql_environment,
320
+ SQLAction,
321
+ SQLObservation,
322
+ max_concurrent_envs=64, # ≥ TRL's generation_batch_size
323
+ )
324
+ ```
325
+ Then `uv run openenv build -t ...` and `uv run openenv push` again.
326
+
327
+ **Why it's not a launch blocker:** the blog does not ask readers to train
328
+ against the hosted Space. Our own training uses the in-process
329
+ `SQLEnvironment` via `SQLEnvTRL` (not the WebSocket client), so we never hit
330
+ this limit. Only matters if an external user wants to run `GRPOTrainer`
331
+ against `https://hjerpe-sql-env.hf.space` directly. File as a GitHub issue
332
+ after the blog ships.
333
+
334
+ ---
335
+
336
+ ## 9. TRL `environment_factory` wrapper — DONE
337
+
338
+ Already implemented in `training/trl_adapter.py::SQLEnvTRL` and wired into
339
+ `notebooks/train_grpo.ipynb` cell 16. See the TRL section at the top of this
340
+ document for details. No action.
341
+
342
+ ---
343
+
344
+ ---
345
+
346
+ ## Appendix A: Republish the Space from scratch (reference)
347
+
348
+ Only run these if step 1–3 show the Space is broken and a rebuild+push is
349
+ needed. Otherwise skip — the current Space is already live.
350
+
351
+ **Prereqs (one-time):**
352
+ ```bash
353
+ uv sync # project deps
354
+ hf auth login # HuggingFace CLI auth
355
+ # (token with write access to hjerpe/sql_env)
356
+ ```
357
+
358
+ **Validate + build + push:**
359
+ ```bash
360
+ # 1. Dry-run config check — confirms the openenv manifest, Dockerfile
361
+ # and server entrypoint agree
362
+ uv run openenv validate --verbose
363
+
364
+ # 2. Build the Docker image locally (same image HF Spaces will run)
365
+ uv run openenv build -t openenv-sql-env:local
366
+
367
+ # 3. Optional: smoke-test the local image before pushing
368
+ docker run -d --name sqlenv-local -p 8001:8000 openenv-sql-env:local
369
+ curl -sS http://0.0.0.0:8001/health
370
+ docker stop sqlenv-local && docker rm sqlenv-local
371
+
372
+ # 4. Push to the Space — creates hjerpe/sql_env if it doesn't exist,
373
+ # uploads files, and triggers the Space's own Docker build
374
+ uv run openenv push
375
+ # expected tail:
376
+ # ✓ Authenticated as: hjerpe
377
+ # ✓ Space hjerpe/sql_env is ready
378
+ # ✓ Upload completed successfully
379
+ # Space URL: https://huggingface.co/spaces/hjerpe/sql_env
380
+ ```
381
+
382
+ **After push:** the Space rebuilds its own Docker image on HF's infra (takes
383
+ 2–5 min). Watch the build logs in the browser at
384
+ `https://huggingface.co/spaces/hjerpe/sql_env` → "Logs" tab. When it turns
385
+ green, re-run steps 1–5 at the top of this doc to verify.
386
+
387
+ **Files that must exist for `openenv push` to work** (already in the repo):
388
+ - `openenv.yaml` — manifest with name, version, description
389
+ - `sql_env/server/Dockerfile` — FastAPI + uvicorn container
390
+ - `sql_env/server/app.py` — `create_app(...)` entrypoint
391
+ - `sql_env/models.py` — `SQLAction` / `SQLObservation` Pydantic models
392
+ - `pyproject.toml` — pip-installable package metadata
393
+ - `README.md` — Space landing page (HF renders it on the Space page)
394
+
395
+ If any of these drifts out of sync, `openenv validate --verbose` will flag
396
+ it before you push.
397
+
398
+ ---
399
+
400
+ ## Appendix B: Research finding — dangling legacy reward module
401
+
402
+ **Finding:** `training/rewards.py` (151 lines) is legacy dead code from the
403
+ pre-F010 rollout-based architecture. It is not used by the production
404
+ training path and can be deleted post-launch.
405
+
406
+ **Evidence:**
407
+ - Module docstring (line 1–5): *"Reward callables for TRL GRPO training.
408
+ These helpers consume **rollout metadata**..."* — this is the OLD
409
+ pattern where reward functions parsed `kwargs['metadata']` from TRL
410
+ rollouts instead of reading `env.reward` from environment instances.
411
+ - Internal helper `_extract_metadata_rows()` (line 41): *"TRL can pass
412
+ rollout metadata in different shapes depending on wrapper code."* —
413
+ explicit confirmation this is replay-based reward parsing.
414
+ - Functions exposed: `reward_correctness`, `reward_progress`,
415
+ `reward_operational`.
416
+ - **Zero production imports.** `rg 'from.*training\.rewards|training\.rewards\.reward_'`
417
+ returns exactly **one** hit: `tests/unit/test_rewards.py`. No script,
418
+ notebook, or other module in `training/` imports it.
419
+ - The real training path uses `sql_env_reward_func` in
420
+ `training/trl_adapter.py`, which reads `env.reward` directly from
421
+ `SQLEnvTRL` instances. This is the `environment_factory` pattern
422
+ mandated by F010 and documented as the correct choice (see
423
+ `specs/F010-IMPLEMENTATION_SPEC.md:173` and the user's own memory
424
+ note: *"Use environment_factory or rollout_func, not replay-based
425
+ reward parsing"*).
426
+ - Notebook `train_grpo.ipynb` cell 16:
427
+ `reward_funcs=[sql_env_reward_func]` — pulls from `trl_adapter`,
428
+ not `rewards.py`.
429
+
430
+ **The only `rollout` matches in `training/`** are harmless:
431
+ - `training/prompts.py:1` — docstring mentions "GRPO training rollouts"
432
+ - `training/rewards.py` — the legacy module itself
433
+ - `notebooks/train_grpo.ipynb` cell 16 — a local variable
434
+ `before_rollouts = sample_random_baseline(...)` that has nothing to do
435
+ with TRL's `rollout_func`
436
+
437
+ **Recommendation (post-launch, low priority):**
438
+ 1. Delete `training/rewards.py`
439
+ 2. Delete `tests/unit/test_rewards.py`
440
+ 3. Confirm `uv run pytest tests/ -v` still passes
441
+ 4. Commit with message: `refactor: remove legacy rollout-metadata reward
442
+ module superseded by F010 environment_factory`
443
+
444
+ **Why not today:** zero risk on launch (nothing imports it in production),
445
+ and deleting files during blog-publish day is the wrong kind of churn.
446
+ File as a post-launch cleanup.
447
+
448
+ ---
449
+
450
+ ## Post-launch cleanup
451
+
452
+ - [ ] Delete this file
453
+ - [ ] File issue for item 8 (Space concurrency)
454
+ - [ ] Delete `training/rewards.py` + `tests/unit/test_rewards.py` (see Appendix B)
455
+ - [ ] Update `docs/competition-deliverables.md` open-items list
docs/exploration/grpo-collapse-analysis.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: GRPO Training Collapse Analysis
3
+ description: Root-cause analysis of GRPO training collapse on Qwen3-1.7B caused by extra kwargs in tool calls and advantage collapse
4
+ doc_type: exploration
5
+ ---
6
+
7
+ # GRPO Training Collapse Analysis
8
+
9
+ ## What happened
10
+
11
+ After SFT warmup, GRPO training on Qwen3-1.7B collapsed within the first 30 steps. The model degenerated into passing extra `null` arguments to every tool call (`"sql": null, "table_name": "...", "value": null`), triggering `unexpected keyword argument` errors on every rollout. It never recovered across 351 steps (~8 hours on L4).
12
+
13
+ ## Timeline
14
+
15
+ | Step | Reward | What the model does |
16
+ |------|--------|-------------------|
17
+ | 10 | -1.25 | First call has extra args, gets error, loops with `Episode is over` |
18
+ | 20 | 0.01 | Occasionally correct describe, but passes wrong args to answer |
19
+ | 30 | 0.00 | Stuck: `describe(sql=null, table_name="concert")` infinite loop |
20
+ | 40-351 | 0.00 | Complete collapse: every rollout is identical error loops |
21
+
22
+ ## Why it collapsed
23
+
24
+ ### 1. SFT taught wrong argument patterns
25
+ The SFT examples show `describe(table_name=...)` correctly, but the base Qwen3-1.7B model has a strong prior from pretraining to include all available parameter names in every call. The 353-turn SFT warmup (2 epochs, batch=2) wasn't enough to override this for all 4 tools.
26
+
27
+ ### 2. Extra kwargs cause hard failures, not soft degradation
28
+ When the model passes `describe(sql=null, table_name="flights")`, TRL dispatches `SQLEnvTRL.describe(sql=None, table_name="flights")` which raises `TypeError: unexpected keyword argument 'sql'`. This is a **hard wall** — the model gets zero useful information back, just an error string it can't learn from.
29
+
30
+ ### 3. GRPO advantage collapse
31
+ With 6 generations per question:
32
+ - All 6 rollouts pass the same extra args → all get reward 0.0
33
+ - Advantage = 0.0 for every sample → zero gradient signal
34
+ - The model has no way to discover that dropping the extra args would work
35
+ - Loss oscillates near 0 throughout training
36
+
37
+ ### 4. No recovery mechanism
38
+ Once the model enters the error loop:
39
+ - Error messages say "unexpected keyword argument 'sql'" but don't say "try calling with only table_name"
40
+ - The model retries the same call pattern endlessly
41
+ - Post-episode penalty accumulates negative reward (-1.25 at step 10) but doesn't help because ALL rollouts are equally bad
42
+ - No positive examples exist in any rollout group to provide advantage signal
43
+
44
+ ## The core problem: kwargs rejection vs. kwargs tolerance
45
+
46
+ The TRL adapter methods have strict signatures:
47
+ ```python
48
+ def describe(self, table_name: str) -> str:
49
+ def query(self, sql: str) -> str:
50
+ def answer(self, value: str) -> str:
51
+ ```
52
+
53
+ When the model generates `{"table_name": "flights", "sql": null}`, Python raises TypeError before the method body executes. The model never gets a schema response, so it has no path to success.
54
+
55
+ ## Fix: Accept and ignore extra kwargs
56
+
57
+ The simplest fix is to make the tool methods tolerant of extra arguments:
58
+
59
+ ```python
60
+ def describe(self, table_name: str, **kwargs) -> str:
61
+ def query(self, sql: str, **kwargs) -> str:
62
+ def answer(self, value: str, **kwargs) -> str:
63
+ def sample(self, table_name: str, **kwargs) -> str:
64
+ ```
65
+
66
+ This means `describe(sql=null, table_name="flights")` would work — it would ignore `sql` and return the schema. The model gets useful feedback, can write SQL, and has a path to positive reward. GRPO then has signal to learn that the extra args are unnecessary.
67
+
68
+ **Why this is the right approach:**
69
+ - Small models (1.7B) lack the capacity to perfectly learn function signatures from tool definitions alone
70
+ - The tool definitions in `<tools>` XML clearly state which params are required — the model will converge toward correct signatures over time via reward signal
71
+ - Strict rejection creates an unrecoverable dead end; tolerance creates a learning gradient
72
+ - This matches how real APIs work — most accept and ignore unexpected fields
73
+
74
+ ## Other contributing factors
75
+
76
+ ### SFT quality issues
77
+ - SFT was only 100 questions x ~3.5 turns = 347 examples
78
+ - Only 2 epochs at batch=2 (total 347 steps)
79
+ - The model learned tool-call format but not strict argument isolation
80
+ - Need: more SFT data or more epochs on existing data
81
+
82
+ ### Missing KL penalty
83
+ - No KL divergence penalty against the SFT reference model
84
+ - GRPO updated the policy freely, drifting away from the SFT distribution
85
+ - A KL penalty (beta=0.01-0.05) would have anchored the model near the working SFT baseline
86
+
87
+ ### Learning rate may be too high
88
+ - Default TRL learning rate (5e-7 or 1e-6) may be too aggressive for 1.7B
89
+ - Lower LR (1e-7) would make smaller updates, reducing drift risk
90
+
91
+ ## Recommended fixes (priority order)
92
+
93
+ ### 1. Add `**kwargs` to all tool methods (critical)
94
+ Prevents the hard wall. Model can still learn correct signatures from reward signal.
95
+
96
+ ### 2. Increase SFT warmup
97
+ - 4 epochs instead of 2
98
+ - Or increase SFT data from 100 to 200 questions
99
+ - Verify post-SFT that the model generates correct single-arg calls
100
+
101
+ ### 3. Add KL penalty
102
+ ```python
103
+ GRPOConfig(
104
+ ...,
105
+ beta=0.04, # KL penalty against SFT reference
106
+ )
107
+ ```
108
+ Prevents policy from drifting too far from the working SFT baseline.
109
+
110
+ ### 4. Lower GRPO learning rate
111
+ From default to 1e-7 or 5e-8.
112
+
113
+ ## Verification checklist
114
+
115
+ Before running GRPO again:
116
+ - [ ] Post-SFT format check shows `describe(table_name="X")` with NO extra args
117
+ - [ ] Tool methods accept `**kwargs` so extra args don't crash
118
+ - [ ] First 10 GRPO steps show at least some reward > 0
119
+ - [ ] Reward doesn't flatline at 0.0 by step 30