File size: 16,662 Bytes
9e64e71 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 | # SQLEnv: Interactive Database Query Environment β Project Brief
## Executive Summary
**What we're building**: An RL environment where AI agents learn to answer natural language questions about databases through iterative explorationβnot by generating SQL in one shot, but by actively probing schemas, running exploratory queries, observing results, and refining their approach.
**Why it matters**: Current text-to-SQL benchmarks (Spider, BIRD) test one-shot SQL generation. Real data analysts don't work this wayβthey explore databases iteratively. We're creating the first RL environment that captures this multi-turn exploration process.
**The challenge**: Make a system where agents get meaningful feedback during exploration (not just at the end), learn query strategies through trial and error, and develop the kind of investigative reasoning humans use with unfamiliar databases.
---
## Team Onboarding: What You Need to Know
### 1. Project Structure (5 Core Pieces)
```
π¦ SQLEnv Project
β
βββ ποΈ Environment (OpenEnv integration)
β βββ Action/Observation types (what agents can do/see)
β βββ Episode lifecycle (reset β explore β answer β done)
β βββ Database sandboxing (safe SQL execution)
β
βββ π― Reward System (3-layer architecture)
β βββ Layer 1: Operational signals (query executed, new info discovered)
β βββ Layer 2: Progress signals (getting closer to correct answer)
β βββ Layer 3: Terminal reward (correct answer = win)
β
βββ π Dataset (Spider multi-hop questions)
β βββ Question curation (50-100 questions, easy β hard)
β βββ Database files (SQLite with real schemas)
β βββ Answer verification (multi-type: int, float, string, list, table)
β
βββ π€ Training Pipeline (GRPO via TRL)
β βββ System prompts (teach agent SQL exploration strategy)
β βββ Rollout function (agent plays episodes against environment)
β βββ Model training (small LLM learns to query databases)
β
βββ π Submission Artifacts
βββ HuggingFace Space (live environment people can use)
βββ GitHub repo (code + training notebooks + results)
βββ Blog post (story + before/after demo + learning curves)
```
### 2. The Multi-Hop Insight (Why This Works for RL)
Spider questions are **already multi-hop**βnot just because they require SQL JOINs, but because an agent starting with zero knowledge must:
**Example: "How many departments are managed by someone born in Alabama?"**
| Hop | Agent Action | What It Learns |
|-----|--------------|----------------|
| 1 | `DESCRIBE head` | Discovers `born_state` column exists in `head` table |
| 2 | `QUERY: SELECT head_id FROM head WHERE born_state = 'Alabama'` | Finds specific head IDs: [5, 12, 23] |
| 3 | `DESCRIBE management` | Discovers `head_id` and `department_id` columns |
| 4 | `QUERY: SELECT department_id FROM management WHERE head_id IN (5,12,23)` | Gets department IDs: [2, 7] |
| 5 | `QUERY: SELECT COUNT(DISTINCT department_id) FROM management JOIN head ...` | Refines into proper JOIN query |
| 6 | `ANSWER: 2` | Submits final answer |
**Key point**: The schema is hidden initially. Agent only sees table names at start. Must actively probe to discover:
- What columns exist (can't guess `born_state` is in `head` table)
- How tables relate (must discover `management` links `department` β `head`)
- What data looks like (needs to sample to understand values)
**Why Spider specifically**:
- 71% of queries require JOINs across multiple tables
- 17% require nested subqueries (decomposition needed)
- Schema initially hidden forces exploration
- Multiple difficulty levels (easy: 1-2 tables β hard: 5+ tables with subqueries)
- Gold SQL available for reward computation
- 200 databases across 138 domains = generalization testing
---
## 3. Critical Design Decisions (For Team Discussion)
### Decision 1: Reward Architecture
**The Challenge**: If we only reward the final answer (terminal reward), the agent has no signal during explorationβit's like playing chess but only learning "win/lose" without feedback on good/bad moves.
**Our Solution**: Three reward layers that sum together:
```
Episode Reward = Terminal Correctness (0 or 1.0) + Exploration Progress (max 0.5)
```
**Layer 1 - Operational** (teaches "how to use the tool"):
- Small positive: query runs without error (+0.02)
- Small positive: discovered new table/column info (+0.01, capped)
- Small negative: repeated exact same query (-0.01)
- Tiny cost: each step (-0.005, keeps episodes short)
**Layer 2 - Progress** (teaches "am I getting closer?"):
- After each query, compute: how close is result to gold answer?
- Only reward *improvement* over best-so-far
- Coarsen to 5 levels {0, 0.25, 0.5, 0.75, 1.0} to prevent "reward hacking"
**Layer 3 - Terminal** (teaches "what's the goal?"):
- Correct answer: +1.0
- Incorrect or timeout: 0.0
**Why terminal must dominate**: With exploration capped at 0.5, a correct answer is always worth 2x the maximum exploration reward. This prevents agents from "gaming" the system by just exploring without ever answering.
**Open Question for Team**: Should we expose these as separate reward components to the trainer (TRL supports multi-reward), or sum them into a single scalar? Trade-off: separate = more control for research, single = simpler to tune.
### Decision 2: Action Space (What Can Agents Do?)
| Action | Purpose | Example |
|--------|---------|---------|
| `DESCRIBE table_name` | Learn schema | `DESCRIBE employees` β see column names, types |
| `SAMPLE table_name` | See example data | `SAMPLE orders` β get 5 random rows |
| `QUERY sql_string` | Test hypothesis | `QUERY: SELECT COUNT(*) FROM orders WHERE status='shipped'` |
| `ANSWER value` | Submit final answer | `ANSWER: 42` |
**Key constraints**:
- Budget: 15 steps per episode (configurable)
- Sandboxing: Read-only SQLite, 5 second timeout, SELECT-only
- Truncation: Query results limited to 20 rows (agent sees summary, not full data)
**Design Philosophy**: This is intentionally a POMDP (Partially Observable). The agent doesn't see the full databaseβjust like a real analyst wouldn't. This forces strategic exploration.
**Open Question for Team**: Should we add a "HINT" action that reveals join paths after N failed queries? Helps curriculum learning but might be a crutch.
### Decision 3: Dataset Curation
**MVP Scope**: 50-100 questions from Spider dev set
**Selection Criteria**:
- **Easy** (40%): Single table, basic WHERE/COUNT
- *"How many orders were placed in January?"*
- **Medium** (40%): 2-3 tables, JOINs, GROUP BY
- *"What's the total revenue per customer?"*
- **Hard** (20%): 4+ tables, subqueries, aggregates
- *"Find customers who ordered from every product category"*
**Answer Type Support**:
- Integer (counts, IDs) β exact match
- Float (averages, sums) β 1% tolerance
- String (names, statuses) β case-insensitive normalized
- List (top-k results) β order-insensitive set comparison
- Table (multi-column results) β column + row overlap metrics
**Open Question for Team**: Should we hand-pick questions or sample randomly by difficulty? Hand-picking ensures diversity, random ensures unbiased evaluation.
---
## 4. Development Roadmap (Phased Approach)
### Phase 1: Scaffold (Days 1-2)
**Goal**: Get OpenEnv environment running in Docker
**Key Milestone**: `openenv validate` passesβcan reset, step, receive observations
**What Team Members Build**:
- Pydantic models for actions/observations
- Stub environment (hardcoded responses)
- Client wrapper for WebSocket communication
**Done When**: Someone on another machine can connect and run a test episode
---
### Phase 2: Core Loop (Days 3-5)
**Goal**: Wire up real databases and terminal reward
**Key Milestone**: Can play a full episode manuallyβdescribe tables, run queries, submit answer, get correct/incorrect
**What Team Members Build**:
- SQLite database loader
- Action handlers (DESCRIBE/SAMPLE/QUERY/ANSWER)
- Answer verification logic
- Budget tracking and episode termination
**Done When**: A hardcoded "cheat" policy that knows the answer can achieve 100% success rate
---
### Phase 3: Dense Reward (Days 6-8)
**Goal**: Add exploration signals so agents learn before final answer
**Key Milestone**: Reward varies meaningfully between random exploration vs. targeted queries
**What Team Members Build**:
- Progress computation (compare query results to gold)
- Operational signals (new info, repeats, execution success)
- Reward combination logic (3 layers β single episode reward)
- Anti-gaming measures (caps, coarsening, repeat detection)
**Done When**: Can observe reward curves showing agents making progress during episodes (not just 0/1 at the end)
---
### Phase 4: Training Pipeline (Days 9-13)
**Goal**: Train a small model and show it improves over baseline
**Key Milestone**: Trained model beats random policy on success rate
**What Team Members Build**:
- System prompt (teach agent SQL exploration strategy)
- Rollout function (episode player for training)
- TRL/GRPO integration (reward functions, training config)
- Evaluation wrapper (Green Agent)
- Comparison notebook (before/after metrics)
**Done When**: Have a chart showing learning curve + success rate comparison + example episodes
---
### Phase 5: Polish & Submit (Days 14-16)
**Goal**: Package everything for submission
**Key Milestone**: All artifacts live and accessible to judges
**What Team Members Build**:
- HuggingFace Space deployment
- GitHub repo cleanup (README, setup instructions)
- Blog post (narrative + results + demos)
- Side-by-side video/screenshots (untrained vs trained agent)
**Done When**: Someone outside the team can read the blog, visit the HF Space, run the training notebook, and reproduce results
---
## 5. Ownership Areas (Team Can Claim)
These are natural pieces where team members can take full ownership:
### π§ Environment Engineering
- **Owner**: ???
- **Scope**: OpenEnv integration, WebSocket server, action handlers, episode lifecycle
- **Key Decisions**: Error handling, timeout strategy, result truncation format
- **Success Metric**: Environment runs stably for 1000+ episodes without crashes
### π― Reward Designer
- **Owner**: ???
- **Scope**: 3-layer reward computation, progress metrics, anti-gaming measures
- **Key Decisions**: Reward weights, coarsening granularity, when to give partial credit
- **Success Metric**: Dense reward enables training convergence (vs. terminal-only baseline)
### π Dataset Curator
- **Owner**: ???
- **Scope**: Question selection, answer verification, database preparation
- **Key Decisions**: Difficulty distribution, answer type coverage, edge case handling
- **Success Metric**: Curated question set has clear difficulty progression and diverse answer types
### π€ Training Engineer
- **Owner**: ???
- **Scope**: GRPO setup, system prompts, hyperparameter tuning, evaluation
- **Key Decisions**: Model size, rollout strategy, prompt engineering, training duration
- **Success Metric**: Trained model shows measurable improvement on held-out test set
### π Storyteller
- **Owner**: ???
- **Scope**: Blog post, documentation, demo videos, results visualization
- **Key Decisions**: Narrative framing, which results to highlight, demo selection
- **Success Metric**: Blog post tells compelling story even if results are modest
---
## 6. Open Research Questions (Team Input Needed)
### Question 1: Reward Weights
What's the right balance between exploration and exploitation?
**Current Plan**: Terminal = 1.0, Exploration max = 0.5 (2:1 ratio)
**Alternatives**:
- More aggressive: Terminal = 1.0, Exploration max = 0.2 (5:1) β forces focus on correctness
- More exploratory: Terminal = 1.0, Exploration max = 0.8 (1.25:1) β encourages discovery
**How to Decide**: Run ablation study during Phase 4 training
### Question 2: Schema Revelation Strategy
How much schema info should the agent see initially?
**Current Plan**: Only table names at reset, must DESCRIBE to learn columns
**Alternatives**:
- Give column names but not types/constraints
- Give full schema but hide foreign key relationships
- Progressive revelation: unlock new info after N steps
**How to Decide**: Pilot study with human participantsβwhat feels natural?
### Question 3: Episode Budget
How many steps should agents get?
**Current Plan**: 15 steps (enough for 2-3 explores + 3-5 queries + answer)
**Trade-offs**:
- Lower (10 steps): Forces efficiency, faster training, but may not allow enough exploration
- Higher (25 steps): More thorough exploration, but longer episodes = slower training
**How to Decide**: Analyze human trajectories on similar tasks (if data available) or start conservative (15) and increase if needed
### Question 4: Multi-Database Verification (Post-MVP)
Should we verify answers across database variants?
**The Problem**: Agent might return correct answer for wrong reason (e.g., hard-codes an ID that happens to be right in the test DB)
**Current Plan**: Single DB in MVP, add variants in Phase 6+ if time allows
**Open Question**: Is this worth the complexity? Or should we rely on question diversity to catch brittleness?
---
## 7. Success Criteria (How We Know We're Done)
### Minimum Viable Submission
β
Environment runs stably on HuggingFace Space
β
Can complete full episodes via WebSocket
β
Has dense reward (not just terminal)
β
Trained model shows *some* improvement over random baseline
β
Blog post + GitHub repo + demo exist
### Stretch Goals
π― Trained model achieves >30% success rate on medium-difficulty questions
π― Clear learning curves showing reward improvement over training
π― Ablation study showing dense reward > terminal-only
π― Multi-database verification for top-performing models
π― Published as reusable benchmark for future RL research
---
## 8. Risk Mitigation
| Risk | Mitigation Strategy |
|------|---------------------|
| **Training doesn't converge** | Start with easy questions only; use very small model (1.7B); aggressive reward shaping |
| **Reward hacking/gaming** | Coarsen progress signals; cap exploration rewards; monitor training for degenerate behaviors |
| **Technical debt in environment** | Extensive logging; unit tests for action handlers; validation suite before training starts |
| **Timeline pressure** | Each phase has a "minimum viable" deliverable; can submit after Phase 4 if needed |
| **Dataset too easy/hard** | Pilot with manual testing; curate incrementally; be ready to swap questions mid-training |
---
## 9. What Makes This Compelling (For Blog/Submission)
### The Hook
*"AI can generate SQL in one shot. But can it learn to explore databases like a data analyst?"*
### The Technical Innovation
- First RL environment for interactive SQL (not just static benchmarks)
- 3-layer reward architecture (operational + progress + terminal)
- Multi-hop exploration required (schema hidden initially)
- OpenEnv integration (reusable, WebSocket-based, Docker-isolated)
### The Story
Show side-by-side comparison:
- **Untrained agent**: Random flailing, rarely succeeds
- **Trained agent**: Strategic exploration, builds queries incrementally, much higher success rate
Even modest improvement is a winβthe *trajectory* of exploration is as interesting as the final success rate.
### The Impact
- Makes Spider (10K questions) usable for RL training
- Creates methodology for future interactive SQL agents
- Shows how to shape rewards for complex reasoning tasks
- Demonstrates OpenEnv's flexibility beyond games
---
## Appendix: Key References
- **Spider Dataset**: https://huggingface.co/datasets/xlangai/spider (questions + SQL)
- **Spider Schema**: https://huggingface.co/datasets/richardr1126/spider-schema (table structures)
- **OpenEnv Framework**: https://github.com/openenv-org/openenv
- **TRL GRPO Tutorial**: Wordle environment example
- **Research**: SQL-TRAIL (2026), PaVeRL-SQL (2025), QueryGym (2025)
---
## Next Steps
1. **Team Meeting**: Review this brief, claim ownership areas, discuss open questions
2. **Kickoff**: Start Phase 1 (whoever owns Environment Engineering)
3. **Daily Standups**: 15min syncβwhat's done, what's blocked, what's next
4. **Phase Gates**: Don't move to next phase until "Done When" criteria met
5. **Mid-Project Check**: After Phase 3, assess if we're on track for submission
**Questions? Concerns? Better ideas?** β Discuss in team channel. This is a living documentβwe'll update as we learn.
|