# SQLEnv: Interactive Database Query Environment — Project Brief

## Executive Summary

**What we're building**: An RL environment where AI agents learn to answer natural language questions about databases through iterative exploration—not by generating SQL in one shot, but by actively probing schemas, running exploratory queries, observing results, and refining their approach.

**Why it matters**: Current text-to-SQL benchmarks (Spider, BIRD) test one-shot SQL generation. Real data analysts don't work this way—they explore databases iteratively. We're creating the first RL environment that captures this multi-turn exploration process.

**The challenge**: Make a system where agents get meaningful feedback during exploration (not just at the end), learn query strategies through trial and error, and develop the kind of investigative reasoning humans use with unfamiliar databases.

---

## Team Onboarding: What You Need to Know

### 1. Project Structure (5 Core Pieces)

```
📦 SQLEnv Project
│
├── 🏗️  Environment (OpenEnv integration)
│   ├── Action/Observation types (what agents can do/see)
│   ├── Episode lifecycle (reset → explore → answer → done)
│   └── Database sandboxing (safe SQL execution)
│
├── 🎯 Reward System (3-layer architecture)
│   ├── Layer 1: Operational signals (query executed, new info discovered)
│   ├── Layer 2: Progress signals (getting closer to correct answer)
│   └── Layer 3: Terminal reward (correct answer = win)
│
├── 📊 Dataset (Spider multi-hop questions)
│   ├── Question curation (50-100 questions, easy → hard)
│   ├── Database files (SQLite with real schemas)
│   └── Answer verification (multi-type: int, float, string, list, table)
│
├── 🤖 Training Pipeline (GRPO via TRL)
│   ├── System prompts (teach agent SQL exploration strategy)
│   ├── Rollout function (agent plays episodes against environment)
│   └── Model training (small LLM learns to query databases)
│
└── 📝 Submission Artifacts
    ├── HuggingFace Space (live environment people can use)
    ├── GitHub repo (code + training notebooks + results)
    └── Blog post (story + before/after demo + learning curves)
```

### 2. The Multi-Hop Insight (Why This Works for RL)

Spider questions are **already multi-hop**—not just because they require SQL JOINs, but because an agent starting with zero knowledge must:

**Example: "How many departments are managed by someone born in Alabama?"**

| Hop | Agent Action | What It Learns |
|-----|--------------|----------------|
| 1 | `DESCRIBE head` | Discovers `born_state` column exists in `head` table |
| 2 | `QUERY: SELECT head_id FROM head WHERE born_state = 'Alabama'` | Finds specific head IDs: [5, 12, 23] |
| 3 | `DESCRIBE management` | Discovers `head_id` and `department_id` columns |
| 4 | `QUERY: SELECT department_id FROM management WHERE head_id IN (5,12,23)` | Gets department IDs: [2, 7] |
| 5 | `QUERY: SELECT COUNT(DISTINCT department_id) FROM management JOIN head ...` | Refines into proper JOIN query |
| 6 | `ANSWER: 2` | Submits final answer |

**Key point**: The schema is hidden initially. Agent only sees table names at start. Must actively probe to discover:
- What columns exist (can't guess `born_state` is in `head` table)
- How tables relate (must discover `management` links `department` ↔ `head`)
- What data looks like (needs to sample to understand values)

**Why Spider specifically**:
- 71% of queries require JOINs across multiple tables
- 17% require nested subqueries (decomposition needed)
- Schema initially hidden forces exploration
- Multiple difficulty levels (easy: 1-2 tables → hard: 5+ tables with subqueries)
- Gold SQL available for reward computation
- 200 databases across 138 domains = generalization testing

---

## 3. Critical Design Decisions (For Team Discussion)

### Decision 1: Reward Architecture

**The Challenge**: If we only reward the final answer (terminal reward), the agent has no signal during exploration—it's like playing chess but only learning "win/lose" without feedback on good/bad moves.

**Our Solution**: Three reward layers that sum together:

```
Episode Reward = Terminal Correctness (0 or 1.0) + Exploration Progress (max 0.5)
```

**Layer 1 - Operational** (teaches "how to use the tool"):
- Small positive: query runs without error (+0.02)
- Small positive: discovered new table/column info (+0.01, capped)
- Small negative: repeated exact same query (-0.01)
- Tiny cost: each step (-0.005, keeps episodes short)

**Layer 2 - Progress** (teaches "am I getting closer?"):
- After each query, compute: how close is result to gold answer?
- Only reward *improvement* over best-so-far
- Coarsen to 5 levels {0, 0.25, 0.5, 0.75, 1.0} to prevent "reward hacking"

**Layer 3 - Terminal** (teaches "what's the goal?"):
- Correct answer: +1.0
- Incorrect or timeout: 0.0

**Why terminal must dominate**: With exploration capped at 0.5, a correct answer is always worth 2x the maximum exploration reward. This prevents agents from "gaming" the system by just exploring without ever answering.

**Open Question for Team**: Should we expose these as separate reward components to the trainer (TRL supports multi-reward), or sum them into a single scalar? Trade-off: separate = more control for research, single = simpler to tune.

### Decision 2: Action Space (What Can Agents Do?)

| Action | Purpose | Example |
|--------|---------|---------|
| `DESCRIBE table_name` | Learn schema | `DESCRIBE employees` → see column names, types |
| `SAMPLE table_name` | See example data | `SAMPLE orders` → get 5 random rows |
| `QUERY sql_string` | Test hypothesis | `QUERY: SELECT COUNT(*) FROM orders WHERE status='shipped'` |
| `ANSWER value` | Submit final answer | `ANSWER: 42` |

**Key constraints**:
- Budget: 15 steps per episode (configurable)
- Sandboxing: Read-only SQLite, 5 second timeout, SELECT-only
- Truncation: Query results limited to 20 rows (agent sees summary, not full data)

**Design Philosophy**: This is intentionally a POMDP (Partially Observable). The agent doesn't see the full database—just like a real analyst wouldn't. This forces strategic exploration.

**Open Question for Team**: Should we add a "HINT" action that reveals join paths after N failed queries? Helps curriculum learning but might be a crutch.

### Decision 3: Dataset Curation

**MVP Scope**: 50-100 questions from Spider dev set

**Selection Criteria**:
- **Easy** (40%): Single table, basic WHERE/COUNT
  - *"How many orders were placed in January?"*
- **Medium** (40%): 2-3 tables, JOINs, GROUP BY
  - *"What's the total revenue per customer?"*
- **Hard** (20%): 4+ tables, subqueries, aggregates
  - *"Find customers who ordered from every product category"*

**Answer Type Support**:
- Integer (counts, IDs) → exact match
- Float (averages, sums) → 1% tolerance
- String (names, statuses) → case-insensitive normalized
- List (top-k results) → order-insensitive set comparison
- Table (multi-column results) → column + row overlap metrics

**Open Question for Team**: Should we hand-pick questions or sample randomly by difficulty? Hand-picking ensures diversity, random ensures unbiased evaluation.

---

## 4. Development Roadmap (Phased Approach)

### Phase 1: Scaffold (Days 1-2)
**Goal**: Get OpenEnv environment running in Docker

**Key Milestone**: `openenv validate` passes—can reset, step, receive observations

**What Team Members Build**:
- Pydantic models for actions/observations
- Stub environment (hardcoded responses)
- Client wrapper for WebSocket communication

**Done When**: Someone on another machine can connect and run a test episode

---

### Phase 2: Core Loop (Days 3-5)
**Goal**: Wire up real databases and terminal reward

**Key Milestone**: Can play a full episode manually—describe tables, run queries, submit answer, get correct/incorrect

**What Team Members Build**:
- SQLite database loader
- Action handlers (DESCRIBE/SAMPLE/QUERY/ANSWER)
- Answer verification logic
- Budget tracking and episode termination

**Done When**: A hardcoded "cheat" policy that knows the answer can achieve 100% success rate

---

### Phase 3: Dense Reward (Days 6-8)
**Goal**: Add exploration signals so agents learn before final answer

**Key Milestone**: Reward varies meaningfully between random exploration vs. targeted queries

**What Team Members Build**:
- Progress computation (compare query results to gold)
- Operational signals (new info, repeats, execution success)
- Reward combination logic (3 layers → single episode reward)
- Anti-gaming measures (caps, coarsening, repeat detection)

**Done When**: Can observe reward curves showing agents making progress during episodes (not just 0/1 at the end)

---

### Phase 4: Training Pipeline (Days 9-13)
**Goal**: Train a small model and show it improves over baseline

**Key Milestone**: Trained model beats random policy on success rate

**What Team Members Build**:
- System prompt (teach agent SQL exploration strategy)
- Rollout function (episode player for training)
- TRL/GRPO integration (reward functions, training config)
- Evaluation wrapper (Green Agent)
- Comparison notebook (before/after metrics)

**Done When**: Have a chart showing learning curve + success rate comparison + example episodes

---

### Phase 5: Polish & Submit (Days 14-16)
**Goal**: Package everything for submission

**Key Milestone**: All artifacts live and accessible to judges

**What Team Members Build**:
- HuggingFace Space deployment
- GitHub repo cleanup (README, setup instructions)
- Blog post (narrative + results + demos)
- Side-by-side video/screenshots (untrained vs trained agent)

**Done When**: Someone outside the team can read the blog, visit the HF Space, run the training notebook, and reproduce results

---

## 5. Ownership Areas (Team Can Claim)

These are natural pieces where team members can take full ownership:

### 🔧 Environment Engineering
- **Owner**: ???
- **Scope**: OpenEnv integration, WebSocket server, action handlers, episode lifecycle
- **Key Decisions**: Error handling, timeout strategy, result truncation format
- **Success Metric**: Environment runs stably for 1000+ episodes without crashes

### 🎯 Reward Designer
- **Owner**: ???
- **Scope**: 3-layer reward computation, progress metrics, anti-gaming measures
- **Key Decisions**: Reward weights, coarsening granularity, when to give partial credit
- **Success Metric**: Dense reward enables training convergence (vs. terminal-only baseline)

### 📊 Dataset Curator
- **Owner**: ???
- **Scope**: Question selection, answer verification, database preparation
- **Key Decisions**: Difficulty distribution, answer type coverage, edge case handling
- **Success Metric**: Curated question set has clear difficulty progression and diverse answer types

### 🤖 Training Engineer
- **Owner**: ???
- **Scope**: GRPO setup, system prompts, hyperparameter tuning, evaluation
- **Key Decisions**: Model size, rollout strategy, prompt engineering, training duration
- **Success Metric**: Trained model shows measurable improvement on held-out test set

### 📝 Storyteller
- **Owner**: ???
- **Scope**: Blog post, documentation, demo videos, results visualization
- **Key Decisions**: Narrative framing, which results to highlight, demo selection
- **Success Metric**: Blog post tells compelling story even if results are modest

---

## 6. Open Research Questions (Team Input Needed)

### Question 1: Reward Weights
What's the right balance between exploration and exploitation?

**Current Plan**: Terminal = 1.0, Exploration max = 0.5 (2:1 ratio)

**Alternatives**:
- More aggressive: Terminal = 1.0, Exploration max = 0.2 (5:1) → forces focus on correctness
- More exploratory: Terminal = 1.0, Exploration max = 0.8 (1.25:1) → encourages discovery

**How to Decide**: Run ablation study during Phase 4 training

### Question 2: Schema Revelation Strategy
How much schema info should the agent see initially?

**Current Plan**: Only table names at reset, must DESCRIBE to learn columns

**Alternatives**:
- Give column names but not types/constraints
- Give full schema but hide foreign key relationships
- Progressive revelation: unlock new info after N steps

**How to Decide**: Pilot study with human participants—what feels natural?

### Question 3: Episode Budget
How many steps should agents get?

**Current Plan**: 15 steps (enough for 2-3 explores + 3-5 queries + answer)

**Trade-offs**:
- Lower (10 steps): Forces efficiency, faster training, but may not allow enough exploration
- Higher (25 steps): More thorough exploration, but longer episodes = slower training

**How to Decide**: Analyze human trajectories on similar tasks (if data available) or start conservative (15) and increase if needed

### Question 4: Multi-Database Verification (Post-MVP)
Should we verify answers across database variants?

**The Problem**: Agent might return correct answer for wrong reason (e.g., hard-codes an ID that happens to be right in the test DB)

**Current Plan**: Single DB in MVP, add variants in Phase 6+ if time allows

**Open Question**: Is this worth the complexity? Or should we rely on question diversity to catch brittleness?

---

## 7. Success Criteria (How We Know We're Done)

### Minimum Viable Submission
✅ Environment runs stably on HuggingFace Space
✅ Can complete full episodes via WebSocket
✅ Has dense reward (not just terminal)
✅ Trained model shows *some* improvement over random baseline
✅ Blog post + GitHub repo + demo exist

### Stretch Goals
🎯 Trained model achieves >30% success rate on medium-difficulty questions
🎯 Clear learning curves showing reward improvement over training
🎯 Ablation study showing dense reward > terminal-only
🎯 Multi-database verification for top-performing models
🎯 Published as reusable benchmark for future RL research

---

## 8. Risk Mitigation

| Risk | Mitigation Strategy |
|------|---------------------|
| **Training doesn't converge** | Start with easy questions only; use very small model (1.7B); aggressive reward shaping |
| **Reward hacking/gaming** | Coarsen progress signals; cap exploration rewards; monitor training for degenerate behaviors |
| **Technical debt in environment** | Extensive logging; unit tests for action handlers; validation suite before training starts |
| **Timeline pressure** | Each phase has a "minimum viable" deliverable; can submit after Phase 4 if needed |
| **Dataset too easy/hard** | Pilot with manual testing; curate incrementally; be ready to swap questions mid-training |

---

## 9. What Makes This Compelling (For Blog/Submission)

### The Hook
*"AI can generate SQL in one shot. But can it learn to explore databases like a data analyst?"*

### The Technical Innovation
- First RL environment for interactive SQL (not just static benchmarks)
- 3-layer reward architecture (operational + progress + terminal)
- Multi-hop exploration required (schema hidden initially)
- OpenEnv integration (reusable, WebSocket-based, Docker-isolated)

### The Story
Show side-by-side comparison:
- **Untrained agent**: Random flailing, rarely succeeds
- **Trained agent**: Strategic exploration, builds queries incrementally, much higher success rate

Even modest improvement is a win—the *trajectory* of exploration is as interesting as the final success rate.

### The Impact
- Makes Spider (10K questions) usable for RL training
- Creates methodology for future interactive SQL agents
- Shows how to shape rewards for complex reasoning tasks
- Demonstrates OpenEnv's flexibility beyond games

---

## Appendix: Key References

- **Spider Dataset**: https://huggingface.co/datasets/xlangai/spider (questions + SQL)
- **Spider Schema**: https://huggingface.co/datasets/richardr1126/spider-schema (table structures)
- **OpenEnv Framework**: https://github.com/openenv-org/openenv
- **TRL GRPO Tutorial**: Wordle environment example
- **Research**: SQL-TRAIL (2026), PaVeRL-SQL (2025), QueryGym (2025)

---

## Next Steps

1. **Team Meeting**: Review this brief, claim ownership areas, discuss open questions
2. **Kickoff**: Start Phase 1 (whoever owns Environment Engineering)
3. **Daily Standups**: 15min sync—what's done, what's blocked, what's next
4. **Phase Gates**: Don't move to next phase until "Done When" criteria met
5. **Mid-Project Check**: After Phase 3, assess if we're on track for submission

**Questions? Concerns? Better ideas?** → Discuss in team channel. This is a living document—we'll update as we learn.