# Research Summary

**Project:** SQLEnv
**Change:** F006 — GRPO Training Pipeline
**Date:** 2026-03-27
**Status:** Draft

---

## 1. Change Overview

### What We're Changing
TRL/GRPO integration for training a small LLM (Qwen3-1.7B or similar) to play SQLEnv. Includes:
1. System prompt design for SQL exploration strategy
2. `rollout_func` that plays episodes via the environment
3. `reward_funcs` (correctness, progress, operational) for GRPOTrainer
4. Training notebook with hyperparameter config
5. Baseline vs trained comparison output

### Why We're Changing It
The "before vs after" comparison is the competition's money shot. Without training, there's no demo of the environment's utility for RL.

### Success Criteria
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results

---

## 2. System Context

### Current Behavior
No training pipeline exists. The environment (F001) is functional with reset()/step() API. No GRPO integration.

### Architecture Context
```
Training Notebook / Script
  ├── GRPOTrainer (TRL)
  │   ├── model: Qwen3-1.7B (or similar small LLM)
  │   ├── rollout_func: plays SQLEnv episodes
  │   │   ├── env.reset() → initial obs
  │   │   ├── model.generate() → action text
  │   │   ├── parse action → SQLAction
  │   │   ├── env.step(action) → obs
  │   │   └── repeat until done
  │   ├── reward_funcs:
  │   │   ├── reward_correctness → 0.0 or 1.0
  │   │   ├── reward_progress → cumulative progress
  │   │   └── reward_operational → sum of L1 signals
  │   └── train_dataset: questions as prompts
  └── Evaluation (F005 Green Agent)
      ├── Random baseline metrics
      └── Trained model metrics
```

### Entry Points

| Entry Point | Trigger | Current Flow |
|-------------|---------|--------------|
| Training notebook | User runs notebook | **To be created** |
| `rollout_func` | Called by GRPOTrainer | **To be created** — plays episodes |
| `reward_funcs` | Called by GRPOTrainer per completion | **To be created** — computes per-component rewards |

### Data Flow

| Data | Source | Shape/Type | Destination |
|------|--------|------------|-------------|
| Questions | `data/questions/questions_train.json` | JSON | Training dataset |
| System prompt | Training config | `str` | Model context |
| Episode observations | SQLEnvironment | `SQLObservation` | Model input |
| Model output | LLM generation | `str` (parsed to SQLAction) | Environment step |
| Rewards | `reward_funcs` | `list[float]` per completion | GRPOTrainer |
| Trained model | GRPOTrainer output | Model weights | Evaluation |

---

## 3. Dependencies

### Code We Depend On

| Dependency | What We Use | Risk if Changed |
|------------|-------------|-----------------|
| `trl` (external) | `GRPOTrainer` | Must match TRL API version |
| `transformers` (external) | Model loading, tokenizer | Standard HF interface |
| `vllm` (external, optional) | Fast inference during rollout | Optional — can use HF generate |
| F001 (SQLEnvironment) | `reset()`, `step()` | Complete, stable |
| F002 (verifier) | Terminal correctness | Being built in parallel |
| F003 (reward) | Dense reward signals | Being built in parallel |
| F005 (Green Agent) | Evaluation wrapper | Being built in parallel |

### Code That Depends On Us

| Dependent | How They Use Us | Impact of Our Change |
|-----------|-----------------|---------------------|
| F007 (HF Submission) | Training results for blog | Provides learning curves + comparison |

### External Systems

| System | Integration Point | Considerations |
|--------|-------------------|----------------|
| GPU (CUDA) | Training compute | Qwen3-1.7B needs ~8GB VRAM |
| HuggingFace Hub | Model download | Qwen3-1.7B weights |
| SQLEnv server | Episode execution | Can be local instance or WebSocket |

---

## 4. Risks & Edge Cases

### Identified Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Training doesn't converge | Medium | No demo results | Start very small (easy questions, short episodes), tune rewards |
| GPU requirements too high | Medium | Can't train locally | Use small model (1.7B), short episodes, few steps |
| TRL API breaking changes | Low | Script breaks | Pin TRL version in requirements |
| Notebook has hidden dependencies | Medium | Users can't reproduce | Explicit requirements, Colab-compatible |

### Edge Cases to Handle

| Edge Case | Current Behavior | Required Behavior |
|-----------|------------------|-------------------|
| Model generates unparseable action | N/A | Default to QUERY with raw text |
| Episode exceeds token budget | N/A | Truncate context, keep recent actions |
| Training OOM | N/A | Reduce batch size, gradient accumulation |

### Invariants to Preserve

- [ ] Training is deterministic given seed
- [ ] Reward functions match environment reward signals
- [ ] Evaluation uses same environment as training

---

## 4b. Code Shape & Design Target

### Target Shape

| Component | Purpose | Why This Boundary |
|-----------|---------|-------------------|
| `training/config.py` | Hyperparameters, model name, paths | Centralized config |
| `training/rollout.py` | `rollout_func` — plays episodes | TRL integration point |
| `training/rewards.py` | `reward_funcs` — per-component rewards | TRL integration point |
| `training/prompts.py` | System prompt design | Separates prompt engineering |
| `notebooks/train_grpo.ipynb` | End-to-end training notebook | User-facing entry point |

### Abstraction Level

- **Recommendation:** `training/` subpackage with focused modules. Notebook imports from package. Keep notebook cells linear and self-explanatory.

### Anti-Patterns to Avoid

- Don't couple rollout to WebSocket — use local env for training
- Don't over-engineer prompt templates — single system prompt is enough for MVP
- Don't add wandb/tensorboard integration (MVP: just print metrics)
- Don't require specific GPU — should work on Colab free tier with small model

---

## 5. Constraints

### Technical Constraints

| Constraint | Requirement | Notes |
|------------|-------------|-------|
| Model size | ≤ 3B parameters | Must train on consumer GPU / Colab |
| Training time | < 2 hours for demo | Short enough for competition |
| Dependencies | TRL, transformers, torch | Must be pip-installable |

### Pattern Constraints

- Follow TRL GRPOTrainer pattern (Wordle tutorial as reference)
- `reward_funcs` must be separate callables (not combined)
- `rollout_func` signature must match TRL expectations

---

## 6. Open Questions

| Question | Why It Matters | Who Can Answer |
|----------|----------------|----------------|
| Which model? Qwen3-1.7B vs others | Affects VRAM, training time, quality | Recommend Qwen3-1.7B (good instruction following, small) |
| vLLM for inference or HF generate? | Speed vs. simplicity | Recommend HF generate for MVP (simpler, Colab-compatible) |
| Train on all questions or easy-only? | Convergence speed | Recommend start with easy+medium, add hard later |

---

## 7. Context Sources

| Source | Type | Notes |
|--------|------|-------|
| `docs_draft/SQLEnv_Concept_v1.md` Section 3.5 | Doc | TRL mapping, GRPOTrainer pattern |
| `docs_draft/sql_env_project_brief.md` Phase 4 | Doc | Training pipeline requirements |
| `server/sql_environment.py` | Code | Environment API |
| `models.py` | Code | Action/observation types |
| OpenEnv Wordle GRPO tutorial | Reference | TRL integration pattern |