# Research Summary **Project:** SQLEnv **Change:** F006 — GRPO Training Pipeline **Date:** 2026-03-27 **Status:** Draft --- ## 1. Change Overview ### What We're Changing TRL/GRPO integration for training a small LLM (Qwen3-1.7B or similar) to play SQLEnv. Includes: 1. System prompt design for SQL exploration strategy 2. `rollout_func` that plays episodes via the environment 3. `reward_funcs` (correctness, progress, operational) for GRPOTrainer 4. Training notebook with hyperparameter config 5. Baseline vs trained comparison output ### Why We're Changing It The "before vs after" comparison is the competition's money shot. Without training, there's no demo of the environment's utility for RL. ### Success Criteria - Training notebook runs end-to-end in one click - Learning curve clearly shows improvement over episodes - Side-by-side episode transcripts: random vs trained - Reproducible results --- ## 2. System Context ### Current Behavior No training pipeline exists. The environment (F001) is functional with reset()/step() API. No GRPO integration. ### Architecture Context ``` Training Notebook / Script ├── GRPOTrainer (TRL) │ ├── model: Qwen3-1.7B (or similar small LLM) │ ├── rollout_func: plays SQLEnv episodes │ │ ├── env.reset() → initial obs │ │ ├── model.generate() → action text │ │ ├── parse action → SQLAction │ │ ├── env.step(action) → obs │ │ └── repeat until done │ ├── reward_funcs: │ │ ├── reward_correctness → 0.0 or 1.0 │ │ ├── reward_progress → cumulative progress │ │ └── reward_operational → sum of L1 signals │ └── train_dataset: questions as prompts └── Evaluation (F005 Green Agent) ├── Random baseline metrics └── Trained model metrics ``` ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | Training notebook | User runs notebook | **To be created** | | `rollout_func` | Called by GRPOTrainer | **To be created** — plays episodes | | `reward_funcs` | Called by GRPOTrainer per completion | **To be created** — computes per-component rewards | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | Questions | `data/questions/questions_train.json` | JSON | Training dataset | | System prompt | Training config | `str` | Model context | | Episode observations | SQLEnvironment | `SQLObservation` | Model input | | Model output | LLM generation | `str` (parsed to SQLAction) | Environment step | | Rewards | `reward_funcs` | `list[float]` per completion | GRPOTrainer | | Trained model | GRPOTrainer output | Model weights | Evaluation | --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `trl` (external) | `GRPOTrainer` | Must match TRL API version | | `transformers` (external) | Model loading, tokenizer | Standard HF interface | | `vllm` (external, optional) | Fast inference during rollout | Optional — can use HF generate | | F001 (SQLEnvironment) | `reset()`, `step()` | Complete, stable | | F002 (verifier) | Terminal correctness | Being built in parallel | | F003 (reward) | Dense reward signals | Being built in parallel | | F005 (Green Agent) | Evaluation wrapper | Being built in parallel | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | F007 (HF Submission) | Training results for blog | Provides learning curves + comparison | ### External Systems | System | Integration Point | Considerations | |--------|-------------------|----------------| | GPU (CUDA) | Training compute | Qwen3-1.7B needs ~8GB VRAM | | HuggingFace Hub | Model download | Qwen3-1.7B weights | | SQLEnv server | Episode execution | Can be local instance or WebSocket | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Training doesn't converge | Medium | No demo results | Start very small (easy questions, short episodes), tune rewards | | GPU requirements too high | Medium | Can't train locally | Use small model (1.7B), short episodes, few steps | | TRL API breaking changes | Low | Script breaks | Pin TRL version in requirements | | Notebook has hidden dependencies | Medium | Users can't reproduce | Explicit requirements, Colab-compatible | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | Model generates unparseable action | N/A | Default to QUERY with raw text | | Episode exceeds token budget | N/A | Truncate context, keep recent actions | | Training OOM | N/A | Reduce batch size, gradient accumulation | ### Invariants to Preserve - [ ] Training is deterministic given seed - [ ] Reward functions match environment reward signals - [ ] Evaluation uses same environment as training --- ## 4b. Code Shape & Design Target ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `training/config.py` | Hyperparameters, model name, paths | Centralized config | | `training/rollout.py` | `rollout_func` — plays episodes | TRL integration point | | `training/rewards.py` | `reward_funcs` — per-component rewards | TRL integration point | | `training/prompts.py` | System prompt design | Separates prompt engineering | | `notebooks/train_grpo.ipynb` | End-to-end training notebook | User-facing entry point | ### Abstraction Level - **Recommendation:** `training/` subpackage with focused modules. Notebook imports from package. Keep notebook cells linear and self-explanatory. ### Anti-Patterns to Avoid - Don't couple rollout to WebSocket — use local env for training - Don't over-engineer prompt templates — single system prompt is enough for MVP - Don't add wandb/tensorboard integration (MVP: just print metrics) - Don't require specific GPU — should work on Colab free tier with small model --- ## 5. Constraints ### Technical Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | Model size | ≤ 3B parameters | Must train on consumer GPU / Colab | | Training time | < 2 hours for demo | Short enough for competition | | Dependencies | TRL, transformers, torch | Must be pip-installable | ### Pattern Constraints - Follow TRL GRPOTrainer pattern (Wordle tutorial as reference) - `reward_funcs` must be separate callables (not combined) - `rollout_func` signature must match TRL expectations --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Which model? Qwen3-1.7B vs others | Affects VRAM, training time, quality | Recommend Qwen3-1.7B (good instruction following, small) | | vLLM for inference or HF generate? | Speed vs. simplicity | Recommend HF generate for MVP (simpler, Colab-compatible) | | Train on all questions or easy-only? | Convergence speed | Recommend start with easy+medium, add hard later | --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `docs_draft/SQLEnv_Concept_v1.md` Section 3.5 | Doc | TRL mapping, GRPOTrainer pattern | | `docs_draft/sql_env_project_brief.md` Phase 4 | Doc | Training pipeline requirements | | `server/sql_environment.py` | Code | Environment API | | `models.py` | Code | Action/observation types | | OpenEnv Wordle GRPO tutorial | Reference | TRL integration pattern |