Upload folder using huggingface_hub
5dd1bb4 verified
Research Summary
Project: SQLEnv
Change: F006 β GRPO Training Pipeline
Date: 2026-03-27
Status: Draft
1. Change Overview
What We're Changing
TRL/GRPO integration for training a small LLM (Qwen3-1.7B or similar) to play SQLEnv. Includes:
- System prompt design for SQL exploration strategy
rollout_func that plays episodes via the environment
reward_funcs (correctness, progress, operational) for GRPOTrainer
- Training notebook with hyperparameter config
- Baseline vs trained comparison output
Why We're Changing It
The "before vs after" comparison is the competition's money shot. Without training, there's no demo of the environment's utility for RL.
Success Criteria
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results
2. System Context
Current Behavior
No training pipeline exists. The environment (F001) is functional with reset()/step() API. No GRPO integration.
Architecture Context
Training Notebook / Script
βββ GRPOTrainer (TRL)
β βββ model: Qwen3-1.7B (or similar small LLM)
β βββ rollout_func: plays SQLEnv episodes
β β βββ env.reset() β initial obs
β β βββ model.generate() β action text
β β βββ parse action β SQLAction
β β βββ env.step(action) β obs
β β βββ repeat until done
β βββ reward_funcs:
β β βββ reward_correctness β 0.0 or 1.0
β β βββ reward_progress β cumulative progress
β β βββ reward_operational β sum of L1 signals
β βββ train_dataset: questions as prompts
βββ Evaluation (F005 Green Agent)
βββ Random baseline metrics
βββ Trained model metrics
Entry Points
| Entry Point |
Trigger |
Current Flow |
| Training notebook |
User runs notebook |
To be created |
rollout_func |
Called by GRPOTrainer |
To be created β plays episodes |
reward_funcs |
Called by GRPOTrainer per completion |
To be created β computes per-component rewards |
Data Flow
| Data |
Source |
Shape/Type |
Destination |
| Questions |
data/questions/questions_train.json |
JSON |
Training dataset |
| System prompt |
Training config |
str |
Model context |
| Episode observations |
SQLEnvironment |
SQLObservation |
Model input |
| Model output |
LLM generation |
str (parsed to SQLAction) |
Environment step |
| Rewards |
reward_funcs |
list[float] per completion |
GRPOTrainer |
| Trained model |
GRPOTrainer output |
Model weights |
Evaluation |
3. Dependencies
Code We Depend On
| Dependency |
What We Use |
Risk if Changed |
trl (external) |
GRPOTrainer |
Must match TRL API version |
transformers (external) |
Model loading, tokenizer |
Standard HF interface |
vllm (external, optional) |
Fast inference during rollout |
Optional β can use HF generate |
| F001 (SQLEnvironment) |
reset(), step() |
Complete, stable |
| F002 (verifier) |
Terminal correctness |
Being built in parallel |
| F003 (reward) |
Dense reward signals |
Being built in parallel |
| F005 (Green Agent) |
Evaluation wrapper |
Being built in parallel |
Code That Depends On Us
| Dependent |
How They Use Us |
Impact of Our Change |
| F007 (HF Submission) |
Training results for blog |
Provides learning curves + comparison |
External Systems
| System |
Integration Point |
Considerations |
| GPU (CUDA) |
Training compute |
Qwen3-1.7B needs ~8GB VRAM |
| HuggingFace Hub |
Model download |
Qwen3-1.7B weights |
| SQLEnv server |
Episode execution |
Can be local instance or WebSocket |
4. Risks & Edge Cases
Identified Risks
| Risk |
Likelihood |
Impact |
Mitigation |
| Training doesn't converge |
Medium |
No demo results |
Start very small (easy questions, short episodes), tune rewards |
| GPU requirements too high |
Medium |
Can't train locally |
Use small model (1.7B), short episodes, few steps |
| TRL API breaking changes |
Low |
Script breaks |
Pin TRL version in requirements |
| Notebook has hidden dependencies |
Medium |
Users can't reproduce |
Explicit requirements, Colab-compatible |
Edge Cases to Handle
| Edge Case |
Current Behavior |
Required Behavior |
| Model generates unparseable action |
N/A |
Default to QUERY with raw text |
| Episode exceeds token budget |
N/A |
Truncate context, keep recent actions |
| Training OOM |
N/A |
Reduce batch size, gradient accumulation |
Invariants to Preserve
4b. Code Shape & Design Target
Target Shape
| Component |
Purpose |
Why This Boundary |
training/config.py |
Hyperparameters, model name, paths |
Centralized config |
training/rollout.py |
rollout_func β plays episodes |
TRL integration point |
training/rewards.py |
reward_funcs β per-component rewards |
TRL integration point |
training/prompts.py |
System prompt design |
Separates prompt engineering |
notebooks/train_grpo.ipynb |
End-to-end training notebook |
User-facing entry point |
Abstraction Level
- Recommendation:
training/ subpackage with focused modules. Notebook imports from package. Keep notebook cells linear and self-explanatory.
Anti-Patterns to Avoid
- Don't couple rollout to WebSocket β use local env for training
- Don't over-engineer prompt templates β single system prompt is enough for MVP
- Don't add wandb/tensorboard integration (MVP: just print metrics)
- Don't require specific GPU β should work on Colab free tier with small model
5. Constraints
Technical Constraints
| Constraint |
Requirement |
Notes |
| Model size |
β€ 3B parameters |
Must train on consumer GPU / Colab |
| Training time |
< 2 hours for demo |
Short enough for competition |
| Dependencies |
TRL, transformers, torch |
Must be pip-installable |
Pattern Constraints
- Follow TRL GRPOTrainer pattern (Wordle tutorial as reference)
reward_funcs must be separate callables (not combined)
rollout_func signature must match TRL expectations
6. Open Questions
| Question |
Why It Matters |
Who Can Answer |
| Which model? Qwen3-1.7B vs others |
Affects VRAM, training time, quality |
Recommend Qwen3-1.7B (good instruction following, small) |
| vLLM for inference or HF generate? |
Speed vs. simplicity |
Recommend HF generate for MVP (simpler, Colab-compatible) |
| Train on all questions or easy-only? |
Convergence speed |
Recommend start with easy+medium, add hard later |
7. Context Sources
| Source |
Type |
Notes |
docs_draft/SQLEnv_Concept_v1.md Section 3.5 |
Doc |
TRL mapping, GRPOTrainer pattern |
docs_draft/sql_env_project_brief.md Phase 4 |
Doc |
Training pipeline requirements |
server/sql_environment.py |
Code |
Environment API |
models.py |
Code |
Action/observation types |
| OpenEnv Wordle GRPO tutorial |
Reference |
TRL integration pattern |