sql_env / specs /F006-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Research Summary

Project: SQLEnv Change: F006 β€” GRPO Training Pipeline Date: 2026-03-27 Status: Draft


1. Change Overview

What We're Changing

TRL/GRPO integration for training a small LLM (Qwen3-1.7B or similar) to play SQLEnv. Includes:

  1. System prompt design for SQL exploration strategy
  2. rollout_func that plays episodes via the environment
  3. reward_funcs (correctness, progress, operational) for GRPOTrainer
  4. Training notebook with hyperparameter config
  5. Baseline vs trained comparison output

Why We're Changing It

The "before vs after" comparison is the competition's money shot. Without training, there's no demo of the environment's utility for RL.

Success Criteria

  • Training notebook runs end-to-end in one click
  • Learning curve clearly shows improvement over episodes
  • Side-by-side episode transcripts: random vs trained
  • Reproducible results

2. System Context

Current Behavior

No training pipeline exists. The environment (F001) is functional with reset()/step() API. No GRPO integration.

Architecture Context

Training Notebook / Script
  β”œβ”€β”€ GRPOTrainer (TRL)
  β”‚   β”œβ”€β”€ model: Qwen3-1.7B (or similar small LLM)
  β”‚   β”œβ”€β”€ rollout_func: plays SQLEnv episodes
  β”‚   β”‚   β”œβ”€β”€ env.reset() β†’ initial obs
  β”‚   β”‚   β”œβ”€β”€ model.generate() β†’ action text
  β”‚   β”‚   β”œβ”€β”€ parse action β†’ SQLAction
  β”‚   β”‚   β”œβ”€β”€ env.step(action) β†’ obs
  β”‚   β”‚   └── repeat until done
  β”‚   β”œβ”€β”€ reward_funcs:
  β”‚   β”‚   β”œβ”€β”€ reward_correctness β†’ 0.0 or 1.0
  β”‚   β”‚   β”œβ”€β”€ reward_progress β†’ cumulative progress
  β”‚   β”‚   └── reward_operational β†’ sum of L1 signals
  β”‚   └── train_dataset: questions as prompts
  └── Evaluation (F005 Green Agent)
      β”œβ”€β”€ Random baseline metrics
      └── Trained model metrics

Entry Points

Entry Point Trigger Current Flow
Training notebook User runs notebook To be created
rollout_func Called by GRPOTrainer To be created β€” plays episodes
reward_funcs Called by GRPOTrainer per completion To be created β€” computes per-component rewards

Data Flow

Data Source Shape/Type Destination
Questions data/questions/questions_train.json JSON Training dataset
System prompt Training config str Model context
Episode observations SQLEnvironment SQLObservation Model input
Model output LLM generation str (parsed to SQLAction) Environment step
Rewards reward_funcs list[float] per completion GRPOTrainer
Trained model GRPOTrainer output Model weights Evaluation

3. Dependencies

Code We Depend On

Dependency What We Use Risk if Changed
trl (external) GRPOTrainer Must match TRL API version
transformers (external) Model loading, tokenizer Standard HF interface
vllm (external, optional) Fast inference during rollout Optional β€” can use HF generate
F001 (SQLEnvironment) reset(), step() Complete, stable
F002 (verifier) Terminal correctness Being built in parallel
F003 (reward) Dense reward signals Being built in parallel
F005 (Green Agent) Evaluation wrapper Being built in parallel

Code That Depends On Us

Dependent How They Use Us Impact of Our Change
F007 (HF Submission) Training results for blog Provides learning curves + comparison

External Systems

System Integration Point Considerations
GPU (CUDA) Training compute Qwen3-1.7B needs ~8GB VRAM
HuggingFace Hub Model download Qwen3-1.7B weights
SQLEnv server Episode execution Can be local instance or WebSocket

4. Risks & Edge Cases

Identified Risks

Risk Likelihood Impact Mitigation
Training doesn't converge Medium No demo results Start very small (easy questions, short episodes), tune rewards
GPU requirements too high Medium Can't train locally Use small model (1.7B), short episodes, few steps
TRL API breaking changes Low Script breaks Pin TRL version in requirements
Notebook has hidden dependencies Medium Users can't reproduce Explicit requirements, Colab-compatible

Edge Cases to Handle

Edge Case Current Behavior Required Behavior
Model generates unparseable action N/A Default to QUERY with raw text
Episode exceeds token budget N/A Truncate context, keep recent actions
Training OOM N/A Reduce batch size, gradient accumulation

Invariants to Preserve

  • Training is deterministic given seed
  • Reward functions match environment reward signals
  • Evaluation uses same environment as training

4b. Code Shape & Design Target

Target Shape

Component Purpose Why This Boundary
training/config.py Hyperparameters, model name, paths Centralized config
training/rollout.py rollout_func β€” plays episodes TRL integration point
training/rewards.py reward_funcs β€” per-component rewards TRL integration point
training/prompts.py System prompt design Separates prompt engineering
notebooks/train_grpo.ipynb End-to-end training notebook User-facing entry point

Abstraction Level

  • Recommendation: training/ subpackage with focused modules. Notebook imports from package. Keep notebook cells linear and self-explanatory.

Anti-Patterns to Avoid

  • Don't couple rollout to WebSocket β€” use local env for training
  • Don't over-engineer prompt templates β€” single system prompt is enough for MVP
  • Don't add wandb/tensorboard integration (MVP: just print metrics)
  • Don't require specific GPU β€” should work on Colab free tier with small model

5. Constraints

Technical Constraints

Constraint Requirement Notes
Model size ≀ 3B parameters Must train on consumer GPU / Colab
Training time < 2 hours for demo Short enough for competition
Dependencies TRL, transformers, torch Must be pip-installable

Pattern Constraints

  • Follow TRL GRPOTrainer pattern (Wordle tutorial as reference)
  • reward_funcs must be separate callables (not combined)
  • rollout_func signature must match TRL expectations

6. Open Questions

Question Why It Matters Who Can Answer
Which model? Qwen3-1.7B vs others Affects VRAM, training time, quality Recommend Qwen3-1.7B (good instruction following, small)
vLLM for inference or HF generate? Speed vs. simplicity Recommend HF generate for MVP (simpler, Colab-compatible)
Train on all questions or easy-only? Convergence speed Recommend start with easy+medium, add hard later

7. Context Sources

Source Type Notes
docs_draft/SQLEnv_Concept_v1.md Section 3.5 Doc TRL mapping, GRPOTrainer pattern
docs_draft/sql_env_project_brief.md Phase 4 Doc Training pipeline requirements
server/sql_environment.py Code Environment API
models.py Code Action/observation types
OpenEnv Wordle GRPO tutorial Reference TRL integration pattern