Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F006-RESEARCH_SUMMARY.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 23 days ago

preview code

raw

history blame contribute delete

7.7 kB

Research Summary

Project: SQLEnv Change: F006 — GRPO Training Pipeline Date: 2026-03-27 Status: Draft

1. Change Overview

What We're Changing

TRL/GRPO integration for training a small LLM (Qwen3-1.7B or similar) to play SQLEnv. Includes:

System prompt design for SQL exploration strategy
rollout_func that plays episodes via the environment
reward_funcs (correctness, progress, operational) for GRPOTrainer
Training notebook with hyperparameter config
Baseline vs trained comparison output

Why We're Changing It

The "before vs after" comparison is the competition's money shot. Without training, there's no demo of the environment's utility for RL.

Success Criteria

Training notebook runs end-to-end in one click
Learning curve clearly shows improvement over episodes
Side-by-side episode transcripts: random vs trained
Reproducible results

2. System Context

Current Behavior

No training pipeline exists. The environment (F001) is functional with reset()/step() API. No GRPO integration.

Architecture Context

Training Notebook / Script
  ├── GRPOTrainer (TRL)
  │   ├── model: Qwen3-1.7B (or similar small LLM)
  │   ├── rollout_func: plays SQLEnv episodes
  │   │   ├── env.reset() → initial obs
  │   │   ├── model.generate() → action text
  │   │   ├── parse action → SQLAction
  │   │   ├── env.step(action) → obs
  │   │   └── repeat until done
  │   ├── reward_funcs:
  │   │   ├── reward_correctness → 0.0 or 1.0
  │   │   ├── reward_progress → cumulative progress
  │   │   └── reward_operational → sum of L1 signals
  │   └── train_dataset: questions as prompts
  └── Evaluation (F005 Green Agent)
      ├── Random baseline metrics
      └── Trained model metrics

Entry Points

Entry Point	Trigger	Current Flow
Training notebook	User runs notebook	To be created
`rollout_func`	Called by GRPOTrainer	To be created — plays episodes
`reward_funcs`	Called by GRPOTrainer per completion	To be created — computes per-component rewards

Data Flow

Data	Source	Shape/Type	Destination
Questions	`data/questions/questions_train.json`	JSON	Training dataset
System prompt	Training config	`str`	Model context
Episode observations	SQLEnvironment	`SQLObservation`	Model input
Model output	LLM generation	`str` (parsed to SQLAction)	Environment step
Rewards	`reward_funcs`	`list[float]` per completion	GRPOTrainer
Trained model	GRPOTrainer output	Model weights	Evaluation

3. Dependencies

Code We Depend On

Dependency	What We Use	Risk if Changed
`trl` (external)	`GRPOTrainer`	Must match TRL API version
`transformers` (external)	Model loading, tokenizer	Standard HF interface
`vllm` (external, optional)	Fast inference during rollout	Optional — can use HF generate
F001 (SQLEnvironment)	`reset()`, `step()`	Complete, stable
F002 (verifier)	Terminal correctness	Being built in parallel
F003 (reward)	Dense reward signals	Being built in parallel
F005 (Green Agent)	Evaluation wrapper	Being built in parallel

Code That Depends On Us

Dependent	How They Use Us	Impact of Our Change
F007 (HF Submission)	Training results for blog	Provides learning curves + comparison

External Systems

System	Integration Point	Considerations
GPU (CUDA)	Training compute	Qwen3-1.7B needs ~8GB VRAM
HuggingFace Hub	Model download	Qwen3-1.7B weights
SQLEnv server	Episode execution	Can be local instance or WebSocket

4. Risks & Edge Cases

Identified Risks

Risk	Likelihood	Impact	Mitigation
Training doesn't converge	Medium	No demo results	Start very small (easy questions, short episodes), tune rewards
GPU requirements too high	Medium	Can't train locally	Use small model (1.7B), short episodes, few steps
TRL API breaking changes	Low	Script breaks	Pin TRL version in requirements
Notebook has hidden dependencies	Medium	Users can't reproduce	Explicit requirements, Colab-compatible

Edge Cases to Handle

Edge Case	Current Behavior	Required Behavior
Model generates unparseable action	N/A	Default to QUERY with raw text
Episode exceeds token budget	N/A	Truncate context, keep recent actions
Training OOM	N/A	Reduce batch size, gradient accumulation

Invariants to Preserve

Training is deterministic given seed
Reward functions match environment reward signals
Evaluation uses same environment as training

4b. Code Shape & Design Target

Target Shape

Component	Purpose	Why This Boundary
`training/config.py`	Hyperparameters, model name, paths	Centralized config
`training/rollout.py`	`rollout_func` — plays episodes	TRL integration point
`training/rewards.py`	`reward_funcs` — per-component rewards	TRL integration point
`training/prompts.py`	System prompt design	Separates prompt engineering
`notebooks/train_grpo.ipynb`	End-to-end training notebook	User-facing entry point

Abstraction Level

Recommendation: training/ subpackage with focused modules. Notebook imports from package. Keep notebook cells linear and self-explanatory.

Anti-Patterns to Avoid

Don't couple rollout to WebSocket — use local env for training
Don't over-engineer prompt templates — single system prompt is enough for MVP
Don't add wandb/tensorboard integration (MVP: just print metrics)
Don't require specific GPU — should work on Colab free tier with small model

5. Constraints

Technical Constraints

Constraint	Requirement	Notes
Model size	≤ 3B parameters	Must train on consumer GPU / Colab
Training time	< 2 hours for demo	Short enough for competition
Dependencies	TRL, transformers, torch	Must be pip-installable

Pattern Constraints

Follow TRL GRPOTrainer pattern (Wordle tutorial as reference)
reward_funcs must be separate callables (not combined)
rollout_func signature must match TRL expectations

6. Open Questions

Question	Why It Matters	Who Can Answer
Which model? Qwen3-1.7B vs others	Affects VRAM, training time, quality	Recommend Qwen3-1.7B (good instruction following, small)
vLLM for inference or HF generate?	Speed vs. simplicity	Recommend HF generate for MVP (simpler, Colab-compatible)
Train on all questions or easy-only?	Convergence speed	Recommend start with easy+medium, add hard later

7. Context Sources

Source	Type	Notes
`docs_draft/SQLEnv_Concept_v1.md` Section 3.5	Doc	TRL mapping, GRPOTrainer pattern
`docs_draft/sql_env_project_brief.md` Phase 4	Doc	Training pipeline requirements
`server/sql_environment.py`	Code	Environment API
`models.py`	Code	Action/observation types
OpenEnv Wordle GRPO tutorial	Reference	TRL integration pattern