Spaces:
Sleeping
Sleeping
RL Training for SQL Debug β GRPO with Qwen2.5-Coder-7B-Instruct
Full training script:
train_grpo.pyHF Space deployment:deploy_hf_space.md
Why GRPO, Not DDPG
| DDPG | GRPO | |
|---|---|---|
| Action space | Continuous R^n | Discrete tokens β |
| Value network | Required | Not needed β |
| Gradient signal | Bellman + actor-critic | Group relative ranking β |
| Works for SQL? | β | β |
DDPG is for robot control / trading. SQL token generation is discrete β always use GRPO or PPO.
What train_grpo.py Contains
| Section | Description |
|---|---|
| 1. Environment | Local DuckDB env or HTTP client pointing at HF Space |
| 2. Model & Tokenizer | Qwen/Qwen2.5-Coder-7B-Instruct, left-padding |
| 3. System Prompt | Rules, Response Format (<think> + sql), Strategy, Goal |
| 4. Helpers | build_user_message(), extract_sql_from_response(), format_messages() |
| 5. Rollout | rollout_func() β plays multi-turn episode, returns padded tensors |
| 6. Reward Fns | reward_correctness, reward_format, reward_no_repetition |
| 7. Dataset | 3 tasks Γ N repeats β HF Dataset with prompt + task_id columns |
| 8. GRPOConfig | A100-tuned: num_generations=4, bf16=True, max_completion_length=1024 |
| 9. Trainer | GRPOTrainer with reward_weights=[0.7, 0.2, 0.1] |
| 10. Save & Push | trainer.save_model() + push_to_hub() when PUSH_TO_HUB=true |
| 11. Evaluation | Greedy decode, 10 episodes/task, JSON report in outputs/evals/ |
Quick Start
# Install
pip install trl>=0.12.0 transformers>=4.45.0 torch>=2.3.0 duckdb pandas pydantic
# Start local server (terminal 1)
uvicorn server.app:app --host 0.0.0.0 --port 7860
# Train (terminal 2)
python train_grpo.py --mode train --n-repeats 50
# Evaluate trained model
python train_grpo.py --mode eval --output-dir ./sql-debug-qwen-grpo
# Train + eval in one command
python train_grpo.py --mode both
System Prompt Structure
RULES β what the agent must/must not do
RESPONSE FORMAT β <think>...</think> then ```sql...```
STRATEGY β task-specific hints (syntax / JOIN type / timezone)
GOAL β produce output matching the ground truth exactly
The <think> block is critical β it teaches chain-of-thought diagnosis before emitting the fix.
Reward Weights
reward_weights = [0.7, 0.2, 0.1]
# 0.7 Γ reward_correctness (dense 0.0β1.0 from grader)
# 0.2 Γ reward_format (<think> block + ```sql``` present)
# 0.1 Γ reward_no_repetition (penalty for trivial empty output)
Expected Outcomes After Training
| Task | Before (GPT-4o-mini baseline) | After GRPO (estimated) |
|---|---|---|
| task1_syntax_fix | ~0.85 | ~0.95 |
| task2_join_aggregation | ~0.55 | ~0.75 |
| task3_etl_timezone | ~0.25 | ~0.50 |
Use curriculum (train on Task 1+2 first, then add Task 3) for better Hard task improvement.