SQL_debug_env_v1 / train_rl.md
sai1912's picture
Upload folder using huggingface_hub
3411777 verified

RL Training for SQL Debug β€” GRPO with Qwen2.5-Coder-7B-Instruct

Full training script: train_grpo.py HF Space deployment: deploy_hf_space.md


Why GRPO, Not DDPG

DDPG GRPO
Action space Continuous R^n Discrete tokens βœ…
Value network Required Not needed βœ…
Gradient signal Bellman + actor-critic Group relative ranking βœ…
Works for SQL? ❌ βœ…

DDPG is for robot control / trading. SQL token generation is discrete β€” always use GRPO or PPO.


What train_grpo.py Contains

Section Description
1. Environment Local DuckDB env or HTTP client pointing at HF Space
2. Model & Tokenizer Qwen/Qwen2.5-Coder-7B-Instruct, left-padding
3. System Prompt Rules, Response Format (<think> + sql), Strategy, Goal
4. Helpers build_user_message(), extract_sql_from_response(), format_messages()
5. Rollout rollout_func() β€” plays multi-turn episode, returns padded tensors
6. Reward Fns reward_correctness, reward_format, reward_no_repetition
7. Dataset 3 tasks Γ— N repeats β†’ HF Dataset with prompt + task_id columns
8. GRPOConfig A100-tuned: num_generations=4, bf16=True, max_completion_length=1024
9. Trainer GRPOTrainer with reward_weights=[0.7, 0.2, 0.1]
10. Save & Push trainer.save_model() + push_to_hub() when PUSH_TO_HUB=true
11. Evaluation Greedy decode, 10 episodes/task, JSON report in outputs/evals/

Quick Start

# Install
pip install trl>=0.12.0 transformers>=4.45.0 torch>=2.3.0 duckdb pandas pydantic

# Start local server (terminal 1)
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Train (terminal 2)
python train_grpo.py --mode train --n-repeats 50

# Evaluate trained model
python train_grpo.py --mode eval --output-dir ./sql-debug-qwen-grpo

# Train + eval in one command
python train_grpo.py --mode both

System Prompt Structure

RULES         β€” what the agent must/must not do
RESPONSE FORMAT β€” <think>...</think> then ```sql...```
STRATEGY      β€” task-specific hints (syntax / JOIN type / timezone)
GOAL          β€” produce output matching the ground truth exactly

The <think> block is critical β€” it teaches chain-of-thought diagnosis before emitting the fix.


Reward Weights

reward_weights = [0.7, 0.2, 0.1]
# 0.7 Γ— reward_correctness  (dense 0.0–1.0 from grader)
# 0.2 Γ— reward_format       (<think> block + ```sql``` present)
# 0.1 Γ— reward_no_repetition (penalty for trivial empty output)

Expected Outcomes After Training

Task Before (GPT-4o-mini baseline) After GRPO (estimated)
task1_syntax_fix ~0.85 ~0.95
task2_join_aggregation ~0.55 ~0.75
task3_etl_timezone ~0.25 ~0.50

Use curriculum (train on Task 1+2 first, then add Task 3) for better Hard task improvement.