# RL Training for SQL Debug — GRPO with Qwen2.5-Coder-7B-Instruct > **Full training script:** [`train_grpo.py`](train_grpo.py) > **HF Space deployment:** [`deploy_hf_space.md`](deploy_hf_space.md) --- ## Why GRPO, Not DDPG | | DDPG | GRPO | |---|---|---| | Action space | Continuous R^n | Discrete tokens ✅ | | Value network | Required | Not needed ✅ | | Gradient signal | Bellman + actor-critic | Group relative ranking ✅ | | Works for SQL? | ❌ | ✅ | DDPG is for robot control / trading. SQL token generation is discrete — **always use GRPO or PPO**. --- ## What `train_grpo.py` Contains | Section | Description | |---|---| | 1. Environment | Local DuckDB env or HTTP client pointing at HF Space | | 2. Model & Tokenizer | `Qwen/Qwen2.5-Coder-7B-Instruct`, left-padding | | 3. System Prompt | Rules, Response Format (`` + ```sql```), Strategy, Goal | | 4. Helpers | `build_user_message()`, `extract_sql_from_response()`, `format_messages()` | | 5. Rollout | `rollout_func()` — plays multi-turn episode, returns padded tensors | | 6. Reward Fns | `reward_correctness`, `reward_format`, `reward_no_repetition` | | 7. Dataset | 3 tasks × N repeats → HF `Dataset` with `prompt` + `task_id` columns | | 8. GRPOConfig | A100-tuned: `num_generations=4`, `bf16=True`, `max_completion_length=1024` | | 9. Trainer | `GRPOTrainer` with `reward_weights=[0.7, 0.2, 0.1]` | | 10. Save & Push | `trainer.save_model()` + `push_to_hub()` when `PUSH_TO_HUB=true` | | 11. Evaluation | Greedy decode, 10 episodes/task, JSON report in `outputs/evals/` | --- ## Quick Start ```powershell # Install pip install trl>=0.12.0 transformers>=4.45.0 torch>=2.3.0 duckdb pandas pydantic # Start local server (terminal 1) uvicorn server.app:app --host 0.0.0.0 --port 7860 # Train (terminal 2) python train_grpo.py --mode train --n-repeats 50 # Evaluate trained model python train_grpo.py --mode eval --output-dir ./sql-debug-qwen-grpo # Train + eval in one command python train_grpo.py --mode both ``` --- ## System Prompt Structure ``` RULES — what the agent must/must not do RESPONSE FORMAT — ... then ```sql...``` STRATEGY — task-specific hints (syntax / JOIN type / timezone) GOAL — produce output matching the ground truth exactly ``` The `` block is critical — it teaches chain-of-thought diagnosis before emitting the fix. --- ## Reward Weights ```python reward_weights = [0.7, 0.2, 0.1] # 0.7 × reward_correctness (dense 0.0–1.0 from grader) # 0.2 × reward_format ( block + ```sql``` present) # 0.1 × reward_no_repetition (penalty for trivial empty output) ``` --- ## Expected Outcomes After Training | Task | Before (GPT-4o-mini baseline) | After GRPO (estimated) | |---|---|---| | task1_syntax_fix | ~0.85 | ~0.95 | | task2_join_aggregation | ~0.55 | ~0.75 | | task3_etl_timezone | ~0.25 | ~0.50 | Use curriculum (train on Task 1+2 first, then add Task 3) for better Hard task improvement.