Spaces:

sai1912
/

SQL_debug_env_v1

Sleeping

App Files Files Community

SQL_debug_env_v1 / train_rl.md

sai1912

Upload folder using huggingface_hub

3411777 verified 13 days ago

preview code

raw

history blame contribute delete

3.09 kB

RL Training for SQL Debug — GRPO with Qwen2.5-Coder-7B-Instruct

Full training script: train_grpo.py HF Space deployment: deploy_hf_space.md

Why GRPO, Not DDPG

	DDPG	GRPO
Action space	Continuous R^n	Discrete tokens ✅
Value network	Required	Not needed ✅
Gradient signal	Bellman + actor-critic	Group relative ranking ✅
Works for SQL?	❌	✅

DDPG is for robot control / trading. SQL token generation is discrete — always use GRPO or PPO.

What `train_grpo.py` Contains

Section	Description
1. Environment	Local DuckDB env or HTTP client pointing at HF Space
2. Model & Tokenizer	`Qwen/Qwen2.5-Coder-7B-Instruct`, left-padding
3. System Prompt	Rules, Response Format (`<think>` + `sql`), Strategy, Goal
4. Helpers	`build_user_message()`, `extract_sql_from_response()`, `format_messages()`
5. Rollout	`rollout_func()` — plays multi-turn episode, returns padded tensors
6. Reward Fns	`reward_correctness`, `reward_format`, `reward_no_repetition`
7. Dataset	3 tasks × N repeats → HF `Dataset` with `prompt` + `task_id` columns
8. GRPOConfig	A100-tuned: `num_generations=4`, `bf16=True`, `max_completion_length=1024`
9. Trainer	`GRPOTrainer` with `reward_weights=[0.7, 0.2, 0.1]`
10. Save & Push	`trainer.save_model()` + `push_to_hub()` when `PUSH_TO_HUB=true`
11. Evaluation	Greedy decode, 10 episodes/task, JSON report in `outputs/evals/`

Quick Start

# Install
pip install trl>=0.12.0 transformers>=4.45.0 torch>=2.3.0 duckdb pandas pydantic

# Start local server (terminal 1)
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Train (terminal 2)
python train_grpo.py --mode train --n-repeats 50

# Evaluate trained model
python train_grpo.py --mode eval --output-dir ./sql-debug-qwen-grpo

# Train + eval in one command
python train_grpo.py --mode both

System Prompt Structure

RULES         — what the agent must/must not do
RESPONSE FORMAT — <think>...</think> then ```sql...```
STRATEGY      — task-specific hints (syntax / JOIN type / timezone)
GOAL          — produce output matching the ground truth exactly

The <think> block is critical — it teaches chain-of-thought diagnosis before emitting the fix.

Reward Weights

reward_weights = [0.7, 0.2, 0.1]
# 0.7 × reward_correctness  (dense 0.0–1.0 from grader)
# 0.2 × reward_format       (<think> block + ```sql``` present)
# 0.1 × reward_no_repetition (penalty for trivial empty output)

Expected Outcomes After Training

Task	Before (GPT-4o-mini baseline)	After GRPO (estimated)
task1_syntax_fix	~0.85	~0.95
task2_join_aggregation	~0.55	~0.75
task3_etl_timezone	~0.25	~0.50

Use curriculum (train on Task 1+2 first, then add Task 3) for better Hard task improvement.