qwen3-4b-sql-grpo_202602221146
This repository provides a merged model (base + LoRA) fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using GRPO (Group Relative Policy Optimization) + Unsloth.
This is the fully merged model — no separate adapter loading is required.
Training Objective
This model is trained via GRPO to improve text-to-SQL generation accuracy. Reward signals are derived from:
- SQL Execution Reward — whether the generated SQL executes successfully
- SQL Correctness Reward — whether the execution result matches the expected answer
Training Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Instruct-2507 |
| Method | GRPO (LoRA, merged) |
| Loss type | dr_grpo |
| Max sequence length | 1536 |
| Epochs | 1 |
| Learning rate | 1e-06 |
| LoRA r | 64 |
| LoRA alpha | 128 |
| Num generations (per prompt) | 4 |
| Sampling temperature | 0.7 |
Dataset
deyucao/dbbench-description-rewrites
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "your_id/qwen3-4b-sql-grpo_202602221146"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
Sources & Terms
- Dataset:
deyucao/dbbench-description-rewrites - Users must comply with the base model's original terms of use.
- Downloads last month
- 2
Model tree for deyucao/qwen3-4b-sql-grpo_202602221146
Base model
Qwen/Qwen3-4B-Instruct-2507