qwen3-4b-sql-grpo_202602221146

This repository provides a merged model (base + LoRA) fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using GRPO (Group Relative Policy Optimization) + Unsloth.

This is the fully merged model — no separate adapter loading is required.

Training Objective

This model is trained via GRPO to improve text-to-SQL generation accuracy. Reward signals are derived from:

SQL Execution Reward — whether the generated SQL executes successfully
SQL Correctness Reward — whether the execution result matches the expected answer

Training Configuration

Parameter	Value
Base model	`Qwen/Qwen3-4B-Instruct-2507`
Method	GRPO (LoRA, merged)
Loss type	`dr_grpo`
Max sequence length	1536
Epochs	1
Learning rate	1e-06
LoRA r	64
LoRA alpha	128
Num generations (per prompt)	4
Sampling temperature	0.7

Dataset

deyucao/dbbench-description-rewrites

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your_id/qwen3-4b-sql-grpo_202602221146"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

Sources & Terms

Dataset: deyucao/dbbench-description-rewrites
Users must comply with the base model's original terms of use.

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for deyucao/qwen3-4b-sql-grpo_202602221146

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1536)

this model