qwen3-4b-sql-grpo_202602221146

This repository provides a merged model (base + LoRA) fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using GRPO (Group Relative Policy Optimization) + Unsloth.

This is the fully merged model — no separate adapter loading is required.

Training Objective

This model is trained via GRPO to improve text-to-SQL generation accuracy. Reward signals are derived from:

  1. SQL Execution Reward — whether the generated SQL executes successfully
  2. SQL Correctness Reward — whether the execution result matches the expected answer

Training Configuration

Parameter Value
Base model Qwen/Qwen3-4B-Instruct-2507
Method GRPO (LoRA, merged)
Loss type dr_grpo
Max sequence length 1536
Epochs 1
Learning rate 1e-06
LoRA r 64
LoRA alpha 128
Num generations (per prompt) 4
Sampling temperature 0.7

Dataset

  • deyucao/dbbench-description-rewrites

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your_id/qwen3-4b-sql-grpo_202602221146"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

Sources & Terms

  • Dataset: deyucao/dbbench-description-rewrites
  • Users must comply with the base model's original terms of use.
Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deyucao/qwen3-4b-sql-grpo_202602221146

Finetuned
(1536)
this model