Qwen3-8B Code Reasoning (SFT + GRPO)

A fine-tuned version of Qwen/Qwen3-8B trained to reason step-by-step about code problems using a two-stage SFT + GRPO pipeline. The model reasons inside <think> tags before producing a final solution — similar to DeepSeek-R1 style chain-of-thought reasoning.

🏆 Benchmark Results

Benchmark	This Model	Base Qwen3-8B	GPT-4
HumanEval pass@1	86.0% 🔥	~65-70%	~87%
Custom Code Reasoning	30/35 (86%)	—	—
vs SFT baseline	+30% improvement	—	—

💡 What This Model Does

Given a coding problem, the model:

Reasons step-by-step inside <think> tags
Produces a clean, commented solution after </think>

Example output:

Input: Fix this fibonacci that times out:
def fib(n):
    if n<=1: return n
    return fib(n-1)+fib(n-2)

Output:
<think>
This uses recursive approach with O(2^n) time complexity.
Each call branches into two more calls causing exponential growth.
Fix: use iterative approach to get O(n) time, O(1) space.
Store previously computed values to avoid recomputation.
</think>

def fib(n):
    if n <= 1:
        return n
    # Use iterative approach — O(n) time, O(1) space
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

🚀 How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("Madhu2133/qwen3-8b-code-grpo-v10")

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(base_model, "Madhu2133/qwen3-8b-code-grpo-v10")
model.eval()

def ask(question):
    prompt = (
        "<|im_start|>system\n"
        "You are an expert software engineer. "
        "Always reason step-by-step inside <think> tags first, "
        "then provide your final solution after </think>.<|im_end|>\n"
        f"<|im_start|>user\n{question}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True)

print(ask("Fix this binary search that fails on duplicates:\ndef bs(arr,x):\n    l,r=0,len(arr)\n    while l<r:\n        m=(l+r)//2\n        if arr[m]==x: return m\n        elif arr[m]<x: l=m\n        else: r=m\n    return -1"))

🏗️ Training Details

Training Pipeline

Qwen3-8B (Base)
      ↓
Stage 1: SFT — Supervised Fine-Tuning
  Dataset:  garage-bAInd/Open-Platypus (3,000 code samples)
  GPU:      NVIDIA L4 (24GB) | Duration: ~25 minutes
  Loss:     1.58 → 0.52
      ↓
Stage 2: GRPO — Reinforcement Learning
  Dataset:  35 coding prompts × 10 = 350 samples
  GPU:      NVIDIA A100 (40GB) | Duration: ~5 hours
  Reward:   3.63 → 5.28 (+45%)
      ↓
Final Model

Training Data

SFT: garage-bAInd/Open-Platypus — 3,000 code-related samples filtered from 24,926 total. Each example was preprocessed to inject <think> tags around the reasoning portion of the answer.

GRPO: 35 handcrafted Python coding prompts covering bug fixes, algorithm implementations, refactoring, and design patterns. Each prompt repeated 10 times = 350 training samples.

Training Hyperparameters

SFT:

Parameter	Value
Epochs	1
Learning rate	2e-4 (cosine decay)
Batch size	2 × 4 grad accum = 8
Optimizer	AdamW 8-bit
Precision	BF16

GRPO:

Parameter	Value
Steps	300
Learning rate	2e-6 (cosine decay)
Beta (KL penalty)	0.1
Generations per prompt	4
Max completion length	512 tokens
Optimizer	Paged AdamW 8-bit
Precision	BF16

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	32
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	43.6M / 8.2B (0.53%)

📊 Reward Functions (GRPO)

Reward Function	Max Score	Description
reward_reasoning	2.0	Words in `<think>` block (tiered: 5/20/50+ words)
reward_code_quality	2.5	Presence of def, return, comments, docstring
reward_format	1.0	Correct `<think></think>` + non-empty answer
reward_no_stubs	0 / -1.5	Penalizes TODO/NotImplemented/pass
reward_length	1.0	Answer length 20–300 words
combined_reward	7.3	Sum of all above

📈 GRPO Training Progression

Step	Reward	KL Divergence	Completion Length
10	3.63	0.575	328 tokens
100	4.65	0.638	292 tokens
200	5.07	0.567	333 tokens
300	5.28	0.541	332 tokens

🧪 Evaluation

Testing Data

Custom benchmark: 7 Python coding problems (3 seen during training, 4 unseen)
HumanEval: All 164 problems from openai/openai_humaneval

Metrics

pass@1 — percentage of problems solved correctly on first attempt
Think words — average words inside <think> reasoning blocks
Code Reasoning Score — 5-criteria scoring (think quality, code structure, return, comments, no stubs)

Results

Custom Code Reasoning (7 tests, 5 criteria each = 35 max):

Model	Score	Think Words
SFT only	23/35	119 avg
GRPO (this model)	30/35	129 avg

HumanEval (164 problems):

Model	pass@1
Base Qwen3-8B	~65-70%
This model	86.0%
GPT-4	~87%

⚠️ Bias, Risks, and Limitations

Fine-tuned primarily on Python code — may underperform on other languages
Occasional verbose reasoning for simple problems
~14% of HumanEval failures were due to code formatting issues (prose before code block), not reasoning errors
Not evaluated on real-world software engineering tasks (SWE-bench)
May hallucinate imports or APIs that don't exist

Recommendations

Test outputs before using in production. The model works best on bug-fixing and algorithm implementation tasks in Python.

🌍 Environmental Impact

Hardware: NVIDIA L4 (SFT) + NVIDIA A100 (GRPO)
Cloud Provider: Modal
Hours used: ~5.5 hours total GPU time
Estimated CO2: ~1.2 kg CO2eq (A100 @ 400W × 5hrs)

Technical Specifications

Model Architecture

Type: Causal Language Model with LoRA adapters
Base: Qwen3-8B (8.2B parameters)
Adapter: LoRA r=16, alpha=32 (43.6M trainable params)
Quantization: 4-bit NF4 (BitsAndBytes)

Software

Package	Version
transformers	4.51.3
peft	0.18.1
trl	0.15.2
unsloth	2026.4.1
torch	2.5.1+cu124

Citation

@misc{madhu2026qwen3grpo,
  title={Qwen3-8B Code Reasoning with SFT and GRPO},
  author={Madhukumar},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Madhu2133/qwen3-8b-code-grpo-v10}
}

Framework Versions

PEFT 0.18.1
Transformers 4.51.3
TRL 0.15.2
Unsloth 2026.4.1

Downloads last month: 54

Model tree for Madhu2133/qwen3-8b-code-grpo-v10

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1072)

this model

Madhu2133
/

qwen3-8b-code-grpo-v10