Qwen3-8B Code Reasoning (SFT + GRPO)
A fine-tuned version of Qwen/Qwen3-8B trained to reason step-by-step about code problems using a two-stage SFT + GRPO pipeline. The model reasons inside <think> tags before producing a final solution โ similar to DeepSeek-R1 style chain-of-thought reasoning.
๐ Benchmark Results
| Benchmark | This Model | Base Qwen3-8B | GPT-4 |
|---|---|---|---|
| HumanEval pass@1 | 86.0% ๐ฅ | ~65-70% | ~87% |
| Custom Code Reasoning | 30/35 (86%) | โ | โ |
| vs SFT baseline | +30% improvement | โ | โ |
๐ก What This Model Does
Given a coding problem, the model:
- Reasons step-by-step inside
<think>tags - Produces a clean, commented solution after
</think>
Example output:
Input: Fix this fibonacci that times out:
def fib(n):
if n<=1: return n
return fib(n-1)+fib(n-2)
Output:
<think>
This uses recursive approach with O(2^n) time complexity.
Each call branches into two more calls causing exponential growth.
Fix: use iterative approach to get O(n) time, O(1) space.
Store previously computed values to avoid recomputation.
</think>
def fib(n):
if n <= 1:
return n
# Use iterative approach โ O(n) time, O(1) space
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
๐ How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained("Madhu2133/qwen3-8b-code-grpo-v10")
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "Madhu2133/qwen3-8b-code-grpo-v10")
model.eval()
def ask(question):
prompt = (
"<|im_start|>system\n"
"You are an expert software engineer. "
"Always reason step-by-step inside <think> tags first, "
"then provide your final solution after </think>.<|im_end|>\n"
f"<|im_start|>user\n{question}<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(
out[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True)
print(ask("Fix this binary search that fails on duplicates:\ndef bs(arr,x):\n l,r=0,len(arr)\n while l<r:\n m=(l+r)//2\n if arr[m]==x: return m\n elif arr[m]<x: l=m\n else: r=m\n return -1"))
๐๏ธ Training Details
Training Pipeline
Qwen3-8B (Base)
โ
Stage 1: SFT โ Supervised Fine-Tuning
Dataset: garage-bAInd/Open-Platypus (3,000 code samples)
GPU: NVIDIA L4 (24GB) | Duration: ~25 minutes
Loss: 1.58 โ 0.52
โ
Stage 2: GRPO โ Reinforcement Learning
Dataset: 35 coding prompts ร 10 = 350 samples
GPU: NVIDIA A100 (40GB) | Duration: ~5 hours
Reward: 3.63 โ 5.28 (+45%)
โ
Final Model
Training Data
SFT: garage-bAInd/Open-Platypus โ 3,000 code-related samples filtered from 24,926 total. Each example was preprocessed to inject <think> tags around the reasoning portion of the answer.
GRPO: 35 handcrafted Python coding prompts covering bug fixes, algorithm implementations, refactoring, and design patterns. Each prompt repeated 10 times = 350 training samples.
Training Hyperparameters
SFT:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Learning rate | 2e-4 (cosine decay) |
| Batch size | 2 ร 4 grad accum = 8 |
| Optimizer | AdamW 8-bit |
| Precision | BF16 |
GRPO:
| Parameter | Value |
|---|---|
| Steps | 300 |
| Learning rate | 2e-6 (cosine decay) |
| Beta (KL penalty) | 0.1 |
| Generations per prompt | 4 |
| Max completion length | 512 tokens |
| Optimizer | Paged AdamW 8-bit |
| Precision | BF16 |
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 16 |
| Alpha | 32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 43.6M / 8.2B (0.53%) |
๐ Reward Functions (GRPO)
| Reward Function | Max Score | Description |
|---|---|---|
| reward_reasoning | 2.0 | Words in <think> block (tiered: 5/20/50+ words) |
| reward_code_quality | 2.5 | Presence of def, return, comments, docstring |
| reward_format | 1.0 | Correct <think></think> + non-empty answer |
| reward_no_stubs | 0 / -1.5 | Penalizes TODO/NotImplemented/pass |
| reward_length | 1.0 | Answer length 20โ300 words |
| combined_reward | 7.3 | Sum of all above |
๐ GRPO Training Progression
| Step | Reward | KL Divergence | Completion Length |
|---|---|---|---|
| 10 | 3.63 | 0.575 | 328 tokens |
| 100 | 4.65 | 0.638 | 292 tokens |
| 200 | 5.07 | 0.567 | 333 tokens |
| 300 | 5.28 | 0.541 | 332 tokens |
๐งช Evaluation
Testing Data
- Custom benchmark: 7 Python coding problems (3 seen during training, 4 unseen)
- HumanEval: All 164 problems from openai/openai_humaneval
Metrics
- pass@1 โ percentage of problems solved correctly on first attempt
- Think words โ average words inside
<think>reasoning blocks - Code Reasoning Score โ 5-criteria scoring (think quality, code structure, return, comments, no stubs)
Results
Custom Code Reasoning (7 tests, 5 criteria each = 35 max):
| Model | Score | Think Words |
|---|---|---|
| SFT only | 23/35 | 119 avg |
| GRPO (this model) | 30/35 | 129 avg |
HumanEval (164 problems):
| Model | pass@1 |
|---|---|
| Base Qwen3-8B | ~65-70% |
| This model | 86.0% |
| GPT-4 | ~87% |
โ ๏ธ Bias, Risks, and Limitations
- Fine-tuned primarily on Python code โ may underperform on other languages
- Occasional verbose reasoning for simple problems
- ~14% of HumanEval failures were due to code formatting issues (prose before code block), not reasoning errors
- Not evaluated on real-world software engineering tasks (SWE-bench)
- May hallucinate imports or APIs that don't exist
Recommendations
Test outputs before using in production. The model works best on bug-fixing and algorithm implementation tasks in Python.
๐ Environmental Impact
- Hardware: NVIDIA L4 (SFT) + NVIDIA A100 (GRPO)
- Cloud Provider: Modal
- Hours used: ~5.5 hours total GPU time
- Estimated CO2: ~1.2 kg CO2eq (A100 @ 400W ร 5hrs)
Technical Specifications
Model Architecture
- Type: Causal Language Model with LoRA adapters
- Base: Qwen3-8B (8.2B parameters)
- Adapter: LoRA r=16, alpha=32 (43.6M trainable params)
- Quantization: 4-bit NF4 (BitsAndBytes)
Software
| Package | Version |
|---|---|
| transformers | 4.51.3 |
| peft | 0.18.1 |
| trl | 0.15.2 |
| unsloth | 2026.4.1 |
| torch | 2.5.1+cu124 |
Citation
@misc{madhu2026qwen3grpo,
title={Qwen3-8B Code Reasoning with SFT and GRPO},
author={Madhukumar},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/Madhu2133/qwen3-8b-code-grpo-v10}
}
Framework Versions
- PEFT 0.18.1
- Transformers 4.51.3
- TRL 0.15.2
- Unsloth 2026.4.1
- Downloads last month
- 54