Qwen3-8B Code Reasoning (SFT + GRPO)

A fine-tuned version of Qwen/Qwen3-8B trained to reason step-by-step about code problems using a two-stage SFT + GRPO pipeline. The model reasons inside <think> tags before producing a final solution โ€” similar to DeepSeek-R1 style chain-of-thought reasoning.

๐Ÿ† Benchmark Results

Benchmark This Model Base Qwen3-8B GPT-4
HumanEval pass@1 86.0% ๐Ÿ”ฅ ~65-70% ~87%
Custom Code Reasoning 30/35 (86%) โ€” โ€”
vs SFT baseline +30% improvement โ€” โ€”

๐Ÿ’ก What This Model Does

Given a coding problem, the model:

  1. Reasons step-by-step inside <think> tags
  2. Produces a clean, commented solution after </think>

Example output:

Input: Fix this fibonacci that times out:
def fib(n):
    if n<=1: return n
    return fib(n-1)+fib(n-2)

Output:
<think>
This uses recursive approach with O(2^n) time complexity.
Each call branches into two more calls causing exponential growth.
Fix: use iterative approach to get O(n) time, O(1) space.
Store previously computed values to avoid recomputation.
</think>

def fib(n):
    if n <= 1:
        return n
    # Use iterative approach โ€” O(n) time, O(1) space
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

๐Ÿš€ How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("Madhu2133/qwen3-8b-code-grpo-v10")

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(base_model, "Madhu2133/qwen3-8b-code-grpo-v10")
model.eval()

def ask(question):
    prompt = (
        "<|im_start|>system\n"
        "You are an expert software engineer. "
        "Always reason step-by-step inside <think> tags first, "
        "then provide your final solution after </think>.<|im_end|>\n"
        f"<|im_start|>user\n{question}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True)

print(ask("Fix this binary search that fails on duplicates:\ndef bs(arr,x):\n    l,r=0,len(arr)\n    while l<r:\n        m=(l+r)//2\n        if arr[m]==x: return m\n        elif arr[m]<x: l=m\n        else: r=m\n    return -1"))

๐Ÿ—๏ธ Training Details

Training Pipeline

Qwen3-8B (Base)
      โ†“
Stage 1: SFT โ€” Supervised Fine-Tuning
  Dataset:  garage-bAInd/Open-Platypus (3,000 code samples)
  GPU:      NVIDIA L4 (24GB) | Duration: ~25 minutes
  Loss:     1.58 โ†’ 0.52
      โ†“
Stage 2: GRPO โ€” Reinforcement Learning
  Dataset:  35 coding prompts ร— 10 = 350 samples
  GPU:      NVIDIA A100 (40GB) | Duration: ~5 hours
  Reward:   3.63 โ†’ 5.28 (+45%)
      โ†“
Final Model

Training Data

SFT: garage-bAInd/Open-Platypus โ€” 3,000 code-related samples filtered from 24,926 total. Each example was preprocessed to inject <think> tags around the reasoning portion of the answer.

GRPO: 35 handcrafted Python coding prompts covering bug fixes, algorithm implementations, refactoring, and design patterns. Each prompt repeated 10 times = 350 training samples.

Training Hyperparameters

SFT:

Parameter Value
Epochs 1
Learning rate 2e-4 (cosine decay)
Batch size 2 ร— 4 grad accum = 8
Optimizer AdamW 8-bit
Precision BF16

GRPO:

Parameter Value
Steps 300
Learning rate 2e-6 (cosine decay)
Beta (KL penalty) 0.1
Generations per prompt 4
Max completion length 512 tokens
Optimizer Paged AdamW 8-bit
Precision BF16

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 32
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 43.6M / 8.2B (0.53%)

๐Ÿ“Š Reward Functions (GRPO)

Reward Function Max Score Description
reward_reasoning 2.0 Words in <think> block (tiered: 5/20/50+ words)
reward_code_quality 2.5 Presence of def, return, comments, docstring
reward_format 1.0 Correct <think></think> + non-empty answer
reward_no_stubs 0 / -1.5 Penalizes TODO/NotImplemented/pass
reward_length 1.0 Answer length 20โ€“300 words
combined_reward 7.3 Sum of all above

๐Ÿ“ˆ GRPO Training Progression

Step Reward KL Divergence Completion Length
10 3.63 0.575 328 tokens
100 4.65 0.638 292 tokens
200 5.07 0.567 333 tokens
300 5.28 0.541 332 tokens

๐Ÿงช Evaluation

Testing Data

  • Custom benchmark: 7 Python coding problems (3 seen during training, 4 unseen)
  • HumanEval: All 164 problems from openai/openai_humaneval

Metrics

  • pass@1 โ€” percentage of problems solved correctly on first attempt
  • Think words โ€” average words inside <think> reasoning blocks
  • Code Reasoning Score โ€” 5-criteria scoring (think quality, code structure, return, comments, no stubs)

Results

Custom Code Reasoning (7 tests, 5 criteria each = 35 max):

Model Score Think Words
SFT only 23/35 119 avg
GRPO (this model) 30/35 129 avg

HumanEval (164 problems):

Model pass@1
Base Qwen3-8B ~65-70%
This model 86.0%
GPT-4 ~87%

โš ๏ธ Bias, Risks, and Limitations

  • Fine-tuned primarily on Python code โ€” may underperform on other languages
  • Occasional verbose reasoning for simple problems
  • ~14% of HumanEval failures were due to code formatting issues (prose before code block), not reasoning errors
  • Not evaluated on real-world software engineering tasks (SWE-bench)
  • May hallucinate imports or APIs that don't exist

Recommendations

Test outputs before using in production. The model works best on bug-fixing and algorithm implementation tasks in Python.

๐ŸŒ Environmental Impact

  • Hardware: NVIDIA L4 (SFT) + NVIDIA A100 (GRPO)
  • Cloud Provider: Modal
  • Hours used: ~5.5 hours total GPU time
  • Estimated CO2: ~1.2 kg CO2eq (A100 @ 400W ร— 5hrs)

Technical Specifications

Model Architecture

  • Type: Causal Language Model with LoRA adapters
  • Base: Qwen3-8B (8.2B parameters)
  • Adapter: LoRA r=16, alpha=32 (43.6M trainable params)
  • Quantization: 4-bit NF4 (BitsAndBytes)

Software

Package Version
transformers 4.51.3
peft 0.18.1
trl 0.15.2
unsloth 2026.4.1
torch 2.5.1+cu124

Citation

@misc{madhu2026qwen3grpo,
  title={Qwen3-8B Code Reasoning with SFT and GRPO},
  author={Madhukumar},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Madhu2133/qwen3-8b-code-grpo-v10}
}

Framework Versions

  • PEFT 0.18.1
  • Transformers 4.51.3
  • TRL 0.15.2
  • Unsloth 2026.4.1
Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Madhu2133/qwen3-8b-code-grpo-v10

Finetuned
Qwen/Qwen3-8B
Adapter
(1072)
this model

Dataset used to train Madhu2133/qwen3-8b-code-grpo-v10