Harsha901/Qwen3_4B_GRPO_GGUF

Model Summary

Harsha901/Qwen3_4B_GRPO_GGUF is a GGUF-format, quantized inference model derived from Qwen/Qwen3-4B-Base, fine-tuned using Unsloth with GRPO (Generalized Reinforcement Preference Optimization) to improve mathematical reasoning performance.

The model is optimized for local inference using llama.cpp / llama-cpp-python and shows a significant accuracy improvement on GSM8K compared to the base model.

Training Details

Base Model:

Qwen/Qwen3-4B-Base Fine-tuning Framework:
Unsloth Training Method:
GRPO (Generalized Reinforcement Preference Optimization) Training Dataset:
unsloth/OpenMathReasoning-mini

The training dataset contains structured mathematical reasoning problems designed to improve step-by-step reasoning and final-answer accuracy.

Intended Use

Mathematical reasoning and word-problem solving
GSM8K-style numerical question answering
Local / edge inference using GGUF-compatible runtimes

Not intended for: safety-critical or high-stakes decision-making tasks.

Evaluation

Evaluation Dataset:

GSM8K Number of Samples:
1000 Metric:
Exact match accuracy on extracted final numeric answers

Results

Model	Correct	Total	Accuracy
Qwen/Qwen3-4B-Base	701	1000	70.10%
Harsha901/Qwen3_4B_GRPO_GGUF	850	1000	85.00%

Absolute Improvement: +14.90 percentage points

Runtime Notes

GGUF context length during evaluation: 4096
Training context length: 32768 (full capacity not utilized during inference)
Left padding is recommended for decoder-only models
Deterministic decoding (temperature=0.0) used during evaluation

Usage

llama.cpp

./main \
  -m Qwen3-4B-Base.Q8_0.gguf \
  -p "Solve: If a car travels 180 miles in 3 hours, what is its speed?" \
  -n 128 \
  --temp 0.0

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3-4B-Base.Q8_0.gguf",
    n_ctx=4096,
    n_threads=8,
)

prompt = "Q: A train travels 120 miles in 2 hours. What is its speed?\nA:"
out = llm(prompt, max_tokens=64, temperature=0.0)
print(out["choices"][0]["text"])

Limitations

Performance depends on prompt format, decoding settings, and answer extraction logic
Quantization may slightly reduce reasoning depth compared to full-precision models
Evaluation was performed on a single 1000-sample GSM8K run

Ethical Considerations

This model may generate incorrect or misleading answers and should be used with validation. No guarantees are provided regarding bias mitigation or robustness outside mathematical reasoning tasks.

Citation

If you use this model, please cite: * Base model: Qwen/Qwen3-4B-Base * Training dataset: unsloth/OpenMathReasoning-mini * Fine-tuning framework: Unsloth * Model: Harsha901/Qwen3_4B_GRPO_GGUF

Acknowledgements

Qwen team for the base model
Unsloth for efficient fine-tuning and GRPO support
GSM8K dataset authors
llama.cpp community for GGUF inference tooling

Downloads last month: 20

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

5-bit

8-bit

Model tree for Harsha901/Qwen3_4B_GRPO_GGUF

Base model

Qwen/Qwen3-4B-Base

Quantized

(37)

this model

Datasets used to train Harsha901/Qwen3_4B_GRPO_GGUF

Collection including Harsha901/Qwen3_4B_GRPO_GGUF

Quantized & Deployment Models

Collection

Quantized & Deployment Models by Harsha901 • 3 items • Updated Jan 3