Harsha901/Qwen3_4B_GRPO_GGUF

Model Summary

Harsha901/Qwen3_4B_GRPO_GGUF is a GGUF-format, quantized inference model derived from Qwen/Qwen3-4B-Base, fine-tuned using Unsloth with GRPO (Generalized Reinforcement Preference Optimization) to improve mathematical reasoning performance.

The model is optimized for local inference using llama.cpp / llama-cpp-python and shows a significant accuracy improvement on GSM8K compared to the base model.

Training Details

Base Model:

  • Qwen/Qwen3-4B-Base Fine-tuning Framework:
  • Unsloth Training Method:
  • GRPO (Generalized Reinforcement Preference Optimization) Training Dataset:
  • unsloth/OpenMathReasoning-mini

The training dataset contains structured mathematical reasoning problems designed to improve step-by-step reasoning and final-answer accuracy.

Intended Use

  • Mathematical reasoning and word-problem solving
  • GSM8K-style numerical question answering
  • Local / edge inference using GGUF-compatible runtimes

Not intended for: safety-critical or high-stakes decision-making tasks.

Evaluation

Evaluation Dataset:

  • GSM8K Number of Samples:
  • 1000 Metric:
  • Exact match accuracy on extracted final numeric answers

Results

Model Correct Total Accuracy
Qwen/Qwen3-4B-Base 701 1000 70.10%
Harsha901/Qwen3_4B_GRPO_GGUF 850 1000 85.00%

Absolute Improvement: +14.90 percentage points

Runtime Notes

  • GGUF context length during evaluation: 4096
  • Training context length: 32768 (full capacity not utilized during inference)
  • Left padding is recommended for decoder-only models
  • Deterministic decoding (temperature=0.0) used during evaluation

Usage

llama.cpp

./main \
  -m Qwen3-4B-Base.Q8_0.gguf \
  -p "Solve: If a car travels 180 miles in 3 hours, what is its speed?" \
  -n 128 \
  --temp 0.0

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3-4B-Base.Q8_0.gguf",
    n_ctx=4096,
    n_threads=8,
)

prompt = "Q: A train travels 120 miles in 2 hours. What is its speed?\nA:"
out = llm(prompt, max_tokens=64, temperature=0.0)
print(out["choices"][0]["text"])

Limitations

  • Performance depends on prompt format, decoding settings, and answer extraction logic
  • Quantization may slightly reduce reasoning depth compared to full-precision models
  • Evaluation was performed on a single 1000-sample GSM8K run

Ethical Considerations

This model may generate incorrect or misleading answers and should be used with validation. No guarantees are provided regarding bias mitigation or robustness outside mathematical reasoning tasks.

Citation

If you use this model, please cite: * Base model: Qwen/Qwen3-4B-Base * Training dataset: unsloth/OpenMathReasoning-mini * Fine-tuning framework: Unsloth * Model: Harsha901/Qwen3_4B_GRPO_GGUF

Acknowledgements

  • Qwen team for the base model
  • Unsloth for efficient fine-tuning and GRPO support
  • GSM8K dataset authors
  • llama.cpp community for GGUF inference tooling

Downloads last month
20
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Harsha901/Qwen3_4B_GRPO_GGUF

Quantized
(37)
this model

Datasets used to train Harsha901/Qwen3_4B_GRPO_GGUF

Collection including Harsha901/Qwen3_4B_GRPO_GGUF