Harsha901/Qwen3_4B_GRPO_GGUF
Model Summary
Harsha901/Qwen3_4B_GRPO_GGUF is a GGUF-format, quantized inference model derived from Qwen/Qwen3-4B-Base, fine-tuned using Unsloth with GRPO (Generalized Reinforcement Preference Optimization) to improve mathematical reasoning performance.
The model is optimized for local inference using llama.cpp / llama-cpp-python and shows a significant accuracy improvement on GSM8K compared to the base model.
Training Details
Base Model:
- Qwen/Qwen3-4B-Base
Fine-tuning Framework:
- Unsloth
Training Method:
- GRPO (Generalized Reinforcement Preference Optimization)
Training Dataset:
- unsloth/OpenMathReasoning-mini
The training dataset contains structured mathematical reasoning problems designed to improve step-by-step reasoning and final-answer accuracy.
Intended Use
- Mathematical reasoning and word-problem solving
- GSM8K-style numerical question answering
- Local / edge inference using GGUF-compatible runtimes
Not intended for: safety-critical or high-stakes decision-making tasks.
Evaluation
Evaluation Dataset:
- GSM8K
Number of Samples:
- 1000
Metric:
- Exact match accuracy on extracted final numeric answers
Results
| Model |
Correct |
Total |
Accuracy |
| Qwen/Qwen3-4B-Base |
701 |
1000 |
70.10% |
| Harsha901/Qwen3_4B_GRPO_GGUF |
850 |
1000 |
85.00% |
Absolute Improvement: +14.90 percentage points
Runtime Notes
- GGUF context length during evaluation: 4096
- Training context length: 32768 (full capacity not utilized during inference)
- Left padding is recommended for decoder-only models
- Deterministic decoding (
temperature=0.0) used during evaluation
Usage
llama.cpp
./main \
-m Qwen3-4B-Base.Q8_0.gguf \
-p "Solve: If a car travels 180 miles in 3 hours, what is its speed?" \
-n 128 \
--temp 0.0
llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3-4B-Base.Q8_0.gguf",
n_ctx=4096,
n_threads=8,
)
prompt = "Q: A train travels 120 miles in 2 hours. What is its speed?\nA:"
out = llm(prompt, max_tokens=64, temperature=0.0)
print(out["choices"][0]["text"])
Limitations
- Performance depends on prompt format, decoding settings, and answer extraction logic
- Quantization may slightly reduce reasoning depth compared to full-precision models
- Evaluation was performed on a single 1000-sample GSM8K run
Ethical Considerations
This model may generate incorrect or misleading answers and should be used with validation. No guarantees are provided regarding bias mitigation or robustness outside mathematical reasoning tasks.
Citation
If you use this model, please cite:
* Base model: Qwen/Qwen3-4B-Base
* Training dataset: unsloth/OpenMathReasoning-mini
* Fine-tuning framework: Unsloth
* Model: Harsha901/Qwen3_4B_GRPO_GGUF
Acknowledgements
- Qwen team for the base model
- Unsloth for efficient fine-tuning and GRPO support
- GSM8K dataset authors
- llama.cpp community for GGUF inference tooling