gemma3-4b-qwen3.5-reasoning-gpt-oss-120b
Developed by: Glyph Software LLP
License: Apache 2.0
Model type: Fine-tuned / Quantized (4-bit, BitsAndBytes)
Architecture: Gemma 3 (4B parameters)
Parent model: glyphsoftware/gemma-3-4b-qwen3.5-distilled
Training framework: Unsloth + Hugging Face TRL
Model Summary
gemma3-4b-qwen3.5-reasoning-gpt-oss-120b is a compact, quantized reasoning model built on the Gemma 3 4B backbone. It is a fine-tuned derivative of glyphsoftware/gemma-3-4b-qwen3.5-distilled — itself a distilled model that inherits reasoning capabilities from Qwen 3.5 and targets the performance profile of the GPT-OSS 120B teacher model.
The name encodes the full lineage:
gemma3-4b → student architecture (Google Gemma 3, 4B params)
qwen3.5 → intermediate distillation source (Alibaba Qwen 3.5)
reasoning → capability emphasis (chain-of-thought / structured thinking)
gpt-oss-120b → ultimate teacher model (OpenAI GPT-OSS 120B)
Trained 2× faster using Unsloth optimizations, this model delivers frontier-class reasoning behavior in a 4B footprint that is runnable on consumer hardware.
Intended Use
Primary Use Cases
- Agentic reasoning tasks — structured step-by-step problem decomposition inside
<think>tags before producing a final answer. - STEM and coding assistance — multi-step math, logic, and code generation with explicit reasoning traces.
- Edge / local deployment — running sophisticated reasoning on resource-constrained hardware (single GPU, laptop inference via llama.cpp or Ollama).
- Fine-tuning base — can be further specialized for domain-specific reasoning via LoRA or QLoRA.
- Distillation student — suitable as a student model for additional knowledge distillation pipelines.
Out-of-Scope Uses
- Tasks requiring factual encyclopedic recall (this model prioritises reasoning over memorisation, consistent with the GPT-OSS design philosophy).
- Real-time or latency-critical production APIs without further optimisation.
- Multilingual-heavy workloads; training data is primarily English.
- Applications requiring formal safety or compliance certification.
Model Lineage & Training Pipeline
GPT-OSS 120B (OpenAI) ← Ultimate teacher model
│ knowledge distillation
▼
Qwen 3.5 (Alibaba) ← Intermediate distillation target
│ cross-architecture distillation
▼
glyphsoftware/gemma-3-4b-qwen3.5-distilled
│ supervised fine-tuning (SFT) via Unsloth + TRL
▼
gemma3-4b-qwen3.5-reasoning-gpt-oss-120b ← This model (4-bit quantized)
The distillation chain transfers structured chain-of-thought (CoT) reasoning patterns from the 120B-parameter GPT-OSS teacher down into the 4B Gemma 3 student. GPT-OSS 120B is a Mixture-of-Experts model released by OpenAI under Apache 2.0; despite activating only ~5.1B parameters per token, it achieves frontier-level scores on MMLU-Pro (90.0%), AIME 2024 (96.6% with tools), and SWE-bench Verified (62.4%). These reasoning trajectories are compressed into the Gemma 3 architecture through intermediate Qwen 3.5 distillation.
Technical Specifications
| Property | Value |
|---|---|
| Architecture | Gemma 3 (Transformer) |
| Total Parameters | 4B |
| Quantization | 4-bit (BitsAndBytes) |
| Tensor Types | F32 · BF16 · U8 (INT4 via bitsandbytes) |
| Language | English (primary) |
| Pipeline | Image-Text-to-Text (multimodal-capable base) |
| Context Window | Up to 128K tokens (Gemma 3 native) |
| Chat Template | Included |
| Training Framework | Unsloth + Hugging Face TRL |
| License | Apache 2.0 |
Quantization Notes
This model is distributed in 4-bit BitsAndBytes format (U8 tensor type with BF16/F32 compute buffers). The quantization enables the model to run on significantly less GPU VRAM than a full-precision equivalent:
| Precision | Approx. VRAM Required |
|---|---|
| BF16 (full) | ~8 GB |
| INT4 / 4-bit (this model) | ~3–4 GB |
This makes it suitable for a single consumer GPU (e.g. RTX 3060 12 GB, RTX 4070, Apple M-series with Metal).
How to Use
Basic Inference (Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_id = "glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
messages = [
{"role": "user", "content": "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?"}
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Unsloth Inference (Faster)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b",
max_seq_length=8192,
dtype=None, # auto-detect
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Explain the halting problem with a concrete example."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()
outputs = model.generate(input_ids=inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Recommended Inference Parameters
| Parameter | Recommended Value | Notes |
|---|---|---|
temperature |
0.6 | Balances creativity and coherence |
top_p |
0.95 | Nucleus sampling |
top_k |
20 | Optional additional filter |
max_new_tokens |
1024–4096 | Higher for complex reasoning traces |
do_sample |
True |
Required when temperature > 0 |
repetition_penalty |
1.1–1.5 | Prevents looping in long traces |
Reasoning Format
This model emits structured chain-of-thought reasoning inside <think> tags before producing a final answer, inheriting this pattern from the GPT-OSS and Qwen 3.5 distillation chain:
<think>
Let me break this problem down carefully.
1. Identify what is being asked...
2. Gather relevant constraints...
3. Apply the appropriate method...
4. Verify the result...
</think>
The answer is: [final response here]
This makes intermediate reasoning transparent and auditable — useful for debugging agent pipelines and educational applications.
Known Limitations
- English-centric: Training data is primarily English. Performance on Indic languages, Mandarin, and other non-Latin scripts will be significantly lower than dedicated multilingual models (e.g. Gemma 3 for Indic, Qwen for Mandarin).
- Hallucination risk: As an autoregressive LLM, the model can confabulate facts, especially when asked about events post-training cutoff or obscure factual details. External retrieval (RAG) is recommended for fact-sensitive applications.
- Encyclopedic recall trade-off: Consistent with the GPT-OSS design philosophy, this model prioritises reasoning depth over broad factual memorisation.
- Quantization degradation: 4-bit quantization introduces small but non-zero accuracy loss vs. BF16. For highest-stakes applications, consider running the parent model at higher precision.
- Context window in practice: While the Gemma 3 architecture supports up to 128K tokens, usable context at 4-bit may be practically limited by available VRAM.
- No formal safety evaluation: This model has not been independently red-teamed or certified for production safety. Standard content filtering and guardrails are recommended for user-facing deployments.
Evaluation
No independent benchmarks have been published for this specific checkpoint. The following benchmarks characterise the GPT-OSS 120B teacher model, which sets the upper-bound performance ceiling this distillation targets:
| Benchmark | GPT-OSS 120B Score |
|---|---|
| MMLU-Pro | 90.0% |
| AIME 2024 (with tools) | 96.6% |
| AIME 2025 (with tools) | 97.9% |
| GPQA Diamond (with tools) | 80.9% |
| SWE-bench Verified | 62.4% |
Expect the distilled 4B student to deliver a fraction of teacher performance on these benchmarks, with the trade-off being dramatically reduced inference cost and hardware requirements.
Community evaluation contributions are welcome via the Discussions tab.
Environmental Impact
Training was performed using Unsloth, which reports up to 2× faster training with reduced GPU memory usage compared to standard Hugging Face training loops. This reduces energy consumption during fine-tuning.
Inference energy cost is further reduced by 4-bit quantization, enabling use on lower-TDP consumer hardware.
Citation
If you use this model in research or production, please cite as follows:
@misc{glyphsoftware2026gemma3reasoning,
title = {gemma3-4b-qwen3.5-reasoning-gpt-oss-120b},
author = {Glyph Software LLP},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b}}
}
Acknowledgements
- Unsloth AI — for the training acceleration framework.
- Hugging Face TRL — for the SFT training library.
- Google DeepMind — for the Gemma 3 base architecture.
- Alibaba Qwen Team — for the Qwen 3.5 intermediate distillation target.
- OpenAI — for releasing GPT-OSS under Apache 2.0, enabling the teacher-student distillation pipeline.
Model Card Metadata
| Field | Value |
|---|---|
| Card authored by | Community (based on available HF metadata) |
| Card version | 1.0 |
| Last updated | March 2026 |
| HF repo | glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b |
- Downloads last month
- 98
Model tree for glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b
Base model
glyphsoftware/gemma-3-4b-qwen3.5-distilled