You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

gemma3-4b-qwen3.5-reasoning-gpt-oss-120b

Developed by: Glyph Software LLP
License: Apache 2.0
Model type: Fine-tuned / Quantized (4-bit, BitsAndBytes)
Architecture: Gemma 3 (4B parameters)
Parent model: glyphsoftware/gemma-3-4b-qwen3.5-distilled
Training framework: Unsloth + Hugging Face TRL

Model Summary

gemma3-4b-qwen3.5-reasoning-gpt-oss-120b is a compact, quantized reasoning model built on the Gemma 3 4B backbone. It is a fine-tuned derivative of glyphsoftware/gemma-3-4b-qwen3.5-distilled — itself a distilled model that inherits reasoning capabilities from Qwen 3.5 and targets the performance profile of the GPT-OSS 120B teacher model.

The name encodes the full lineage:

gemma3-4b        →  student architecture (Google Gemma 3, 4B params)
qwen3.5          →  intermediate distillation source (Alibaba Qwen 3.5)
reasoning        →  capability emphasis (chain-of-thought / structured thinking)
gpt-oss-120b     →  ultimate teacher model (OpenAI GPT-OSS 120B)

Trained 2× faster using Unsloth optimizations, this model delivers frontier-class reasoning behavior in a 4B footprint that is runnable on consumer hardware.

Intended Use

Primary Use Cases

Agentic reasoning tasks — structured step-by-step problem decomposition inside <think> tags before producing a final answer.
STEM and coding assistance — multi-step math, logic, and code generation with explicit reasoning traces.
Edge / local deployment — running sophisticated reasoning on resource-constrained hardware (single GPU, laptop inference via llama.cpp or Ollama).
Fine-tuning base — can be further specialized for domain-specific reasoning via LoRA or QLoRA.
Distillation student — suitable as a student model for additional knowledge distillation pipelines.

Out-of-Scope Uses

Tasks requiring factual encyclopedic recall (this model prioritises reasoning over memorisation, consistent with the GPT-OSS design philosophy).
Real-time or latency-critical production APIs without further optimisation.
Multilingual-heavy workloads; training data is primarily English.
Applications requiring formal safety or compliance certification.

Model Lineage & Training Pipeline

GPT-OSS 120B (OpenAI)          ← Ultimate teacher model
        │  knowledge distillation
        ▼
Qwen 3.5 (Alibaba)             ← Intermediate distillation target
        │  cross-architecture distillation
        ▼
glyphsoftware/gemma-3-4b-qwen3.5-distilled
        │  supervised fine-tuning (SFT) via Unsloth + TRL
        ▼
gemma3-4b-qwen3.5-reasoning-gpt-oss-120b  ← This model (4-bit quantized)

The distillation chain transfers structured chain-of-thought (CoT) reasoning patterns from the 120B-parameter GPT-OSS teacher down into the 4B Gemma 3 student. GPT-OSS 120B is a Mixture-of-Experts model released by OpenAI under Apache 2.0; despite activating only ~5.1B parameters per token, it achieves frontier-level scores on MMLU-Pro (90.0%), AIME 2024 (96.6% with tools), and SWE-bench Verified (62.4%). These reasoning trajectories are compressed into the Gemma 3 architecture through intermediate Qwen 3.5 distillation.

Technical Specifications

Property	Value
Architecture	Gemma 3 (Transformer)
Total Parameters	4B
Quantization	4-bit (BitsAndBytes)
Tensor Types	F32 · BF16 · U8 (INT4 via bitsandbytes)
Language	English (primary)
Pipeline	Image-Text-to-Text (multimodal-capable base)
Context Window	Up to 128K tokens (Gemma 3 native)
Chat Template	Included
Training Framework	Unsloth + Hugging Face TRL
License	Apache 2.0

Quantization Notes

This model is distributed in 4-bit BitsAndBytes format (U8 tensor type with BF16/F32 compute buffers). The quantization enables the model to run on significantly less GPU VRAM than a full-precision equivalent:

Precision	Approx. VRAM Required
BF16 (full)	~8 GB
INT4 / 4-bit (this model)	~3–4 GB

This makes it suitable for a single consumer GPU (e.g. RTX 3060 12 GB, RTX 4070, Apple M-series with Metal).

How to Use

Basic Inference (Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Unsloth Inference (Faster)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b",
    max_seq_length=8192,
    dtype=None,        # auto-detect
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Explain the halting problem with a concrete example."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()

outputs = model.generate(input_ids=inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Recommended Inference Parameters

Parameter	Recommended Value	Notes
`temperature`	0.6	Balances creativity and coherence
`top_p`	0.95	Nucleus sampling
`top_k`	20	Optional additional filter
`max_new_tokens`	1024–4096	Higher for complex reasoning traces
`do_sample`	`True`	Required when temperature > 0
`repetition_penalty`	1.1–1.5	Prevents looping in long traces

Reasoning Format

This model emits structured chain-of-thought reasoning inside <think> tags before producing a final answer, inheriting this pattern from the GPT-OSS and Qwen 3.5 distillation chain:

<think>
Let me break this problem down carefully.

1. Identify what is being asked...
2. Gather relevant constraints...
3. Apply the appropriate method...
4. Verify the result...
</think>

The answer is: [final response here]

This makes intermediate reasoning transparent and auditable — useful for debugging agent pipelines and educational applications.

Known Limitations

English-centric: Training data is primarily English. Performance on Indic languages, Mandarin, and other non-Latin scripts will be significantly lower than dedicated multilingual models (e.g. Gemma 3 for Indic, Qwen for Mandarin).
Hallucination risk: As an autoregressive LLM, the model can confabulate facts, especially when asked about events post-training cutoff or obscure factual details. External retrieval (RAG) is recommended for fact-sensitive applications.
Encyclopedic recall trade-off: Consistent with the GPT-OSS design philosophy, this model prioritises reasoning depth over broad factual memorisation.
Quantization degradation: 4-bit quantization introduces small but non-zero accuracy loss vs. BF16. For highest-stakes applications, consider running the parent model at higher precision.
Context window in practice: While the Gemma 3 architecture supports up to 128K tokens, usable context at 4-bit may be practically limited by available VRAM.
No formal safety evaluation: This model has not been independently red-teamed or certified for production safety. Standard content filtering and guardrails are recommended for user-facing deployments.

Evaluation

No independent benchmarks have been published for this specific checkpoint. The following benchmarks characterise the GPT-OSS 120B teacher model, which sets the upper-bound performance ceiling this distillation targets:

Benchmark	GPT-OSS 120B Score
MMLU-Pro	90.0%
AIME 2024 (with tools)	96.6%
AIME 2025 (with tools)	97.9%
GPQA Diamond (with tools)	80.9%
SWE-bench Verified	62.4%

Expect the distilled 4B student to deliver a fraction of teacher performance on these benchmarks, with the trade-off being dramatically reduced inference cost and hardware requirements.

Community evaluation contributions are welcome via the Discussions tab.

Environmental Impact

Training was performed using Unsloth, which reports up to 2× faster training with reduced GPU memory usage compared to standard Hugging Face training loops. This reduces energy consumption during fine-tuning.

Inference energy cost is further reduced by 4-bit quantization, enabling use on lower-TDP consumer hardware.

Citation

If you use this model in research or production, please cite as follows:

@misc{glyphsoftware2026gemma3reasoning,
  title  = {gemma3-4b-qwen3.5-reasoning-gpt-oss-120b},
  author = {Glyph Software LLP},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b}}
}

Acknowledgements

Unsloth AI — for the training acceleration framework.
Hugging Face TRL — for the SFT training library.
Google DeepMind — for the Gemma 3 base architecture.
Alibaba Qwen Team — for the Qwen 3.5 intermediate distillation target.
OpenAI — for releasing GPT-OSS under Apache 2.0, enabling the teacher-student distillation pipeline.

Model Card Metadata

Field	Value
Card authored by	Community (based on available HF metadata)
Card version	1.0
Last updated	March 2026
HF repo	glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b

Downloads last month: 98

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b

Base model

glyphsoftware/gemma-3-4b-qwen3.5-distilled

Quantized

(1)

this model