You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

gemma3-4b-qwen3.5-reasoning-gpt-oss-120b

Developed by: Glyph Software LLP
License: Apache 2.0
Model type: Fine-tuned / Quantized (4-bit, BitsAndBytes)
Architecture: Gemma 3 (4B parameters)
Parent model: glyphsoftware/gemma-3-4b-qwen3.5-distilled
Training framework: Unsloth + Hugging Face TRL


Model Summary

gemma3-4b-qwen3.5-reasoning-gpt-oss-120b is a compact, quantized reasoning model built on the Gemma 3 4B backbone. It is a fine-tuned derivative of glyphsoftware/gemma-3-4b-qwen3.5-distilled — itself a distilled model that inherits reasoning capabilities from Qwen 3.5 and targets the performance profile of the GPT-OSS 120B teacher model.

The name encodes the full lineage:

gemma3-4b        →  student architecture (Google Gemma 3, 4B params)
qwen3.5          →  intermediate distillation source (Alibaba Qwen 3.5)
reasoning        →  capability emphasis (chain-of-thought / structured thinking)
gpt-oss-120b     →  ultimate teacher model (OpenAI GPT-OSS 120B)

Trained 2× faster using Unsloth optimizations, this model delivers frontier-class reasoning behavior in a 4B footprint that is runnable on consumer hardware.


Intended Use

Primary Use Cases

  • Agentic reasoning tasks — structured step-by-step problem decomposition inside <think> tags before producing a final answer.
  • STEM and coding assistance — multi-step math, logic, and code generation with explicit reasoning traces.
  • Edge / local deployment — running sophisticated reasoning on resource-constrained hardware (single GPU, laptop inference via llama.cpp or Ollama).
  • Fine-tuning base — can be further specialized for domain-specific reasoning via LoRA or QLoRA.
  • Distillation student — suitable as a student model for additional knowledge distillation pipelines.

Out-of-Scope Uses

  • Tasks requiring factual encyclopedic recall (this model prioritises reasoning over memorisation, consistent with the GPT-OSS design philosophy).
  • Real-time or latency-critical production APIs without further optimisation.
  • Multilingual-heavy workloads; training data is primarily English.
  • Applications requiring formal safety or compliance certification.

Model Lineage & Training Pipeline

GPT-OSS 120B (OpenAI)          ← Ultimate teacher model
        │  knowledge distillation
        ▼
Qwen 3.5 (Alibaba)             ← Intermediate distillation target
        │  cross-architecture distillation
        ▼
glyphsoftware/gemma-3-4b-qwen3.5-distilled
        │  supervised fine-tuning (SFT) via Unsloth + TRL
        ▼
gemma3-4b-qwen3.5-reasoning-gpt-oss-120b  ← This model (4-bit quantized)

The distillation chain transfers structured chain-of-thought (CoT) reasoning patterns from the 120B-parameter GPT-OSS teacher down into the 4B Gemma 3 student. GPT-OSS 120B is a Mixture-of-Experts model released by OpenAI under Apache 2.0; despite activating only ~5.1B parameters per token, it achieves frontier-level scores on MMLU-Pro (90.0%), AIME 2024 (96.6% with tools), and SWE-bench Verified (62.4%). These reasoning trajectories are compressed into the Gemma 3 architecture through intermediate Qwen 3.5 distillation.


Technical Specifications

Property Value
Architecture Gemma 3 (Transformer)
Total Parameters 4B
Quantization 4-bit (BitsAndBytes)
Tensor Types F32 · BF16 · U8 (INT4 via bitsandbytes)
Language English (primary)
Pipeline Image-Text-to-Text (multimodal-capable base)
Context Window Up to 128K tokens (Gemma 3 native)
Chat Template Included
Training Framework Unsloth + Hugging Face TRL
License Apache 2.0

Quantization Notes

This model is distributed in 4-bit BitsAndBytes format (U8 tensor type with BF16/F32 compute buffers). The quantization enables the model to run on significantly less GPU VRAM than a full-precision equivalent:

Precision Approx. VRAM Required
BF16 (full) ~8 GB
INT4 / 4-bit (this model) ~3–4 GB

This makes it suitable for a single consumer GPU (e.g. RTX 3060 12 GB, RTX 4070, Apple M-series with Metal).


How to Use

Basic Inference (Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Unsloth Inference (Faster)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b",
    max_seq_length=8192,
    dtype=None,        # auto-detect
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Explain the halting problem with a concrete example."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()

outputs = model.generate(input_ids=inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Recommended Inference Parameters

Parameter Recommended Value Notes
temperature 0.6 Balances creativity and coherence
top_p 0.95 Nucleus sampling
top_k 20 Optional additional filter
max_new_tokens 1024–4096 Higher for complex reasoning traces
do_sample True Required when temperature > 0
repetition_penalty 1.1–1.5 Prevents looping in long traces

Reasoning Format

This model emits structured chain-of-thought reasoning inside <think> tags before producing a final answer, inheriting this pattern from the GPT-OSS and Qwen 3.5 distillation chain:

<think>
Let me break this problem down carefully.

1. Identify what is being asked...
2. Gather relevant constraints...
3. Apply the appropriate method...
4. Verify the result...
</think>

The answer is: [final response here]

This makes intermediate reasoning transparent and auditable — useful for debugging agent pipelines and educational applications.


Known Limitations

  • English-centric: Training data is primarily English. Performance on Indic languages, Mandarin, and other non-Latin scripts will be significantly lower than dedicated multilingual models (e.g. Gemma 3 for Indic, Qwen for Mandarin).
  • Hallucination risk: As an autoregressive LLM, the model can confabulate facts, especially when asked about events post-training cutoff or obscure factual details. External retrieval (RAG) is recommended for fact-sensitive applications.
  • Encyclopedic recall trade-off: Consistent with the GPT-OSS design philosophy, this model prioritises reasoning depth over broad factual memorisation.
  • Quantization degradation: 4-bit quantization introduces small but non-zero accuracy loss vs. BF16. For highest-stakes applications, consider running the parent model at higher precision.
  • Context window in practice: While the Gemma 3 architecture supports up to 128K tokens, usable context at 4-bit may be practically limited by available VRAM.
  • No formal safety evaluation: This model has not been independently red-teamed or certified for production safety. Standard content filtering and guardrails are recommended for user-facing deployments.

Evaluation

No independent benchmarks have been published for this specific checkpoint. The following benchmarks characterise the GPT-OSS 120B teacher model, which sets the upper-bound performance ceiling this distillation targets:

Benchmark GPT-OSS 120B Score
MMLU-Pro 90.0%
AIME 2024 (with tools) 96.6%
AIME 2025 (with tools) 97.9%
GPQA Diamond (with tools) 80.9%
SWE-bench Verified 62.4%

Expect the distilled 4B student to deliver a fraction of teacher performance on these benchmarks, with the trade-off being dramatically reduced inference cost and hardware requirements.

Community evaluation contributions are welcome via the Discussions tab.


Environmental Impact

Training was performed using Unsloth, which reports up to 2× faster training with reduced GPU memory usage compared to standard Hugging Face training loops. This reduces energy consumption during fine-tuning.

Inference energy cost is further reduced by 4-bit quantization, enabling use on lower-TDP consumer hardware.


Citation

If you use this model in research or production, please cite as follows:

@misc{glyphsoftware2026gemma3reasoning,
  title  = {gemma3-4b-qwen3.5-reasoning-gpt-oss-120b},
  author = {Glyph Software LLP},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b}}
}

Acknowledgements

  • Unsloth AI — for the training acceleration framework.
  • Hugging Face TRL — for the SFT training library.
  • Google DeepMind — for the Gemma 3 base architecture.
  • Alibaba Qwen Team — for the Qwen 3.5 intermediate distillation target.
  • OpenAI — for releasing GPT-OSS under Apache 2.0, enabling the teacher-student distillation pipeline.

Model Card Metadata

Field Value
Card authored by Community (based on available HF metadata)
Card version 1.0
Last updated March 2026
HF repo glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b
Downloads last month
98
Safetensors
Model size
4B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for glyphsoftware/gemma3-4b-qwen3.5-reasoning-gpt-oss-120b

Quantized
(1)
this model