Model Card for gpt-oss-10b-quants
This repository contains multiple quantized versions of the GPT-OSS-10B model in GGUF format.
It is intended for efficient inference on consumer hardware, making large model deployment more accessible.
Model Details
Model Description
- Developed by: leeminwaan
- Funded by [optional]: Independent project
- Shared by [optional]: leeminwaan
- Model type: Decoder-only transformer language model
- Language(s) (NLP): English (primary), multilingual capabilities not benchmarked
- License: Apache-2.0
- Finetuned from model [optional]: openai/gpt-oss-20b (via pruning and expert selection)
Model Sources
- Repository: Hugging Face Repo
- Paper [optional]: Not available
- Demo [optional]: To be released
Uses
Direct Use
- Text generation
- Experimentation with quantization formats
- Running benchmarks on low-resource hardware
Downstream Use
- Fine-tuning for chatbots, classification, summarization, or research projects
- Integration into lightweight inference pipelines
Out-of-Scope Use
- High-stakes decision making (medical, legal, financial)
- Content moderation without further fine-tuning
- Applications requiring guaranteed factual accuracy
Bias, Risks, and Limitations
- May reproduce societal biases from training data
- Limited evaluation on multilingual or domain-specific text
- Quantization may degrade accuracy slightly compared to full precision
Recommendations
- Run evaluations before production deployment
- Do not use outputs as factual truth without human verification
How to Get Started with the Model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("leeminwaan/gpt-oss-10.6B-GGUF", "gpt-oss-10b-q4_k_m.gguf")
print("Downloaded:", model_path)
Quantized versions available:
- Q2_K, Q3_K_S, Q3_K_M, Q3_K_L
- Q4_0, Q4_1, Q4_K_S, Q4_K_M
- Q5_0, Q5_1, Q5_K_S, Q5_K_M
- Q6_K, Q8_0
Training Details
Training Data
- Based on GPT-OSS-20B pretraining corpus (public large-scale web text, open datasets).
- No additional fine-tuning was performed for this release.
Training Procedure
- Original GPT-OSS-20B β pruned to 10B experts β quantized to GGUF formats.
Preprocessing
- Standard tokenization, no special preprocessing for quantization.
Training Hyperparameters
- Quantization only; no gradient updates performed.
- Storage optimized for GGUF inference.
Speeds, Sizes, Times
- Full FP16 checkpoint size ~20B β reduced to ~10B experts β GGUF quantizations ranging from ~3GB to ~7GB.
Evaluation
Testing Data
- No dedicated evaluation dataset; informal testing on open prompts.
Factors
- Quantization level strongly affects perplexity and memory footprint.
Metrics
- Perplexity (approximate, not benchmarked formally).
- Memory usage on consumer GPUs/CPUs.
Results
- Q8_0 maintains near full precision quality.
- Q4_K_M, Q5_K_M provide good trade-off between performance and quality.
Summary
Quantized models are suitable for lightweight inference and experimentation.
Model Examination
- No interpretability analysis yet.
Technical Specifications
Model Architecture and Objective
- Decoder-only Transformer
- Optimized for text generation
Compute Infrastructure
Hardware
- Single RTX 3090 (24GB VRAM) for quantization tasks
Software
- llama.cpp for quantization
- Python 3.10, huggingface_hub
Citation
BibTeX:
@misc{gptoss10bquants,
title={GPT-OSS-10B Quantized Models},
author={leeminwaan},
year={2025},
howpublished={\url{https://huggingface.co/leeminwaan/gpt-oss-10b-quants}}
}
APA:
leeminwaan. (2025). GPT-OSS-10B Quantized Models [Computer software]. Hugging Face. https://huggingface.co/leeminwaan/gpt-oss-10b-quants
Glossary
- Quantization: Reducing precision of weights to lower memory usage.
- GGUF: Optimized format for llama.cpp inference.
More Information
- This project is experimental.
- Expect further updates and quantization benchmarks.
Model Card Authors
Model Card Contact