π v0.1.6: Real-time Metrics & Blackwell-Optimized Docker (Recommended)
This model is fully compatible with the DGX-Spark-llama.cpp-Bench. Experience the state-of-the-art inference engine optimized for NVIDIA Blackwell (DGX Spark) hardware.
π Key Features (v0.1.6)
- Real-time Performance Metrics: Now visualizes
Input TPSandOutput TPSduring streaming. - Improved Reasoning UI: Seamlessly renders and stabilizes the model's Chain-of-Thought (CoT).
- Blackwell Optimization: Native support for ARM64/SM121 and CUDA 13.0 FP4.
π³ Quick Start
# Pull the latest optimized image
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.6
For more details, visit our GitHub Repository.
π v0.1.6: μ€μκ° μ§ν λ° Blackwell μ΅μ ν λ컀 (κΆμ₯)
μ΄ λͺ¨λΈμ DGX-Spark-llama.cpp-Bench μμ€ν μ μ΅μ νλμ΄ μμ΅λλ€. NVIDIA Blackwell (DGX Spark) νλμ¨μ΄μ μ±λ₯μ μ΅λλ‘ νμ©νμΈμ.
π μ£Όμ νΉμ§ (v0.1.6)
- μ€μκ° μ±λ₯ μ§ν μκ°ν: μ€νΈλ¦¬λ° μ€
Input TPSλ°Output TPSλ₯Ό μ€μκ°μΌλ‘ νμν©λλ€. - μ§λ₯ν μΆλ‘ UI κ³ λν: λͺ¨λΈμ μκ°νλ κ³Όμ (CoT)μ λ μμ μ μΌλ‘ λ λλ§ν©λλ€.
- Blackwell μ΅μ ν: ARM64/SM121 μν€ν μ² λ° CUDA 13.0 FP4 κ°μ μ§μ.
π³ μ€ν λ°©λ²
# μ΅μ μ΅μ ν μ΄λ―Έμ§ λ΄λ €λ°κΈ°
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.6
μμΈν μ¬μ©λ²μ GitHub 리ν¬μ§ν 리λ₯Ό μ°Έμ‘°νμΈμ.
π v0.1.5: Real-time Metrics & Blackwell-Optimized Docker (Recommended)
This model is fully compatible with the DGX-Spark-llama.cpp-Bench. Experience the state-of-the-art inference engine optimized for NVIDIA Blackwell (DGX Spark) hardware.
π Key Features (v0.1.5)
- Real-time Performance Metrics: Now visualizes
Input TPSandOutput TPSduring streaming. - Improved Reasoning UI: Seamlessly renders and stabilizes the model's Chain-of-Thought (CoT).
- Blackwell Optimization: Native support for ARM64/SM121 and CUDA 13.0 FP4.
π³ Quick Start
# Pull the latest optimized image
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.5
For more details, visit our GitHub Repository.
π v0.1.5: μ€μκ° μ§ν λ° Blackwell μ΅μ ν λ컀 (κΆμ₯)
μ΄ λͺ¨λΈμ DGX-Spark-llama.cpp-Bench μμ€ν μ μ΅μ νλμ΄ μμ΅λλ€. NVIDIA Blackwell (DGX Spark) νλμ¨μ΄μ μ±λ₯μ μ΅λλ‘ νμ©νμΈμ.
π μ£Όμ νΉμ§ (v0.1.5)
- μ€μκ° μ±λ₯ μ§ν μκ°ν: μ€νΈλ¦¬λ° μ€
Input TPSλ°Output TPSλ₯Ό μ€μκ°μΌλ‘ νμν©λλ€. - μ§λ₯ν μΆλ‘ UI κ³ λν: λͺ¨λΈμ μκ°νλ κ³Όμ (CoT)μ λ μμ μ μΌλ‘ λ λλ§ν©λλ€.
- Blackwell μ΅μ ν: ARM64/SM121 μν€ν μ² λ° CUDA 13.0 FP4 κ°μ μ§μ.
π³ μ€ν λ°©λ²
# μ΅μ μ΅μ ν μ΄λ―Έμ§ λ΄λ €λ°κΈ°
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.5
μμΈν μ¬μ©λ²μ GitHub 리ν¬μ§ν 리λ₯Ό μ°Έμ‘°νμΈμ.
π v0.1.4: Quick Start with Blackwell-Optimized Docker (Recommended)
This model is fully compatible with the DGX-Spark-llama.cpp-Bench. Experience the best performance on NVIDIA Blackwell (DGX Spark) hardware with our optimized inference engine.
π Key Features (v0.1.4)
- Blackwell Optimized: Native support for ARM64/SM121 and CUDA 13.0 FP4.
- Intelligent Reasoning UI: Automatic extraction and visualization of reasoning processes (CoT).
- One-Click Deployment: Standardized environment via GHCR Docker image.
π³ How to Run
# Pull the latest optimized image
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.4
# Follow the instructions in our repo to serve this model
# GitHub: https://github.com/sowilow/DGX-Spark-llama.cpp-Bench
π v0.1.4: Blackwell μ΅μ ν λ컀 ν΅μ€ννΈ (κΆμ₯)
μ΄ λͺ¨λΈμ DGX-Spark-llama.cpp-Bench μμ€ν μ μ΅μ νλμ΄ μμ΅λλ€. NVIDIA Blackwell (DGX Spark) νλμ¨μ΄μ μ±λ₯μ μ΅λλ‘ νμ©νλ μ΅μ νλ μΆλ‘ μμ§μ κ²½νν΄ λ³΄μΈμ.
π μ£Όμ νΉμ§ (v0.1.4)
- Blackwell μ΅μ ν: ARM64/SM121 μν€ν μ² λ° CUDA 13.0 FP4 νλμ¨μ΄ κ°μ μ§μ.
- μ§λ₯ν μΆλ‘ UI: λͺ¨λΈμ μκ°νλ κ³Όμ (CoT)μ μλμΌλ‘ κ°μ§νκ³ μκ°νν©λλ€.
- κ°νΈν λ°°ν¬: GHCR λ컀 μ΄λ―Έμ§λ₯Ό ν΅ν΄ νκ²½ μ€μ μμ΄ μ¦μ μ€ν κ°λ₯ν©λλ€.
π³ μ€ν λ°©λ²
# μ΅μ μ΅μ ν μ΄λ―Έμ§ λ΄λ €λ°κΈ°
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:v0.1.4
μμΈν μ¬μ©λ²μ GitHub 리ν¬μ§ν 리λ₯Ό μ°Έμ‘°νμΈμ.
π Quick Start with Docker (Recommended)
You can easily run this model using the DGX-Spark-llama.cpp-Bench inference engine. It's pre-configured for high-performance inference on NVIDIA hardware (especially Blackwell/DGX Spark).
1. Pull the Docker Image
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:latest
2. Run the Inference Server
For detailed configuration and usage, visit the GitHub Repository.
gpt-oss-20b-DGX-Spark-GGUF
This repository provides GGUF quantized versions of OpenAI's gpt-oss-20b, optimized specifically for NVIDIA Blackwell (DGX Spark) architectures.
These models were converted and quantized using llama.cpp with support for the gpt_oss architecture.
Model Highlights
- Optimized for Blackwell: Specifically tuned for high-performance inference on NVIDIA DGX Spark (SM120/SM121).
- Flexible Quantization:
Q4_MXFP4: 4-bit Medium quantization (recommended for efficiency).Q8_0: 8-bit quantization (recommended for maximum precision).
- MoE Architecture: 21B total parameters with 3.6B active parameters, leveraging Mixture-of-Experts for high efficiency.
- Long Context: Supports up to 131k context length.
Quantization Details
| File | Quant Method | Bitrate | Size | Description |
|---|---|---|---|---|
gpt-oss-20b-q4_mxfp4.gguf |
Q4_MXFP4 | 4.5 bpw | ~12 GB | Balanced performance and quality. |
gpt-oss-20b-q8_0.gguf |
Q8_0 | 8.5 bpw | ~22 GB | Standard 8-bit quantization. |
Quick Start (llama.cpp)
To run these models on a DGX Spark system:
Pull the optimized Docker image:
docker pull ghcr.io/sowilow/dgx-spark-llama.cpp-bench:latestRun with llama-server:
docker run --gpus all -v $(pwd)/models:/model \ ghcr.io/sowilow/dgx-spark-llama.cpp-bench:latest \ llama-server -m /model/gpt-oss-20b-q4_mxfp4.gguf -ngl 99 -c 8192
Original Model Information
This is a quantized version of openai/gpt-oss-20b. Please refer to the original model card for details on training, safety, and benchmarks.
Citation
@misc{openai2025gptoss120bgptoss20bmodel,
title={gpt-oss-120b & gpt-oss-20b Model Card},
author={OpenAI},
year={2025},
eprint={2508.10925},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.10925},
}
- Downloads last month
- 343
8-bit
Model tree for sowilow/gpt-oss-20b-DGX-Spark-GGUF
Base model
openai/gpt-oss-20b