Mixtral-8x7B-Instruct-v0.1 GGUF Model Card

Quantized versions of mistralai/Mixtral-8x7B-Instruct-v0.1 in GGUF format for use with llama.cpp.

Note: All operations — conversion, quantization, benchmarking, and inference — were performed on an NVIDIA DGX Spark running NVIDIA DGX Spark OS Version 7.4.0 (GPU: NVIDIA GB10, 124 GB VRAM, compute capability 12.1).

Available Quantizations

File	Type	Description
`Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf`	Q8_0	Near-lossless quality, large size ✅ recommended
`Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf`	Q4_K_M	Golden standard — best quality/size trade-off

Quick Start

1. Download the model

# Recommended — Q8_0
wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf

# Other quantizations:
# wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

2. Build llama.cpp

Requirements: CUDA-capable GPU, CMake ≥ 3.18, CUDA Toolkit

mkdir -p ~/llamacpp && cd ~/llamacpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release

Version used for conversion and testing:

version: 8334 (463b6a963)

3. Start the server

./llama.cpp/build/bin/llama-server \
  -m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  --port 8080 \
  --host 0.0.0.0

4. Generate a completion

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mixtral",
    "messages": [
      { "role": "user", "content": "Explain the MoE architecture in simple terms." }
    ]
  }' | jq -r '.choices[0].message.content'

Benchmark

Tested with llama-bench on NVIDIA GB10 (124 GB VRAM):

./llama.cpp/build/bin/llama-bench \
  -m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  -ngl 9999

model	size	params	backend	ngl	test	t/s
llama 8x7B Q8_0	46.22 GiB	46.70 B	CUDA	9999	pp512	811.58 ± 6.23
llama 8x7B Q8_0	46.22 GiB	46.70 B	CUDA	9999	tg128	15.75 ± 0.03

pp = prompt processing (prefill) tokens/s · tg = text generation tokens/s

How the weights were created

Step 1 — Prepare the Python environment

cd ~/llamacpp
python3 -m venv .venv_llamacpp
source ./.venv_llamacpp/bin/activate
python -m pip install --upgrade pip
pip3 install "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.10.0" \
  --index-url https://download.pytorch.org/whl/cu130
pip install -r ./llama.cpp/requirements/requirements-convert_legacy_llama.txt \
  --extra-index-url https://download.pytorch.org/whl/cu130
pip install mistral-common[image,audio] \
  --extra-index-url https://download.pytorch.org/whl/cu130

Step 2 — Convert to BF16 GGUF

cd ~/llamacpp
source ./.venv_llamacpp/bin/activate
python ./llama.cpp/convert_hf_to_gguf.py \
  ~/llm/models/mixtral \
  --verbose \
  --outfile ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  --outtype bf16

Step 3 — Quantize

# Q8_0 — near-lossless
./llama.cpp/build/bin/llama-quantize \
  ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  Q8_0

# Q4_K_M — golden standard
./llama.cpp/build/bin/llama-quantize \
  ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  ./Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
  Q4_K_M