Mixtral-8x7B-Instruct-v0.1 GGUF Model Card

Quantized versions of mistralai/Mixtral-8x7B-Instruct-v0.1 in GGUF format for use with llama.cpp.

Note: All operations β€” conversion, quantization, benchmarking, and inference β€” were performed on an NVIDIA DGX Spark running NVIDIA DGX Spark OS Version 7.4.0 (GPU: NVIDIA GB10, 124 GB VRAM, compute capability 12.1).


Available Quantizations

File Type Description
Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf Q8_0 Near-lossless quality, large size βœ… recommended
Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf Q4_K_M Golden standard β€” best quality/size trade-off

Quick Start

1. Download the model

# Recommended β€” Q8_0
wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf

# Other quantizations:
# wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

2. Build llama.cpp

Requirements: CUDA-capable GPU, CMake β‰₯ 3.18, CUDA Toolkit

mkdir -p ~/llamacpp && cd ~/llamacpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release

Version used for conversion and testing:

version: 8334 (463b6a963)

3. Start the server

./llama.cpp/build/bin/llama-server \
  -m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  --port 8080 \
  --host 0.0.0.0

4. Generate a completion

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mixtral",
    "messages": [
      { "role": "user", "content": "Explain the MoE architecture in simple terms." }
    ]
  }' | jq -r '.choices[0].message.content'

Benchmark

Tested with llama-bench on NVIDIA GB10 (124 GB VRAM):

./llama.cpp/build/bin/llama-bench \
  -m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  -ngl 9999
model size params backend ngl test t/s
llama 8x7B Q8_0 46.22 GiB 46.70 B CUDA 9999 pp512 811.58 Β± 6.23
llama 8x7B Q8_0 46.22 GiB 46.70 B CUDA 9999 tg128 15.75 Β± 0.03

pp = prompt processing (prefill) tokens/s Β· tg = text generation tokens/s


How the weights were created

Step 1 β€” Prepare the Python environment

cd ~/llamacpp
python3 -m venv .venv_llamacpp
source ./.venv_llamacpp/bin/activate
python -m pip install --upgrade pip
pip3 install "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.10.0" \
  --index-url https://download.pytorch.org/whl/cu130
pip install -r ./llama.cpp/requirements/requirements-convert_legacy_llama.txt \
  --extra-index-url https://download.pytorch.org/whl/cu130
pip install mistral-common[image,audio] \
  --extra-index-url https://download.pytorch.org/whl/cu130

Step 2 β€” Convert to BF16 GGUF

cd ~/llamacpp
source ./.venv_llamacpp/bin/activate
python ./llama.cpp/convert_hf_to_gguf.py \
  ~/llm/models/mixtral \
  --verbose \
  --outfile ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  --outtype bf16

Step 3 β€” Quantize

# Q8_0 β€” near-lossless
./llama.cpp/build/bin/llama-quantize \
  ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
  Q8_0

# Q4_K_M β€” golden standard
./llama.cpp/build/bin/llama-quantize \
  ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
  ./Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
  Q4_K_M

License

This model inherits the license of the original β€” Apache 2.0

Downloads last month
207
GGUF
Model size
47B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF

Quantized
(49)
this model

Collections including kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF