Forks
Collection
My forks, when I need modify something in original model β’ 5 items β’ Updated
Quantized versions of mistralai/Mixtral-8x7B-Instruct-v0.1 in GGUF format for use with llama.cpp.
Note: All operations β conversion, quantization, benchmarking, and inference β were performed on an NVIDIA DGX Spark running NVIDIA DGX Spark OS Version 7.4.0 (GPU: NVIDIA GB10, 124 GB VRAM, compute capability 12.1).
| File | Type | Description |
|---|---|---|
Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf |
Q8_0 | Near-lossless quality, large size β recommended |
Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf |
Q4_K_M | Golden standard β best quality/size trade-off |
# Recommended β Q8_0
wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf
# Other quantizations:
# wget https://huggingface.co/kostakoff/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf
Requirements: CUDA-capable GPU, CMake β₯ 3.18, CUDA Toolkit
mkdir -p ~/llamacpp && cd ~/llamacpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release
Version used for conversion and testing:
version: 8334 (463b6a963)
./llama.cpp/build/bin/llama-server \
-m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
--port 8080 \
--host 0.0.0.0
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mixtral",
"messages": [
{ "role": "user", "content": "Explain the MoE architecture in simple terms." }
]
}' | jq -r '.choices[0].message.content'
Tested with llama-bench on NVIDIA GB10 (124 GB VRAM):
./llama.cpp/build/bin/llama-bench \
-m ./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
-ngl 9999
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 8x7B Q8_0 | 46.22 GiB | 46.70 B | CUDA | 9999 | pp512 | 811.58 Β± 6.23 |
| llama 8x7B Q8_0 | 46.22 GiB | 46.70 B | CUDA | 9999 | tg128 | 15.75 Β± 0.03 |
pp = prompt processing (prefill) tokens/s Β· tg = text generation tokens/s
cd ~/llamacpp
python3 -m venv .venv_llamacpp
source ./.venv_llamacpp/bin/activate
python -m pip install --upgrade pip
pip3 install "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.10.0" \
--index-url https://download.pytorch.org/whl/cu130
pip install -r ./llama.cpp/requirements/requirements-convert_legacy_llama.txt \
--extra-index-url https://download.pytorch.org/whl/cu130
pip install mistral-common[image,audio] \
--extra-index-url https://download.pytorch.org/whl/cu130
cd ~/llamacpp
source ./.venv_llamacpp/bin/activate
python ./llama.cpp/convert_hf_to_gguf.py \
~/llm/models/mixtral \
--verbose \
--outfile ./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
--outtype bf16
# Q8_0 β near-lossless
./llama.cpp/build/bin/llama-quantize \
./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
./Mixtral-8x7B-Instruct-v0.1-Q8_0.gguf \
Q8_0
# Q4_K_M β golden standard
./llama.cpp/build/bin/llama-quantize \
./Mixtral-8x7B-Instruct-v0.1-bf16.gguf \
./Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
Q4_K_M
This model inherits the license of the original β Apache 2.0
4-bit
8-bit
Base model
mistralai/Mixtral-8x7B-v0.1