Instructions to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF",
	filename="DeepSeek-V4-Flash-Instruct-Q8xQ5.gguf-00001-of-00011.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
# Run inference directly in the terminal:
llama-cli -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
# Run inference directly in the terminal:
llama-cli -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
# Run inference directly in the terminal:
./llama-cli -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Use Docker

docker model run hf.co/setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

LM Studio
Jan
Ollama
How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with Ollama:
```
ollama run hf.co/setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
```

Unsloth Studio new

How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF to start chatting

Docker Model Runner
How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with Docker Model Runner:
```
docker model run hf.co/setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
```

Lemonade

How to use setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-Q8xQ5-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

DeepSeek V4 Flash — Q8xQ5 GGUF

Mixed-precision quantization: Q8_0 (attention, shared expert) + Q5_K (routed experts). Quality equivalent to standard Q5_K_M.

Parameter	Value
Model	DeepSeek V4 Flash
Architecture	284B total, 13B active (MoE 256 experts, top-6)
Format	GGUF Q8xQ5 (11 parts)
Size	184 GB
Context	up to 1M tokens
Hardware target	Apple M3 Ultra 256 GB

Features

Mixed-precision: attention/shared expert/router in Q8_0, experts in Q5_K
Per-layer MoE offload: --moe-hot-count auto keeps only needed experts in RAM
- Typical: 9.7 GB for hot experts (vs 37 GB uniform)
- Automatically adapts to your workload
Full 1M context with MLA compressed KV cache (~7 GB)

Requirements

Stock llama.cpp cannot load this model. You need our custom build:

git clone https://github.com/setar/llama.cpp.git
cd llama.cpp
git checkout feat/moe-expert-persistence
cmake -S . -B build \
  -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
  -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DGGML_CCACHE=OFF
cmake --build build --config Release --clean-first -j"$(sysctl -n hw.ncpu)"

# For CUDA (NVIDIA):
#   -DGGML_METAL=OFF -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
#   Replace -j"$(sysctl -n hw.ncpu)" with -j$(nproc)

How to run

./build/bin/llama-server \
  --model DeepSeek-V4-Flash-Instruct-Q8xQ5.gguf-00001-of-00011.gguf \
  --host 0.0.0.0 --port 8082 \
  --ctx-size 1048576 \
  --flash-attn on \
  --batch-size 2048 --ubatch-size 1024 \
  -t 20 --n-gpu-layers all \
  --mlock \
  --moe-hot-count auto \
  --jinja \
  --chat-template-file deepseek-ai-DeepSeek-V4.jinja

Model is split into 11 parts — llama.cpp auto-detects all, just point --model to part 1.

Per-layer MoE offload (`--moe-hot-count auto`)

Instead of keeping the same number of experts hot in every layer, auto mode tracks which experts are actually used and calculates per-layer optimal counts from accumulated statistics. The data persists across restarts in ~/.llama/expert_<model>.bin.

vs uniform hot-count

	Uniform hot=64	Per-layer auto	Saving
RAM (experts)	37 GB	9.7 GB	27 GB
Decode speed	~22 t/s	~22 t/s	0%
Layers with hot=4	—	24 (deterministic)
Layers with hot=128	—	3 (H2, S25, S37)
Average hot count	64	17

Per-layer distribution (from ~466k expert activations)

H0:     4   H1:     4   H2:   128     (hash layers)
S25:  128   S29:     7   S32:    20   (mid scored)
S37:  128   S38:    83   S39:    82   (late scored)
Others:    4                           (deterministic)

Performance (M3 Ultra)

Metric	Value
Prefill	~126 tok/s
Decode	~20 t/s
Hot experts RAM	9.7 GB
Context	1M tokens

Quantization details

tensor types:
  f32:   535 tensors  (norms, biases, embeddings)
  q8_0:  661 tensors  (attention, shared expert, router)
  q5_K:  129 tensors  (expert FFN: gate/up/down)
  i32:     3 tensors  (expert mapping)
BPW: ~5.2

Files

File	Size	Description
`.gguf-00001-of-00011` ~ `00011`	184 GB total	Model weights (GGUF split)
`deepseek-ai-DeepSeek-V4.jinja`	2.3 KB	Required chat template

Model tree for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(29)

this model

setar007
/

DeepSeek-V4-Flash-Q8xQ5-GGUF

DeepSeek V4 Flash — Q8xQ5 GGUF

Features

Requirements

How to run

Per-layer MoE offload (`--moe-hot-count auto`)

vs uniform hot-count

Per-layer distribution (from ~466k expert activations)

Performance (M3 Ultra)

Quantization details

Files

Links

Model tree for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

DeepSeek V4 Flash — Q8xQ5 GGUF

Features

Requirements

How to run

Per-layer MoE offload (--moe-hot-count auto)

vs uniform hot-count

Per-layer distribution (from ~466k expert activations)

Performance (M3 Ultra)

Quantization details

Files

Links

Model tree for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Per-layer MoE offload (`--moe-hot-count auto`)