Instructions to use teamblobfish/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="teamblobfish/DeepSeek-V4-Flash-GGUF",
	filename="IQ1_M-XL/DeepSeek-V4-Flash-IQ1_M-XL-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "teamblobfish/DeepSeek-V4-Flash-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teamblobfish/DeepSeek-V4-Flash-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Ollama
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Ollama:
```
ollama run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
```

Unsloth Studio new

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for teamblobfish/DeepSeek-V4-Flash-GGUF to start chatting

Pi new

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M
```

Lemonade

How to use teamblobfish/DeepSeek-V4-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull teamblobfish/DeepSeek-V4-Flash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-GGUF-Q4_K_M

List all available models

lemonade list

DeepSeek V4 Flash · GGUF

GGUF quantizations of deepseek-ai/DeepSeek-V4-Flash for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.

⚠️ These quants will not load on upstream ggml-org/llama.cpp. V4 architecture, V4-specific Metal and CUDA kernels, and the f16 KV pin are only present in the fork above. Upstreaming is gated on the V3.2/DSA PR (#21149) landing first.

🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (ggml_dsv4_rope_tail, ggml_dsv4_hc_split_sinkhorn, ggml_dsv4_hc_weighted_sum, ggml_dsv4_hc_expand, ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind __CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under -DCMAKE_CUDA_ARCHITECTURES=70 but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.

Available quants

Quant	Size	BPW	Decode (M3 Ultra)	gate-tools	Notes
Q8_0	~282 GiB (7 shards)	8.50	16 t/s	✓ pass	Reference. Full-fidelity baseline.
Q4_K_M-XL	~163 GiB (4 shards)	4.92	23.28 t/s	✓ pass	Recommended. Fastest decode, K-quant body, non-expert tensors and embedding/output pinned at Q8_0. Matches Q8 on tool calling at half the size.
Q2_K-XL	~100 GiB (3 shards)	3.01	17.45 t/s	✓ pass	Smaller-footprint K-quant alternative to Q4_K_M-XL with the same XL pin recipe.
IQ2_XS-XL	~81 GiB (2 shards)	2.45	23.76 t/s	✓ pass †	IQ2 body with XL pins. Fastest IQ-class.
IQ2_XXS-XL	~73 GiB (2 shards)	2.21	18.83 t/s	✓ pass †	IQ2 body with XL pins.
IQ1_M-XL	~63 GiB (2 shards)	1.91	18.42 t/s	✓ pass †	IQ1_M body with XL pins.
IQ1_M	~60 GiB (2 shards)	1.81	13.56 t/s	✓ pass †	IQ1_M without XL pins. Below the 16 t/s decode floor on M3 Ultra; use the -XL variant unless disk is tight.
IQ1_S-XL	~57 GiB (2 shards)	1.73	18.30 t/s	✓ pass †	IQ1_S body with XL pins. Smallest variant clearing the decode floor.
`imatrix/imatrix-v4-flash.dat`	~449 MiB	—	—	—	wikitext-103 1000-chunk imatrix calibration produced by `v4-port-I-imatrix`. Reproducibility seed for downstream IQ-class builds.
`imatrix/dsml.jinja`	~5 KiB	—	—	—	DSML chat template, also baked into every GGUF in this repo. Published here for reference and downstream tooling.

† All quants in this repo ship with the DSML chat template baked into the GGUF metadata, so llama-server --jinja does the right thing without any extra flags. The imatrix/dsml.jinja file is also published in this repo for reference and downstream tooling.

-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Without that pinning, IQ-class quants fall below the 16 t/s decode floor on M3 Ultra.

Recommended use by quant

Use case	Recommended	Notes
General agent / Claude Code workloads	Q4_K_M-XL	Fastest decode, full tool-calling support, half the disk of Q8
Reference / "is this a quant artifact?" debugging	Q8_0	Full-fidelity baseline
Smaller VRAM / disk budget	Q2_K-XL	Same XL recipe at lower BPW
Maximum throughput, tighter VRAM	IQ2_XS-XL	Fastest IQ-class quant

All quants in this repo ship with V4's DSML chat template baked in, so llama-server --jinja does the right thing without any extra flags — no --chat-template-file needed. Tool calls return as proper tool_calls JSON in the response object.

Loading

# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp

# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j

# OR build for NVIDIA CUDA (Ada/Blackwell; CUDA toolkit >= 12.8 for native SM_120)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89;120" && cmake --build build -j
# Older CUDA toolkits: drop SM_120 (PTX JIT handles Blackwell from SM_89). FP8
# native path needs toolkit >= 11.8; older arches use software-emulated FP8.
#
# Multi-GPU CUDA (2+ devices, scheduler will split graph across them):
# add `-DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128` to the cmake
# line. V4's dense per-layer inputs (hyperconnection + indexer + multiple
# KV caches) exceed the upstream default of 30 at multi-device split
# boundaries. Cost: ~200 MB extra scheduler memory; only needed on
# multi-GPU. Single-GPU runs do not need this flag.

# Download the recommended Q4_K_M-XL shards
hf download teamblobfish/DeepSeek-V4-Flash-GGUF \
  --include "Q4_K_M-XL/*" \
  --local-dir ~/models/DeepSeek-V4-Flash-GGUF

# Run the server (point at the first shard; llama.cpp auto-loads the rest)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Flash-GGUF/Q4_K_M-XL/DeepSeek-V4-Flash-Q4_K_M-XL-00001-of-00004.gguf \
  --jinja \
  --reasoning off \
  --ctx-size 393216 \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.

Why the `-XL` recipe (and why no vanilla Q4_K_M)

V4 decode is compute-bound on the indexer / sinkhorn / expert-routing kernels — not on memory bandwidth. That makes the choice of dequant codepath matter as much as the bit-count: Q8_0's int8 × per-block-scale unpack is dramatically simpler than Q4_K_M's super-block path, so on this hardware Q8_0 actually decoded faster than vanilla Q4_K_M in our earlier benchmarks (write-up).

The -XL recipe published here threads that needle: leave the discrimination-critical non-expert tensors at Q8_0 (so attention, embedding, output, etc. all use the fast dequant path) and only compress the routed and shared experts. The result is the best of both — Q4_K_M-XL is half the disk of Q8_0 and decodes ~45% faster (23.28 vs 16 t/s) because the experts barely touch the hot decode path while the bandwidth-heavy non-expert tensors stay on the fast codepath. Same trick applies to all the IQ-class -XL variants below.

We don't publish vanilla Q4_K_M (no XL pins) — it would be both larger and slower than Q4_K_M-XL on this hardware.

Quirks worth knowing

--cache-type-k|v q8_0 is silently overridden to f16 on V4. V4's K is already FP8-quantized at write time, so q8_0's per-block stationarity assumption breaks. The fork emits a LLAMA_LOG_WARN on first override.
llama-imatrix originally segfaulted on V4 during activation collection. Fixed in v4-port-I-imatrix; the calibration data published alongside these quants (imatrix/imatrix-v4-flash.dat) was produced by the patched binary.
--no-repack is required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. The repack codepath in ggml/src/ggml-cpu/repack.cpp doesn't release the source mmap, so V4's 282-GiB Q8 source needs ~575 GiB peak RAM at load without the flag. The fork's gates pass --no-repack by default.
Validation gates: tests/v4-port/run-all-gates.sh in the fork. Each row in the table above documents the result of that gate suite at the listed BPW.

Provenance

Source: deepseek-ai/DeepSeek-V4-Flash HF safetensors (FP8 e4m3 weights, FP4 routed experts).
Q8_0: built via convert_hf_to_gguf.py --outtype q8_0 --deepseek4-expert-outtypes q8_0 (M3 Ultra, ~30–60 min wall), split into 50 GiB shards with llama-gguf-split.
bf16-experts-Q8 staging GGUF (not published): built via convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes q8_0. Used as the source for IQ1/IQ2/Q2_K-XL/Q4_K_M-XL builds below to preserve embed.weight and output.weight BF16 source precision (other discrimination-critical tensors are FP8-native in the source so Q8 staging is essentially lossless for them).
IQ1/IQ2/Q2_K-XL/Q4_K_M-XL builds: produced via llama-quantize --imatrix imatrix-v4-flash.dat with the v4-port fork's V4-tensor pin recipe (output_hc, attn_compressor, attn_q_a/b, attn_kv, attn_output_a/b, hc_attn, hc_ffn, indexer, nextn all at Q8_0 in -XL variants).
imatrix: wikitext-103 test split, 1000 chunks, ~1M tokens. Per-class layer coverage verified by tests/v4-port/gate-imatrix.sh.

License

MIT, matching the upstream DeepSeek V4 Flash license.

Downloads last month: 2,304

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

1-bit

2-bit

4-bit

8-bit

Model tree for teamblobfish/DeepSeek-V4-Flash-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(42)

this model