How to use from
llama.cppInstall from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:# Run inference directly in the terminal:
llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:# Run inference directly in the terminal:
./llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Use Docker
docker model run hf.co/Preyazz/DeepSeek-V4-Flash-GGUF:Quick Links
DeepSeek-V4-Flash GGUF (community quants)
Quantized variants of deepseek-ai/DeepSeek-V4-Flash (284B params, 13B active per token).
Quants
| Quant | Approx size | Recommended? |
|---|---|---|
| ~~~54 GB~~ | Removed | |
| ~~~87 GB~~ | Removed | |
| ~~~109 GB~~ | Removed | |
| Q2_K | ~96 GB | |
| Q3_K_M | ~125 GB | |
| Q4_K_M | ~161 GB |
Provenance
All variants derived from Preyazz/DeepSeek-V4-Flash-Q8_0-GGUF, which is itself a lossless conversion of the original FP8 safetensors.
Compatibility
Requires llama.cpp built from PR #22378 (nisparks's wip/deepseek-v4-support branch) or later. The deepseek4 architecture is not yet in stable llama.cpp releases.
For Strix Halo / consumer ROCm: build with GGML_HIP_NO_VMM=ON (VMM=ON currently crashes on gfx1151 — see ROCm Issue #6146).
- Downloads last month
- 99,383
Hardware compatibility
Log In to add your hardware
2-bit
3-bit
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Preyazz/DeepSeek-V4-Flash-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash Quantized
Preyazz/DeepSeek-V4-Flash-Q8_0-GGUF
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:# Run inference directly in the terminal: llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF: