How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="batiai/DeepSeek-V4-Flash-GGUF",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

DeepSeek V4-Flash GGUF — BatiAI Early Access

⚠️ Early Access — requires bati.cpp to run. DeepSeek V4-Flash was released on 2026-04-24 and is not yet supported by ggml-org/llama.cpp master. This GGUF was converted with batiai/bati.cpp — BatiAI's own inference library — and inference also requires the same library. Ollama is not yet compatible (will auto-update once mainline merges V4 support).

Why ship now?

V4-Flash is a frontier-class model (284B-A13B, top-tier on SWE-Bench Pro). Mainline support will likely take 1-2 more weeks. We provide this interim release so power users can evaluate the model immediately. When mainline merges support, this repo will gain imatrix-calibrated IQ3_XXS / IQ4_XS / Q5_K_M plus Ollama push.

Available Quants

Quant Size Recommended hardware Notes
Q3_K_M 127 GB 128 GB unified memory (M4 Max) 3-bit, smallest
Q4_K_M 161 GB 192 GB+ Mac balanced — recommended
Q5_K_M 188 GB 256 GB+ Mac higher fidelity
Q6_K 218 GB 384 GB+ Mac near-lossless
Q8_0 282 GB M3 Ultra 512 GB original FP4 → Q8_0 dequant

All quants signed by BatiAI (general.author=BatiAI, general.url=https://flow.bati.ai).

Note: IQ-quants (IQ3_XXS / IQ4_XS) are tracked in bati.cpp v0.2.0. They require imatrix calibration, and llama-imatrix currently segfaults during V4-Flash model context init in the fork. Will be added once that path is fixed (or once mainline llama.cpp merges V4 support). K-quants above use bati.cpp v0.1.2's integer-tensor pass-through patch.

How to run inference (build bati.cpp)

# 1. Clone + build BatiAI's inference library
git clone https://github.com/batiai/bati.cpp.git
cd bati.cpp
cmake -B build -DGGML_CUDA=ON      # Linux
# or: cmake -B build -DGGML_METAL=ON  # macOS (Mac Studio M3 Ultra 512GB recommended)
cmake --build build -j 16 --target llama-cli llama-gguf-split

# 2. Download shards + merge into a single GGUF
hf download batiai/DeepSeek-V4-Flash-GGUF --local-dir ./v4-flash
build/bin/llama-gguf-split --merge \
    ./v4-flash/deepseek-ai-DeepSeek-V4-Flash-Q8_0-00001-of-00007.gguf \
    ./v4-flash/merged-Q8_0.gguf

# 3. Inference
build/bin/llama-cli -m ./v4-flash/merged-Q8_0.gguf -cnv -ngl 99 -c 4096

Model details

  • Source: deepseek-ai/DeepSeek-V4-Flash
  • Architecture: 284B total / 13B active MoE, 1M context window
  • Hybrid attention: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA)
  • Original precision: FP4 + FP8 mixed (FP4 expert weights, FP8 attention)
  • This GGUF: Q8_0 dequantization (FP4 → Q8_0 directly; BF16 intermediate skipped)
  • License: MIT

What happens after mainline merges V4

When ggml-org/llama.cpp master merges DeepSeek V4:

  1. We rebuild with mainline + run imatrix calibration (wikitext-2, 200 chunks)
  2. Add IQ3_XXS, IQ4_XS, Q5_K_M quants to this repo (BatiAI-signed)
  3. Push to Ollama: batiai/deepseek-v4-flash:iq3 / :iq4 / :q5
  4. Run real-hardware Mac benchmarks (M3 Ultra 512GB)
  5. bati.cpp's V4 support transitions to read-only archive (users migrate to mainline)

Watch this repo for the update.

BatiAI signing

All GGUFs in this repo carry:

  • general.author = BatiAI
  • general.url = https://flow.bati.ai

About bati.cpp

batiai/bati.cpp is BatiAI's own inference library — a llama.cpp-based fork focused on Apple Silicon, frontier-model early access, and BatiAI's quantization standard. Built on top of ggml-org/llama.cpp and antirez/llama.cpp-deepseek-v4-flash (all MIT). See bati.cpp's ATTRIBUTION.md for full credits.

License

Inherits the source model license: MIT.

About BatiFlow

BatiFlow — free on-device AI automation for Mac.

Downloads last month
1,197
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for batiai/DeepSeek-V4-Flash-GGUF

Quantized
(27)
this model

Collection including batiai/DeepSeek-V4-Flash-GGUF