Instructions to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="anemll/DeepSeek-V4-Flash-FP4-FP8-SSD",
	filename="dense/model-dense.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
# Run inference directly in the terminal:
llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
# Run inference directly in the terminal:
llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
# Run inference directly in the terminal:
./llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
# Run inference directly in the terminal:
./build/bin/llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Use Docker

docker model run hf.co/anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

LM Studio
Jan

vLLM

How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "anemll/DeepSeek-V4-Flash-FP4-FP8-SSD"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anemll/DeepSeek-V4-Flash-FP4-FP8-SSD",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Ollama
How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with Ollama:
```
ollama run hf.co/anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
```

Unsloth Studio new

How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD to start chatting

Docker Model Runner
How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with Docker Model Runner:
```
docker model run hf.co/anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
```

Lemonade

How to use anemll/DeepSeek-V4-Flash-FP4-FP8-SSD with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-FP4-FP8-SSD-{{QUANT_TAG}}

List all available models

lemonade list

DeepSeek-V4-Flash-FP4-FP8-SSD / README.md

anemll

Add files using upload-large-folder tool

3aa5c60 verified 14 days ago

preview code

raw

history blame contribute delete

1.71 kB

metadata

license: mit
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
library_name: llama.cpp
pipeline_tag: text-generation
tags:
  - gguf
  - deepseek-v4
  - deepseek-v4-flash
  - flash-moe
  - slot-bank
  - ssd
  - fp8
  - fp4
  - mxfp4
  - metal

DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package

This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.

Quantization

Dense/shared tensors: native DeepSeek FP8, represented as F8_E4M3_B128 in GGUF.
Routed MoE expert tensors: native DeepSeek FP4, represented as MXFP4 in the sidecar manifest.
Embeddings, output, norms, routing metadata, and IDs may remain BF16, F32, or I32 where appropriate.

The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.

Files

dense/
  model-dense.gguf
  flashmoe-package.json

sidecar/
  manifest.json
  layer_000.bin
  ...
  layer_042.bin

Model Details

Architecture: deepseek4
Blocks: 43
Experts: 256
Active experts per token: 6
Context length: 1048576
Dense GGUF tensors: 1199
Routed expert sidecar entries: 129

Example

./build/bin/llama-cli \
  -m dense/model-dense.gguf \
  --moe-mode slot-bank \
  --moe-sidecar sidecar \
  --moe-slot-bank 96 \
  --moe-topk 6 \
  -ngl 999 \
  --moe-cache-io-split 4 \
  -c 8192 \
  -b 128 \
  -ub 1 \
  --no-warmup \
  -p "What is Apple Neural Engine?" \
  -n 256

This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.