Instructions to use Preyazz/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Preyazz/DeepSeek-V4-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Preyazz/DeepSeek-V4-Flash-GGUF",
	filename="DeepSeek-V4-Flash-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Preyazz/DeepSeek-V4-Flash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use Preyazz/DeepSeek-V4-Flash-GGUF with Ollama:
```
ollama run hf.co/Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Preyazz/DeepSeek-V4-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Preyazz/DeepSeek-V4-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Preyazz/DeepSeek-V4-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Preyazz/DeepSeek-V4-Flash-GGUF to start chatting

Docker Model Runner
How to use Preyazz/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M
```

Lemonade

How to use Preyazz/DeepSeek-V4-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Preyazz/DeepSeek-V4-Flash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-GGUF-Q4_K_M

List all available models

lemonade list

GFX1151

by X-AI-GT - opened about 23 hours ago

Discussion

X-AI-GT

about 23 hours ago

Dear expert, do you have any methods to enable GFX1151 to use all shared memory under Windows 11 using the ROCm backend? Currently, I'm trying to use Vulkan to load models with weights exceeding 100G and achieve GPU and VRAM speeds, but ROCm fails! I wonder if you have any relevant experience to share. Thank you!

Preyazz

Owner about 19 hours ago

Hey man, definitely not an expert haha. For context I'm running a GMKtec EVO-X2 (Strix Halo, 128 GB) on Fedora with ROCm 7.2 in a toolbox container. The way I get gfx1151 to use all of system RAM is mostly Linux-specific, but at least one piece is hardware-level and worth trying on Windows too:

BIOS UMA Frame Buffer Size set to the minimum, 1GB on my GMKtec EVO-X2. This is firmware-level and applies on either OS. Counter-intuitive, but the trick is shrinking the BAR-mapped “dedicated VRAM” so almost everything goes through shared memory instead. Worth trying first since you can flip it independently.
amdgpu.no_system_mem_limit=1 kernel parameter, and /sys/module/amdgpu/parameters/no_system_mem_limit set to Y at runtime. This one is Linux-specific.
llama.cpp built from source with -DGGML_HIP_NO_VMM=ON. VMM=ON crashes on gfx1151, see ROCm issue #6146. Build flag applies on either OS.

With those three together on my Linux box, a single hipMalloc of ~`120GB` succeeds and the model lives in system RAM addressable by the iGPU.

Whether the BIOS and VMM-off bits alone are enough on Windows ROCm I genuinely don’t know - let me know if you try it!

When I get a gap I’ll have another go at the IQ quants - my last batch failed miserably.

X-AI-GT

about 18 hours ago

I have tried setting the minimum video memory to 0.5G on Windows 11, but this setting didn't yield any miraculous results. It even caused Vulkan to lose its ability to break through the 96G limit, and ROCm fared even worse, only recognizing about half of the video memory, which is frustrating! I also tried setting it to manual mode and maximizing the allocatable video memory to 124G, but there was still no breakthrough. Finally, when setting it to 96G, I decided to give it a try. Surprisingly, Vulkan loaded directly without displaying an OOM or any other error message, loading a 103G weight model directly into the video memory. This is a MiniMax-M2.7 model, and the peak speed during the first round of dialog can reach 30 tokens/s. So I am sure that some of the unified memory allocated to system memory was mistakenly treated as video memory, otherwise it wouldn't have achieved such a speed! If the expert has any new findings, please let me know. Of course, if I have any new discoveries under Windows, I would also be very happy to share them with you! Thank you for your reply! I would like to express my gratitude again。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment