Instructions to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8

SGLang

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8
```

kylesayrs commited on 9 days ago

Commit

bf7ad50

verified ·

1 Parent(s): 5be6632

Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitattributes +1 -0
config.json +238 -0
generation_config.json +9 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +3 -0
recipe.yaml +67 -0
tokenizer.json +0 -0
tokenizer_config.json +14 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text

config.json ADDED Viewed

	@@ -0,0 +1,238 @@

+{
+  "architectures": [
+    "DeepseekV4ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "compress_ratios": [
+    0,
+    0,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4
+  ],
+  "compress_rope_parameters": {
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 16,
+    "original_max_position_embeddings": 65536,
+    "partial_rotary_factor": 0.125,
+    "rope_theta": 160000,
+    "rope_type": "yarn",
+    "type": "yarn"
+  },
+  "compress_rope_theta": 160000,
+  "dtype": "bfloat16",
+  "eos_token_id": 1,
+  "expert_dtype": "fp4",
+  "first_k_dense_replace": null,
+  "hc_eps": 1e-06,
+  "hc_mult": 4,
+  "hc_sinkhorn_iters": 20,
+  "head_dim": 512,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "index_head_dim": 128,
+  "index_n_heads": 64,
+  "index_topk": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 18432,
+  "kv_lora_rank": null,
+  "max_position_embeddings": 1048576,
+  "model_type": "deepseek_v4",
+  "moe_intermediate_size": 2048,
+  "n_group": null,
+  "n_routed_experts": 256,
+  "n_shared_experts": 1,
+  "norm_topk_prob": true,
+  "num_attention_heads": 64,
+  "num_experts_per_tok": 6,
+  "num_hash_layers": 3,
+  "num_hidden_layers": 43,
+  "num_key_value_heads": 1,
+  "num_nextn_predict_layers": 1,
+  "o_groups": 8,
+  "o_lora_rank": 1024,
+  "output_router_logits": false,
+  "pad_token_id": null,
+  "partial_rotary_factor": 0.125,
+  "pretraining_tp": 1,
+  "q_lora_rank": 1024,
+  "qk_nope_head_dim": 448,
+  "qk_rope_head_dim": 64,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "float-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": true,
+          "group_size": 128,
+          "num_bits": 8,
+          "observer": null,
+          "observer_kwargs": {},
+          "scale_dtype": null,
+          "strategy": "group",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        },
+        "output_activations": null,
+        "targets": [
+          "re:model.*attn.*(wkv|wo_a|wo_b|wq_a|wq_b)$",
+          "re:model.*attn\\.compressor.*(wgate|wkv)$"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": [
+            128,
+            128
+          ],
+          "dynamic": false,
+          "group_size": null,
+          "num_bits": 8,
+          "observer": "memoryless_minmax",
+          "observer_kwargs": {},
+          "scale_dtype": null,
+          "strategy": "block",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        }
+      },
+      "group_1": {
+        "format": "nvfp4-pack-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": "local",
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "static_minmax",
+          "observer_kwargs": {},
+          "scale_dtype": "torch.float8_e4m3fn",
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        },
+        "output_activations": null,
+        "targets": [
+          "re:model.*mlp.*(gate|up|down)_proj$"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "memoryless_minmax",
+          "observer_kwargs": {},
+          "scale_dtype": "torch.float8_e4m3fn",
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        }
+      }
+    },
+    "format": "mixed-precision",
+    "global_compression_ratio": null,
+    "ignore": [
+      "model.layers.2.self_attn.compressor.indexer.weights_proj",
+      "model.layers.4.self_attn.compressor.indexer.weights_proj",
+      "model.layers.6.self_attn.compressor.indexer.weights_proj",
+      "model.layers.8.self_attn.compressor.indexer.weights_proj",
+      "model.layers.10.self_attn.compressor.indexer.weights_proj",
+      "model.layers.12.self_attn.compressor.indexer.weights_proj",
+      "model.layers.14.self_attn.compressor.indexer.weights_proj",
+      "model.layers.16.self_attn.compressor.indexer.weights_proj",
+      "model.layers.18.self_attn.compressor.indexer.weights_proj",
+      "model.layers.20.self_attn.compressor.indexer.weights_proj",
+      "model.layers.22.self_attn.compressor.indexer.weights_proj",
+      "model.layers.24.self_attn.compressor.indexer.weights_proj",
+      "model.layers.26.self_attn.compressor.indexer.weights_proj",
+      "model.layers.28.self_attn.compressor.indexer.weights_proj",
+      "model.layers.30.self_attn.compressor.indexer.weights_proj",
+      "model.layers.32.self_attn.compressor.indexer.weights_proj",
+      "model.layers.34.self_attn.compressor.indexer.weights_proj",
+      "model.layers.36.self_attn.compressor.indexer.weights_proj",
+      "model.layers.38.self_attn.compressor.indexer.weights_proj",
+      "model.layers.40.self_attn.compressor.indexer.weights_proj",
+      "model.layers.42.self_attn.compressor.indexer.weights_proj",
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.15.1.dev15+g11daf97.d20260427"
+  },
+  "rms_norm_eps": 1e-06,
+  "rope_interleave": true,
+  "rope_parameters": {
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 16,
+    "original_max_position_embeddings": 65536,
+    "partial_rotary_factor": 0.125,
+    "rope_theta": 10000,
+    "rope_type": "yarn",
+    "type": "yarn"
+  },
+  "rope_theta": 10000,
+  "routed_scaling_factor": 1.5,
+  "router_aux_loss_coef": 0.001,
+  "router_jitter_noise": 0.0,
+  "scoring_func": "sqrtsoftplus",
+  "sliding_window": 128,
+  "swiglu_limit": 10.0,
+  "tie_word_embeddings": false,
+  "topk_group": null,
+  "topk_method": "noaux_tc",
+  "transformers_version": "5.7.0.dev0",
+  "use_cache": true,
+  "v_head_dim": null,
+  "vocab_size": 129280
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "do_sample": true,
+  "eos_token_id": 1,
+  "temperature": 1.0,
+  "top_p": 1.0,
+  "transformers_version": "5.7.0.dev0"
+}

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a0db4ccc0dbf59d4efa7827a03c1419e936a4acaa209a159ea21cca2de8056d
+size 50004001220

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:218785a9faf4aa9caffec98b8397179cb4d23e52c26a123155c4880342f320d1
+size 50004594224

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c23c539c92ca38c459729c727ea0341ae965efd2ff2639da62411d9f4569d74
+size 50001290592

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:136c70a9a390971d7422cbc38ac6944386a8b1892d5c80162846dec7649e9038
+size 13900266712

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2976aeaf60de0cb11ba7bf7330839e58c915a43fcc3f2873f47d88a51e01dda0
+size 13404816

recipe.yaml ADDED Viewed

	@@ -0,0 +1,67 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      config_groups:
+        attention:
+          targets: ['re:model.*attn.*(wkv|wo_a|wo_b|wq_a|wq_b)$', 're:model.*attn\.compressor.*(wgate|wkv)$']
+          weights:
+            num_bits: 8
+            type: float
+            symmetric: true
+            group_size: null
+            strategy: block
+            block_structure: [128, 128]
+            dynamic: false
+            actorder: null
+            scale_dtype: null
+            zp_dtype: null
+            observer: memoryless_minmax
+            observer_kwargs: {}
+          input_activations:
+            num_bits: 8
+            type: float
+            symmetric: true
+            group_size: 128
+            strategy: group
+            block_structure: null
+            dynamic: true
+            actorder: null
+            scale_dtype: null
+            zp_dtype: null
+            observer: null
+            observer_kwargs: {}
+          output_activations: null
+          format: null
+        experts:
+          targets: ['re:model.*mlp.*(gate|up|down)_proj$']
+          weights:
+            num_bits: 4
+            type: float
+            symmetric: true
+            group_size: 16
+            strategy: tensor_group
+            block_structure: null
+            dynamic: false
+            actorder: null
+            scale_dtype: torch.float8_e4m3fn
+            zp_dtype: null
+            observer: memoryless_minmax
+            observer_kwargs: {}
+          input_activations:
+            num_bits: 4
+            type: float
+            symmetric: true
+            group_size: 16
+            strategy: tensor_group
+            block_structure: null
+            dynamic: local
+            actorder: null
+            scale_dtype: torch.float8_e4m3fn
+            zp_dtype: null
+            observer: static_minmax
+            observer_kwargs: {}
+          output_activations: null
+          format: null
+      targets: [Linear]
+      ignore: [lm_head]
+      bypass_divisibility_checks: false

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<｜begin▁of▁sentence｜>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<｜end▁of▁sentence｜>",
+  "is_local": true,
+  "legacy": true,
+  "local_files_only": false,
+  "model_max_length": 1048576,
+  "pad_token": "<｜end▁of▁sentence｜>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": null
+}