--- license: mit base_model: - deepseek-ai/DeepSeek-V4-Flash library_name: transformers tags: - compressed-tensors - nvfp4 - vllm ---

DeepSeek-V4-Flash-NVFP4-FP8

## Model Optimizations This model was obtained by using the following branch with LLM Compressor: https://github.com/vllm-project/llm-compressor/pull/2647 ## Deployment This model was deployed using the following branch with vLLM: https://github.com/vllm-project/vllm/pull/41276 ```bash vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8" ``` ## Evaluation ```bash python tests/evals/gsm8k/gsm8k_eval.py ``` ``` Results: Accuracy: 0.910 Invalid responses: 0.000 Total latency: 173.006 s Questions per second: 7.624 Total output tokens: 116217 Output tokens per second: 671.752 ``` For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack