--- base_model: - nvidia/Cosmos-Reason2-32B tags: - nvidia - cosmos - cosmos-reason2 - multimodal - vlm - quantized - flashhead - qwen3_vl pipeline_tag: image-text-to-text license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE extra_gated_prompt: >- The information you provide will be collected, stored, processed and shared in accordance with the [Embedl Privacy Policy](https://www.embedl.com/privacy-policy). extra_gated_fields: Company: text ---
# Cosmos-Reason2-32B-W4A16-FlashHead [](https://github.com/embedl/flash-head) **Optimized version of [nvidia/Cosmos-Reason2-32B](https://huggingface.co/nvidia/Cosmos-Reason2-32B) using quantization and FlashHead, Embedl's efficient replacement for the language model head.** Designed for **low-latency inference** on **NVIDIA GPUs**, leveraging: - FlashHead - Quantization (W4A16) - vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head) --- ## Model Details | **Field** | **Value** | |---|---| | **Base Model** | [nvidia/Cosmos-Reason2-32B](https://huggingface.co/nvidia/Cosmos-Reason2-32B) | | **Input / Output** | Text + Image / Video -> Text | | **Optimizations** | FlashHead LM Head + Quantization (W4A16) | | **Developers** | Embedl | | **Licenses** | Upstream: [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---
## Installation
```bash
pip install flash-head
```
The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup.
---
## Usage Examples
### vLLM Serve
```bash
vllm serve embedl/Cosmos-Reason2-32B-W4A16-FlashHead \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
```
### vLLM Video Inference
```python
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-32B-W4A16-FlashHead"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
gpu_memory_utilization=0.9,
trust_remote_code=True,
)
output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
print(output[0].outputs[0].text)
```
---
## License
- **Upstream:** [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
---
## Contact
- Enterprise and Commercial Inquiries: `models@embedl.com`
- Technical Issues and Early Access: [`https://github.com/embedl/flash-head`](https://github.com/embedl/flash-head)
- More Information and Model Releases: `https://embedl.com`