--- base_model: - nvidia/Cosmos-Reason2-32B tags: - nvidia - cosmos - cosmos-reason2 - multimodal - vlm - quantized - flashhead - qwen3_vl pipeline_tag: image-text-to-text license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE extra_gated_prompt: >- The information you provide will be collected, stored, processed and shared in accordance with the [Embedl Privacy Policy](https://www.embedl.com/privacy-policy). extra_gated_fields: Company: text ---

Optimized by Embedl

Need to fine-tune, hit performance targets, or deploy on specific hardware?

We've got you covered.

Learn more Get in touch →

# Cosmos-Reason2-32B-W4A16-FlashHead [![GitHub](https://img.shields.io/badge/GitHub-flash--head-black?logo=github)](https://github.com/embedl/flash-head) **Optimized version of [nvidia/Cosmos-Reason2-32B](https://huggingface.co/nvidia/Cosmos-Reason2-32B) using quantization and FlashHead, Embedl's efficient replacement for the language model head.** Designed for **low-latency inference** on **NVIDIA GPUs**, leveraging: - FlashHead - Quantization (W4A16) - vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head) --- ## Model Details | **Field** | **Value** | |---|---| | **Base Model** | [nvidia/Cosmos-Reason2-32B](https://huggingface.co/nvidia/Cosmos-Reason2-32B) | | **Input / Output** | Text + Image / Video -> Text | | **Optimizations** | FlashHead LM Head + Quantization (W4A16) | | **Developers** | Embedl | | **Licenses** | Upstream: [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | --- ## Benchmarks Accuracy and on-device latency benchmarks can be explored on [embedl/Edge-Inference-Benchmarks](https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks).

--- ## Installation ```bash pip install flash-head ``` The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup. --- ## Usage Examples ### vLLM Serve ```bash vllm serve embedl/Cosmos-Reason2-32B-W4A16-FlashHead \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 ``` ### vLLM Video Inference ```python from vllm import LLM, SamplingParams if __name__ == "__main__": model = "embedl/Cosmos-Reason2-32B-W4A16-FlashHead" video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4" messages = [ { "role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}], }, { "role": "user", "content": [ {"type": "video_url", "video_url": {"url": video_url, "fps": 4}}, {"type": "text", "text": "Describe this video in detail."}, ], }, ] llm = LLM( model=model, limit_mm_per_prompt={ "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720}, "image": 0, "audio": 0, }, media_io_kwargs={"video": {"num_frames": -1}}, max_model_len=8192, mm_processor_kwargs={"truncation": False}, gpu_memory_utilization=0.9, trust_remote_code=True, ) output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256)) print(output[0].outputs[0].text) ``` --- ## License - **Upstream:** [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* --- ## Contact - Enterprise and Commercial Inquiries: `models@embedl.com` - Technical Issues and Early Access: [`https://github.com/embedl/flash-head`](https://github.com/embedl/flash-head) - More Information and Model Releases: `https://embedl.com`

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →