Cosmos-Reason2-32B-W4A16-FlashHead

Optimized version of nvidia/Cosmos-Reason2-32B using quantization and FlashHead, Embedl's efficient replacement for the language model head.

Designed for low-latency inference on NVIDIA GPUs, leveraging:

FlashHead
Quantization (W4A16)
vLLM plugin via flash-head

Model Details

Field	Value
Base Model	nvidia/Cosmos-Reason2-32B
Input / Output	Text + Image / Video -> Text
Optimizations	FlashHead LM Head + Quantization (W4A16)
Developers	Embedl
Licenses	Upstream: NVIDIA Open Model License. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

vLLM Serve

vllm serve embedl/Cosmos-Reason2-32B-W4A16-FlashHead \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

vLLM Video Inference

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-32B-W4A16-FlashHead"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)