Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch →

Cosmos-Reason2-32B-W4A16-FlashHead

GitHub

Optimized version of nvidia/Cosmos-Reason2-32B using quantization and FlashHead, Embedl's efficient replacement for the language model head.

Designed for low-latency inference on NVIDIA GPUs, leveraging:

  • FlashHead
  • Quantization (W4A16)
  • vLLM plugin via flash-head

Model Details

Field Value
Base Model nvidia/Cosmos-Reason2-32B
Input / Output Text + Image / Video -> Text
Optimizations FlashHead LM Head + Quantization (W4A16)
Developers Embedl
Licenses Upstream: NVIDIA Open Model License.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Screenshot Edge Inference Benchmarks

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.


Usage Examples

vLLM Serve

vllm serve embedl/Cosmos-Reason2-32B-W4A16-FlashHead \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

vLLM Video Inference

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-32B-W4A16-FlashHead"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)

License


Contact

  • Enterprise and Commercial Inquiries: models@embedl.com
  • Technical Issues and Early Access: https://github.com/embedl/flash-head
  • More Information and Model Releases: https://embedl.com
Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips — bring them all.
Join our Discord →
Downloads last month
425
Safetensors
Model size
6B params
Tensor type
BF16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-32B-W4A16-FlashHead

Quantized
(6)
this model

Collection including embedl/Cosmos-Reason2-32B-W4A16-FlashHead