Cosmos-Reason2
Collection
nvidia/Cosmos-Reason2 multi-modal reasoning models optimized by Embedl. • 11 items • Updated • 4
Optimized version of nvidia/Cosmos-Reason2-8B using quantization and FlashHead, Embedl's efficient replacement for the language model head.
Designed for low-latency inference on NVIDIA GPUs, leveraging:
flash-head| Field | Value |
|---|---|
| Base Model | nvidia/Cosmos-Reason2-8B |
| Input / Output | Text + Image / Video -> Text |
| Optimizations | FlashHead LM Head + Quantization (W4A16) |
| Developers | Embedl |
| Licenses | Upstream: NVIDIA Open Model License. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
pip install flash-head
The flash-head vLLM plugin is required. It activates automatically at startup.
vllm serve embedl/Cosmos-Reason2-8B-W4A16-FlashHead \
--max-model-len 8192 \
--gpu-memory-utilization 0.75
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-8B-W4A16-FlashHead"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
gpu_memory_utilization=0.75,
trust_remote_code=True,
)
output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
print(output[0].outputs[0].text)
models@embedl.comhttps://github.com/embedl/flash-headhttps://embedl.com