How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vrfai/Cosmos-Reason2-8B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("vrfai/Cosmos-Reason2-8B-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("vrfai/Cosmos-Reason2-8B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

Cosmos-Reason2-8B-NVFP4

NVFP4 quantized version of nvidia/Cosmos-Reason2-8B by vrfai using llm-compressor.

License: This model inherits the NVIDIA Open Model License from the base model. Commercial use and derivative models are permitted under its terms.

NVFP4 Quantization Details

Base model nvidia/Cosmos-Reason2-8B
Quantization NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor
Model size <17 GB> → <7.1 GB> (~58% reduction)
Requires NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19

What's Quantized / What's Not

Component Precision Reason
All LLM layers — FFN + attention projections (36 layers) NVFP4 Standard transformer, stable under 4-bit
Vision encoder — all 27 blocks + merger BF16 Preserved for visual perception quality
DeepStack merger list (3×) BF16 Multi-scale visual fusion, sensitive to precision
lm_head BF16 Output logits preserved for generation stability

Quantization Config (llm-compressor)

# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 27 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*

Quick Start (vLLM)

vllm serve vrfai/Cosmos-Reason2-8B-NVFP4 \
  --max-model-len 8192

Python (Transformers)

from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-8B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-8B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Tested Environment

Component Version
vLLM 0.19.1
Transformers 5.6.0
PyTorch 2.10.0+cu128
CUDA 12.8 (nvcc 12.8.61)
llm-compressor compressed-tensors 0.14.0.1
GPU 1× NVIDIA RTX 5090

Model Overview

Cosmos-Reason2-8B is a vision-language model developed by NVIDIA for Physical AI reasoning — understanding physical common sense and embodied interactions from video and image inputs.

Architecture Qwen3VLForConditionalGeneration
Parameters ~8B
Hidden size 4096
Layers 36 (standard GQA transformer)
Attention heads 32 Q / 8 KV
Vision encoder depth 27 blocks (DeepStack-enhanced)
Context length 262,144 tokens
Input modalities Text, image, video

Quality Benchmarks

For benchmark results see the Physical AI Bench Leaderboard and the base model card.


Ethical Considerations & Safety

This section is reproduced from the base model card and applies equally to this quantized derivative.

This model is intended for Physical AI developers working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

Safety note: Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns here.


Credits

Downloads last month
22
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/Cosmos-Reason2-8B-NVFP4

Quantized
(10)
this model

Collection including vrfai/Cosmos-Reason2-8B-NVFP4