Instructions to use vrfai/Cosmos-Reason2-2B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vrfai/Cosmos-Reason2-2B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vrfai/Cosmos-Reason2-2B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("vrfai/Cosmos-Reason2-2B-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("vrfai/Cosmos-Reason2-2B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Cosmos

How to use vrfai/Cosmos-Reason2-2B-NVFP4 with Cosmos:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use vrfai/Cosmos-Reason2-2B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vrfai/Cosmos-Reason2-2B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-2B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/vrfai/Cosmos-Reason2-2B-NVFP4

SGLang

How to use vrfai/Cosmos-Reason2-2B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vrfai/Cosmos-Reason2-2B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-2B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vrfai/Cosmos-Reason2-2B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-2B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use vrfai/Cosmos-Reason2-2B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/vrfai/Cosmos-Reason2-2B-NVFP4
```

Cosmos-Reason2-2B-NVFP4

File size: 6,238 Bytes

---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-2B
tags:
  - nvfp4
  - quantized
  - compressed-tensors
  - blackwell
  - physical-ai
  - embodied-reasoning
  - cosmos
  - nvidia
  - vllm
quantized_by: vrfai
extra_gated_prompt: >-
  By downloading this model, you agree to the
  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---

# Cosmos-Reason2-2B-NVFP4

NVFP4 quantized version of [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.

## NVFP4 Quantization Details

| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | 4.6 GB → **2.7 GB** (~41% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |

### What's Quantized / What's Not

Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16:

| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (28 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 24 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |

### Quantization Config (llm-compressor)

```yaml
# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 24 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*
```

---

## Quick Start (vLLM)

```bash
vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \
  --max-model-len 8192
```

The model fits comfortably on a single RTX 5090 (32 GB). No `--tensor-parallel-size` needed.

### Python (Transformers)

```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-2B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
```

### OpenAI-compatible API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-2B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
```

---

## Tested Environment

| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |

---

## Model Overview

Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines.

| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~2B |
| **Hidden size** | 2048 |
| **Layers** | 28 (standard GQA transformer) |
| **Attention heads** | 16 Q / 8 KV |
| **Vision encoder depth** | 24 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |

### Quality Benchmarks

For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B).

---

## Ethical Considerations & Safety

> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B) and applies equally to this quantized derivative.

This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

---

## Credits

- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)