---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-2B
tags:
  - nvfp4
  - quantized
  - compressed-tensors
  - blackwell
  - physical-ai
  - embodied-reasoning
  - cosmos
  - nvidia
  - vllm
quantized_by: vrfai
extra_gated_prompt: >-
  By downloading this model, you agree to the
  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---

# Cosmos-Reason2-2B-NVFP4

NVFP4 quantized version of [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.

## NVFP4 Quantization Details

| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | 4.6 GB → **2.7 GB** (~41% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |

### What's Quantized / What's Not

Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16:

| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (28 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 24 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |

### Quantization Config (llm-compressor)

```yaml
# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 24 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*
```

---

## Quick Start (vLLM)

```bash
vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \
  --max-model-len 8192
```

The model fits comfortably on a single RTX 5090 (32 GB). No `--tensor-parallel-size` needed.

### Python (Transformers)

```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-2B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
```

### OpenAI-compatible API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-2B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
```

---

## Tested Environment

| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |

---

## Model Overview

Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines.

| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~2B |
| **Hidden size** | 2048 |
| **Layers** | 28 (standard GQA transformer) |
| **Attention heads** | 16 Q / 8 KV |
| **Vision encoder depth** | 24 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |

### Quality Benchmarks

For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B).

---

## Ethical Considerations & Safety

> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B) and applies equally to this quantized derivative.

This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

---

## Credits

- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)