File size: 6,238 Bytes
3dc98f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e7595c
3dc98f3
 
 
3e7595c
3dc98f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-2B
tags:
  - nvfp4
  - quantized
  - compressed-tensors
  - blackwell
  - physical-ai
  - embodied-reasoning
  - cosmos
  - nvidia
  - vllm
quantized_by: vrfai
extra_gated_prompt: >-
  By downloading this model, you agree to the
  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---

# Cosmos-Reason2-2B-NVFP4

NVFP4 quantized version of [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.

## NVFP4 Quantization Details

| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | 4.6 GB → **2.7 GB** (~41% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |

### What's Quantized / What's Not

Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16:

| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (28 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 24 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |

### Quantization Config (llm-compressor)

```yaml
# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 24 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*
```

---

## Quick Start (vLLM)

```bash
vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \
  --max-model-len 8192
```

The model fits comfortably on a single RTX 5090 (32 GB). No `--tensor-parallel-size` needed.

### Python (Transformers)

```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-2B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
```

### OpenAI-compatible API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-2B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
```

---

## Tested Environment

| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |

---

## Model Overview

Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines.

| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~2B |
| **Hidden size** | 2048 |
| **Layers** | 28 (standard GQA transformer) |
| **Attention heads** | 16 Q / 8 KV |
| **Vision encoder depth** | 24 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |

### Quality Benchmarks

For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B).

---

## Ethical Considerations & Safety

> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B) and applies equally to this quantized derivative.

This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

---

## Credits

- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)