File size: 5,812 Bytes
6679678
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78943c7
6679678
 
 
78943c7
6679678
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-8B
tags:
  - nvfp4
  - quantized
  - compressed-tensors
  - blackwell
  - physical-ai
  - embodied-reasoning
  - cosmos
  - nvidia
  - vllm
quantized_by: vrfai
extra_gated_prompt: >-
  By downloading this model, you agree to the
  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---

# Cosmos-Reason2-8B-NVFP4

NVFP4 quantized version of [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.

## NVFP4 Quantization Details

| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | <17 GB> → **<7.1 GB>** (~58% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |

### What's Quantized / What's Not

| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (36 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 27 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |

### Quantization Config (llm-compressor)

```yaml
# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — 27 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeepStack multi-scale merger
    - re:model\.visual\.deepstack_merger_list\.\d+\..*
```

---

## Quick Start (vLLM)

```bash
vllm serve vrfai/Cosmos-Reason2-8B-NVFP4 \
  --max-model-len 8192
```

### Python (Transformers)

```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Cosmos-Reason2-8B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
```

### OpenAI-compatible API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Cosmos-Reason2-8B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://..."}},
                {"type": "text", "text": "Describe the physical interaction in this scene."}
            ]
        }
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
```

---

## Tested Environment

| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |

---

## Model Overview

Cosmos-Reason2-8B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs.

| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~8B |
| **Hidden size** | 4096 |
| **Layers** | 36 (standard GQA transformer) |
| **Attention heads** | 32 Q / 8 KV |
| **Vision encoder depth** | 27 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |

### Quality Benchmarks

For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B).

---

## Ethical Considerations & Safety

> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B) and applies equally to this quantized derivative.

This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

---

## Credits

- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)