File size: 6,238 Bytes
3dc98f3 3e7595c 3dc98f3 3e7595c 3dc98f3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | ---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-2B
tags:
- nvfp4
- quantized
- compressed-tensors
- blackwell
- physical-ai
- embodied-reasoning
- cosmos
- nvidia
- vllm
quantized_by: vrfai
extra_gated_prompt: >-
By downloading this model, you agree to the
[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---
# Cosmos-Reason2-2B-NVFP4
NVFP4 quantized version of [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.
## NVFP4 Quantization Details
| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | 4.6 GB → **2.7 GB** (~41% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |
### What's Quantized / What's Not
Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16:
| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (28 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 24 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |
### Quantization Config (llm-compressor)
```yaml
# recipe.yaml
QuantizationModifier:
targets: [Linear]
scheme: NVFP4
ignore:
- lm_head
# Vision encoder — 24 blocks (attn + mlp) + merger
- re:model\.visual\.blocks\.\d+\..*
- model.visual.merger.linear_fc1
- model.visual.merger.linear_fc2
# DeepStack multi-scale merger
- re:model\.visual\.deepstack_merger_list\.\d+\..*
```
---
## Quick Start (vLLM)
```bash
vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \
--max-model-len 8192
```
The model fits comfortably on a single RTX 5090 (32 GB). No `--tensor-parallel-size` needed.
### Python (Transformers)
```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer
model_name = "vrfai/Cosmos-Reason2-2B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
```
### OpenAI-compatible API
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="vrfai/Cosmos-Reason2-2B-NVFP4",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://..."}},
{"type": "text", "text": "Describe the physical interaction in this scene."}
]
}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
```
---
## Tested Environment
| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |
---
## Model Overview
Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines.
| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~2B |
| **Hidden size** | 2048 |
| **Layers** | 28 (standard GQA transformer) |
| **Attention heads** | 16 Q / 8 KV |
| **Vision encoder depth** | 24 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |
### Quality Benchmarks
For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B).
---
## Ethical Considerations & Safety
> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B) and applies equally to this quantized derivative.
This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.
**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.
Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
---
## Credits
- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
|