quangnd58's picture
Update README.md
78943c7 verified
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
library_name: transformers
pipeline_tag: image-text-to-text
base_model: nvidia/Cosmos-Reason2-8B
tags:
- nvfp4
- quantized
- compressed-tensors
- blackwell
- physical-ai
- embodied-reasoning
- cosmos
- nvidia
- vllm
quantized_by: vrfai
extra_gated_prompt: >-
By downloading this model, you agree to the
[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
---
# Cosmos-Reason2-8B-NVFP4
NVFP4 quantized version of [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
> **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.
## NVFP4 Quantization Details
| | |
|---|---|
| **Base model** | [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) |
| **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 |
| **Format** | `compressed-tensors` (native vLLM support) |
| **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
| **Model size** | <17 GB> → **<7.1 GB>** (~58% reduction) |
| **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 |
### What's Quantized / What's Not
| Component | Precision | Reason |
|---|---|---|
| All LLM layers — FFN + attention projections (36 layers) | **NVFP4** | Standard transformer, stable under 4-bit |
| Vision encoder — all 27 blocks + merger | **BF16** | Preserved for visual perception quality |
| DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision |
| `lm_head` | **BF16** | Output logits preserved for generation stability |
### Quantization Config (llm-compressor)
```yaml
# recipe.yaml
QuantizationModifier:
targets: [Linear]
scheme: NVFP4
ignore:
- lm_head
# Vision encoder — 27 blocks (attn + mlp) + merger
- re:model\.visual\.blocks\.\d+\..*
- model.visual.merger.linear_fc1
- model.visual.merger.linear_fc2
# DeepStack multi-scale merger
- re:model\.visual\.deepstack_merger_list\.\d+\..*
```
---
## Quick Start (vLLM)
```bash
vllm serve vrfai/Cosmos-Reason2-8B-NVFP4 \
--max-model-len 8192
```
### Python (Transformers)
```python
from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer
model_name = "vrfai/Cosmos-Reason2-8B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
```
### OpenAI-compatible API
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="vrfai/Cosmos-Reason2-8B-NVFP4",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://..."}},
{"type": "text", "text": "Describe the physical interaction in this scene."}
]
}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
```
---
## Tested Environment
| Component | Version |
|-----------|---------|
| vLLM | 0.19.1 |
| Transformers | 5.6.0 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 1× NVIDIA RTX 5090 |
---
## Model Overview
Cosmos-Reason2-8B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs.
| | |
|---|---|
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Parameters** | ~8B |
| **Hidden size** | 4096 |
| **Layers** | 36 (standard GQA transformer) |
| **Attention heads** | 32 Q / 8 KV |
| **Vision encoder depth** | 27 blocks (DeepStack-enhanced) |
| **Context length** | 262,144 tokens |
| **Input modalities** | Text, image, video |
### Quality Benchmarks
For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B).
---
## Ethical Considerations & Safety
> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B) and applies equally to this quantized derivative.
This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.
**Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.
Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
---
## Credits
- **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)
- **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai)
- **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)