| --- |
| license: other |
| license_name: nvidia-open-model-license |
| license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| base_model: nvidia/Cosmos-Reason2-8B |
| tags: |
| - nvfp4 |
| - quantized |
| - compressed-tensors |
| - blackwell |
| - physical-ai |
| - embodied-reasoning |
| - cosmos |
| - nvidia |
| - vllm |
| quantized_by: vrfai |
| extra_gated_prompt: >- |
| By downloading this model, you agree to the |
| [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). |
| --- |
| |
| # Cosmos-Reason2-8B-NVFP4 |
|
|
| NVFP4 quantized version of [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor). |
|
|
| > **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms. |
|
|
| ## NVFP4 Quantization Details |
|
|
| | | | |
| |---|---| |
| | **Base model** | [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) | |
| | **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 | |
| | **Format** | `compressed-tensors` (native vLLM support) | |
| | **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) | |
| | **Model size** | <17 GB> → **<7.1 GB>** (~58% reduction) | |
| | **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 | |
|
|
| ### What's Quantized / What's Not |
|
|
| | Component | Precision | Reason | |
| |---|---|---| |
| | All LLM layers — FFN + attention projections (36 layers) | **NVFP4** | Standard transformer, stable under 4-bit | |
| | Vision encoder — all 27 blocks + merger | **BF16** | Preserved for visual perception quality | |
| | DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision | |
| | `lm_head` | **BF16** | Output logits preserved for generation stability | |
|
|
| ### Quantization Config (llm-compressor) |
|
|
| ```yaml |
| # recipe.yaml |
| QuantizationModifier: |
| targets: [Linear] |
| scheme: NVFP4 |
| ignore: |
| - lm_head |
| # Vision encoder — 27 blocks (attn + mlp) + merger |
| - re:model\.visual\.blocks\.\d+\..* |
| - model.visual.merger.linear_fc1 |
| - model.visual.merger.linear_fc2 |
| # DeepStack multi-scale merger |
| - re:model\.visual\.deepstack_merger_list\.\d+\..* |
| ``` |
|
|
| --- |
|
|
| ## Quick Start (vLLM) |
|
|
| ```bash |
| vllm serve vrfai/Cosmos-Reason2-8B-NVFP4 \ |
| --max-model-len 8192 |
| ``` |
|
|
| ### Python (Transformers) |
|
|
| ```python |
| from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer |
| |
| model_name = "vrfai/Cosmos-Reason2-8B-NVFP4" |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_name, |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ### OpenAI-compatible API |
|
|
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") |
| response = client.chat.completions.create( |
| model="vrfai/Cosmos-Reason2-8B-NVFP4", |
| messages=[ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image_url", "image_url": {"url": "https://..."}}, |
| {"type": "text", "text": "Describe the physical interaction in this scene."} |
| ] |
| } |
| ], |
| temperature=0.7, |
| max_tokens=512, |
| ) |
| print(response.choices[0].message.content) |
| ``` |
|
|
| --- |
|
|
| ## Tested Environment |
|
|
| | Component | Version | |
| |-----------|---------| |
| | vLLM | 0.19.1 | |
| | Transformers | 5.6.0 | |
| | PyTorch | 2.10.0+cu128 | |
| | CUDA | 12.8 (nvcc 12.8.61) | |
| | llm-compressor | compressed-tensors 0.14.0.1 | |
| | GPU | 1× NVIDIA RTX 5090 | |
|
|
| --- |
|
|
| ## Model Overview |
|
|
| Cosmos-Reason2-8B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. |
|
|
| | | | |
| |---|---| |
| | **Architecture** | `Qwen3VLForConditionalGeneration` | |
| | **Parameters** | ~8B | |
| | **Hidden size** | 4096 | |
| | **Layers** | 36 (standard GQA transformer) | |
| | **Attention heads** | 32 Q / 8 KV | |
| | **Vision encoder depth** | 27 blocks (DeepStack-enhanced) | |
| | **Context length** | 262,144 tokens | |
| | **Input modalities** | Text, image, video | |
|
|
| ### Quality Benchmarks |
|
|
| For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B). |
|
|
| --- |
|
|
| ## Ethical Considerations & Safety |
|
|
| > This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B) and applies equally to this quantized derivative. |
|
|
| This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment. |
|
|
| **Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation. |
|
|
| Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |
|
|
| --- |
|
|
| ## Credits |
|
|
| - **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) |
| - **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai) |
| - **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
|
|