--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license library_name: transformers pipeline_tag: image-text-to-text base_model: nvidia/Cosmos-Reason2-2B tags: - nvfp4 - quantized - compressed-tensors - blackwell - physical-ai - embodied-reasoning - cosmos - nvidia - vllm quantized_by: vrfai extra_gated_prompt: >- By downloading this model, you agree to the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). --- # Cosmos-Reason2-2B-NVFP4 NVFP4 quantized version of [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor). > **License:** This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms. ## NVFP4 Quantization Details | | | |---|---| | **Base model** | [nvidia/Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) | | **Quantization** | NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 | | **Format** | `compressed-tensors` (native vLLM support) | | **Tool** | [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) | | **Model size** | 4.6 GB → **2.7 GB** (~41% reduction) | | **Requires** | NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 | ### What's Quantized / What's Not Unlike hybrid-attention models (e.g. Qwen3.6), Cosmos-Reason2-2B uses a standard transformer backbone — all language model linear layers are quantized. Only the visual components and output head are preserved in BF16: | Component | Precision | Reason | |---|---|---| | All LLM layers — FFN + attention projections (28 layers) | **NVFP4** | Standard transformer, stable under 4-bit | | Vision encoder — all 24 blocks + merger | **BF16** | Preserved for visual perception quality | | DeepStack merger list (3×) | **BF16** | Multi-scale visual fusion, sensitive to precision | | `lm_head` | **BF16** | Output logits preserved for generation stability | ### Quantization Config (llm-compressor) ```yaml # recipe.yaml QuantizationModifier: targets: [Linear] scheme: NVFP4 ignore: - lm_head # Vision encoder — 24 blocks (attn + mlp) + merger - re:model\.visual\.blocks\.\d+\..* - model.visual.merger.linear_fc1 - model.visual.merger.linear_fc2 # DeepStack multi-scale merger - re:model\.visual\.deepstack_merger_list\.\d+\..* ``` --- ## Quick Start (vLLM) ```bash vllm serve vrfai/Cosmos-Reason2-2B-NVFP4 \ --max-model-len 8192 ``` The model fits comfortably on a single RTX 5090 (32 GB). No `--tensor-parallel-size` needed. ### Python (Transformers) ```python from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer model_name = "vrfai/Cosmos-Reason2-2B-NVFP4" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = Qwen3VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) ``` ### OpenAI-compatible API ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") response = client.chat.completions.create( model="vrfai/Cosmos-Reason2-2B-NVFP4", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://..."}}, {"type": "text", "text": "Describe the physical interaction in this scene."} ] } ], temperature=0.7, max_tokens=512, ) print(response.choices[0].message.content) ``` --- ## Tested Environment | Component | Version | |-----------|---------| | vLLM | 0.19.1 | | Transformers | 5.6.0 | | PyTorch | 2.10.0+cu128 | | CUDA | 12.8 (nvcc 12.8.61) | | llm-compressor | compressed-tensors 0.14.0.1 | | GPU | 1× NVIDIA RTX 5090 | --- ## Model Overview Cosmos-Reason2-2B is a vision-language model developed by NVIDIA for **Physical AI reasoning** — understanding physical common sense and embodied interactions from video and image inputs. It is designed for use as a planner or reasoning backbone in robotics and Vision-Language-Action (VLA) pipelines. | | | |---|---| | **Architecture** | `Qwen3VLForConditionalGeneration` | | **Parameters** | ~2B | | **Hidden size** | 2048 | | **Layers** | 28 (standard GQA transformer) | | **Attention heads** | 16 Q / 8 KV | | **Vision encoder depth** | 24 blocks (DeepStack-enhanced) | | **Context length** | 262,144 tokens | | **Input modalities** | Text, image, video | ### Quality Benchmarks For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B). --- ## Ethical Considerations & Safety > This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-2B) and applies equally to this quantized derivative. This model is intended for **Physical AI developers** working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment. **Safety note:** Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation. Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). --- ## Credits - **Original model:** [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B) - **NVFP4 quantization:** [vrfai](https://huggingface.co/vrfai) - **Quantization framework:** [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)