Instructions to use vrfai/Cosmos-Reason2-8B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vrfai/Cosmos-Reason2-8B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vrfai/Cosmos-Reason2-8B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("vrfai/Cosmos-Reason2-8B-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("vrfai/Cosmos-Reason2-8B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Cosmos

How to use vrfai/Cosmos-Reason2-8B-NVFP4 with Cosmos:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use vrfai/Cosmos-Reason2-8B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vrfai/Cosmos-Reason2-8B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/vrfai/Cosmos-Reason2-8B-NVFP4

SGLang

How to use vrfai/Cosmos-Reason2-8B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vrfai/Cosmos-Reason2-8B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vrfai/Cosmos-Reason2-8B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Cosmos-Reason2-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use vrfai/Cosmos-Reason2-8B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/vrfai/Cosmos-Reason2-8B-NVFP4
```

Cosmos-Reason2-8B-NVFP4 / README.md

quangnd58

Update README.md

78943c7 verified 3 days ago

preview code

raw

history blame contribute delete

5.81 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model: nvidia/Cosmos-Reason2-8B
	tags:
	- nvfp4
	- quantized
	- compressed-tensors
	- blackwell
	- physical-ai
	- embodied-reasoning
	- cosmos
	- nvidia
	- vllm
	quantized_by: vrfai
	extra_gated_prompt: >-
	By downloading this model, you agree to the
	[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
	---

	# Cosmos-Reason2-8B-NVFP4

	NVFP4 quantized version of [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) by [vrfai](https://huggingface.co/vrfai) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

	> License: This model inherits the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) from the base model. Commercial use and derivative models are permitted under its terms.

	## NVFP4 Quantization Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) \|
	\| Quantization \| NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8 \|
	\| Format \| `compressed-tensors` (native vLLM support) \|
	\| Tool \| [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) \|
	\| Model size \| <17 GB> → <7.1 GB> (~58% reduction) \|
	\| Requires \| NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19 \|

	### What's Quantized / What's Not

	\| Component \| Precision \| Reason \|
	\|---\|---\|---\|
	\| All LLM layers — FFN + attention projections (36 layers) \| NVFP4 \| Standard transformer, stable under 4-bit \|
	\| Vision encoder — all 27 blocks + merger \| BF16 \| Preserved for visual perception quality \|
	\| DeepStack merger list (3×) \| BF16 \| Multi-scale visual fusion, sensitive to precision \|
	\| `lm_head` \| BF16 \| Output logits preserved for generation stability \|

	### Quantization Config (llm-compressor)

	```yaml
	# recipe.yaml
	QuantizationModifier:
	targets: [Linear]
	scheme: NVFP4
	ignore:
	- lm_head
	# Vision encoder — 27 blocks (attn + mlp) + merger
	- re:model\.visual\.blocks\.\d+\..*
	- model.visual.merger.linear_fc1
	- model.visual.merger.linear_fc2
	# DeepStack multi-scale merger
	- re:model\.visual\.deepstack_merger_list\.\d+\..*
	```

	---

	## Quick Start (vLLM)

	```bash
	vllm serve vrfai/Cosmos-Reason2-8B-NVFP4 \
	--max-model-len 8192
	```

	### Python (Transformers)

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoTokenizer

	model_name = "vrfai/Cosmos-Reason2-8B-NVFP4"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)
	```

	### OpenAI-compatible API

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
	response = client.chat.completions.create(
	model="vrfai/Cosmos-Reason2-8B-NVFP4",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://..."}},
	{"type": "text", "text": "Describe the physical interaction in this scene."}
	]
	}
	],
	temperature=0.7,
	max_tokens=512,
	)
	print(response.choices[0].message.content)
	```

	---

	## Tested Environment

	\| Component \| Version \|
	\|-----------\|---------\|
	\| vLLM \| 0.19.1 \|
	\| Transformers \| 5.6.0 \|
	\| PyTorch \| 2.10.0+cu128 \|
	\| CUDA \| 12.8 (nvcc 12.8.61) \|
	\| llm-compressor \| compressed-tensors 0.14.0.1 \|
	\| GPU \| 1× NVIDIA RTX 5090 \|

	---

	## Model Overview

	Cosmos-Reason2-8B is a vision-language model developed by NVIDIA for Physical AI reasoning — understanding physical common sense and embodied interactions from video and image inputs.

	\| \| \|
	\|---\|---\|
	\| Architecture \| `Qwen3VLForConditionalGeneration` \|
	\| Parameters \| ~8B \|
	\| Hidden size \| 4096 \|
	\| Layers \| 36 (standard GQA transformer) \|
	\| Attention heads \| 32 Q / 8 KV \|
	\| Vision encoder depth \| 27 blocks (DeepStack-enhanced) \|
	\| Context length \| 262,144 tokens \|
	\| Input modalities \| Text, image, video \|

	### Quality Benchmarks

	For benchmark results see the [Physical AI Bench Leaderboard](https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard) and the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B).

	---

	## Ethical Considerations & Safety

	> This section is reproduced from the [base model card](https://huggingface.co/nvidia/Cosmos-Reason2-8B) and applies equally to this quantized derivative.

	This model is intended for Physical AI developers working on embodied reasoning tasks. Users are responsible for model inputs and outputs, including implementing appropriate guardrails prior to deployment.

	Safety note: Because this model is designed for robot planning and can serve as a VLA backbone, its outputs may directly influence physical actuation. Planning errors or misinterpretations carry inherent life-safety risks, including physical collisions or unsafe object manipulation.

	Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

	---

	## Credits

	- Original model: [NVIDIA](https://huggingface.co/nvidia) — [Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)
	- NVFP4 quantization: [vrfai](https://huggingface.co/vrfai)
	- Quantization framework: [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)