Instructions to use robertzty/Cosmos-Reason2-32B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use robertzty/Cosmos-Reason2-32B-GGUF with Cosmos:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

llama-cpp-python

How to use robertzty/Cosmos-Reason2-32B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="robertzty/Cosmos-Reason2-32B-GGUF",
	filename="Cosmos-Reason2-32B-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use robertzty/Cosmos-Reason2-32B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use robertzty/Cosmos-Reason2-32B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "robertzty/Cosmos-Reason2-32B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robertzty/Cosmos-Reason2-32B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Ollama
How to use robertzty/Cosmos-Reason2-32B-GGUF with Ollama:
```
ollama run hf.co/robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use robertzty/Cosmos-Reason2-32B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for robertzty/Cosmos-Reason2-32B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for robertzty/Cosmos-Reason2-32B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for robertzty/Cosmos-Reason2-32B-GGUF to start chatting

Pi new

How to use robertzty/Cosmos-Reason2-32B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Cosmos-Reason2-32B-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use robertzty/Cosmos-Reason2-32B-GGUF with Docker Model Runner:
```
docker model run hf.co/robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M
```

Lemonade

How to use robertzty/Cosmos-Reason2-32B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull robertzty/Cosmos-Reason2-32B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Cosmos-Reason2-32B-GGUF-Q4_K_M

List all available models

lemonade list

Cosmos-Reason2-32B GGUF

Pure GGUF conversion of nvidia/Cosmos-Reason2-32B.

Built on NVIDIA Cosmos.

Files

Cosmos-Reason2-32B-BF16.gguf: BF16 text backbone GGUF.
Cosmos-Reason2-32B-Q4_K_M.gguf: smaller 4-bit text backbone GGUF for lower memory use.
Cosmos-Reason2-32B-Q5_K_M.gguf: balanced 5-bit text backbone GGUF with better quality than Q4.
Cosmos-Reason2-32B-Q8_0.gguf: larger 8-bit text backbone GGUF for higher quality.
mmproj-Cosmos-Reason2-32B-F16.gguf: F16 multimodal projector / vision GGUF.

Use one text backbone file together with the mmproj file for multimodal inference.

Hardware estimates

These are rough inference estimates for llama.cpp with batch size 1. Actual memory use depends on context length, image/video inputs, backend, and how many layers are offloaded to GPU.

Text backbone	File size	Text + mmproj	Suggested system RAM	Suggested VRAM for mostly/full GPU offload	Notes
`Q4_K_M`	19.8 GB	21.0 GB	32 GB minimum, 48 GB comfortable	24 GB tight, 32 GB comfortable	Best first choice for local use.
`Q5_K_M`	23.2 GB	24.4 GB	48 GB comfortable	32 GB comfortable	Better quality than Q4 with moderate extra memory.
`Q8_0`	34.8 GB	36.0 GB	64 GB comfortable	48 GB+ recommended	Higher quality, much larger.
`BF16`	65.5 GB	66.7 GB	96 GB+ recommended	80 GB+ or multi-GPU	Original precision GGUF; not a practical default for most local machines.

KV cache adds roughly 2 GiB per 8k text tokens at fp16 cache precision, before additional image/video token overhead. Reduce --ctx-size or use partial CPU/GPU offload if memory is tight.

Source

Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B

This GGUF conversion was produced with llama.cpp convert_hf_to_gguf.py from the original Hugging Face safetensors.

Usage

Use one text backbone file together with the multimodal projector in llama.cpp:

llama-server \
  -m Cosmos-Reason2-32B-Q4_K_M.gguf \
  --mmproj mmproj-Cosmos-Reason2-32B-F16.gguf

BF16 and Q8_0 are large and may require CPU offload or a multi-GPU setup.

License

Licensed by NVIDIA Corporation under the NVIDIA Open Model License.

See NOTICE and the original model card for license terms and usage requirements.

Downloads last month: 330

GGUF

Model size

33B params

Architecture

qwen3vl

Hardware compatibility

4-bit

5-bit

8-bit

16-bit

Model tree for robertzty/Cosmos-Reason2-32B-GGUF

Base model

Qwen/Qwen3-VL-32B-Instruct

Finetuned

nvidia/Cosmos-Reason2-32B

Quantized

(6)

this model