---
base_model: nvidia/Cosmos-Reason2-32B
library_name: llama.cpp
pipeline_tag: image-text-to-text
tags:
- gguf
- qwen3-vl
- cosmos
- nvidia
- multimodal
- image-text-to-text
- bf16
- q4_k_m
- q5_k_m
- q8_0
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
---

# Cosmos-Reason2-32B GGUF

Pure GGUF conversion of `nvidia/Cosmos-Reason2-32B`.

Built on NVIDIA Cosmos.

## Files

- `Cosmos-Reason2-32B-BF16.gguf`: BF16 text backbone GGUF.
- `Cosmos-Reason2-32B-Q4_K_M.gguf`: smaller 4-bit text backbone GGUF for lower memory use.
- `Cosmos-Reason2-32B-Q5_K_M.gguf`: balanced 5-bit text backbone GGUF with better quality than Q4.
- `Cosmos-Reason2-32B-Q8_0.gguf`: larger 8-bit text backbone GGUF for higher quality.
- `mmproj-Cosmos-Reason2-32B-F16.gguf`: F16 multimodal projector / vision GGUF.

Use one text backbone file together with the `mmproj` file for multimodal inference.

## Hardware estimates

These are rough inference estimates for `llama.cpp` with batch size 1. Actual memory use depends on context length, image/video inputs, backend, and how many layers are offloaded to GPU.

| Text backbone | File size | Text + mmproj | Suggested system RAM | Suggested VRAM for mostly/full GPU offload | Notes |
| --- | ---: | ---: | ---: | ---: | --- |
| `Q4_K_M` | 19.8 GB | 21.0 GB | 32 GB minimum, 48 GB comfortable | 24 GB tight, 32 GB comfortable | Best first choice for local use. |
| `Q5_K_M` | 23.2 GB | 24.4 GB | 48 GB comfortable | 32 GB comfortable | Better quality than Q4 with moderate extra memory. |
| `Q8_0` | 34.8 GB | 36.0 GB | 64 GB comfortable | 48 GB+ recommended | Higher quality, much larger. |
| `BF16` | 65.5 GB | 66.7 GB | 96 GB+ recommended | 80 GB+ or multi-GPU | Original precision GGUF; not a practical default for most local machines. |

KV cache adds roughly 2 GiB per 8k text tokens at fp16 cache precision, before additional image/video token overhead. Reduce `--ctx-size` or use partial CPU/GPU offload if memory is tight.

## Source

Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B

This GGUF conversion was produced with `llama.cpp` `convert_hf_to_gguf.py` from the original Hugging Face safetensors.

## Usage

Use one text backbone file together with the multimodal projector in `llama.cpp`:

```bash
llama-server \
  -m Cosmos-Reason2-32B-Q4_K_M.gguf \
  --mmproj mmproj-Cosmos-Reason2-32B-F16.gguf
```

BF16 and Q8_0 are large and may require CPU offload or a multi-GPU setup.

## License

Licensed by NVIDIA Corporation under the NVIDIA Open Model License.

See `NOTICE` and the original model card for license terms and usage requirements.