Use one text backbone file together with the `mmproj` file for multimodal inference.
## Hardware estimates
These are rough inference estimates for `llama.cpp` with batch size 1. Actual memory use depends on context length, image/video inputs, backend, and how many layers are offloaded to GPU.
| Text backbone | File size | Text + mmproj | Suggested system RAM | Suggested VRAM for mostly/full GPU offload | Notes |
| --- | ---: | ---: | ---: | ---: | --- |
| `Q4_K_M` | 19.8 GB | 21.0 GB | 32 GB minimum, 48 GB comfortable | 24 GB tight, 32 GB comfortable | Best first choice for local use. |
| `Q5_K_M` | 23.2 GB | 24.4 GB | 48 GB comfortable | 32 GB comfortable | Better quality than Q4 with moderate extra memory. |
| `BF16` | 65.5 GB | 66.7 GB | 96 GB+ recommended | 80 GB+ or multi-GPU | Original precision GGUF; not a practical default for most local machines. |
KV cache adds roughly 2 GiB per 8k text tokens at fp16 cache precision, before additional image/video token overhead. Reduce `--ctx-size` or use partial CPU/GPU offload if memory is tight.
## Source
Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B
This GGUF conversion was produced with `llama.cpp``convert_hf_to_gguf.py` from the original Hugging Face safetensors.
## Usage
Use one text backbone file together with the multimodal projector in `llama.cpp`:
```bash
llama-server \
-m Cosmos-Reason2-32B-Q4_K_M.gguf \
--mmproj mmproj-Cosmos-Reason2-32B-F16.gguf
```
BF16 and Q8_0 are large and may require CPU offload or a multi-GPU setup.
## License
Licensed by NVIDIA Corporation under the NVIDIA Open Model License.
See `NOTICE` and the original model card for license terms and usage requirements.