Add hardware estimates
Browse files
README.md
CHANGED
|
@@ -34,6 +34,19 @@ Built on NVIDIA Cosmos.
|
|
| 34 |
|
| 35 |
Use one text backbone file together with the `mmproj` file for multimodal inference.
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
## Source
|
| 38 |
|
| 39 |
Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B
|
|
|
|
| 34 |
|
| 35 |
Use one text backbone file together with the `mmproj` file for multimodal inference.
|
| 36 |
|
| 37 |
+
## Hardware estimates
|
| 38 |
+
|
| 39 |
+
These are rough inference estimates for `llama.cpp` with batch size 1. Actual memory use depends on context length, image/video inputs, backend, and how many layers are offloaded to GPU.
|
| 40 |
+
|
| 41 |
+
| Text backbone | File size | Text + mmproj | Suggested system RAM | Suggested VRAM for mostly/full GPU offload | Notes |
|
| 42 |
+
| --- | ---: | ---: | ---: | ---: | --- |
|
| 43 |
+
| `Q4_K_M` | 19.8 GB | 21.0 GB | 32 GB minimum, 48 GB comfortable | 24 GB tight, 32 GB comfortable | Best first choice for local use. |
|
| 44 |
+
| `Q5_K_M` | 23.2 GB | 24.4 GB | 48 GB comfortable | 32 GB comfortable | Better quality than Q4 with moderate extra memory. |
|
| 45 |
+
| `Q8_0` | 34.8 GB | 36.0 GB | 64 GB comfortable | 48 GB+ recommended | Higher quality, much larger. |
|
| 46 |
+
| `BF16` | 65.5 GB | 66.7 GB | 96 GB+ recommended | 80 GB+ or multi-GPU | Original precision GGUF; not a practical default for most local machines. |
|
| 47 |
+
|
| 48 |
+
KV cache adds roughly 2 GiB per 8k text tokens at fp16 cache precision, before additional image/video token overhead. Reduce `--ctx-size` or use partial CPU/GPU offload if memory is tight.
|
| 49 |
+
|
| 50 |
## Source
|
| 51 |
|
| 52 |
Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B
|