robertzty commited on
Commit
ef4533f
·
verified ·
1 Parent(s): f5d019b

Add hardware estimates

Browse files
Files changed (1) hide show
  1. README.md +13 -0
README.md CHANGED
@@ -34,6 +34,19 @@ Built on NVIDIA Cosmos.
34
 
35
  Use one text backbone file together with the `mmproj` file for multimodal inference.
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Source
38
 
39
  Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B
 
34
 
35
  Use one text backbone file together with the `mmproj` file for multimodal inference.
36
 
37
+ ## Hardware estimates
38
+
39
+ These are rough inference estimates for `llama.cpp` with batch size 1. Actual memory use depends on context length, image/video inputs, backend, and how many layers are offloaded to GPU.
40
+
41
+ | Text backbone | File size | Text + mmproj | Suggested system RAM | Suggested VRAM for mostly/full GPU offload | Notes |
42
+ | --- | ---: | ---: | ---: | ---: | --- |
43
+ | `Q4_K_M` | 19.8 GB | 21.0 GB | 32 GB minimum, 48 GB comfortable | 24 GB tight, 32 GB comfortable | Best first choice for local use. |
44
+ | `Q5_K_M` | 23.2 GB | 24.4 GB | 48 GB comfortable | 32 GB comfortable | Better quality than Q4 with moderate extra memory. |
45
+ | `Q8_0` | 34.8 GB | 36.0 GB | 64 GB comfortable | 48 GB+ recommended | Higher quality, much larger. |
46
+ | `BF16` | 65.5 GB | 66.7 GB | 96 GB+ recommended | 80 GB+ or multi-GPU | Original precision GGUF; not a practical default for most local machines. |
47
+
48
+ KV cache adds roughly 2 GiB per 8k text tokens at fp16 cache precision, before additional image/video token overhead. Reduce `--ctx-size` or use partial CPU/GPU offload if memory is tight.
49
+
50
  ## Source
51
 
52
  Original model: https://huggingface.co/nvidia/Cosmos-Reason2-32B