--- tags: - llama-cpp - jetson-nano - cuda-10 - 1-bit - bonsai - edge-ai - gguf - nvidia - tegra - maxwell - quantization - arm64 license: mit pipeline_tag: text-generation --- # llamita.cpp > Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano. **llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM). ## Results | Model | Size on disk | RAM used | Prompt | Generation | Board | |-------|-------------|----------|--------|------------|-------| | [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB | | [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB | An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM. ## What Was Changed 27 files modified, ~3,200 lines of patches across 7 categories: 1. **C++17 to C++14** — `if constexpr`, `std::is_same_v`, structured bindings, fold expressions 2. **CUDA 10.2 API stubs** — `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF` 3. **SM 5.3 Maxwell** — Warp size macros, MMQ params, flash attention disabled with stubs 4. **ARM NEON GCC 8** — Custom struct types for broken `vld1q_*_x*` intrinsics 5. **Linker** — `-lstdc++fs` for `std::filesystem` 6. **Critical correctness fix** — `binbcast.cu` fold expression silently computing nothing 7. **Build system** — `CUDA_STANDARD 14`, flash attention template exclusion ## The Bug That Broke Everything During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference — and produced complete garbage. The fix was one line. ## Links - **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp) - **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/) - **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md) - **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md) - **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md) ## Credits - [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) — Original llama.cpp (MIT) - [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) — Q1_0_g128 support (MIT) - [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) — 1-bit LLMs (Apache 2.0)