---
tags:
- llama-cpp
- jetson-nano
- cuda-10
- 1-bit
- bonsai
- edge-ai
- gguf
- nvidia
- tegra
- maxwell
- quantization
- arm64
license: mit
pipeline_tag: text-generation
---

# llamita.cpp

> Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.

**llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM).

## Results

| Model | Size on disk | RAM used | Prompt | Generation | Board |
|-------|-------------|----------|--------|------------|-------|
| [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
| [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |

An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.

## What Was Changed

27 files modified, ~3,200 lines of patches across 7 categories:

1. **C++17 to C++14** — `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
2. **CUDA 10.2 API stubs** — `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
3. **SM 5.3 Maxwell** — Warp size macros, MMQ params, flash attention disabled with stubs
4. **ARM NEON GCC 8** — Custom struct types for broken `vld1q_*_x*` intrinsics
5. **Linker** — `-lstdc++fs` for `std::filesystem`
6. **Critical correctness fix** — `binbcast.cu` fold expression silently computing nothing
7. **Build system** — `CUDA_STANDARD 14`, flash attention template exclusion

## The Bug That Broke Everything

During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference — and produced complete garbage. The fix was one line.

## Links

- **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
- **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
- **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
- **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
- **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)

## Credits

- [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) — Original llama.cpp (MIT)
- [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) — Q1_0_g128 support (MIT)
- [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) — 1-bit LLMs (Apache 2.0)