Spaces:
Runtime error
Runtime error
File size: 2,641 Bytes
fcc1447 622c714 fcc1447 622c714 fcc1447 622c714 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ---
title: "𧬠BitNet b1.58 2B4T β CPU Inference (bitnet.cpp)"
emoji: β‘
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
models:
- microsoft/bitnet-b1.58-2B-4T
- microsoft/bitnet-b1.58-2B-4T-gguf
tags:
- bitnet
- 1-bit
- cpu-inference
- ternary-weights
- bitnet-cpp
- llama-server
short_description: "1-bit LLM on CPU with bitnet.cpp β 4-10x faster"
---
# β‘ BitNet b1.58 2B4T β CPU Inference with bitnet.cpp
The **fast** version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the **I2_S lossless kernel** β achieving **4-10Γ speedup** over the transformers-based version.
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio UI (Python) β
β β OpenAI-compatible streaming API β
β llama-server (bitnet.cpp, port 8080) β
β β I2_S ternary kernel (additions only) β
β ggml-model-i2_s.gguf (1.1 GB, lossless) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Why bitnet.cpp?
| Engine | Speed | Lossless? | Notes |
|---|---|---|---|
| **bitnet.cpp I2_S** | ~10-15 tok/s | β
100% | Optimized ternary kernel |
| transformers (bf16) | ~1.4 tok/s | β
| No ternary optimization |
| llama.cpp TQ1_0 | ~2 tok/s | β 1.4% | Quality degraded |
## Features
- π¬ **Streaming Chat** β Real-time conversation with live tok/s stats
- π **Benchmark** β Greedy generation with detailed performance metrics
- π **Paper Results** β Published comparison table from the technical report
- ποΈ **Architecture** β How ternary weights enable multiplication-free inference
- βοΈ **System Info** β Live CPU/memory stats
## How it works
1. **Docker build** compiles bitnet.cpp with I2_S kernel optimizations
2. **Container startup** launches `llama-server` on localhost:8080
3. **Gradio app** connects via OpenAI-compatible streaming API
4. **All inference** happens on CPU with addition-only matrix operations
## References
- π [Technical Report](https://arxiv.org/abs/2504.12285)
- π [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880)
- π€ [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T)
- π€ [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf)
- π» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ β)
|