File size: 2,641 Bytes
fcc1447
622c714
 
 
fcc1447
 
 
622c714
 
 
 
 
 
 
 
 
 
 
fcc1447
 
622c714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "🧬 BitNet b1.58 2B4T β€” CPU Inference (bitnet.cpp)"
emoji: ⚑
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
models:
  - microsoft/bitnet-b1.58-2B-4T
  - microsoft/bitnet-b1.58-2B-4T-gguf
tags:
  - bitnet
  - 1-bit
  - cpu-inference
  - ternary-weights
  - bitnet-cpp
  - llama-server
short_description: "1-bit LLM on CPU with bitnet.cpp β€” 4-10x faster"
---

# ⚑ BitNet b1.58 2B4T β€” CPU Inference with bitnet.cpp

The **fast** version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the **I2_S lossless kernel** β€” achieving **4-10Γ— speedup** over the transformers-based version.

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gradio UI (Python)                             β”‚
β”‚    ↕ OpenAI-compatible streaming API            β”‚
β”‚  llama-server (bitnet.cpp, port 8080)           β”‚
β”‚    ↕ I2_S ternary kernel (additions only)       β”‚
β”‚  ggml-model-i2_s.gguf (1.1 GB, lossless)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Why bitnet.cpp?

| Engine | Speed | Lossless? | Notes |
|---|---|---|---|
| **bitnet.cpp I2_S** | ~10-15 tok/s | βœ… 100% | Optimized ternary kernel |
| transformers (bf16) | ~1.4 tok/s | βœ… | No ternary optimization |
| llama.cpp TQ1_0 | ~2 tok/s | ❌ 1.4% | Quality degraded |

## Features

- πŸ’¬ **Streaming Chat** β€” Real-time conversation with live tok/s stats
- πŸ“Š **Benchmark** β€” Greedy generation with detailed performance metrics
- πŸ“ˆ **Paper Results** β€” Published comparison table from the technical report
- πŸ—οΈ **Architecture** β€” How ternary weights enable multiplication-free inference
- βš™οΈ **System Info** β€” Live CPU/memory stats

## How it works

1. **Docker build** compiles bitnet.cpp with I2_S kernel optimizations
2. **Container startup** launches `llama-server` on localhost:8080
3. **Gradio app** connects via OpenAI-compatible streaming API
4. **All inference** happens on CPU with addition-only matrix operations

## References

- πŸ“„ [Technical Report](https://arxiv.org/abs/2504.12285)
- πŸ“„ [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880)
- πŸ€— [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T)
- πŸ€— [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf)
- πŸ’» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ ⭐)