Spaces:
Runtime error
Runtime error
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,66 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "𧬠BitNet b1.58 2B4T β CPU Inference (bitnet.cpp)"
|
| 3 |
+
emoji: β‘
|
| 4 |
+
colorFrom: indigo
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
+
models:
|
| 9 |
+
- microsoft/bitnet-b1.58-2B-4T
|
| 10 |
+
- microsoft/bitnet-b1.58-2B-4T-gguf
|
| 11 |
+
tags:
|
| 12 |
+
- bitnet
|
| 13 |
+
- 1-bit
|
| 14 |
+
- cpu-inference
|
| 15 |
+
- ternary-weights
|
| 16 |
+
- bitnet-cpp
|
| 17 |
+
- llama-server
|
| 18 |
+
short_description: "1-bit LLM on CPU with bitnet.cpp β 4-10x faster"
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# β‘ BitNet b1.58 2B4T β CPU Inference with bitnet.cpp
|
| 22 |
+
|
| 23 |
+
The **fast** version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the **I2_S lossless kernel** β achieving **4-10Γ speedup** over the transformers-based version.
|
| 24 |
+
|
| 25 |
+
## Architecture
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 29 |
+
β Gradio UI (Python) β
|
| 30 |
+
β β OpenAI-compatible streaming API β
|
| 31 |
+
β llama-server (bitnet.cpp, port 8080) β
|
| 32 |
+
β β I2_S ternary kernel (additions only) β
|
| 33 |
+
β ggml-model-i2_s.gguf (1.1 GB, lossless) β
|
| 34 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Why bitnet.cpp?
|
| 38 |
+
|
| 39 |
+
| Engine | Speed | Lossless? | Notes |
|
| 40 |
+
|---|---|---|---|
|
| 41 |
+
| **bitnet.cpp I2_S** | ~10-15 tok/s | β
100% | Optimized ternary kernel |
|
| 42 |
+
| transformers (bf16) | ~1.4 tok/s | β
| No ternary optimization |
|
| 43 |
+
| llama.cpp TQ1_0 | ~2 tok/s | β 1.4% | Quality degraded |
|
| 44 |
+
|
| 45 |
+
## Features
|
| 46 |
+
|
| 47 |
+
- π¬ **Streaming Chat** β Real-time conversation with live tok/s stats
|
| 48 |
+
- π **Benchmark** β Greedy generation with detailed performance metrics
|
| 49 |
+
- π **Paper Results** β Published comparison table from the technical report
|
| 50 |
+
- ποΈ **Architecture** β How ternary weights enable multiplication-free inference
|
| 51 |
+
- βοΈ **System Info** β Live CPU/memory stats
|
| 52 |
+
|
| 53 |
+
## How it works
|
| 54 |
+
|
| 55 |
+
1. **Docker build** compiles bitnet.cpp with I2_S kernel optimizations
|
| 56 |
+
2. **Container startup** launches `llama-server` on localhost:8080
|
| 57 |
+
3. **Gradio app** connects via OpenAI-compatible streaming API
|
| 58 |
+
4. **All inference** happens on CPU with addition-only matrix operations
|
| 59 |
+
|
| 60 |
+
## References
|
| 61 |
+
|
| 62 |
+
- π [Technical Report](https://arxiv.org/abs/2504.12285)
|
| 63 |
+
- π [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880)
|
| 64 |
+
- π€ [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T)
|
| 65 |
+
- π€ [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf)
|
| 66 |
+
- π» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ β)
|