knoxel commited on
Commit
622c714
Β·
verified Β·
1 Parent(s): e17101b

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -4
README.md CHANGED
@@ -1,10 +1,66 @@
1
  ---
2
- title: Bitnet Cpp Explorer
3
- emoji: πŸ¦€
4
- colorFrom: yellow
5
  colorTo: blue
6
  sdk: docker
7
  pinned: false
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "🧬 BitNet b1.58 2B4T β€” CPU Inference (bitnet.cpp)"
3
+ emoji: ⚑
4
+ colorFrom: indigo
5
  colorTo: blue
6
  sdk: docker
7
  pinned: false
8
+ models:
9
+ - microsoft/bitnet-b1.58-2B-4T
10
+ - microsoft/bitnet-b1.58-2B-4T-gguf
11
+ tags:
12
+ - bitnet
13
+ - 1-bit
14
+ - cpu-inference
15
+ - ternary-weights
16
+ - bitnet-cpp
17
+ - llama-server
18
+ short_description: "1-bit LLM on CPU with bitnet.cpp β€” 4-10x faster"
19
  ---
20
 
21
+ # ⚑ BitNet b1.58 2B4T β€” CPU Inference with bitnet.cpp
22
+
23
+ The **fast** version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the **I2_S lossless kernel** β€” achieving **4-10Γ— speedup** over the transformers-based version.
24
+
25
+ ## Architecture
26
+
27
+ ```
28
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
29
+ β”‚ Gradio UI (Python) β”‚
30
+ β”‚ ↕ OpenAI-compatible streaming API β”‚
31
+ β”‚ llama-server (bitnet.cpp, port 8080) β”‚
32
+ β”‚ ↕ I2_S ternary kernel (additions only) β”‚
33
+ β”‚ ggml-model-i2_s.gguf (1.1 GB, lossless) β”‚
34
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
35
+ ```
36
+
37
+ ## Why bitnet.cpp?
38
+
39
+ | Engine | Speed | Lossless? | Notes |
40
+ |---|---|---|---|
41
+ | **bitnet.cpp I2_S** | ~10-15 tok/s | βœ… 100% | Optimized ternary kernel |
42
+ | transformers (bf16) | ~1.4 tok/s | βœ… | No ternary optimization |
43
+ | llama.cpp TQ1_0 | ~2 tok/s | ❌ 1.4% | Quality degraded |
44
+
45
+ ## Features
46
+
47
+ - πŸ’¬ **Streaming Chat** β€” Real-time conversation with live tok/s stats
48
+ - πŸ“Š **Benchmark** β€” Greedy generation with detailed performance metrics
49
+ - πŸ“ˆ **Paper Results** β€” Published comparison table from the technical report
50
+ - πŸ—οΈ **Architecture** β€” How ternary weights enable multiplication-free inference
51
+ - βš™οΈ **System Info** β€” Live CPU/memory stats
52
+
53
+ ## How it works
54
+
55
+ 1. **Docker build** compiles bitnet.cpp with I2_S kernel optimizations
56
+ 2. **Container startup** launches `llama-server` on localhost:8080
57
+ 3. **Gradio app** connects via OpenAI-compatible streaming API
58
+ 4. **All inference** happens on CPU with addition-only matrix operations
59
+
60
+ ## References
61
+
62
+ - πŸ“„ [Technical Report](https://arxiv.org/abs/2504.12285)
63
+ - πŸ“„ [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880)
64
+ - πŸ€— [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T)
65
+ - πŸ€— [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf)
66
+ - πŸ’» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ ⭐)