Spaces:
Runtime error
Runtime error
| title: "𧬠BitNet b1.58 2B4T β CPU Inference (bitnet.cpp)" | |
| emoji: β‘ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| models: | |
| - microsoft/bitnet-b1.58-2B-4T | |
| - microsoft/bitnet-b1.58-2B-4T-gguf | |
| tags: | |
| - bitnet | |
| - 1-bit | |
| - cpu-inference | |
| - ternary-weights | |
| - bitnet-cpp | |
| - llama-server | |
| short_description: "1-bit LLM on CPU with bitnet.cpp β 4-10x faster" | |
| # β‘ BitNet b1.58 2B4T β CPU Inference with bitnet.cpp | |
| The **fast** version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the **I2_S lossless kernel** β achieving **4-10Γ speedup** over the transformers-based version. | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Gradio UI (Python) β | |
| β β OpenAI-compatible streaming API β | |
| β llama-server (bitnet.cpp, port 8080) β | |
| β β I2_S ternary kernel (additions only) β | |
| β ggml-model-i2_s.gguf (1.1 GB, lossless) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Why bitnet.cpp? | |
| | Engine | Speed | Lossless? | Notes | | |
| |---|---|---|---| | |
| | **bitnet.cpp I2_S** | ~10-15 tok/s | β 100% | Optimized ternary kernel | | |
| | transformers (bf16) | ~1.4 tok/s | β | No ternary optimization | | |
| | llama.cpp TQ1_0 | ~2 tok/s | β 1.4% | Quality degraded | | |
| ## Features | |
| - π¬ **Streaming Chat** β Real-time conversation with live tok/s stats | |
| - π **Benchmark** β Greedy generation with detailed performance metrics | |
| - π **Paper Results** β Published comparison table from the technical report | |
| - ποΈ **Architecture** β How ternary weights enable multiplication-free inference | |
| - βοΈ **System Info** β Live CPU/memory stats | |
| ## How it works | |
| 1. **Docker build** compiles bitnet.cpp with I2_S kernel optimizations | |
| 2. **Container startup** launches `llama-server` on localhost:8080 | |
| 3. **Gradio app** connects via OpenAI-compatible streaming API | |
| 4. **All inference** happens on CPU with addition-only matrix operations | |
| ## References | |
| - π [Technical Report](https://arxiv.org/abs/2504.12285) | |
| - π [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880) | |
| - π€ [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) | |
| - π€ [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf) | |
| - π» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ β) | |