Spaces:
Running
Running
| title: "𧬠BitNet b1.58 2B4T β CPU-Only Inference Explorer" | |
| emoji: 𧬠| |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 6.14.0 | |
| app_file: app.py | |
| pinned: false | |
| models: | |
| - microsoft/bitnet-b1.58-2B-4T | |
| tags: | |
| - bitnet | |
| - 1-bit | |
| - cpu-inference | |
| - ternary-weights | |
| - efficient-inference | |
| short_description: "Chat with Microsoft's 1-bit LLM on CPU β no GPU needed" | |
| # 𧬠BitNet b1.58 2B4T β CPU-Only Inference Explorer | |
| An interactive demo of **Microsoft Research's first open-source native 1-bit Large Language Model**. | |
| > β‘ **Want the fast version?** See [knoxel/bitnet-cpp-explorer](https://huggingface.co/spaces/knoxel/bitnet-cpp-explorer) β same model but powered by bitnet.cpp's optimized ternary kernels (4-10Γ faster). | |
| ## What makes this special? | |
| | Feature | Detail | | |
| |---|---| | |
| | **Weights** | Ternary {-1, 0, +1} β just 1.58 bits per weight | | |
| | **Memory** | 0.4 GB (non-embedding) β **5-13Γ less** than comparable models | | |
| | **Energy** | 0.028J per token β **6-9Γ less** than FP16 models | | |
| | **Quality** | 54.2% avg benchmark β competitive with Qwen2.5 1.5B (55.2%) | | |
| | **Training** | Trained from scratch on 4T tokens (not post-training quantized) | | |
| ## Key insight | |
| Since weights are only -1, 0, or +1, matrix multiplication becomes pure **addition/subtraction**. No floating-point multiplies needed β this is why CPUs can run BitNet efficiently. | |
| ## Demo features | |
| - π¬ **Chat** β Streaming conversation with live tokens/sec stats | |
| - π **Benchmark** β Single-shot generation with memory & speed metrics | |
| - π **Paper Results** β Published benchmark comparison table | |
| - ποΈ **Architecture** β Visual explainer of how BitNet b1.58 differs from standard Transformers | |
| - βοΈ **System** β Live hardware & memory stats | |
| ## Performance note | |
| This demo uses the `transformers` library (~1.4 tok/s), which does **not** include the specialized bitnet.cpp kernels. For the paper's reported CPU latency (29ms/token = ~34 tok/s), see: | |
| - β‘ [Fast version with bitnet.cpp](https://huggingface.co/spaces/knoxel/bitnet-cpp-explorer) | |
| - π» [bitnet.cpp repo](https://github.com/microsoft/BitNet) with the [GGUF weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf) | |
| ## References | |
| - π [Technical Report](https://arxiv.org/abs/2504.12285) β BitNet b1.58 2B4T | |
| - π [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880) β Optimized inference kernels | |
| - π€ [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) | |
| - π» [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ β) | |