Spaces:
Running
Running
metadata
title: 𧬠BitNet b1.58 2B4T β CPU-Only Inference Explorer
emoji: π§¬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
models:
- microsoft/bitnet-b1.58-2B-4T
tags:
- bitnet
- 1-bit
- cpu-inference
- ternary-weights
- efficient-inference
short_description: Chat with Microsoft's 1-bit LLM on CPU β no GPU needed
𧬠BitNet b1.58 2B4T β CPU-Only Inference Explorer
An interactive demo of Microsoft Research's first open-source native 1-bit Large Language Model.
β‘ Want the fast version? See knoxel/bitnet-cpp-explorer β same model but powered by bitnet.cpp's optimized ternary kernels (4-10Γ faster).
What makes this special?
| Feature | Detail |
|---|---|
| Weights | Ternary {-1, 0, +1} β just 1.58 bits per weight |
| Memory | 0.4 GB (non-embedding) β 5-13Γ less than comparable models |
| Energy | 0.028J per token β 6-9Γ less than FP16 models |
| Quality | 54.2% avg benchmark β competitive with Qwen2.5 1.5B (55.2%) |
| Training | Trained from scratch on 4T tokens (not post-training quantized) |
Key insight
Since weights are only -1, 0, or +1, matrix multiplication becomes pure addition/subtraction. No floating-point multiplies needed β this is why CPUs can run BitNet efficiently.
Demo features
- π¬ Chat β Streaming conversation with live tokens/sec stats
- π Benchmark β Single-shot generation with memory & speed metrics
- π Paper Results β Published benchmark comparison table
- ποΈ Architecture β Visual explainer of how BitNet b1.58 differs from standard Transformers
- βοΈ System β Live hardware & memory stats
Performance note
This demo uses the transformers library (~1.4 tok/s), which does not include the specialized bitnet.cpp kernels. For the paper's reported CPU latency (29ms/token = ~34 tok/s), see:
- β‘ Fast version with bitnet.cpp
- π» bitnet.cpp repo with the GGUF weights
References
- π Technical Report β BitNet b1.58 2B4T
- π bitnet.cpp Paper β Optimized inference kernels
- π€ Model Weights
- π» bitnet.cpp (38K+ β)