Spaces:
Runtime error
Runtime error
metadata
title: 𧬠BitNet b1.58 2B4T β CPU Inference (bitnet.cpp)
emoji: β‘
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
models:
- microsoft/bitnet-b1.58-2B-4T
- microsoft/bitnet-b1.58-2B-4T-gguf
tags:
- bitnet
- 1-bit
- cpu-inference
- ternary-weights
- bitnet-cpp
- llama-server
short_description: 1-bit LLM on CPU with bitnet.cpp β 4-10x faster
β‘ BitNet b1.58 2B4T β CPU Inference with bitnet.cpp
The fast version. This Space compiles Microsoft's bitnet.cpp from source and runs the official GGUF model with the I2_S lossless kernel β achieving 4-10Γ speedup over the transformers-based version.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio UI (Python) β
β β OpenAI-compatible streaming API β
β llama-server (bitnet.cpp, port 8080) β
β β I2_S ternary kernel (additions only) β
β ggml-model-i2_s.gguf (1.1 GB, lossless) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Why bitnet.cpp?
| Engine | Speed | Lossless? | Notes |
|---|---|---|---|
| bitnet.cpp I2_S | ~10-15 tok/s | β 100% | Optimized ternary kernel |
| transformers (bf16) | ~1.4 tok/s | β | No ternary optimization |
| llama.cpp TQ1_0 | ~2 tok/s | β 1.4% | Quality degraded |
Features
- π¬ Streaming Chat β Real-time conversation with live tok/s stats
- π Benchmark β Greedy generation with detailed performance metrics
- π Paper Results β Published comparison table from the technical report
- ποΈ Architecture β How ternary weights enable multiplication-free inference
- βοΈ System Info β Live CPU/memory stats
How it works
- Docker build compiles bitnet.cpp with I2_S kernel optimizations
- Container startup launches
llama-serveron localhost:8080 - Gradio app connects via OpenAI-compatible streaming API
- All inference happens on CPU with addition-only matrix operations
References
- π Technical Report
- π bitnet.cpp Paper
- π€ Model Weights
- π€ GGUF Weights
- π» bitnet.cpp (38K+ β)