Spaces:

knoxel
/

bitnet-cpp-explorer

Runtime error

App Files Files Community

bitnet-cpp-explorer / README.md

knoxel

Upload README.md

622c714 verified 4 days ago

preview code

raw

history blame contribute delete

2.64 kB

	---
	title: "🧬 BitNet b1.58 2B4T — CPU Inference (bitnet.cpp)"
	emoji: ⚡
	colorFrom: indigo
	colorTo: blue
	sdk: docker
	pinned: false
	models:
	- microsoft/bitnet-b1.58-2B-4T
	- microsoft/bitnet-b1.58-2B-4T-gguf
	tags:
	- bitnet
	- 1-bit
	- cpu-inference
	- ternary-weights
	- bitnet-cpp
	- llama-server
	short_description: "1-bit LLM on CPU with bitnet.cpp — 4-10x faster"
	---

	# ⚡ BitNet b1.58 2B4T — CPU Inference with bitnet.cpp

	The fast version. This Space compiles Microsoft's [bitnet.cpp](https://github.com/microsoft/BitNet) from source and runs the official GGUF model with the I2_S lossless kernel — achieving 4-10× speedup over the transformers-based version.

	## Architecture

	```
	┌─────────────────────────────────────────────────┐
	│ Gradio UI (Python) │
	│ ↕ OpenAI-compatible streaming API │
	│ llama-server (bitnet.cpp, port 8080) │
	│ ↕ I2_S ternary kernel (additions only) │
	│ ggml-model-i2_s.gguf (1.1 GB, lossless) │
	└─────────────────────────────────────────────────┘
	```

	## Why bitnet.cpp?

	\| Engine \| Speed \| Lossless? \| Notes \|
	\|---\|---\|---\|---\|
	\| bitnet.cpp I2_S \| ~10-15 tok/s \| ✅ 100% \| Optimized ternary kernel \|
	\| transformers (bf16) \| ~1.4 tok/s \| ✅ \| No ternary optimization \|
	\| llama.cpp TQ1_0 \| ~2 tok/s \| ❌ 1.4% \| Quality degraded \|

	## Features

	- 💬 Streaming Chat — Real-time conversation with live tok/s stats
	- 📊 Benchmark — Greedy generation with detailed performance metrics
	- 📈 Paper Results — Published comparison table from the technical report
	- 🏗️ Architecture — How ternary weights enable multiplication-free inference
	- ⚙️ System Info — Live CPU/memory stats

	## How it works

	1. Docker build compiles bitnet.cpp with I2_S kernel optimizations
	2. Container startup launches `llama-server` on localhost:8080
	3. Gradio app connects via OpenAI-compatible streaming API
	4. All inference happens on CPU with addition-only matrix operations

	## References

	- 📄 [Technical Report](https://arxiv.org/abs/2504.12285)
	- 📄 [bitnet.cpp Paper](https://arxiv.org/abs/2502.11880)
	- 🤗 [Model Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T)
	- 🤗 [GGUF Weights](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf)
	- 💻 [bitnet.cpp](https://github.com/microsoft/BitNet) (38K+ ⭐)