Spaces:

knoxel
/

bitnet-cpp-explorer

Runtime error

App Files Files Community

bitnet-cpp-explorer / README.md

knoxel

Upload README.md

622c714 verified 4 days ago

preview code

raw

history blame contribute delete

2.64 kB

metadata

title: 🧬 BitNet b1.58 2B4T — CPU Inference (bitnet.cpp)
emoji: ⚡
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
models:
  - microsoft/bitnet-b1.58-2B-4T
  - microsoft/bitnet-b1.58-2B-4T-gguf
tags:
  - bitnet
  - 1-bit
  - cpu-inference
  - ternary-weights
  - bitnet-cpp
  - llama-server
short_description: 1-bit LLM on CPU with bitnet.cpp — 4-10x faster

⚡ BitNet b1.58 2B4T — CPU Inference with bitnet.cpp

The fast version. This Space compiles Microsoft's bitnet.cpp from source and runs the official GGUF model with the I2_S lossless kernel — achieving 4-10× speedup over the transformers-based version.

Architecture

┌─────────────────────────────────────────────────┐
│  Gradio UI (Python)                             │
│    ↕ OpenAI-compatible streaming API            │
│  llama-server (bitnet.cpp, port 8080)           │
│    ↕ I2_S ternary kernel (additions only)       │
│  ggml-model-i2_s.gguf (1.1 GB, lossless)       │
└─────────────────────────────────────────────────┘

Why bitnet.cpp?

Engine	Speed	Lossless?	Notes
bitnet.cpp I2_S	~10-15 tok/s	✅ 100%	Optimized ternary kernel
transformers (bf16)	~1.4 tok/s	✅	No ternary optimization
llama.cpp TQ1_0	~2 tok/s	❌ 1.4%	Quality degraded

Features

💬 Streaming Chat — Real-time conversation with live tok/s stats
📊 Benchmark — Greedy generation with detailed performance metrics
📈 Paper Results — Published comparison table from the technical report
🏗️ Architecture — How ternary weights enable multiplication-free inference
⚙️ System Info — Live CPU/memory stats

How it works

Docker build compiles bitnet.cpp with I2_S kernel optimizations
Container startup launches llama-server on localhost:8080
Gradio app connects via OpenAI-compatible streaming API
All inference happens on CPU with addition-only matrix operations

⚡ BitNet b1.58 2B4T — CPU Inference with bitnet.cpp

Architecture

Why bitnet.cpp?

Features

How it works

References