bitnet-cpp-explorer / README.md
knoxel's picture
Upload README.md
622c714 verified
metadata
title: 🧬 BitNet b1.58 2B4T β€” CPU Inference (bitnet.cpp)
emoji: ⚑
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
models:
  - microsoft/bitnet-b1.58-2B-4T
  - microsoft/bitnet-b1.58-2B-4T-gguf
tags:
  - bitnet
  - 1-bit
  - cpu-inference
  - ternary-weights
  - bitnet-cpp
  - llama-server
short_description: 1-bit LLM on CPU with bitnet.cpp β€” 4-10x faster

⚑ BitNet b1.58 2B4T β€” CPU Inference with bitnet.cpp

The fast version. This Space compiles Microsoft's bitnet.cpp from source and runs the official GGUF model with the I2_S lossless kernel β€” achieving 4-10Γ— speedup over the transformers-based version.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gradio UI (Python)                             β”‚
β”‚    ↕ OpenAI-compatible streaming API            β”‚
β”‚  llama-server (bitnet.cpp, port 8080)           β”‚
β”‚    ↕ I2_S ternary kernel (additions only)       β”‚
β”‚  ggml-model-i2_s.gguf (1.1 GB, lossless)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why bitnet.cpp?

Engine Speed Lossless? Notes
bitnet.cpp I2_S ~10-15 tok/s βœ… 100% Optimized ternary kernel
transformers (bf16) ~1.4 tok/s βœ… No ternary optimization
llama.cpp TQ1_0 ~2 tok/s ❌ 1.4% Quality degraded

Features

  • πŸ’¬ Streaming Chat β€” Real-time conversation with live tok/s stats
  • πŸ“Š Benchmark β€” Greedy generation with detailed performance metrics
  • πŸ“ˆ Paper Results β€” Published comparison table from the technical report
  • πŸ—οΈ Architecture β€” How ternary weights enable multiplication-free inference
  • βš™οΈ System Info β€” Live CPU/memory stats

How it works

  1. Docker build compiles bitnet.cpp with I2_S kernel optimizations
  2. Container startup launches llama-server on localhost:8080
  3. Gradio app connects via OpenAI-compatible streaming API
  4. All inference happens on CPU with addition-only matrix operations

References