knoxel's picture
docs: add cross-link to fast bitnet.cpp version
7039448 verified
metadata
title: 🧬 BitNet b1.58 2B4T β€” CPU-Only Inference Explorer
emoji: 🧬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
models:
  - microsoft/bitnet-b1.58-2B-4T
tags:
  - bitnet
  - 1-bit
  - cpu-inference
  - ternary-weights
  - efficient-inference
short_description: Chat with Microsoft's 1-bit LLM on CPU β€” no GPU needed

🧬 BitNet b1.58 2B4T β€” CPU-Only Inference Explorer

An interactive demo of Microsoft Research's first open-source native 1-bit Large Language Model.

⚑ Want the fast version? See knoxel/bitnet-cpp-explorer β€” same model but powered by bitnet.cpp's optimized ternary kernels (4-10Γ— faster).

What makes this special?

Feature Detail
Weights Ternary {-1, 0, +1} β€” just 1.58 bits per weight
Memory 0.4 GB (non-embedding) β€” 5-13Γ— less than comparable models
Energy 0.028J per token β€” 6-9Γ— less than FP16 models
Quality 54.2% avg benchmark β€” competitive with Qwen2.5 1.5B (55.2%)
Training Trained from scratch on 4T tokens (not post-training quantized)

Key insight

Since weights are only -1, 0, or +1, matrix multiplication becomes pure addition/subtraction. No floating-point multiplies needed β€” this is why CPUs can run BitNet efficiently.

Demo features

  • πŸ’¬ Chat β€” Streaming conversation with live tokens/sec stats
  • πŸ“Š Benchmark β€” Single-shot generation with memory & speed metrics
  • πŸ“ˆ Paper Results β€” Published benchmark comparison table
  • πŸ—οΈ Architecture β€” Visual explainer of how BitNet b1.58 differs from standard Transformers
  • βš™οΈ System β€” Live hardware & memory stats

Performance note

This demo uses the transformers library (~1.4 tok/s), which does not include the specialized bitnet.cpp kernels. For the paper's reported CPU latency (29ms/token = ~34 tok/s), see:

References