Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
|
|
| 35 |
|
| 36 |
- **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
|
| 37 |
- **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
|
| 38 |
-
- **GGUF
|
| 39 |
- **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
|
| 40 |
- **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
|
| 41 |
- **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
|
|
@@ -63,13 +63,13 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
|
|
| 63 |
| Layers | 36 Transformer decoder blocks |
|
| 64 |
| Context length | 65,536 tokens |
|
| 65 |
| Vocab size | 151,936 |
|
| 66 |
-
| Weight format | GGUF
|
| 67 |
| Deployed size | **1.15 GB** (14.2x smaller than FP16) |
|
| 68 |
| 1-bit coverage | Embeddings, attention projections, MLP projections, LM head |
|
| 69 |
| License | Apache 2.0 |
|
| 70 |
|
| 71 |
|
| 72 |
-
## Quantization Format:
|
| 73 |
|
| 74 |
Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
|
| 75 |
|
|
@@ -82,7 +82,7 @@ Parameter memory only (weights and scales loaded into memory):
|
|
| 82 |
| Format | Size | Reduction | Ratio |
|
| 83 |
| :----------------- | ----------: | --------: | --------: |
|
| 84 |
| FP16 | 16.38 GB | — | 1.0x |
|
| 85 |
-
| **GGUF
|
| 86 |
| MLX 1-bit g128 | 1.28 GB | 92.2% | 12.8x |
|
| 87 |
|
| 88 |
The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
|
|
@@ -115,7 +115,7 @@ You are a helpful assistant
|
|
| 115 |
### llama.cpp (CUDA)
|
| 116 |
|
| 117 |
```bash
|
| 118 |
-
# Clone the PrismML fork of llama.cpp (includes
|
| 119 |
git clone https://github.com/PrismML-Eng/llama.cpp
|
| 120 |
cd llama.cpp
|
| 121 |
|
|
@@ -136,7 +136,7 @@ cmake -B build -DGGML_CUDA=ON && cmake --build build -j
|
|
| 136 |
### llama.cpp (Metal / macOS)
|
| 137 |
|
| 138 |
```bash
|
| 139 |
-
# Clone the PrismML fork of llama.cpp (includes
|
| 140 |
git clone https://github.com/PrismML-Eng/llama.cpp
|
| 141 |
cd llama.cpp
|
| 142 |
|
|
@@ -145,7 +145,7 @@ cmake -B build && cmake --build build -j
|
|
| 145 |
|
| 146 |
# Run inference
|
| 147 |
./build/bin/llama-cli \
|
| 148 |
-
-m Bonsai-8B
|
| 149 |
-p "Explain quantum computing in simple terms." \
|
| 150 |
-n 256 \
|
| 151 |
--temp 0.5 \
|
|
@@ -158,7 +158,7 @@ cmake -B build && cmake --build build -j
|
|
| 158 |
|
| 159 |
```bash
|
| 160 |
./build/bin/llama-server \
|
| 161 |
-
-m Bonsai-8B
|
| 162 |
--host 0.0.0.0 \
|
| 163 |
--port 8080 \
|
| 164 |
-ngl 99
|
|
|
|
| 35 |
|
| 36 |
- **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
|
| 37 |
- **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
|
| 38 |
+
- **GGUF Q1_0 (g128)** format with inline dequantization kernels — no FP16 materialization
|
| 39 |
- **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
|
| 40 |
- **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
|
| 41 |
- **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
|
|
|
|
| 63 |
| Layers | 36 Transformer decoder blocks |
|
| 64 |
| Context length | 65,536 tokens |
|
| 65 |
| Vocab size | 151,936 |
|
| 66 |
+
| Weight format | GGUF Q1_0 |
|
| 67 |
| Deployed size | **1.15 GB** (14.2x smaller than FP16) |
|
| 68 |
| 1-bit coverage | Embeddings, attention projections, MLP projections, LM head |
|
| 69 |
| License | Apache 2.0 |
|
| 70 |
|
| 71 |
|
| 72 |
+
## Quantization Format: Q1_0
|
| 73 |
|
| 74 |
Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
|
| 75 |
|
|
|
|
| 82 |
| Format | Size | Reduction | Ratio |
|
| 83 |
| :----------------- | ----------: | --------: | --------: |
|
| 84 |
| FP16 | 16.38 GB | — | 1.0x |
|
| 85 |
+
| **GGUF Q1_0 ** | **1.15 GB** | **93.0%** | **14.2x** |
|
| 86 |
| MLX 1-bit g128 | 1.28 GB | 92.2% | 12.8x |
|
| 87 |
|
| 88 |
The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
|
|
|
|
| 115 |
### llama.cpp (CUDA)
|
| 116 |
|
| 117 |
```bash
|
| 118 |
+
# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
|
| 119 |
git clone https://github.com/PrismML-Eng/llama.cpp
|
| 120 |
cd llama.cpp
|
| 121 |
|
|
|
|
| 136 |
### llama.cpp (Metal / macOS)
|
| 137 |
|
| 138 |
```bash
|
| 139 |
+
# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
|
| 140 |
git clone https://github.com/PrismML-Eng/llama.cpp
|
| 141 |
cd llama.cpp
|
| 142 |
|
|
|
|
| 145 |
|
| 146 |
# Run inference
|
| 147 |
./build/bin/llama-cli \
|
| 148 |
+
-m Bonsai-8B.gguf \
|
| 149 |
-p "Explain quantum computing in simple terms." \
|
| 150 |
-n 256 \
|
| 151 |
--temp 0.5 \
|
|
|
|
| 158 |
|
| 159 |
```bash
|
| 160 |
./build/bin/llama-server \
|
| 161 |
+
-m Bonsai-8B.gguf \
|
| 162 |
--host 0.0.0.0 \
|
| 163 |
--port 8080 \
|
| 164 |
-ngl 99
|