prism-ml
/

Bonsai-8B-gguf

@@ -35,7 +35,7 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
 - **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
 - **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
-- **GGUF Q1_0_g128** format with inline dequantization kernels — no FP16 materialization
 - **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
 - **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
 - **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
@@ -63,13 +63,13 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
 | Layers         | 36 Transformer decoder blocks                                          |
 | Context length | 65,536 tokens                                                          |
 | Vocab size     | 151,936                                                                |
-| Weight format  | GGUF Q1_0_g128                                                         |
 | Deployed size  | **1.15 GB** (14.2x smaller than FP16)                                  |
 | 1-bit coverage | Embeddings, attention projections, MLP projections, LM head            |
 | License        | Apache 2.0                                                             |
-## Quantization Format: Q1_0_g128
 Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
@@ -82,7 +82,7 @@ Parameter memory only (weights and scales loaded into memory):
 | Format             | Size        | Reduction | Ratio     |
 | :----------------- | ----------: | --------: | --------: |
 | FP16               | 16.38 GB    | —         | 1.0x      |
-| **GGUF Q1_0_g128** | **1.15 GB** | **93.0%** | **14.2x** |
 | MLX 1-bit g128     | 1.28 GB     | 92.2%     | 12.8x     |
 The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
@@ -115,7 +115,7 @@ You are a helpful assistant
 ### llama.cpp (CUDA)
 ```bash
-# Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
 git clone https://github.com/PrismML-Eng/llama.cpp
 cd llama.cpp
@@ -136,7 +136,7 @@ cmake -B build -DGGML_CUDA=ON && cmake --build build -j
 ### llama.cpp (Metal / macOS)
 ```bash
-# Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
 git clone https://github.com/PrismML-Eng/llama.cpp
 cd llama.cpp
@@ -145,7 +145,7 @@ cmake -B build && cmake --build build -j
 # Run inference
 ./build/bin/llama-cli \
-    -m Bonsai-8B-Q1_0_g128.gguf \
     -p "Explain quantum computing in simple terms." \
     -n 256 \
     --temp 0.5 \
@@ -158,7 +158,7 @@ cmake -B build && cmake --build build -j
 ```bash
 ./build/bin/llama-server \
-    -m Bonsai-8B-Q1_0_g128.gguf \
     --host 0.0.0.0 \
     --port 8080 \
     -ngl 99

 - **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
 - **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
+- **GGUF Q1_0 (g128)** format with inline dequantization kernels — no FP16 materialization
 - **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
 - **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
 - **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
 | Layers         | 36 Transformer decoder blocks                                          |
 | Context length | 65,536 tokens                                                          |
 | Vocab size     | 151,936                                                                |
+| Weight format  | GGUF Q1_0                                                        |
 | Deployed size  | **1.15 GB** (14.2x smaller than FP16)                                  |
 | 1-bit coverage | Embeddings, attention projections, MLP projections, LM head            |
 | License        | Apache 2.0                                                             |
+## Quantization Format: Q1_0
 Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
 | Format             | Size        | Reduction | Ratio     |
 | :----------------- | ----------: | --------: | --------: |
 | FP16               | 16.38 GB    | —         | 1.0x      |
+| **GGUF Q1_0     ** | **1.15 GB** | **93.0%** | **14.2x** |
 | MLX 1-bit g128     | 1.28 GB     | 92.2%     | 12.8x     |
 The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
 ### llama.cpp (CUDA)
 ```bash
+# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
 git clone https://github.com/PrismML-Eng/llama.cpp
 cd llama.cpp
 ### llama.cpp (Metal / macOS)
 ```bash
+# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
 git clone https://github.com/PrismML-Eng/llama.cpp
 cd llama.cpp
 # Run inference
 ./build/bin/llama-cli \
+    -m Bonsai-8B.gguf \
     -p "Explain quantum computing in simple terms." \
     -n 256 \
     --temp 0.5 \
 ```bash
 ./build/bin/llama-server \
+    -m Bonsai-8B.gguf \
     --host 0.0.0.0 \
     --port 8080 \
     -ngl 99