pashak commited on
Commit
d21c85a
·
verified ·
1 Parent(s): 686fe46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -35,7 +35,7 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
35
 
36
  - **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
37
  - **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
38
- - **GGUF Q1_0_g128** format with inline dequantization kernels — no FP16 materialization
39
  - **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
40
  - **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
41
  - **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
@@ -63,13 +63,13 @@ End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)
63
  | Layers | 36 Transformer decoder blocks |
64
  | Context length | 65,536 tokens |
65
  | Vocab size | 151,936 |
66
- | Weight format | GGUF Q1_0_g128 |
67
  | Deployed size | **1.15 GB** (14.2x smaller than FP16) |
68
  | 1-bit coverage | Embeddings, attention projections, MLP projections, LM head |
69
  | License | Apache 2.0 |
70
 
71
 
72
- ## Quantization Format: Q1_0_g128
73
 
74
  Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
75
 
@@ -82,7 +82,7 @@ Parameter memory only (weights and scales loaded into memory):
82
  | Format | Size | Reduction | Ratio |
83
  | :----------------- | ----------: | --------: | --------: |
84
  | FP16 | 16.38 GB | — | 1.0x |
85
- | **GGUF Q1_0_g128** | **1.15 GB** | **93.0%** | **14.2x** |
86
  | MLX 1-bit g128 | 1.28 GB | 92.2% | 12.8x |
87
 
88
  The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
@@ -115,7 +115,7 @@ You are a helpful assistant
115
  ### llama.cpp (CUDA)
116
 
117
  ```bash
118
- # Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
119
  git clone https://github.com/PrismML-Eng/llama.cpp
120
  cd llama.cpp
121
 
@@ -136,7 +136,7 @@ cmake -B build -DGGML_CUDA=ON && cmake --build build -j
136
  ### llama.cpp (Metal / macOS)
137
 
138
  ```bash
139
- # Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
140
  git clone https://github.com/PrismML-Eng/llama.cpp
141
  cd llama.cpp
142
 
@@ -145,7 +145,7 @@ cmake -B build && cmake --build build -j
145
 
146
  # Run inference
147
  ./build/bin/llama-cli \
148
- -m Bonsai-8B-Q1_0_g128.gguf \
149
  -p "Explain quantum computing in simple terms." \
150
  -n 256 \
151
  --temp 0.5 \
@@ -158,7 +158,7 @@ cmake -B build && cmake --build build -j
158
 
159
  ```bash
160
  ./build/bin/llama-server \
161
- -m Bonsai-8B-Q1_0_g128.gguf \
162
  --host 0.0.0.0 \
163
  --port 8080 \
164
  -ngl 99
 
35
 
36
  - **1.15 GB** parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
37
  - **End-to-end 1-bit weights** across embeddings, attention projections, MLP projections, and LM head
38
+ - **GGUF Q1_0 (g128)** format with inline dequantization kernels — no FP16 materialization
39
  - **Cross-platform**: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
40
  - **Competitive benchmarks**: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
41
  - **MLX companion**: also available as [MLX 1-bit g128](https://huggingface.co/prism-ml/Bonsai-8B-mlx-1bit) for native Apple Silicon inference
 
63
  | Layers | 36 Transformer decoder blocks |
64
  | Context length | 65,536 tokens |
65
  | Vocab size | 151,936 |
66
+ | Weight format | GGUF Q1_0 |
67
  | Deployed size | **1.15 GB** (14.2x smaller than FP16) |
68
  | 1-bit coverage | Embeddings, attention projections, MLP projections, LM head |
69
  | License | Apache 2.0 |
70
 
71
 
72
+ ## Quantization Format: Q1_0
73
 
74
  Each weight is a single bit: `0` maps to `−scale`, `1` maps to `+scale`. Every group of 128 weights shares one FP16 scale factor.
75
 
 
82
  | Format | Size | Reduction | Ratio |
83
  | :----------------- | ----------: | --------: | --------: |
84
  | FP16 | 16.38 GB | — | 1.0x |
85
+ | **GGUF Q1_0 ** | **1.15 GB** | **93.0%** | **14.2x** |
86
  | MLX 1-bit g128 | 1.28 GB | 92.2% | 12.8x |
87
 
88
  The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.
 
115
  ### llama.cpp (CUDA)
116
 
117
  ```bash
118
+ # Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
119
  git clone https://github.com/PrismML-Eng/llama.cpp
120
  cd llama.cpp
121
 
 
136
  ### llama.cpp (Metal / macOS)
137
 
138
  ```bash
139
+ # Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
140
  git clone https://github.com/PrismML-Eng/llama.cpp
141
  cd llama.cpp
142
 
 
145
 
146
  # Run inference
147
  ./build/bin/llama-cli \
148
+ -m Bonsai-8B.gguf \
149
  -p "Explain quantum computing in simple terms." \
150
  -n 256 \
151
  --temp 0.5 \
 
158
 
159
  ```bash
160
  ./build/bin/llama-server \
161
+ -m Bonsai-8B.gguf \
162
  --host 0.0.0.0 \
163
  --port 8080 \
164
  -ngl 99