Zamba2-1.2B-HXQ
1.7x smaller from BF16. HellaSwag 71.1%. Fits in 1.35 GB.
Zamba2-1.2B (hybrid Mamba2 + Transformer) compressed from 2.3 GB (BF16) to 1.35 GB. Downstream task scores match the dense model after 1.7x compression. No calibration data. No architecture-specific tuning. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Downstream Benchmarks
Evaluated with lm-evaluation-harness on an NVIDIA 4090:
| Benchmark | HXQ (1.7x) | Dense (Zyphra reported) |
|---|---|---|
| HellaSwag (acc_norm) | 71.12% | ~69-72% |
| ARC-Easy (acc_norm) | 74.45% | β |
| ARC-Challenge (acc_norm) | 48.21% | β |
Task performance is preserved after 1.7x compression. These are real downstream scores, not PPL proxies.
Compression Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 2.3 GB | 1.35 GB |
| Perplexity (WikiText-2) | 5.458 | 5.617 (+2.90%) |
| Compression ratio | β | 1.7x |
| Compressed modules | β | 136 HelixLinear layers |
| Architecture | Zamba2 (Mamba2 + shared Transformer) | unchanged |
Eval: WikiText-2 test split, 2048 tokens, stride 512.
Good to Know
- GPU and CPU supported β runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- Fine-tunable via LoRA β compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substrateβ the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". - +2.90% PPL delta β measurable but small. Whether this matters depends on your use case.
mamba-ssmrecommended β without it, falls back to a slower sequential code path.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable β
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required β unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 136
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form β no decompression needed
Architecture Details
Zamba2-1.2B is a hybrid architecture with:
- 32 Mamba2 layers (SSM blocks with in_proj + out_proj linear layers)
- 6 hybrid layers (Mamba2 + shared Transformer decoder with attention + MLP)
- 1 shared Transformer block (reused at layers 5, 11, 17, 23, 29, 35)
- 38 total layers, hidden_size=2048
All 136 linear layers (Mamba projections, attention Q/K/V/O, MLP gate/up/down, adapter layers) are compressed. Normalization layers, embeddings, and Mamba-specific parameters (A_log, D, dt_bias, conv1d) are stored at full precision.
Compression Receipt
Compressed tensors: 136
Exact tensors: 156 (norms, embeddings, biases)
From original model: 114 (A_log, D, dt_bias, conv1d)
Total keys: 814
Output size: 1,350 MB
Compression ratio: 1.7x
PPL delta: +2.90% (5.617 vs 5.458 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from Zyphra/Zamba2-1.2B).
- Downloads last month
- 1,711
Model tree for EchoLabs33/zamba2-1.2b-hxq
Base model
Zyphra/Zamba2-1.2BCollection including EchoLabs33/zamba2-1.2b-hxq
Evaluation results
- Accuracy (norm) on HellaSwagself-reported0.711
- Accuracy (norm) on ARC-Easyself-reported0.745
- Accuracy (norm) on ARC-Challengeself-reported0.482
- Perplexity on WikiText-2test set self-reported5.617