Mamba-130M-HXQ
3.8x smaller. Pure SSM. Architecture proof.
Mamba-130M compressed from 489 MB (FP32) to 128 MB β a pure state-space model proving the codec works beyond transformers. No calibration data. No architecture-specific tuning. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/mamba-130m-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/mamba-130m-helix")
inputs = tokenizer("The future of artificial intelligence", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Benchmark
| Dense (FP32) | HXQ | |
|---|---|---|
| Size | 489 MB | 128 MB |
| Perplexity (WikiText-2) | 20.77 | 24.60 (+18.4%) |
| Compression ratio | β | 3.8x |
| Compressed modules | β | 96 HelixLinear + 1 nn.Linear (embedding) |
| Architecture | Mamba (24 layers, pure SSM) | unchanged |
Eval: WikiText-2 test split, 2048 tokens, stride 512.
Good to Know
- +18.4% PPL delta β higher than transformer models. Expected: Mamba-130M is tiny (24 layers, 768 hidden), so each weight carries more information per parameter. The companion Zamba2-1.2B (which includes Mamba2 layers) compresses at +2.90% β SSM architectures compress well at scale.
- GPU and CPU supported β runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- Fine-tunable via LoRA β compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substrateβ the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". mamba-ssmrecommended β without it, falls back to a slower sequential code path.
Why This Model Exists
This is the architecture proof, not the fidelity champion. HelixCode compresses any nn.Linear β including the in_proj, out_proj, x_proj, and dt_proj layers inside Mamba's selective scan blocks. No architecture-specific tuning was needed.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable β
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required β unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 96
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form β no decompression needed
Mamba-Specific Details
Mamba's architecture includes non-weight parameters that are stored at full precision:
A_logβ log-space diagonal state matrix (24 layers)Dβ skip connection parameter (24 layers)dt_biasβ timestep bias (24 layers)conv1dβ causal convolution (24 layers)
These are not nn.Linear and are not compressed. Only the projection matrices (in_proj, out_proj, x_proj, dt_proj) are VQ-compressed.
Compression Receipt
Compressed tensors: 97
From original model: 145 (A_log, D, dt_bias, conv1d, norms)
Total keys: 573
Output size: 128 MB
HXQ ratio: 3.92x (weight bytes)
HelixLinear ratio: 5.61x (in-memory, includes format overhead reduction)
PPL delta: +18.4% (24.60 vs 20.77 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from state-spaces/mamba-130m-hf).
- Downloads last month
- 1,716
Model tree for EchoLabs33/mamba-130m-hxq
Base model
state-spaces/mamba-130m-hfCollection including EchoLabs33/mamba-130m-hxq
Evaluation results
- Perplexity on WikiText-2test set self-reported24.600