Mamba-130M-HXQ

3.8x smaller. Pure SSM. Architecture proof.

Mamba-130M compressed from 489 MB (FP32) to 128 MB β€” a pure state-space model proving the codec works beyond transformers. No calibration data. No architecture-specific tuning. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/mamba-130m-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/mamba-130m-helix")

inputs = tokenizer("The future of artificial intelligence", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Benchmark

Dense (FP32) HXQ
Size 489 MB 128 MB
Perplexity (WikiText-2) 20.77 24.60 (+18.4%)
Compression ratio β€” 3.8x
Compressed modules β€” 96 HelixLinear + 1 nn.Linear (embedding)
Architecture Mamba (24 layers, pure SSM) unchanged

Eval: WikiText-2 test split, 2048 tokens, stride 512.

Good to Know

  • +18.4% PPL delta β€” higher than transformer models. Expected: Mamba-130M is tiny (24 layers, 768 hidden), so each weight carries more information per parameter. The companion Zamba2-1.2B (which includes Mamba2 layers) compresses at +2.90% β€” SSM architectures compress well at scale.
  • GPU and CPU supported β€” runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
  • Fine-tunable via LoRA β€” compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
  • Requires helix-substrate β€” the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • mamba-ssm recommended β€” without it, falls back to a slower sequential code path.

Why This Model Exists

This is the architecture proof, not the fidelity champion. HelixCode compresses any nn.Linear β€” including the in_proj, out_proj, x_proj, and dt_proj layers inside Mamba's selective scan blocks. No architecture-specific tuning was needed.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable β€” HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
  • No calibration data required β€” unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 96 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form β€” no decompression needed

Mamba-Specific Details

Mamba's architecture includes non-weight parameters that are stored at full precision:

  • A_log β€” log-space diagonal state matrix (24 layers)
  • D β€” skip connection parameter (24 layers)
  • dt_bias β€” timestep bias (24 layers)
  • conv1d β€” causal convolution (24 layers)

These are not nn.Linear and are not compressed. Only the projection matrices (in_proj, out_proj, x_proj, dt_proj) are VQ-compressed.

Compression Receipt

Compressed tensors:  97
From original model: 145  (A_log, D, dt_bias, conv1d, norms)
Total keys:          573
Output size:         128 MB
HXQ ratio:       3.92x (weight bytes)
HelixLinear ratio:   5.61x (in-memory, includes format overhead reduction)
PPL delta:           +18.4% (24.60 vs 20.77 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio PPL Delta
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34%
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69%
qwen2.5-coder-3b-helix Transformer (code) 1.6x +1.92%
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63%
tinyllama-1.1b-helix Transformer 4.0x +0.78%
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59%
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90%
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from state-spaces/mamba-130m-hf).

Downloads last month
1,716
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EchoLabs33/mamba-130m-hxq

Quantized
(4)
this model

Collection including EchoLabs33/mamba-130m-hxq

Evaluation results