Zamba2-1.2B-HXQ

1.7x smaller from BF16. HellaSwag 71.1%. Fits in 1.35 GB.

Zamba2-1.2B (hybrid Mamba2 + Transformer) compressed from 2.3 GB (BF16) to 1.35 GB. Downstream task scores match the dense model after 1.7x compression. No calibration data. No architecture-specific tuning. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Downstream Benchmarks

Evaluated with lm-evaluation-harness on an NVIDIA 4090:

Benchmark HXQ (1.7x) Dense (Zyphra reported)
HellaSwag (acc_norm) 71.12% ~69-72%
ARC-Easy (acc_norm) 74.45% β€”
ARC-Challenge (acc_norm) 48.21% β€”

Task performance is preserved after 1.7x compression. These are real downstream scores, not PPL proxies.

Compression Benchmark

Dense (BF16) HXQ
Size 2.3 GB 1.35 GB
Perplexity (WikiText-2) 5.458 5.617 (+2.90%)
Compression ratio β€” 1.7x
Compressed modules β€” 136 HelixLinear layers
Architecture Zamba2 (Mamba2 + shared Transformer) unchanged

Eval: WikiText-2 test split, 2048 tokens, stride 512.

Good to Know

  • GPU and CPU supported β€” runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
  • Fine-tunable via LoRA β€” compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
  • Requires helix-substrate β€” the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • +2.90% PPL delta β€” measurable but small. Whether this matters depends on your use case.
  • mamba-ssm recommended β€” without it, falls back to a slower sequential code path.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable β€” HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
  • No calibration data required β€” unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 136 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form β€” no decompression needed

Architecture Details

Zamba2-1.2B is a hybrid architecture with:

  • 32 Mamba2 layers (SSM blocks with in_proj + out_proj linear layers)
  • 6 hybrid layers (Mamba2 + shared Transformer decoder with attention + MLP)
  • 1 shared Transformer block (reused at layers 5, 11, 17, 23, 29, 35)
  • 38 total layers, hidden_size=2048

All 136 linear layers (Mamba projections, attention Q/K/V/O, MLP gate/up/down, adapter layers) are compressed. Normalization layers, embeddings, and Mamba-specific parameters (A_log, D, dt_bias, conv1d) are stored at full precision.

Compression Receipt

Compressed tensors:  136
Exact tensors:       156  (norms, embeddings, biases)
From original model: 114  (A_log, D, dt_bias, conv1d)
Total keys:          814
Output size:         1,350 MB
Compression ratio:   1.7x
PPL delta:           +2.90% (5.617 vs 5.458 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio PPL Delta
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34%
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69%
qwen2.5-coder-3b-helix Transformer (code) 1.6x +1.92%
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63%
tinyllama-1.1b-helix Transformer 4.0x +0.78%
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59%
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0%
mamba-130m-helix Pure SSM 3.8x +18.4%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Zyphra/Zamba2-1.2B).

Downloads last month
1,711
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EchoLabs33/zamba2-1.2b-hxq

Quantized
(1)
this model

Collection including EchoLabs33/zamba2-1.2b-hxq

Evaluation results