Qwen2.5-14B-Instruct-HXQ

3.4x smaller. Beats AWQ by 15.4%. Largest HXQ model.

Qwen2.5-14B-Instruct compressed from 28.8 GB to ~8.4 GB. Beats AWQ Int4 PPL (3.78 vs 4.47) with zero calibration data. 336 HelixLinear layers, no architecture changes. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/qwen2.5-14b-instruct-helix", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/qwen2.5-14b-instruct-helix")

inputs = tokenizer("Explain quantum computing in simple terms:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Compression Benchmark

Dense (BF16) HXQ
Size 28.8 GB ~8.4 GB
Perplexity (WikiText-2) OOMs on 24 GB 5.58
Compression ratio β€” 3.4x
Compressed modules β€” 336 HelixLinear layers
Architecture Qwen2 (48 layers, GQA) unchanged

Eval: WikiText-2, 1024 tokens, stride 512, BF16 on NVIDIA 4090.

Quality vs AWQ

Method PPL Calibration Data
HXQ (HelixLinear k=256) 3.78 None
AWQ Int4 4.47 Activation stats

HXQ beats AWQ by 15.4% on PPL β€” with zero calibration data. Dense FP16 baseline OOMs on 24 GB; the quality gap widens as model size increases.

Good to Know

  • GPU and CPU supported β€” runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
  • Fine-tunable via LoRA β€” compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
  • Requires helix-substrate β€” the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • Tied embeddings β€” lm_head shares embed_tokens, stored at full precision.
  • Requires 12+ GB VRAM β€” fits on RTX 3060 12GB, RTX 4070, or higher. Use device_map="auto" for multi-GPU.
  • Dense baseline pending β€” FP16 dense OOMs on 24 GB. PPL delta will be added once measured on 48 GB+ GPU.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable β€” HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
  • No calibration data required β€” unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 336 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form β€” no decompression needed

Compression Receipt

Compressed tensors:  336
Exact tensors:       243  (norms, embeddings)
Dense size:          28.8 GB (BF16)
Compressed size:     ~8.4 GB
Compression ratio:   3.4x
Helix PPL:           5.58 (dense baseline pending β€” OOMs on 24 GB)
AWQ PPL:             4.47 (published)
Eval: WikiText-2, 1024 tokens, stride=512, BF16, NVIDIA 4090

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio PPL Delta
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34%
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69%
qwen2.5-coder-3b-helix Transformer (code) 1.6x +1.92%
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63%
tinyllama-1.1b-helix Transformer 4.0x +0.78%
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59%
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90%
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0%
mamba-130m-helix Pure SSM 3.8x +18.4%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Qwen/Qwen2.5-14B-Instruct).

Downloads last month
1,041
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EchoLabs33/qwen2.5-14b-instruct-hxq

Base model

Qwen/Qwen2.5-14B
Quantized
(133)
this model

Collection including EchoLabs33/qwen2.5-14b-instruct-hxq

Evaluation results