OLMoE-1B-7B-Instruct-HXQ
1.9x smaller from BF16. HellaSwag 78.8%. First MoE compressed with HXQ.
OLMoE-1B-7B-Instruct (64-expert Mixture-of-Experts, 1B active / 6.9B total) compressed from 13 GB (BF16) to 6.7 GB. All three downstream benchmarks within noise of the dense baseline. No calibration data. No architecture-specific tuning. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"EchoLabs33/olmoe-1b-7b-instruct-helix",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/olmoe-1b-7b-instruct-helix")
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Downstream Benchmarks
Evaluated with lm-evaluation-harness v0.4.11 on an NVIDIA RTX 3090 (batch=4, dtype=bfloat16):
| Benchmark | Dense (acc_norm) | HXQ (acc_norm) | Delta |
|---|---|---|---|
| HellaSwag | 78.92% | 78.76% | -0.16% |
| ARC-Challenge | 52.13% | 52.05% | -0.08% |
| ARC-Easy | 75.72% | 76.85% | +1.14% |
All deltas within standard error. Task performance is preserved after 1.9x compression. These are real downstream scores from paired dense/HXQ evaluations, not PPL proxies.
Compression Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 13 GB | 6.7 GB |
| Compression ratio | โ | 1.9x |
| VRAM (eval) | 13,886 MB | 7,540 MB |
| Compressed modules | โ | 3,152 HelixLinear layers |
| Architecture | OLMoE (64-expert MoE) | unchanged |
Verification Status
- Compression receipt: PASS โ 3,152 compressed, 67 exact, 12,675 total keys
- Conversion receipt: PASS โ SHA256
a9f74982b746853077d13dc11c8bc863dc91219c81e22577de1de2b195c7b836 - Downstream eval: PASS โ paired dense/HXQ on HellaSwag, ARC-Easy, ARC-Challenge
Good to Know
- GPU and CPU supported โ runs on any CUDA GPU or CPU via standard PyTorch.
trust_remote_code=Truerequired โ OLMoE uses custom modeling code.- Fine-tunable via LoRA โ compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substrateโ the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". - 64 experts = slow eval โ lm-eval-harness takes ~5.5 hours on a 3090 due to MoE routing overhead. Inference speed is normal for interactive use.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable โ
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MoE, CNN) - No calibration data required โ unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 3,152
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form โ no decompression needed
Architecture Details
OLMoE-1B-7B-Instruct is a Mixture-of-Experts architecture with:
- 16 transformer layers, each with attention + MoE MLP
- 64 experts per layer, top-8 routing (1B active / 6.9B total parameters)
- hidden_size=2048, intermediate_size=1024 per expert
- 16 attention heads, no GQA (num_kv_heads=16)
All 3,152 linear layers are compressed:
- 3,072 expert projections (64 experts x 3 projections x 16 layers)
- 64 attention projections (Q/K/V/O across 16 layers)
- 16 router gates (expert routing per layer)
Normalization layers (33), embeddings (1), and lm_head (1) are stored at full precision.
Why This Matters
OLMoE is the first Mixture-of-Experts model compressed with HXQ. Combined with existing Transformer, SSM, and Hybrid results, this demonstrates that the same codec โ same codebook size, same algorithm, same pip install โ works across four distinct architecture families without modification.
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | Eval Delta |
|---|---|---|---|
| olmoe-1b-7b-instruct-helix | MoE (64 experts) | 1.9x | -0.16% HellaSwag |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% PPL |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% PPL |
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% PPL |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% PPL |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% PPL |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% PPL |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% PPL |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% PPL |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% PPL |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from allenai/OLMoE-1B-7B-0924-Instruct).
- Downloads last month
- 748
Model tree for EchoLabs33/olmoe-1b-7b-instruct-hxq
Base model
allenai/OLMoE-1B-7B-0924Evaluation results
- Accuracy (norm) on HellaSwagself-reported0.788
- Accuracy (norm) on ARC-Easyself-reported0.768
- Accuracy (norm) on ARC-Challengeself-reported0.520