OLMoE-1B-7B-Instruct-HXQ

1.9x smaller from BF16. HellaSwag 78.8%. First MoE compressed with HXQ.

OLMoE-1B-7B-Instruct (64-expert Mixture-of-Experts, 1B active / 6.9B total) compressed from 13 GB (BF16) to 6.7 GB. All three downstream benchmarks within noise of the dense baseline. No calibration data. No architecture-specific tuning. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"

import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "EchoLabs33/olmoe-1b-7b-instruct-helix",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/olmoe-1b-7b-instruct-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Downstream Benchmarks

Evaluated with lm-evaluation-harness v0.4.11 on an NVIDIA RTX 3090 (batch=4, dtype=bfloat16):

Benchmark	Dense (acc_norm)	HXQ (acc_norm)	Delta
HellaSwag	78.92%	78.76%	-0.16%
ARC-Challenge	52.13%	52.05%	-0.08%
ARC-Easy	75.72%	76.85%	+1.14%

All deltas within standard error. Task performance is preserved after 1.9x compression. These are real downstream scores from paired dense/HXQ evaluations, not PPL proxies.

Compression Benchmark

	Dense (BF16)	HXQ
Size	13 GB	6.7 GB
Compression ratio	—	1.9x
VRAM (eval)	13,886 MB	7,540 MB
Compressed modules	—	3,152 HelixLinear layers
Architecture	OLMoE (64-expert MoE)	unchanged

Verification Status

Compression receipt: PASS — 3,152 compressed, 67 exact, 12,675 total keys
Conversion receipt: PASS — SHA256 a9f74982b746853077d13dc11c8bc863dc91219c81e22577de1de2b195c7b836
Downstream eval: PASS — paired dense/HXQ on HellaSwag, ARC-Easy, ARC-Challenge

Good to Know

GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch.
trust_remote_code=True required — OLMoE uses custom modeling code.
Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
Requires helix-substrate — the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
64 experts = slow eval — lm-eval-harness takes ~5.5 hours on a 3090 due to MoE routing overhead. Inference speed is normal for interactive use.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
The compressed form is the executable — HelixLinear performs codebook[indices] @ x directly, no decompression step
Works on any nn.Linear regardless of architecture (Transformer, Mamba, MoE, CNN)
No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

import helix_substrate registers the hxq quantizer with HuggingFace
from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
The quantizer replaces 3,152 nn.Linear modules with HelixLinear shells before weight loading
Safetensors populates the codebook, indices, and sidecar buffers directly
The model runs in compressed form — no decompression needed

Architecture Details

OLMoE-1B-7B-Instruct is a Mixture-of-Experts architecture with:

16 transformer layers, each with attention + MoE MLP
64 experts per layer, top-8 routing (1B active / 6.9B total parameters)
hidden_size=2048, intermediate_size=1024 per expert
16 attention heads, no GQA (num_kv_heads=16)

All 3,152 linear layers are compressed:

3,072 expert projections (64 experts x 3 projections x 16 layers)
64 attention projections (Q/K/V/O across 16 layers)
16 router gates (expert routing per layer)

Normalization layers (33), embeddings (1), and lm_head (1) are stored at full precision.

Why This Matters

OLMoE is the first Mixture-of-Experts model compressed with HXQ. Combined with existing Transformer, SSM, and Hybrid results, this demonstrates that the same codec — same codebook size, same algorithm, same pip install — works across four distinct architecture families without modification.

Companion Models

Same codec, same pip install, multiple architectures:

Model	Architecture	Ratio	Eval Delta
olmoe-1b-7b-instruct-helix	MoE (64 experts)	1.9x	-0.16% HellaSwag
zamba2-2.7b-instruct-helix	Hybrid (Mamba2+Transformer)	1.8x	+6.59% PPL
zamba2-1.2b-helix	Hybrid (Mamba2+Transformer)	1.7x	+2.90% PPL
qwen2.5-14b-instruct-helix	Transformer	3.4x	pending
qwen2.5-7b-instruct-helix	Transformer	2.2x	+6.34% PPL
qwen2.5-3b-instruct-helix	Transformer	1.6x	+0.69% PPL
qwen2.5-coder-3b-helix	Transformer (code)	1.6x	+1.92% PPL
qwen2.5-coder-1.5b-instruct-helix	Transformer (code)	2.4x	+1.63% PPL
tinyllama-1.1b-helix	Transformer	4.0x	+0.78% PPL
mamba2-1.3b-helix	Pure SSM (Mamba2)	2.1x	+8.0% PPL
mamba-130m-helix	Pure SSM	3.8x	+18.4% PPL

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from allenai/OLMoE-1B-7B-0924-Instruct).

Downloads last month: 748

Model tree for EchoLabs33/olmoe-1b-7b-instruct-hxq

Base model

allenai/OLMoE-1B-7B-0924

Finetuned

allenai/OLMoE-1B-7B-0924-SFT

Finetuned

allenai/OLMoE-1B-7B-0924-Instruct

Quantized

(12)

this model

Evaluation results

Accuracy (norm) on HellaSwag
self-reported

0.788
Accuracy (norm) on ARC-Easy
self-reported

0.768
Accuracy (norm) on ARC-Challenge
self-reported

0.520