OLMoE-1B-7B-Instruct-HXQ

1.9x smaller from BF16. HellaSwag 78.8%. First MoE compressed with HXQ.

OLMoE-1B-7B-Instruct (64-expert Mixture-of-Experts, 1B active / 6.9B total) compressed from 13 GB (BF16) to 6.7 GB. All three downstream benchmarks within noise of the dense baseline. No calibration data. No architecture-specific tuning. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "EchoLabs33/olmoe-1b-7b-instruct-helix",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/olmoe-1b-7b-instruct-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Downstream Benchmarks

Evaluated with lm-evaluation-harness v0.4.11 on an NVIDIA RTX 3090 (batch=4, dtype=bfloat16):

Benchmark Dense (acc_norm) HXQ (acc_norm) Delta
HellaSwag 78.92% 78.76% -0.16%
ARC-Challenge 52.13% 52.05% -0.08%
ARC-Easy 75.72% 76.85% +1.14%

All deltas within standard error. Task performance is preserved after 1.9x compression. These are real downstream scores from paired dense/HXQ evaluations, not PPL proxies.

Compression Benchmark

Dense (BF16) HXQ
Size 13 GB 6.7 GB
Compression ratio โ€” 1.9x
VRAM (eval) 13,886 MB 7,540 MB
Compressed modules โ€” 3,152 HelixLinear layers
Architecture OLMoE (64-expert MoE) unchanged

Verification Status

  • Compression receipt: PASS โ€” 3,152 compressed, 67 exact, 12,675 total keys
  • Conversion receipt: PASS โ€” SHA256 a9f74982b746853077d13dc11c8bc863dc91219c81e22577de1de2b195c7b836
  • Downstream eval: PASS โ€” paired dense/HXQ on HellaSwag, ARC-Easy, ARC-Challenge

Good to Know

  • GPU and CPU supported โ€” runs on any CUDA GPU or CPU via standard PyTorch.
  • trust_remote_code=True required โ€” OLMoE uses custom modeling code.
  • Fine-tunable via LoRA โ€” compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
  • Requires helix-substrate โ€” the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • 64 experts = slow eval โ€” lm-eval-harness takes ~5.5 hours on a 3090 due to MoE routing overhead. Inference speed is normal for interactive use.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable โ€” HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MoE, CNN)
  • No calibration data required โ€” unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 3,152 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form โ€” no decompression needed

Architecture Details

OLMoE-1B-7B-Instruct is a Mixture-of-Experts architecture with:

  • 16 transformer layers, each with attention + MoE MLP
  • 64 experts per layer, top-8 routing (1B active / 6.9B total parameters)
  • hidden_size=2048, intermediate_size=1024 per expert
  • 16 attention heads, no GQA (num_kv_heads=16)

All 3,152 linear layers are compressed:

  • 3,072 expert projections (64 experts x 3 projections x 16 layers)
  • 64 attention projections (Q/K/V/O across 16 layers)
  • 16 router gates (expert routing per layer)

Normalization layers (33), embeddings (1), and lm_head (1) are stored at full precision.

Why This Matters

OLMoE is the first Mixture-of-Experts model compressed with HXQ. Combined with existing Transformer, SSM, and Hybrid results, this demonstrates that the same codec โ€” same codebook size, same algorithm, same pip install โ€” works across four distinct architecture families without modification.

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio Eval Delta
olmoe-1b-7b-instruct-helix MoE (64 experts) 1.9x -0.16% HellaSwag
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59% PPL
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90% PPL
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34% PPL
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69% PPL
qwen2.5-coder-3b-helix Transformer (code) 1.6x +1.92% PPL
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63% PPL
tinyllama-1.1b-helix Transformer 4.0x +0.78% PPL
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0% PPL
mamba-130m-helix Pure SSM 3.8x +18.4% PPL

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from allenai/OLMoE-1B-7B-0924-Instruct).

Downloads last month
748
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EchoLabs33/olmoe-1b-7b-instruct-hxq

Quantized
(12)
this model

Evaluation results