ROLV Primitive© Sparse matrix operator for Mixture-of-Experts AI inference. 5–103× faster than cuBLAS/MKL. Up to 99% energy reduction. Bit-identical outputs. Test any HuggingFace MoE model — no upload required.

What is this Modern frontier AI models — DeepSeek-V3, Llama-4, Kimi-K2, Qwen3, Mixtral — use Mixture-of-Experts (MoE) architecture. Each token activates only a small fraction of the model's experts (typically 8 of 256 in DeepSeek-V3). The inactive experts produce zero outputs. Standard libraries — NVIDIA cuBLAS, Intel MKL, cuSPARSE — multiply those zeros anyway. ROLV Primitive© skips them. Outputs are mathematically identical. The speedup is real. The INT_MAX finding cuSPARSE cannot benchmark the full DeepSeek-V3 or Kimi-K2 stacked expert matrix. The matrix is 256 experts × 2048 × 7168 = 3,758,489,600 elements, which exceeds `INT_MAX` (2,147,483,647). cuSPARSE overflows silently and returns a submatrix result. Every published cuSPARSE benchmark on these models is reporting a fraction of the full computation. ROLV Primitive© handles the full matrix natively. Published finding: doi.org/10.5281/zenodo.19221455

Verified results 482 SHA-256 verified cases on real downloaded model weights, 7 hardware platforms. Independent validation by the University of Miami Frost Institute for Data Science and Computing is currently underway. Model Layer Sparsity vs dense vs cuSPARSE DeepSeek-V3 gate_proj 87.5% 5.58× overflow* Llama-4-Scout gate_proj 92.2% 9.54× 103× Kimi-K2 gate_proj 93.8% 8.97× overflow* Mixtral-8×22B gate_proj 87.5% 5.39× 109× Mixtral-8×7B gate_proj 87.5% 5.21× 76× OLMoE-1B-7B gate_proj 87.5% 5.58× confirmed OLMoE-1B-7B up_proj 87.5% 5.74× confirmed OLMoE-1B-7B down_proj 87.5% 5.67× confirmed Peak (99% sparsity) down_proj 99% 46× — CPU (Intel i7, Windows, 8 threads): 28/28 PASS, ATOL=0.0000, all real weights. ARM (Google Axion): 5.12× vs MKL confirmed. cuSPARSE INT_MAX overflow — see finding above.

Quick start Step 1 — Download the wheel for your platform From Releases: Platform Python Wheel filename Windows 64-bit 3.13 rolvprimitive-1.0.0-cp313-none-win_amd64.whl Windows 64-bit 3.11 rolvprimitive-1.0.0-cp311-none-win_amd64.whl Linux x86_64 3.12 rolvprimitive-1.0.0-cp312-cp312-linux_x86_64.whl Any / Anaconda any rolvprimitive-1.0.0-py3-none-any.whl Step 2 — Install

pip install rolvprimitive-1.0.0-cp313-none-win_amd64.whl   # Windows py3.13
pip install rolvprimitive-1.0.0-cp312-cp312-linux_x86_64.whl  # Linux py3.12
pip install rolvprimitive-1.0.0-py3-none-any.whl           # Anaconda / any

Step 3 — Run the benchmark The script downloads model weights directly from HuggingFace to your own machine. Nothing is uploaded. You benchmark on your own hardware.

pip install torch scipy psutil transformers accelerate huggingface_hub einops tqdm

# DeepSeek-V3 shapes — no download, uses real dimensions (fastest start):
python scripts/benchmark.py --model deepseek-shapes

# OLMoE real weights — ~7 GB download, CPU or GPU:
python scripts/benchmark.py --model olmoe

# Mixtral-8x7B — ~26 GB, GPU recommended:
python scripts/benchmark.py --model mixtral-8x7b

# Any HuggingFace MoE model by ID:
python scripts/benchmark.py --model mistralai/Mixtral-8x22B-v0.1

# CPU only:
python scripts/benchmark.py --model olmoe --device cpu

# Custom iterations and batch size:
python scripts/benchmark.py --model olmoe --iterations 2000 --batch 2000

# Multiple models:
python scripts/benchmark.py --model deepseek-shapes,olmoe

Available model shortcuts Shortcut Model Download `deepseek-shapes` DeepSeek-V3 real dimensions (synthetic weights) none `olmoe` allenai/OLMoE-1B-7B-0924 ~7 GB `mixtral-8x7b` mistralai/Mixtral-8x7B-v0.1 ~26 GB `mixtral-8x22b` mistralai/Mixtral-8x22B-v0.1 ~87 GB `phi35moe` microsoft/Phi-3.5-MoE-instruct ~16 GB `deepseek-moe` deepseek-ai/deepseek-moe-16b-base ~32 GB `qwen2moe` Qwen/Qwen1.5-MoE-A2.7B ~6 GB `jamba` ai21labs/Jamba-1.5-Mini ~24 GB `auto` DeepSeek shapes + OLMoE ~7 GB any HF model ID e.g. `mistralai/Mistral-7B-v0.1` varies

What the benchmark measures Per ROLV Benchmark Harness Prerequisites & Standards v2.0: Hardware detection banner — CPU, GPU, RAM, VRAM, backend, energy source 4 SHA-256 hashes per case — W, X, baseline output, ROLV output 4 error metrics — max/mean absolute and relative error (raw FP32) ATOL correctness check — column-normalised, threshold 0.05 Perturbation test — proves live computation, not a cached result Speed — build_ms, ms/iter, speedup× and %, vs both dense and sparse Energy — joules and watts via pynvml (NVIDIA), pyrsmi (AMD), or proxy FLOPs reduction, tok/s gain, TTFT reduction — all vs vendor baseline ROLVswitch™ strategy printed per case RSMT™ threshold printed per case CSV output — all results saved to `rolv_results.csv` Disk cleanup after each model — no disk exhaustion

Use in your own code

import torch
from rolvprimitive import ROLVHybrid

# Your MoE expert weight matrix — any source
W = your_expert_weight   # shape: (out_features, in_features)

# Build once at model load time — ROLVswitch™ auto-selects strategy
op = ROLVHybrid(W, batch=1000)
print(op._strategy)      # see which path was selected

# Use at inference time
out = op.apply(X)        # X: (batch, in_features) — identical to W @ X.T

Post your results Run the benchmark and share your output. Include your hardware, model, and numbers. Reddit: r/LocalLLaMA · r/MachineLearning HuggingFace: huggingface.co/rolv-ai GitHub Discussions: open a thread here

Citation `bibtex @misc{heggenhougen2026rolv, title = {ROLV Primitive: A Sparse Matrix Operator for Mixture-of-Experts Inference}, author = {Heggenhougen, Rolv Eitrem}, year = {2026}, doi = {10.5281/zenodo.19221455}, url = {https://doi.org/10.5281/zenodo.19221455} }`

License Free for personal and research use. Commercial use requires a license. Commercial use includes inference APIs, cloud services, enterprise software, and any business deployment. Commercial licensing: rolv@rolv.ai | rolv.ai

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ROLV Primitive© Sparse matrix operator for Mixture-of-Experts AI inference. 5–103× faster than cuBLAS/MKL. Up to 99% energy reduction. Bit-identical outputs. Test any HuggingFace MoE model — no upload required.

Post your results Run the benchmark and share your output. Include your hardware, model, and numbers. Reddit: r/LocalLLaMA · r/MachineLearning HuggingFace: huggingface.co/rolv-ai GitHub Discussions: open a thread here

Citation bibtex @misc{heggenhougen2026rolv, title = {ROLV Primitive: A Sparse Matrix Operator for Mixture-of-Experts Inference}, author = {Heggenhougen, Rolv Eitrem}, year = {2026}, doi = {10.5281/zenodo.19221455}, url = {https://doi.org/10.5281/zenodo.19221455} }

License Free for personal and research use. Commercial use requires a license. Commercial use includes inference APIs, cloud services, enterprise software, and any business deployment. Commercial licensing: rolv@rolv.ai | rolv.ai

Citation `bibtex @misc{heggenhougen2026rolv, title = {ROLV Primitive: A Sparse Matrix Operator for Mixture-of-Experts Inference}, author = {Heggenhougen, Rolv Eitrem}, year = {2026}, doi = {10.5281/zenodo.19221455}, url = {https://doi.org/10.5281/zenodo.19221455} }`