Leanstral-RotorQuant

KV-cache quantized Leanstral-2603 using RotorQuant for high-throughput Lean 4 formal proof generation.

Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant applies RotorQuant KV-cache quantization, delivering 5.3x faster prefill and 28% faster decode compared to TurboQuant while preserving full BF16 model weights.

Overview

This repository provides the RotorQuant KV-cache-only configuration of Leanstral-2603. The model weights remain at full precision; only the KV cache is quantized during inference using RotorQuant's rotation-aware quantization scheme.

Spec	Value
Base model	mistralai/Leanstral-2603
Architecture	Mistral MoE (~119B parameters, 7 consolidated shards)
Compression	RotorQuant KV-cache quantization
Weight precision	BF16 (unmodified)
KV-cache precision	Mixed-precision quantized
Prefill speedup	5.3x vs TurboQuant
Decode speedup	28% vs TurboQuant
License	Apache 2.0
Use case	Lean 4 formal verification, theorem proving, mathematical proofs

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "majentik/Leanstral-RotorQuant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

# Enable RotorQuant KV-cache quantization
cache = IsoQuantCache(model)

prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is RotorQuant?

RotorQuant is an advanced KV-cache quantization method that leverages rotation-aware quantization to achieve superior throughput compared to standard KV-cache compression. By exploiting the rotary positional embedding structure, RotorQuant achieves:

5.3x faster prefill -- critical for long Lean 4 proof contexts
28% faster decode -- faster token-by-token proof generation
Equivalent memory savings to TurboQuant with better computational efficiency

This makes RotorQuant the preferred choice for interactive theorem proving sessions where latency matters.

Memory Estimates

Component	Estimate
Model weights (BF16)	~238 GB
KV-cache savings	2-4x reduction vs FP16 KV cache
Recommended VRAM	4x A100 80GB or equivalent

Lean 4 Use Case

Leanstral excels at:

Formal verification -- generating machine-checkable proofs of mathematical theorems
Theorem proving -- interactive and automated proof search in Lean 4
Code generation -- writing verified Lean 4 programs with correctness guarantees
Proof repair -- fixing incomplete or broken proof scripts

Model tree for majentik/Leanstral-RotorQuant

Base model

mistralai/Leanstral-2603

Finetuned

(8)

this model

majentik
/

Leanstral-RotorQuant