Instructions to use m-a-p/OProver-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-a-p/OProver-8B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-a-p/OProver-8B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("m-a-p/OProver-8B-Base")
model = AutoModelForCausalLM.from_pretrained("m-a-p/OProver-8B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use m-a-p/OProver-8B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-a-p/OProver-8B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/m-a-p/OProver-8B-Base

SGLang

How to use m-a-p/OProver-8B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-a-p/OProver-8B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-a-p/OProver-8B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use m-a-p/OProver-8B-Base with Docker Model Runner:
```
docker model run hf.co/m-a-p/OProver-8B-Base
```

OProver-8B-Base / README.md

dma2077

Upload README.md with huggingface_hub

2758888 verified 1 day ago

preview code

raw

history blame contribute delete

6.64 kB

metadata

license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - lean4
  - theorem-proving
  - formal-mathematics
  - agentic
  - retrieval-augmented
datasets:
  - m-a-p/OProofs

OProver

A unified framework for agentic formal theorem proving in Lean 4.

📄 Paper: OProver: A Unified Framework for Agentic Formal Theorem Proving (arXiv)
📚 Collection: m-a-p/OProver
📦 Dataset: m-a-p/OProofs

OProver treats theorem proving as a multi-round refinement loop. Given a target theorem, the prover retrieves top-k compiler-verified proofs from a memory of prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on failure—revises the attempt using the compiler feedback in the next round. The same retrieval and feedback signals are baked into training, so the training-time interface matches the proving-time interaction.

Highlights

State-of-the-art among open-weight whole-proof provers. OProver-32B attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five benchmarks despite having 4× fewer parameters.
Agentic proving is in the policy, not bolted on. Retrieval, multi-turn compiler feedback, and iterative repair are all part of the trained policy. Ablations show feedback is the dominant driver: removing it costs OProver-32B 4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds 0.5–1.7 more.
Co-evolving prover and corpus. Each post-training iteration runs agentic rollouts; verified proofs are added to OProofs and re-indexed for retrieval, repair trajectories become SFT data, and unresolved hard cases provide RL signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 → 87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.

What's in this release

The OProver collection (m-a-p/OProver) bundles the paper, the corpus, and seven model checkpoints covering both training stages and both model sizes:

Repo	Stage	Size	Note
OProver-8B-Base	Continued pretraining only	8B	Domain-adapted base; before SFT/RL
OProver-32B-Base	Continued pretraining only	32B	Domain-adapted base; before SFT/RL
OProver-8B-Round1	Post-training Round 1	8B	After first SFT+RL iteration
OProver-8B-Round2	Post-training Round 2	8B	After second SFT+RL iteration
OProver-32B-Round1	Post-training Round 1	32B	After first SFT+RL iteration
OProver-8B	Final	8B	The 8B prover reported in the paper (Round 3)
OProver-32B	Final	32B	The 32B prover reported in the paper (Round 2)
OProofs	Dataset	6.86M proofs	Lean 4 corpus used for CPT, SFT, and the retrieval memory

Use OProver-8B or OProver-32B for proving. The Base / Round-N checkpoints are released for reproducibility and ablation studies.

Dataset: OProofs

OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at proving time. It is built from three sources: public Lean resources (NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …), large-scale autoformalization + agentic proof synthesis from informal math mined on Common Crawl and GitHub, and traces from OProver's own agentic proving runs.

Training

Two stages, both anchored on OProofs.

1. Continued pretraining (one-time). A 65B-token mixture of formal Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%, Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR 5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output: OProver-{8B,32B}-Base.

2. Iterative post-training. Each round runs:

Agentic proving with the current prover on a theorem pool, producing multi-round rollouts conditioned on retrieved proofs and Lean feedback.
SFT on round-level repair examples (s, R, p_{t-1}, f_{t-1}) → p_t, with cross-entropy loss only on the new attempt.
GSPO RL on hard cases (groups with non-trivial pass-rate). Per-round reward r = 0.8 + 0.2·1[format ok] if Lean-verified, else 0; advantages are pooled across the n×R rounds for the same theorem.
Newly verified proofs and repair trajectories are folded back into OProofs and re-indexed into the retrieval memory for the next round.

Results

Pass@32 (n=64) across five Lean 4 benchmarks. Bold is best, underlined is second-best.

OProver-32B reaches three best and two second-best across five benchmarks — the most top placements of any model in the comparison, despite being a 32B dense model versus a 560B MoE or a 671B dense competitor.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

# Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2
name = "m-a-p/OProver-8B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")

from datasets import load_dataset
ds = load_dataset("m-a-p/OProofs", split="train")

OProver is trained against a multi-round agentic interface: at each round the input includes the target Lean statement, top-k retrieved verified proofs, the prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for the prompt template, and Appendix B for serialization details.

Citation

@article{ma2026oprover,
  title   = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
  author  = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
  journal = {arXiv preprint arXiv:2605.17283},
  year    = {2026}
}