Instructions to use m-a-p/OProver-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-a-p/OProver-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-a-p/OProver-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("m-a-p/OProver-32B")
model = AutoModelForCausalLM.from_pretrained("m-a-p/OProver-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use m-a-p/OProver-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-a-p/OProver-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/m-a-p/OProver-32B

SGLang

How to use m-a-p/OProver-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-a-p/OProver-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-a-p/OProver-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OProver-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use m-a-p/OProver-32B with Docker Model Runner:
```
docker model run hf.co/m-a-p/OProver-32B
```

OProver-32B / README.md

dma2077

Upload README.md with huggingface_hub

0033418 verified 1 day ago

preview code

raw

history blame contribute delete

6.64 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- lean4
	- theorem-proving
	- formal-mathematics
	- agentic
	- retrieval-augmented
	datasets:
	- m-a-p/OProofs
	---

	# OProver

	> A unified framework for agentic formal theorem proving in Lean 4.

	- 📄 Paper: [OProver: A Unified Framework for Agentic Formal Theorem Proving](https://huggingface.co/papers/2605.17283) ([arXiv](https://arxiv.org/abs/2605.17283))
	- 📚 Collection: [m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)
	- 📦 Dataset: [m-a-p/OProofs](https://huggingface.co/datasets/m-a-p/OProofs)

	OProver treats theorem proving as a multi-round refinement loop. Given a target
	theorem, the prover retrieves top-k compiler-verified proofs from a memory of
	prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on
	failure—revises the attempt using the compiler feedback in the next round.
	The same retrieval and feedback signals are baked into training, so the
	training-time interface matches the proving-time interaction.

	## Highlights

	- State-of-the-art among open-weight whole-proof provers. OProver-32B
	attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and
	PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and
	ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five
	benchmarks despite having 4× fewer parameters.
	- Agentic proving is in the policy, not bolted on. Retrieval, multi-turn
	compiler feedback, and iterative repair are all part of the trained policy.
	Ablations show feedback is the dominant driver: removing it costs OProver-32B
	4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds
	0.5–1.7 more.
	- Co-evolving prover and corpus. Each post-training iteration runs agentic
	rollouts; verified proofs are added to OProofs and re-indexed for retrieval,
	repair trajectories become SFT data, and unresolved hard cases provide RL
	signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 →
	87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.

	## What's in this release

	The OProver collection ([m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)) bundles the paper,
	the corpus, and seven model checkpoints covering both training stages and
	both model sizes:

	\| Repo \| Stage \| Size \| Note \|
	\|---\|---\|---\|---\|
	\| [OProver-8B-Base](https://huggingface.co/m-a-p/OProver-8B-Base) \| Continued pretraining only \| 8B \| Domain-adapted base; before SFT/RL \|
	\| [OProver-32B-Base](https://huggingface.co/m-a-p/OProver-32B-Base) \| Continued pretraining only \| 32B \| Domain-adapted base; before SFT/RL \|
	\| [OProver-8B-Round1](https://huggingface.co/m-a-p/OProver-8B-Round1) \| Post-training Round 1 \| 8B \| After first SFT+RL iteration \|
	\| [OProver-8B-Round2](https://huggingface.co/m-a-p/OProver-8B-Round2) \| Post-training Round 2 \| 8B \| After second SFT+RL iteration \|
	\| [OProver-32B-Round1](https://huggingface.co/m-a-p/OProver-32B-Round1) \| Post-training Round 1 \| 32B \| After first SFT+RL iteration \|
	\| [OProver-8B](https://huggingface.co/m-a-p/OProver-8B) \| Final \| 8B \| The 8B prover reported in the paper (Round 3) \|
	\| [OProver-32B](https://huggingface.co/m-a-p/OProver-32B) \| Final \| 32B \| The 32B prover reported in the paper (Round 2) \|
	\| [OProofs](https://huggingface.co/datasets/m-a-p/OProofs) \| Dataset \| 6.86M proofs \| Lean 4 corpus used for CPT, SFT, and the retrieval memory \|

	Use OProver-8B or OProver-32B for proving. The Base / Round-N
	checkpoints are released for reproducibility and ablation studies.

	## Dataset: OProofs

	OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at
	proving time. It is built from three sources: public Lean resources
	(NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …),
	large-scale autoformalization + agentic proof synthesis from informal math
	mined on Common Crawl and GitHub, and traces from OProver's own agentic
	proving runs.

	![OProofs Lean 4 corpus statistics](./assets/oproofs.png)

	## Training

	Two stages, both anchored on OProofs.

	1. Continued pretraining (one-time). A 65B-token mixture of formal
	Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%,
	Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR
	5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output:
	`OProver-{8B,32B}-Base`.

	2. Iterative post-training. Each round runs:

	1. Agentic proving with the current prover on a theorem pool, producing
	multi-round rollouts conditioned on retrieved proofs and Lean feedback.
	2. SFT on round-level repair examples `(s, R, p_{t-1}, f_{t-1}) → p_t`,
	with cross-entropy loss only on the new attempt.
	3. GSPO RL on hard cases (groups with non-trivial pass-rate). Per-round
	reward `r = 0.8 + 0.2·1[format ok]` if Lean-verified, else 0; advantages
	are pooled across the n×R rounds for the same theorem.
	4. Newly verified proofs and repair trajectories are folded back into OProofs
	and re-indexed into the retrieval memory for the next round.

	## Results

	Pass@32 (n=64) across five Lean 4 benchmarks. Bold is best, _underlined_
	is second-best.

	![Main results: Pass@32 across five Lean 4 benchmarks](./assets/main_results.png)

	OProver-32B reaches three best and two second-best across five benchmarks —
	the most top placements of any model in the comparison, despite being a
	32B dense model versus a 560B MoE or a 671B dense competitor.

	## Loading

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Pick any of: OProver-8B, OProver-32B, -Base, -Round1, *-Round2
	name = "m-a-p/OProver-8B"
	tok = AutoTokenizer.from_pretrained(name)
	model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")
	```

	```python
	from datasets import load_dataset
	ds = load_dataset("m-a-p/OProofs", split="train")
	```

	OProver is trained against a multi-round agentic interface: at each round the
	input includes the target Lean statement, top-k retrieved verified proofs, the
	prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for
	the prompt template, and Appendix B for serialization details.

	## Citation

	```bibtex
	@article{ma2026oprover,
	title = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
	author = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
	journal = {arXiv preprint arXiv:2605.17283},
	year = {2026}
	}
	```