# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nics-efc/VPR-Minesweeper")
model = AutoModelForCausalLM.from_pretrained("nics-efc/VPR-Minesweeper")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))VPR-Minesweeper
🌐 Project Page | 📝 Paper | 💻 Code
Model Description
This is the Minesweeper-trained checkpoint from VPR: Verifiable Process Rewards for Agentic Reasoning, initialized from Qwen3-4B.
VPR turns verifiable oracles into dense turn-level rewards for long-horizon agentic reasoning. This checkpoint is trained with posterior-based VPR on Minesweeper, where posterior mine probabilities provide step-level feedback for safe reveals and certain mine flags under partial observability.
Overview
Reinforcement learning from verifiable rewards usually rewards only final success. In long-horizon agentic tasks, this creates a credit assignment problem: a trajectory may fail after many correct steps, or succeed despite flawed intermediate decisions.
VPR studies densely-verifiable agentic reasoning problems, where each intermediate action can be checked by a task-specific oracle. Instead of learning a noisy process reward model or estimating step values through extra rollouts, VPR uses task structure itself to provide reliable turn-level supervision.
Method
VPR converts sparse trajectory-level feedback into dense process rewards:
r_t^VPR = V(s_t, a_t)
For Minesweeper, VPR uses a posterior-based verifier. The oracle computes posterior mine probabilities and rewards actions that reveal provably safe cells or flag certain mines.
Results
In-Domain Minesweeper
| Method | Success Rate | Completion Rate |
|---|---|---|
| Base | 0.78 | 73.71 |
| OR | 3.91 | 77.26 |
| MC-PR | 2.34 | 78.67 |
| VPR | 10.39 | 80.27 |
Zero-Shot Transfer
| Benchmark group | Metric | VPR-Minesweeper |
|---|---|---|
| General reasoning | Average pass@1 | 62.59 |
| General reasoning | GSM8K | 94.82 |
| General reasoning | MATH-500 | 85.00 |
| General reasoning | MMLU-Pro | 67.98 |
| Agentic reasoning | ALFWorld success rate | 28.61 |
| Agentic reasoning | WebShop score | 30.38 |
| Agentic reasoning | WebShop success rate | 1.93 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nics-efc/VPR-Minesweeper"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Solve this step by step: If a train travels 180 miles in 3 hours, what is its average speed?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use
This checkpoint is intended for research on verifiable rewards, process supervision, reinforcement learning for LLM agents, and transfer from game-like agentic training environments to broader reasoning tasks.
The released checkpoint contains the trained language model. Environment simulators, verifiers, and training code are provided in the project repository.
Citation
If you find this model helpful, please cite:
@misc{yuan2026verifiable,
title={Verifiable Process Rewards for Agentic Reasoning},
author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
year={2026},
eprint={2605.10325},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.10325}
}
- Downloads last month
- -


# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nics-efc/VPR-Minesweeper") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)