Instructions to use nics-efc/VPR-Minesweeper with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nics-efc/VPR-Minesweeper with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nics-efc/VPR-Minesweeper")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nics-efc/VPR-Minesweeper")
model = AutoModelForCausalLM.from_pretrained("nics-efc/VPR-Minesweeper")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nics-efc/VPR-Minesweeper with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nics-efc/VPR-Minesweeper"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nics-efc/VPR-Minesweeper",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nics-efc/VPR-Minesweeper

SGLang

How to use nics-efc/VPR-Minesweeper with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nics-efc/VPR-Minesweeper" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nics-efc/VPR-Minesweeper",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nics-efc/VPR-Minesweeper" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nics-efc/VPR-Minesweeper",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nics-efc/VPR-Minesweeper with Docker Model Runner:
```
docker model run hf.co/nics-efc/VPR-Minesweeper
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

VPR-Minesweeper

🌐 Project Page | 📝 Paper | 💻 Code

Model Description

This is the Minesweeper-trained checkpoint from VPR: Verifiable Process Rewards for Agentic Reasoning, initialized from Qwen3-4B.

VPR turns verifiable oracles into dense turn-level rewards for long-horizon agentic reasoning. This checkpoint is trained with posterior-based VPR on Minesweeper, where posterior mine probabilities provide step-level feedback for safe reveals and certain mine flags under partial observability.

Overview

Reinforcement learning from verifiable rewards usually rewards only final success. In long-horizon agentic tasks, this creates a credit assignment problem: a trajectory may fail after many correct steps, or succeed despite flawed intermediate decisions.

VPR studies densely-verifiable agentic reasoning problems, where each intermediate action can be checked by a task-specific oracle. Instead of learning a noisy process reward model or estimating step values through extra rollouts, VPR uses task structure itself to provide reliable turn-level supervision.

Method

VPR converts sparse trajectory-level feedback into dense process rewards:

r_t^VPR = V(s_t, a_t)

For Minesweeper, VPR uses a posterior-based verifier. The oracle computes posterior mine probabilities and rewards actions that reveal provably safe cells or flag certain mines.

Results

In-Domain Minesweeper

Method	Success Rate	Completion Rate
Base	0.78	73.71
OR	3.91	77.26
MC-PR	2.34	78.67
VPR	10.39	80.27

Zero-Shot Transfer

Benchmark group	Metric	VPR-Minesweeper
General reasoning	Average pass@1	62.59
General reasoning	GSM8K	94.82
General reasoning	MATH-500	85.00
General reasoning	MMLU-Pro	67.98
Agentic reasoning	ALFWorld success rate	28.61
Agentic reasoning	WebShop score	30.38
Agentic reasoning	WebShop success rate	1.93

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nics-efc/VPR-Minesweeper"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Solve this step by step: If a train travels 180 miles in 3 hours, what is its average speed?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This checkpoint is intended for research on verifiable rewards, process supervision, reinforcement learning for LLM agents, and transfer from game-like agentic training environments to broader reasoning tasks.

The released checkpoint contains the trained language model. Environment simulators, verifiers, and training code are provided in the project repository.

Citation

If you find this model helpful, please cite:

@misc{yuan2026verifiable,
  title={Verifiable Process Rewards for Agentic Reasoning},
  author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
  year={2026},
  eprint={2605.10325},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.10325}
}