WebArbiter-7B
A principle-guided reasoning Process Reward Model for web agents
Published at ICLR 2026
Paper | Code | Website | Collection | Demo
Introduction
WebArbiter-7B is a 7B reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-7B-Instruct. Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.
On WEBPRMBENCH, WebArbiter-7B achieves an Avg. BoN Acc of 74.60%, outperforming GPT-5 by 9.1 points and the previous SOTA WebPRM (WebShepherd-8B) by 31 points. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to 6.4 points in success rate.
Highlights
- Reasoning as reward: Generates structured
<State>,<Criteria>,<Analysis>, and<Answer>outputs with auditable reasoning chains, instead of scalar scores or brittle checklists. - Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
- Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
- Robust generalization: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).
Results on WebPRMBench
Models marked with ⋆ are ours. Bold = best overall.
| Model | Mind2Web | WebArena | AssistantBench | WorkArena | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | |
| Proprietary LLM-as-judge | ||||||||||
| GPT-4o-mini | 81.74 | 50.92 | 78.23 | 56.72 | 89.17 | 73.33 | 81.43 | 46.70 | 82.64 | 56.92 |
| GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
| GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
| Claude-3.7-Sonnet | 80.20 | 57.90 | 82.80 | 64.10 | 81.50 | 61.30 | 82.10 | 60.60 | 81.65 | 60.98 |
| Gemini-2.5-Flash | 81.30 | 57.01 | 82.71 | 62.19 | 80.00 | 63.33 | 83.30 | 56.13 | 81.83 | 59.67 |
| DeepSeek-R1 | 81.62 | 57.37 | 82.04 | 60.21 | 78.49 | 56.18 | 84.12 | 63.89 | 81.57 | 59.41 |
| Open-source LLM-as-judge | ||||||||||
| Qwen2.5-7B-Instruct | 77.79 | 39.18 | 74.88 | 42.79 | 84.17 | 53.33 | 77.58 | 35.85 | 77.61 | 42.78 |
| Llama-3-70B-Instruct | 80.55 | 49.36 | 77.36 | 50.75 | 85.83 | 70.00 | 79.08 | 40.09 | 80.71 | 52.55 |
| WebPRMs | ||||||||||
| WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 |
| ⋆ WebArbiter-7B | 97.07 | 89.53 | 88.43 | 68.66 | 89.17 | 70.00 | 82.09 | 70.19 | 89.19 | 74.60 |
Reward-Guided Trajectory Search (WebArena-Lite)
WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on WebArena-Lite:
| Policy | WebPRM | Shopping | CMS | GitLab | MAP | Avg. | Δ | |
|---|---|---|---|---|---|---|---|---|
| GPT-4o-mini | w/o Search | 21.74 | 22.86 | 19.05 | 34.38 | 19.35 | 23.48 | — |
| GPT-4o-mini | GPT-4o-mini (as WebPRM) | 24.44 | 22.86 | 26.32 | 33.33 | 15.38 | 24.47 | +0.99 |
| GPT-4o-mini | WebShepherd-8B | 26.09 | 45.71 | 23.81 | 40.62 | 35.48 | 34.34 | +10.86 |
| GPT-4o-mini | WebArbiter-7B | 37.78 | 42.86 | 36.84 | 46.67 | 38.46 | 40.52 | +17.04 |
| GPT-4o | w/o Search | 23.91 | 31.43 | 28.57 | 56.25 | 19.35 | 31.90 | — |
| GPT-4o | GPT-4o-mini (as WebPRM) | 26.67 | 37.14 | 42.11 | 40.00 | 19.23 | 33.03 | +1.13 |
| GPT-4o | WebShepherd-8B | 30.43 | 42.86 | 47.62 | 46.88 | 35.48 | 40.65 | +8.75 |
| GPT-4o | WebArbiter-7B | 44.44 | 42.86 | 52.63 | 56.67 | 38.46 | 47.01 | +15.11 |
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ZYao720/WebArbiter-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses
messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)
response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)
Example output:
<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>
Training Details
| Stage 1: Reasoning Distillation | Stage 2: RLVR | |
|---|---|---|
| Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
| Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
| Teacher | o3 | — |
| Base Model | Qwen2.5-7B-Instruct | Stage 1 checkpoint |
| Fine-tuning | LoRA (rank 128, lr 8e-4) | FSDP + LoRA (lr 7e-6) |
| Framework | LLaMA-Factory | veRL |
| Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB |
| Source Data | WebPRM Collection (~30k step-level preference pairs from Mind2Web) |
Key training insights (from ablation studies in the paper):
- Explicit principles are essential — removing them notably degrades performance, especially on out-of-domain environments.
- Cold-start RL without reasoning distillation is unstable across environments.
- Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments.
Intended Uses
WebArbiter-7B is designed to:
- Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
- Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
- Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.
Limitations
- Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
- English-only: Training and evaluation are conducted exclusively in English-language web environments.
- Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
- Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.
License
This model is released under Apache 2.0, following the base model Qwen2.5-7B-Instruct.
Related Resources
| Resource | Link |
|---|---|
| WebArbiter-8B-Qwen3 (strongest) | ZYao720/WebArbiter-8B-Qwen3 |
| WebArbiter-4B-Qwen3 | ZYao720/WebArbiter-4B-Qwen3 |
| WebArbiter-3B | ZYao720/WebArbiter-3B |
| WEBPRMBENCH (benchmark) | ZYao720/WEBPRMBENCH |
| Training Data | ZYao720/WebArbiter-Data |
| Search Trajectories | ZYao720/WebArbiter-Trajectories |
Citation
@misc{zhang2026ZYao720principleguidedreasoningprocess,
title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
year={2026},
eprint={2601.21872},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.21872},
}
- Downloads last month
- 159
Model tree for ZYao720/WebArbiter-7B
Dataset used to train ZYao720/WebArbiter-7B
Collections including ZYao720/WebArbiter-7B
Papers for ZYao720/WebArbiter-7B
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Evaluation results
- Avg Pairwise Accuracy on WebPRMBenchself-reported89.190
- Avg BoN Accuracy on WebPRMBenchself-reported74.600