WebArbiter-7B

A principle-guided reasoning Process Reward Model for web agents

Published at ICLR 2026

Paper | Code | Website | Collection | Demo

Introduction

WebArbiter-7B is a 7B reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-7B-Instruct. Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

On WEBPRMBENCH, WebArbiter-7B achieves an Avg. BoN Acc of 74.60%, outperforming GPT-5 by 9.1 points and the previous SOTA WebPRM (WebShepherd-8B) by 31 points. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to 6.4 points in success rate.

Highlights

Reasoning as reward: Generates structured <State>, <Criteria>, <Analysis>, and <Answer> outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
Robust generalization: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).

Results on WebPRMBench

Models marked with ⋆ are ours. Bold = best overall.

Model	Mind2Web		WebArena		AssistantBench		WorkArena		Avg.
	Pair	BoN	Pair	BoN	Pair	BoN	Pair	BoN	Pair	BoN
Proprietary LLM-as-judge
GPT-4o-mini	81.74	50.92	78.23	56.72	89.17	73.33	81.43	46.70	82.64	56.92
GPT-4o	79.99	52.62	84.58	66.67	85.83	66.67	84.33	55.19	83.68	60.29
GPT-5	80.86	62.39	84.83	71.64	81.67	63.33	81.14	64.62	82.13	65.50
Claude-3.7-Sonnet	80.20	57.90	82.80	64.10	81.50	61.30	82.10	60.60	81.65	60.98
Gemini-2.5-Flash	81.30	57.01	82.71	62.19	80.00	63.33	83.30	56.13	81.83	59.67
DeepSeek-R1	81.62	57.37	82.04	60.21	78.49	56.18	84.12	63.89	81.57	59.41
Open-source LLM-as-judge
Qwen2.5-7B-Instruct	77.79	39.18	74.88	42.79	84.17	53.33	77.58	35.85	77.61	42.78
Llama-3-70B-Instruct	80.55	49.36	77.36	50.75	85.83	70.00	79.08	40.09	80.71	52.55
WebPRMs
WebShepherd-8B	86.66	73.69	68.33	43.88	55.92	30.00	54.56	25.53	64.34	43.28
⋆ WebArbiter-7B	97.07	89.53	88.43	68.66	89.17	70.00	82.09	70.19	89.19	74.60

Reward-Guided Trajectory Search (WebArena-Lite)

WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on WebArena-Lite:

Policy	WebPRM	Shopping	CMS	Reddit	GitLab	MAP	Avg.	Δ
GPT-4o-mini	w/o Search	21.74	22.86	19.05	34.38	19.35	23.48	—
GPT-4o-mini	GPT-4o-mini (as WebPRM)	24.44	22.86	26.32	33.33	15.38	24.47	+0.99
GPT-4o-mini	WebShepherd-8B	26.09	45.71	23.81	40.62	35.48	34.34	+10.86
GPT-4o-mini	WebArbiter-7B	37.78	42.86	36.84	46.67	38.46	40.52	+17.04
GPT-4o	w/o Search	23.91	31.43	28.57	56.25	19.35	31.90	—
GPT-4o	GPT-4o-mini (as WebPRM)	26.67	37.14	42.11	40.00	19.23	33.03	+1.13
GPT-4o	WebShepherd-8B	30.43	42.86	47.62	46.88	35.48	40.65	+8.75
GPT-4o	WebArbiter-7B	44.44	42.86	52.63	56.67	38.46	47.01	+15.11

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZYao720/WebArbiter-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)

Example output:

<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>

Training Details

	Stage 1: Reasoning Distillation	Stage 2: RLVR
Method	Supervised fine-tuning (SFT)	GRPO with binary verifiable rewards
Data	9,642 teacher-distilled examples	18,921 preference pairs
Teacher	o3	—
Base Model	Qwen2.5-7B-Instruct	Stage 1 checkpoint
Fine-tuning	LoRA (rank 128, lr 8e-4)	FSDP + LoRA (lr 7e-6)
Framework	LLaMA-Factory	veRL
Hardware	8 × NVIDIA A100-80GB	8 × NVIDIA A100-80GB
Source Data	WebPRM Collection (~30k step-level preference pairs from Mind2Web)

Key training insights (from ablation studies in the paper):

Explicit principles are essential — removing them notably degrades performance, especially on out-of-domain environments.
Cold-start RL without reasoning distillation is unstable across environments.
Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments.

Intended Uses

WebArbiter-7B is designed to:

Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.

Limitations

Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
English-only: Training and evaluation are conducted exclusively in English-language web environments.
Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

License

This model is released under Apache 2.0, following the base model Qwen2.5-7B-Instruct.

Related Resources

Resource	Link
WebArbiter-8B-Qwen3 (strongest)	ZYao720/WebArbiter-8B-Qwen3
WebArbiter-4B-Qwen3	ZYao720/WebArbiter-4B-Qwen3
WebArbiter-3B	ZYao720/WebArbiter-3B
WEBPRMBENCH (benchmark)	ZYao720/WEBPRMBENCH
Training Data	ZYao720/WebArbiter-Data
Search Trajectories	ZYao720/WebArbiter-Trajectories

Citation

@misc{zhang2026ZYao720principleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}