WebArbiter-7B

A principle-guided reasoning Process Reward Model for web agents

Published at ICLR 2026

Paper | Code | Website | Collection | Demo

Introduction

WebArbiter-7B is a 7B reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-7B-Instruct. Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

On WEBPRMBENCH, WebArbiter-7B achieves an Avg. BoN Acc of 74.60%, outperforming GPT-5 by 9.1 points and the previous SOTA WebPRM (WebShepherd-8B) by 31 points. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to 6.4 points in success rate.

Highlights

  • Reasoning as reward: Generates structured <State>, <Criteria>, <Analysis>, and <Answer> outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
  • Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
  • Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
  • Robust generalization: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).

Results on WebPRMBench

Models marked with ⋆ are ours. Bold = best overall.

Model Mind2Web WebArena AssistantBench WorkArena Avg.
Pair BoN Pair BoN Pair BoN Pair BoN Pair BoN
Proprietary LLM-as-judge
GPT-4o-mini 81.74 50.92 78.23 56.72 89.17 73.33 81.43 46.70 82.64 56.92
GPT-4o 79.99 52.62 84.58 66.67 85.83 66.67 84.33 55.19 83.68 60.29
GPT-5 80.86 62.39 84.83 71.64 81.67 63.33 81.14 64.62 82.13 65.50
Claude-3.7-Sonnet 80.20 57.90 82.80 64.10 81.50 61.30 82.10 60.60 81.65 60.98
Gemini-2.5-Flash 81.30 57.01 82.71 62.19 80.00 63.33 83.30 56.13 81.83 59.67
DeepSeek-R1 81.62 57.37 82.04 60.21 78.49 56.18 84.12 63.89 81.57 59.41
Open-source LLM-as-judge
Qwen2.5-7B-Instruct 77.79 39.18 74.88 42.79 84.17 53.33 77.58 35.85 77.61 42.78
Llama-3-70B-Instruct 80.55 49.36 77.36 50.75 85.83 70.00 79.08 40.09 80.71 52.55
WebPRMs
WebShepherd-8B 86.66 73.69 68.33 43.88 55.92 30.00 54.56 25.53 64.34 43.28
WebArbiter-7B 97.07 89.53 88.43 68.66 89.17 70.00 82.09 70.19 89.19 74.60

Reward-Guided Trajectory Search (WebArena-Lite)

WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on WebArena-Lite:

Policy WebPRM Shopping CMS Reddit GitLab MAP Avg. Δ
GPT-4o-mini w/o Search 21.74 22.86 19.05 34.38 19.35 23.48
GPT-4o-mini GPT-4o-mini (as WebPRM) 24.44 22.86 26.32 33.33 15.38 24.47 +0.99
GPT-4o-mini WebShepherd-8B 26.09 45.71 23.81 40.62 35.48 34.34 +10.86
GPT-4o-mini WebArbiter-7B 37.78 42.86 36.84 46.67 38.46 40.52 +17.04
GPT-4o w/o Search 23.91 31.43 28.57 56.25 19.35 31.90
GPT-4o GPT-4o-mini (as WebPRM) 26.67 37.14 42.11 40.00 19.23 33.03 +1.13
GPT-4o WebShepherd-8B 30.43 42.86 47.62 46.88 35.48 40.65 +8.75
GPT-4o WebArbiter-7B 44.44 42.86 52.63 56.67 38.46 47.01 +15.11

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZYao720/WebArbiter-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)

Example output:

<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>

Training Details

Stage 1: Reasoning Distillation Stage 2: RLVR
Method Supervised fine-tuning (SFT) GRPO with binary verifiable rewards
Data 9,642 teacher-distilled examples 18,921 preference pairs
Teacher o3
Base Model Qwen2.5-7B-Instruct Stage 1 checkpoint
Fine-tuning LoRA (rank 128, lr 8e-4) FSDP + LoRA (lr 7e-6)
Framework LLaMA-Factory veRL
Hardware 8 × NVIDIA A100-80GB 8 × NVIDIA A100-80GB
Source Data WebPRM Collection (~30k step-level preference pairs from Mind2Web)

Key training insights (from ablation studies in the paper):

  • Explicit principles are essential — removing them notably degrades performance, especially on out-of-domain environments.
  • Cold-start RL without reasoning distillation is unstable across environments.
  • Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments.

Intended Uses

WebArbiter-7B is designed to:

  • Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
  • Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
  • Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.

Limitations

  • Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
  • English-only: Training and evaluation are conducted exclusively in English-language web environments.
  • Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
  • Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

License

This model is released under Apache 2.0, following the base model Qwen2.5-7B-Instruct.

Related Resources

Resource Link
WebArbiter-8B-Qwen3 (strongest) ZYao720/WebArbiter-8B-Qwen3
WebArbiter-4B-Qwen3 ZYao720/WebArbiter-4B-Qwen3
WebArbiter-3B ZYao720/WebArbiter-3B
WEBPRMBENCH (benchmark) ZYao720/WEBPRMBENCH
Training Data ZYao720/WebArbiter-Data
Search Trajectories ZYao720/WebArbiter-Trajectories

Citation

@misc{zhang2026ZYao720principleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}
Downloads last month
159
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZYao720/WebArbiter-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(3202)
this model
Quantizations
1 model

Dataset used to train ZYao720/WebArbiter-7B

Collections including ZYao720/WebArbiter-7B

Papers for ZYao720/WebArbiter-7B

Evaluation results