Let's Think in Two Steps:
Mitigating Agreement Bias in MLLMs with
Self-Grounded Verification

Anonymous Authors
Comparison between the baseline CoT verifier and our SVG method.

Our Self-Grounded Verification (SGV) method uses a two-step process to reduce agreement bias in MLLM verifiers.

Abstract

Verifiers — functions assigning rewards to agent behavior — have been key for AI progress in domains such as math and board games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators. Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on the self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena — setting a new state of the art on the benchmark, surpassing the previous best by 48%.

Summary of findings.
Top: example of a task in VisualWebArena and a corresponding agent trajectory. Middle: an MLLM verifier validates clearly flawed agent behavior, generating reasoning to rationalize its incorrect judgment and reinforcing brittle behavior. Bottom: Self-Grounded Verification leads to more accurate verification and enables the agent to backtrack.

Problem Setting

Several breakthroughs in artificial intelligence can be viewed from the lens of search guided by verifiers — functions assigning rewards to agent behavior aligned with desired criteria. However, while domains such as mathematics, programming, and board games benefit from relatively well-defined criteria to evaluate agent behavior, such clarity diminishes in open-ended settings. Evaluation in such scenarios often requires nuanced criteria and reasoning over possibly long sequences of multimodal inputs. Multimodal large language models (MLLMs) emerge as a promising solution to bridge this gap.

We consider two digital agent environments — VisualWebArena and OSWorld — and the long-horizon robot manipulation suite robomimic. Together, they span over 1,200 tasks across varied domains, where evaluating agent behavior demands nuanced criteria and multimodal reasoning.

In the offline setting, we evaluate how often MLLM verifiers match oracle verifiers in evaluation of agent trajectories. In the online setting, we assess the ability of MLLM verifiers to evaluate and provide real-time feedback to guide agents towards task completion.

Task Examples

Agreement Bias

We identify a critical limitation that undermines the reliability of MLLMs as verifiers: a strong tendency to favor information presented in their context window, a phenomenon we call agreement bias. MLLM verifiers often validate flawed agent behavior and even produce chains-of-thought (CoT) to rationalize their incorrect judgments. This bias is pervasive across model families, and persists despite the use of established test-time scaling techniques, training for reasoning, and instructions with task-specific evaluation criteria.

We observe that agreement bias occurs despite models having relatively strong, and human-aligned priors on desired characteristics for agent behavior. This suggests a bottleneck in retrieval; while models possess the relevant knowledge for a downstream task, the way it is stored can hinder retrieval depending on the form the information is presented.

Self-Grounded Verification

To address agreement bias, we introduce Self-Grounded Verification (SGV). SGV operates in two steps: first, the model is elicited to retrieve a broad set of priors associated with successful task completion, conditioned only on minimal context to frame the task without access to the data under evaluation. In the second step, conditioned on its self-generated priors, the model then evaluates a candidate trajectory. SGV leads to more balanced and accurate verification across environments and policies, while introducing minimal token overhead and being easy to integrate into pipelines relying on MLLM verifiers.

Intuitively, by modulating unconditional and conditional generation, SGV harnesses MLLMs' own sampling mechanisms to enable more effective use of their world knowledge, alignment, and reasoning capabilities. By conditioning only on essential information in the first step, the model is induced to fully explore its high-dimensional probability distribution to retrieve information relevant to the task at hand and independent of the data under evaluation. In the second step, the model performs more focused generation, reasoning about the trajectory while grounded in its own trajectory-independent beliefs about what success looks like — effectively sampling from a conditional distribution induced by its own priors. We hypothesize that MLLMs, given their vast world knowledge and generalization abilities, can generate priors that are well-aligned with human notions of success, and that by decoupling inference in this manner, these priors can serve as an impartial baseline for grounding the verification.

Results

To demonstrate SGV's effectiveness and versatility, we evaluate its performance in two key applications of MLLM verifiers: automatic evaluation of agent trajectories, and online supervision to guide agents toward task completion.

Performance of MLLM verifiers on evaluation of web agent trajectories. (a) Agreement bias: models struggle to identify flawed behavior. (b) SGV improves performance across models.
Model (a) CoT-Based Verification (b) Self-Grounded Verification ∆ Acc.
TPR (%) TNR (%) Acc. (%) TPR (%) TNR (%) Acc. (%)
GPT-4.1 90 50 65 80 65 72 +7
GPT-4o 85 54 62 74 73 73 +11
Gemini 2.5 89 54 69 75 77 76 +8
Qwen2.5-VL-72B 82 54 62 74 68 70 +8
Llama 4 Maverick 82 54 62 74 64 67 +6
OpenAI o1 (high)* 77 66 68 74 74 74 +6
Gemini 2.5 (T-H)† 84 65 73 73 83 79 +6

* VisualWebArena shopping domain only. † Thinking enabled with max thinking budget (24,576 tokens).

On over 1,200 agent trajectories from VisualWebArena and OSWorld, SGV improves verification performance of several MLLMs and large reasoning models (LRMs), leading to gains of up to 17 percentage points (pp) in accuracy and up to 20 pp in true-negative identification, outperforming baselines leveraging established test-time scaling techniques and instructions with access to privileged information.

Task success rate (SR) and token usage across VisualWebArena domains and OSWorld. Token usage (for VisualWebArena) is relative to the ReAct agent with no verifier.
Method All VWA Classifieds Reddit Shopping OSWorld
SR Tokens SR Tokens SR Tokens SR Tokens SR
Search Agent* 29 4.9x 34 4.2x 22 5.4x 30 5.1x
R-MCTS* 34 9.3x 41 7.4x 29 9.7x 32 10.1x
Base Agent 46 1x 49 1x 29 1x 52 1x 21.4
+ Verifier, no SGV 46 1.5x 48 1.5x 29 1x 53 1x
+ Verifier, SGV 50 1.9x 52 1.9x 33 1.7x 57 1.9x 24.5

*Results from prior work.

Paired with SGV, our ReAct agent achieves a 5 pp gain (≈ 10% relative) on VisualWebArena, establishing a new state of the art by outperforming the previous best by 16 pp (≈ 48% relative), with minimal increase in token usage. Similarly, our verifiers guide the GUI-specialist UI-TARS-1.5 to improve by 3 pp (≈ 15% relative) in OSWorld, and a diffusion policy to outperform oracle supervision by 8 pp (≈ 33% relative) on robomimic's tool-hang task.

Example Trajectories

1. Example of the agent receiving valid feedback from the verifier and modifying its approach to correctly complete the task:

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab."
(VisualWebArena)

2. Example of the verifier guiding the agent toward a more robust strategy to complete the task:

"How many hours are on the engine of the most recently listed red boat?"
(VisualWebArena)