Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Abstract

Verifiers — functions assigning rewards to agent behavior — have been key for AI progress in domains such as math and board games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators. Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on the self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena — setting a new state of the art on the benchmark, surpassing the previous best by 48%.

Top: example of a task in VisualWebArena and a corresponding agent trajectory. Middle: an MLLM verifier validates clearly flawed agent behavior, generating reasoning to rationalize its incorrect judgment and reinforcing brittle behavior. Bottom: Self-Grounded Verification leads to more accurate verification and enables the agent to backtrack.

Problem Setting

Several breakthroughs in artificial intelligence can be viewed from the lens of search guided by verifiers — functions assigning rewards to agent behavior aligned with desired criteria. However, while domains such as mathematics, programming, and board games benefit from relatively well-defined criteria to evaluate agent behavior, such clarity diminishes in open-ended settings. Evaluation in such scenarios often requires nuanced criteria and reasoning over possibly long sequences of multimodal inputs. Multimodal large language models (MLLMs) emerge as a promising solution to bridge this gap.

We consider two digital agent environments — VisualWebArena and OSWorld — and the long-horizon robot manipulation suite robomimic. Together, they span over 1,200 tasks across varied domains, where evaluating agent behavior demands nuanced criteria and multimodal reasoning.

In the offline setting, we evaluate how often MLLM verifiers match oracle verifiers in evaluation of agent trajectories. In the online setting, we assess the ability of MLLM verifiers to evaluate and provide real-time feedback to guide agents towards task completion.

Task Examples

"Show me the most expensive collectible historical USA coin that is under $2000." (VisualWebArena)

"I really like these shoes, but I am not satisfied with the color. Could you find me an item on this page for these shoes with a darker shade or color?" (VisualWebArena)

"Please set my slides upright instead of sideways." (OSWorld)

"My glasses are broken, and I'm having trouble seeing small things clearly. Could you help me enlarge the text on my screen so it's easier to read?" (OSWorld)

Agreement Bias

We identify a critical limitation that undermines the reliability of MLLMs as verifiers: a strong tendency to favor information presented in their context window, a phenomenon we call agreement bias. MLLM verifiers often validate flawed agent behavior and even produce chains-of-thought (CoT) to rationalize their incorrect judgments. This bias is pervasive across model families, and persists despite the use of established test-time scaling techniques, training for reasoning, and instructions with task-specific evaluation criteria.

We observe that agreement bias occurs despite models having relatively strong, and human-aligned priors on desired characteristics for agent behavior. This suggests a bottleneck in retrieval; while models possess the relevant knowledge for a downstream task, the way it is stored can hinder retrieval depending on the form the information is presented.

Self-Grounded Verification

To address agreement bias, we introduce Self-Grounded Verification (SGV). SGV operates in two steps: first, the model is elicited to retrieve a broad set of priors associated with successful task completion, conditioned only on minimal context to frame the task without access to the data under evaluation. In the second step, conditioned on its self-generated priors, the model then evaluates a candidate trajectory. SGV leads to more balanced and accurate verification across environments and policies, while introducing minimal token overhead and being easy to integrate into pipelines relying on MLLM verifiers.

Intuitively, by modulating unconditional and conditional generation, SGV harnesses MLLMs' own sampling mechanisms to enable more effective use of their world knowledge, alignment, and reasoning capabilities. By conditioning only on essential information in the first step, the model is induced to fully explore its high-dimensional probability distribution to retrieve information relevant to the task at hand and independent of the data under evaluation. In the second step, the model performs more focused generation, reasoning about the trajectory while grounded in its own trajectory-independent beliefs about what success looks like — effectively sampling from a conditional distribution induced by its own priors. We hypothesize that MLLMs, given their vast world knowledge and generalization abilities, can generate priors that are well-aligned with human notions of success, and that by decoupling inference in this manner, these priors can serve as an impartial baseline for grounding the verification.

Results

To demonstrate SGV's effectiveness and versatility, we evaluate its performance in two key applications of MLLM verifiers: automatic evaluation of agent trajectories, and online supervision to guide agents toward task completion.

Performance of MLLM verifiers on evaluation of web agent trajectories. (a) Agreement bias: models struggle to identify flawed behavior. (b) SGV improves performance across models.

Model	(a) CoT-Based Verification			(b) Self-Grounded Verification			∆ Acc.
Model	TPR (%)	TNR (%)	Acc. (%)	TPR (%)	TNR (%)	Acc. (%)	∆ Acc.
GPT-4.1	90	50	65	80	65	72	+7
GPT-4o	85	54	62	74	73	73	+11
Gemini 2.5	89	54	69	75	77	76	+8
Qwen2.5-VL-72B	82	54	62	74	68	70	+8
Llama 4 Maverick	82	54	62	74	64	67	+6
OpenAI o1 (high)*	77	66	68	74	74	74	+6
Gemini 2.5 (T-H)†	84	65	73	73	83	79	+6

* VisualWebArena shopping domain only. † Thinking enabled with max thinking budget (24,576 tokens).

On over 1,200 agent trajectories from VisualWebArena and OSWorld, SGV improves verification performance of several MLLMs and large reasoning models (LRMs), leading to gains of up to 17 percentage points (pp) in accuracy and up to 20 pp in true-negative identification, outperforming baselines leveraging established test-time scaling techniques and instructions with access to privileged information.

Task success rate (SR) and token usage across VisualWebArena domains and OSWorld. Token usage (for VisualWebArena) is relative to the ReAct agent with no verifier.

Method	All VWA		Classifieds		Reddit		Shopping		OSWorld
Method	SR	Tokens	SR	Tokens	SR	Tokens	SR	Tokens	SR
Search Agent*	29	4.9x	34	4.2x	22	5.4x	30	5.1x	–
R-MCTS*	34	9.3x	41	7.4x	29	9.7x	32	10.1x	–
Base Agent	46	1x	49	1x	29	1x	52	1x	21.4
+ Verifier, no SGV	46	1.5x	48	1.5x	29	1x	53	1x	–
+ Verifier, SGV	50	1.9x	52	1.9x	33	1.7x	57	1.9x	24.5

*Results from prior work.

Paired with SGV, our ReAct agent achieves a 5 pp gain (≈ 10% relative) on VisualWebArena, establishing a new state of the art by outperforming the previous best by 16 pp (≈ 48% relative), with minimal increase in token usage. Similarly, our verifiers guide the GUI-specialist UI-TARS-1.5 to improve by 3 pp (≈ 15% relative) in OSWorld, and a diffusion policy to outperform oracle supervision by 8 pp (≈ 33% relative) on robomimic's tool-hang task.

Example Trajectories

1. Example of the agent receiving valid feedback from the verifier and modifying its approach to correctly complete the task:

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab."
(VisualWebArena)

Agent:
tab_focus [1]

Agent:
type [5] [Nintendo 64 controller]

Agent:
click [32]

Agent:
stop

Verifier feedback:

EVALUATION: PARTIAL SUCCESS
FEEDBACK: You correctly identified the console and searched for the controller. However, you did not sort the search results by price to find the cheapest controller. Please go back to the search results page, sort the results by price (lowest first), and then click on the listing for the cheapest Nintendo 64 controller.

Agent:
go_back

Agent:
click [18]

Agent:
stop

Verifier feedback:

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

2. Example of the verifier guiding the agent toward a more robust strategy to complete the task:

"How many hours are on the engine of the most recently listed red boat?"
(VisualWebArena)

Agent:
click [38]

Agent:
scroll [down]

Agent:
stop [80]

Verifier feedback:

EVALUATION: FAILURE
FEEDBACK: The assistant did not attempt to search for "red boat" as required by the objective. ... The assistant should go back to the home page and use the search bar to search for "red boat". After searching, the assistant should look for the most recently listed red boat and click on the listing to find the engine hours.

Agent:
go_back

Agent:
type [5] [red boat] [1]

Agent:
stop [80]

Verifier feedback:

EVALUATION: PARTIAL SUCCESS
FEEDBACK: The assistant successfully searched for "red boat". However, it stopped before clicking on any of the listings to find the engine hours. The assistant needs to click on the most recently listed red boat and then find the engine hours on that page. The most recently listed red boat is the one at the top of the list.

Let's Think in Two Steps:
Mitigating Agreement Bias in MLLMs with
Self-Grounded Verification

Our Self-Grounded Verification (SGV) method uses a two-step process to reduce agreement bias in MLLM verifiers.

Abstract

Problem Setting

Task Examples

"Show me the most expensive collectible historical USA coin that is under $2000." (VisualWebArena)

"I really like these shoes, but I am not satisfied with the color. Could you find me an item on this page for these shoes with a darker shade or color?" (VisualWebArena)

"Please set my slides upright instead of sideways." (OSWorld)

"My glasses are broken, and I'm having trouble seeing small things clearly. Could you help me enlarge the text on my screen so it's easier to read?" (OSWorld)

Agreement Bias

Self-Grounded Verification

Results

Example Trajectories

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab."
(VisualWebArena)

Agent:
tab_focus [1]

Agent:
type [5] [Nintendo 64 controller]

Agent:
click [32]

Agent:
stop

Verifier feedback:

Agent:
go_back

Agent:
click [18]

Agent:
stop

Verifier feedback:

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

"How many hours are on the engine of the most recently listed red boat?"
(VisualWebArena)

Agent:
click [38]

Agent:
scroll [down]

Agent:
stop [80]

Verifier feedback:

Agent:
go_back

Agent:
type [5] [red boat] [1]

Agent:
stop [80]

Verifier feedback:

Agent:
click [38]

Agent:
stop [80]

Verifier feedback:

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully navigated to the listing for the red boat and found the engine hours.

Let's Think in Two Steps:Mitigating Agreement Bias in MLLMs withSelf-Grounded Verification

Our Self-Grounded Verification (SGV) method uses a two-step process to reduce agreement bias in MLLM verifiers.

Abstract

Problem Setting

Task Examples

"Show me the most expensive collectible historical USA coin that is under $2000." (VisualWebArena)

"I really like these shoes, but I am not satisfied with the color. Could you find me an item on this page for these shoes with a darker shade or color?" (VisualWebArena)

"Please set my slides upright instead of sideways." (OSWorld)

"My glasses are broken, and I'm having trouble seeing small things clearly. Could you help me enlarge the text on my screen so it's easier to read?" (OSWorld)

Agreement Bias

Self-Grounded Verification

Results

Example Trajectories

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab." (VisualWebArena)

Agent:tab_focus [1]

Agent:type [5] [Nintendo 64 controller]

Agent:click [32]

Agent:stop

Verifier feedback:

Agent:go_back

Agent:click [18]

Agent:stop

Verifier feedback:

EVALUATION: SUCCESS FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

"How many hours are on the engine of the most recently listed red boat?" (VisualWebArena)

Agent:click [38]

Agent:scroll [down]

Agent:stop [80]

Verifier feedback:

Agent:go_back

Agent:type [5] [red boat] [1]

Agent:stop [80]

Verifier feedback:

Agent:click [38]

Agent:stop [80]

Verifier feedback:

EVALUATION: SUCCESS FEEDBACK: The assistant successfully navigated to the listing for the red boat and found the engine hours.

Let's Think in Two Steps:
Mitigating Agreement Bias in MLLMs with
Self-Grounded Verification

"Find me the cheapest controller from the classifieds site that is meant for the console in the image on the Reddit tab."
(VisualWebArena)

Agent:
tab_focus [1]

Agent:
type [5] [Nintendo 64 controller]

Agent:
click [32]

Agent:
stop

Agent:
go_back

Agent:
click [18]

Agent:
stop

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully completed the task by identifying the console, searching for the controller, sorting by price, and navigating to the page of the cheapest item.

"How many hours are on the engine of the most recently listed red boat?"
(VisualWebArena)

Agent:
click [38]

Agent:
scroll [down]

Agent:
stop [80]

Agent:
go_back

Agent:
type [5] [red boat] [1]

Agent:
stop [80]

Agent:
click [38]

Agent:
stop [80]

EVALUATION: SUCCESS
FEEDBACK: The assistant successfully navigated to the listing for the red boat and found the engine hours.