Web Agent Task Diversity Research

2025/11/01·StepFun can make mistakes, check important info.

Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills. Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.

我需要根据用户要求，分块核查内容

1. 现有网页代理任务生成方法的局限

准备核查清单中...

预备调查清单

一、待核查内容列表

Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.
Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.

二、核查要点展开

待核查内容1：Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.

核查原因：
1. 人物真实性与观点归属：需要确认是否存在名为Vardaan Pahuja的学者，以及该观点是否确实由其提出。
所需查明信息：
1. 是否存在名为Vardaan Pahuja且研究领域相关的学者。
2. 该学者是否在其公开发表的论文、演讲或著作中明确提出了上述观点。
必要信源：
1. 学术搜索引擎（如Google Scholar, arXiv, Semantic Scholar）。
2. 相关学者的个人主页或其所属研究机构的官方网站。
核查优先级：高

待核查内容2：Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).

核查原因：
1. 引文真实性：报告引用了具体的学术文献作为论据，其真实性是后续论证的基础。
2. 内容准确性：需要核实该文献的核心内容是否与报告描述一致。
所需查明信息：
1. 是否存在一篇由“He et al.”于2024年发表的、标记为“b”的学术论文。
2. 该论文的研究主题是否为“synthetic web trajectory generation”。
3. 该论文是否明确采用了“self-instruct”方法进行“task proposal generation”。
必要信源：
1. 权威的学术论文数据库（如IEEE Xplore, ACM Digital Library, arXiv）。
核查优先级：高

待核查内容3：It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.

核查原因：
1. 信息转述准确性：此为对所引文献（He et al., 2024b）方法论局限性的具体描述，可能存在误读或断章取义。
2. 逻辑关联性：该陈述是连接“现有工作不足”与“Pahuja观点”的关键环节，其准确性直接影响整体结论。
所需查明信息：
1. 在“He et al., 2024b”这篇论文中，其任务生成的数据来源是否被明确限定于“homepages or parametric LLM knowledge”。
2. 该论文本身或Vardaan Pahuja的著作中，是否将这种方法描述为“overlooking the richer content available in deeper web pages”。
必要信源：
1. “He et al., 2024b”论文的原文，特别是其方法论和局限性讨论部分。
2. Vardaan Pahuja的相关研究成果原文。
核查优先级：中

行动计划

首先，通过权威学术搜索引擎核实学者“Vardaan Pahuja”的身份、研究背景，并检索其相关出版物，以确认待核查内容1的观点归属。
同步检索学术数据库，查找名为“He et al., 2024b”的文献，验证其是否存在、主题是否相关，并获取其原文。
仔细研读“He et al., 2024b”的原文，重点关注其研究方法部分，以核实待核查内容2和3中关于其技术路径和数据来源的描述是否准确。
将Vardaan Pahuja的观点与其对“He et al., 2024b”或同类工作的评述进行比对，判断报告中的逻辑链条（即Pahuja的观点是否被用以批判He等人的工作）是否成立。
综合上述查证结果，评估报告整体陈述的准确性、引用的恰当性以及逻辑的严密性。

Vardaan Pahuja scholar web agents data intent diversity

arxiv.org

by V Pahuja · 2025 · Cited by 16 — In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date.

huggingface.co

In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date.

谷歌

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents. V Pahuja*, Y Lu*, C Rosset, B Gou, A Mitra, S Whitehead, Y Su, ...

GitHub Pages

Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist ...

aclanthology.org

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents. Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou ...

我已经获取了一些关于Vardaan Pahuja的初步信息，但需要更深入了解他的研究和相关论文内容。接下来，我将查看他的论文详情，特别是关于数据意图多样性的观点。

网页解析

https://arxiv.org/abs/2502.11357

回答

1. 网页摘要

本网页展示了一篇题为"Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents"的研究论文，由Vardaan Pahuja等人撰写，发表于ACL 2025 (Findings)。该研究针对多模态网络代理在现实在线环境中表现不佳的问题，提出了一种可扩展的解决方案。研究团队开发了一种合成大规模多样化轨迹级数据集的方法，包含超过94K成功的多模态网络轨迹，涵盖49K个独特URL、720K截图和33M网页元素。该方法利用广泛的网络探索和优化来获取多样化的任务意图，平均成本为每个成功轨迹28美分，使其对社区中广泛用户都可负担。基于此数据集，研究团队训练了名为Explorer的多模态网络代理，在Mind2Web-Live、Multimodal-Mind2Web和MiniWob++等离线和在线网络代理基准测试中表现出色。研究还强调数据规模扩展是提高网络代理能力的关键驱动因素，旨在使更大规模的基于LMM的代理研究更加普及。

2. 与问题相关的信息提取

在网页内容中，Vardaan Pahuja确实提到了数据意图多样性(data intent diversity)的概念，但没有直接使用"data intent diversity"这个术语。根据摘要内容，Pahuja及其团队强调了"diverse task intents"(多样化任务意图)的重要性，特别是在开发网络代理时。

摘要中提到："In particular, we leverage extensive web exploration and refinement to obtain diverse task intents."(特别地，我们利用广泛的网络探索和优化来获取多样化的任务意图)。这表明研究团队认识到多样化的任务意图对于训练有效的网络代理至关重要。

研究团队开发了"the largest and most diverse trajectory-level dataset"(最大且最多样化的轨迹级数据集)，这进一步强调了数据多样性的重要性。

然而，摘要中没有明确提到这种多样性对于"equipping generalist web agents with a broad range of skills"(为通用网络代理配备广泛技能)是必不可少的。摘要更多地强调了数据规模扩展("data scaling")作为提高网络代理能力的关键驱动因素："our experiments highlight data scaling as a key driver for improving web agent capabilities"(我们的实验强调数据规模扩展是提高网络代理能力的关键驱动因素)。

总之，虽然Pahuja确实强调了任务意图多样性的重要性，但在提供的摘要中没有明确表述它对于为通用网络代理配备广泛技能是必不可少的。

3. 相关网页链接

v1 - 上下文：Submitted on 17 Feb 2025 ([v1]), last revised 30 May 2025 (this version, v4)
Vardaan Pahuja - 上下文：作者链接
Yadong Lu - 上下文：作者链接
Corby Rosset - 上下文：作者链接
Boyu Gou - 上下文：作者链接
Arindam Mitra - 上下文：作者链接
Spencer Whitehead - 上下文：作者链接
Yu Su - 上下文：作者链接
Ahmed Awadallah - 上下文：作者链接
arXiv:2502.11357 - 上下文：Cite as: arXiv:2502.11357 [cs.AI]

4. 相关图片

网页内容中没有提供任何图片链接。

我已经查看了Vardaan Pahuja的论文，但需要继续深入了解。我现在需要查找关于"He et al., 2024b"的信息，以及他们关于synthetic web trajectory generation的工作。

He et al. 2024b synthetic web trajectory generation self-instruct

aclanthology.org

by S Qiao · 2024 · Cited by 106 — In order to obtain high-quality synthesized trajectories, we filter out all the tra- jectories with reward < 1 and collect trajectories with ...

openreview.net

by LE Erdogan · Cited by 30 — Synthetic Data Generation. Synthetic generation has gained popularity since pioneering work like Self-Instruct (Wang et al., 2022) and Alpaca (Taori et al., ...

aclanthology.org

by Q Sun · 2025 · Cited by 39 — Following prior work (He et al., 2024; Lai et al.,. 2024) on collecting tasks for GUI agents, we guide. GPT-4o to infer possible high-level instructions based ...

researchgate.net

Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting ...

我需要更精确地搜索关于"He et al., 2024b"的信息，特别是关于web trajectory generation的工作。

"He et al. 2024b" web trajectory generation task proposal homepage parametric LLM knowledge

aclanthology.org

by D Nguyen · 2025 · Cited by 47 — WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its ...

openreview.net

2024; He et al., 2024b; Putta et al., 2024). Despite. 058 their initial ... efficient trajectory generation in M-GRPO. 275. 4. Page 5. Table 2: Task ...

researchgate.net

... (He et al., 2024b) 1165 48 A11y tree + Screenshot NNetnav (Murty et al., 2024) ... Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web ...

openreview.net

OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It ...

已核查结论清单

已核查内容1：Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.

初步结论：基本确认。存在名为Vardaan Pahuja的学者，其研究方向与通用网络代理（generalist web agents）高度相关。其发表的论文核心思想强调通过“多样化的任务意图”（diverse task intents）来构建大规模数据集以提升网络代理的能力，这与报告中的陈述在核心观点上一致，但报告中的表述并非其论文原文的直接引述，而更像是对其研究动机的总结。
信源支撑要点：
1. Vardaan Pahuja是论文《Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents》的主要作者，该论文致力于解决网络代理缺乏多样化、大规模轨迹数据集的问题。
2. 论文摘要明确提出，其方法“利用广泛的网络探索和优化来获取多样化的任务意图”（leverage extensive web exploration and refinement to obtain diverse task intents）。
3. Vardaan Pahuja的个人主页（GitHub Pages）也表明其研究重点是“推进网络代理的基础模型，特别强调利用合成数据来训练端到端的通用代理”。
证据：
1. 原文摘录：Title:Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents Authors: Vardaan Pahuja , Yadong Lu , Corby Rosset , Boyu Gou , Arindam Mitra , Spencer Whitehead , Yu Su , Ahmed Awadallah... In particular, we leverage extensive web exploration and refinement to obtain diverse task intents.
  链接：https://arxiv.org/abs/2502.11357
  信源名称：arXiv.org
  信源发布时间：2025-02-17 (v1)
2. 原文摘录：Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist ...
  链接：https://vardaanpahuja.github.io/
  信源名称：GitHub Pages
  信源发布时间：未知
信源结论一致性：一致
是否需复核：否

已核查内容2：Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).

初步结论：部分确认。已有多篇学术论文引用了“He et al., 2024b”的文献，并确认其与网络代理任务相关。其中一篇引用将其具体指向名为“OpenWebVoyager”的研究。然而，该文献是否确实采用“self-instruct”方法进行任务生成，尚待原文验证。
信源支撑要点：
1. 在多篇关于GUI代理和网络代理的学术论文中，均引用了“He et al., 2024b”的成果。
2. 一份名为《CS598 JY2 Final Survey Report - Multimodal Web Agents》的报告明确指出“OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments.”
证据：
1. 原文摘录：OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It ...
  链接：https://openreview.net/pdf?id=AQ2TWYqsWH
  信源名称：OpenReview
  信源发布时间：未知
2. 原文摘录：WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its ...
  链接：https://aclanthology.org/2025.findings-acl.1158.pdf
  信源名称：ACL Anthology
  信源发布时间：2025
信源结论一致性：一致
是否需复核：是

待补充核查清单

待核查内容1：Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).

核查原因：
1. 技术细节待确认：虽然已定位到“He et al., 2024b”可能指向“OpenWebVoyager”，但其是否明确使用了“self-instruct”技术来生成任务提案，需要通过阅读其论文原文才能证实。这是报告中一个关键的技术论据。
所需查明信息：
1. 查找并获取名为“OpenWebVoyager (He et al., 2024b)”的论文原文。
2. 在论文原文中，核实其任务生成（task proposal generation）的方法论，确认是否使用了“self-instruct”或类似的技术。
必要信源：
1. “OpenWebVoyager (He et al., 2024b)”的论文原文，最好来自arXiv、ACL Anthology、OpenReview等权威学术发布平台。
核查优先级：高

待核查内容2：It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.

核查原因：
1. 方法论局限性待验证：这是对“He et al., 2024b”工作的核心批判，其准确性直接影响报告整体逻辑的成立。需要验证该工作是否真的将任务生成局限于“主页或参数化LLM知识”。
2. 观点归属待查证：需要确认这种“忽略了更深层网页内容”的批判是报告作者的观点，还是Vardaan Pahuja或领域内其他专家的共识。
所需查明信息：
1. “OpenWebVoyager”论文中描述的任务生成数据来源是什么？是否明确限定于“homepages or parametric LLM knowledge”？
2. 该论文的“局限性”或“未来工作”部分是否讨论了未能利用深层网页内容的问题？
3. Vardaan Pahuja的论文《Explorer》中是否直接或间接评论了“OpenWebVoyager”或类似工作在任务来源上的局限性？
必要信源：
1. “OpenWebVoyager (He et al., 2024b)”的论文原文。
2. “Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents” (Vardaan Pahuja et al.) 的论文原文。
核查优先级：高

行动计划

集中查找名为“OpenWebVoyager”且作者包含“He”的2024年论文，获取其全文。
仔细研读“OpenWebVoyager”论文的方法论（Methodology）部分，以核实其任务生成的具体技术（是否为self-instruct）和数据来源（是否局限于主页和LLM知识）。
查阅“OpenWebVoyager”论文的局限性（Limitations）和讨论（Discussion）部分，确认其作者是否承认存在未能利用深层网页内容的问题。
在Vardaan Pahuja的论文《Explorer》中，检索其对相关工作（related work）的评述，寻找是否存在对“OpenWebVoyager”或类似工作的直接批评，以验证报告中逻辑链的来源和准确性。
综合以上信息，对报告中关于“He et al., 2024b”的技术路径、局限性及其与Pahuja观点的逻辑关联做出最终判断。

OpenWebVoyager He et al. 2024b self-instruct task proposal generation

www.wpsshop.cn

过滤低质量或重复的指令，然后将其余的有效任务添加到任务池中。这个过程可以重复多次，直到得到大量任务。为了在实证上评估SELF-INSTRUCT，在GPT3上运行了这个框架。在该模型上进行的迭代SELF-INSTRUCT过程生成了约52000个指令，配对约82000个实例的输入和目标输出。所得到的数据提供了各种创造性任务，正如图1中的示例所示。这些生成的任务与典型NLP任务的分布有所不同，并且与种子任务有相当小的重叠。本文的贡献是：介绍了SELF-INSTRUCT，一种使用最少人工标注数据诱导指令遵循能力的方法；通过广泛的指令调整实验展示了其有效性；发布了一个包含52000个指令和一组手动编写的新任务的大规模合成数据集，用于构建和评估未来的指令遵循模型。2 方法本节介绍了SELF-INSTRUCT的过程，该过程指的是使用原始预训练语言模型自身生成任务的流程，过滤生成的数据，然后使用这些生成的数据进行指令调整，以使LM更好地遵循指令。该流程如图2所示。2.1 定义指令数据

openreview.net

by LE Erdogan · Cited by 30 — Synthetic generation has gained popularity since pioneering work like Self-Instruct (Wang et al., ... OpenWebVoyager (He et al., 2024b) Idefics2-8b-inst. 27.4.

我需要更多关于OpenWebVoyager的具体信息，特别是其任务生成方法。

OpenWebVoyager He et al. 2024b full paper

arxiv.org

by H He · 2024 · Cited by 8 — In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world ...

aclanthology.org

by H He · 2025 · Cited by 8 — During this stage, we follow He et al. (2024) and leverage GPT-4o to automatically evaluate the cor- rectness of the trajectories produced by OpenWeb-. Voyager.

飞书

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models链接：https://arxiv.org/abs/2401.13919简介：WebVoyager 是一个创新的多模态大型模型驱动的网络智能体，能够通过与真实网站交互完成用户指令，并在我们建立的基准测试中取得了显著优于GPT-4和文本只版本的性能表现。On the Multi-turn Instruction Following for Conversational Web Agents链接：https://arxiv.org/abs/2402.15057简介：本文提出了一种新任务“对话式网页导航”，并构建了一个名为Multi-Turn Mind2Web的数据集，同时提出了一个名为“自我反思记忆增强规划（Self-MAP）”的框架，以有效地处理大型语言模型在对话任务中的上下文长度限制和依赖问题。实验结果验证了该方法的有效性。You Only Look at Screens: Multimodal Chain-of-Action Agents

dblp.uni-trier.de

ser networks. Phys. Commun. 62 : 102252 ( 2024 )Hongliang He Wenlin Yao Kaixin Ma Wenhao Yu Yong Dai Hongming Zhang Zhenzhong Lan Dong YuWebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL (1) 2024 6864-6890[c26]Huachuan Qiu , Shuai Zhang , Hongliang He , Anqi Li , Zhenzhong Lan :Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models. CSCWD 2024 : 2313-2318Huachuan Qiu , Hongliang He , Shuai Zhang , Anqi Li , Zhenzhong Lan :

researchgate.net

Experiments demonstrate a 10% improvement in performance compared to self-evolving algorithm baseline, OpenWebVoyager (He et al., 2024b) , with significant ...

researchgate.net

Existing benchmarks involve a small number of websites: WebVoyager covers 15 websites (He et al., 2024) , and WebArena only 4 (Zhou et al., 2024) ...

huggingface.co

In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously ...

aclanthology.org

by H He · 2024 · Cited by 218 — The rapid advancement of large language mod- els (LLMs) has led to a new era marked by the development of autonomous applications.

我需要查看WebVoyager的完整论文内容，以确认其任务生成方法。

网页解析

https://arxiv.org/abs/2401.13919

回答

1. 网页摘要

这篇论文介绍了WebVoyager，一个由大型多模态模型(LMM)驱动的创新型网络代理，能够通过与真实网站交互来完成用户指令。与现有网络代理相比，WebVoyager支持多模态输入，并在真实网站环境中运行，而非简化的网页模拟器或静态网页快照。研究团队建立了一个新的基准测试，包含来自15个流行网站的真实任务，并引入了一种利用GPT-4V多模态理解能力的自动评估协议。实验结果显示，WebVoyager在该基准测试上达到了59.1%的任务成功率，显著超越了GPT-4(All Tools)和仅文本版WebVoyager的表现。提出的自动评估指标与人类判断达到了85.3%的一致性，证明了其在评估网络代理方面的有效性和准确性。该研究已被ACL 2024接收，代码和数据已在GitHub上公开发布。

2. 与探究问题相关的信息提取

根据提供的网页内容，没有关于WebVoyager或OpenWebVoyager如何生成任务提案的具体信息。网页内容主要介绍了WebVoyager的整体功能和性能评估，但没有详细说明其任务提案生成机制，也没有提及是否使用自我指导(self-instruct)方法，或者是否从主页或参数化LLM知识中制定任务提案。

网页内容仅提到WebVoyager是一个由大型多模态模型驱动的网络代理，能够与真实网站交互完成用户指令，并建立了包含15个流行网站真实任务的基准测试，但没有具体说明这些任务是如何生成或提出的。

要获取关于任务提案生成机制的详细信息，可能需要查阅完整的论文内容或访问GitHub上发布的代码库。

3. 相关网页链接

https://github.com/MinorJerry/WebVoyager - 上下文: "Code and data is released at this https URL"
https://arxiv.org/abs/2401.13919 - 上下文: "Cite as: arXiv:2401.13919 [cs.CL]"
https://arxiv.org/abs/2401.13919v4 - 上下文: "or arXiv:2401.13919v4 [cs.CL] for this version"
https://doi.org/10.48550/arXiv.2401.13919 - 上下文: "https://doi.org/10.48550/arXiv.2401.13919"
https://arxiv.org/search/cs?searchtype=author&query=He%2C%20H - 上下文: "Hongliang He"
https://arxiv.org/search/cs?searchtype=author&query=Yao%2C%20W - 上下文: "Wenlin Yao"
https://arxiv.org/search/cs?searchtype=author&query=Ma%2C%20K - 上下文: "Kaixin Ma"
https://arxiv.org/search/cs?searchtype=author&query=Yu%2C%20W - 上下文: "Wenhao Yu"
https://arxiv.org/search/cs?searchtype=author&query=Dai%2C%20Y - 上下文: "Yong Dai"
https://arxiv.org/search/cs?searchtype=author&query=Zhang%2C%20H - 上下文: "Hongming Zhang"

4. 相关图片提取

网页内容中没有提供任何图片链接或描述。

我需要查看OpenWebVoyager的论文，以获取更多关于任务生成方法的信息。

网页解析

https://arxiv.org/abs/2410.19609

摘要

本文介绍了OpenWebVoyager，一个开源框架，旨在开发能够自主进行真实世界探索并自我改进的多模态网络代理。研究采用了三阶段方法：首先通过模仿学习训练基础模型获取基本能力；然后让代理探索开放网络并收集其轨迹的反馈；最后通过学习由另一个通用模型判断为表现良好的轨迹来改进其策略。这种"探索-反馈-优化"循环可以持续多次迭代。与现有的仅在合成环境中构建文本代理的开源努力不同，OpenWebVoyager专注于处理需要多模态感知能力且缺乏明确奖励信号的真实场景。实验结果表明，该网络代理在每次迭代后都能成功自我改进，在多个测试集上展示出强大的性能。

与问题相关的信息提取

关于OpenWebVoyager如何生成任务提案的信息，在提供的摘要中没有直接提及。摘要主要介绍了OpenWebVoyager的整体框架和工作流程，包括模仿学习、网络探索、反馈收集和策略优化的循环过程，但没有具体说明任务提案的生成机制。

摘要中没有提到是否使用了self-instruct技术，也没有说明任务提案是从主页生成还是基于参数化LLM知识生成。文中仅提到代理会"探索开放网络"并收集反馈，但没有详细说明任务的来源或生成方式。

无与问题有关的内容。在提供的摘要中，没有找到关于OpenWebVoyager如何生成任务提案的具体信息。

WebVoyager : Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He 1,3∗, Wenlin Yao 2, Kaixin Ma 2, Wenhao Yu 2, Yong Dai 2,Hongming Zhang 2, Zhenzhong Lan 3, Dong Yu 21Zhejiang University, 2Tencent AI Lab, 3Westlake University
hehongliang@westlake.edu.cn , wenlinyao@global.tencent.com
Abstract
The rapid advancement of large language mod-els (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innova-tion in creating advanced web agents. Exist-ing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenar-ios. To bridge this gap, we introduce WebVoy-ager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an au-tomatic evaluation protocol leveraging multi-modal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indi-cating its effectiveness in providing reliable and accurate assessments of web agents. 1

1 Introduction
The recent advancement of large language models (LLMs), such as ChatGPT and GPT-4 (OpenAI, 2023), have sparked significant interest in devel-oping LLM-based autonomous agents (AutoGPT, 2022) for complex task execution (Qin et al., 2023; Schick et al., 2023). Recent studies have explored the construction of text-based web browsing envi-ronments and how to instruct large language model agents to perform web navigation (Nakano et al., 2021; Gur et al., 2023; Zhou et al., 2023; Lu et al.,

∗Work done during the internship at Tencent AI Lab.
1Our code and data will be released at https://github. com/MinorJerry/WebVoyager

2023). The primary challenge in these works lies in managing complex and verbose HTML texts, and solutions include simplifying and structuring HTML (Nakano et al., 2021; Zhou et al., 2023; Gur et al., 2023; Deng et al., 2023). However, existing approaches overlook a critical functionality of browsing: rendering HTML into visual webpages. Particularly, vision capability is crucial for utilizing tools such as web browsers, as rendered web pages are inherently designed with user experience (UX), emphasizing intuitive information and structured presentation. This de-sign principle of rendering makes visual analysis more effective than mere HTML representation. At present, large multimodal models (LMMs), partic-ularly GPT-4V(ision) (OpenAI, 2023) and Gemini (Team et al., 2023), demonstrate a remarkable abil-ity to integrate intricate visual cues with textual information. Existing studies such as Pix2Struct (Lee et al., 2023) and WebArena (Zhou et al., 2023), have initiated explorations into using screenshots as inputs for decision-making in web navigation, yet these are preliminary and do not represent a deep exploration. Therefore, building multimodal web agents to leverage the environment rendered by browsers through screenshots, thus mimicking human web browsing behavior, is now a viable approach to enhance web navigation abilities. We introduce WebVoyager (Figure 1), a multi-modal web agent designed to autonomously ac-complish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention. WebVoyager processes the user query by making observations from screenshots and textual content in interactive web elements, formulates a thought on what ac-tion to take (such as clicking, typing, or scrolling, etc.), and then executes that action on the web-sites. Inspired by Set-of-Mark Prompting (Yang et al., 2023a), we mark interactive web elements on screenshots (see Figure 2) to facilitate decision-

arXiv:2401.13919v4 [cs.CL] 6 Jun 2024 Available Websites
Screenshot (Image)
Web Elements (Text)
[1]: Delivering to Santa Clara 95050
[2]: "Search Amazon"
...
Observation
Thought: Based on observation...
Click? Type? Scroll? ...
Send
Query
Return
Answer
User
...
Thought
Action

Figure 1: The overall workflow of WebVoyager. WebVoyager takes web tasks assigned by a human and automatically browses the web online. At each step, WebVoyager selects actions based on screenshots and text (the ‘type’ of the web element and its contents). Once the task is completed, the answers will be returned to the user. For example, for a user query: "Find the cost of a 2-year protection for PS4 on Amazon.", the agent interacts with Amazon online, locates the PS4, identifies the 2-year protection price, and returns "$30.99" to the user.
making for WebVoyager. Another challenge is the evaluation of an end-to-end web agent. Existing benchmarks, such as Mind2Web (Deng et al., 2023), primarily focus on stepwise and offline evaluation, where agents follow a predefined “golden” trajectory for action selection. This approach, however, may not fully account for the variety of viable strategies to ac-complish a task, as it only reflects one possible plan. This limitation could lead to a biased evaluation and difficulties in fairly comparing different meth-ods. To accurately evaluate the capabilities of web agents in end-to-end task completion, we propose an automated evaluation protocol using GPT-4V. Specifically, we save screenshots throughout the online navigation process and then use GPT-4V to evaluate these trajectories together with the fi-nal results automatically. Human evaluations are also conducted to verify the results and the analysis shows that our evaluation protocol achieves 85.3% agreement with human judges, indicating GPT-4V can serve as a reliable evaluator for online agents. We conduct evaluations on a newly collected dataset, which is semi-automatically generated us-ing a self-instruct (Wang et al., 2022) method, com-prising 643 web tasks from 15 commonly accessed websites. We also evaluate WebVoyager on 90 web-related tasks of level 1 and level 2 from the GAIA (Mialon et al., 2023), and 50 interactive open-web tasks from SeeAct (Zheng et al., 2024). We com-pare our WebVoyager with 1) GPT-4 (All Tools) 2,and 2) WebVoyager in a text-only setting which em-ploys the textual accessibility tree proposed in We-bArena (Zhou et al., 2023) to describe web pages.

2GPT-4 (All Tools) is an integrated tool-based agent re-leased by OpenAI in Oct. 2023. See https://chat.openai.com/

The results show that WebVoyager achieves a Task Success Rate of 59.1% on our new benchmark, sig-nificantly outperforming GPT-4 (All Tools) with a rate of 30.8% and the text-only setting with a rate of 40.1%, demonstrating the effectiveness of our method. Our research demonstrates the effec-tiveness of the WebVoyager method for web tasks, offering insights into the development of more in-telligent and efficient web automation solutions.
2 Related Work
Autonomous web navigation (Shi et al., 2017; Yang et al., 2023b) requires an agent to follow instruc-tions, construct plans, comprehend complex web structures, and decompose tasks into step-by-step decisions (Weng, 2023). To study web agents in a controlled environment, previous works con-structed web simulators that contain simplified websites (Shi et al., 2017; Yao et al., 2022a). More recently, there has been a surge of interest in build-ing more challenging and realistic benchmarks such as Mind2Web (Deng et al., 2023) and We-bArena (Zhou et al., 2023). Along with these new benchmarks, numerous efforts have been made to build autonomous web agents. WebGPT (Nakano et al., 2021) constructs a text-based web browsing environment and fine-tunes GPT-3 as a web agent. WebAgent (Gur et al., 2023) pretrains a T5 model to extract HTML snip-pets and leverages Flan-U-PaLM (Chowdhery et al., 2023) to generate Python code to interact with the environment. Besides fine-tuning, another line of work tries to build web agents by prompting LLMs (Yao et al., 2022b; Shinn et al., 2023; Ma et al., 2023). Multimodal web agents that integrate visual signals have also been explored, WebGUM (Fu-ruta et al., 2023) combines T5 (Raffel et al., 2020) with a Vision Transformer (ViT) to navigate using both screenshots and HTML text. PIX2ACT (Shaw et al., 2023) instead solely relies on web screen-shots as inputs to predict agent actions. Unlike pre-vious works that only consider a single modality or simplified web environments, we build a multi-modal agent that can complete tasks on real-world websites in this work. Concurrently with our work, SeeAct (Zheng et al., 2024) also leverages Large Multimodal Models (LMMs) for integrated visual understanding and actions on websites. However, the best SeeAct agent still relies on a finetuned cross-encoder model to select candidate elements for interaction. In contrast, WebVoyager does not require any additional modules.
3 WebVoyager
We aim to build an agent that can browse the open web autonomously without human intervention to complete user instructions. Given an instruction, our WebVoyager first instantiates a web browser and then performs actions with visual (i.e., screen-shots) and textual (i.e., HTML elements) signals from the web. The agent produces an action based on the inputs at every step, which is then executed in the browser environment. The process continues until the agent decides to stop. The details of Web-Voyager, including environment, interaction cycle, observation space, and action space, are as follows.
3.1 Browsing Environment
We develop an automated web-browsing environ-ment using Selenium 3. Unlike WebArena (Zhou et al., 2023), we do not host any websites locally and allow the agent to explore the open web instead, which poses unique challenges such as floating ads, pop-up windows, constant updates, etc. 4 Still, we opt for online interaction with real websites as we believe that this setting truly reflects the real-world use cases (e.g., the agent needs access to real-time information from the web), and a successful web agent should be able to adapt to these challenges and consistently solve the problem robustly.

3https://www.selenium.dev/
4Regarding CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) challenges, we believe it is important to respect the rules of these websites and prompt the agent to retrieve information from alternative sources.
Figure 2: Examples of webpage screenshots provided to the agent. We add borders to most of the interactive elements on the web pages and label them with numeri-cal tags in the top left corner.

3.2 Interaction Formulation
Formally, we denote the Environment as E, the large Multimodal Model as M, the Observa-tion Space as O, and the Action Space as A.At time step t, the model receives the con-text ct as inputs, which consist of historical ac-tions ai and observations oi, defined as: ct =(o1, a 1, ..., o t−1, a t−1, o t, I ) The the model pro-duces the action at at time t, at = M(ct), which is then executed in the environment. After execu-tion, the environment sends back the observation at time t + 1 , ot+1 = E(ot, a t). Then the context will be updated and this interaction process continues until the model generates a terminating action or the maximum step is reached. Inspired by the paradigm of ReAct Prompting (Yao et al., 2022b), we also prompt our agent to generate a thought process first before generating the action code. Hence at can be further composed into (st, ˆat) where st and ˆat represent the natural language thought and action code respectively. Fig-ure 7 in Appendix A presents the System Prompt we designed for the action prediction step. Also, it’s worth noting that excessive observations of web pages from longer episodes may confuse the agent. Therefore, we perform context clipping to remove outdated web page information and only keep the three most recent observations in the inputs, and we keep the entire history of thoughts and actions to better guide the agent.
3.3 Observation Space
Similar to how humans browse the web, our agent also takes the visual information from the web (screenshots) as the primary source of input. Using screenshots allows us to avoid the burden of pro-cessing HTML DOM tree or accessibility tree to portray the overall structure of webpages, which can lead to overly verbose texts and impact the decision-making process of the agent. Inspired by Set-of-Mark Prompting (Yang et al., 2023a), we overlay bounding boxes of the interactive elements on the websites to better guide the agent’s action prediction. Unlike Yang et al. (2023a), we do not need any object detection module (Zou et al., 2023). Instead, we utilize GPT-4V-ACT 5, a Javascript tool to extracts the interactive elements based on web element types and then overlays bounding boxes with numerical labels on the respective regions of the elements. GPT-4V-Act is efficient since it is rule-based without incorporating any object detec-tion model. As illustrated in Figure 2, the nature of webpages allows us to locate and outline each interactive el-ement using this tool precisely. The numerical labels assigned to each element are also essential for the model to identify the elements requiring interaction, thereby facilitating accurate action de-termination. We empirically choose black color for the borders and the background of the labels to enhance clarity. We observe that using a single black color yields higher success rates than using multiple colors. We also provide the agent with auxiliary text as inputs, including the textual con-tent embedded within the interactive element, the type of the element, and possibly some comment text in the aria-label attribute. To simplify the ob-servation, we have disabled multiple tabs, i.e., all interactions occur within the current tab instead of opening new ones. At every step, the agent receives the current screenshot, auxiliary text, and history as inputs, as discussed in (§3.2). In case the agent’s action raised an exception during execution, we addition-ally incorporated the error messages in the prompt and asked the model to regenerate the response. Note that each error correction attempt also con-sumes one step from the total exploration budget.
3.4 Action Space
We define the action space of our agent similar to how humans browse the web. To this end, we implement the most commonly used mouse and keyboard actions, sufficient for the agent to browse various web pages and locate the content required

5https://github.com/ddupont808/GPT-4V-Act

for the task. With the help of numerical labels in screenshots, we enable the agent to respond with a concise Action Format. This approach precisely locates the elements requiring interaction and ex-ecutes the corresponding actions. The usage of actions is as follows (more details in Appendix C): 1) Click. This action involves clicking on an element within a webpage, typically a link or a button. 2) Input. This composite action involves selecting a text box, deleting any existing content within it, and then inputting new content. 3) Scroll .Scrolling is a common operation for browsing web-pages, usually involving the vertical movement of the entire page. 4) Wait . Action execution requires time, and this action is often used to wait for web pages to load. 5) Back . This action is used to return to the previous page. 6) Jump to Search Engine . There are often situations where agents get stuck at a certain website without finding an answer. This action enables the agent to jump to a search engine and start anew. 7) Answer . Once all questions in the task are resolved, this action concludes the iteration and provides an answer in line with the task requirements.
4 Benchmark for WebVoyager
4.1 Website Selection
We select 15 representative websites that cover dif-ferent aspects of our daily life to ensure diversity in our evaluation, including Allrecipes, Amazon, Ap-ple, ArXiv, BBC News, Booking, Cambridge Dic-tionary, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Huggingface, and Wolfram Alpha. Due to technical limitations, we regretfully omit websites requiring login or CAPTCHA to access their content. Additionally, Google Search is a universal website that can serve as a starting point for any website, making our framework applicable to various scenarios.
4.2 Data Construction
We employ a combination of self-instruct (Wang et al., 2022) and human verification to construct our evaluation set. Figure 3 illustrates our data creation process. Initially, we manually sample and rewrite some tasks from Mind2Web (Yin et al., 2023; Deng et al., 2023) for websites including Google Flights, Google Map, Google Search, Booking, and Wol-fram Alpha. This process yields initial seed tasks in the Task Pool for subsequent generation. In step two, we sample tasks from Task Pool as in-context Step 1: Tasks written by

Human (for 3 - 5 webs)
GPT-4
Several tasks
for 15 webs
Manual Filtering
Task Pool
(15 webs)
GPT-4
Manual Filtering
(Find answer s on Web)
Final Web Tasks
In-context
examples
(Quality of Tasks)
Filtered
tasks
In-context
examples
Step 2: seed tasks for each web
Generated tasks
Step 3: Generate web tasks and
check answers online.

Figure 3: Data creation process using self-instruct.
examples (Dong et al., 2022) and prompt GPT-4 Turbo to generate approximately 100 new tasks (20 iterations). Then we manually verify each gener-ated task and rewrite them if necessary to ensure its high quality and the answers can be found on the corresponding website, then we add them to the Task Pool as additional seed tasks. This step allows us to create human-validated seed tasks for each website. Finally, in step three, we sample more diverse in-context examples in the Task Pool and directly add the generated tasks to the Task Pool in each iteration. We manually verify that the gen-erated tasks have low repetition, and the answers to the generated tasks can be found on the web-sites. In total, we collected 40+ tasks per website, resulting in a total of 643 tasks. To further confirm that the generated tasks in the dataset have low repetition, We use the all-mpnet-base-v2 6 model to calculate pairwise similarity for 643 ques. Out of a total of 206,403 pairs, only 49 pairs have a similarity greater than 0.8, and 140 pairs have a similarity between 0.7 and 0.8. All of these have been manually checked and found to be acceptable. 99.68% of pairs have a similarity of less than 0.6. This demonstrates the diversity of our tasks and the robustness of our approach.
4.3 Annotation Process
After collecting the full task pool, we annotate an-swers for each task. Since some questions are open-ended and the web information may change, these questions may not have a fixed golden response. Thus, we label each data entry with an answer, cat-egorized as “Possible” or “Golden.” For answers

6

https://huggingface.co/sentence-transformers/all-mpnet-base-v2
labeled as “Golden,” we provide a comprehensive listing of possible responses and consider them stable in the short term. The “Possible” category covers the following scenarios: 1) Answers for open-ended tasks where it’s hard to find an exact match answer, such as summarization. 2) multiple answers satisfy the task, making it impractical to list all of them. Therefore, we provide a pa ...

我已经获取了WebVoyager论文的内容，这对我的核查非常有价值。现在我需要查看Vardaan Pahuja的Explorer论文，以便对比两者的方法和观点。

网页解析

https://arxiv.org/pdf/2502.11357.pdf

arXiv:2502.11357v4 [cs.AI] 30 May 2025

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for

Multimodal Web Agents

Vardaan Pahuja1*∗†*,

Yadong Lu2*¶,

Corby Rosset2**,**

Boyu Gou1**,**

Arindam Mitra2**,**

Spencer Whitehead2**,**

Yu Su1**,**

Ahmed Awadallah2

1The Ohio State University

2Microsoft Research, Redmond

pahuja.9@osu.edu, yadonglu@microsoft.com

Abstract

Recent success in large multimodal models

(LMMs) has sparked promising applications

of agents capable of autonomously complet-

ing complex web tasks. While open-source

LMM agents have made significant advances

in offline evaluation benchmarks, their perfor-

mance still falls substantially short of human-

level capabilities in more realistic online set-

tings. A key bottleneck is the lack of diverse

and large-scale trajectory-level datasets across

various domains, which are expensive to col-

lect. In this paper, we address this challenge

by developing a scalable recipe to synthesize

the largest and most diverse trajectory-level

dataset to date, containing over 94K success-

ful multimodal web trajectories, spanning 49K

unique URLs, 720K screenshots, and 33M web

elements. In particular, we leverage extensive

web exploration and refinement to obtain di-

verse task intents. The average cost is 28 cents

per successful trajectory, making it affordable

to a wide range of users in the community.

Leveraging this dataset, we train Explorer, a

multimodal web agent, and demonstrate strong

performance on both offline and online web

agent benchmarks such as Mind2Web-Live,

Multimodal-Mind2Web, and MiniWob++. Ad-

ditionally, our experiments highlight data scal-

ing as a key driver for improving web agent

capabilities. We hope this study makes state-of-

the-art LMM-based agent research at a larger

scale more accessible.1

Introduction

Graphical User Interfaces (GUIs) serve as the pri-

mary medium for user interaction across digital

environments. Within the GUI environment, LLM-

based agents (Su et al., 2024) have shown great

potential in automating complex workflows for hu-

man users. These agents are designed to operate

*Equal Contribution. † Work partly done during internship

at Microsoft Research. ¶ Project Lead.

1Project website: https://osu-nlp-group.github.io/

Explorer/

across diverse interfaces, including the web (Deng

et al., 2023; Zhou et al., 2024; Zheng et al., 2024,

2025), desktop (Xie et al., 2024; Wu et al., 2024),

and mobile platforms (Rawles et al., 2023; Yan

et al., 2023). Navigating modern GUI interfaces,

which integrate textual, graphical, and interactive

components, typically requires agents to possess

visual grounding, long-term planning, and memory

management capabilities.

Recent work (Cheng et al., 2024; Gou et al.,

2025) has demonstrated the effectiveness of syn-

thetic data for enhancing visual grounding (Gou

et al., 2025; Chen et al., 2024a; Kapoor et al.,

2024; Chen et al., 2024b) and planning (Xu et al.,

2025b; Zhang et al., 2024). Developing end-to-end

GUI agents with long-term planning and grounding

capabilities requires training on multi-step trajec-

tory data (Xu et al., 2025a,b; Qin et al., 2025).

However, existing trajectory datasets are primar-

ily human-annotated (Deng et al., 2023; Li et al.,

2024; Lu et al., 2024) or leverage synthetic data just

for task proposal curation (Lai et al., 2024; Chen

et al., 2024a). And human annotation is expensive

to scale for collecting large and diverse training

datasets. Therefore, synthetic data has emerged

as a promising alternative to human-annotated

data (Hartvigsen et al., 2022; Sahu et al., 2022;

Ye et al., 2022; Tang et al., 2023; Mukherjee et al.,

2023; Mitra et al., 2024). Collecting trajectory-

level datasets presents unique challenges: 1) cu-

rating a diverse set of task intents at scale, 2) de-

ploying an agent capable of interacting with a real-

world environment to complete these tasks through

a series of actions, and 3) verifying whether the task

is accomplished by the executed action sequence.

Data diversity is essential for equipping gener-

alist web agents with a broad range of skills. Ex-

isting work on synthetic web trajectory generation

employs self-instruct for task proposal generation

(He et al., 2024b). It formulates task proposals

from homepages or parametric LLM knowledge,

Task Proposal: Find

today's deals on

Amazon.

Action: Click on the

'Today's Deals' link.

Task Refinement: Find and

purchase a Lightning deal

item with at least 30% off.

Action: Click on the '36%

off Limited time deal

Amazon eero Pro 6E mesh

Wi-Fi router' link.

Overall Task: Buy the Amazon

eero Pro 6E mesh Wi-Fi router on

Amazon.

Task Refinement: Proceed to

checkout for the Amazon eero

Pro 6E mesh Wi-Fi router with

36% off.

Action: Click on the 'Buy Now'

button for the Amazon eero

Pro 6E mesh Wi-Fi router.

Task Proposer

Task Refiner

Task Summarizer

(a) Proposal Stage

(b) Refinement Stage

(c) Summarization Stage

(d) Verification Stage

Overall Task: Buy the Amazon

eero Pro 6E mesh Wi-Fi router on

Amazon.

Success

Task Verifier

Thoughts: The user intended to purchase the Amazon eero

Pro 6E mesh Wi-Fi router on Amazon. The agent successfully

navigated to the product page, added the item to the cart,

and initiated the checkout process by clicking "Buy Now."

Figure 1: Data Generation Pipeline. The task proposer agent generates an abstract task proposal and the first

action based on the website homepage. The task is then iteratively refined in subsequent steps by the refiner agent.

Finally, the task summarizer agent constructs an overall task description from the action sequence, followed by task

verification to assess correctness.

overlooking the richer content available in deeper

web pages, which is essential for achieving broader

task diversity. Another line of work leverages web

tutorials as a form of supervision for generating

web trajectories (Ou et al., 2024; Xu et al., 2025a).

While web tutorials effectively cover common daily

user tasks, the resulting trajectory data exhibits lim-

ited domain diversity in terms of website and do-

main coverage (Table 1). Additionally, information-

seeking tasks remain underrepresented. Due to

these limitations, web agents trained on existing

synthetic trajectory datasets have not seen much

success in more realistic online evaluation settings.

To enhance web agents’ performance in real-world

settings, it is essential to incorporate greater diver-

sity in their training trajectories.

In this work, we develop a scalable and diverse

web trajectory data synthesis recipe for training

GUI agent models. Inspired by how humans learn

to use the internet, we leverage exploration as a

key mechanism for achieving diversity in task in-

tents. We introduce Explorer, a framework for sys-

tematic web exploration to generate diverse, high-

quality trajectory datasets. Unlike prior work that

relies on static task proposals, Explorer dynami-

cally explores web environments to curate diverse,

real-world tasks. This exploration-based approach

ensures broader task coverage and better gener-

alization to real-world scenarios. We instantiate

this framework using popular URLs from several

sources, such as Tranco (Pochat et al., 2019) and

similarweb.com as seeds. Our dataset comprises

94K diverse web trajectories spanning 49K unique

URLs, making it the largest web trajectory dataset

to date. Each trajectory is richly annotated with

artifacts such as screenshots, raw and set-of-mark

(Yang et al., 2023) annotated versions, HTML, and

the accessibility tree, enabling comprehensive web

agent training. To construct this dataset, we de-

velop a multi-agent pipeline that starts with an ab-

stract task proposal and iteratively refines it into a

more specific task through web exploration (Fig-

ure 1). Unlike previous approaches, our pipeline

generates tasks better grounded in real-world web-

sites, improving task relevance and diversity. To

demonstrate the effectiveness of our dataset, we

train small language models using just the synthetic

data and outperform existing web agent baselines

by a significant margin. The main contributions of

this work are as follows:

• We develop a scalable and easily customizable

multi-agent pipeline for web agent trajectory

synthesis. This pipeline leverages exploration

as a core mechanism to generate diverse tra-

jectory data, ensuring broad domain coverage

and skill diversity in the resulting dataset.

• We leverage this pipeline to generate a diverse

and high-quality GUI trajectory dataset con-

sisting of 94K trajectories, spanning 49K

# Trajectories

# Websites

Modality

RUSS (Xu et al., 2021)

HTML

Mind2Web (Deng et al., 2023)

2350

137

HTML + Screenshot

WebLINX (Lu et al., 2024)

2337

155

HTML + Screenshot

GUIAct (Chen et al., 2024a)

5696

121

Screenshot

OpenWebVoyager (He et al., 2024b)

1165

A11y tree + Screenshot

NNetnav (Murty et al., 2024)

A11y tree + Screenshot

AgentTrek (Xu et al., 2025a)

10*.*4K

127

A11y tree + HTML + Screenshot

Explorer

94K

49K

A11y tree + Screenshot (raw + SoM) + HTML

Table 1: Comparison to existing web agent benchmarks.

unique URLs with 720K screenshots and

33M web elements, making it the largest web

trajectory dataset of this scale.

• We demonstrate the effectiveness of our

dataset by training small language models,

which achieve strong performance on both

online and offline benchmarks, significantly

surpassing existing web agent baselines, in-

cluding those with larger parameter counts.

Related Work

Recent advances in multimodal language models

have facilitated the development of web agents —

autonomous systems designed to interact with real-

world websites to perform everyday tasks (Deng

et al., 2023; Hong et al., 2024; Cheng et al., 2024;

Zheng et al., 2024, 2025; Xue et al., 2025). Early

efforts to acquire trajectory data for training web

agents primarily relied on crowd-sourcing (Deng

et al., 2023; Lu et al., 2024). However, due to

the high cost of human annotation, recent work

has adopted synthetic data generation for large-

scale collection. AutoWebGLM (Lai et al., 2024)

and GUIAct (Chen et al., 2024a) utilize LLMs

to generate task proposals, which human experts

subsequently annotate.

OpenWebVoyager (He

et al., 2024b) employs a web agent to execute auto-

generated task descriptions. However, since these

task descriptions are generated using LLMs with-

out exploring a website, they fail to capture the

full diversity of possible tasks on that website. An-

other line of work, including Synatra (Ou et al.,

2024) and AgentTrek (Xu et al., 2025a), leverages

web tutorials to guide web trajectory generation.

Meanwhile, concurrent effort (Murty et al., 2024)

employs an exploration-based trajectory generation

in WebArena’s sandbox, while our work focuses

on more realistic web agent evaluation on live web-

sites. To address diversity limitations in prior tra-

jectory synthesis work, we design a bottom-up web

trajectory synthesis pipeline that explores websites

dynamically while maintaining a coherent high-

level task intent. We refer readers to Appendix F

for further discussion.

Data Recipe

We design an automatic web trajectory synthesis

pipeline that explores websites to generate diverse

web trajectories. It utilizes Playwright2 to execute

actions and collect metadata from real-world web-

sites, starting from an initial URL.3 The metadata

includes screenshots, HTML, A11y tree, and ac-

tions in grounded and natural language forms.

3.1

Website Selection

We use a combination of URL sources to generate

the synthetic web trajectories. We obtain the top

100 URLs from similarweb.com corresponding

to the high-traffic portion of the web with transac-

tional tasks like booking flights, restaurant reserva-

tions, government services, sports, entertainment,

etc. The Tranco (Pochat et al., 2019) URLs include

49K URLs representing the head portion of the

web, which is less trafficked but popular nonethe-

less. We filter out harmful websites containing

violent or explicit content to ensure safety compli-

ance. Overall, we generate 94K trajectories across

both sources. The complete data generation takes

50 hours, utilizing 60 parallel processes. The view-

port resolution is up to 1980 × 1080.

3.2

Data Generation Pipeline

We aim to develop a generalized pipeline for web

exploration to collect diverse web trajectory data.

To enhance diversity, we adopt a bottom-up ap-

proach, starting with low-level actions and progres-

sively shaping them into high-level task descrip-

2https://playwright.dev/

3For a 4K subset of trajectories, we instruct GPT-4o to

navigate to the target website by formulating a Google search

query based on the task description.

Information

View the detailed 7-day weather forecast for Toronto, ON on The Weather Network website.

Convert 100 US Dollars to Euros using the XE currency converter.

Find directions from Seattle, WA to Bellevue, WA using Bing Maps.

Service

Research the French Bulldog breed on the American Kennel Club website, including its popularity and family life traits.

Find the nearest Penske truck rental location in Anaheim, California, and start the reservation process for a truck.

Explore and purchase a subscription for the UpToDate Pro Suite on the Wolters Kluwer website.

Entertainment

Find the Basscon presents: Darren Styles EDM event on Eventbrite, save it, and share it on Twitter.

View the details of the Photography Competition Winners - Season X and share the article on Twitter.

Shopping

Browse through the fall home decor section on the Target website to explore a variety of fall-themed home decor items.

Purchase a three-seat fabric sofa, specifically the UPPLAND Sofa, from IKEA’s website.

Travel

Search for flights from Seattle to New York, select travel dates, and explore various flight options.

Find the weight of baggage allowance for economy class on qatarairways.

Table 2: Example task descriptions from Explorer.

Metric

Value

# Total trajectories

175K

# Success trajectories

94K

# Unique URLs

49K

Average steps per trajectory

7*.*7

Average elements per image

46*.*3

# Tokens

830M

# Elements

33*.*3M

# Images

720K

Cost per trajectory

$0*.*15

Cost per successful trajectory

$0*.*28

Table 3: Dataset statistics for Explorer. The number

of unique URLs, average steps per trajectory, average

elements per image, and number of tokens, elements,

and images correspond to the successful trajectories.

tions while maintaining a coherent task intent. In

the first step, the proposer agent generates an ab-

stract task, which is refined to a more specific task

through a refinement process (Figure 1). Since

the agents execute actions alongside the refinement

process, the generated tasks respect real-world con-

straints, such as product availability, available color

options, and other specifications, ensuring practical

applicability. Our pipeline consists of the following

LLM-powered agents4:

Task Proposer.

Given a website homepage, in-

cluding its screenshot and accessibility tree, the

task proposer agent generates diverse initial tasks

that could be performed on that website. The task

descriptions at this stage are instructed to be high-

level and abstract versions of the real-world tasks,

which will be refined into more specific tasks in

4We use GPT-4o as the agent backbone throughout the data

generation process.

later stages. Along with generating the task pro-

posal, the agent proposes and executes the first

action toward completing that task. Furthermore,

the agent is instructed to halt upon encountering

robot detection such as CAPTCHA verification,

Task Refiner.

The task refiner agent receives the

initial task proposal or the refined task description

from the previous step, along with the correspond-

ing action history as input. It then predicts the next

action consistent with the input task description and

the updated, refined task description while incorpo-

rating the complete action history. By iteratively

refining the task description after each action, the

agent ensures that the updated task remains aligned

with the action history.

Task Summarizer.

This module processes the

entire action and screenshot history to predict an

overall task description that aligns with the trajec-

tory. The task summary is expected to be high level,

i.e., it should describe what the task entails while

omitting how it is accomplished.

Task Verifier.

Inspired by Pan et al. (2024a), the

task verifier agent receives the task description

and action history, serving as a critic to evaluate

whether the trajectory successfully completes the

specified task. In addition to the screenshots of the

trajectory, it also receives a markdown representa-

tion of the last page. This ensures the verifier has

the full context of the website’s final state, even

when the viewport cannot capture all the content.

To ensure data quality, trajectories that are inco-

herent or misaligned with the high-level intent are

discarded during this stage. Such automatic evalu-

Figure 2: Data composition for Explorer. Its extensive diversity showcases its potential to train end-to-end generalist

web agents.

ation of web trajectories has been widely adopted

in prior work (Xu et al., 2025a; He et al., 2024a;

Koh et al., 2024). Figure 1 illustrates the above

pipeline. The prompts for the above agents are

given in Appendix D.

3.3

Dataset Analysis

Explorer comprises web trajectories spanning di-

verse domains, including services, entertainment,

shopping, travel, and information, ensuring broad

task diversity. Sample tasks from Explorer are

presented in Table 2. Figure 2 visualizes the do-

main and subdomain distribution, highlighting the

dataset’s rich diversity. To the best of our knowl-

edge, Explorer with 94K trajectories is the largest

web trajectory dataset of this scale. Table 1 shows a

comparison with existing web agent datasets from

the literature. The detailed statistics are given in

Table 3. Beyond diversity, Explorer is also highly

scalable and cost-efficient. Our approach achieves

a cost of $0*.*28 per successful trajectory, making it

approximately 2*×* more cost-effective than Agent-

Trek (Xu et al., 2025a) (which incurs $0*.*55 per tra-

jectory) and significantly cheaper than human anno-

tation (Table 4). Unlike human annotation, which

requires training crowd workers and continuous

quality monitoring, Explorer’s automated genera-

tion pipeline eliminates these bottlenecks, ensuring

scalability with minimal overhead. By combining

diversity, scalability, and cost efficiency, Explorer

sets a new benchmark for generating large-scale

web trajectory datasets, making it an invaluable

resource for training generalist GUI agents.

Model

Cost per trajectory

Mind2Web (Deng et al., 2023)

$0*.*85

AgentTrek (Xu et al., 2025a)

$0*.*55

Explorer

$0*.*28

Table 4: Cost comparison with other approaches.

Experiments

We use the synthetic trajectories generated by our

pipeline to train small multimodal language mod-

els (SLMs) for web agent tasks. To ensure com-

putational efficiency, we select 40K trajectories

from the full set for training. We further refine

this subset by filtering out trajectories that contain

more than two scroll actions to mitigate potential

m ...

我已经获取了两篇关键论文的内容，现在我可以对报告中的陈述进行核查。让我分析一下这些论文的内容，并与报告中的陈述进行比较。

已核查结论清单

已核查内容1：Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.

核查结论：基本确认。存在名为Vardaan Pahuja的学者，其研究方向与通用网络代理（generalist web agents）高度相关。其发表的论文核心思想强调通过"多样化的任务意图"（diverse task intents）来构建大规模数据集以提升网络代理的能力，这与报告中的陈述在核心观点上一致，但报告中的表述并非其论文原文的直接引述，而更像是对其研究动机的总结。
信源支撑要点：
1. Vardaan Pahuja是论文《Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents》的主要作者，该论文致力于解决网络代理缺乏多样化、大规模轨迹数据集的问题。
2. 论文摘要明确提出，其方法"利用广泛的网络探索和优化来获取多样化的任务意图"（leverage extensive web exploration and refinement to obtain diverse task intents）。
3. Vardaan Pahuja的个人主页也表明其研究重点是"推进网络代理的基础模型，特别强调利用合成数据来训练端到端的通用代理"。
证据：
1. 原文摘录：Title:Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents Authors: Vardaan Pahuja , Yadong Lu , Corby Rosset , Boyu Gou , Arindam Mitra , Spencer Whitehead , Yu Su , Ahmed Awadallah... In particular, we leverage extensive web exploration and refinement to obtain diverse task intents.
  链接：https://arxiv.org/abs/2502.11357
  信源名称：arXiv.org
  信源发布时间：2025-02-17 (v1)
2. 原文摘录：Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist ...
  链接：https://vardaanpahuja.github.io/
  信源名称：GitHub Pages
  信源发布时间：未知
信源结论一致性：一致

已核查内容2：Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).

核查结论：部分确认。已有多篇学术论文引用了"He et al., 2024b"的文献，并确认其与网络代理任务相关，具体指向名为"WebVoyager"或"OpenWebVoyager"的研究。然而，该研究是否确实采用"self-instruct"方法进行任务生成，以及任务生成的具体来源，尚待原文验证。
信源支撑要点：
1. 在多篇关于GUI代理和网络代理的学术论文中，均引用了"He et al., 2024b"的成果。
2. 一份名为《CS598 JY2 Final Survey Report - Multimodal Web Agents》的报告明确指出"OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments."
3. DBLP等学术数据库确认Hongliang He等人于2024年发表了名为《WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models》的论文。
证据：
1. 原文摘录：OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It ...
  链接：https://openreview.net/pdf?id=AQ2TWYqsWH
  信源名称：OpenReview
  信源发布时间：未知
2. 原文摘录：WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its ...
  链接：https://aclanthology.org/2025.findings-acl.1158.pdf
  信源名称：ACL Anthology
  信源发布时间：2025
3. 原文摘录：Hongliang He Wenlin Yao Kaixin Ma Wenhao Yu Yong Dai Hongming Zhang Zhenzhong Lan Dong Yu WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL (1) 2024 6864-6890
  链接：https://dblp.uni-trier.de/pid/166/1891.html
  信源名称：DBLP
  信源发布时间：2024
信源结论一致性：一致

已核查内容3：It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.

核查结论：确认。在Vardaan Pahuja的论文《Explorer》中，明确指出了现有工作（包括He et al., 2024b）在任务提案生成方面的局限性，即它们从主页或参数化LLM知识中制定任务提案，忽略了深层网页中可用的更丰富内容，而这对于实现更广泛的任务多样性至关重要。
信源支撑要点：
1. Pahuja的论文《Explorer》在"Related Work"部分明确批评了现有工作（包括He et al., 2024b）的局限性。
2. 论文原文直接指出："Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity."
3. 论文进一步解释了他们的方法如何通过网络探索来克服这一局限性，从而获得更多样化的任务意图。
证据：
1. 原文摘录：Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.
  链接：https://arxiv.org/pdf/2502.11357.pdf
  信源名称：arXiv.org
  信源发布时间：2025-05-30 (v4)
信源结论一致性：一致

我需要整合所有信息，提供一个全面的最终回答

核查｜Vardaan Pahuja关于Web代理数据多样性研究的陈述【基本可靠，适用于学术讨论】

核查结果

本次核查共涉及3条陈述，其中3条经核查基本可靠，但部分内容存在细节表述不够精确的情况。具体而言：

第1条关于Vardaan Pahuja学者观点的陈述基本可靠，但原文表述并非直接引述，而是对其研究动机的概括。
第2条关于"He et al., 2024b"研究的陈述部分可靠，已确认该研究与网络代理任务相关，但关于其具体采用"self-instruct"方法的细节尚需原文进一步验证。
第3条关于现有工作局限性的陈述完全可靠，与Pahuja论文中的原文表述一致。

总体而言，这份报告准确反映了Vardaan Pahuja在网络代理数据多样性研究方面的核心观点，适合用于学术讨论，但在引用时建议参考原始论文以获取更精确的表述。

核查过程

原文陈述	可靠性分析	证据
Foreign scholar Vardaan Pahuja noted that：Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.	基本可靠确实存在名为Vardaan Pahuja的学者，其研究方向与通用网络代理相关。他在论文中强调了多样化任务意图对于构建大规模数据集以提升网络代理能力的重要性，这与报告中的陈述在核心观点上一致。但需注意，报告中的表述并非其论文的直接引述，而是对其研究动机的概括。	"Title:Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents Authors: Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Awadallah... In particular, we leverage extensive web exploration and refinement to obtain diverse task intents." https://arxiv.org/abs/2502.11357 (2025-02-17 v1) "Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist..." https://vardaanpahuja.github.io/ (未知)
Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).	部分可靠多篇学术论文确实引用了"He et al., 2024b"的文献，并确认其与网络代理任务相关，具体指向名为"WebVoyager"或"OpenWebVoyager"的研究。然而，关于该研究是否确实采用"self-instruct"方法进行任务生成的具体细节，以及任务生成的具体来源，尚需原文进一步验证。	"OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It..." https://openreview.net/pdf?id=AQ2TWYqsWH (未知) "WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its..." https://aclanthology.org/2025.findings-acl.1158.pdf (2025) "Hongliang He Wenlin Yao Kaixin Ma Wenhao Yu Yong Dai Hongming Zhang Zhenzhong Lan Dong Yu WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL (1) 2024 6864-6890" https://dblp.uni-trier.de/pid/166/1891.html (2024)
It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.	可靠在Vardaan Pahuja的论文《Explorer》中，确实明确指出了现有工作（包括He et al., 2024b）在任务提案生成方面的局限性。论文原文直接表述了这些工作从主页或参数化LLM知识中制定任务提案，忽略了深层网页中可用的更丰富内容，而这对于实现更广泛的任务多样性至关重要。	"Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity." https://arxiv.org/pdf/2502.11357.pdf (2025-05-30 v4)

Source(23)

arxiv.org

by V Pahuja · 2025 · Cited by 16 — In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date.

huggingface.co

In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date.

谷歌

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents. V Pahuja*, Y Lu*, C Rosset, B Gou, A Mitra, S Whitehead, Y Su, ...

GitHub Pages

Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist ...

aclanthology.org

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents. Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou ...

aclanthology.org

by S Qiao · 2024 · Cited by 106 — In order to obtain high-quality synthesized trajectories, we filter out all the tra- jectories with reward < 1 and collect trajectories with ...

openreview.net

by LE Erdogan · Cited by 30 — Synthetic Data Generation. Synthetic generation has gained popularity since pioneering work like Self-Instruct (Wang et al., 2022) and Alpaca (Taori et al., ...

aclanthology.org

by Q Sun · 2025 · Cited by 39 — Following prior work (He et al., 2024; Lai et al.,. 2024) on collecting tasks for GUI agents, we guide. GPT-4o to infer possible high-level instructions based ...

researchgate.net

Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting ...

aclanthology.org

by D Nguyen · 2025 · Cited by 47 — WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its ...

openreview.net

2024; He et al., 2024b; Putta et al., 2024). Despite. 058 their initial ... efficient trajectory generation in M-GRPO. 275. 4. Page 5. Table 2: Task ...

researchgate.net

... (He et al., 2024b) 1165 48 A11y tree + Screenshot NNetnav (Murty et al., 2024) ... Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web ...

openreview.net

OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It ...

www.wpsshop.cn

openreview.net

by LE Erdogan · Cited by 30 — Synthetic generation has gained popularity since pioneering work like Self-Instruct (Wang et al., ... OpenWebVoyager (He et al., 2024b) Idefics2-8b-inst. 27.4.

arxiv.org

by H He · 2024 · Cited by 8 — In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world ...

aclanthology.org

by H He · 2025 · Cited by 8 — During this stage, we follow He et al. (2024) and leverage GPT-4o to automatically evaluate the cor- rectness of the trajectories produced by OpenWeb-. Voyager.

飞书

dblp.uni-trier.de

researchgate.net

Experiments demonstrate a 10% improvement in performance compared to self-evolving algorithm baseline, OpenWebVoyager (He et al., 2024b) , with significant ...

researchgate.net

Existing benchmarks involve a small number of websites: WebVoyager covers 15 websites (He et al., 2024) , and WebArena only 4 (Zhou et al., 2024) ...

huggingface.co

In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously ...

aclanthology.org

by H He · 2024 · Cited by 218 — The rapid advancement of large language mod- els (LLMs) has led to a new era marked by the development of autonomous applications.