
我需要根据用户要求,分块核查内容
准备核查清单中...
预备调查清单
待核查内容1:Foreign scholar Vardaan Pahuja noted that:Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.
待核查内容2:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
待核查内容3:It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.
行动计划
Search
Vardaan Pahuja scholar web agents data intent diversity我已经获取了一些关于Vardaan Pahuja的初步信息,但需要更深入了解他的研究和相关论文内容。接下来,我将查看他的论文详情,特别是关于数据意图多样性的观点。
本网页展示了一篇题为"Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents"的研究论文,由Vardaan Pahuja等人撰写,发表于ACL 2025 (Findings)。该研究针对多模态网络代理在现实在线环境中表现不佳的问题,提出了一种可扩展的解决方案。研究团队开发了一种合成大规模多样化轨迹级数据集的方法,包含超过94K成功的多模态网络轨迹,涵盖49K个独特URL、720K截图和33M网页元素。该方法利用广泛的网络探索和优化来获取多样化的任务意图,平均成本为每个成功轨迹28美分,使其对社区中广泛用户都可负担。基于此数据集,研究团队训练了名为Explorer的多模态网络代理,在Mind2Web-Live、Multimodal-Mind2Web和MiniWob++等离线和在线网络代理基准测试中表现出色。研究还强调数据规模扩展是提高网络代理能力的关键驱动因素,旨在使更大规模的基于LMM的代理研究更加普及。
在网页内容中,Vardaan Pahuja确实提到了数据意图多样性(data intent diversity)的概念,但没有直接使用"data intent diversity"这个术语。根据摘要内容,Pahuja及其团队强调了"diverse task intents"(多样化任务意图)的重要性,特别是在开发网络代理时。
摘要中提到:"In particular, we leverage extensive web exploration and refinement to obtain diverse task intents."(特别地,我们利用广泛的网络探索和优化来获取多样化的任务意图)。这表明研究团队认识到多样化的任务意图对于训练有效的网络代理至关重要。
研究团队开发了"the largest and most diverse trajectory-level dataset"(最大且最多样化的轨迹级数据集),这进一步强调了数据多样性的重要性。
然而,摘要中没有明确提到这种多样性对于"equipping generalist web agents with a broad range of skills"(为通用网络代理配备广泛技能)是必不可少的。摘要更多地强调了数据规模扩展("data scaling")作为提高网络代理能力的关键驱动因素:"our experiments highlight data scaling as a key driver for improving web agent capabilities"(我们的实验强调数据规模扩展是提高网络代理能力的关键驱动因素)。
总之,虽然Pahuja确实强调了任务意图多样性的重要性,但在提供的摘要中没有明确表述它对于为通用网络代理配备广泛技能是必不可少的。
网页内容中没有提供任何图片链接。
我已经查看了Vardaan Pahuja的论文,但需要继续深入了解。我现在需要查找关于"He et al., 2024b"的信息,以及他们关于synthetic web trajectory generation的工作。
Search
He et al. 2024b synthetic web trajectory generation self-instruct我需要更精确地搜索关于"He et al., 2024b"的信息,特别是关于web trajectory generation的工作。
Search
"He et al. 2024b" web trajectory generation task proposal homepage parametric LLM knowledge已核查结论清单
已核查内容1:Foreign scholar Vardaan Pahuja noted that:Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.
已核查内容2:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
待补充核查清单
待核查内容1:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
待核查内容2:It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.
行动计划
Search
OpenWebVoyager He et al. 2024b self-instruct task proposal generation我需要更多关于OpenWebVoyager的具体信息,特别是其任务生成方法。
Search
OpenWebVoyager He et al. 2024b full paper我需要查看WebVoyager的完整论文内容,以确认其任务生成方法。
这篇论文介绍了WebVoyager,一个由大型多模态模型(LMM)驱动的创新型网络代理,能够通过与真实网站交互来完成用户指令。与现有网络代理相比,WebVoyager支持多模态输入,并在真实网站环境中运行,而非简化的网页模拟器或静态网页快照。研究团队建立了一个新的基准测试,包含来自15个流行网站的真实任务,并引入了一种利用GPT-4V多模态理解能力的自动评估协议。实验结果显示,WebVoyager在该基准测试上达到了59.1%的任务成功率,显著超越了GPT-4(All Tools)和仅文本版WebVoyager的表现。提出的自动评估指标与人类判断达到了85.3%的一致性,证明了其在评估网络代理方面的有效性和准确性。该研究已被ACL 2024接收,代码和数据已在GitHub上公开发布。
根据提供的网页内容,没有关于WebVoyager或OpenWebVoyager如何生成任务提案的具体信息。网页内容主要介绍了WebVoyager的整体功能和性能评估,但没有详细说明其任务提案生成机制,也没有提及是否使用自我指导(self-instruct)方法,或者是否从主页或参数化LLM知识中制定任务提案。
网页内容仅提到WebVoyager是一个由大型多模态模型驱动的网络代理,能够与真实网站交互完成用户指令,并建立了包含15个流行网站真实任务的基准测试,但没有具体说明这些任务是如何生成或提出的。
要获取关于任务提案生成机制的详细信息,可能需要查阅完整的论文内容或访问GitHub上发布的代码库。
网页内容中没有提供任何图片链接或描述。
我需要查看OpenWebVoyager的论文,以获取更多关于任务生成方法的信息。
本文介绍了OpenWebVoyager,一个开源框架,旨在开发能够自主进行真实世界探索并自我改进的多模态网络代理。研究采用了三阶段方法:首先通过模仿学习训练基础模型获取基本能力;然后让代理探索开放网络并收集其轨迹的反馈;最后通过学习由另一个通用模型判断为表现良好的轨迹来改进其策略。这种"探索-反馈-优化"循环可以持续多次迭代。与现有的仅在合成环境中构建文本代理的开源努力不同,OpenWebVoyager专注于处理需要多模态感知能力且缺乏明确奖励信号的真实场景。实验结果表明,该网络代理在每次迭代后都能成功自我改进,在多个测试集上展示出强大的性能。
关于OpenWebVoyager如何生成任务提案的信息,在提供的摘要中没有直接提及。摘要主要介绍了OpenWebVoyager的整体框架和工作流程,包括模仿学习、网络探索、反馈收集和策略优化的循环过程,但没有具体说明任务提案的生成机制。
摘要中没有提到是否使用了self-instruct技术,也没有说明任务提案是从主页生成还是基于参数化LLM知识生成。文中仅提到代理会"探索开放网络"并收集反馈,但没有详细说明任务的来源或生成方式。
无与问题有关的内容。在提供的摘要中,没有找到关于OpenWebVoyager如何生成任务提案的具体信息。
无图片内容在提供的网页中。
已核查结论清单
已核查内容1:Foreign scholar Vardaan Pahuja noted that:Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.
已核查内容2:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
待补充核查清单
待核查内容1:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
待核查内容2:It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.
行动计划
Title:
URL Source: https://arxiv.org/pdf/2401.13919.pdf
Published Time: Mon, 10 Jun 2024 01:00:56 GMT
Markdown Content:
Hongliang He 1,3∗, Wenlin Yao 2, Kaixin Ma 2, Wenhao Yu 2, Yong Dai 2,Hongming Zhang 2, Zhenzhong Lan 3, Dong Yu 21Zhejiang University, 2Tencent AI Lab, 3Westlake University
hehongliang@westlake.edu.cn , wenlinyao@global.tencent.com
Abstract
The rapid advancement of large language mod-els (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innova-tion in creating advanced web agents. Exist-ing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenar-ios. To bridge this gap, we introduce WebVoy-ager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an au-tomatic evaluation protocol leveraging multi-modal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indi-cating its effectiveness in providing reliable and accurate assessments of web agents. 1
1 Introduction
The recent advancement of large language models (LLMs), such as ChatGPT and GPT-4 (OpenAI, 2023), have sparked significant interest in devel-oping LLM-based autonomous agents (AutoGPT, 2022) for complex task execution (Qin et al., 2023; Schick et al., 2023). Recent studies have explored the construction of text-based web browsing envi-ronments and how to instruct large language model agents to perform web navigation (Nakano et al., 2021; Gur et al., 2023; Zhou et al., 2023; Lu et al.,
∗Work done during the internship at Tencent AI Lab.
1Our code and data will be released at https://github. com/MinorJerry/WebVoyager
2023). The primary challenge in these works lies in managing complex and verbose HTML texts, and solutions include simplifying and structuring HTML (Nakano et al., 2021; Zhou et al., 2023; Gur et al., 2023; Deng et al., 2023). However, existing approaches overlook a critical functionality of browsing: rendering HTML into visual webpages. Particularly, vision capability is crucial for utilizing tools such as web browsers, as rendered web pages are inherently designed with user experience (UX), emphasizing intuitive information and structured presentation. This de-sign principle of rendering makes visual analysis more effective than mere HTML representation. At present, large multimodal models (LMMs), partic-ularly GPT-4V(ision) (OpenAI, 2023) and Gemini (Team et al., 2023), demonstrate a remarkable abil-ity to integrate intricate visual cues with textual information. Existing studies such as Pix2Struct (Lee et al., 2023) and WebArena (Zhou et al., 2023), have initiated explorations into using screenshots as inputs for decision-making in web navigation, yet these are preliminary and do not represent a deep exploration. Therefore, building multimodal web agents to leverage the environment rendered by browsers through screenshots, thus mimicking human web browsing behavior, is now a viable approach to enhance web navigation abilities. We introduce WebVoyager (Figure 1), a multi-modal web agent designed to autonomously ac-complish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention. WebVoyager processes the user query by making observations from screenshots and textual content in interactive web elements, formulates a thought on what ac-tion to take (such as clicking, typing, or scrolling, etc.), and then executes that action on the web-sites. Inspired by Set-of-Mark Prompting (Yang et al., 2023a), we mark interactive web elements on screenshots (see Figure 2) to facilitate decision-
arXiv:2401.13919v4 [cs.CL] 6 Jun 2024 Available Websites
Screenshot (Image)
Web Elements (Text)
[1]: Delivering to Santa Clara 95050
[2]: "Search Amazon"
...
Observation
Thought: Based on observation...
Click? Type? Scroll? ...
Send
Query
Return
Answer
User
...
Thought
Action
Figure 1: The overall workflow of WebVoyager. WebVoyager takes web tasks assigned by a human and automatically browses the web online. At each step, WebVoyager selects actions based on screenshots and text (the ‘type’ of the web element and its contents). Once the task is completed, the answers will be returned to the user. For example, for a user query: "Find the cost of a 2-year protection for PS4 on Amazon.", the agent interacts with Amazon online, locates the PS4, identifies the 2-year protection price, and returns "$30.99" to the user.
making for WebVoyager. Another challenge is the evaluation of an end-to-end web agent. Existing benchmarks, such as Mind2Web (Deng et al., 2023), primarily focus on stepwise and offline evaluation, where agents follow a predefined “golden” trajectory for action selection. This approach, however, may not fully account for the variety of viable strategies to ac-complish a task, as it only reflects one possible plan. This limitation could lead to a biased evaluation and difficulties in fairly comparing different meth-ods. To accurately evaluate the capabilities of web agents in end-to-end task completion, we propose an automated evaluation protocol using GPT-4V. Specifically, we save screenshots throughout the online navigation process and then use GPT-4V to evaluate these trajectories together with the fi-nal results automatically. Human evaluations are also conducted to verify the results and the analysis shows that our evaluation protocol achieves 85.3% agreement with human judges, indicating GPT-4V can serve as a reliable evaluator for online agents. We conduct evaluations on a newly collected dataset, which is semi-automatically generated us-ing a self-instruct (Wang et al., 2022) method, com-prising 643 web tasks from 15 commonly accessed websites. We also evaluate WebVoyager on 90 web-related tasks of level 1 and level 2 from the GAIA (Mialon et al., 2023), and 50 interactive open-web tasks from SeeAct (Zheng et al., 2024). We com-pare our WebVoyager with 1) GPT-4 (All Tools) 2,and 2) WebVoyager in a text-only setting which em-ploys the textual accessibility tree proposed in We-bArena (Zhou et al., 2023) to describe web pages.
2GPT-4 (All Tools) is an integrated tool-based agent re-leased by OpenAI in Oct. 2023. See https://chat.openai.com/
The results show that WebVoyager achieves a Task Success Rate of 59.1% on our new benchmark, sig-nificantly outperforming GPT-4 (All Tools) with a rate of 30.8% and the text-only setting with a rate of 40.1%, demonstrating the effectiveness of our method. Our research demonstrates the effec-tiveness of the WebVoyager method for web tasks, offering insights into the development of more in-telligent and efficient web automation solutions.
2 Related Work
Autonomous web navigation (Shi et al., 2017; Yang et al., 2023b) requires an agent to follow instruc-tions, construct plans, comprehend complex web structures, and decompose tasks into step-by-step decisions (Weng, 2023). To study web agents in a controlled environment, previous works con-structed web simulators that contain simplified websites (Shi et al., 2017; Yao et al., 2022a). More recently, there has been a surge of interest in build-ing more challenging and realistic benchmarks such as Mind2Web (Deng et al., 2023) and We-bArena (Zhou et al., 2023). Along with these new benchmarks, numerous efforts have been made to build autonomous web agents. WebGPT (Nakano et al., 2021) constructs a text-based web browsing environment and fine-tunes GPT-3 as a web agent. WebAgent (Gur et al., 2023) pretrains a T5 model to extract HTML snip-pets and leverages Flan-U-PaLM (Chowdhery et al., 2023) to generate Python code to interact with the environment. Besides fine-tuning, another line of work tries to build web agents by prompting LLMs (Yao et al., 2022b; Shinn et al., 2023; Ma et al., 2023). Multimodal web agents that integrate visual signals have also been explored, WebGUM (Fu-ruta et al., 2023) combines T5 (Raffel et al., 2020) with a Vision Transformer (ViT) to navigate using both screenshots and HTML text. PIX2ACT (Shaw et al., 2023) instead solely relies on web screen-shots as inputs to predict agent actions. Unlike pre-vious works that only consider a single modality or simplified web environments, we build a multi-modal agent that can complete tasks on real-world websites in this work. Concurrently with our work, SeeAct (Zheng et al., 2024) also leverages Large Multimodal Models (LMMs) for integrated visual understanding and actions on websites. However, the best SeeAct agent still relies on a finetuned cross-encoder model to select candidate elements for interaction. In contrast, WebVoyager does not require any additional modules.
3 WebVoyager
We aim to build an agent that can browse the open web autonomously without human intervention to complete user instructions. Given an instruction, our WebVoyager first instantiates a web browser and then performs actions with visual (i.e., screen-shots) and textual (i.e., HTML elements) signals from the web. The agent produces an action based on the inputs at every step, which is then executed in the browser environment. The process continues until the agent decides to stop. The details of Web-Voyager, including environment, interaction cycle, observation space, and action space, are as follows.
3.1 Browsing Environment
We develop an automated web-browsing environ-ment using Selenium 3. Unlike WebArena (Zhou et al., 2023), we do not host any websites locally and allow the agent to explore the open web instead, which poses unique challenges such as floating ads, pop-up windows, constant updates, etc. 4 Still, we opt for online interaction with real websites as we believe that this setting truly reflects the real-world use cases (e.g., the agent needs access to real-time information from the web), and a successful web agent should be able to adapt to these challenges and consistently solve the problem robustly.
3https://www.selenium.dev/
4Regarding CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) challenges, we believe it is important to respect the rules of these websites and prompt the agent to retrieve information from alternative sources.
Figure 2: Examples of webpage screenshots provided to the agent. We add borders to most of the interactive elements on the web pages and label them with numeri-cal tags in the top left corner.
3.2 Interaction Formulation
Formally, we denote the Environment as E, the large Multimodal Model as M, the Observa-tion Space as O, and the Action Space as A.At time step t, the model receives the con-text ct as inputs, which consist of historical ac-tions ai and observations oi, defined as: ct =(o1, a 1, ..., o t−1, a t−1, o t, I ) The the model pro-duces the action at at time t, at = M(ct), which is then executed in the environment. After execu-tion, the environment sends back the observation at time t + 1 , ot+1 = E(ot, a t). Then the context will be updated and this interaction process continues until the model generates a terminating action or the maximum step is reached. Inspired by the paradigm of ReAct Prompting (Yao et al., 2022b), we also prompt our agent to generate a thought process first before generating the action code. Hence at can be further composed into (st, ˆat) where st and ˆat represent the natural language thought and action code respectively. Fig-ure 7 in Appendix A presents the System Prompt we designed for the action prediction step. Also, it’s worth noting that excessive observations of web pages from longer episodes may confuse the agent. Therefore, we perform context clipping to remove outdated web page information and only keep the three most recent observations in the inputs, and we keep the entire history of thoughts and actions to better guide the agent.
3.3 Observation Space
Similar to how humans browse the web, our agent also takes the visual information from the web (screenshots) as the primary source of input. Using screenshots allows us to avoid the burden of pro-cessing HTML DOM tree or accessibility tree to portray the overall structure of webpages, which can lead to overly verbose texts and impact the decision-making process of the agent. Inspired by Set-of-Mark Prompting (Yang et al., 2023a), we overlay bounding boxes of the interactive elements on the websites to better guide the agent’s action prediction. Unlike Yang et al. (2023a), we do not need any object detection module (Zou et al., 2023). Instead, we utilize GPT-4V-ACT 5, a Javascript tool to extracts the interactive elements based on web element types and then overlays bounding boxes with numerical labels on the respective regions of the elements. GPT-4V-Act is efficient since it is rule-based without incorporating any object detec-tion model. As illustrated in Figure 2, the nature of webpages allows us to locate and outline each interactive el-ement using this tool precisely. The numerical labels assigned to each element are also essential for the model to identify the elements requiring interaction, thereby facilitating accurate action de-termination. We empirically choose black color for the borders and the background of the labels to enhance clarity. We observe that using a single black color yields higher success rates than using multiple colors. We also provide the agent with auxiliary text as inputs, including the textual con-tent embedded within the interactive element, the type of the element, and possibly some comment text in the aria-label attribute. To simplify the ob-servation, we have disabled multiple tabs, i.e., all interactions occur within the current tab instead of opening new ones. At every step, the agent receives the current screenshot, auxiliary text, and history as inputs, as discussed in (§3.2). In case the agent’s action raised an exception during execution, we addition-ally incorporated the error messages in the prompt and asked the model to regenerate the response. Note that each error correction attempt also con-sumes one step from the total exploration budget.
3.4 Action Space
We define the action space of our agent similar to how humans browse the web. To this end, we implement the most commonly used mouse and keyboard actions, sufficient for the agent to browse various web pages and locate the content required
for the task. With the help of numerical labels in screenshots, we enable the agent to respond with a concise Action Format. This approach precisely locates the elements requiring interaction and ex-ecutes the corresponding actions. The usage of actions is as follows (more details in Appendix C): 1) Click. This action involves clicking on an element within a webpage, typically a link or a button. 2) Input. This composite action involves selecting a text box, deleting any existing content within it, and then inputting new content. 3) Scroll .Scrolling is a common operation for browsing web-pages, usually involving the vertical movement of the entire page. 4) Wait . Action execution requires time, and this action is often used to wait for web pages to load. 5) Back . This action is used to return to the previous page. 6) Jump to Search Engine . There are often situations where agents get stuck at a certain website without finding an answer. This action enables the agent to jump to a search engine and start anew. 7) Answer . Once all questions in the task are resolved, this action concludes the iteration and provides an answer in line with the task requirements.
4 Benchmark for WebVoyager
4.1 Website Selection
We select 15 representative websites that cover dif-ferent aspects of our daily life to ensure diversity in our evaluation, including Allrecipes, Amazon, Ap-ple, ArXiv, BBC News, Booking, Cambridge Dic-tionary, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Huggingface, and Wolfram Alpha. Due to technical limitations, we regretfully omit websites requiring login or CAPTCHA to access their content. Additionally, Google Search is a universal website that can serve as a starting point for any website, making our framework applicable to various scenarios.
4.2 Data Construction
We employ a combination of self-instruct (Wang et al., 2022) and human verification to construct our evaluation set. Figure 3 illustrates our data creation process. Initially, we manually sample and rewrite some tasks from Mind2Web (Yin et al., 2023; Deng et al., 2023) for websites including Google Flights, Google Map, Google Search, Booking, and Wol-fram Alpha. This process yields initial seed tasks in the Task Pool for subsequent generation. In step two, we sample tasks from Task Pool as in-context Step 1: Tasks written by
Human (for 3 - 5 webs)
GPT-4
Several tasks
for 15 webs
Manual Filtering
Task Pool
(15 webs)
GPT-4
Manual Filtering
(Find answer s on Web)
Final Web Tasks
In-context
examples
(Quality of Tasks)
Filtered
tasks
In-context
examples
Step 2: seed tasks for each web
Generated tasks
Step 3: Generate web tasks and
check answers online.
Figure 3: Data creation process using self-instruct.
examples (Dong et al., 2022) and prompt GPT-4 Turbo to generate approximately 100 new tasks (20 iterations). Then we manually verify each gener-ated task and rewrite them if necessary to ensure its high quality and the answers can be found on the corresponding website, then we add them to the Task Pool as additional seed tasks. This step allows us to create human-validated seed tasks for each website. Finally, in step three, we sample more diverse in-context examples in the Task Pool and directly add the generated tasks to the Task Pool in each iteration. We manually verify that the gen-erated tasks have low repetition, and the answers to the generated tasks can be found on the web-sites. In total, we collected 40+ tasks per website, resulting in a total of 643 tasks. To further confirm that the generated tasks in the dataset have low repetition, We use the all-mpnet-base-v2 6 model to calculate pairwise similarity for 643 ques. Out of a total of 206,403 pairs, only 49 pairs have a similarity greater than 0.8, and 140 pairs have a similarity between 0.7 and 0.8. All of these have been manually checked and found to be acceptable. 99.68% of pairs have a similarity of less than 0.6. This demonstrates the diversity of our tasks and the robustness of our approach.
4.3 Annotation Process
After collecting the full task pool, we annotate an-swers for each task. Since some questions are open-ended and the web information may change, these questions may not have a fixed golden response. Thus, we label each data entry with an answer, cat-egorized as “Possible” or “Golden.” For answers
6
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
labeled as “Golden,” we provide a comprehensive listing of possible responses and consider them stable in the short term. The “Possible” category covers the following scenarios: 1) Answers for open-ended tasks where it’s hard to find an exact match answer, such as summarization. 2) multiple answers satisfy the task, making it impractical to list all of them. Therefore, we provide a pa ...
我已经获取了WebVoyager论文的内容,这对我的核查非常有价值。现在我需要查看Vardaan Pahuja的Explorer论文,以便对比两者的方法和观点。
arXiv:2502.11357v4 [cs.AI] 30 May 2025
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for
Multimodal Web Agents
Vardaan Pahuja1*∗†*,
Yadong Lu2*¶,
Corby Rosset2**,**
Boyu Gou1**,**
Arindam Mitra2**,**
Spencer Whitehead2**,**
Yu Su1**,**
Ahmed Awadallah2
1The Ohio State University
2Microsoft Research, Redmond
pahuja.9@osu.edu, yadonglu@microsoft.com
Abstract
Recent success in large multimodal models
(LMMs) has sparked promising applications
of agents capable of autonomously complet-
ing complex web tasks. While open-source
LMM agents have made significant advances
in offline evaluation benchmarks, their perfor-
mance still falls substantially short of human-
level capabilities in more realistic online set-
tings. A key bottleneck is the lack of diverse
and large-scale trajectory-level datasets across
various domains, which are expensive to col-
lect. In this paper, we address this challenge
by developing a scalable recipe to synthesize
the largest and most diverse trajectory-level
dataset to date, containing over 94K success-
ful multimodal web trajectories, spanning 49K
unique URLs, 720K screenshots, and 33M web
elements. In particular, we leverage extensive
web exploration and refinement to obtain di-
verse task intents. The average cost is 28 cents
per successful trajectory, making it affordable
to a wide range of users in the community.
Leveraging this dataset, we train Explorer, a
multimodal web agent, and demonstrate strong
performance on both offline and online web
agent benchmarks such as Mind2Web-Live,
Multimodal-Mind2Web, and MiniWob++. Ad-
ditionally, our experiments highlight data scal-
ing as a key driver for improving web agent
capabilities. We hope this study makes state-of-
the-art LMM-based agent research at a larger
scale more accessible.1
1
Introduction
Graphical User Interfaces (GUIs) serve as the pri-
mary medium for user interaction across digital
environments. Within the GUI environment, LLM-
based agents (Su et al., 2024) have shown great
potential in automating complex workflows for hu-
man users. These agents are designed to operate
*Equal Contribution. † Work partly done during internship
at Microsoft Research. ¶ Project Lead.
1Project website: https://osu-nlp-group.github.io/
Explorer/
across diverse interfaces, including the web (Deng
et al., 2023; Zhou et al., 2024; Zheng et al., 2024,
2025), desktop (Xie et al., 2024; Wu et al., 2024),
and mobile platforms (Rawles et al., 2023; Yan
et al., 2023). Navigating modern GUI interfaces,
which integrate textual, graphical, and interactive
components, typically requires agents to possess
visual grounding, long-term planning, and memory
management capabilities.
Recent work (Cheng et al., 2024; Gou et al.,
2025) has demonstrated the effectiveness of syn-
thetic data for enhancing visual grounding (Gou
et al., 2025; Chen et al., 2024a; Kapoor et al.,
2024; Chen et al., 2024b) and planning (Xu et al.,
2025b; Zhang et al., 2024). Developing end-to-end
GUI agents with long-term planning and grounding
capabilities requires training on multi-step trajec-
tory data (Xu et al., 2025a,b; Qin et al., 2025).
However, existing trajectory datasets are primar-
ily human-annotated (Deng et al., 2023; Li et al.,
2024; Lu et al., 2024) or leverage synthetic data just
for task proposal curation (Lai et al., 2024; Chen
et al., 2024a). And human annotation is expensive
to scale for collecting large and diverse training
datasets. Therefore, synthetic data has emerged
as a promising alternative to human-annotated
data (Hartvigsen et al., 2022; Sahu et al., 2022;
Ye et al., 2022; Tang et al., 2023; Mukherjee et al.,
2023; Mitra et al., 2024). Collecting trajectory-
level datasets presents unique challenges: 1) cu-
rating a diverse set of task intents at scale, 2) de-
ploying an agent capable of interacting with a real-
world environment to complete these tasks through
a series of actions, and 3) verifying whether the task
is accomplished by the executed action sequence.
Data diversity is essential for equipping gener-
alist web agents with a broad range of skills. Ex-
isting work on synthetic web trajectory generation
employs self-instruct for task proposal generation
(He et al., 2024b). It formulates task proposals
from homepages or parametric LLM knowledge,
Task Proposal: Find
today's deals on
Amazon.
Action: Click on the
'Today's Deals' link.
Task Refinement: Find and
purchase a Lightning deal
item with at least 30% off.
Action: Click on the '36%
off Limited time deal
Amazon eero Pro 6E mesh
Wi-Fi router' link.
Overall Task: Buy the Amazon
eero Pro 6E mesh Wi-Fi router on
Amazon.
Task Refinement: Proceed to
checkout for the Amazon eero
Pro 6E mesh Wi-Fi router with
36% off.
Action: Click on the 'Buy Now'
button for the Amazon eero
Pro 6E mesh Wi-Fi router.
Task Proposer
Task Refiner
Task Refiner
Task Summarizer
(a) Proposal Stage
(b) Refinement Stage
(c) Summarization Stage
(d) Verification Stage
Overall Task: Buy the Amazon
eero Pro 6E mesh Wi-Fi router on
Amazon.
Success
Task Verifier
Thoughts: The user intended to purchase the Amazon eero
Pro 6E mesh Wi-Fi router on Amazon. The agent successfully
navigated to the product page, added the item to the cart,
and initiated the checkout process by clicking "Buy Now."
Figure 1: Data Generation Pipeline. The task proposer agent generates an abstract task proposal and the first
action based on the website homepage. The task is then iteratively refined in subsequent steps by the refiner agent.
Finally, the task summarizer agent constructs an overall task description from the action sequence, followed by task
verification to assess correctness.
overlooking the richer content available in deeper
web pages, which is essential for achieving broader
task diversity. Another line of work leverages web
tutorials as a form of supervision for generating
web trajectories (Ou et al., 2024; Xu et al., 2025a).
While web tutorials effectively cover common daily
user tasks, the resulting trajectory data exhibits lim-
ited domain diversity in terms of website and do-
main coverage (Table 1). Additionally, information-
seeking tasks remain underrepresented. Due to
these limitations, web agents trained on existing
synthetic trajectory datasets have not seen much
success in more realistic online evaluation settings.
To enhance web agents’ performance in real-world
settings, it is essential to incorporate greater diver-
sity in their training trajectories.
In this work, we develop a scalable and diverse
web trajectory data synthesis recipe for training
GUI agent models. Inspired by how humans learn
to use the internet, we leverage exploration as a
key mechanism for achieving diversity in task in-
tents. We introduce Explorer, a framework for sys-
tematic web exploration to generate diverse, high-
quality trajectory datasets. Unlike prior work that
relies on static task proposals, Explorer dynami-
cally explores web environments to curate diverse,
real-world tasks. This exploration-based approach
ensures broader task coverage and better gener-
alization to real-world scenarios. We instantiate
this framework using popular URLs from several
sources, such as Tranco (Pochat et al., 2019) and
similarweb.com as seeds. Our dataset comprises
94K diverse web trajectories spanning 49K unique
URLs, making it the largest web trajectory dataset
to date. Each trajectory is richly annotated with
artifacts such as screenshots, raw and set-of-mark
(Yang et al., 2023) annotated versions, HTML, and
the accessibility tree, enabling comprehensive web
agent training. To construct this dataset, we de-
velop a multi-agent pipeline that starts with an ab-
stract task proposal and iteratively refines it into a
more specific task through web exploration (Fig-
ure 1). Unlike previous approaches, our pipeline
generates tasks better grounded in real-world web-
sites, improving task relevance and diversity. To
demonstrate the effectiveness of our dataset, we
train small language models using just the synthetic
data and outperform existing web agent baselines
by a significant margin. The main contributions of
this work are as follows:
• We develop a scalable and easily customizable
multi-agent pipeline for web agent trajectory
synthesis. This pipeline leverages exploration
as a core mechanism to generate diverse tra-
jectory data, ensuring broad domain coverage
and skill diversity in the resulting dataset.
• We leverage this pipeline to generate a diverse
and high-quality GUI trajectory dataset con-
sisting of 94K trajectories, spanning 49K
# Trajectories
# Websites
Modality
RUSS (Xu et al., 2021)
80
22
HTML
Mind2Web (Deng et al., 2023)
2350
137
HTML + Screenshot
WebLINX (Lu et al., 2024)
2337
155
HTML + Screenshot
GUIAct (Chen et al., 2024a)
5696
121
Screenshot
OpenWebVoyager (He et al., 2024b)
1165
48
A11y tree + Screenshot
NNetnav (Murty et al., 2024)
6K
4
A11y tree + Screenshot
AgentTrek (Xu et al., 2025a)
10*.*4K
127
A11y tree + HTML + Screenshot
Explorer
94K
49K
A11y tree + Screenshot (raw + SoM) + HTML
Table 1: Comparison to existing web agent benchmarks.
unique URLs with 720K screenshots and
33M web elements, making it the largest web
trajectory dataset of this scale.
• We demonstrate the effectiveness of our
dataset by training small language models,
which achieve strong performance on both
online and offline benchmarks, significantly
surpassing existing web agent baselines, in-
cluding those with larger parameter counts.
2
Related Work
Recent advances in multimodal language models
have facilitated the development of web agents —
autonomous systems designed to interact with real-
world websites to perform everyday tasks (Deng
et al., 2023; Hong et al., 2024; Cheng et al., 2024;
Zheng et al., 2024, 2025; Xue et al., 2025). Early
efforts to acquire trajectory data for training web
agents primarily relied on crowd-sourcing (Deng
et al., 2023; Lu et al., 2024). However, due to
the high cost of human annotation, recent work
has adopted synthetic data generation for large-
scale collection. AutoWebGLM (Lai et al., 2024)
and GUIAct (Chen et al., 2024a) utilize LLMs
to generate task proposals, which human experts
subsequently annotate.
OpenWebVoyager (He
et al., 2024b) employs a web agent to execute auto-
generated task descriptions. However, since these
task descriptions are generated using LLMs with-
out exploring a website, they fail to capture the
full diversity of possible tasks on that website. An-
other line of work, including Synatra (Ou et al.,
2024) and AgentTrek (Xu et al., 2025a), leverages
web tutorials to guide web trajectory generation.
Meanwhile, concurrent effort (Murty et al., 2024)
employs an exploration-based trajectory generation
in WebArena’s sandbox, while our work focuses
on more realistic web agent evaluation on live web-
sites. To address diversity limitations in prior tra-
jectory synthesis work, we design a bottom-up web
trajectory synthesis pipeline that explores websites
dynamically while maintaining a coherent high-
level task intent. We refer readers to Appendix F
for further discussion.
3
Data Recipe
We design an automatic web trajectory synthesis
pipeline that explores websites to generate diverse
web trajectories. It utilizes Playwright2 to execute
actions and collect metadata from real-world web-
sites, starting from an initial URL.3 The metadata
includes screenshots, HTML, A11y tree, and ac-
tions in grounded and natural language forms.
3.1
Website Selection
We use a combination of URL sources to generate
the synthetic web trajectories. We obtain the top
100 URLs from similarweb.com corresponding
to the high-traffic portion of the web with transac-
tional tasks like booking flights, restaurant reserva-
tions, government services, sports, entertainment,
etc. The Tranco (Pochat et al., 2019) URLs include
49K URLs representing the head portion of the
web, which is less trafficked but popular nonethe-
less. We filter out harmful websites containing
violent or explicit content to ensure safety compli-
ance. Overall, we generate 94K trajectories across
both sources. The complete data generation takes
50 hours, utilizing 60 parallel processes. The view-
port resolution is up to 1980 × 1080.
3.2
Data Generation Pipeline
We aim to develop a generalized pipeline for web
exploration to collect diverse web trajectory data.
To enhance diversity, we adopt a bottom-up ap-
proach, starting with low-level actions and progres-
sively shaping them into high-level task descrip-
3For a 4K subset of trajectories, we instruct GPT-4o to
navigate to the target website by formulating a Google search
query based on the task description.
Information
View the detailed 7-day weather forecast for Toronto, ON on The Weather Network website.
Convert 100 US Dollars to Euros using the XE currency converter.
Find directions from Seattle, WA to Bellevue, WA using Bing Maps.
Service
Research the French Bulldog breed on the American Kennel Club website, including its popularity and family life traits.
Find the nearest Penske truck rental location in Anaheim, California, and start the reservation process for a truck.
Explore and purchase a subscription for the UpToDate Pro Suite on the Wolters Kluwer website.
Entertainment
Find the Basscon presents: Darren Styles EDM event on Eventbrite, save it, and share it on Twitter.
View the details of the Photography Competition Winners - Season X and share the article on Twitter.
Shopping
Browse through the fall home decor section on the Target website to explore a variety of fall-themed home decor items.
Purchase a three-seat fabric sofa, specifically the UPPLAND Sofa, from IKEA’s website.
Travel
Search for flights from Seattle to New York, select travel dates, and explore various flight options.
Find the weight of baggage allowance for economy class on qatarairways.
Table 2: Example task descriptions from Explorer.
Metric
Value
# Total trajectories
175K
# Success trajectories
94K
# Unique URLs
49K
Average steps per trajectory
7*.*7
Average elements per image
46*.*3
# Tokens
830M
# Elements
33*.*3M
# Images
720K
Cost per trajectory
$0*.*15
Cost per successful trajectory
$0*.*28
Table 3: Dataset statistics for Explorer. The number
of unique URLs, average steps per trajectory, average
elements per image, and number of tokens, elements,
and images correspond to the successful trajectories.
tions while maintaining a coherent task intent. In
the first step, the proposer agent generates an ab-
stract task, which is refined to a more specific task
through a refinement process (Figure 1). Since
the agents execute actions alongside the refinement
process, the generated tasks respect real-world con-
straints, such as product availability, available color
options, and other specifications, ensuring practical
applicability. Our pipeline consists of the following
LLM-powered agents4:
Task Proposer.
Given a website homepage, in-
cluding its screenshot and accessibility tree, the
task proposer agent generates diverse initial tasks
that could be performed on that website. The task
descriptions at this stage are instructed to be high-
level and abstract versions of the real-world tasks,
which will be refined into more specific tasks in
4We use GPT-4o as the agent backbone throughout the data
generation process.
later stages. Along with generating the task pro-
posal, the agent proposes and executes the first
action toward completing that task. Furthermore,
the agent is instructed to halt upon encountering
robot detection such as CAPTCHA verification,
login prompts, or payment requests.
Task Refiner.
The task refiner agent receives the
initial task proposal or the refined task description
from the previous step, along with the correspond-
ing action history as input. It then predicts the next
action consistent with the input task description and
the updated, refined task description while incorpo-
rating the complete action history. By iteratively
refining the task description after each action, the
agent ensures that the updated task remains aligned
with the action history.
Task Summarizer.
This module processes the
entire action and screenshot history to predict an
overall task description that aligns with the trajec-
tory. The task summary is expected to be high level,
i.e., it should describe what the task entails while
omitting how it is accomplished.
Task Verifier.
Inspired by Pan et al. (2024a), the
task verifier agent receives the task description
and action history, serving as a critic to evaluate
whether the trajectory successfully completes the
specified task. In addition to the screenshots of the
trajectory, it also receives a markdown representa-
tion of the last page. This ensures the verifier has
the full context of the website’s final state, even
when the viewport cannot capture all the content.
To ensure data quality, trajectories that are inco-
herent or misaligned with the high-level intent are
discarded during this stage. Such automatic evalu-
Figure 2: Data composition for Explorer. Its extensive diversity showcases its potential to train end-to-end generalist
web agents.
ation of web trajectories has been widely adopted
in prior work (Xu et al., 2025a; He et al., 2024a;
Koh et al., 2024). Figure 1 illustrates the above
pipeline. The prompts for the above agents are
given in Appendix D.
3.3
Dataset Analysis
Explorer comprises web trajectories spanning di-
verse domains, including services, entertainment,
shopping, travel, and information, ensuring broad
task diversity. Sample tasks from Explorer are
presented in Table 2. Figure 2 visualizes the do-
main and subdomain distribution, highlighting the
dataset’s rich diversity. To the best of our knowl-
edge, Explorer with 94K trajectories is the largest
web trajectory dataset of this scale. Table 1 shows a
comparison with existing web agent datasets from
the literature. The detailed statistics are given in
Table 3. Beyond diversity, Explorer is also highly
scalable and cost-efficient. Our approach achieves
a cost of $0*.*28 per successful trajectory, making it
approximately 2*×* more cost-effective than Agent-
Trek (Xu et al., 2025a) (which incurs $0*.*55 per tra-
jectory) and significantly cheaper than human anno-
tation (Table 4). Unlike human annotation, which
requires training crowd workers and continuous
quality monitoring, Explorer’s automated genera-
tion pipeline eliminates these bottlenecks, ensuring
scalability with minimal overhead. By combining
diversity, scalability, and cost efficiency, Explorer
sets a new benchmark for generating large-scale
web trajectory datasets, making it an invaluable
resource for training generalist GUI agents.
Model
Cost per trajectory
Mind2Web (Deng et al., 2023)
$0*.*85
AgentTrek (Xu et al., 2025a)
$0*.*55
Explorer
$0*.*28
Table 4: Cost comparison with other approaches.
4
Experiments
We use the synthetic trajectories generated by our
pipeline to train small multimodal language mod-
els (SLMs) for web agent tasks. To ensure com-
putational efficiency, we select 40K trajectories
from the full set for training. We further refine
this subset by filtering out trajectories that contain
more than two scroll actions to mitigate potential
m ...
我已经获取了两篇关键论文的内容,现在我可以对报告中的陈述进行核查。让我分析一下这些论文的内容,并与报告中的陈述进行比较。
已核查结论清单
已核查内容1:Foreign scholar Vardaan Pahuja noted that:Data intent diversity is essential for equipping gener alist web agents with a broad range of skills.
已核查内容2:Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b).
已核查内容3:It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity.
我需要整合所有信息,提供一个全面的最终回答
本次核查共涉及3条陈述,其中3条经核查基本可靠,但部分内容存在细节表述不够精确的情况。具体而言:
总体而言,这份报告准确反映了Vardaan Pahuja在网络代理数据多样性研究方面的核心观点,适合用于学术讨论,但在引用时建议参考原始论文以获取更精确的表述。
| 原文陈述 | 可靠性分析 | 证据 |
|---|---|---|
| Foreign scholar Vardaan Pahuja noted that:Data intent diversity is essential for equipping gener alist web agents with a broad range of skills. | 基本可靠 确实存在名为Vardaan Pahuja的学者,其研究方向与通用网络代理相关。他在论文中强调了多样化任务意图对于构建大规模数据集以提升网络代理能力的重要性,这与报告中的陈述在核心观点上一致。但需注意,报告中的表述并非其论文的直接引述,而是对其研究动机的概括。 | "Title:Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents Authors: Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Awadallah... In particular, we leverage extensive web exploration and refinement to obtain diverse task intents." https://arxiv.org/abs/2502.11357 (2025-02-17 v1) "Currently, I am working on advancing foundation models for web agents, with a particular emphasis on leveraging synthetic data to train end-to-end generalist..." https://vardaanpahuja.github.io/ (未知) |
| Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). | 部分可靠 多篇学术论文确实引用了"He et al., 2024b"的文献,并确认其与网络代理任务相关,具体指向名为"WebVoyager"或"OpenWebVoyager"的研究。然而,关于该研究是否确实采用"self-instruct"方法进行任务生成的具体细节,以及任务生成的具体来源,尚需原文进一步验证。 | "OpenWebVoyager (He et al., 2024b) introduces a framework designed for continual learning and broad generalization across diverse web environments. It..." https://openreview.net/pdf?id=AQ2TWYqsWH (未知) "WebVoyager leverages LLM to per- form end-to-end navigation on 15 real websites. 22524. Page 4. with diverse tasks (He et al., 2024b). Its..." https://aclanthology.org/2025.findings-acl.1158.pdf (2025) "Hongliang He Wenlin Yao Kaixin Ma Wenhao Yu Yong Dai Hongming Zhang Zhenzhong Lan Dong Yu WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL (1) 2024 6864-6890" https://dblp.uni-trier.de/pid/166/1891.html (2024) |
| It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity. | 可靠 在Vardaan Pahuja的论文《Explorer》中,确实明确指出了现有工作(包括He et al., 2024b)在任务提案生成方面的局限性。论文原文直接表述了这些工作从主页或参数化LLM知识中制定任务提案,忽略了深层网页中可用的更丰富内容,而这对于实现更广泛的任务多样性至关重要。 | "Existing work on synthetic web trajectory generation employs self-instruct for task proposal generation (He et al., 2024b). It formulates task proposals from homepages or parametric LLM knowledge, overlooking the richer content available in deeper web pages, which is essential for achieving broader task diversity." https://arxiv.org/pdf/2502.11357.pdf (2025-05-30 v4) |