Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

1KAIST, 2University of Oxford

Interpolate start reference image.

[Examplary web page contextualization for enhanced decision making of LLM agent.]

Abstract

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human.

LCoW


LCoW is a framework for training the module that contextualizes complicated web page, thereby enhancing the decision-making capabilities of LLM agents in web automation. The contextualization module transforms complex web page into a comprehensible and informative format, enabling LLM agents to make more accurate decisions for web navigation. Our training algorithm to train the contextualization module consists of three phases: (i) trajectory collection, (ii) sampling contextualized observations, and (iii) updating the contextualization module. For each observation from the collected trajectories, we generate multiple contextualized observations using the current contextualization module. Each observation is then assigned a reward based on whether a set of LLM agents can accurately predict the correct action given the contextualized observation. Finally, we select the one with the highest reward as the target and train the contextualization module to maximize the likelihood of the target given the original raw observation.

Experiments


We investigate the efficacy of LCoW across multiple LLM agents in WebShop. For all LLM agents, LCoW consistently improves both success rate and reward over iterations, surpassing self-contextualization (self-ctx) and even human expert-level success rate (59.6%) by the third iteration. Additionally, LCoW is also effective when combined with Llama-3.1-70B, which was not used for computing the action-matching reward (i.e., unseen) during training the contextualization module.

Interpolate start reference image.

We evaluate the success rate of five LLM agents with varying scales on 165 tasks in the WorkArena benchmark, which features more realistic web environment. Single iteration of LCoW shows improvement of success rate over all LLMs, even generalized to Llama-3.1-70B and Llama-3.1-8B, which were not used for computing the action matching reward. Especially, relatively small LLM (i.e., Llama-3.1-8B) struggles to accomplish task given complicated web pages, while improves by 36% given observation contextualized by LCoW.

Interpolate start reference image.

We analyze whether the (1) training contextualization using seed demonstration is more effective than directly training LLM agent through behavior cloning, and (2) contextualization module trained on certain types of tasks generalizes to unseen types of tasks. As shown in the Figure (left), directly training the Llama-3.1-8B-Instruct using seed demonstration achieves 23.6% success rate on 165 WorkArena tasks, while Llama-3.1-8B-Instruct combined with LCoW-trained contextualization module (which is trained based on same number of seed demonstrations) achieves 37.0% success rate. Additionally, as shown in Figure (right), contextualization module improves task success rate by approximately 6% on tasks belonging to both seen task types and unseen task types.

Interpolate start reference image.