autogen_ext.agents.web_surfer#

class MultimodalWebSurfer(name: str, model_client: ChatCompletionClient, downloads_folder: str | None = None, description: str = '\n    A helpful assistant with access to a web browser.\n    Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.).\n    It can also summarize the entire page, or answer questions based on the content of the page.\n    It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.\n    ', debug_dir: str | None = None, headless: bool = True, start_page: str | None = 'https://www.bing.com/', animate_actions: bool = False, to_save_screenshots: bool = False, use_ocr: bool = True, browser_channel: str | None = None, browser_data_dir: str | None = None, to_resize_viewport: bool = True, playwright: Playwright | None = None, context: BrowserContext | None = None)[source]#

Bases: BaseChatAgent

MultimodalWebSurfer is a multimodal agent that acts as a web surfer that can search the web and visit web pages.

It launches a chromium browser and allows the playwright to interact with the web browser and can perform a variety of actions. The browser is launched on the first call to the agent and is reused for subsequent calls.

It must be used with a multimodal model client that supports function/tool calling, ideally GPT-4o currently.

When on_messages() or on_messages_stream() is called, the following occurs:
  1. If this is the first call, the browser is initialized and the page is loaded. This is done in _lazy_init(). The browser is only closed when close() is called.

  2. The method _generate_reply() is called, which then creates the final response as below.

  3. The agent takes a screenshot of the page, extracts the interactive elements, and prepares a set-of-mark screenshot with bounding boxes around the interactive elements.

  4. The agent makes a call to the model_client with the SOM screenshot, history of messages, and the list of available tools.
    • If the model returns a string, the agent returns the string as the final response.

    • If the model returns a list of tool calls, the agent executes the tool calls with _execute_tool() using _playwright_controller.

    • The agent returns a final response which includes a screenshot of the page, page metadata, description of the action taken and the inner text of the webpage.

  5. If at any point the agent encounters an error, it returns the error message as the final response.

Note

Please note that using the MultimodalWebSurfer involves interacting with a digital world designed for humans, which carries inherent risks. Be aware that agents may occasionally attempt risky actions, such as recruiting humans for help or accepting cookie agreements without human involvement. Always ensure agents are monitored and operate within a controlled environment to prevent unintended consequences. Moreover, be cautious that MultimodalWebSurfer may be susceptible to prompt injection attacks from webpages.

Parameters:
  • name (str) – The name of the agent.

  • model_client (ChatCompletionClient) – The model client used by the agent. Must be multimodal and support function calling.

  • downloads_folder (str, optional) – The folder where downloads are saved. Defaults to None, no downloads are saved.

  • description (str, optional) – The description of the agent. Defaults to MultimodalWebSurfer.DEFAULT_DESCRIPTION.

  • debug_dir (str, optional) – The directory where debug information is saved. Defaults to None.

  • headless (bool, optional) – Whether the browser should be headless. Defaults to True.

  • start_page (str, optional) – The start page for the browser. Defaults to MultimodalWebSurfer.DEFAULT_START_PAGE.

  • animate_actions (bool, optional) – Whether to animate actions. Defaults to False.

  • to_save_screenshots (bool, optional) – Whether to save screenshots. Defaults to False.

  • use_ocr (bool, optional) – Whether to use OCR. Defaults to True.

  • browser_channel (str, optional) – The browser channel. Defaults to None.

  • browser_data_dir (str, optional) – The browser data directory. Defaults to None.

  • to_resize_viewport (bool, optional) – Whether to resize the viewport. Defaults to True.

  • playwright (Playwright, optional) – The playwright instance. Defaults to None.

  • context (BrowserContext, optional) – The browser context. Defaults to None.

Example usage:

The following example demonstrates how to create a web surfing agent with a model client and run it for multiple turns.

import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer


async def main() -> None:
    # Define an agent
    web_surfer_agent = MultimodalWebSurfer(
        name="MultimodalWebSurfer",
        model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
    )

    # Define a team
    agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)

    # Run the team and stream messages to the console
    stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
    await Console(stream)
    # Close the browser controlled by the agent
    await web_surfer_agent.close()


asyncio.run(main())
DEFAULT_DESCRIPTION = '\n    A helpful assistant with access to a web browser.\n    Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.).\n    It can also summarize the entire page, or answer questions based on the content of the page.\n    It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.\n    '#
DEFAULT_START_PAGE = 'https://www.bing.com/'#
MLM_HEIGHT = 765#
MLM_WIDTH = 1224#
SCREENSHOT_TOKENS = 1105#
VIEWPORT_HEIGHT = 900#
VIEWPORT_WIDTH = 1440#
async close() None[source]#

Close the browser and the page. Should be called when the agent is no longer needed.

async on_messages(messages: Sequence[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], cancellation_token: CancellationToken) Response[source]#

Handles incoming messages and returns a response.

Note

Agents are stateful and the messages passed to this method should be the new messages since the last call to this method. The agent should maintain its state between calls to this method. For example, if the agent needs to remember the previous messages to respond to the current message, it should store the previous messages in the agent state.

async on_messages_stream(messages: Sequence[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], cancellation_token: CancellationToken) AsyncGenerator[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage | ToolCallMessage | ToolCallResultMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')] | Response, None][source]#

Handles incoming messages and returns a stream of messages and and the final item is the response. The base implementation in BaseChatAgent simply calls on_messages() and yields the messages in the response.

Note

Agents are stateful and the messages passed to this method should be the new messages since the last call to this method. The agent should maintain its state between calls to this method. For example, if the agent needs to remember the previous messages to respond to the current message, it should store the previous messages in the agent state.

async on_reset(cancellation_token: CancellationToken) None[source]#

Resets the agent to its initialization state.

property produced_message_types: List[type[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]]]#

The types of messages that the agent produces.

class PlaywrightController(downloads_folder: str | None = None, animate_actions: bool = False, viewport_width: int = 1440, viewport_height: int = 900, _download_handler: Callable[[Download], None] | None = None, to_resize_viewport: bool = True)[source]#

Bases: object

A helper class to allow Playwright to interact with web pages to perform actions such as clicking, filling, and scrolling.

Parameters:
  • downloads_folder (str | None) – The folder to save downloads to. If None, downloads are not saved.

  • animate_actions (bool) – Whether to animate the actions (create fake cursor to click).

  • viewport_width (int) – The width of the viewport.

  • viewport_height (int) – The height of the viewport.

  • _download_handler (Optional[Callable[[Download], None]]) – A function to handle downloads.

  • to_resize_viewport (bool) – Whether to resize the viewport

async add_cursor_box(page: Page, identifier: str) None[source]#

Add a red cursor box around the element with the given identifier.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

async back(page: Page) None[source]#

Navigate back to the previous page.

Parameters:

page (Page) – The Playwright page object.

async click_id(page: Page, identifier: str) Page | None[source]#

Click the element with the given identifier.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

Returns:

Page | None – The new page if a new page is opened, otherwise None.

async fill_id(page: Page, identifier: str, value: str, press_enter: bool = True) None[source]#

Fill the element with the given identifier with the specified value.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

  • value (str) – The value to fill.

async get_focused_rect_id(page: Page) str[source]#

Retrieve the ID of the currently focused element.

Parameters:

page (Page) – The Playwright page object.

Returns:

str – The ID of the focused element.

async get_interactive_rects(page: Page) Dict[str, InteractiveRegion][source]#

Retrieve interactive regions from the web page.

Parameters:

page (Page) – The Playwright page object.

Returns:

Dict[str, InteractiveRegion] – A dictionary of interactive regions.

async get_page_markdown(page: Page) str[source]#

Retrieve the markdown content of the web page. Currently not implemented.

Parameters:

page (Page) – The Playwright page object.

Returns:

str – The markdown content of the page.

async get_page_metadata(page: Page) Dict[str, Any][source]#

Retrieve metadata from the web page.

Parameters:

page (Page) – The Playwright page object.

Returns:

Dict[str, Any] – A dictionary of page metadata.

async get_visual_viewport(page: Page) VisualViewport[source]#

Retrieve the visual viewport of the web page.

Parameters:

page (Page) – The Playwright page object.

Returns:

VisualViewport – The visual viewport of the page.

async get_webpage_text(page: Page, n_lines: int = 50) str[source]#

Retrieve the text content of the web page.

Parameters:
  • page (Page) – The Playwright page object.

  • n_lines (int) – The number of lines to return from the page inner text.

Returns:

str – The text content of the page.

async gradual_cursor_animation(page: Page, start_x: float, start_y: float, end_x: float, end_y: float) None[source]#

Animate the cursor movement gradually from start to end coordinates.

Parameters:
  • page (Page) – The Playwright page object.

  • start_x (float) – The starting x-coordinate.

  • start_y (float) – The starting y-coordinate.

  • end_x (float) – The ending x-coordinate.

  • end_y (float) – The ending y-coordinate.

async hover_id(page: Page, identifier: str) None[source]#

Hover the mouse over the element with the given identifier.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

async on_new_page(page: Page) None[source]#

Handle actions to perform on a new page.

Parameters:

page (Page) – The Playwright page object.

async page_down(page: Page) None[source]#

Scroll the page down by one viewport height minus 50 pixels.

Parameters:

page (Page) – The Playwright page object.

async page_up(page: Page) None[source]#

Scroll the page up by one viewport height minus 50 pixels.

Parameters:

page (Page) – The Playwright page object.

async remove_cursor_box(page: Page, identifier: str) None[source]#

Remove the red cursor box around the element with the given identifier.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

async scroll_id(page: Page, identifier: str, direction: str) None[source]#

Scroll the element with the given identifier in the specified direction.

Parameters:
  • page (Page) – The Playwright page object.

  • identifier (str) – The element identifier.

  • direction (str) – The direction to scroll (“up” or “down”).

async sleep(page: Page, duration: int | float) None[source]#

Pause the execution for a specified duration.

Parameters:
  • page (Page) – The Playwright page object.

  • duration (Union[int, float]) – The duration to sleep in milliseconds.

async visit_page(page: Page, url: str) Tuple[bool, bool][source]#

Visit a specified URL.

Parameters:
  • page (Page) – The Playwright page object.

  • url (str) – The URL to visit.

Returns:

Tuple[bool, bool] – A tuple indicating whether to reset prior metadata hash and last download.