autogen_ext.agents.web_surfer#
- class MultimodalWebSurfer(name: str, model_client: ChatCompletionClient, downloads_folder: str | None = None, description: str = '\n A helpful assistant with access to a web browser.\n Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.).\n It can also summarize the entire page, or answer questions based on the content of the page.\n It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.\n ', debug_dir: str | None = None, headless: bool = True, start_page: str | None = 'https://www.bing.com/', animate_actions: bool = False, to_save_screenshots: bool = False, use_ocr: bool = True, browser_channel: str | None = None, browser_data_dir: str | None = None, to_resize_viewport: bool = True, playwright: Playwright | None = None, context: BrowserContext | None = None)[source]#
Bases:
BaseChatAgent
MultimodalWebSurfer is a multimodal agent that acts as a web surfer that can search the web and visit web pages.
It launches a chromium browser and allows the playwright to interact with the web browser and can perform a variety of actions. The browser is launched on the first call to the agent and is reused for subsequent calls.
It must be used with a multimodal model client that supports function/tool calling, ideally GPT-4o currently.
- When
on_messages()
oron_messages_stream()
is called, the following occurs: If this is the first call, the browser is initialized and the page is loaded. This is done in
_lazy_init()
. The browser is only closed whenclose()
is called.The method
_generate_reply()
is called, which then creates the final response as below.The agent takes a screenshot of the page, extracts the interactive elements, and prepares a set-of-mark screenshot with bounding boxes around the interactive elements.
- The agent makes a call to the
model_client
with the SOM screenshot, history of messages, and the list of available tools. If the model returns a string, the agent returns the string as the final response.
If the model returns a list of tool calls, the agent executes the tool calls with
_execute_tool()
using_playwright_controller
.The agent returns a final response which includes a screenshot of the page, page metadata, description of the action taken and the inner text of the webpage.
- The agent makes a call to the
If at any point the agent encounters an error, it returns the error message as the final response.
Note
Please note that using the MultimodalWebSurfer involves interacting with a digital world designed for humans, which carries inherent risks. Be aware that agents may occasionally attempt risky actions, such as recruiting humans for help or accepting cookie agreements without human involvement. Always ensure agents are monitored and operate within a controlled environment to prevent unintended consequences. Moreover, be cautious that MultimodalWebSurfer may be susceptible to prompt injection attacks from webpages.
- Parameters:
name (str) – The name of the agent.
model_client (ChatCompletionClient) – The model client used by the agent. Must be multimodal and support function calling.
downloads_folder (str, optional) – The folder where downloads are saved. Defaults to None, no downloads are saved.
description (str, optional) – The description of the agent. Defaults to MultimodalWebSurfer.DEFAULT_DESCRIPTION.
debug_dir (str, optional) – The directory where debug information is saved. Defaults to None.
headless (bool, optional) – Whether the browser should be headless. Defaults to True.
start_page (str, optional) – The start page for the browser. Defaults to MultimodalWebSurfer.DEFAULT_START_PAGE.
animate_actions (bool, optional) – Whether to animate actions. Defaults to False.
to_save_screenshots (bool, optional) – Whether to save screenshots. Defaults to False.
use_ocr (bool, optional) – Whether to use OCR. Defaults to True.
browser_channel (str, optional) – The browser channel. Defaults to None.
browser_data_dir (str, optional) – The browser data directory. Defaults to None.
to_resize_viewport (bool, optional) – Whether to resize the viewport. Defaults to True.
playwright (Playwright, optional) – The playwright instance. Defaults to None.
context (BrowserContext, optional) – The browser context. Defaults to None.
Example usage:
The following example demonstrates how to create a web surfing agent with a model client and run it for multiple turns.
import asyncio from autogen_agentchat.ui import Console from autogen_agentchat.teams import RoundRobinGroupChat from autogen_ext.models.openai import OpenAIChatCompletionClient from autogen_ext.agents.web_surfer import MultimodalWebSurfer async def main() -> None: # Define an agent web_surfer_agent = MultimodalWebSurfer( name="MultimodalWebSurfer", model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"), ) # Define a team agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3) # Run the team and stream messages to the console stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.") await Console(stream) # Close the browser controlled by the agent await web_surfer_agent.close() asyncio.run(main())
- DEFAULT_DESCRIPTION = '\n A helpful assistant with access to a web browser.\n Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.).\n It can also summarize the entire page, or answer questions based on the content of the page.\n It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.\n '#
- DEFAULT_START_PAGE = 'https://www.bing.com/'#
- MLM_HEIGHT = 765#
- MLM_WIDTH = 1224#
- SCREENSHOT_TOKENS = 1105#
- VIEWPORT_HEIGHT = 900#
- VIEWPORT_WIDTH = 1440#
- async close() None [source]#
Close the browser and the page. Should be called when the agent is no longer needed.
- async on_messages(messages: Sequence[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], cancellation_token: CancellationToken) Response [source]#
Handles incoming messages and returns a response.
Note
Agents are stateful and the messages passed to this method should be the new messages since the last call to this method. The agent should maintain its state between calls to this method. For example, if the agent needs to remember the previous messages to respond to the current message, it should store the previous messages in the agent state.
- async on_messages_stream(messages: Sequence[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]], cancellation_token: CancellationToken) AsyncGenerator[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage | ToolCallMessage | ToolCallResultMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')] | Response, None] [source]#
Handles incoming messages and returns a stream of messages and and the final item is the response. The base implementation in
BaseChatAgent
simply callson_messages()
and yields the messages in the response.Note
Agents are stateful and the messages passed to this method should be the new messages since the last call to this method. The agent should maintain its state between calls to this method. For example, if the agent needs to remember the previous messages to respond to the current message, it should store the previous messages in the agent state.
- async on_reset(cancellation_token: CancellationToken) None [source]#
Resets the agent to its initialization state.
- property produced_message_types: List[type[Annotated[TextMessage | MultiModalMessage | StopMessage | HandoffMessage, FieldInfo(annotation=NoneType, required=True, discriminator='type')]]]#
The types of messages that the agent produces.
- When
- class PlaywrightController(downloads_folder: str | None = None, animate_actions: bool = False, viewport_width: int = 1440, viewport_height: int = 900, _download_handler: Callable[[Download], None] | None = None, to_resize_viewport: bool = True)[source]#
Bases:
object
A helper class to allow Playwright to interact with web pages to perform actions such as clicking, filling, and scrolling.
- Parameters:
downloads_folder (str | None) – The folder to save downloads to. If None, downloads are not saved.
animate_actions (bool) – Whether to animate the actions (create fake cursor to click).
viewport_width (int) – The width of the viewport.
viewport_height (int) – The height of the viewport.
_download_handler (Optional[Callable[[Download], None]]) – A function to handle downloads.
to_resize_viewport (bool) – Whether to resize the viewport
- async add_cursor_box(page: Page, identifier: str) None [source]#
Add a red cursor box around the element with the given identifier.
- Parameters:
page (Page) – The Playwright page object.
identifier (str) – The element identifier.
- async back(page: Page) None [source]#
Navigate back to the previous page.
- Parameters:
page (Page) – The Playwright page object.
- async click_id(page: Page, identifier: str) Page | None [source]#
Click the element with the given identifier.
- Parameters:
page (Page) – The Playwright page object.
identifier (str) – The element identifier.
- Returns:
Page | None – The new page if a new page is opened, otherwise None.
- async fill_id(page: Page, identifier: str, value: str, press_enter: bool = True) None [source]#
Fill the element with the given identifier with the specified value.
- async get_focused_rect_id(page: Page) str [source]#
Retrieve the ID of the currently focused element.
- Parameters:
page (Page) – The Playwright page object.
- Returns:
str – The ID of the focused element.
- async get_interactive_rects(page: Page) Dict[str, InteractiveRegion] [source]#
Retrieve interactive regions from the web page.
- Parameters:
page (Page) – The Playwright page object.
- Returns:
Dict[str, InteractiveRegion] – A dictionary of interactive regions.
- async get_page_markdown(page: Page) str [source]#
Retrieve the markdown content of the web page. Currently not implemented.
- Parameters:
page (Page) – The Playwright page object.
- Returns:
str – The markdown content of the page.
- async get_page_metadata(page: Page) Dict[str, Any] [source]#
Retrieve metadata from the web page.
- Parameters:
page (Page) – The Playwright page object.
- Returns:
Dict[str, Any] – A dictionary of page metadata.
- async get_visual_viewport(page: Page) VisualViewport [source]#
Retrieve the visual viewport of the web page.
- Parameters:
page (Page) – The Playwright page object.
- Returns:
VisualViewport – The visual viewport of the page.
- async get_webpage_text(page: Page, n_lines: int = 50) str [source]#
Retrieve the text content of the web page.
- Parameters:
page (Page) – The Playwright page object.
n_lines (int) – The number of lines to return from the page inner text.
- Returns:
str – The text content of the page.
- async gradual_cursor_animation(page: Page, start_x: float, start_y: float, end_x: float, end_y: float) None [source]#
Animate the cursor movement gradually from start to end coordinates.
- async hover_id(page: Page, identifier: str) None [source]#
Hover the mouse over the element with the given identifier.
- Parameters:
page (Page) – The Playwright page object.
identifier (str) – The element identifier.
- async on_new_page(page: Page) None [source]#
Handle actions to perform on a new page.
- Parameters:
page (Page) – The Playwright page object.
- async page_down(page: Page) None [source]#
Scroll the page down by one viewport height minus 50 pixels.
- Parameters:
page (Page) – The Playwright page object.
- async page_up(page: Page) None [source]#
Scroll the page up by one viewport height minus 50 pixels.
- Parameters:
page (Page) – The Playwright page object.
- async remove_cursor_box(page: Page, identifier: str) None [source]#
Remove the red cursor box around the element with the given identifier.
- Parameters:
page (Page) – The Playwright page object.
identifier (str) – The element identifier.
- async scroll_id(page: Page, identifier: str, direction: str) None [source]#
Scroll the element with the given identifier in the specified direction.