ai-agents-for-beginners

Building Computer Use Agents (CUA)

Computer use agents can interact with websites the same way a person would: by opening a browser, inspecting the page, and taking the next best action from what they see. In this lesson, you’ll build a browser automation agent that searches Airbnb, extracts structured listing data, and identifies the cheapest stay in Stockholm.

The lesson combines Browser-Use for AI-driven navigation, Playwright and Chrome DevTools Protocol (CDP) for browser control, Azure OpenAI for vision-enabled reasoning, and Pydantic for structured extraction.

Introduction

This lesson will cover:

Learning Goals

After completing this lesson, you will know how to:

Code Sample

This lesson includes one notebook tutorial:

Prerequisites

Setup

Install the packages used in the notebook:

pip install browser_use playwright python-dotenv
playwright install chromium

Set the Azure OpenAI environment variables used by the notebook:

AZURE_OPENAI_ENDPOINT=...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=...
# Optional: defaults to the latest API version when omitted
AZURE_OPENAI_API_VERSION=...

Architecture Overview

The notebook demonstrates a hybrid browser automation workflow:

  1. Chrome starts with CDP enabled so both Playwright and Browser-Use can share the same browser session.
  2. A Browser-Use agent handles open-ended navigation tasks such as opening Airbnb, dismissing pop-ups, and searching for Stockholm.
  3. The active page is inspected with a structured Pydantic schema to extract listing titles, nightly prices, ratings, and URLs.
  4. Python logic compares the extracted listings and highlights the cheapest result.

This approach keeps the flexible, vision-based reasoning that Browser-Use is good at while still giving you deterministic browser control when you need it.

Key Takeaways and Best Practices

When to Use Agent vs Actor

Scenario Use Agent Use Actor
Dynamic layouts Yes, AI can adapt to page changes No, brittle selectors can break
Known structure No, an agent is slower than direct control Yes, fast and precise
Finding elements Yes, natural language works well No, exact selectors are required
Timing control No, less predictable Yes, full control over waits and retries
Complex workflows Yes, handles unexpected UI states No, requires explicit branching

Browser-Use Best Practices

  1. Start with an agent for exploration and dynamic navigation.
  2. Switch to direct page control when the interaction becomes predictable.
  3. Use structured output models so extracted data is validated and type-safe.
  4. Add delays strategically after actions that trigger visible UI changes.
  5. Capture screenshots while iterating so failures are easier to debug.
  6. Expect websites to change and design fallback strategies for pop-ups and layout shifts.
  7. Blend agent and actor patterns to get both flexibility and precision.

Real-World Applications

Additional Resources