UFO² — Windows AgentOS

arxivPython VersionLicense: MITgithubYouTube

UFO² is a Windows AgentOS that reimagines desktop automation as a first-class operating system abstraction. Unlike traditional Computer-Using Agents (CUAs) that rely on screenshots and simulated inputs, UFO² deeply integrates with Windows OS through UI Automation APIs, application-specific introspection, and hybrid GUI–API execution—enabling robust, efficient, and non-disruptive automation across 20+ real-world applications.


What is UFO²?

UFO² addresses fundamental limitations of existing desktop automation solutions:

Traditional RPA (UiPath, Power Automate):
❌ Fragile scripts that break with UI changes
❌ Requires extensive manual maintenance
❌ Limited adaptability to dynamic environments

Current CUAs (Claude, Operator):
❌ Visual-only inputs with high cognitive overhead
❌ Miss native OS APIs and application internals
❌ Lock users out during automation (poor UX)

UFO² AgentOS:
Deep OS Integration — Windows UIA, Win32, WinCOM APIs
Hybrid GUI–API Actions — Native APIs + fallback GUI automation
Continuous Knowledge Learning — RAG-enhanced from docs & execution history
Picture-in-Picture Desktop — Parallel automation without user disruption
10%+ better success rate than state-of-the-art CUAs

AgentOS vs Traditional CUA

Figure 1: Comparison between (a) traditional CUAs that rely on screenshots and simulated inputs, and (b) UFO² AgentOS that deeply integrates with OS APIs, application internals, and hybrid GUI–API execution.

Core Architecture

UFO² implements a hierarchical multi-agent system optimized for Windows desktop automation:

UFO² System Architecture

Figure 2: UFO² system architecture featuring the two-tier agent hierarchy (HostAgent + AppAgents), hybrid control detection pipeline, continuous knowledge substrate integration, and unified GUI–API action layer coordinated through MCP servers.

Two-Tier Agent Hierarchy

Agent Type Role Key Capabilities
HostAgent Desktop Orchestrator Task decomposition • Application selection • Cross-app coordination • AppAgent lifecycle management
AppAgent Application Executor UI element interaction • Hybrid GUI–API execution • Application-specific automation • Result reporting

Design Philosophy:
- HostAgent handles WHAT (which application) and WHEN (task sequencing)
- AppAgent handles HOW (UI/API interaction) and WHERE (control targeting)
- Blackboard facilitates inter-agent communication without tight coupling
- State Machines ensure deterministic execution flow and error recovery

Learn More


Key Innovations

1. Deep OS Integration 🔧

UFO² embeds directly into Windows OS infrastructure:

  • UI Automation (UIA): Introspects accessibility trees for standard controls
  • Win32 APIs: Low-level window management and process control
  • WinCOM: Interacts with Office applications (Excel, Word, Outlook)
  • Hybrid Detection: Fuses UIA metadata + visual grounding for non-standard UI elements

Hybrid Control Detection

Combines Windows UIA APIs with vision models (OmniParser) to detect both standard and custom UI controls—bridging structured accessibility trees and pixel-level perception.

📖 Control Detection Guide

2. Unified GUI–API Action Layer ⚡

Traditional CUAs simulate mouse/keyboard only. UFO² chooses the best execution method:

GUI Actions (fallback):
click, type, select, scroll → Reliable for any application

Native APIs (preferred):
- Excel: xlwings for direct cell/chart manipulation
- Outlook: win32com for email operations
- PowerPoint: python-pptx for slide editing
51% fewer LLM calls via speculative multi-action execution

Model Context Protocol (MCP) Servers:
Extensible framework for adding application-specific APIs without modifying agent code.

3. Continuous Knowledge Substrate 📚

UFO² learns from three knowledge sources without model retraining:

Source Content Integration Method
Help Documents Official app documentation, API references Vectorized retrieval (RAG)
Bing Search Real-time web knowledge for latest features Dynamic query expansion
Execution History Past successful/failed action sequences Experience replay & pattern mining

Result: Agents improve autonomously by retrieving relevant context at execution time.

4. Speculative Multi-Action Execution 🚀

Reduce LLM latency by predicting and validating action sequences:

Traditional Approach:
1 LLM call → 1 action → observe → repeat → High latency

UFO² Speculative Execution:
1 LLM call → predict N actions → validate with UI state → execute all → 51% fewer queries

Validation Mechanism:
Lightweight control-state checks ensure predicted actions remain valid before execution.

Efficiency Gain

Task: "Fill form fields A1–A10 with sequential numbers"

  • Traditional CUA: 10 LLM calls (1 per field) → ~30 seconds
  • UFO² Speculative: 1 LLM call predicts all 10 actions → ~8 seconds

📖 Multi-Action Execution Guide

5. Picture-in-Picture Desktop 🖼️

Problem: Existing CUAs lock users out during automation (poor UX).

UFO² Solution: Nested virtual desktop via Windows Remote Desktop loopback:

  • User Desktop: Continue working normally
  • Agent Desktop (PiP): Automation runs in parallel sandboxed environment
  • Zero Interference: User and agent don't compete for mouse/keyboard

Implementation:
Built on Windows native remote desktop infrastructure—secure, isolated, non-disruptive.

User Experience

Users can continue email, browsing, or coding while UFO² automates Excel reports in the background PiP desktop.


System Components

Processing Pipeline

Both HostAgent and AppAgent execute a 4-phase processing cycle:

Phase Purpose HostAgent Strategy AppAgent Strategy
1. Data Collection Gather environment state Desktop screenshot, app list App screenshot, UI tree, control annotations
2. LLM Interaction Decide next action Select application, plan subtask Select control, plan action sequence
3. Action Execution Execute commands Launch app, create AppAgent Execute GUI/API actions
4. Memory Update Record execution Save orchestration step Save interaction step, update blackboard

Processing Details

📖 Strategy Layer — Processing framework and dependency chain
📖 State Layer — FSM design principles

Command System

Commands are dispatched through MCP (Model Context Protocol) servers:

HostAgent Commands:

  • Desktop Capture: capture_desktop_screenshot
  • Window Management: get_desktop_app_info, get_app_window
  • Process Control: launch_application, close_application

AppAgent Commands:

  • Screenshot: capture_screenshot, annotate_screenshot
  • UI Inspection: get_control_info, get_ui_tree
  • UI Interaction: click, set_edit_text, wheel_mouse_input
  • Control Selection: select_control_by_index, select_control_by_name

Command Architecture

📖 Command Layer — MCP integration and command dispatch
📖 MCP Servers — Server architecture and custom server creation


Configuration

UFO² integrates with a centralized YAML-based configuration system:

# config/ufo/host_agent_config.yaml
host_agent:
  visual_mode: true                  # Enable screenshot-based reasoning
  max_subtasks: 10                   # Maximum subtasks per session
  llm_config:
    model: "gpt-4o"
    temperature: 0.0

# config/ufo/app_agent_config.yaml
app_agent:
  visual_mode: true                  # Enable UI screenshot analysis
  control_backend: "uia"             # UI Automation (uia) or Win32 (win32)
  max_steps: 20                      # Maximum steps per subtask

Complete Configuration Guide

For detailed configuration options, model setup, and advanced customization:

📖 Configuration & Setup — Complete system configuration reference
📖 Model Setup — LLM provider configuration (OpenAI, Azure, Gemini, Claude, etc.)
📖 MCP Configuration — MCP server and extension configuration


Quick Start

Basic Usage

UFO² is designed to be run from the command line:

Interactive Mode:

# Start UFO² in interactive mode
python -m ufo --task <your_task_name>

Example:

python -m ufo --task excel_demo

This will prompt you to enter your request interactively:

Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction.
Please enter your request to be completed🛸: Create a chart from Sheet1 data in Excel

Direct Request Mode:

# Execute with a specific request directly
python -m ufo --task <your_task_name> -r "<your_request>"

Example:

python -m ufo --task excel_demo -r "Open Excel and create a chart from Sheet1 data"

Complete Setup Guide

For detailed installation, configuration, and advanced usage options, see the Quick Start Guide.

What Happens Under the Hood

  1. Session creates HostAgent with user request
  2. HostAgent captures desktop, selects "Microsoft Excel", launches app
  3. HostAgent creates AppAgent for Excel, delegates subtask
  4. AppAgent captures Excel UI, identifies chart insertion control
  5. AppAgent executes hybrid action (API if available, GUI fallback)
  6. AppAgent reports completion to HostAgent
  7. HostAgent verifies task, returns success to Session

Documentation Navigation

Core Concepts

Advanced Features

System Architecture

Development

Benchmarking & Logs


Research Impact

UFO² demonstrates that system-level integration and architectural design matter more than model size alone:

Key Findings

  • 10%+ improvement over Claude/Operator on WindowsAgentArena
  • 51% fewer LLM calls via speculative multi-action execution
  • Robust to UI changes through hybrid UIA + visual detection
  • Continuous learning without model retraining via RAG
  • Non-disruptive UX via Picture-in-Picture desktop

Research Paper:
📄 UFO²: A Grounded OS Agent for Windows


Get Started

Ready to explore UFO²? Choose your path:

Learning Paths

🚀 New Users: Start with Quick Start Guide
🔧 Developers: Read Creating AppAgent
🏗️ System Architects: Study Device Agent Architecture
📊 Researchers: Check Benchmark Results

Next: HostAgent Deep Dive → Understand desktop orchestration


🌐 Media Coverage

Check out our official deep dive of UFO on this Youtube Video.

UFO sightings have garnered attention from various media outlets, including:


📚 Citation

If you build on this work, please cite the AgentOS framework:

UFO² – The Desktop AgentOS (2025)
https://arxiv.org/abs/2504.14603

@article{zhang2025ufo2,
  title   = {{UFO2: The Desktop AgentOS}},
  author  = {Zhang, Chaoyun and Huang, He and Ni, Chiming and Mu, Jian and Qin, Si and He, Shilin and Wang, Lu and Yang, Fangkai and Zhao, Pu and Du, Chao and Li, Liqun and Kang, Yu and Jiang, Zhao and Zheng, Suzhen and Wang, Rujia and Qian, Jiaxu and Ma, Minghua and Lou, Jian-Guang and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei},
  journal = {arXiv preprint arXiv:2504.14603},
  year    = {2025}
}

UFO – A UI‑Focused Agent for Windows OS Interaction (2024)
https://arxiv.org/abs/2402.07939

@article{zhang2024ufo,
  title   = {{UFO: A UI-Focused Agent for Windows OS Interaction}},
  author  = {Zhang, Chaoyun and Li, Liqun and He, Shilin and Zhang, Xu and Qiao, Bo and Qin, Si and Ma, Minghua and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Zhang, Qi},
  journal = {arXiv preprint arXiv:2402.07939},
  year    = {2024}
}


❓Get Help