Welcome to UFO's Document!

arxivPython VersionLicense: MITgithubYouTube

Introduction

UFO is a UI-Focused multi-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

🕌 Framework

UFO UFO Image operates as a multi-agent framework, encompassing:

  • HostAgent 🤖, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.

  • AppAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.

  • Application Automator 🎮, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and through UI controls, native APIs or AI tools. Check out more details here.

Both agents leverage the multi-modal capabilities of Visual Language Model (VLM) to comprehend the application UI and fulfill the user's request. For more details, please consult our technical report.

🚀 Quick Start

Please follow the Quick Start Guide to get started with UFO.

💥 Highlights

  • First Windows Agent - UFO is the pioneering agent framework capable of translating user requests in natural language into actionable operations on Windows OS.
  • Agent as an Expert - UFO is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including offline help documents, online search engines, and human demonstrations, making the agent an application "expert".
  • Rich Skill Set - UFO is equipped with a diverse set of skills to support comprehensive automation, such as mouse, keyboard, native API, and "Copilot".
  • Interactive Mode - UFO facilitates multiple sub-requests from users within the same session, enabling the seamless completion of complex tasks.
  • Agent Customization - UFO allows users to customize their own agents by providing additional information. The agent will proactively query users for details when necessary to better tailor its behavior.
  • Scalable AppAgent Creation - UFO offers extensibility, allowing users and app developers to create their own AppAgents in an easy and scalable way.

🌐 Media Coverage

Check out our official deep dive of UFO on this Youtube Video.

UFO sightings have garnered attention from various media outlets, including:

❓Get help


🎬 Demo Examples

We present two demo videos that complete user request on Windows OS using UFO. For more case study, please consult our technical report.

1️⃣🗑️ Example 1: Deleting all notes on a PowerPoint presentation.

In this example, we will demonstrate how to efficiently use UFO to delete all notes on a PowerPoint presentation with just a few simple steps. Explore this functionality to enhance your productivity and work smarter, not harder!

 

2️⃣📧 Example 2: Composing an email using text from multiple sources.

In this example, we will demonstrate how to utilize UFO to extract text from Word documents, describe an image, compose an email, and send it seamlessly. Enjoy the versatility and efficiency of cross-application experiences with UFO!

 

📚 Citation

Our technical report paper can be found here. Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions. If you use UFO in your research, please cite our paper:

@article{ufo,
  title={{UFO: A UI-Focused Agent for Windows OS Interaction}},
  author={Zhang, Chaoyun and Li, Liqun and He, Shilin and  Zhang, Xu and Qiao, Bo and  Qin, Si and Ma, Minghua and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and  Zhang, Qi},
  journal={arXiv preprint arXiv:2402.07939},
  year={2024}
}

You may also find TaskWeaver useful, a code-first LLM agent framework for seamlessly planning and executing data analytics tasks.