Creating a New Device Agent - Complete Tutorial

This comprehensive tutorial teaches you how to create a new device agent (like MobileAgent, AndroidAgent, or iOSAgent) and integrate it with UFO³'s multi-device orchestration system. We'll use LinuxAgent as our primary reference implementation.

Introduction

What is a Device Agent?

A Device Agent is a specialized AI agent that controls and automates tasks on a specific type of device or platform. Unlike traditional third-party agents that extend specific functionality, device agents represent entire computing platforms with their own:

Execution Environment: Device-specific OS, runtime, and APIs
Control Mechanism: UI automation, CLI commands, or platform APIs
Communication Protocol: Client-server architecture via WebSocket
MCP Integration: Device-specific MCP servers for command execution

Device Agent vs Third-Party Agent

Aspect	Device Agent	Third-Party Agent
Scope	Full platform control (Windows, Linux, Mobile)	Specific functionality (Hardware, Web)
Architecture	Client-Server separation	Runs on orchestrator server
Communication	WebSocket + AIP Protocol	Direct method calls
MCP Servers	Platform-specific MCP servers	Shares MCP servers
Examples	WindowsAgent, LinuxAgent, MobileAgent	HardwareAgent, WebAgent
Deployment	Separate client process on device	Part of orchestrator

When to Create a Device Agent

Create a Device Agent when you need to:

Control an entirely new platform (mobile, IoT, embedded)
Execute tasks on remote or distributed devices
Integrate with Galaxy multi-device orchestration
Isolate execution for security or scalability

Create a Third-Party Agent when you need to:

Extend existing platform with new capabilities
Add specialized tools or APIs
Run alongside existing agents

Prerequisites

Before starting this tutorial, ensure you have:

Knowledge Requirements

✅ Python 3.10+: Intermediate Python programming skills
✅ Async Programming: Understanding of async/await patterns
✅ UFO³ Basics: Familiarity with Agent Architecture
✅ MCP Protocol: Understanding of Model Context Protocol
✅ WebSocket: Basic knowledge of WebSocket communication

Development Environment

# Clone UFO³ repository
git clone https://github.com/microsoft/UFO.git
cd UFO

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import ufo; print('UFO³ installed successfully')"

Understanding Device Agents

Three-Layer Architecture

All device agents in UFO³ follow a unified three-layer architecture:

graph TB subgraph "Device Agent Architecture" subgraph "Level-1: State Layer (FSM)" S1[AgentState] S2[State Machine] S3[State Transitions] S1 --> S2 --> S3 end subgraph "Level-2: Strategy Layer (Execution Logic)" P1[ProcessorTemplate] P2[DATA_COLLECTION] P3[LLM_INTERACTION] P4[ACTION_EXECUTION] P5[MEMORY_UPDATE] P1 --> P2 --> P3 --> P4 --> P5 end subgraph "Level-3: Command Layer (System Interface)" C1[CommandDispatcher] C2[MCP Tools] C3[Device Commands] C1 --> C2 --> C3 end S3 -->|delegates to| P1 P5 -->|executes via| C1 end style S1 fill:#e1f5ff style P1 fill:#fff3e0 style C1 fill:#f3e5f5

Key Layers:

State Layer (Level-1): Finite State Machine controlling agent lifecycle
Strategy Layer (Level-2): Processing pipeline with modular strategies
Command Layer (Level-3): Atomic system operations via MCP

For detailed architecture, see Agent Architecture Documentation.

Server-Client Separation

Device agents use a server-client architecture for security and scalability:

graph LR subgraph "Server Side (Orchestrator)" Server[Device Agent Server] State[State Machine] Processor[Strategy Processor] LLM[LLM Service] Server --> State Server --> Processor Processor -.-> LLM end subgraph "Communication" AIP[AIP Protocol WebSocket] end subgraph "Client Side (Device)" Client[Device Client] MCP[MCP Server Manager] Tools[Platform Tools] OS[Device OS] Client --> MCP MCP --> Tools Tools --> OS end Server <-->|Commands/Results| AIP AIP <-->|Commands/Results| Client style Server fill:#e1f5ff style Client fill:#c8e6c9 style AIP fill:#fff3e0

Separation Benefits:

Component	Location	Responsibilities	Security
Agent Server	Orchestrator	Reasoning, planning, state management	Untrusted (LLM-driven)
Device Client	Target Device	Command execution, resource access	Trusted (validated operations)
AIP Protocol	Network	Message transport, serialization	Encrypted channel

Separation Benefits:

Security: Isolates LLM reasoning from system-level execution
Scalability: Single orchestrator manages multiple devices
Flexibility: Clients run on resource-constrained devices (mobile, IoT)
Safety: Client validates all commands before execution

LinuxAgent: Reference Implementation

Why LinuxAgent as Reference?

LinuxAgent is the ideal reference for creating new device agents because:

✅ Simple Architecture: Single-tier agent (no HostAgent delegation)
✅ Clear Separation: Clean server-client boundary
✅ Well-Documented: Comprehensive code and documentation
✅ Production-Ready: Battle-tested in real deployments
✅ Minimal Complexity: Focuses on core device agent patterns

LinuxAgent Components

graph TB subgraph "Server Side (ufo/agents/)" LA[LinuxAgent Class customized_agent.py] LAP[LinuxAgentProcessor customized_agent_processor.py] LAS[LinuxAgent Strategies linux_agent_strategy.py] LAST[LinuxAgent States linux_agent_state.py] LA --> LAP LAP --> LAS LA --> LAST end subgraph "Client Side (ufo/client/)" Client[UFO Client client.py] MCP[MCP Server Manager mcp_server_manager.py] LinuxMCP[Linux MCP Server linux_mcp_server.py] Client --> MCP MCP --> LinuxMCP end subgraph "Configuration" Config[third_party.yaml] Devices[devices.yaml] Prompts[Prompt Templates] end LA -.reads.-> Config Client -.reads.-> Devices LA -.uses.-> Prompts style LA fill:#c8e6c9 style LAP fill:#c8e6c9 style LAS fill:#c8e6c9 style LAST fill:#c8e6c9 style Client fill:#e1f5ff style MCP fill:#e1f5ff style LinuxMCP fill:#e1f5ff

File Locations:

Component	File Path	Purpose
Agent Class	`ufo/agents/agent/customized_agent.py`	LinuxAgent definition
Processor	`ufo/agents/processors/customized/customized_agent_processor.py`	LinuxAgentProcessor
Strategies	`ufo/agents/processors/strategies/linux_agent_strategy.py`	LLM & Action strategies
States	`ufo/agents/states/linux_agent_state.py`	State machine states
Prompter	`ufo/prompter/customized/linux_agent_prompter.py`	Prompt construction
Client	`ufo/client/client.py`	Device client entry point
MCP Server	`ufo/client/mcp/http_servers/linux_mcp_server.py`	Command execution

LinuxAgent Architecture Diagram

sequenceDiagram participant User participant Server as LinuxAgent Server participant AIP as AIP Protocol participant Client as Linux Client participant MCP as Linux MCP Server participant Shell as Bash Shell User->>Server: User Request: "List files in /tmp" Server->>Server: State: ContinueLinuxAgentState Server->>Server: Processor: LinuxAgentProcessor Server->>Server: Strategy: LLM_INTERACTION Note over Server: Construct prompt, call LLM Server->>Server: LLM Response: execute_command("ls -la /tmp") Server->>Server: Strategy: ACTION_EXECUTION Server->>AIP: COMMAND: execute_command AIP->>Client: WebSocket: COMMAND Client->>MCP: Call MCP Tool: execute_command MCP->>Shell: Execute: ls -la /tmp Shell-->>MCP: stdout, stderr, exit_code MCP-->>Client: Result Client->>AIP: WebSocket: RESULT AIP->>Server: RESULT Server->>Server: Strategy: MEMORY_UPDATE Server->>Server: Update memory & blackboard Server->>Server: State Transition: FINISH Server->>User: Task Complete

Key Execution Flow:

User Request → LinuxAgent Server receives request
State Machine → Activates ContinueLinuxAgentState
Processor → Executes LinuxAgentProcessor strategies
LLM Interaction → Generates shell command
Action Execution → Sends command via AIP to client
MCP Execution → Client executes via Linux MCP Server
Result Handling → Server receives result, updates memory
State Transition → Moves to FINISH state

Architecture Overview

Complete Device Agent Architecture

When creating a new device agent (e.g., MobileAgent), you'll implement these components:

graph TB subgraph "1. Agent Definition" A1[Agent Class MobileAgent] A2[Processor MobileAgentProcessor] A3[State Manager MobileAgentStateManager] end subgraph "2. Processing Strategies" S1[DATA_COLLECTION Screenshot, UI Tree] S2[LLM_INTERACTION Prompt Construction] S3[ACTION_EXECUTION Command Dispatch] S4[MEMORY_UPDATE Context Update] end subgraph "3. MCP Server" M1[MCP Server mobile_mcp_server.py] M2[MCP Tools tap, swipe, type, etc.] end subgraph "4. Configuration" C1[third_party.yaml Agent Config] C2[devices.yaml Device Registry] C3[Prompt Templates LLM Prompts] end subgraph "5. Client" CL1[Device Client client.py] CL2[MCP Manager mcp_server_manager.py] end A1 --> A2 A2 --> S1 & S2 & S3 & S4 S3 --> M1 M1 --> M2 A1 -.reads.-> C1 CL1 --> CL2 CL2 --> M1 CL1 -.reads.-> C2 A2 -.uses.-> C3 style A1 fill:#c8e6c9 style A2 fill:#c8e6c9 style A3 fill:#c8e6c9 style M1 fill:#e1f5ff style CL1 fill:#e1f5ff

Implementation Checklist:

[ ] Agent Class: Define MobileAgent inheriting from CustomizedAgent
[ ] Processor: Create MobileAgentProcessor with custom strategies
[ ] State Manager: Implement MobileAgentStateManager and states
[ ] Strategies: Build platform-specific LLM and action strategies
[ ] MCP Server: Develop MCP server with platform tools
[ ] Prompter: Create custom prompter for mobile context
[ ] Client Setup: Configure client to run on mobile device
[ ] Configuration: Add agent config to third_party.yaml
[ ] Device Registry: Register device in devices.yaml
[ ] Prompt Templates: Write LLM prompt templates

Tutorial Roadmap

This tutorial is split into 6 detailed guides:

📘 Part 1: Core Components

Learn to implement the server-side components:

Agent Class definition
Processor and strategies
State Manager and states
Prompter for LLM interaction

Time: 45 minutes
Difficulty: ⭐⭐⭐

📘 Part 2: MCP Server Development

Create a platform-specific MCP server:

MCP server architecture
Defining MCP tools
Command execution logic
Error handling and validation

Time: 30 minutes
Difficulty: ⭐⭐

📘 Part 3: Client Configuration

Set up the device client:

Client initialization
MCP server manager integration
WebSocket connection setup
Platform detection

Time: 20 minutes
Difficulty: ⭐⭐

📘 Part 4: Configuration & Deployment

Configure and deploy your agent:

third_party.yaml configuration
devices.yaml device registration
Prompt template creation
Galaxy integration

Time: 25 minutes
Difficulty: ⭐⭐

📘 Part 5: Testing & Debugging

Test and debug your implementation:

Unit testing strategies
Integration testing
Debugging techniques
Common issues and solutions

Time: 30 minutes
Difficulty: ⭐⭐⭐

📘 Part 6: Complete Example: MobileAgent

Hands-on walkthrough creating MobileAgent:

Step-by-step implementation
Android/iOS platform specifics
UI Automator integration
Complete working example

Time: 60 minutes
Difficulty: ⭐⭐⭐⭐

Quick Start Guide

For experienced developers, here's a minimal implementation checklist:

1️⃣ Create Agent Class

# ufo/agents/agent/customized_agent.py

@AgentRegistry.register(
    agent_name="MobileAgent",
    third_party=True,
    processor_cls=MobileAgentProcessor
)
class MobileAgent(CustomizedAgent):
    def __init__(self, name, main_prompt, example_prompt):
        super().__init__(name, main_prompt, example_prompt,
                         process_name=None, app_root_name=None, is_visual=None)
        self._blackboard = Blackboard()
        self.set_state(self.default_state)
        self._context_provision_executed = False

    @property
    def default_state(self):
        return ContinueMobileAgentState()

2️⃣ Create Processor

# ufo/agents/processors/customized/customized_agent_processor.py

class MobileAgentProcessor(CustomizedProcessor):
    def _setup_strategies(self):
        # Compose multiple data collection strategies
        self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
            strategies=[
                MobileScreenshotCaptureStrategy(fail_fast=True),
                MobileAppsCollectionStrategy(fail_fast=False),
                MobileControlsCollectionStrategy(fail_fast=False),
            ],
            name="MobileDataCollectionStrategy",
            fail_fast=True,
        )

        self.strategies[ProcessingPhase.LLM_INTERACTION] = (
            MobileLLMInteractionStrategy(fail_fast=True)
        )
        self.strategies[ProcessingPhase.ACTION_EXECUTION] = (
            MobileActionExecutionStrategy(fail_fast=False)
        )
        self.strategies[ProcessingPhase.MEMORY_UPDATE] = (
            AppMemoryUpdateStrategy(fail_fast=False)
        )

3️⃣ Create MCP Server

# ufo/client/mcp/http_servers/mobile_mcp_server.py

def create_mobile_mcp_server(host="localhost", port=8020):
    mcp = FastMCP("Mobile MCP Server", stateless_http=False, 
                  json_response=True, host=host, port=port)

    @mcp.tool()
    async def tap_element(x: int, y: int) -> dict:
        # Execute tap via ADB or platform API
        pass

    mcp.run(transport="streamable-http")

4️⃣ Configure Agent

# config/ufo/third_party.yaml

ENABLED_THIRD_PARTY_AGENTS: ["MobileAgent"]

THIRD_PARTY_AGENT_CONFIG:
  MobileAgent:
    VISUAL_MODE: True
    AGENT_NAME: "MobileAgent"
    APPAGENT_PROMPT: "ufo/prompts/third_party/mobile_agent.yaml"
    APPAGENT_EXAMPLE_PROMPT: "ufo/prompts/third_party/mobile_agent_example.yaml"
    INTRODUCTION: "MobileAgent controls Android/iOS devices..."

5️⃣ Register Device

# config/galaxy/devices.yaml

devices:
  - device_id: "mobile_agent_1"
    server_url: "ws://localhost:5010/ws"
    os: "android"
    capabilities: ["ui_automation", "app_testing"]
    metadata:
      device_model: "Pixel 6"
      android_version: "13"
    max_retries: 5

6️⃣ Start Server & Client

# Terminal 1: Start Agent Server
python -m ufo.server.app --port 5010

# Terminal 2: Start Device Client
python -m ufo.client.client \
  --ws --ws-server ws://localhost:5010/ws \
  --client-id mobile_agent_1 \
  --platform android

# Terminal 3: Start MCP Server (on device or accessible endpoint)
python -m ufo.client.mcp.http_servers.mobile_mcp_server --port 8020

Next Steps

Ready to Build Your Device Agent?

Start with Part 1: Core Components →

Or jump to a specific topic:

Agent Architecture - Three-layer architecture deep dive
Linux Agent Quick Start - LinuxAgent deployment guide
Server Overview - Server-side orchestration
Client Overview - Client-side execution
MCP Overview - Model Context Protocol
AIP Protocol - Agent Interaction Protocol
Creating Third-Party Agents - Third-party agent tutorial

Summary

Key Takeaways:

Device Agents control entire platforms (Windows, Linux, Mobile)
Server-Client Architecture separates reasoning from execution
Three-Layer Design provides modular, extensible framework
LinuxAgent is the best reference implementation
6-Part Tutorial covers all aspects of device agent creation
MCP Integration enables platform-specific command execution
Galaxy Integration supports multi-device orchestration

Ready to build your first device agent? Let's get started! 🚀

Priority	Topic	Link	Time
🥇	Agent Architecture Overview	Infrastructure/Agents	20 min
🥇	LinuxAgent Quick Start	Quick Start: Linux	15 min
🥈	Server-Client Architecture	Server Overview, Client Overview	30 min
🥈	MCP Integration	MCP Overview	20 min
🥉	AIP Protocol	AIP Protocol	15 min