Platform-Specific Agent Implementations

This document describes how the unified three-layer Device Agent architecture is implemented across different platforms. While the core framework (State, Processor, Command layers) remains consistent, each platform implements specialized agents optimized for their native control mechanisms and hierarchies. Understanding these implementations is essential for extending UFO3 to new platforms or customizing existing agents.

Overview

UFO3's Device Agent architecture achieves cross-platform compatibility through platform-specific agent implementations that inherit from a common abstract framework. Each platform's agents implement the same BasicAgent interface while adapting the three-layer architecture to their unique execution environments:

graph TB subgraph "Unified Framework" Framework[Three-Layer Architecture State + Processor + Command] end subgraph "Windows Platform" HostAgent[HostAgent Application Selection] AppAgent[AppAgent Application Control] HostAgent -->|Delegates| AppAgent end subgraph "Linux Platform" LinuxAgent[LinuxAgent Shell Commands] end subgraph "Future Platforms" MacAgent[macOS Agent] AndroidAgent[Android Agent] IOSAgent[iOS Agent] end Framework -.Implements.-> HostAgent Framework -.Implements.-> AppAgent Framework -.Implements.-> LinuxAgent Framework -.Extends to.-> MacAgent Framework -.Extends to.-> AndroidAgent Framework -.Extends to.-> IOSAgent style Framework fill:#fff4e1 style HostAgent fill:#e1f5ff style AppAgent fill:#e1f5ff style LinuxAgent fill:#c8e6c9 style MacAgent fill:#f0f0f0 style AndroidAgent fill:#f0f0f0 style IOSAgent fill:#f0f0f0

Unified Framework Benefits:

Code Reuse: State management, strategy orchestration, and command dispatch logic shared across platforms
Consistent Interface: All agents implement BasicAgent interface with same lifecycle (handle, next_state, next_agent)
Extensibility: New platforms inherit three-layer architecture, only implementing platform-specific strategies and commands
Multi-Platform Coordination: HostAgent on Windows can coordinate with LinuxAgent on Linux device via Blackboard

Platform Comparison

Feature	Windows (Two-Tier)	Linux (Single-Tier)	Future (macOS, Mobile)
Agent Hierarchy	HostAgent ?AppAgent delegation	LinuxAgent (flat)	Platform-specific (TBD)
Observation Method	UI Automation API (COM)	Shell output, accessibility tree	Platform APIs (Accessibility, Screen)
Action Mechanism	UI element manipulation (click, type)	Shell command execution	Platform-specific controls
Application Model	Windowed applications	Command-line tools, X11 apps	Application frameworks
State Complexity	7 states (CONTINUE, FINISH, CONFIRM, etc.)	Simplified state set	Platform-dependent
Multi-Agent Coordination	HostAgent ?AppAgent via Blackboard	N/A (single agent per device)	Cross-device via Blackboard
Primary Use Cases	Office automation, GUI apps	Server management, DevOps	Mobile apps, embedded systems

Platform Selection Strategy

Windows: Use HostAgent + AppAgent for GUI-based applications requiring multi-step workflows (e.g., Excel data analysis, Word document editing)
Linux: Use LinuxAgent for command-line tasks, server administration, scripting workflows
Cross-Platform: Coordinate Windows and Linux agents via Blackboard for hybrid tasks (e.g., Windows collects data, Linux processes on server)

Windows Platform: Two-Tier Agent Hierarchy

Windows implements a two-tier hierarchy where HostAgent manages application selection and task decomposition, delegating execution to AppAgent instances for specific applications.

Architecture Overview

graph TB subgraph "Windows Two-Tier Hierarchy" User[User Request: 'Create chart in Excel from Word data'] HostAgent[HostAgent Task Orchestrator] AppAgent1[AppAgent Microsoft Word] AppAgent2[AppAgent Microsoft Excel] User --> HostAgent HostAgent -->|1. Extract data| AppAgent1 HostAgent -->|2. Create chart| AppAgent2 AppAgent1 -.Result via Blackboard.-> HostAgent AppAgent2 -.Result via Blackboard.-> HostAgent end subgraph "HostAgent Responsibilities" TaskDecomp[Task Decomposition] AppSelect[Application Selection] SubtaskDist[Subtask Distribution] ResultAgg[Result Aggregation] end subgraph "AppAgent Responsibilities" UIObserve[UI Observation] ActionExec[Action Execution] AppControl[Application Control] end HostAgent --> TaskDecomp HostAgent --> AppSelect HostAgent --> SubtaskDist HostAgent --> ResultAgg AppAgent1 --> UIObserve AppAgent1 --> ActionExec AppAgent1 --> AppControl style HostAgent fill:#fff4e1 style AppAgent1 fill:#e1f5ff style AppAgent2 fill:#e1f5ff

Two-Tier Execution Flow Example:

User Request: "Extract data from sales.docx and create a bar chart in Excel"

HostAgent: 1. Analyzes request → Identifies need for Word + Excel 2. Creates subtask 1: "Extract sales data from Word document" 3. Delegates to AppAgent (Word) via next_agent()

AppAgent (Word): 1. Observes Word UI, locates sales data table 2. Executes select_text + copy_to_clipboard actions 3. Writes result to Blackboard: blackboard.add_data(data, blackboard.trajectories) 4. Returns to HostAgent via next_agent(HostAgent)

HostAgent: 1. Reads result from Blackboard 2. Creates subtask 2: "Create bar chart in Excel from extracted data" 3. Delegates to AppAgent (Excel) via next_agent()

AppAgent (Excel): 1. Reads data from Blackboard trajectories 2. Executes actions: paste_data → select_data_range → insert_chart 3. Returns to HostAgent with AgentStatus.FINISH

HostAgent: Application Selection and Task Orchestration

The HostAgent is the top-level coordinator in the Windows two-tier hierarchy, responsible for application selection, task decomposition, and subtask distribution.

HostAgent Architecture

@AgentRegistry.register(agent_name="hostagent")
class HostAgent(BasicAgent):
    """
    The HostAgent class is the manager of AppAgents.
    Coordinates multi-application workflows on Windows.
    """

    def __init__(
        self,
        name: str,
        is_visual: bool,
        main_prompt: str,
        example_prompt: str,
        api_prompt: str,
    ) -> None:
        super().__init__(name=name)
        self.prompter = HostAgentPrompter(is_visual, main_prompt, example_prompt, api_prompt)
        self.agent_factory = AgentFactory()
        self.appagent_dict = {}  # Cache of created AppAgent instances
        self._active_appagent = None
        self._blackboard = Blackboard()  # Shared coordination space
        self.set_state(self.default_state)

Key Responsibilities

Responsibility	Implementation	Example
Task Decomposition	LLM analyzes user request, breaks into subtasks	"Create report" ?["Extract data", "Generate chart", "Format document"]
Application Selection	Identifies required applications for each subtask	Subtask "Extract data" ?Microsoft Word
AppAgent Creation	Factory pattern creates AppAgent instances on-demand	`agent_factory.create_agent("app", process="WINWORD.EXE")`
Subtask Delegation	Routes subtasks to appropriate AppAgent	`next_agent() ?AppAgent(Word)`
Result Aggregation	Collects results from AppAgents via Blackboard	`blackboard.get_value("appagent/word/result")`
Multi-App Coordination	Sequences actions across multiple applications	Word → Excel → PowerPoint workflow

HostAgent Processor

class HostAgentProcessor(ProcessorTemplate):
    """
    Processor for HostAgent with specialized strategies.
    """

    def __init__(self, agent, context):
        super().__init__(agent, context)

        # DATA_COLLECTION: Get list of running applications
        self.register_strategy(
            ProcessingPhase.DATA_COLLECTION,
            HostDataCollectionStrategy(agent, context)
        )

        # LLM_INTERACTION: Application selection and task planning
        self.register_strategy(
            ProcessingPhase.LLM_INTERACTION,
            HostLLMInteractionStrategy(agent, context)
        )

        # ACTION_EXECUTION: Create AppAgent, delegate subtask
        self.register_strategy(
            ProcessingPhase.ACTION_EXECUTION,
            HostActionExecutionStrategy(agent, context)
        )

HostAgent Strategy Specializations:

DATA_COLLECTION: Uses MCP tools to observe available Windows apps
LLM_INTERACTION: Specialized prompt template for application selection:
- Input: User request + list of running apps
- Output: Selected application + decomposed subtask
ACTION_EXECUTION: Instead of executing UI commands, creates/retrieves AppAgent instance and delegates via next_agent()

HostAgent State Transitions

stateDiagram-v2 [*] --> CONTINUE: User request received CONTINUE --> CONTINUE: Select app, delegate to AppAgent CONTINUE --> FINISH: All subtasks completed CONTINUE --> CONFIRM: Need user confirmation CONTINUE --> ERROR: Application selection failed CONFIRM --> CONTINUE: User confirms CONFIRM --> FINISH: User rejects ERROR --> FINISH: Unrecoverable error FINISH --> [*] note right of CONTINUE HostAgent delegates to AppAgent via next_agent() method end note note right of FINISH HostAgent aggregates results from all AppAgents end note

HostAgent Delegation Pattern Example:

class HostAgent(BasicAgent):
    def handle(self, context: Context) -> Tuple[AgentStatus, Optional[BasicAgent]]:
        """
        Handle HostAgent state: select application and delegate.
        """
        # Execute processor strategies
        processor = HostAgentProcessor(self, context)
        result = processor.process()

        # Get selected application from LLM response
        selected_app = result.parsed_response.get("application")
        subtask = result.parsed_response.get("subtask")

        # Create or retrieve AppAgent for selected application
        appagent = self.get_or_create_appagent(selected_app)

        # Write subtask to Blackboard for AppAgent to read
        self._blackboard.add_data(
            {"subtask": subtask, "app": selected_app},
            self._blackboard.requests
        )

        # Delegate to AppAgent
        return AgentStatus.CONTINUE, appagent

    def get_or_create_appagent(self, app_name: str) -> AppAgent:
        """
        Factory method: Create AppAgent if not exists, otherwise return cached instance.
        """
        if app_name not in self.appagent_dict:
            self.appagent_dict[app_name] = self.agent_factory.create_agent(
                agent_type="app",
                name=f"AppAgent/{app_name}",
                process_name=app_name,
                app_root_name=app_name
            )
        return self.appagent_dict[app_name]

AppAgent: Application-Specific Control

The AppAgent is responsible for direct control of a specific Windows application, executing UI-based actions through Windows UI Automation APIs.

AppAgent Architecture

@AgentRegistry.register(agent_name="appagent", processor_cls=AppAgentProcessor)
class AppAgent(BasicAgent):
    """
    The AppAgent class manages interaction with a specific Windows application.
    """

    def __init__(
        self,
        name: str,
        process_name: str,
        app_root_name: str,
        is_visual: bool,
        main_prompt: str,
        example_prompt: str,
        mode: str = "normal",
    ) -> None:
        super().__init__(name=name)
        self.prompter = AppAgentPrompter(is_visual, main_prompt, example_prompt)
        self._process_name = process_name  # e.g., "WINWORD.EXE"
        self._app_root_name = app_root_name  # e.g., "Microsoft Word"
        self._mode = mode
        self.set_state(self.default_state)

Key Responsibilities

Responsibility	Implementation	Example
UI Observation	Screenshot + UI Automation tree capture	`get_ui_tree` returns hierarchical element structure
Element Identification	Parse UI tree to locate target elements	Find "Save" button by name, control type, bounding box
Action Execution	Execute UI commands via MCP tools	`click_element(element_id="save_button")`
Application Context	Maintain application-specific state	Current document, active window, focus element
Error Handling	Detect and recover from UI failures	Retry on stale element, fallback to keyboard shortcuts
Result Reporting	Write results to Blackboard for HostAgent	`blackboard.add_key_value("result", "Document saved")`

AppAgent Processor

class AppAgentProcessor(ProcessorTemplate):
    """
    Processor for AppAgent with UI-focused strategies.
    """

    def __init__(self, agent, context):
        super().__init__(agent, context)

        # DATA_COLLECTION: Screenshot + UI tree
        self.register_strategy(
            ProcessingPhase.DATA_COLLECTION,
            ComposedStrategy([
                ScreenshotStrategy(agent, context),
                UITreeStrategy(agent, context)
            ])
        )

        # LLM_INTERACTION: UI element selection
        self.register_strategy(
            ProcessingPhase.LLM_INTERACTION,
            AppAgentLLMStrategy(agent, context)
        )

        # ACTION_EXECUTION: Execute UI commands
        self.register_strategy(
            ProcessingPhase.ACTION_EXECUTION,
            UIActionExecutionStrategy(agent, context)
        )

Windows UI Automation Integration

AppAgent leverages Windows UI Automation (UIA) for robust UI control:

graph LR subgraph "AppAgent Observation" AppAgent[AppAgent] Screenshot[Screenshot Visual Context] UITree[UI Automation Tree Element Hierarchy] AppAgent --> Screenshot AppAgent --> UITree end subgraph "Windows UI Automation" UIA[UI Automation API] Elements[UI Elements Button, TextBox, etc.] Properties[Element Properties Name, Type, BoundingBox] Patterns[Control Patterns Invoke, Value, Selection] UIA --> Elements UIA --> Properties UIA --> Patterns end UITree -.Query.-> UIA subgraph "Action Execution" Commands[MCP Commands] Click[click_element] Type[type_text] Select[select_item] Commands --> Click Commands --> Type Commands --> Select end AppAgent --> Commands Commands -.Invoke.-> Patterns style AppAgent fill:#e1f5ff style UIA fill:#fff4e1 style Commands fill:#ffe1e1

UI Automation Capabilities:

Element Discovery: Traverse UI tree to find controls by name, type, automation ID
Property Access: Read element properties (text, state, position, visibility)
Pattern Invocation: Execute control-specific actions:
- InvokePattern: Click buttons, menu items
- ValuePattern: Set text in textboxes
- SelectionPattern: Select items in lists, dropdowns
- TogglePattern: Toggle checkboxes, radio buttons

AppAgent Commands

Command Category	Commands	Description
Observation	`screenshot`, `get_ui_tree`, `get_accessibility_tree`	Capture visual and structural UI information
Navigation	`click_element`, `double_click`, `right_click`	Navigate UI through mouse interactions
Text Input	`type_text`, `set_value`, `clear_text`	Input and modify text in UI controls
Selection	`select_item`, `select_dropdown`, `toggle_checkbox`	Manipulate selection controls
Scrolling	`scroll`, `scroll_to_element`	Navigate large UI areas
Window Management	`activate_window`, `close_window`, `maximize_window`	Control window state
File Operations	`open_file`, `save_file`, `save_as`	Application-specific file actions

AppAgent UI Control Pattern Example:

class AppAgent(BasicAgent):
    def handle(self, context: Context) -> Tuple[AgentStatus, Optional[BasicAgent]]:
        """
        Handle AppAgent state: Control application UI.
        """
        # Read subtask from Blackboard (written by HostAgent)
        subtask_memory = self._blackboard.requests.to_list_of_dicts()
        if subtask_memory:
            subtask = subtask_memory[-1].get("subtask")

        # Execute processor strategies
        processor = AppAgentProcessor(self, context)
        context.set(ContextNames.REQUEST, subtask)
        result = processor.process()

        # Check if subtask completed
        if result.status == AgentStatus.FINISH:
            # Write result to Blackboard
            self._blackboard.add_data(
                {"result": result.parsed_response.get("result")},
                self._blackboard.trajectories
            )

            # Return to HostAgent
            return AgentStatus.FINISH, self.parent_agent

        return result.status, None

Linux Platform: Single-Tier Agent System

Linux implements a single-tier architecture where LinuxAgent directly executes shell commands without hierarchical delegation.

LinuxAgent Architecture

graph TB subgraph "Linux Single-Tier System" User[User Request: 'Check server logs and restart service'] LinuxAgent[LinuxAgent Shell Command Executor] Shell[Linux Shell bash, zsh, etc.] User --> LinuxAgent LinuxAgent -->|1. Execute: tail /var/log/app.log| Shell Shell -.Output.-> LinuxAgent LinuxAgent -->|2. Execute: systemctl restart app| Shell Shell -.Status.-> LinuxAgent end subgraph "LinuxAgent Capabilities" ShellExec[Shell Command Execution] OutputParse[Output Parsing] ChainCmd[Command Chaining] ErrorHandle[Error Detection] end LinuxAgent --> ShellExec LinuxAgent --> OutputParse LinuxAgent --> ChainCmd LinuxAgent --> ErrorHandle style LinuxAgent fill:#c8e6c9 style Shell fill:#fff4e1

@AgentRegistry.register(
    agent_name="LinuxAgent",
    third_party=True,
    processor_cls=LinuxAgentProcessor
)
class LinuxAgent(CustomizedAgent):
    """
    LinuxAgent is a specialized agent that interacts with Linux systems.
    Executes shell commands and parses output.
    """

    def __init__(
        self,
        name: str,
        main_prompt: str,
        example_prompt: str,
    ) -> None:
        super().__init__(
            name=name,
            main_prompt=main_prompt,
            example_prompt=example_prompt,
            process_name=None,
            app_root_name=None,
            is_visual=None  # LinuxAgent typically operates without visual mode
        )
        self._blackboard = Blackboard()
        self.set_state(ContinueLinuxAgentState())

Key Differences from Windows Agents

Aspect	Windows (HostAgent + AppAgent)	Linux (LinuxAgent)
Hierarchy	Two-tier (delegation pattern)	Single-tier (direct execution)
Observation	Screenshot + UI Automation tree	Shell command output (stdout/stderr)
Action Mechanism	UI element manipulation	Shell command execution
Context Tracking	Application windows, UI state	Command history, working directory
Error Detection	UI element not found, timeout	Exit code ?0, stderr output
Coordination	Via Blackboard between HostAgent and AppAgent	Via Blackboard with other devices (cross-device)

LinuxAgent Processor

class LinuxAgentProcessor(ProcessorTemplate):
    """
    Processor for LinuxAgent with shell-focused strategies.
    """

    def __init__(self, agent, context):
        super().__init__(agent, context)

        # DATA_COLLECTION: No visual observation, use command output from previous step
        self.register_strategy(
            ProcessingPhase.DATA_COLLECTION,
            LinuxDataCollectionStrategy(agent, context)  # Collects shell output
        )

        # LLM_INTERACTION: Command generation from request
        self.register_strategy(
            ProcessingPhase.LLM_INTERACTION,
            LinuxLLMStrategy(agent, context)  # Generates shell commands
        )

        # ACTION_EXECUTION: Execute shell commands
        self.register_strategy(
            ProcessingPhase.ACTION_EXECUTION,
            ShellExecutionStrategy(agent, context)  # Executes via shell_execute
        )

LinuxAgent Commands

Command	Function	Example
`shell_execute`	Execute shell command (non-blocking)	`shell_execute(command="ls -la /home/user")`
`shell_execute_read`	Execute command and capture output	`shell_execute_read(command="cat /var/log/app.log")`
`get_accessibility_tree`	Get GUI app accessibility tree (X11)	`get_accessibility_tree()` for GUI apps
`screenshot`	Capture screen (optional, for GUI)	`screenshot()`

LinuxAgent Best Practices:

Command Chaining: Use && and || for robust workflows:

cd /app && ./deploy.sh || echo "Deployment failed"

Output Parsing: Parse stdout for structured data:

output = shell_execute_read("systemctl status app")
if "active (running)" in output:
    # Service is running

Error Handling: Check exit codes and stderr:

result = shell_execute("restart_service.sh")
if result.status == ResultStatus.FAILURE:
    # Handle error from stderr

Idempotency: Design commands to be safely re-runnable:

# Good: Check before creating
[ -d /app/backup ] || mkdir -p /app/backup

# Bad: Fails if directory exists
mkdir /app/backup

LinuxAgent Cross-Device Coordination Example:

# Windows HostAgent prepares data for Linux processing
windows_blackboard.add_data(
    {"data_file": "C:/export/data.csv", "ready": True},
    windows_blackboard.requests
)

# LinuxAgent polls Blackboard for task availability
requests = linux_blackboard.requests.to_list_of_dicts()
if requests and requests[-1].get("ready"):
    # Download data from Windows device (via network share or AIP)
    await linux_agent.execute_command(
        "scp user@windows-pc:/c/export/data.csv /tmp/data.csv"
    )

    # Process data
    await linux_agent.execute_command(
        "python3 /app/process.py /tmp/data.csv"
    )

    # Report completion
    linux_blackboard.add_data(
        {"status": "completed"},
        linux_blackboard.trajectories
    )

Multi-Agent Coordination Patterns

The three-layer architecture enables seamless coordination across different agent types through Blackboard-based communication.

Pattern 1: Windows Multi-App Workflow

sequenceDiagram participant User participant HostAgent participant AppAgentWord participant AppAgentExcel participant Blackboard User->>HostAgent: "Extract data from Word, create chart in Excel" HostAgent->>HostAgent: Decompose task HostAgent->>Blackboard: Write subtask_1: "Extract data" HostAgent->>AppAgentWord: Delegate (next_agent) AppAgentWord->>AppAgentWord: Observe Word UI AppAgentWord->>AppAgentWord: Execute: select_text, copy AppAgentWord->>Blackboard: Write result: extracted_data AppAgentWord->>HostAgent: Return (next_agent) HostAgent->>Blackboard: Read extracted_data HostAgent->>Blackboard: Write subtask_2: "Create chart" HostAgent->>AppAgentExcel: Delegate (next_agent) AppAgentExcel->>Blackboard: Read extracted_data AppAgentExcel->>AppAgentExcel: Execute: paste, select_range, insert_chart AppAgentExcel->>Blackboard: Write result: chart_created AppAgentExcel->>HostAgent: Return (next_agent) HostAgent->>User: Task completed

Pattern 2: Cross-Device Linux-Windows Coordination

sequenceDiagram participant WindowsHost participant WindowsApp participant Blackboard participant LinuxAgent WindowsHost->>WindowsApp: "Export sales data to CSV" WindowsApp->>WindowsApp: Execute: export to C:/data/sales.csv WindowsApp->>Blackboard: Write: data_ready=true, path=C:/data/sales.csv LinuxAgent->>Blackboard: Poll: data_ready? Blackboard-->>LinuxAgent: data_ready=true LinuxAgent->>LinuxAgent: Execute: scp windows-pc:/c/data/sales.csv /tmp/ LinuxAgent->>LinuxAgent: Execute: python3 analyze.py /tmp/sales.csv LinuxAgent->>Blackboard: Write: analysis_complete=true, results=/tmp/report.pdf WindowsHost->>Blackboard: Read: analysis_complete? Blackboard-->>WindowsHost: analysis_complete=true, results=/tmp/report.pdf WindowsHost->>LinuxAgent: Request: Download /tmp/report.pdf

Pattern 3: Parallel Multi-Device Tasks

graph TB subgraph "Orchestrator (HostAgent)" Orchestrator[HostAgent Task Coordinator] end subgraph "Device 1: Windows Desktop" AppAgent1[AppAgent PowerPoint] end subgraph "Device 2: Windows Laptop" AppAgent2[AppAgent Excel] end subgraph "Device 3: Linux Server" LinuxAgent1[LinuxAgent Data Processing] end subgraph "Shared Blackboard" BB[Blackboard Coordination Space] end Orchestrator -->|Subtask 1: Create slides| AppAgent1 Orchestrator -->|Subtask 2: Generate charts| AppAgent2 Orchestrator -->|Subtask 3: Process data| LinuxAgent1 AppAgent1 -.Write result.-> BB AppAgent2 -.Write result.-> BB LinuxAgent1 -.Write result.-> BB BB -.Aggregate results.-> Orchestrator style Orchestrator fill:#fff4e1 style AppAgent1 fill:#e1f5ff style AppAgent2 fill:#e1f5ff style LinuxAgent1 fill:#c8e6c9 style BB fill:#ffe1e1

Platform Extensibility: Adding New Platforms

The three-layer architecture provides a clear path for extending UFO3 to new platforms:

Extension Checklist

Steps to Add a New Platform:

Define Agent Class

@AgentRegistry.register(
    agent_name="MacOSAgent",
    processor_cls=MacOSAgentProcessor
)
class MacOSAgent(BasicAgent):
    # Implement platform-specific initialization

Implement Platform-Specific Strategies
- DATA_COLLECTION: How to observe system state (screenshots, accessibility tree, shell output)
- LLM_INTERACTION: Adapt prompt template for platform capabilities
- ACTION_EXECUTION: Map actions to platform APIs (AppKit, Accessibility API, etc.)
- MEMORY_UPDATE: Standard implementation (usually no changes needed)

Define Platform Commands (MCP Tools)

# macOS-specific commands
commands = [
    "applescript_execute",  # Execute AppleScript
    "accessibility_tree",   # macOS Accessibility API
    "click_element",        # macOS UI control
    "type_text"             # Text input
]

Implement AgentState Subclasses (if needed)

class ContinueMacOSAgentState(AgentState):
    def handle(self, agent, context):
        # macOS-specific state handling

Create Platform-Specific Processor

class MacOSAgentProcessor(ProcessorTemplate):
    def __init__(self, agent, context):
        super().__init__(agent, context)
        self.register_strategy(
            ProcessingPhase.DATA_COLLECTION,
            MacOSDataCollectionStrategy(agent, context)
        )
        # Register other strategies...

Configure MCP Server (on device client)
- Implement MCP tools for platform-specific operations
- Register tools with MCP server manager
- Ensure AIP client routes commands correctly

Platform-Specific Considerations

Platform	Key Considerations	Suggested Implementation
macOS	Accessibility API, AppleScript, window management	MacOSAgent (single-tier), AppleScript execution strategy
Android	Activity lifecycle, UI Automator, touch gestures	AndroidAgent (single-tier), UI Automator integration
iOS	Accessibility, XCTest, limited automation	iOSAgent (single-tier), XCTest framework
Embedded	Limited resources, no GUI, command-line only	EmbeddedAgent (minimal strategies, shell-based)
Web	Browser automation, DOM manipulation	WebAgent (Selenium/Playwright integration)

Example: Adding macOS Support

# 1. Define macOS Agent
@AgentRegistry.register(
    agent_name="MacOSAgent",
    processor_cls=MacOSAgentProcessor
)
class MacOSAgent(BasicAgent):
    def __init__(self, name: str, main_prompt: str, example_prompt: str):
        super().__init__(name=name)
        self.prompter = MacOSAgentPrompter(main_prompt, example_prompt)
        self.set_state(ContinueMacOSAgentState())

# 2. Implement macOS-specific DATA_COLLECTION strategy
class MacOSDataCollectionStrategy(ProcessingStrategy):
    def execute(self, context: ProcessingContext):
        # Use macOS Accessibility API
        commands = [
            Command(tool_name="get_accessibility_tree", parameters={}, tool_type="data_collection"),
            Command(tool_name="screenshot", parameters={}, tool_type="data_collection")
        ]
        results = self.dispatcher.execute_commands(commands)

        context.set_local("accessibility_tree", results[0].result)
        context.set_local("screenshot", results[1].result)

# 3. Implement macOS-specific ACTION_EXECUTION strategy
class MacOSActionExecutionStrategy(ProcessingStrategy):
    def execute(self, context: ProcessingContext):
        action = context.get_global("action")

        if action == "click_element":
            # Use macOS Accessibility API via MCP tool
            command = Command(
                tool_name="macos_click_element",
                parameters={"element_id": context.get_global("element_id")},
                tool_type="action"
            )
        elif action == "applescript_execute":
            # Execute AppleScript via MCP tool
            command = Command(
                tool_name="applescript_execute",
                parameters={"script": context.get_global("applescript")},
                tool_type="action"
            )

        results = self.dispatcher.execute_commands([command])
        context.set_local("execution_results", results)

# 4. Configure MCP tools on macOS device client
# In device client code:
mcp_server_manager.register_tool(
    MCPToolInfo(
        tool_name="macos_click_element",
        description="Click element via macOS Accessibility API",
        input_schema={
            "element_id": {"type": "string", "description": "Accessibility element ID"}
        },
        # ... other fields
    ),
    handler=macos_accessibility_click_handler
)

Agent Lifecycle Comparison

Windows HostAgent Lifecycle

sequenceDiagram participant Session participant HostAgent participant AppAgent participant Blackboard Session->>HostAgent: Initialize (user request) HostAgent->>HostAgent: Set state = ContinueHostAgentState loop Until Task Complete HostAgent->>HostAgent: Execute HostAgentProcessor HostAgent->>HostAgent: LLM selects application HostAgent->>HostAgent: Create/retrieve AppAgent HostAgent->>Blackboard: Write subtask HostAgent->>AppAgent: Delegate (next_agent) AppAgent->>Blackboard: Read subtask AppAgent->>AppAgent: Execute AppAgentProcessor AppAgent->>Blackboard: Write result AppAgent->>HostAgent: Return (next_agent) HostAgent->>Blackboard: Read result HostAgent->>HostAgent: Update task status end HostAgent->>Session: Return AgentStatus.FINISH

Linux LinuxAgent Lifecycle

sequenceDiagram participant Session participant LinuxAgent participant Shell Session->>LinuxAgent: Initialize (user request) LinuxAgent->>LinuxAgent: Set state = ContinueLinuxAgentState loop Until Task Complete LinuxAgent->>LinuxAgent: Execute LinuxAgentProcessor LinuxAgent->>LinuxAgent: LLM generates shell command LinuxAgent->>Shell: Execute command Shell-->>LinuxAgent: Return output (stdout/stderr) LinuxAgent->>LinuxAgent: Parse output LinuxAgent->>LinuxAgent: Update context with result LinuxAgent->>LinuxAgent: Check task completion end LinuxAgent->>Session: Return AgentStatus.FINISH

Performance and Scalability

Metric	Windows (Two-Tier)	Linux (Single-Tier)	Notes
Agent Initialization	~500ms (HostAgent) + ~300ms per AppAgent	~200ms (LinuxAgent)	AppAgent creation overhead for each application
Observation Latency	~1-2s (screenshot + UI tree)	~100-500ms (shell output)	UI Automation API slower than shell
Action Execution	~200-500ms per UI action	~50-200ms per shell command	UI actions require element discovery
Memory Footprint	~50MB (HostAgent) + ~30MB per AppAgent	~20MB (LinuxAgent)	UI Automation increases memory usage
Scalability	Limited by number of AppAgents	Handles many parallel commands	HostAgent manages AppAgent pool
Coordination Overhead	Blackboard read/write per delegation	Minimal (only cross-device)	Two-tier hierarchy increases communication

Performance Optimization:

Windows: Reuse AppAgent instances across subtasks (cached in appagent_dict)
Linux: Batch multiple shell commands with && to reduce round trips
Cross-Platform: Minimize Blackboard writes; use hierarchical keys for efficient reads

Best Practices

Windows Agent Best Practices

HostAgent:

AppAgent Caching: Reuse AppAgent instances for same application to avoid recreation overhead
Task Decomposition: Break complex tasks into independent subtasks for parallel execution
Blackboard Namespacing: Use clear keys within appropriate memory sections
Error Propagation: Detect AppAgent failures and retry with different strategy

AppAgent:

Element Stability: Wait for UI elements to stabilize before interaction (use wait_for_element)
Fallback Actions: If UI Automation fails, fallback to keyboard shortcuts (e.g., Ctrl+S instead of clicking Save button)
Context Awareness: Track active window and focus to ensure actions target correct application
Idempotent Actions: Design actions to be safely retryable (e.g., check if file exists before creating)

Linux Agent Best Practices

LinuxAgent:

Command Validation: Validate commands before execution to prevent injection attacks
Output Parsing: Use structured output formats (JSON, CSV) instead of parsing raw text
Error Detection: Check exit codes ($?) and stderr for failure detection
Idempotency: Use conditional commands ([ -f file ] || create_file) to safely re-run workflows
Resource Cleanup: Always clean up temporary files and processes after task completion

Cross-Platform Best Practices

Multi-Agent Coordination:

Blackboard Keys: Use appropriate memory sections to separate agent-specific data:

# Good - using structured memory sections
blackboard.add_data({"status": "ready"}, blackboard.requests)
blackboard.add_data({"status": "processing"}, blackboard.trajectories)

# Bad - unclear categorization
blackboard.add_data({"status": "ready"}, blackboard.questions)

Synchronization: Use polling or event-based patterns for cross-device synchronization:

# Polling pattern
while not any(r.get("task_complete") for r in blackboard.requests.to_list_of_dicts()):
    await asyncio.sleep(1)

# Event-based (via AIP custom messages)
# Linux device sends completion event
aip_client.send_event("task_complete", {...})

Data Transfer: For large data, use shared storage (network drive, S3) instead of Blackboard:

# Bad: Store large data in Blackboard
blackboard.add_data({"dataset": [1000000 rows]}, blackboard.trajectories)

# Good: Store reference to shared storage
blackboard.add_data({"dataset_path": "s3://bucket/data.csv"}, blackboard.requests)

Device Agent Overview - Three-layer architecture and design principles
Server-Client Architecture - Server and client separation
State Layer - AgentState interface and state machine
Processor and Strategy Layer - ProcessorTemplate and strategy implementations
Command Layer - CommandDispatcher and MCP integration
Memory System - Memory and Blackboard for agent coordination
Server Architecture - Server-side orchestration
Client Architecture - Device client MCP execution
AIP Protocol - Agent Interaction Protocol for communication

Summary

Key Takeaways:

Windows Two-Tier Hierarchy: HostAgent (orchestration) + AppAgent (application control) for GUI workflows
Linux Single-Tier System: LinuxAgent executes shell commands directly for command-line tasks
Unified Framework: Both platforms leverage same three-layer architecture (State, Processor, Command)
Multi-Agent Coordination: Blackboard enables seamless coordination across HostAgent → AppAgent and cross-device communication
Platform Extensibility: Clear extension path for macOS, Android, iOS, embedded systems
HostAgent Responsibilities: Task decomposition, application selection, AppAgent creation, subtask delegation
AppAgent Capabilities: UI observation (screenshot + UI Automation), element identification, UI action execution
LinuxAgent Characteristics: Shell command execution, output parsing, idempotent workflows
Best Practices: AppAgent caching, appropriate Blackboard usage, idempotent commands, structured output parsing
Performance: Windows UI Automation slower but more robust; Linux shell commands faster but less structured

UFO3's platform-specific agent implementations demonstrate the flexibility and extensibility of the three-layer architecture, enabling cross-platform and cross-device task automation while maintaining consistent design principles and coordination mechanisms.

Platform-Specific Agent Implementations

Overview

Platform Comparison

Windows Platform: Two-Tier Agent Hierarchy

Architecture Overview

HostAgent: Application Selection and Task Orchestration

HostAgent Architecture

Key Responsibilities

HostAgent Processor

HostAgent State Transitions

AppAgent: Application-Specific Control

AppAgent Architecture

Key Responsibilities

AppAgent Processor

Windows UI Automation Integration

AppAgent Commands

Linux Platform: Single-Tier Agent System

LinuxAgent Architecture

Key Differences from Windows Agents

LinuxAgent Processor

LinuxAgent Commands

Multi-Agent Coordination Patterns

Pattern 1: Windows Multi-App Workflow

Pattern 2: Cross-Device Linux-Windows Coordination

Pattern 3: Parallel Multi-Device Tasks

Platform Extensibility: Adding New Platforms

Extension Checklist

Platform-Specific Considerations

Agent Lifecycle Comparison

Windows HostAgent Lifecycle

Linux LinuxAgent Lifecycle

Performance and Scalability

Best Practices

Windows Agent Best Practices

Linux Agent Best Practices

Cross-Platform Best Practices

Related Documentation

Summary