Agent Interaction Protocol (AIP)

The orchestration model requires a communication substrate that remains correct under continuous DAG evolution, dynamic agent participation, and fine-grained event propagation. Legacy HTTP-based coordination approaches (e.g., A2A, ACP) assume short-lived, stateless interactions, incurring handshake overhead, stale capability views, and fragile recovery when partial failures occur mid-task. These assumptions make them unsuitable for the continuously evolving workflows and long-running reasoning loops characteristic of UFO².

Design Overview

AIP serves as the nervous system of UFO², connecting the ConstellationClient, device agent services, and device clients under a unified, event-driven control plane. It is designed as a lightweight yet evolution-tolerant protocol to satisfy six goals:

Design Goals:

(G1) Maintain persistent bidirectional sessions to eliminate per-request overhead
(G2) Unify heterogeneous capability discovery via multi-source profiling
(G3) Ensure fine-grained reliability through heartbeats and timeout managers for disconnection and failure detection
(G4) Preserve deterministic command ordering within sessions
(G5) Support composable extensibility for new message types and resilience strategies
(G6) Provide transparent reconnection and task continuity under transient failures

Legacy HTTP Coordination	AIP WebSocket-Based Design
❌ Short-lived requests	✅ Persistent sessions (G1)
❌ Stateless interactions	✅ Session-aware task management
❌ High latency overhead	✅ Low-latency event streaming
❌ Poor reconnection support	✅ Seamless recovery from disconnections (G6)
❌ Manual state synchronization	✅ Automatic DAG state propagation
❌ Fragile partial failures	✅ Fine-grained reliability (G3)

Five-Layer Architecture

To meet these requirements, AIP adopts a persistent, bidirectional WebSocket transport and decomposes the orchestration substrate into five logical strata, each responsible for a distinct aspect of reliability and adaptability. The architecture establishes a complete substrate where L1 defines semantic contracts, L2 provides transport flexibility, L3 implements protocol logic, L4 ensures operational resilience, and L5 delivers deployment-ready orchestration primitives.

Architecture Diagram:

The following diagram illustrates the five-layer architecture and the roles of each component:

AIP Architecture

Layer 1: Message Schema Layer

Defines strongly-typed, Pydantic-validated contracts (ClientMessage, ServerMessage) for message direction, purpose, and task transitions. All messages are validated at schema level, preventing malformed messages from entering the protocol pipeline, enabling early error detection and simplifying debugging.

Responsibility	Implementation	Supports
Message contracts	Pydantic models with validation	Human-readable + machine-verifiable
Structured metadata	System info, capabilities	Unified capability discovery (G2)
ID correlation	Explicit request/response linking	Deterministic ordering (G4)

Layer 2: Transport Abstraction Layer

Provides protocol-agnostic Transport interface with production-grade WebSocket implementation. The abstraction layer allows swapping transports without changing protocol logic, supporting future protocol evolution.

Feature	Benefit	Goals
Configurable pings/timeouts	Connection health monitoring	G3
Large payload support	Handles complex task definitions	G1
Decoupled transport logic	Future extensibility (HTTP/3, gRPC)	G5
Low-latency persistent sessions	Eliminates per-request overhead	G1

Layer 3: Protocol Orchestration Layer

Implements modular handlers for registration, task execution, heartbeat, and command dispatch. Each handler is independently testable and replaceable, supporting composable extensibility (G5) while maintaining ordered state transitions (G4).

Component	Purpose	Design
`AIPProtocol` base	Common handler infrastructure	Extensible base class
Handler modules	Registration, tasks, heartbeat, commands	Pluggable handlers
Middleware hooks	Logging, metrics, authentication	Composable extensions (G5)
State transitions	Ordered message processing	Deterministic ordering (G4)

Related Documentation: - Complete message reference - Protocol implementation details

Layer 4: Resilience and Health Management Layer

Fault Tolerance

This layer guarantees fine-grained reliability (G3) and seamless task continuity under transient disconnections (G6), preventing cascade failures.

Encapsulates reliability mechanisms ensuring operational continuity under failures:

Component	Mechanism	Goals
`HeartbeatManager`	Periodic keepalive signals	G3
`TimeoutManager`	Configurable timeout policies	G3
`ReconnectionStrategy`	Exponential backoff with jitter	G6
Session recovery	Automatic state restoration	G6

→ Resilience implementation details

Layer 5: Endpoint Orchestration Layer

Provides role-specific facades integrating lower layers into deployable components. These endpoints unify connection lifecycle, task routing, and health monitoring across roles, reinforcing G1–G6 through consistent implementation of lower-layer capabilities.

Endpoint	Role	Responsibilities
`ConstellationEndpoint`	Orchestrator	Global agent registry, task assignment, DAG coordination
`DeviceServerEndpoint`	Server	WebSocket connection management, task dispatch, result aggregation
`DeviceClientEndpoint`	Executor	Local task execution, MCP tool invocation, telemetry reporting

Endpoint Integration Benefits:

✅ Connection lifecycle management (G1, G6)
✅ Role-specific protocol variants (G5)
✅ Health monitoring integration (G3)
✅ Task routing and session management (G4)

→ Endpoint setup guide

Architecture Benefits

Together, these layers form a vertically integrated stack that enables UFO² to maintain correctness and availability under challenging conditions:

Challenge	How AIP Addresses It	Layers Involved
DAG Evolution	Deterministic ordering, extensible message types	L1, L3, L4, L5 (G4, G5)
Agent Churn	Heartbeats, reconnection, session recovery	L4, L5 (G3, G6)
Heterogeneous Environments	Persistent sessions, multi-source profiling	L1, L2, L5 (G1, G2)
Transient Failures	Timeout management, automatic recovery	L4 (G3, G6)
Protocol Evolution	Transport abstraction, middleware hooks	L2, L3 (G5)

AIP transforms distributed workflow execution into a coherent, safe, and adaptive system where reasoning and execution converge seamlessly across diverse agents and environments.

Core Capabilities

Agent Registration & Profiling

Each agent is represented by an AgentProfile combining data from three sources for comprehensive capability discovery, supporting heterogeneous capability unification (G2):

Source	Provider	Information
User Config	ConstellationClient	Endpoint URLs, user preferences, device identity
Service Manifest	Device Agent Service	Supported tools, capabilities, operational metadata
Client Telemetry	Device Agent Client	OS, hardware specs, GPU status, runtime metrics

Benefits of Multi-Level Profiling:

✅ Accurate task allocation based on real-time capabilities (G2)
✅ Transparent adaptation to environmental changes (e.g., GPU availability)
✅ No manual updates needed when device state changes
✅ Informed scheduling decisions at scale

Dynamic Profile Updates

Client telemetry continuously refreshes, so the orchestrator always sees current device state—critical for GPU-aware scheduling or cross-device load balancing (G2).

→ See detailed registration flow

Task Dispatch & Result Delivery

AIP uses long-lived WebSocket sessions that span multiple task executions, eliminating per-request connection overhead and preserving context (G1).

Task Execution Sequence:

The following sequence diagram shows the complete lifecycle of a task from assignment to completion, including intermediate execution steps and state updates:

sequenceDiagram participant CC as ConstellationClient participant DAS as Device Service participant DAC as Device Client CC->>DAS: TASK message (TaskStar) DAS->>DAC: Stream task payload DAC->>DAC: Execute using MCP tools DAC->>DAS: Stream execution logs DAS->>CC: TASK_END (status, logs, results) CC->>CC: Update TaskConstellation CC->>CC: Notify ConstellationAgent

Each arrow represents a message exchange, with vertical lifelines showing the temporal ordering of events. Note how logs stream back during execution, enabling real-time monitoring.

Stage	Message Type	Content
Assignment	`TASK`	TaskStar definition, target device, commands
Execution	(internal)	MCP tool invocations, local computation
Reporting	`TASK_END`	Status, logs, evaluator outputs, results

Asynchronous Execution

Tasks execute asynchronously. The orchestrator may assign multiple tasks to different devices simultaneously, with results arriving in non-deterministic order.

Command Execution

Within each task, AIP executes individual commands deterministically with preserved ordering, enabling precise control and error handling (G4).

Command Structure:

Field	Purpose	Example
`tool_name`	Tool/action name	`"click_input"`
`parameters`	Typed arguments	`{"target": "Save Button", "button": "left"}`
`tool_type`	Category	`"action"` or `"data_collection"`
`call_id`	Unique identifier	`"cmd_001"`

Execution Guarantees:

✅ Sequential execution within a session (deterministic order) (G4)
✅ Command batching supported (reduces network overhead)
✅ Structured results with status codes and error details
✅ Timeout propagation for precise recovery strategies (G3)

Command Batching Example:

{
  "actions": [
    {"tool_name": "click", "parameters": {"target": "File"}, "call_id": "1"},
    {"tool_name": "click", "parameters": {"target": "Save As"}, "call_id": "2"},
    {"tool_name": "type", "parameters": {"text": "document.pdf"}, "call_id": "3"}
  ]
}

All three commands sent in one message, executed sequentially.

→ See command execution protocol

Message Protocol Overview

All AIP messages use Pydantic models for automatic validation, serialization, and type safety.

Bidirectional Message Types

Direction	Message Type	Purpose
Client → Server	`REGISTER`	Initial capability advertisement
	`COMMAND_RESULTS`	Return command execution results
	`TASK_END`	Notify task completion
	`HEARTBEAT`	Keepalive signal
	`DEVICE_INFO_RESPONSE`	Device telemetry update
Server → Client	`TASK`	Task assignment
	`COMMAND`	Command execution request
	`DEVICE_INFO_REQUEST`	Request telemetry refresh
	`HEARTBEAT`	Keepalive acknowledgment
Bidirectional	`ERROR`	Error condition reporting

Message Correlation:

Every message includes:

timestamp: ISO 8601 formatted
request_id / response_id: Unique identifier
prev_response_id: Links responses to requests
session_id: Session context

→ Complete message reference

Resilient Connection Protocol

Network Instability Handling (G3, G6)

AIP ensures continuous orchestration even under transient network failures or device disconnections through fine-grained reliability mechanisms and transparent reconnection.

Device Disconnection Flow

Connection State Transitions:

This state diagram illustrates how devices transition between connection states and the actions triggered at each transition:

stateDiagram-v2 [*] --> CONNECTED CONNECTED --> DISCONNECTED: Connection lost DISCONNECTED --> CONNECTED: Reconnection succeeds DISCONNECTED --> [*]: Timeout / Manual removal note right of DISCONNECTED • Excluded from scheduling • Tasks marked FAILED • Auto-reconnect triggered end note

The DISCONNECTED state acts as a quarantine zone where the device is temporarily removed from the scheduling pool while auto-reconnection attempts are made. If reconnection fails after timeout, the device is permanently removed.

Event	Orchestrator Action	Device Action
Device disconnects	Mark as `DISCONNECTED` Exclude from scheduling Trigger auto-reconnect (G6)	N/A
Reconnection succeeds	Mark as `CONNECTED` Resume scheduling	Session restored (G6)
Disconnect during task	Mark tasks as `FAILED` Propagate to ConstellationAgent Trigger DAG edit	N/A

ConstellationClient Disconnection

Bidirectional Fault Handling

When the ConstellationClient disconnects, all Device Agent Services:

Receive termination signal
Abort all ongoing tasks tied to that client
Prevent resource leakage and zombie processes
Maintain end-to-end consistency

Guarantees:

✅ No orphaned tasks
✅ Synchronized state across client-server boundary
✅ Rapid recovery when connection restored (G6)
✅ Consistent TaskConstellation state (G4)

→ See resilience implementation

Extensibility Mechanisms

AIP provides multiple extension points for domain-specific needs without modifying the core protocol, supporting composable extensibility (G5).

1. Protocol Middleware

Add custom processing to message pipeline:

from aip.protocol.base import ProtocolMiddleware

class AuditMiddleware(ProtocolMiddleware):
    async def process_outgoing(self, msg):
        log_to_audit_trail(msg)
        return msg

    async def process_incoming(self, msg):
        log_to_audit_trail(msg)
        return msg

2. Custom Message Handlers

Register handlers for new message types:

protocol.register_handler("custom_type", handle_custom_message)

3. Transport Layer

Pluggable transport (default: WebSocket) (G5):

from aip.transport import CustomTransport
protocol.transport = CustomTransport(config)

→ See extensibility guide

Integration with UFO² Ecosystem

Component	Integration Point	Benefit
MCP Servers	Command execution model aligns with MCP message formats	Unified interface for system actions and LLM tool calls
TaskConstellation	Real-time state synchronization via AIP messages	Planning DAG always reflects distributed execution state
Configuration System	Agent endpoints, capabilities managed via UFO² config	Centralized management, type-safe validation
Logging & Monitoring	Comprehensive logging at all protocol layers	Debugging, performance monitoring, audit trails

AIP abstracts network/device heterogeneity, allowing the orchestrator to treat all agents as first-class citizens in a single event-driven control plane.

Related Documentation:

TaskConstellation (DAG orchestrator)
ConstellationAgent (orchestration agent)
MCP Integration Guide
Configuration System Next Steps:
📖 Message Reference - Complete message type documentation
🔧 Protocol Guide - Implementation details and best practices
🌐 Transport Layer - WebSocket configuration and optimization
🔌 Endpoints - Endpoint setup and usage patterns
🛡️ Resilience - Connection management and fault tolerance

Summary

AIP transforms distributed workflow execution into a coherent, safe, and adaptive system where reasoning and execution converge seamlessly across diverse agents and environments.

Key Takeaways:

Aspect	Impact	Goals
Persistence	Long-lived connections reduce overhead, maintain context	G1
Low Latency	WebSocket enables real-time event propagation	G1
Capability Discovery	Multi-source profiling unifies heterogeneous agents	G2
Reliability	Heartbeats, timeouts, auto-reconnection ensure graceful degradation	G3, G6
Determinism	Sequential command execution, explicit ID correlation	G4
Extensibility	Middleware hooks, pluggable transports, custom handlers	G5
Developer UX	Strongly-typed messages, clear errors reduce integration effort	G5

By decomposing orchestration into five logical layers—each addressing specific reliability and adaptability concerns—AIP enables UFO² to maintain correctness and availability under DAG evolution (G4, G5), agent churn (G3, G6), and heterogeneous execution environments (G1, G2).