Skip to main content

Physical AI Toolchain — Tiered Architecture Proposal

  • Date: June 9th, 2026
  • Author: David White
  • Status: Draft for socialization and feedback
  • Scope: Architecture and on-ramp framing for microsoft/physical-ai-toolchain
  • Type: Proposal — not yet adopted

Executive Summary (TL;DR)

The Physical AI Toolchain ships a powerful end-to-end stack, but it presents that stack as all-or-nothing. A newcomer reading the architecture documentation concludes they must stand up Azure Arc, AKS, FluxCD, ACSA, IoT Operations, and a cloud training plane before they can do anything useful. In reality, the repository's working code can close a complete capture → train → evaluate → run loop on a single laptop and one robot with zero cloud and zero Kubernetes. The gap between what the code supports and what the documentation implies is the single largest barrier to entry.

This proposal reframes the architecture around graduated adoption tiers. Each tier states the minimum edge and cloud infrastructure required to reach a concrete goal, and each tier is a legitimate stopping point. Users adopt only the infrastructure their scale actually demands, and the heaviest components become opt-in additions rather than baseline prerequisites.

The proposal also corrects a framing problem: the repository bundles two distinct concerns under the word "fleet." One is fleet delivery — a delivery and connectivity control plane that gets a validated policy onto robots across sites you cannot directly reach. The other is fleet intelligence — the cognition layer (drift detection, automated retraining, aggregate telemetry analytics). The first is legitimate and necessary at multi-site scale. The second is largely unimplemented and over-advertised. Separating them lets a multi-site operator adopt fleet delivery without inheriting the speculative intelligence layer.

[!IMPORTANT] In this repository, "fleet" refers exclusively to a fleet of robots. It never refers to Kubernetes clusters, nor to Azure Kubernetes Fleet Manager — a distinct Azure product for managing a fleet of clusters, not robots. Cluster-level concerns are always described as "clusters" or "sites," never "fleets." This proposal deliberately avoids the bare phrase "fleet management" because the Kubernetes ecosystem (Azure Kubernetes Fleet Manager, FluxCD, Argo, Rancher Fleet) uses it for the cluster delivery/placement concern; qualifying every use as fleet delivery (T4) or fleet intelligence (T5) keeps the meaning anchored to robots and avoids the collision.

What This Proposal Argues

  1. The barrier to entry is a documentation and sequencing problem, not a fundamental property of the architecture. The code already supports a low-friction path; the docs do not describe it.
  2. Adoption should be modeled as six tiers (T0–T5) keyed to operator reach, GPU availability, collaboration, and site count — not robot count.
  3. The word "fleet" conflates a necessary fleet delivery control plane with a speculative fleet intelligence layer. These should be separate tiers.
  4. The maturity evidence supports drawing the line where we propose: the implemented core is infrastructure, training, data management, and evaluation. The fleet intelligence cognition is mostly placeholder.

1. The Problem: On-Ramp Friction

The current architecture documentation foregrounds the maximal configuration. It describes a closed-loop, multi-site, GitOps-managed robot fleet with telemetry-driven retraining as the headline identity of the project. Everything else reads as a component of that endgame.

The practical consequence: a user evaluating the toolkit perceives the cost of entry as the sum of every moving part, on both the edge and the cloud:

SurfaceComponents the docs imply are required
EdgeArc-enabled Kubernetes (K3s or AKS Edge Essentials), ACSA extension, cert-manager, IngestSubvolume CRD, FluxCD, ROS 2 recording service, gating service, IoT Operations MQTT broker
CloudAKS, AzureML workspace, Azure Storage, Container Registry, Key Vault, managed identity, Event Hubs, Microsoft Fabric Real-Time Intelligence, Grafana, point-to-site VPN

Even with the repository's deploy scripts, the perceived setup is a long, ordered sequence of installations and credential wiring across two environments before the first useful result. The "private AKS by default" network model compounds this: a VPN must exist before any kubectl command runs.

[!NOTE] The friction is perceived cost, and perception is set by documentation. The same components, presented as optional additions layered onto a working local baseline, would not read as a barrier. This is why the fix is primarily a re-framing, not a rewrite.

The Core Claim

The edge architecture is premature and over-advertised relative to its implementation, and there is no documented low or middle tier, so a newcomer mistakes the aspirational ceiling for the required floor.


2. Evidence: What Is Actually Built

A file-level survey of the eight domains separates shipped code from aspirational specification. The signal is unambiguous.

DomainPythonShellTerraformPlaceholder specs
infrastructure018800
data-management49200
training23800
evaluation15700
data-pipeline3100
synthetic-data0002
fleet-deployment1105
fleet-intelligence0404

The implemented center of gravity is infrastructure provisioning plus training, data management, and evaluation tooling. The two domains that carry the "fleet" identity — fleet-deployment and fleet-intelligence — hold one Python file between them and nine placeholder specifications. The headline identity of the project is its least-built region.

Two consequences follow:

  1. The heavy edge stack is largely documented intent, not shipped complexity. The barrier is mostly prerequisite weight and framing, not a large body of code a user must understand.
  2. Because the fleet domains are unbuilt, formalizing a lighter default on-ramp costs almost nothing — there is little working code to reorganize.

The Local Path Already Exists

Three of the four core domains run with no cloud dependency today:

  • Trainingtraining/il/scripts/lerobot/train.py detects available CUDA devices at runtime and trains wherever it is invoked. The OSMO and AzureML submission scripts are wrappers around this, not a requirement of it.
  • Evaluationevaluation/sil/scripts/run-local-lerobot-eval.py and play.py provide an explicit local evaluation path.
  • Data management — the dataviewer defaults to STORAGE_BACKEND=local and serves entirely from a laptop.

The capability to start small is present in the code. It is simply undocumented as a sanctioned path.


3. Reframing "Fleet": Fleet Delivery vs. Fleet Intelligence

The strongest objection to the current architecture is not "Arc and AKS are unnecessary." At multi-site scale they are genuinely needed. The objection is that the repository welds two different concerns together under one word.

ConcernWhat it isRepository domainMaturity
Fleet delivery (control plane)Multi-site connectivity, declarative delivery, safe-swap gatingfleet-deployment1 Python file, 5 placeholders
Fleet intelligenceDrift detection, retraining triggers, aggregate telemetry analyticsfleet-intelligence0 Python files, 4 placeholders

The fleet delivery control plane is transport. Across sites you do not control, you need a reachability and identity broker (Arc), a runtime (AKS), and a delivery mechanism (FluxCD). This is, in effect, a git push that survives bad networks and locked-down sites, with a safety gate before a policy swaps on a physical arm. It is legitimate and necessary once an operator spans multiple sites, and aligns with how the Kubernetes ecosystem treats "fleet" — a delivery and reconciliation concern.

The fleet intelligence layer is cognition. Observing many robots, detecting aggregate degradation, and closing the retraining loop is the data-science layer over the robot fleet — and it is the unbuilt, over-advertised part.

[!IMPORTANT] Separating these is what lets a multi-site operator adopt the necessary fleet delivery plane without inheriting the speculative fleet intelligence layer. The repository's single word "fleet" makes the necessary inseparable from the speculative. Unbundling the word — and qualifying it as delivery vs. intelligence — is the core architectural move of this proposal.

One honest caveat: the split is real but porous. The gating service is a decision (approve a safe deployment window), so a small amount of cognition already lives in the delivery plane, and a multi-site delivery plane inevitably records per-site desired state, which is the seed of intelligence. The tiers below treat gating as part of delivery and draw the intelligence boundary at drift/retraining/analytics.


4. The Tier Model

Tiers are keyed to the variables that actually drive infrastructure cost:

  • Operator reach — can one network and one kubectl context reach every robot?
  • GPU availability — is there a local GPU, or must compute be rented?
  • Collaboration — one person, or a team that needs shared catalogs, tracking, and registries?
  • Site count — one location, or many?

Robot count correlates with these but does not cause them. Three identical robots in one lab you control are trivial; three robots across three customer sites you cannot reach are not.

Reference Goal

Each tier states the minimum infrastructure to reach a concrete goal:

Goal G: Capture demonstrations on a robot, train an imitation policy, validate it, and run that policy back on the robot — the full loop, one task. Fleet intelligence is explicitly out of scope of Goal G.

Goal G is fully achievable at Tiers 0 through 2 without any Kubernetes, Arc, or fleet infrastructure. Data augmentation and fleet-intelligence autonomy are modeled as separate optional axes (see Cross-Cutting Axis: Tiered Data Augmentation and The Autonomy Ladder), not steps in the Goal G loop.

Tier Ladder Overview

Each tier carries a stable ID (T0T5) and a stage name. The IDs are canonical and are used for every boundary and graduation reference; the names are the user-facing labels.

T0 Dev Laptop + 1 robot edge: none cloud: none
T1 Lab Shared GPU + few robots edge: shared disk cloud: Blob (+ optional AzureML/MLflow)
T2 Pilot Single site, at scale edge: none (manual) cloud: AzureML + registry + MLflow
T3 Production Single-site declarative edge: local k3s + Flux cloud: same as T2 (no Arc)
─────────────────────────────────────────────────────────────────── multi-site boundary (Arc)
T4 Scale Multi-site delivery edge: Arc + AKS + Flux cloud: + connectivity/identity
───────────────────────────────────────────────────────────── intelligence boundary
T5 Operate Fleet intelligence edge: + IoT Operations cloud: + Fabric RTI + drift/retraining

T0–T2 satisfy Goal G with manual deployment. T3 adds single-site declarative deployment (local k3s + Flux) without Arc — proving GitOps does not require a cloud fleet control plane. T4 is the multi-site fleet delivery terminus, where Arc becomes necessary as the cross-site reachability and identity broker. T5 is the fleet intelligence layer, marked deferred and over-advertised relative to its implementation.


5. Tier Detail

T0 — Dev. One robot, one laptop. No cloud, no Kubernetes

The honest floor for Goal G.

ConcernImplementation
CaptureROS 2 bag recording to local disk on the robot or laptop. No Arc, no ACSA, no PVC.
Move datacp or rsync from robot to laptop.
CurateDataviewer in local mode on the laptop.
Traintrain.py on a laptop or workstation GPU.
TrackFile-backed MLflow (file:./mlruns, no server) or trackio — a local process, no service to stand up.
Validaterun-local-lerobot-eval.py / play.py locally.
Run on robotThe ACT inference node as a plain process or container. No Flux, no gating, no GitOps.

Edge infra: ROS 2 and Docker only. Cloud infra: none.

[!NOTE] Experiment tracking belongs at T0, not as a collaboration perk. Goal G includes validate, and a success number you cannot reproduce, compare across runs, or use to attribute a regression is an anecdote, not a result — doubly so in the high-variance world of RL and IL. File-backed MLflow and trackio run as local processes with no server, so tracking stays inside the zero-cloud, zero-Kubernetes floor. Tracking as a hosted server plus a model registry is a separate, later concern (T2).

This tier exists in the code today and is undocumented. Surfacing it is the highest-leverage change in this proposal.

Graduate when: no local GPU; the task needs many training iterations as conditions vary; or a second person needs the data.

T1 — Lab. One site, a few robots, a shared GPU box. Storage-only cloud

The small-lab and integrator tier. The first cloud resource added is a single storage account.

ConcernImplementationDelta from T0
CaptureROS 2 recording to shared NFS/SMB, or each robot rsyncs up.shared disk
Move dataazcopy or az storage blob upload-batch to one Blob container.+ Blob storage
CurateDataviewer in azure mode against that container (managed identity or SAS).viewer → cloud
TrainLocal shared GPU box, or first optional reach to AzureML on saturation.optional cloud GPU
TrackLocal file-backed tracking (carried from T0) optionally promoted to a shared MLflow server once a team needs shared run history.optional hosted tracking
Run on robotPlain container per robot; hand-update 2–3 robots via docker pull.unchanged

Edge infra: shared disk. Cloud infra: one storage account, optionally AzureML and MLflow. No Kubernetes, no Arc, no Flux.

Graduate when: training scale or team size outgrows one GPU box; dataset governance and catalogs become necessary.

T2 — Pilot. One site, several robots, real training scale and collaboration

The tier where cloud training genuinely becomes the default rather than an option.

ConcernImplementationDelta from T1
TrainAzureML or OSMO as default: multi-GPU, queued jobs, multiple people, VLA scale.cloud GPU standard
RegistryModel registry and versioning become load-bearing.+ registry
CurateDataviewer deployed as a shared web app rather than localhost.hosted viewer
CaptureACSA optional if disk pressure or unattended recording warrants it.optional ACSA
Run on robotManual docker pull per robot. Hand-updating a handful of reachable robots is still tractable.unchanged

Edge infra: none beyond Docker. Cloud infra: AzureML workspace, storage, registry, MLflow. Still no Kubernetes, no Arc, no fleet plane.

Graduate when: the number of robots or the update cadence makes hand-updating each robot error-prone, and version skew across robots becomes a real problem — but all robots are still at one reachable site.

T3 — Production. Local k3s + Flux, no Arc

The tier that proves declarative, GitOps-style deployment does not require Azure Arc. Several robots at one site you control, updated often enough that manual docker pull causes version skew, but all reachable from a single operator network. A single local k3s node running FluxCD reconciles every robot to a Git-declared desired state. Arc is unnecessary precisely because there is only one site and you can reach it directly.

ConcernImplementation
RuntimeA single local k3s node (a ~60 MB binary) at the site.
DeliveryFluxCD reconciles robots to Git-declared desired state; rollback is a Git revert.
Version controlGit becomes the single source of truth for which robot runs which policy.
Safe swapOptional gating before a policy swaps on hardware.
Train and curateSame as T2 (AzureML, registry, MLflow, hosted viewer).

Why this is a full tier, not a footnote: it isolates the single most misattributed cost in the whole architecture. The expensive part of the "fleet" stack was never running Kubernetes at the edge — single-node k3s idles near zero. The expensive part is the cloud-side control plane (Arc enrollment, identity, policy). T3 delivers GitOps deployment automation while paying none of that. It is the decoupling that makes the entire middle of the ladder honest.

Edge infra: one local k3s node + FluxCD. Cloud infra: same as T2. No Arc, no IoT Operations.

Graduate when: robots span multiple sites, or sites become unreachable from a single operator network — the point at which a cross-site reachability and identity broker becomes genuinely necessary.

T4 — Scale. Multi-site fleet delivery

The legitimate top of the necessary ladder. This is the fleet delivery control plane: getting validated policies onto robots across sites you cannot directly reach, safely. The defining difference from T3 is multiple sites — which is exactly what makes Arc necessary, as the cross-site reachability and identity broker that single-site k3s did not need.

ConcernImplementation
Connectivity and identityAzure Arc as the reachability and identity broker across sites.
RuntimeAKS or Arc-enabled Kubernetes at each site.
DeliveryFluxCD GitOps; per-site desired state recorded in Git.
Safe swapGating service approves deployment windows before a policy swaps on hardware.

Explicitly excluded at T4: drift detection, automated retraining, aggregate telemetry analytics. This tier delivers and gates; it does not run fleet intelligence.

Edge infra: Arc + AKS/Flux + gating. Cloud infra: connectivity, identity, registry.

Graduate when: the operator explicitly wants production signals to drive retraining and fleet-wide health analytics — a deliberate decision, not an automatic consequence of scale.

T5 — Operate. Fleet intelligence. Deferred

The aspirational layer. Drift detection, automated retraining triggers, aggregate telemetry, and health analytics.

ConcernImplementationStatus
Edge telemetryAzure IoT Operations MQTT aggregationplaceholder
AnalyticsMicrosoft Fabric Real-Time Intelligence, Grafanaplaceholder
Drift and retrainingDrift detection, retraining triggers, closed loopplaceholder

Status: mostly unimplemented (0 Python files, 4 placeholders in fleet-intelligence). This tier should be documented as a roadmap direction, with explicit human-in-the-loop gating recommended over fully autonomous retraining. It decomposes into an ordered autonomy ladder rather than a single leap.

The Autonomy Ladder (T5.0–T5.3)

Fleet intelligence is not a single leap. Treating it as one repeats the all-or-nothing framing this proposal argues against everywhere else. The closed loop decomposes into four ordered stages of increasing decision authority. Each is a legitimate stopping point, and the deferral is honest only when the intermediate stages are named: three of the four are unbuilt today (modulo an ad-hoc experiment by the team on Hex).

StageNameDecision authorityStatus
T5.0Gated retrainingHumans trigger retraining; the system surfaces signals only.not built
T5.1Human-in-the-loop / active learningThe system proposes what to retrain on and when; a human approves each cycle.ad-hoc (Hex)
T5.2Continual learningThe system retrains on a schedule or trigger; a human reviews before deployment.not built
T5.3Autonomous closed-loopThe system detects drift, retrains, gates, and deploys without human approval.not built

[!IMPORTANT] Autonomy is a different axis from T0–T4. T0–T4 scale on infrastructure reach (sites, GPU, collaboration); T5.0–T5.3 scale on decision authority / loop closure. They are orthogonal: a single-site T3 operator can sit at T5.0, and a multi-site T4 operator can remain fully manual. The autonomy stages are not "more infrastructure to buy" \u2014 they are how much of the retraining decision a human delegates.

The foot-gun warning below applies with increasing force up the ladder; T5.3 should remain a roadmap direction, not a near-term target.

[!WARNING] Fully autonomous retraining on production data is a foot-gun: a legitimate distribution change can cause the loop to bake current degraded behavior into the next dataset. Drift detection also needs statistical power that only exists at fleet scale. T5 should default to human-supervised (T5.0–T5.1), not closed-loop.

Cross-Cutting Axis: Tiered Data Augmentation

Data scarcity is unavoidable in physical AI, so the toolchain needs an augmentation story \u2014 but today the only documented path is the full Cosmos/SDG pipeline, which is the aspirational ceiling that T0\u2013T2 users cannot realistically operate. This is the same all-or-nothing trap as the rest of the architecture: a heavyweight ceiling with no documented low or middle rung. Augmentation should therefore be a tiered, optional axis, recommended when data is scarce \u2014 not a baseline step and explicitly not part of Goal G (folding it into the anchor goal would undermine the "this is the honest minimal floor" argument).

StageApproachWhere it runs
A0Classical CV augmentation (crops, jitter, blur, photometric/geometric transforms)Local, CPU, no model
A1Local small-VLM generation (e.g. via vLLM locally at T0, or Azure AI Foundry at T1)Local GPU or hosted endpoint
A2Full Cosmos / SDG world-foundation-model pipelineCloud, GPU cluster

[!NOTE] Unlike the documentation-and-packaging changes elsewhere in this proposal, the A0\u2013A1 rungs are net-new code: synthetic-data ships 0 Python files and 2 placeholders today. Classical augmentation, local-VLM generation, and the accompanying experiments and guidance are a near-term build item, not a re-framing of existing artifacts. Naming them here is what gives the augmentation axis the credibility the toolchain currently lacks; delivering A0\u2013A1 is a separate, scoped workstream.


6. What Changes for the Repository

This proposal is primarily documentation and packaging, not a code rewrite.

ChangeEffortImpact
Document T0 as the sanctioned starting pathLow — code already supports itRemoves the perceived barrier directly
Restructure the architecture doc around T0–T5Low–mediumReframes the on-ramp; sets correct expectations
Split "fleet" into fleet delivery (T4) and fleet intelligence (T5) language, reserving "fleet" for robotsLowLets multi-site users adopt delivery without intelligence; ends the Azure Kubernetes Fleet Manager naming collision
Mark T5 components explicitly as roadmapLowStops the unbuilt ceiling from reading as a floor
Name the T5.0–T5.3 autonomy ladderLow — documentation onlyTurns the deferral into a roadmap instead of a cliff
Add per-tier quick-startsMediumGives each audience a concrete entry point
Implement A0–A1 data augmentation (classical CV, local-VLM)Medium — net-new codeCloses the credibility gap; the only build item in this list

The deploy scripts, Terraform modules, and training code do not need to change to support the tier model itself. The tiers describe which subset of existing infrastructure a user adopts, and in what order. The one exception is the A0–A1 augmentation rungs, which are an explicit new workstream rather than a re-framing of shipped artifacts.


7. Open Questions for Socialization

  1. Default tier in docs. Should the README and Quick Start default to T0 (Dev), with T2 (Pilot) as the "recommended production" path and T3–T5 (Production, Scale, Operate) clearly marked advanced? Do the stage names — Dev, Lab, Pilot, Production, Scale, Operate — read correctly, and is pairing each name with its T# ID the right convention (stable IDs for boundary references, names for user-facing labels)?
  2. Fleet vocabulary. Does the repository adopt fleet delivery (T4 control plane) and fleet intelligence (T5 cognition) as distinct named concepts, reserve the word "fleet" exclusively for a fleet of robots, and retire the bare phrase "fleet management" (which collides with Azure Kubernetes Fleet Manager and the Kubernetes ecosystem's delivery-oriented use of "fleet")?
  3. Roadmap honesty. How explicitly should placeholder domains be labeled in user-facing docs versus contributor docs?
  4. Scope of Goal G. The proposal keeps the single-task capture → train → validate → run loop as the anchor goal and models data augmentation as a separate optional axis (A0–A2) rather than folding it into Goal G. Is keeping the floor minimal — with augmentation recommended only when data is scarce — the right call, or should augmentation be part of the reference loop?

8. Summary

The toolchain is strong where it is built and over-advertised where it is not. The barrier to entry is the result of presenting an aspirational, multi-site, fleet-intelligence ceiling as the required floor. The fix is to:

  1. Surface the already-working local path (T0) as the sanctioned starting point.
  2. Model adoption as graduated tiers keyed to reach, GPU, collaboration, and site count.
  3. Unbundle "fleet" into a necessary fleet delivery control plane (T4) and a deferred fleet intelligence layer (T5).

This lowers the on-ramp without removing any capability, and it aligns the documented architecture with what the code actually delivers today.