Physical AI Toolchain — Tiered Architecture Proposal

Date: June 9th, 2026
Author: David White
Status: Adopted, decisions resolved (see Resolved Decisions)
Scope: Architecture and on-ramp framing for microsoft/physical-ai-toolchain
Type: Adopted architecture decision

Executive Summary (TL;DR)

The Physical AI Toolchain ships a powerful end-to-end stack, but it presents that stack as all-or-nothing. A newcomer reading the architecture documentation concludes they must stand up Azure Arc, AKS, FluxCD, ACSA, IoT Operations, and a cloud training plane before they can do anything useful. In reality, the repository's working code can close a complete capture → train → evaluate → run loop on a single laptop and one robot with zero cloud and zero Kubernetes. The gap between what the code supports and what the documentation implies is the single largest barrier to entry.

This proposal reframes the architecture around graduated adoption tiers. Each tier states the minimum edge and cloud infrastructure required to reach a concrete goal, and each tier is a legitimate stopping point. Users adopt only the infrastructure their scale actually demands, and the heaviest components become opt-in additions rather than baseline prerequisites.

The proposal also corrects a framing problem: the repository bundles two distinct concerns under the word "fleet." One is fleet delivery: a delivery and connectivity control plane that gets a validated policy onto robots across sites you cannot directly reach. The other is fleet intelligence , the cognition layer (drift detection, automated retraining, aggregate telemetry analytics). The first is legitimate and necessary at multi-site scale. The second is largely unimplemented and over-advertised. Separating them lets a multi-site operator adopt fleet delivery without inheriting the speculative intelligence layer.

[!IMPORTANT] In this repository, "fleet" refers exclusively to a fleet of robots. It never refers to Kubernetes clusters, nor to Azure Kubernetes Fleet Manager, a distinct Azure product for managing a fleet of clusters, not robots. Cluster-level concerns are always described as "clusters" or "sites," never "fleets." This proposal deliberately avoids the bare phrase "fleet management" because the Kubernetes ecosystem (Azure Kubernetes Fleet Manager, FluxCD, Argo, Rancher Fleet) uses it for the cluster delivery/placement concern; qualifying every use as fleet delivery (T4) or fleet intelligence (T5) keeps the meaning anchored to robots and avoids the collision.

What This Proposal Argues

The barrier to entry is a documentation and sequencing problem, not a fundamental property of the architecture. The code already supports a low-friction path; the docs do not describe it.
Adoption should be modeled as six tiers (T0–T5) keyed to operator reach, GPU availability, collaboration, and site count, not robot count.
The word "fleet" conflates a necessary fleet delivery control plane with a speculative fleet intelligence layer. These should be separate tiers.
The maturity evidence supports drawing the line where we propose: the implemented core is infrastructure, training, data management, and evaluation. The fleet intelligence cognition is mostly placeholder.

1. The Problem: On-Ramp Friction

The current architecture documentation foregrounds the maximal configuration. It describes a closed-loop, multi-site, GitOps-managed robot fleet with telemetry-driven retraining as the headline identity of the project. Everything else reads as a component of that endgame.

The practical consequence: a user evaluating the toolkit perceives the cost of entry as the sum of every moving part, on both the edge and the cloud:

Surface	Components the docs imply are required
Edge	Arc-enabled Kubernetes (K3s or AKS Edge Essentials), ACSA extension, cert-manager, IngestSubvolume CRD, FluxCD, ROS 2 recording service, gating service, IoT Operations MQTT broker
Cloud	AKS, AzureML workspace, Azure Storage, Container Registry, Key Vault, managed identity, Event Hubs, Microsoft Fabric Real-Time Intelligence, Grafana, point-to-site VPN

Even with the repository's deploy scripts, the perceived setup is a long, ordered sequence of installations and credential wiring across two environments before the first useful result. The "private AKS by default" network model compounds this: a VPN must exist before any kubectl command runs.

[!NOTE] The friction is perceived cost, and perception is set by documentation. The same components, presented as optional additions layered onto a working local baseline, would not read as a barrier. This is why the fix is primarily a re-framing, not a rewrite.

The Core Claim

The edge architecture is premature and over-advertised relative to its implementation, and there is no documented low or middle tier, so a newcomer mistakes the aspirational ceiling for the required floor.

2. Evidence: What Is Actually Built

A file-level survey of the eight domains separates shipped code from aspirational specification. The signal is unambiguous.

Domain	Python	Shell	Terraform	Placeholder specs
`infrastructure`	0	18	80	0
`data-management`	49	2	0	0
`training`	23	8	0	0
`evaluation`	15	7	0	0
`data-pipeline`	3	1	0	0
`synthetic-data`	0	0	0	2
`fleet-deployment`	1	1	0	5
`fleet-intelligence`	0	4	0	4

The implemented center of gravity is infrastructure provisioning plus training, data management, and evaluation tooling. The two domains that carry the "fleet" identity, fleet-deployment and fleet-intelligence, hold one Python file between them and nine placeholder specifications. The headline identity of the project is its least-built region.

Two consequences follow:

The heavy edge stack is largely documented intent, not shipped complexity. The barrier is mostly prerequisite weight and framing, not a large body of code a user must understand.
Because the fleet domains are unbuilt, formalizing a lighter default on-ramp costs almost nothing. There is little working code to reorganize.

The Local Path Already Exists

Three of the four core domains run with no cloud dependency today:

Training: training/il/scripts/lerobot/train.py detects available CUDA devices at runtime and trains wherever it is invoked. The OSMO and AzureML submission scripts are wrappers around this, not a requirement of it.
Evaluation: evaluation/sil/scripts/run-local-lerobot-eval.py and play.py provide an explicit local evaluation path.
Data management: the dataviewer defaults to STORAGE_BACKEND=local and serves entirely from a laptop.

The capability to start small is present in the code. It is simply undocumented as a sanctioned path.

3. Reframing "Fleet": Fleet Delivery vs. Fleet Intelligence

The strongest objection to the current architecture is not "Arc and AKS are unnecessary." At multi-site scale they are genuinely needed. The objection is that the repository welds two different concerns together under one word.

Concern	What it is	Repository domain	Maturity
Fleet delivery (control plane)	Multi-site connectivity, declarative delivery, safe-swap gating	`fleet-deployment`	1 Python file, 5 placeholders
Fleet intelligence	Drift detection, retraining triggers, aggregate telemetry analytics	`fleet-intelligence`	0 Python files, 4 placeholders

The fleet delivery control plane is transport. Across sites you do not control, you need a reachability and identity broker (Arc), a runtime (AKS), and a delivery mechanism (FluxCD). This is, in effect, a git push that survives bad networks and locked-down sites, with a safety gate before a policy swaps on a physical arm. It is legitimate and necessary once an operator spans multiple sites, and aligns with how the Kubernetes ecosystem treats "fleet": a delivery and reconciliation concern.

The fleet intelligence layer is cognition. Observing many robots, detecting aggregate degradation, and closing the retraining loop is the data-science layer over the robot fleet, and it is the unbuilt, over-advertised part.

[!IMPORTANT] Separating these is what lets a multi-site operator adopt the necessary fleet delivery plane without inheriting the speculative fleet intelligence layer. The repository's single word "fleet" makes the necessary inseparable from the speculative. Unbundling the word, and qualifying it as delivery vs. intelligence, is the core architectural move of this proposal.

One honest caveat: the split is real but porous. The gating service is a decision (approve a safe deployment window), so a small amount of cognition already lives in the delivery plane, and a multi-site delivery plane inevitably records per-site desired state, which is the seed of intelligence. The tiers below treat gating as part of delivery and draw the intelligence boundary at drift/retraining/analytics.

4. The Tier Model

Tiers are keyed to the variables that actually drive infrastructure cost:

Operator reach: can one network and one kubectl context reach every robot?
GPU availability: is there a local GPU, or must compute be rented?
Collaboration: one person, or a team that needs shared catalogs, tracking, and registries?
Site count: one location, or many?

Robot count correlates with these but does not cause them. Three identical robots in one lab you control are trivial; three robots across three customer sites you cannot reach are not.

Reference Goal

Each tier states the minimum infrastructure to reach a concrete goal:

Goal: Full Training Lifecycle: Capture demonstrations on a robot, train an imitation policy, validate it, and run that policy back on the robot, the full loop for one task. Fleet intelligence is explicitly out of scope of Goal: Full Training Lifecycle.

Goal: Full Training Lifecycle is fully achievable at Tiers 0 through 2 without any Kubernetes, Arc, or fleet infrastructure. Data augmentation and fleet-intelligence autonomy are modeled as separate optional axes (see Cross-Cutting Axis: Tiered Data Augmentation and The Autonomy Ladder), not steps in the Goal: Full Training Lifecycle loop.

Tier Ladder Overview

Each tier carries a stable ID (T0–T5) and a stage name. The IDs are canonical and are used for every boundary and graduation reference; the names are the user-facing labels.

T0  Dev         Laptop + 1 robot          edge: none               cloud: none
T1  Lab         Shared GPU + few robots   edge: shared disk        cloud: Blob (+ optional AzureML/MLflow)
T2  Pilot       Single site, at scale     edge: none (manual)      cloud: AzureML + registry + MLflow
T3  Production  Single-site declarative   edge: local k3s + Flux   cloud: same as T2 (no Arc)
─────────────────────────────────────────────────────────────────── multi-site boundary (Arc)
T4  Scale       Multi-site delivery       edge: Arc + AKS + Flux   cloud: + connectivity/identity
───────────────────────────────────────────────────────────── intelligence boundary
T5  Operate     Fleet intelligence        edge: + IoT Operations   cloud: + Fabric RTI + drift/retraining

T0–T2 satisfy Goal: Full Training Lifecycle with manual deployment. T3 adds single-site declarative deployment (local k3s + Flux) without Arc, proving GitOps does not require a cloud fleet control plane. T4 is the multi-site fleet delivery terminus, where Arc becomes necessary as the cross-site reachability and identity broker. T5 is the fleet intelligence layer, marked deferred and over-advertised relative to its implementation.

5. Tier Detail

T0 — Dev. One robot, one laptop. No cloud, no Kubernetes

The honest floor for Goal: Full Training Lifecycle.

Concern	Implementation
Capture	ROS 2 bag recording to local disk on the robot or laptop. No Arc, no ACSA, no PVC.
Move data	`cp` or `rsync` from robot to laptop.
Curate	Dataviewer in `local` mode on the laptop.
Train	`train.py` on a laptop or workstation GPU.
Track	File-backed MLflow (`file:./mlruns`, no server) or trackio, a local process with no service to stand up.
Validate	`run-local-lerobot-eval.py` / `play.py` locally.
Run on robot	The ACT inference node as a plain process or container. No Flux, no gating, no GitOps.

Edge infra: ROS 2 and Docker only. Cloud infra: none.

[!NOTE] Experiment tracking belongs at T0, not as a collaboration perk. Goal: Full Training Lifecycle includes validate, and a success number you cannot reproduce, compare across runs, or use to attribute a regression is an anecdote, not a result, doubly so in the high-variance world of RL and IL. File-backed MLflow and trackio run as local processes with no server, so tracking stays inside the zero-cloud, zero-Kubernetes floor. Tracking as a hosted server plus a model registry is a separate, later concern (T2).

This tier exists in the code today and is undocumented. Surfacing it is the highest-leverage change in this proposal.

Graduate when: no local GPU; the task needs many training iterations as conditions vary; or a second person needs the data.

T1 — Lab. One site, a few robots, a shared GPU box. Storage-only cloud

The small-lab and integrator tier. The first cloud resource added is a single storage account.

Concern	Implementation	Delta from T0
Capture	ROS 2 recording to shared NFS/SMB, or each robot `rsync`s up.	shared disk
Move data	`azcopy` or `az storage blob upload-batch` to one Blob container.	+ Blob storage
Curate	Dataviewer in `azure` mode against that container (managed identity or SAS).	viewer → cloud
Train	Local shared GPU box, or first optional reach to AzureML on saturation.	optional cloud GPU
Track	Local file-backed tracking (carried from T0) optionally promoted to a shared MLflow server once a team needs shared run history.	optional hosted tracking
Run on robot	Plain container per robot; hand-update 2–3 robots via `docker pull`.	unchanged

Edge infra: shared disk. Cloud infra: one storage account, optionally AzureML and MLflow. No Kubernetes, no Arc, no Flux.

Graduate when: training scale or team size outgrows one GPU box; dataset governance and catalogs become necessary.

T2 — Pilot. One site, several robots, real training scale and collaboration

The tier where cloud training genuinely becomes the default rather than an option.

Concern	Implementation	Delta from T1
Train	AzureML or OSMO as default: multi-GPU, queued jobs, multiple people, VLA scale.	cloud GPU standard
Registry	Model registry and versioning become load-bearing.	+ registry
Curate	Dataviewer deployed as a shared web app rather than localhost.	hosted viewer
Capture	ACSA optional if disk pressure or unattended recording warrants it.	optional ACSA
Run on robot	Manual `docker pull` per robot. Hand-updating a handful of reachable robots is still tractable.	unchanged

Edge infra: none beyond Docker. Cloud infra: AzureML workspace, storage, registry, MLflow. Still no Kubernetes, no Arc, no fleet plane.

Graduate when: the number of robots or the update cadence makes hand-updating each robot error-prone, and version skew across robots becomes a real problem, while all robots are still at one reachable site.

T3 — Production. Local k3s + Flux, no Arc

The tier that proves declarative, GitOps-style deployment does not require Azure Arc. Several robots at one site you control, updated often enough that manual docker pull causes version skew, but all reachable from a single operator network. A single local k3s node running FluxCD reconciles every robot to a Git-declared desired state. Arc is unnecessary precisely because there is only one site and you can reach it directly.

Concern	Implementation
Runtime	A single local k3s node (a ~60 MB binary) at the site.
Delivery	FluxCD reconciles robots to Git-declared desired state; rollback is a Git revert.
Version control	Git becomes the single source of truth for which robot runs which policy.
Safe swap	Optional gating before a policy swaps on hardware.
Train and curate	Same as T2 (AzureML, registry, MLflow, hosted viewer).

Why this is a full tier, not a footnote: it isolates the single most misattributed cost in the whole architecture. The expensive part of the "fleet" stack was never running Kubernetes at the edge; single-node k3s idles near zero. The expensive part is the cloud-side control plane (Arc enrollment, identity, policy). T3 delivers GitOps deployment automation while paying none of that. It is the decoupling that makes the entire middle of the ladder honest.

Edge infra: one local k3s node + FluxCD. Cloud infra: same as T2. No Arc, no IoT Operations.

Graduate when: robots span multiple sites, or sites become unreachable from a single operator network. That is the point at which a cross-site reachability and identity broker becomes genuinely necessary.

T4 — Scale. Multi-site fleet delivery

The legitimate top of the necessary ladder. This is the fleet delivery control plane: getting validated policies onto robots across sites you cannot directly reach, safely. The defining difference from T3 is multiple sites, which is exactly what makes Arc necessary as the cross-site reachability and identity broker that single-site k3s did not need.

Concern	Implementation
Connectivity and identity	Azure Arc as the reachability and identity broker across sites.
Runtime	AKS or Arc-enabled Kubernetes at each site.
Delivery	FluxCD GitOps; per-site desired state recorded in Git.
Safe swap	Gating service approves deployment windows before a policy swaps on hardware.

Explicitly excluded at T4: drift detection, automated retraining, aggregate telemetry analytics. This tier delivers and gates; it does not run fleet intelligence.

Edge infra: Arc + AKS/Flux + gating. Cloud infra: connectivity, identity, registry.

Graduate when: the operator explicitly wants production signals to drive retraining and fleet-wide health analytics. This is a deliberate decision, not an automatic consequence of scale.

T5 — Operate. Fleet intelligence. Deferred

The aspirational layer. Drift detection, automated retraining triggers, aggregate telemetry, and health analytics.

Concern	Implementation	Status
Edge telemetry	Azure IoT Operations MQTT aggregation	placeholder
Analytics	Microsoft Fabric Real-Time Intelligence, Grafana	placeholder
Drift and retraining	Drift detection, retraining triggers, closed loop	placeholder

Status: mostly unimplemented (0 Python files, 4 placeholders in fleet-intelligence). This tier should be documented as a roadmap direction, with explicit human-in-the-loop gating recommended over fully autonomous retraining. It decomposes into an ordered autonomy ladder rather than a single leap.

The Autonomy Ladder (T5.0–T5.3)

Fleet intelligence is not a single leap. Treating it as one repeats the all-or-nothing framing this proposal argues against everywhere else. The closed loop decomposes into four ordered stages of increasing decision authority. Each is a legitimate stopping point, and the deferral is honest only when the intermediate stages are named: three of the four are unbuilt today (modulo an ad-hoc experiment by the team on Hex).

Stage	Name	Decision authority	Status
T5.0	Gated retraining	Humans trigger retraining; the system surfaces signals only.	not built
T5.1	Human-in-the-loop / active learning	The system proposes what to retrain on and when; a human approves each cycle.	ad-hoc (Hex)
T5.2	Continual learning	The system retrains on a schedule or trigger; a human reviews before deployment.	not built
T5.3	Autonomous closed-loop	The system detects drift, retrains, gates, and deploys without human approval.	not built

[!IMPORTANT] Autonomy is a different axis from T0–T4. T0–T4 scale on infrastructure reach (sites, GPU, collaboration); T5.0–T5.3 scale on decision authority / loop closure. They are orthogonal: a single-site T3 operator can sit at T5.0, and a multi-site T4 operator can remain fully manual. The autonomy stages are not "more infrastructure to buy"; they are how much of the retraining decision a human delegates.

The foot-gun warning below applies with increasing force up the ladder; T5.3 should remain a roadmap direction, not a near-term target.

[!WARNING] Fully autonomous retraining on production data is a foot-gun: a legitimate distribution change can cause the loop to bake current degraded behavior into the next dataset. Drift detection also needs statistical power that only exists at fleet scale. T5 should default to human-supervised (T5.0–T5.1), not closed-loop.

Cross-Cutting Axis: Tiered Data Augmentation

Data scarcity is unavoidable in physical AI, so the toolchain needs an augmentation story, but today the only documented path is the full Cosmos/SDG pipeline, which is the aspirational ceiling that T0–T2 users cannot realistically operate. This is the same all-or-nothing trap as the rest of the architecture: a heavyweight ceiling with no documented low or middle rung. Augmentation should therefore be a tiered, optional axis, recommended when data is scarce, not a baseline step and explicitly not part of Goal: Full Training Lifecycle (folding it into the anchor goal would undermine the "this is the honest minimal floor" argument).

Stage	Approach	Where it runs
A0	Classical CV augmentation (crops, jitter, blur, photometric/geometric transforms)	Local, CPU, no model
A1	Local small-VLM generation (e.g. via vLLM locally at T0, or Azure AI Foundry at T1)	Local GPU or hosted endpoint
A2	Full Cosmos / SDG world-foundation-model pipeline	Cloud, GPU cluster

[!NOTE] Unlike the documentation-and-packaging changes elsewhere in this proposal, the A0–A1 rungs are net-new code: synthetic-data ships 0 Python files and 2 placeholders today. Classical augmentation, local-VLM generation, and the accompanying experiments and guidance are a near-term build item, not a re-framing of existing artifacts. Naming them here is what gives the augmentation axis the credibility the toolchain currently lacks; delivering A0–A1 is a separate, scoped workstream.

6. What Changes for the Repository

This proposal is primarily documentation and packaging, not a code rewrite.

Change	Effort	Impact
Document T0 as the sanctioned starting path	Low: code already supports it	Removes the perceived barrier directly
Restructure the architecture doc around T0–T5	Low–medium	Reframes the on-ramp; sets correct expectations
Split "fleet" into fleet delivery (T4) and fleet intelligence (T5) language, reserving "fleet" for robots	Low	Lets multi-site users adopt delivery without intelligence; ends the Azure Kubernetes Fleet Manager naming collision
Mark T5 components explicitly as roadmap	Low	Stops the unbuilt ceiling from reading as a floor
Name the T5.0–T5.3 autonomy ladder	Low: documentation only	Turns the deferral into a roadmap instead of a cliff
Add per-tier quick-starts	Medium	Gives each audience a concrete entry point
Implement A0–A1 data augmentation (classical CV, local-VLM)	Medium: net-new code	Closes the credibility gap; the only build item in this list

The deploy scripts, Terraform modules, and training code do not need to change to support the tier model itself. The tiers describe which subset of existing infrastructure a user adopts, and in what order. The one exception is the A0–A1 augmentation rungs, which are an explicit new workstream rather than a re-framing of shipped artifacts.

7. Open Questions for Socialization

Default tier in docs. Should the README and Quick Start default to T0 (Dev), with T2 (Pilot) as the "recommended production" path and T3–T5 (Production, Scale, Operate) clearly marked advanced? Do the stage names, Dev, Lab, Pilot, Production, Scale, and Operate, read correctly, and is pairing each name with its T# ID the right convention (stable IDs for boundary references, names for user-facing labels)?
Fleet vocabulary. Does the repository adopt fleet delivery (T4 control plane) and fleet intelligence (T5 cognition) as distinct named concepts, reserve the word "fleet" exclusively for a fleet of robots, and retire the bare phrase "fleet management" (which collides with Azure Kubernetes Fleet Manager and the Kubernetes ecosystem's delivery-oriented use of "fleet")?
Roadmap honesty. How explicitly should placeholder domains be labeled in user-facing docs versus contributor docs?
Scope of Goal: Full Training Lifecycle. The proposal keeps the single-task capture → train → validate → run loop as the anchor goal and models data augmentation as a separate optional axis (A0–A2) rather than folding it into Goal: Full Training Lifecycle. Is keeping the floor minimal, with augmentation recommended only when data is scarce, the right call, or should augmentation be part of the reference loop?

Resolved Decisions

The four questions above were socialized and resolved as follows. The canonical definitions live in tier-model.md; downstream documentation cites that file rather than this section.

#	Decision	Rationale
1	Adopted. T0 (Dev) is the documented default; T2 (Pilot) is the recommended-production path; T3–T5 (Production, Scale, Operate) are advanced. Stage names pair with `T#` IDs: IDs for boundary references, names for user-facing labels.	Surfaces the already-working local path as the sanctioned floor and gives every tier a stable, linkable identifier.
2	Adopted. "Fleet delivery" (T4) and "fleet intelligence" (T5) become distinct named concepts; "fleet" refers only to robots; the bare phrase "fleet management" is retired.	Removes the Azure Kubernetes Fleet Manager collision and lets a multi-site operator adopt delivery without inheriting intelligence.
3	Adopted. Placeholder/roadmap status is labeled explicitly in both contributor and user-facing docs, most prominently in user-facing docs.	Aligns the documented capability with what the code actually ships and prevents over-advertising the T5 layer.
4	Adopted. Goal: Full Training Lifecycle stays the minimal single-task loop; data augmentation remains a separate optional axis (A0–A2) recommended only when data is scarce. The A0–A1 build is deferred.	Keeps the on-ramp floor minimal and prevents augmentation work from gating the T0 path.

8. Summary

The toolchain is strong where it is built and over-advertised where it is not. The barrier to entry is the result of presenting an aspirational, multi-site, fleet-intelligence ceiling as the required floor. The fix is to:

Surface the already-working local path (T0) as the sanctioned starting point.
Model adoption as graduated tiers keyed to reach, GPU, collaboration, and site count.
Unbundle "fleet" into a necessary fleet delivery control plane (T4) and a deferred fleet intelligence layer (T5).

This lowers the on-ramp without removing any capability, and it aligns the documented architecture with what the code actually delivers today.

Executive Summary (TL;DR)​

What This Proposal Argues​

1. The Problem: On-Ramp Friction​

The Core Claim​

2. Evidence: What Is Actually Built​

The Local Path Already Exists​

3. Reframing "Fleet": Fleet Delivery vs. Fleet Intelligence​

4. The Tier Model​

Reference Goal​

Tier Ladder Overview​

5. Tier Detail​

T0 — Dev. One robot, one laptop. No cloud, no Kubernetes​

T1 — Lab. One site, a few robots, a shared GPU box. Storage-only cloud​

T2 — Pilot. One site, several robots, real training scale and collaboration​

T3 — Production. Local k3s + Flux, no Arc​

T4 — Scale. Multi-site fleet delivery​

T5 — Operate. Fleet intelligence. Deferred​

The Autonomy Ladder (T5.0–T5.3)​

Cross-Cutting Axis: Tiered Data Augmentation​

6. What Changes for the Repository​

7. Open Questions for Socialization​

Resolved Decisions​

8. Summary​