Physical AI Toolchain — Tiered Architecture Proposal
- Date: June 9th, 2026
- Author: David White
- Status: Draft for socialization and feedback
- Scope: Architecture and on-ramp framing for microsoft/physical-ai-toolchain
- Type: Proposal — not yet adopted
Executive Summary (TL;DR)
The Physical AI Toolchain ships a powerful end-to-end stack, but it presents that stack as all-or-nothing. A newcomer reading the architecture documentation concludes they must stand up Azure Arc, AKS, FluxCD, ACSA, IoT Operations, and a cloud training plane before they can do anything useful. In reality, the repository's working code can close a complete capture → train → evaluate → run loop on a single laptop and one robot with zero cloud and zero Kubernetes. The gap between what the code supports and what the documentation implies is the single largest barrier to entry.
This proposal reframes the architecture around graduated adoption tiers. Each tier states the minimum edge and cloud infrastructure required to reach a concrete goal, and each tier is a legitimate stopping point. Users adopt only the infrastructure their scale actually demands, and the heaviest components become opt-in additions rather than baseline prerequisites.
The proposal also corrects a framing problem: the repository bundles two distinct concerns under the word "fleet." One is fleet delivery — a delivery and connectivity control plane that gets a validated policy onto robots across sites you cannot directly reach. The other is fleet intelligence — the cognition layer (drift detection, automated retraining, aggregate telemetry analytics). The first is legitimate and necessary at multi-site scale. The second is largely unimplemented and over-advertised. Separating them lets a multi-site operator adopt fleet delivery without inheriting the speculative intelligence layer.
[!IMPORTANT] In this repository, "fleet" refers exclusively to a fleet of robots. It never refers to Kubernetes clusters, nor to Azure Kubernetes Fleet Manager — a distinct Azure product for managing a fleet of clusters, not robots. Cluster-level concerns are always described as "clusters" or "sites," never "fleets." This proposal deliberately avoids the bare phrase "fleet management" because the Kubernetes ecosystem (Azure Kubernetes Fleet Manager, FluxCD, Argo, Rancher Fleet) uses it for the cluster delivery/placement concern; qualifying every use as fleet delivery (T4) or fleet intelligence (T5) keeps the meaning anchored to robots and avoids the collision.
What This Proposal Argues
- The barrier to entry is a documentation and sequencing problem, not a fundamental property of the architecture. The code already supports a low-friction path; the docs do not describe it.
- Adoption should be modeled as six tiers (T0–T5) keyed to operator reach, GPU availability, collaboration, and site count — not robot count.
- The word "fleet" conflates a necessary fleet delivery control plane with a speculative fleet intelligence layer. These should be separate tiers.
- The maturity evidence supports drawing the line where we propose: the implemented core is infrastructure, training, data management, and evaluation. The fleet intelligence cognition is mostly placeholder.
1. The Problem: On-Ramp Friction
The current architecture documentation foregrounds the maximal configuration. It describes a closed-loop, multi-site, GitOps-managed robot fleet with telemetry-driven retraining as the headline identity of the project. Everything else reads as a component of that endgame.
The practical consequence: a user evaluating the toolkit perceives the cost of entry as the sum of every moving part, on both the edge and the cloud:
| Surface | Components the docs imply are required |
|---|---|
| Edge | Arc-enabled Kubernetes (K3s or AKS Edge Essentials), ACSA extension, cert-manager, IngestSubvolume CRD, FluxCD, ROS 2 recording service, gating service, IoT Operations MQTT broker |
| Cloud | AKS, AzureML workspace, Azure Storage, Container Registry, Key Vault, managed identity, Event Hubs, Microsoft Fabric Real-Time Intelligence, Grafana, point-to-site VPN |
Even with the repository's deploy scripts, the perceived setup is a long, ordered sequence of installations and credential wiring across two environments before the first useful result. The "private AKS by default" network model compounds this: a VPN must exist before any kubectl command runs.
[!NOTE] The friction is perceived cost, and perception is set by documentation. The same components, presented as optional additions layered onto a working local baseline, would not read as a barrier. This is why the fix is primarily a re-framing, not a rewrite.
The Core Claim
The edge architecture is premature and over-advertised relative to its implementation, and there is no documented low or middle tier, so a newcomer mistakes the aspirational ceiling for the required floor.
2. Evidence: What Is Actually Built
A file-level survey of the eight domains separates shipped code from aspirational specification. The signal is unambiguous.
| Domain | Python | Shell | Terraform | Placeholder specs |
|---|---|---|---|---|
infrastructure | 0 | 18 | 80 | 0 |
data-management | 49 | 2 | 0 | 0 |
training | 23 | 8 | 0 | 0 |
evaluation | 15 | 7 | 0 | 0 |
data-pipeline | 3 | 1 | 0 | 0 |
synthetic-data | 0 | 0 | 0 | 2 |
fleet-deployment | 1 | 1 | 0 | 5 |
fleet-intelligence | 0 | 4 | 0 | 4 |
The implemented center of gravity is infrastructure provisioning plus training, data management, and evaluation tooling. The two domains that carry the "fleet" identity — fleet-deployment and fleet-intelligence — hold one Python file between them and nine placeholder specifications. The headline identity of the project is its least-built region.
Two consequences follow:
- The heavy edge stack is largely documented intent, not shipped complexity. The barrier is mostly prerequisite weight and framing, not a large body of code a user must understand.
- Because the fleet domains are unbuilt, formalizing a lighter default on-ramp costs almost nothing — there is little working code to reorganize.
The Local Path Already Exists
Three of the four core domains run with no cloud dependency today:
- Training —
training/il/scripts/lerobot/train.pydetects available CUDA devices at runtime and trains wherever it is invoked. The OSMO and AzureML submission scripts are wrappers around this, not a requirement of it. - Evaluation —
evaluation/sil/scripts/run-local-lerobot-eval.pyandplay.pyprovide an explicit local evaluation path. - Data management — the dataviewer defaults to
STORAGE_BACKEND=localand serves entirely from a laptop.
The capability to start small is present in the code. It is simply undocumented as a sanctioned path.
3. Reframing "Fleet": Fleet Delivery vs. Fleet Intelligence
The strongest objection to the current architecture is not "Arc and AKS are unnecessary." At multi-site scale they are genuinely needed. The objection is that the repository welds two different concerns together under one word.
| Concern | What it is | Repository domain | Maturity |
|---|---|---|---|
| Fleet delivery (control plane) | Multi-site connectivity, declarative delivery, safe-swap gating | fleet-deployment | 1 Python file, 5 placeholders |
| Fleet intelligence | Drift detection, retraining triggers, aggregate telemetry analytics | fleet-intelligence | 0 Python files, 4 placeholders |
The fleet delivery control plane is transport. Across sites you do not control, you need a reachability and identity broker (Arc), a runtime (AKS), and a delivery mechanism (FluxCD). This is, in effect, a git push that survives bad networks and locked-down sites, with a safety gate before a policy swaps on a physical arm. It is legitimate and necessary once an operator spans multiple sites, and aligns with how the Kubernetes ecosystem treats "fleet" — a delivery and reconciliation concern.
The fleet intelligence layer is cognition. Observing many robots, detecting aggregate degradation, and closing the retraining loop is the data-science layer over the robot fleet — and it is the unbuilt, over-advertised part.
[!IMPORTANT] Separating these is what lets a multi-site operator adopt the necessary fleet delivery plane without inheriting the speculative fleet intelligence layer. The repository's single word "fleet" makes the necessary inseparable from the speculative. Unbundling the word — and qualifying it as delivery vs. intelligence — is the core architectural move of this proposal.
One honest caveat: the split is real but porous. The gating service is a decision (approve a safe deployment window), so a small amount of cognition already lives in the delivery plane, and a multi-site delivery plane inevitably records per-site desired state, which is the seed of intelligence. The tiers below treat gating as part of delivery and draw the intelligence boundary at drift/retraining/analytics.
4. The Tier Model
Tiers are keyed to the variables that actually drive infrastructure cost:
- Operator reach — can one network and one
kubectlcontext reach every robot? - GPU availability — is there a local GPU, or must compute be rented?
- Collaboration — one person, or a team that needs shared catalogs, tracking, and registries?
- Site count — one location, or many?
Robot count correlates with these but does not cause them. Three identical robots in one lab you control are trivial; three robots across three customer sites you cannot reach are not.
Reference Goal
Each tier states the minimum infrastructure to reach a concrete goal:
Goal G: Capture demonstrations on a robot, train an imitation policy, validate it, and run that policy back on the robot — the full loop, one task. Fleet intelligence is explicitly out of scope of Goal G.
Goal G is fully achievable at Tiers 0 through 2 without any Kubernetes, Arc, or fleet infrastructure. Data augmentation and fleet-intelligence autonomy are modeled as separate optional axes (see Cross-Cutting Axis: Tiered Data Augmentation and The Autonomy Ladder), not steps in the Goal G loop.
Tier Ladder Overview
Each tier carries a stable ID (T0–T5) and a stage name. The IDs are canonical and are used for every boundary and graduation reference; the names are the user-facing labels.
T0 Dev Laptop + 1 robot edge: none cloud: none
T1 Lab Shared GPU + few robots edge: shared disk cloud: Blob (+ optional AzureML/MLflow)
T2 Pilot Single site, at scale edge: none (manual) cloud: AzureML + registry + MLflow
T3 Production Single-site declarative edge: local k3s + Flux cloud: same as T2 (no Arc)
─────────────────────────────────────────────────────────────────── multi-site boundary (Arc)
T4 Scale Multi-site delivery edge: Arc + AKS + Flux cloud: + connectivity/identity
───────────────────────────────────────────────────────────── intelligence boundary
T5 Operate Fleet intelligence edge: + IoT Operations cloud: + Fabric RTI + drift/retraining
T0–T2 satisfy Goal G with manual deployment. T3 adds single-site declarative deployment (local k3s + Flux) without Arc — proving GitOps does not require a cloud fleet control plane. T4 is the multi-site fleet delivery terminus, where Arc becomes necessary as the cross-site reachability and identity broker. T5 is the fleet intelligence layer, marked deferred and over-advertised relative to its implementation.
5. Tier Detail
T0 — Dev. One robot, one laptop. No cloud, no Kubernetes
The honest floor for Goal G.
| Concern | Implementation |
|---|---|
| Capture | ROS 2 bag recording to local disk on the robot or laptop. No Arc, no ACSA, no PVC. |
| Move data | cp or rsync from robot to laptop. |
| Curate | Dataviewer in local mode on the laptop. |
| Train | train.py on a laptop or workstation GPU. |
| Track | File-backed MLflow (file:./mlruns, no server) or trackio — a local process, no service to stand up. |
| Validate | run-local-lerobot-eval.py / play.py locally. |
| Run on robot | The ACT inference node as a plain process or container. No Flux, no gating, no GitOps. |
Edge infra: ROS 2 and Docker only. Cloud infra: none.
[!NOTE] Experiment tracking belongs at T0, not as a collaboration perk. Goal G includes validate, and a success number you cannot reproduce, compare across runs, or use to attribute a regression is an anecdote, not a result — doubly so in the high-variance world of RL and IL. File-backed MLflow and trackio run as local processes with no server, so tracking stays inside the zero-cloud, zero-Kubernetes floor. Tracking as a hosted server plus a model registry is a separate, later concern (T2).
This tier exists in the code today and is undocumented. Surfacing it is the highest-leverage change in this proposal.
Graduate when: no local GPU; the task needs many training iterations as conditions vary; or a second person needs the data.
T1 — Lab. One site, a few robots, a shared GPU box. Storage-only cloud
The small-lab and integrator tier. The first cloud resource added is a single storage account.
| Concern | Implementation | Delta from T0 |
|---|---|---|
| Capture | ROS 2 recording to shared NFS/SMB, or each robot rsyncs up. | shared disk |
| Move data | azcopy or az storage blob upload-batch to one Blob container. | + Blob storage |
| Curate | Dataviewer in azure mode against that container (managed identity or SAS). | viewer → cloud |
| Train | Local shared GPU box, or first optional reach to AzureML on saturation. | optional cloud GPU |
| Track | Local file-backed tracking (carried from T0) optionally promoted to a shared MLflow server once a team needs shared run history. | optional hosted tracking |
| Run on robot | Plain container per robot; hand-update 2–3 robots via docker pull. | unchanged |
Edge infra: shared disk. Cloud infra: one storage account, optionally AzureML and MLflow. No Kubernetes, no Arc, no Flux.
Graduate when: training scale or team size outgrows one GPU box; dataset governance and catalogs become necessary.
T2 — Pilot. One site, several robots, real training scale and collaboration
The tier where cloud training genuinely becomes the default rather than an option.
| Concern | Implementation | Delta from T1 |
|---|---|---|
| Train | AzureML or OSMO as default: multi-GPU, queued jobs, multiple people, VLA scale. | cloud GPU standard |
| Registry | Model registry and versioning become load-bearing. | + registry |
| Curate | Dataviewer deployed as a shared web app rather than localhost. | hosted viewer |
| Capture | ACSA optional if disk pressure or unattended recording warrants it. | optional ACSA |
| Run on robot | Manual docker pull per robot. Hand-updating a handful of reachable robots is still tractable. | unchanged |
Edge infra: none beyond Docker. Cloud infra: AzureML workspace, storage, registry, MLflow. Still no Kubernetes, no Arc, no fleet plane.
Graduate when: the number of robots or the update cadence makes hand-updating each robot error-prone, and version skew across robots becomes a real problem — but all robots are still at one reachable site.
T3 — Production. Local k3s + Flux, no Arc
The tier that proves declarative, GitOps-style deployment does not require
Azure Arc.
Several robots at one site you control, updated often enough that manual docker pull causes
version skew, but all reachable from a single operator network. A single local k3s node running
FluxCD reconciles every robot to a Git-declared desired state. Arc is unnecessary precisely
because there is only one site and you can reach it directly.
| Concern | Implementation |
|---|---|
| Runtime | A single local k3s node (a ~60 MB binary) at the site. |
| Delivery | FluxCD reconciles robots to Git-declared desired state; rollback is a Git revert. |
| Version control | Git becomes the single source of truth for which robot runs which policy. |
| Safe swap | Optional gating before a policy swaps on hardware. |
| Train and curate | Same as T2 (AzureML, registry, MLflow, hosted viewer). |
Why this is a full tier, not a footnote: it isolates the single most misattributed cost in the whole architecture. The expensive part of the "fleet" stack was never running Kubernetes at the edge — single-node k3s idles near zero. The expensive part is the cloud-side control plane (Arc enrollment, identity, policy). T3 delivers GitOps deployment automation while paying none of that. It is the decoupling that makes the entire middle of the ladder honest.
Edge infra: one local k3s node + FluxCD. Cloud infra: same as T2. No Arc, no IoT Operations.
Graduate when: robots span multiple sites, or sites become unreachable from a single operator network — the point at which a cross-site reachability and identity broker becomes genuinely necessary.
T4 — Scale. Multi-site fleet delivery
The legitimate top of the necessary ladder. This is the fleet delivery control plane: getting validated policies onto robots across sites you cannot directly reach, safely. The defining difference from T3 is multiple sites — which is exactly what makes Arc necessary, as the cross-site reachability and identity broker that single-site k3s did not need.
| Concern | Implementation |
|---|---|
| Connectivity and identity | Azure Arc as the reachability and identity broker across sites. |
| Runtime | AKS or Arc-enabled Kubernetes at each site. |
| Delivery | FluxCD GitOps; per-site desired state recorded in Git. |
| Safe swap | Gating service approves deployment windows before a policy swaps on hardware. |
Explicitly excluded at T4: drift detection, automated retraining, aggregate telemetry analytics. This tier delivers and gates; it does not run fleet intelligence.
Edge infra: Arc + AKS/Flux + gating. Cloud infra: connectivity, identity, registry.
Graduate when: the operator explicitly wants production signals to drive retraining and fleet-wide health analytics — a deliberate decision, not an automatic consequence of scale.
T5 — Operate. Fleet intelligence. Deferred
The aspirational layer. Drift detection, automated retraining triggers, aggregate telemetry, and health analytics.
| Concern | Implementation | Status |
|---|---|---|
| Edge telemetry | Azure IoT Operations MQTT aggregation | placeholder |
| Analytics | Microsoft Fabric Real-Time Intelligence, Grafana | placeholder |
| Drift and retraining | Drift detection, retraining triggers, closed loop | placeholder |
Status: mostly unimplemented (0 Python files, 4 placeholders in fleet-intelligence). This tier should be documented as a roadmap direction, with explicit human-in-the-loop gating recommended over fully autonomous retraining. It decomposes into an ordered autonomy ladder rather than a single leap.
The Autonomy Ladder (T5.0–T5.3)
Fleet intelligence is not a single leap. Treating it as one repeats the all-or-nothing framing this proposal argues against everywhere else. The closed loop decomposes into four ordered stages of increasing decision authority. Each is a legitimate stopping point, and the deferral is honest only when the intermediate stages are named: three of the four are unbuilt today (modulo an ad-hoc experiment by the team on Hex).
| Stage | Name | Decision authority | Status |
|---|---|---|---|
| T5.0 | Gated retraining | Humans trigger retraining; the system surfaces signals only. | not built |
| T5.1 | Human-in-the-loop / active learning | The system proposes what to retrain on and when; a human approves each cycle. | ad-hoc (Hex) |
| T5.2 | Continual learning | The system retrains on a schedule or trigger; a human reviews before deployment. | not built |
| T5.3 | Autonomous closed-loop | The system detects drift, retrains, gates, and deploys without human approval. | not built |
[!IMPORTANT] Autonomy is a different axis from T0–T4. T0–T4 scale on infrastructure reach (sites, GPU, collaboration); T5.0–T5.3 scale on decision authority / loop closure. They are orthogonal: a single-site T3 operator can sit at T5.0, and a multi-site T4 operator can remain fully manual. The autonomy stages are not "more infrastructure to buy" \u2014 they are how much of the retraining decision a human delegates.
The foot-gun warning below applies with increasing force up the ladder; T5.3 should remain a roadmap direction, not a near-term target.
[!WARNING] Fully autonomous retraining on production data is a foot-gun: a legitimate distribution change can cause the loop to bake current degraded behavior into the next dataset. Drift detection also needs statistical power that only exists at fleet scale. T5 should default to human-supervised (T5.0–T5.1), not closed-loop.
Cross-Cutting Axis: Tiered Data Augmentation
Data scarcity is unavoidable in physical AI, so the toolchain needs an augmentation story \u2014 but today the only documented path is the full Cosmos/SDG pipeline, which is the aspirational ceiling that T0\u2013T2 users cannot realistically operate. This is the same all-or-nothing trap as the rest of the architecture: a heavyweight ceiling with no documented low or middle rung. Augmentation should therefore be a tiered, optional axis, recommended when data is scarce \u2014 not a baseline step and explicitly not part of Goal G (folding it into the anchor goal would undermine the "this is the honest minimal floor" argument).
| Stage | Approach | Where it runs |
|---|---|---|
| A0 | Classical CV augmentation (crops, jitter, blur, photometric/geometric transforms) | Local, CPU, no model |
| A1 | Local small-VLM generation (e.g. via vLLM locally at T0, or Azure AI Foundry at T1) | Local GPU or hosted endpoint |
| A2 | Full Cosmos / SDG world-foundation-model pipeline | Cloud, GPU cluster |
[!NOTE] Unlike the documentation-and-packaging changes elsewhere in this proposal, the A0\u2013A1 rungs are net-new code:
synthetic-dataships 0 Python files and 2 placeholders today. Classical augmentation, local-VLM generation, and the accompanying experiments and guidance are a near-term build item, not a re-framing of existing artifacts. Naming them here is what gives the augmentation axis the credibility the toolchain currently lacks; delivering A0\u2013A1 is a separate, scoped workstream.
6. What Changes for the Repository
This proposal is primarily documentation and packaging, not a code rewrite.
| Change | Effort | Impact |
|---|---|---|
| Document T0 as the sanctioned starting path | Low — code already supports it | Removes the perceived barrier directly |
| Restructure the architecture doc around T0–T5 | Low–medium | Reframes the on-ramp; sets correct expectations |
| Split "fleet" into fleet delivery (T4) and fleet intelligence (T5) language, reserving "fleet" for robots | Low | Lets multi-site users adopt delivery without intelligence; ends the Azure Kubernetes Fleet Manager naming collision |
| Mark T5 components explicitly as roadmap | Low | Stops the unbuilt ceiling from reading as a floor |
| Name the T5.0–T5.3 autonomy ladder | Low — documentation only | Turns the deferral into a roadmap instead of a cliff |
| Add per-tier quick-starts | Medium | Gives each audience a concrete entry point |
| Implement A0–A1 data augmentation (classical CV, local-VLM) | Medium — net-new code | Closes the credibility gap; the only build item in this list |
The deploy scripts, Terraform modules, and training code do not need to change to support the tier model itself. The tiers describe which subset of existing infrastructure a user adopts, and in what order. The one exception is the A0–A1 augmentation rungs, which are an explicit new workstream rather than a re-framing of shipped artifacts.
7. Open Questions for Socialization
- Default tier in docs. Should the README and Quick Start default to T0 (Dev), with T2 (Pilot) as the "recommended production" path and T3–T5 (Production, Scale, Operate) clearly marked advanced? Do the stage names — Dev, Lab, Pilot, Production, Scale, Operate — read correctly, and is pairing each name with its
T#ID the right convention (stable IDs for boundary references, names for user-facing labels)? - Fleet vocabulary. Does the repository adopt fleet delivery (T4 control plane) and fleet intelligence (T5 cognition) as distinct named concepts, reserve the word "fleet" exclusively for a fleet of robots, and retire the bare phrase "fleet management" (which collides with Azure Kubernetes Fleet Manager and the Kubernetes ecosystem's delivery-oriented use of "fleet")?
- Roadmap honesty. How explicitly should placeholder domains be labeled in user-facing docs versus contributor docs?
- Scope of Goal G. The proposal keeps the single-task capture → train → validate → run loop as the anchor goal and models data augmentation as a separate optional axis (A0–A2) rather than folding it into Goal G. Is keeping the floor minimal — with augmentation recommended only when data is scarce — the right call, or should augmentation be part of the reference loop?
8. Summary
The toolchain is strong where it is built and over-advertised where it is not. The barrier to entry is the result of presenting an aspirational, multi-site, fleet-intelligence ceiling as the required floor. The fix is to:
- Surface the already-working local path (T0) as the sanctioned starting point.
- Model adoption as graduated tiers keyed to reach, GPU, collaboration, and site count.
- Unbundle "fleet" into a necessary fleet delivery control plane (T4) and a deferred fleet intelligence layer (T5).
This lowers the on-ramp without removing any capability, and it aligns the documented architecture with what the code actually delivers today.