This repository provides Azure-integrated infrastructure and tooling for NVIDIA Isaac Lab-based robotics training, inference, and orchestration through NVIDIA OSMO and Azure Machine Learning.
| Directory | Purpose |
|---|---|
deploy/ |
Ordered Terraform IaC and shell scripts for Azure infrastructure provisioning and Kubernetes cluster setup. |
src/ |
Python code for training and inference, acting as a shim between NVIDIA OSS libraries and Azure services. |
workflows/ |
OSMO workflow and AzureML job YAML definitions for training, inference, and validation. |
scripts/ |
Convenience scripts for submitting OSMO workflows and AzureML jobs, plus CI linting and security tooling. |
external/ |
Cloned NVIDIA Isaac Lab repository for local development intellisense and reference. |
docs/ |
Documentation covering GPU configuration, MLflow integration, deployment validation, and contributing guides. |
The src/ directory contains three packages:
| Package | Contents |
|---|---|
common/ |
Shared CLI argument parsing utilities |
training/ |
Isaac Lab training scripts with skrl, RSL-RL, and LeRobot integration plus Azure MLflow metric tracking |
inference/ |
Policy export, playback, and inference node scripts |
Training scripts and Python code act as integration shims between NVIDIA’s open-source repositories and Azure connectivity features: MLflow from AzureML for tracking training metrics, logs, and checkpoints; the AzureML model registry for checkpoint versioning; and Azure Blob Storage for dataset access.
The root pyproject.toml serves local development dependency management:
| Context | Usage |
|---|---|
| Local development | Providing module availability for intellisense and verification |
This setup is not intended for building publishable Python packages. The pyproject.toml build target only packages training/rl into a wheel for in-container use.
This codebase will reorganize around eight lifecycle domains for robotics and physical AI, each built on current Azure services and NVIDIA’s Physical AI Stack. Each domain represents a distinct functional concern in the physical AI lifecycle.
Each domain maps to a root-level directory in this repository. Domains that require Azure infrastructure beyond what infrastructure/ provides maintain their own IaC subdirectories.
| Domain | Directory | Scope |
|---|---|---|
| Infrastructure | infrastructure/ |
Shared Azure services: AKS, AzureML, networking, storage, observability |
| Data Pipeline | data-pipeline/ |
Robot-to-cloud data capture via Azure Arc and ROS 2 episodic recording |
| Data Management | data-management/ |
Episodic data viewer, labeling, dataset curation, and job orchestration |
| Synthetic Data | synthetic-data/ |
SDG pipelines leveraging NVIDIA Cosmos world foundation models |
| Training | training/ |
Policy training, packaging to TensorRT/ONNX, and model registration |
| Evaluation | evaluation/ |
Software-in-the-loop and hardware-in-the-loop validation pipelines |
| Fleet Deployment | fleet-deployment/ |
Edge deployment via FluxCD GitOps on Azure Arc-enabled Kubernetes |
| Fleet Intelligence | fleet-intelligence/ |
Production telemetry, on-robot policy analytics, and fleet health reporting |
Shared Azure services required across all domains. Terraform modules provision AKS clusters with GPU node pools, AzureML workspaces, Azure Container Registry, Key Vault, managed identities, networking (VNet, subnets, NAT Gateway), and observability (Azure Monitor, DCGM metrics). Domain-specific infrastructure that stands alone (VPN, automation, DNS) deploys from subdirectories within each domain rather than the shared module.
Tooling and infrastructure for capturing real-world robot data and transmitting it to Azure. This domain covers:
Episodic data follows the LeRobot dataset format to maintain compatibility with the broader robotics ML ecosystem.
An episodic data viewer and curator built on top of LeRobot’s visualization tooling. The viewer runs locally for development and can optionally be deployed to an Azure-hosted web app through the included setup scripts. Capabilities include:
Pipelines for synthetic data generation (SDG) through OSMO workflows and AzureML jobs. This domain will incorporate NVIDIA Cosmos world foundation models for generating photorealistic training data:
The Cosmos Cookbook provides post-training scripts and recipes that this domain’s workflows will reference for model customization.
End-to-end training pipeline from raw data to packaged, deployable model artifacts. Training code is organized by learning approach, with each approach containing its own source, workflows, and configuration:
| Approach | Directory | Scope |
|---|---|---|
| RL | training/rl/ |
Reinforcement learning with Isaac Lab (skrl, RSL-RL) |
| IL | training/il/ |
Imitation learning with LeRobot and demonstration datasets |
| VLA | training/vla/ |
Vision-language-action model training for generalist robot policies |
Cross-cutting concerns shared across approaches:
A complete example pipeline demonstrates the full path from trained checkpoint to containerized inference image registered in AzureML.
Software-in-the-loop (SiL) and hardware-in-the-loop (HiL) validation pipelines for trained policies. Both approaches use Isaac Sim to emulate the target robot, with the trained policy controlling the simulation.
| Approach | Infrastructure | Policy Host |
|---|---|---|
| SiL | Any available compute that can serve the policy as an inference endpoint | AzureML managed endpoint or AKS |
| HiL | Target deployment hardware (typically NVIDIA Jetson) running the containerized TensorRT or ONNX policy | Edge device matching production |
Evaluation metrics capture to:
Setup scripts deploy evaluation pipelines to OSMO and AzureML compute targets. Full end-to-end evaluation pipelines orchestrate policy loading, simulation execution, metric collection, and result publishing as a single workflow.
Isaac Sim connects to the deployed policy endpoint, generating control signals and receiving observations to produce evaluation episodes that the Data Management domain can review.
Edge deployment of packaged policy containers to robots through GitOps. This domain includes:
The deployment flow: AzureML model registry publishes a new container image, FluxCD detects the manifest update, the hot-loader pulls and stages the image on the Arc cluster, and the gating service approves deployment to the robot on the operator’s schedule.
Production monitoring, robotics telemetry, and on-robot policy performance analytics that close the data flywheel. The deployment lifecycle does not end when a policy reaches a robot. This domain captures what happens after deployment and feeds insights back into Data Pipeline and Training.
Capabilities include:
This domain distinguishes itself from Evaluation (which validates policies in simulation before deployment) by focusing on real-world, production-time signals from physical robots operating in uncontrolled environments.
Simulation environment authoring, including robot asset import (USD, URDF, MJCF), scene configuration, domain randomization, and Isaac Lab task design, is a prerequisite for training and evaluation. NVIDIA provides comprehensive tooling and documentation for these workflows through Isaac Sim and the Isaac Lab Reference Architecture.
This repository will not maintain a separate codebase domain for simulation. Instead, the docs/ directory will provide guidance on:
This project uses GitHub Copilot agents, instructions, prompts, and skills to automate development workflows. Tooling comes from two sources: the HVE-Core extension (shared across Microsoft HVE projects) and project-specific artifacts defined in .github/.
The hve-core-all VS Code extension provides shared agentic tooling:
| Artifact Type | Count | Examples |
|---|---|---|
| Agents | 33 | RPI workflow, backlog management, PR creation |
| Instructions | 24 | Coding standards (Bash, C#, Python, Terraform, Bicep), commit messages, markdown |
| Prompts | 27 | ADO work items, GitHub issues, security planning, PR descriptions |
| Skills | 2 | PR reference generation, video-to-GIF conversion |
HVE-Core artifacts are registered via the extension’s package.json contributes section and loaded when the extension activates.
This repository defines project-specific artifacts in .github/ that extend HVE-Core with domain knowledge:
| Artifact Type | Count | Purpose |
|---|---|---|
| Agents | 2 | OSMO training manager, dataviewer developer |
| Instructions | 4 | Copilot instructions, dataviewer conventions, documentation style, shell scripts |
| Prompts | 4 | OSMO training submission, LeRobot pipeline, dataviewer workflows |
| Skills | 2 | Dataviewer interaction, OSMO LeRobot training |
Project artifacts are auto-discovered by VS Code from the .github/ directory without explicit registration.
Two workflow chains compose these artifacts:
osmo-training-manager agent → osmo-lerobot-training skill → training submission promptsdataviewer-developer agent → dataviewer skill → dataviewer instruction conventionsEach artifact type uses YAML frontmatter to declare behavior:
| Artifact | File Pattern | Key Frontmatter | Loading |
|---|---|---|---|
| Agents | *.agent.md |
mode, tools, description |
Auto-discovered from .github/agents/ |
| Instructions | *.instructions.md |
applyTo, description |
Auto-discovered from .github/instructions/ |
| Prompts | *.prompt.md |
mode, description, tools |
Auto-discovered from .github/prompts/ |
| Skills | SKILL.md |
N/A (referenced by agents) | Referenced via copilot-skill: URI |
HVE-Core artifacts follow the same patterns but load through extension contribution points rather than workspace auto-discovery.
For the detailed per-artifact inventory and workflow chain diagrams, see Copilot Artifacts.
Each domain will contain specification documents alongside working examples. These specifications serve as structured inputs for GitHub Copilot Agent Skills, enabling customers to adapt this reference architecture to their own codebase and infrastructure.
Every domain directory will include:
| Artifact | Purpose |
|---|---|
README.md |
Domain overview, quick start, and usage guide |
examples/ |
Complete, runnable examples with code, scripts, and configurations |
specifications/ |
Domain specifications describing capabilities, inputs, outputs, and contracts |
specifications/*.specification.md |
Individual specifications that Agent Skills consume to generate customized implementations |
.github/skills/ |
Agent Skill definitions referencing the domain’s specifications |
Domain documentation lives under the root docs/ directory rather than inside each domain folder. Each domain has a corresponding subdirectory at docs/<domain>/ containing detailed guidance, architecture decisions, and tutorials.
Specifications define the contracts and patterns for each domain so that Agent Skills can generate customized implementations:
Each customer may have different hardware configurations, Azure subscription topologies, network constraints, and compliance requirements. Specifications capture the variability points so Agent Skills can produce implementations that fit, rather than requiring manual adaptation of generic examples.
A training domain specification might define:
physical-ai-toolchain/
├── infrastructure/ # Shared Azure IaC and cluster setup
│ ├── terraform/ # Terraform modules and root configurations
│ ├── setup/ # Post-IaC Kubernetes and OSMO setup scripts
│ └── specifications/ # Infrastructure specifications for Agent Skills
├── data-pipeline/ # Robot-to-cloud data capture
│ ├── setup/ # Deploy Arc, edge agents, and transfer components
│ ├── arc/ # Azure Arc configuration and scripts
│ ├── capture/ # ROS 2 episodic data recording
│ ├── examples/ # End-to-end data pipeline examples
│ └── specifications/ # Data pipeline specifications for Agent Skills
├── data-management/ # Episodic data viewer and curation
│ ├── setup/ # Deploy viewer to Azure web app
│ ├── viewer/ # Data viewer application (runs locally or hosted)
│ ├── tools/ # CLI tools for dataset operations
│ ├── examples/ # Data management workflow examples
│ └── specifications/ # Data management specifications for Agent Skills
├── synthetic-data/ # Synthetic data generation pipelines
│ ├── workflows/ # OSMO and AzureML SDG job definitions
│ ├── cosmos/ # Cosmos model integration and configs
│ ├── examples/ # SDG pipeline examples
│ └── specifications/ # SDG specifications for Agent Skills
├── training/ # Policy training and model packaging
│ ├── setup/ # Deploy training pipelines to OSMO and AzureML
│ ├── rl/ # Reinforcement learning (skrl, RSL-RL)
│ ├── il/ # Imitation learning (LeRobot)
│ ├── vla/ # Vision-language-action model training
│ ├── pipelines/ # End-to-end train, export, package, register
│ ├── packaging/ # TensorRT/ONNX export and containerization
│ ├── examples/ # Training pipeline examples
│ └── specifications/ # Training specifications for Agent Skills
├── evaluation/ # SiL and HiL validation
│ ├── setup/ # Deploy evaluation pipelines to OSMO and AzureML
│ ├── sil/ # Software-in-the-loop pipelines
│ ├── hil/ # Hardware-in-the-loop pipelines
│ ├── pipelines/ # End-to-end evaluation workflows
│ ├── metrics/ # Metric collection and reporting
│ ├── examples/ # Evaluation pipeline examples
│ └── specifications/ # Evaluation specifications for Agent Skills
├── fleet-deployment/ # Edge deployment via GitOps
│ ├── gitops/ # FluxCD manifests and bootstrap scripts
│ ├── gating/ # Policy approval and scheduling service
│ ├── examples/ # Fleet deployment workflow examples
│ └── specifications/ # Fleet deployment specifications for Agent Skills
├── fleet-intelligence/ # Production telemetry and policy analytics
│ ├── setup/ # Deploy IoT Operations, telemetry, and dashboards
│ ├── telemetry/ # On-robot telemetry capture and routing
│ ├── dashboards/ # Grafana and Azure Monitor configurations
│ ├── drift/ # Policy drift detection and alerting
│ ├── examples/ # Fleet intelligence pipeline examples
│ └── specifications/ # Fleet intelligence specifications for Agent Skills
├── scripts/ # Cross-domain CI, linting, and security tooling
├── external/ # Cloned external repositories for reference
├── docs/ # All domain and cross-domain documentation
│ ├── infrastructure/ # Infrastructure architecture and guides
│ ├── data-pipeline/ # Data pipeline guides
│ ├── data-management/ # Data management guides
│ ├── synthetic-data/ # SDG guides
│ ├── training/ # Training guides
│ ├── evaluation/ # Evaluation guides
│ ├── fleet-deployment/ # Fleet deployment guides
│ ├── fleet-intelligence/ # Fleet intelligence guides
│ ├── simulation/ # Simulation setup and guidance
│ └── contributing/ # Repository contributing and architecture design decisions
└── .github/ # Agent Skills, instructions, and CI workflows
└── skills/ # Domain-linked Agent Skill definitions
| Resource | Relevance |
|---|---|
| NVIDIA Isaac Lab | Robot learning framework for simulation-based RL and IL training |
| NVIDIA Isaac Sim | Physics simulation platform underlying Isaac Lab |
| Isaac Lab Reference Architecture | End-to-end robot learning workflow from asset import to deployment |
| NVIDIA Cosmos Platform | World foundation models for physical AI (Predict, Transfer, Reason) |
| Cosmos-Transfer2.5 | Sim-to-real photorealistic video generation from simulation |
| Cosmos-Predict2.5 | Future state prediction and world simulation |
| Cosmos Cookbook | Post-training recipes for Cosmos model customization |
| NVIDIA OSMO | Cloud-native orchestration for AI simulation and training |
| LeRobot | Hugging Face robotics ML library for imitation learning |
| Azure Machine Learning | ML model training, registry, and deployment on Azure |
| Azure AI Foundry | Centralized model management and deployment platform |
| Azure Arc-enabled Kubernetes | Kubernetes management for edge clusters connected to Azure |
| Azure IoT Operations | Edge telemetry aggregation and device management for robotics |
| FluxCD | GitOps toolkit for Kubernetes continuous delivery |
| Azure Monitor | Observability and metrics for Azure and hybrid workloads |
| Microsoft Fabric RTI | Streaming telemetry analysis for fleet intelligence |