Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path.
[!NOTE] This guide expands on the Getting Started hub.
| Requirement | Details |
|---|---|
| Azure subscription | Contributor + User Access Administrator roles |
| GPU quota | Standard_NC24ads_A100_v4 in target region |
| NVIDIA NGC account | Sign up at https://ngc.nvidia.com/ for API key |
| Development environment | Devcontainer (recommended) or local tools |
See Prerequisites for installation commands and version requirements.
Clone the repository and initialize the development environment.
git clone https://github.com/microsoft/physical-ai-toolchain.git
cd physical-ai-toolchain
Use the devcontainer (recommended) or run local setup:
./setup-dev.sh
Authenticate with Azure and register required resource providers.
source infrastructure/terraform/prerequisites/az-sub-init.sh
bash infrastructure/terraform/prerequisites/register-azure-providers.sh
Verify your subscription:
az account show --query "{name:name, id:id}" -o table
Create a Terraform variables file for the full-public deployment path. From the repository root:
cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars with these values:
project_name = "robotics"
environment = "dev"
location = "eastus"
gpu_vm_size = "Standard_NC24ads_A100_v4"
enable_azure_ml = true
enable_osmo = true
enable_vpn_gateway = false
enable_private_dns = false
[!TIP] For private networking, set
enable_vpn_gateway = trueandenable_private_dns = true. See the Infrastructure Guide for details.
Initialize and apply the Terraform configuration. This step takes ~30-40 minutes.
terraform init
terraform plan -out=tfplan
terraform apply tfplan
Verify deployment:
terraform output
Connect to the AKS cluster:
az aks get-credentials \
--resource-group "$(terraform output -raw resource_group_name)" \
--name "$(terraform output -raw aks_cluster_name)"
Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root:
cd infrastructure/setup
bash 01-deploy-robotics-charts.sh
bash 02-deploy-azureml-extension.sh
Verify GPU operator pods:
kubectl get pods -n gpu-operator
Deploy the OSMO control plane and backend using Access Keys authentication.
bash 03-deploy-osmo-control-plane.sh
bash 04-deploy-osmo-backend.sh --use-access-keys
Verify OSMO pods:
kubectl get pods -n osmo-control-plane
Navigate to the scripts directory and submit a training job. From the repository root:
cd scripts
bash submit-osmo-training.sh
Scripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See Scripts Reference for all submission options.
Confirm the training job is running:
kubectl get pods -n osmo-control-plane --watch
Check OSMO training status through the OSMO web UI or query pod logs:
kubectl logs -n osmo-control-plane -l app=osmo-training --tail=50
Destroy all infrastructure when finished to stop incurring costs. From the repository root:
cd infrastructure/terraform
terraform destroy
See Cost Considerations for detailed pricing.
| Resource | Description |
|---|---|
| LeRobot Inference | Run inference with trained LeRobot models |
| MLflow Integration | Track experiments with MLflow |
| Deployment Guide | Full deployment reference and options |
| Contributing Guide | Development workflow and code standards |