Skip to main content

Quickstart: Clone to First Training Job

Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path.

[!NOTE] This guide expands on the Getting Started hub.

Prerequisites

RequirementDetails
Azure subscriptionContributor + User Access Administrator roles
GPU quotaStandard_NC24ads_A100_v4 in target region
NVIDIA NGC accountSign up at https://ngc.nvidia.com/ for API key
Development environmentDevcontainer (recommended) or local tools

See Prerequisites for installation commands and version requirements.

Step 1: Clone and Set Up Environment

Clone the repository and initialize the development environment.

git clone https://github.com/microsoft/physical-ai-toolchain.git
cd physical-ai-toolchain

Use the devcontainer (recommended) or run local setup:

./setup-dev.sh

Step 2: Configure Azure Subscription

Authenticate with Azure and register required resource providers.

source infrastructure/terraform/prerequisites/az-sub-init.sh
bash infrastructure/terraform/prerequisites/register-azure-providers.sh

Verify your subscription:

az account show --query "{name:name, id:id}" -o table

Step 3: Configure Terraform Variables

Create a Terraform variables file for the full-public deployment path. From the repository root:

cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars with these values:

project_name = "robotics"
environment = "dev"
location = "eastus"
gpu_vm_size = "Standard_NC24ads_A100_v4"

enable_azure_ml = true
enable_osmo = true
enable_vpn_gateway = false
enable_private_dns = false

[!TIP] For private networking, set enable_vpn_gateway = true and enable_private_dns = true. See the Infrastructure Guide for details.

Step 4: Deploy Infrastructure

Initialize and apply the Terraform configuration. This step takes ~30-40 minutes.

terraform init
terraform plan -out=tfplan
terraform apply tfplan

Verify deployment:

terraform output

Connect to the AKS cluster:

az aks get-credentials \
--resource-group "$(terraform output -raw resource_group_name)" \
--name "$(terraform output -raw aks_cluster_name)"

Step 5: Configure AKS Cluster

Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root:

cd infrastructure/setup
bash 01-deploy-robotics-charts.sh
bash 02-deploy-azureml-extension.sh

Verify GPU operator pods:

kubectl get pods -n gpu-operator

Step 6: Deploy OSMO Components

Deploy the OSMO control plane and backend using Access Keys authentication.

bash 03-deploy-osmo-control-plane.sh
bash 04-deploy-osmo-backend.sh --use-access-keys

Verify OSMO pods:

kubectl get pods -n osmo-control-plane

Step 7: Submit First Training Job

Navigate to the scripts directory and submit a training job. From the repository root:

cd scripts
bash submit-osmo-training.sh

Scripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See Scripts Reference for all submission options.

Step 8: Verify Results

Confirm the training job is running:

kubectl get pods -n osmo-control-plane --watch

Check OSMO training status through the OSMO web UI or query pod logs:

kubectl logs -n osmo-control-plane -l app=osmo-training --tail=50

Cleanup

Destroy all infrastructure when finished to stop incurring costs. From the repository root:

cd infrastructure/terraform
terraform destroy

See Cost Considerations for detailed pricing.

Next Steps

ResourceDescription
MLflow IntegrationTrack experiments with MLflow
Deployment GuideFull deployment reference and options
Contributing GuideDevelopment workflow and code standards