Skip to main content

Quickstart: Clone to First Training Job

Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path.

[!NOTE] This guide expands on the Getting Started hub.

Prerequisites

RequirementDetails
Azure subscriptionContributor + User Access Administrator roles
GPU quotaStandard_NV36ads_A10_v5 (A10 Spot, default) or Standard_NC40ads_H100_v5 (H100) in target region
NVIDIA NGC accountSign up at https://ngc.nvidia.com/ for API key
Development environmentDevcontainer (recommended) or local tools

See Prerequisites for installation commands and version requirements.

Step 1: Clone and Set Up Environment

Clone the repository and initialize the development environment.

git clone https://github.com/microsoft/physical-ai-toolchain.git
cd physical-ai-toolchain

Use the devcontainer (recommended) or run local setup:

./setup-dev.sh

Step 2: Configure Azure Subscription

Authenticate with Azure and register required resource providers.

source infrastructure/terraform/prerequisites/az-sub-init.sh
bash infrastructure/terraform/prerequisites/register-azure-providers.sh

Verify your subscription:

az account show --query "{name:name, id:id}" -o table

Step 3: Configure Terraform Variables

Create a Terraform variables file for the full-public deployment path. From the repository root:

cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars with your values:

environment = "dev"
location = "westus3"
resource_prefix = "yourprefix"
instance = "001"

// Full-public networking (simplest path for inner developer loop)
should_enable_private_endpoint = false
should_enable_private_aks_cluster = false
should_enable_public_network_access = true

// Single GPU pool (Spot A10)
node_pools = {
gpu = {
vm_size = "Standard_NV36ads_A10_v5"
subnet_address_prefixes = ["10.0.7.0/24"]
node_taints = ["nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
gpu_driver = "Install"
priority = "Spot"
should_enable_auto_scaling = true
min_count = 1
max_count = 1
zones = []
eviction_policy = "Delete"
}
}

// System node pool — enable autoscaling for OSMO workloads
should_enable_system_node_pool_auto_scaling = true
system_node_pool_min_count = 1
system_node_pool_max_count = 3

// OSMO Backend Services
should_deploy_postgresql = true
should_deploy_redis = true

[!WARNING] resource_prefix must be lowercase, alphanumeric, and short (6-8 characters recommended). It feeds into Key Vault (kv{prefix}{env}{instance}) and Storage Account names that have 24-character limits and must be globally unique.

[!TIP] For private networking, set should_enable_private_endpoint = true and should_enable_private_aks_cluster = true, then deploy the VPN from infrastructure/terraform/vpn/ before running any kubectl commands. See the Infrastructure Guide for details.

Step 4: Deploy Infrastructure

Initialize and apply the Terraform configuration. This step takes ~30-40 minutes.

terraform init
terraform plan -out=tfplan
terraform apply tfplan

Verify deployment:

terraform output

Connect to the AKS cluster:

az aks get-credentials \
--resource-group "$(terraform output -json resource_group | jq -r '.name')" \
--name "$(terraform output -json aks_cluster | jq -r '.name')"

Step 5: Set NGC API Key

Export your NVIDIA NGC API key for OSMO backend deployment. Obtain a key from https://ngc.nvidia.com/.

export NGC_API_KEY="<your-ngc-api-key>"

Step 6: Configure AKS Cluster

Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root:

cd infrastructure/setup
bash 01-deploy-robotics-charts.sh --config-preview
bash 01-deploy-robotics-charts.sh
bash 02-deploy-azureml-extension.sh --config-preview
bash 02-deploy-azureml-extension.sh

[!TIP] All setup scripts support --config-preview to print configuration and exit without changes. Run it before each real deployment to verify values.

Verify GPU operator pods:

kubectl get pods -n gpu-operator

Step 7: Deploy OSMO Components

Deploy the OSMO control plane and backend using Access Keys authentication.

bash 03-deploy-osmo-control-plane.sh --config-preview
bash 03-deploy-osmo-control-plane.sh

[!IMPORTANT] When running from a devcontainer or codespace, the OSMO service URL is an internal load balancer IP only reachable from within the AKS VNet. Set up a port-forward before running the backend script:

kubectl port-forward svc/osmo-service -n osmo-control-plane 8080:80 &

Then pass --service-url http://localhost:8080 to both scripts 03 and 04. The scripts detect unreachable URLs and print this guidance automatically.

bash 04-deploy-osmo-backend.sh --use-access-keys --service-url http://localhost:8080 --config-preview
bash 04-deploy-osmo-backend.sh --use-access-keys --service-url http://localhost:8080

Verify OSMO pods:

kubectl get pods -n osmo-control-plane
kubectl get pods -n osmo-operator

Step 8: Submit First Training Job

Submit a training job from the repository root:

bash training/rl/scripts/submit-osmo-training.sh

Scripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See Scripts Reference for all submission options.

Step 9: Verify Results

Confirm the training job is running:

kubectl get pods -n osmo-workflows --watch

Check OSMO training status through the OSMO web UI or query pod logs:

kubectl logs -n osmo-workflows -l app=osmo-training --tail=50

Cleanup

Remove OSMO Helm releases before destroying infrastructure to avoid orphaned resources:

cd infrastructure/setup
helm uninstall osmo-operator -n osmo-operator --ignore-not-found
helm uninstall service router ui -n osmo-control-plane --ignore-not-found

Destroy all infrastructure when finished to stop incurring costs. From the repository root:

cd infrastructure/terraform
terraform destroy

See Cost Considerations for detailed pricing.

Next Steps

ResourceDescription
MLflow IntegrationTrack experiments with MLflow
Infrastructure GuideFull deployment reference and options
Contributing GuideDevelopment workflow and code standards