Skip to main content

Cluster Setup

AKS cluster configuration for robotics workloads with AzureML and NVIDIA OSMO.

[!NOTE] This page is part of the deployment guide. Return there for the full deployment sequence.

📋 Prerequisites

  • Terraform infrastructure deployed (cd infrastructure/terraform && terraform apply)
  • VPN connected (if using default private AKS cluster)
  • Azure CLI authenticated (az login)
  • kubectl, Helm 3.x, jq installed
  • OSMO CLI (osmo) for backend deployment

[!NOTE] Scripts automatically install required Azure CLI extensions (k8s-extension, ml) if missing.

[!IMPORTANT] The default infrastructure deploys a private AKS cluster. You must deploy the VPN Gateway and connect before running these scripts. See VPN Gateway for setup instructions. Without VPN, kubectl commands fail with no such host errors.

To skip VPN, set should_enable_private_aks_cluster = false in your Terraform configuration. See Network Configuration Modes.

Azure RBAC Permissions

RoleScopePurpose
Azure Kubernetes Service Cluster User RoleAKS ClusterGet cluster credentials
ContributorResource GroupExtension and FIC creation
Key Vault Secrets UserKey VaultRead PostgreSQL/Redis credentials
Storage Blob Data ContributorStorage AccountCreate workflow containers

🚀 Quick Start

# Connect to cluster (values from terraform output)
az aks get-credentials --resource-group <rg> --name <aks>

# Verify connectivity (requires VPN for private clusters)
kubectl cluster-info
# Expected: Kubernetes control plane is running at https://...
# If you see "no such host" errors, connect to VPN first

# Deploy GPU infrastructure (required for all paths)
./01-deploy-robotics-charts.sh

# Choose your path:
# - AzureML: ./02-deploy-azureml-extension.sh
# - OSMO: ./03-deploy-osmo-control-plane.sh && ./04-deploy-osmo-backend.sh

🔐 Deployment Scenarios

Three authentication and registry configurations are supported. Choose based on your security requirements.

Scenario 1: Access Keys

Simplest setup using storage account keys and public NVIDIA registry.

# terraform.tfvars
osmo_config = {
should_enable_identity = false
should_federate_identity = false
control_plane_namespace = "osmo-control-plane"
operator_namespace = "osmo-operator"
workflows_namespace = "osmo-workflows"
}
./01-deploy-robotics-charts.sh
./02-deploy-azureml-extension.sh
./03-deploy-osmo-control-plane.sh
./04-deploy-osmo-backend.sh --use-access-keys

Scenario 2: Workload Identity

Secure, key-less authentication via Azure Workload Identity.

# terraform.tfvars
osmo_config = {
should_enable_identity = true
should_federate_identity = true
control_plane_namespace = "osmo-control-plane"
operator_namespace = "osmo-operator"
workflows_namespace = "osmo-workflows"
}
./01-deploy-robotics-charts.sh
./02-deploy-azureml-extension.sh
./03-deploy-osmo-control-plane.sh
./04-deploy-osmo-backend.sh

Scripts auto-detect the OSMO managed identity from Terraform outputs and configure ServiceAccount annotations.

Scenario 3: Workload Identity + Private ACR (Air-Gapped)

Enterprise deployment using private Azure Container Registry.

Pre-requisite: Import images to ACR before deployment.

# Get ACR name and import images
cd ../001-iac
ACR_NAME=$(terraform output -json container_registry | jq -r '.value.name')
az acr login --name "$ACR_NAME"

# Set versions
OSMO_VERSION="${OSMO_VERSION:-6.0.0}"
CHART_VERSION="${CHART_VERSION:-1.0.0}"

OSMO_IMAGES=(
service router web-ui worker logger agent
backend-listener backend-worker client
delayed-job-monitor init-container
)
for img in "${OSMO_IMAGES[@]}"; do
az acr import --name "$ACR_NAME" \
--source "nvcr.io/nvidia/osmo/${img}:${OSMO_VERSION}" \
--image "osmo/${img}:${OSMO_VERSION}"
done

# Import Helm charts
for chart in osmo router ui backend-operator; do
helm pull "oci://nvcr.io/nvidia/osmo/${chart}" --version "$CHART_VERSION"
helm push "${chart}-${CHART_VERSION}.tgz" "oci://${ACR_NAME}.azurecr.io/helm"
rm "${chart}-${CHART_VERSION}.tgz"
done
cd ../002-setup
./01-deploy-robotics-charts.sh
./02-deploy-azureml-extension.sh
./03-deploy-osmo-control-plane.sh --use-acr
./04-deploy-osmo-backend.sh --use-acr

Scenario Comparison

Access KeysWorkload IdentityWorkload Identity + ACR
Storage AuthAccess KeysWorkload IdentityWorkload Identity
Registrynvcr.ionvcr.ioPrivate ACR
Air-Gap

🔒 Security Considerations

When deploying with should_enable_private_endpoint = false, cluster endpoints are publicly accessible. Secure the following components:

AzureML Extension

The AzureML inference router (azureml-fe) handles incoming requests. For public deployments:

  • Enable HTTPS with TLS certificates (allowInsecureConnections=False)
  • Configure sslSecret or provide certificate files
  • Consider using internalLoadBalancerProvider=azure for internal-only access

See Secure Kubernetes online endpoints and Inference routing configuration.

OSMO UI

The OSMO web interface requires authentication for public access:

  • Enable Keycloak for user authentication and authorization
  • Configure OIDC integration with Azure AD or other identity providers

See OSMO Keycloak configuration.

📜 Scripts

ScriptPurpose
01-deploy-robotics-charts.shGPU Operator, KAI Scheduler
02-deploy-azureml-extension.shAzureML K8s extension, compute attach
03-deploy-osmo-control-plane.shOSMO service, router, web-ui
04-deploy-osmo-backend.shBackend operator, workflow storage

Script Flags

FlagScriptsDescription
--use-access-keys04-deploy-osmo-backend.shStorage account keys instead of workload identity
--use-acr03-deploy-osmo-control-plane.sh, 04-deploy-osmo-backend.shPull from Terraform-deployed ACR
--acr-name NAME03-deploy-osmo-control-plane.sh, 04-deploy-osmo-backend.shSpecify alternate ACR
--config-previewAllPrint config and exit

⚙️ Configuration

Scripts read from Terraform outputs in infrastructure/terraform/. Override with environment variables:

VariableDescription
AZURE_SUBSCRIPTION_IDAzure subscription
AZURE_RESOURCE_GROUPResource group
AKS_CLUSTER_NAMECluster name

✅ Verification

# Check pods
kubectl get pods -n gpu-operator
kubectl get pods -n azureml
kubectl get pods -n osmo-control-plane
kubectl get pods -n osmo-operator

# Workload identity (if enabled)
kubectl get sa -n osmo-control-plane osmo-control-plane -o yaml | grep azure.workload.identity

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.