Find the symptom you are experiencing, then follow the resolution steps. Start with the quick diagnostics checklist to narrow down the failure category.
| Check | Command | Expected |
|---|---|---|
| Cluster reachable | kubectl get nodes |
Node list returned |
| GPU available | kubectl describe node \| grep nvidia.com/gpu |
GPU count > 0 |
| AzureML extension | kubectl get pods -n azureml |
All pods Running |
| OSMO control plane | kubectl get pods -n osmo-control-plane |
All pods Running |
| VPN connected | ping <private-endpoint-ip> |
Response received |
Cause: The default deployment creates a private AKS cluster. The API server is not reachable without an active VPN connection.
Resolution:
kubectl get nodes.Cause: Private DNS zones are not linked to the VPN virtual network, or the client DNS resolver is not forwarding to Azure DNS.
Resolution:
az network private-dns link vnet list --zone-name <zone> -g <rg>.Cause: Azure RBAC role assignment is missing or the kubeconfig token has expired.
Resolution:
az aks get-credentials --resource-group <rg> --name <cluster> --overwrite-existing.Azure Kubernetes Service Cluster User Role on the cluster resource.Cause: DNS zone for the OSMO service URL is not configured, or the ingress controller internal load balancer has no IP assigned.
Resolution:
kubectl get svc -n osmo-control-plane.Cause: MIG strategy is set to none instead of single. Azure vGPU hosts enable MIG, and strategy: none causes NVIDIA_VISIBLE_DEVICES to receive bare GPU UUIDs instead of MIG device UUIDs.
Resolution:
Set mig.strategy: single in the GPU Operator Helm values for RTX PRO 6000 node pools. See GPU Configuration for node-specific settings.
[!WARNING] RTX PRO 6000 nodes require
mig.strategy: single. Usingnonecauses all GPU workloads on these nodes to fail withCUDA_ERROR_NO_DEVICE.
Cause: Nodes with pre-installed Azure GRID drivers (580.105.08-grid-azure) do not need the GPU Operator datacenter driver. Installing both causes conflicts.
Resolution:
Label GRID driver nodes with nvidia.com/gpu.deploy.driver=false to prevent the GPU Operator from deploying its own driver DaemonSet.
Cause: The NVIDIA_DRIVER_CAPABILITIES environment variable is not set to all. Isaac Sim requires Vulkan capability for rendering.
Resolution:
Set NVIDIA_DRIVER_CAPABILITIES=all in the job environment variables. This is required for all Isaac Sim workloads regardless of GPU type.
Cause: The container runtime is not configured with the NVIDIA runtime class, or GPU resource requests are missing from the pod spec.
Resolution:
resources.limits: nvidia.com/gpu: 1.kubectl get pods -n gpu-operator.kubectl describe node <node> | grep nvidia.com/gpu.Cause: The GPU Operator installed a driver version incompatible with the CUDA toolkit version in the container image.
Resolution:
nvidia-smi on the node.Cause: Required Azure resource providers are not registered on the subscription.
Resolution:
Run source infrastructure/terraform/prerequisites/az-sub-init.sh to register all required providers. The script reads from infrastructure/terraform/prerequisites/robotics-azure-resource-providers.txt.
Cause: The ARM_SUBSCRIPTION_ID environment variable is not set.
Resolution:
Run source infrastructure/terraform/prerequisites/az-sub-init.sh before any terraform commands. This script exports ARM_SUBSCRIPTION_ID and validates Azure CLI authentication.
Cause: The VPN is not connected, or the deploy scripts are running before the VPN Gateway deployment completes.
Resolution:
infrastructure/terraform/vpn/.01-deploy-robotics-charts.sh through 04-deploy-osmo-backend.sh.Cause: Identity or RBAC misconfiguration for the AzureML managed identity, or resource quota exceeded.
Resolution:
kubectl logs <pod> -n azureml.azureml:default and azureml:training service accounts.az vm list-usage --location <region> -o table.Cause: oauth2Proxy.enabled is set to true but no OIDC provider is configured.
Resolution:
Set oauth2Proxy.enabled: false in the OSMO Helm values when no OIDC provider is available. See infrastructure/setup/04-deploy-osmo-backend.sh for the configuration.
Cause: Subscription-level resource group limit or regional capacity constraints.
Resolution:
az account list-locations and az vm list-usage --location <region>.Cause: The environment variables ACCEPT_EULA and PRIVACY_CONSENT are not set to Y.
Resolution:
Add both variables to the job definition:
environment_variables:
ACCEPT_EULA: "Y"
PRIVACY_CONSENT: "Y"
Cause: Workload identity auth failure in the data-capability sidecar when using ro_mount mode.
Resolution:
Switch model validation mode from ro_mount to download in the AzureML job YAML. This is a known workaround for workload identity compatibility.
Cause: numpy 2.x is installed but Isaac Sim 4.x requires numpy < 2.0.0 for ABI compatibility with its bundled libraries.
Resolution:
The train.sh script pins numpy to >=1.26.0,<2.0.0. Verify this pin is present. If using a custom entrypoint, add:
uv pip install "numpy>=1.26.0,<2.0.0"
Cause: Isaac Sim 4.x hangs after env.close() on vGPU nodes due to a shutdown bug.
Resolution:
Use simulation_shutdown.py which stops the simulation timeline and applies a SIGKILL watchdog to force process termination.
Cause: The TRAINING_CHECKPOINT_OUTPUT environment variable is not set or points to a non-existent directory.
Resolution:
Cause: Base64-encoded archive exceeds the ~1 MB payload limit.
Resolution:
Switch from inline payload to dataset folder injection. Upload files as an OSMO dataset and reference the dataset folder name in the workflow environment variables.
Cause: OSMO uses Jinja templates (). Helm Go template syntax () causes parse errors.
Resolution:
Convert all template expressions to Jinja syntax. For variable substitution, use `` patterns.
Cause: Coscheduling (gang-scheduling) requirements are not met. Either insufficient GPU resources or the PodGroup configuration is missing.
Resolution:
kubectl describe nodes | grep nvidia.com/gpu.Cause: The dataset folder name in the workflow YAML does not match the registered dataset name, or the dataset version is not published.
Resolution:
osmo config list DATASET.Cause: The osmo-workflows namespace lacks resource quota or node affinity rules prevent scheduling.
Resolution:
kubectl describe pod <pod> -n osmo-workflows.kubectl get resourcequota -n osmo-workflows.🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.