Troubleshooting Guide
Find the symptom you are experiencing, then follow the resolution steps. Start with the quick diagnostics checklist to narrow down the failure category.
Quick Diagnostics Checklist
| Check | Command | Expected |
|---|---|---|
| Cluster reachable | kubectl get nodes | Node list returned |
| GPU available | kubectl describe node | grep nvidia.com/gpu | GPU count > 0 |
| AzureML extension | kubectl get pods -n azureml | All pods Running |
| OSMO control plane | kubectl get pods -n osmo-control-plane | All pods Running |
| VPN connected | ping <private-endpoint-ip> | Response received |
Connection Errors
kubectl commands hang or return "Unable to connect to the server"
Cause: The default deployment creates a private AKS cluster. The API server is not reachable without an active VPN connection.
Resolution:
- Verify VPN connection status in the Azure portal under the VPN Gateway resource.
- Reconnect using the VPN client profile downloaded during VPN setup.
- Confirm connectivity with
kubectl get nodes.
DNS resolution fails for private endpoints
Cause: Private DNS zones are not linked to the VPN virtual network, or the client DNS resolver is not forwarding to Azure DNS.
Resolution:
- Verify the private DNS zone link exists:
az network private-dns link vnet list --zone-name <zone> -g <rg>. - On the client machine, flush DNS cache and retry.
- For persistent failures, add manual host entries from the private endpoint IP addresses.
kubectl returns "Unauthorized" or "Forbidden"
Cause: Azure RBAC role assignment is missing or the kubeconfig token has expired.
Resolution:
- Refresh credentials:
az aks get-credentials --resource-group <rg> --name <cluster> --overwrite-existing. - Verify your Azure AD identity has
Azure Kubernetes Service Cluster User Roleon the cluster resource.
OSMO UI not reachable at expected URL
Cause: DNS zone for the OSMO service URL is not configured, or the ingress controller internal load balancer has no IP assigned.
Resolution:
- Check the ingress service IP:
kubectl get svc -n osmo-control-plane. - Verify DNS records in the private DNS zone match the load balancer IP.
- See Private DNS for DNS zone deployment.
GPU and CUDA Errors
CUDA_ERROR_NO_DEVICE on RTX PRO 6000 nodes
Cause: MIG strategy is set to none instead of single. Azure vGPU hosts enable MIG, and strategy: none causes NVIDIA_VISIBLE_DEVICES to receive bare GPU UUIDs instead of MIG device UUIDs.
Resolution:
Set mig.strategy: single in the GPU Operator Helm values for RTX PRO 6000 node pools. See GPU Configuration for node-specific settings.
[!WARNING] RTX PRO 6000 nodes require
mig.strategy: single. Usingnonecauses all GPU workloads on these nodes to fail withCUDA_ERROR_NO_DEVICE.
GPU Operator attempts to install drivers on GRID driver nodes
Cause: Nodes with pre-installed Azure GRID drivers (580.105.08-grid-azure) do not need the GPU Operator datacenter driver. Installing both causes conflicts.
Resolution:
Label GRID driver nodes with nvidia.com/gpu.deploy.driver=false to prevent the GPU Operator from deploying its own driver DaemonSet.
Vulkan initialization fails in Isaac Sim containers
Cause: The NVIDIA_DRIVER_CAPABILITIES environment variable is not set to all. Isaac Sim requires Vulkan capability for rendering.
Resolution:
Set NVIDIA_DRIVER_CAPABILITIES=all in the job environment variables. This is required for all Isaac Sim workloads regardless of GPU type.
nvidia-smi shows no GPUs inside the container
Cause: The container runtime is not configured with the NVIDIA runtime class, or GPU resource requests are missing from the pod spec.
Resolution:
- Verify the pod spec includes
resources.limits: nvidia.com/gpu: 1. - Confirm the NVIDIA device plugin is running:
kubectl get pods -n gpu-operator. - Check node allocatable GPU count:
kubectl describe node <node> | grep nvidia.com/gpu.
Driver version mismatch between host and container
Cause: The GPU Operator installed a driver version incompatible with the CUDA toolkit version in the container image.
Resolution:
- Check the host driver version:
nvidia-smion the node. - Verify compatibility with the CUDA compatibility matrix.
- Pin the GPU Operator driver version to match container requirements in the Helm values.
Deployment Failures
Terraform provider registration fails
Cause: Required Azure resource providers are not registered on the subscription.
Resolution:
Run source infrastructure/terraform/prerequisites/az-sub-init.sh to register all required providers. The script reads from infrastructure/terraform/prerequisites/robotics-azure-resource-providers.txt.
Terraform plan fails with "subscription not configured"
Cause: The ARM_SUBSCRIPTION_ID environment variable is not set.
Resolution:
Run source infrastructure/terraform/prerequisites/az-sub-init.sh before any terraform commands. This script exports ARM_SUBSCRIPTION_ID and validates Azure CLI authentication.
Helm chart installation fails with connection refused
Cause: The VPN is not connected, or the deploy scripts are running before the VPN Gateway deployment completes.
Resolution:
- Complete VPN deployment:
infrastructure/terraform/vpn/. - Connect the VPN client.
- Re-run deploy scripts in order:
01-deploy-robotics-charts.shthrough04-deploy-osmo-backend.sh.
AzureML extension pods stuck in CrashLoopBackOff
Cause: Identity or RBAC misconfiguration for the AzureML managed identity, or resource quota exceeded.
Resolution:
- Check pod logs:
kubectl logs <pod> -n azureml. - Verify the managed identity has federated credentials for the
azureml:defaultandazureml:trainingservice accounts. - Check subscription quota:
az vm list-usage --location <region> -o table.
OSMO backend deployment returns oauth2Proxy errors
Cause: oauth2Proxy.enabled is set to true but no OIDC provider is configured.
Resolution:
Set oauth2Proxy.enabled: false in the OSMO Helm values when no OIDC provider is available. See infrastructure/setup/04-deploy-osmo-backend.sh for the configuration.
Resource group creation fails with quota errors
Cause: Subscription-level resource group limit or regional capacity constraints.
Resolution:
- Check current limits:
az account list-locationsandaz vm list-usage --location <region>. - Request quota increases through the Azure portal for the target region.
Training and Inference Errors
Isaac Sim job fails with EULA not accepted
Cause: The environment variables ACCEPT_EULA and PRIVACY_CONSENT are not set to Y.
Resolution:
Add both variables to the job definition:
environment_variables:
ACCEPT_EULA: "Y"
PRIVACY_CONSENT: "Y"
AzureML model download fails with authentication error
Cause: Workload identity auth failure in the data-capability sidecar when using ro_mount mode.
Resolution:
Switch model validation mode from ro_mount to download in the AzureML job YAML. This is a known workaround for workload identity compatibility.
numpy ImportError or ABI mismatch in Isaac Sim container
Cause: numpy 2.x is installed but Isaac Sim 4.x requires numpy < 2.0.0 for ABI compatibility with its bundled libraries.
Resolution:
The train.sh script pins numpy to >=1.26.0,<2.0.0. Verify this pin is present. If using a custom entrypoint, add:
uv pip install "numpy>=1.26.0,<2.0.0"
Isaac Sim process hangs after training completes
Cause: Isaac Sim 4.x hangs after env.close() on vGPU nodes due to a shutdown bug.
Resolution:
Use simulation_shutdown.py which stops the simulation timeline and applies a SIGKILL watchdog to force process termination.
Checkpoint upload fails silently
Cause: The AzureML named output is wired through ${{outputs.checkpoints}} in environment_variables:, which AzureML does NOT substitute (substitution only happens in command:). The container receives the literal template string and the sync helper writes to a relative directory of that name, leaving cap/data-capability/wd/checkpoints/ empty.
Resolution:
- Read
AZURE_ML_OUTPUT_CHECKPOINTS(set by the data-capability runtime) directly in the training entry point — do not indirect through a custom env var. - Confirm the named
outputs.checkpointsuri_folderis declared in the job YAML. - Check job logs (
system_logs/data_capability/data-capability.log) foruploaded N filesrather thanis empty. Skip uploading.
Workflow Runtime Errors
OSMO workflow submission fails with payload too large
Cause: Base64-encoded archive exceeds the ~1 MB payload limit.
Resolution:
Switch from inline payload to dataset folder injection. Upload files as an OSMO dataset and reference the dataset folder name in the workflow environment variables.
OSMO workflow YAML template rendering fails
Cause: OSMO uses Jinja templates ({{ }}). Helm Go template syntax ({{ .Values }}) causes parse errors.
Resolution:
Convert all template expressions to Jinja syntax. For variable substitution, use {{ env_var }} patterns.
OSMO workflow fails during CreateGroup with Exit Code 3002
Cause: OSMO asks Kubernetes to create workflow pods during the CreateGroup phase. Kubernetes calls the KAI Scheduler binder admission webhooks before admitting those pods. If the API server cannot verify the binder webhook TLS certificate, pod admission fails before the training container starts, and OSMO reports Exit Code: 3002.
The characteristic Kubernetes error includes failed calling webhook "binder.run.ai", binder.kai-scheduler.svc, and an x509 message such as certificate signed by unknown authority or parent certificate cannot sign this kind of certificate. The unknown field "spec.env" warning can appear in the same response but is not the admission blocker.
The KAI Scheduler chart (v0.5.5) refreshes binder-webhook-tls-secret on every install but does not reliably keep the MutatingWebhookConfiguration / ValidatingWebhookConfiguration caBundle in sync with the freshly minted leaf. On upgrade the API server can be left trusting a stale CA while the binder service serves a different leaf. In a degenerate case, caBundle references a self-signed cert (binder.kai-scheduler.svc-ca) that does not sign the served leaf at all.
A secondary failure mode is binder pods caching an old mounted certificate after a chart upgrade. Tracked as #794.
Diagnostics:
Compare the leaf certificate served by the binder Secret against the caBundle advertised on the webhook configs — they must match.
kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
-o jsonpath='{.data.tls\.crt}' | base64 --decode |
openssl x509 -noout -subject -fingerprint -sha256
kubectl get mutatingwebhookconfiguration kai-binder \
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 --decode |
openssl x509 -noout -subject -fingerprint -sha256
kubectl run kai-binder-webhook-tls-check --image=registry.k8s.io/pause:3.10 \
--restart=Never --dry-run=server -o yaml
The dry-run admission probe is the definitive test: it exercises the binder webhook path without creating a pod and surfaces the exact x509 failure the API server sees. If the two fingerprints differ, the cluster is in the drift state described above; the probe will report x509: certificate signed by unknown authority.
Resolution:
Repair both failure modes manually with the following sequence. Each step is idempotent and safe to re-run on a live cluster.
First, force binder pods to re-read the mounted certificate (sufficient when only the cached cert is stale):
kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s
Re-run the admission probe above. If the x509 error persists, the Secret and the webhook caBundle are out of sync. Sync the caBundle on both webhook configs to the leaf certificate currently in the Secret, then restart binder:
leaf_b64=$(kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
-o jsonpath='{.data.tls\.crt}')
[[ -n "$leaf_b64" ]] || { echo "binder-webhook-tls-secret has no tls.crt" >&2; exit 1; }
for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
kubectl get "$kind" kai-binder -o json |
jq --arg ca "$leaf_b64" '.webhooks |= map(.clientConfig.caBundle = $ca)' |
kubectl replace -f -
done
kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s
If the webhook configs themselves are corrupted (for example, caBundle references a self-signed cert that does not match any leaf), delete them and re-run infrastructure/setup/01-deploy-robotics-charts.sh to let the chart recreate them. The post-install sync block above then aligns the freshly minted Secret with the recreated configs:
for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
kubectl delete "$kind" kai-binder --ignore-not-found
done
infrastructure/setup/01-deploy-robotics-charts.sh
KAI scheduler rejects multi-GPU job
Cause: Coscheduling (gang-scheduling) requirements are not met. Either insufficient GPU resources or the PodGroup configuration is missing.
Resolution:
- Verify available GPU capacity across nodes:
kubectl describe nodes | grep nvidia.com/gpu. - Confirm the KAI scheduler is installed and configured for coscheduling in the OSMO backend.
- Reduce GPU request count or wait for node autoscaling to provide capacity.
OSMO dataset injection fails
Cause: The dataset folder name in the workflow YAML does not match the registered dataset name, or the dataset version is not published.
Resolution:
- List available datasets:
osmo config list DATASET. - Verify the dataset name and version in the workflow environment variables match a published dataset.
LeRobot training fails with PyTorch shared memory allocation error
Cause: PyTorch DataLoader workers use /dev/shm to collate batches across worker processes. Kubernetes pods inherit the container runtime default shared-memory mount unless the pod spec overrides it, and that default is too small for image-heavy LeRobot batches. The failure occurs after OSMO creates the pod and after training starts, so it is not an OSMO CreateGroup, KAI, or webhook TLS issue.
Characteristic log line:
RuntimeError: unable to allocate shared memory(shm) for file </torch_...>: Success (0)
Resolution:
OSMO mounts /dev/shm according to the POD_TEMPLATE config registered on the control plane at deploy time, with USER_SHM_SIZE rendered from the pool config (default 8Gi). The mount applies only to pods created after the config is registered — restarting the training pod is not enough; a stuck workflow must be cancelled and resubmitted.
Verify the live cluster matches the deploy script's intent:
osmo config get POD_TEMPLATE --output yaml | grep -A 3 dshm
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.volumes[?(@.name=="dshm")].emptyDir.sizeLimit}'
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.containers[?(@.name!="osmo-ctrl")].volumeMounts[?(@.mountPath=="/dev/shm")].name}'
If the POD_TEMPLATE is missing the dshm mount, the registered config drifted from infrastructure/setup/config/pod-template-config.template.json; re-run infrastructure/setup/04-deploy-osmo-backend.sh to republish it. If the config is correct but a running pod lacks the mount, that pod predates the config update — cancel the workflow and submit a new one.
If new pods still exhaust /dev/shm with USER_SHM_SIZE=8Gi, raise USER_SHM_SIZE in infrastructure/setup/config/pool-config.template.json (16Gi is a safe ceiling for image-heavy ACT/Diffusion datasets), re-run 04-deploy-osmo-backend.sh, and resubmit. Reducing --batch-size or dataloader.num_workers is the workaround when raising the mount is not an option.
OSMO workflow pods stuck in Pending
Cause: The osmo-workflows namespace lacks resource quota or node affinity rules prevent scheduling.
Resolution:
- Check pod events:
kubectl describe pod <pod> -n osmo-workflows. - Verify node taints and tolerations match the pod spec.
- Check namespace resource quotas:
kubectl get resourcequota -n osmo-workflows.
OSMO workflow completes task but workflow status stays RUNNING or fails
Cause: The service_base_url field in the OSMO SERVICE config is empty or points to a service that does not route /api/logger paths. The osmo-ctrl sidecar in workflow pods uses this URL to connect via WebSocket to the logger service and to refresh auth tokens. When misconfigured, the sidecar logs websocket: bad handshake or connection refused errors, preventing the workflow from reporting completion.
Resolution:
-
Inspect the osmo-ctrl sidecar arguments on a workflow pod:
kubectl get pod <pod> -n osmo-workflows -o json | \python3 -c "import sys,json; [print(a) for c in json.load(sys.stdin)['spec']['containers'] if c['name']=='osmo-ctrl' for a in c.get('args',[])]" -
If
-hostis empty (""), theservice_base_urlin the SERVICE config is not set:osmo config show SERVICE | grep service_base_url -
Set
service_base_urlto the AzureML ingress controller ClusterIP FQDN:osmo config show SERVICE | python3 -c "import sys, jsonc = json.load(sys.stdin)c['service_base_url'] = 'http://azureml-ingress-nginx-controller.azureml.svc.cluster.local'json.dump(c, sys.stdout, indent=2)" > /tmp/service-config.jsonosmo config update SERVICE --file /tmp/service-config.json --description "Set service base URL" -
Cancel and resubmit the workflow. New pods pick up the updated
service_base_url.
[!WARNING] Do not set
service_base_urltoosmo-router. The router only handles/api/router/paths. The ingress controller routes all API paths (/api/logger,/api/agent,/api/auth,/api) to the correct backend services.
OSMO UI shows "server IP address could not be found" for workflow logs
Cause: The service_base_url is set to an in-cluster FQDN (e.g., azureml-ingress-nginx-controller.azureml.svc.cluster.local) that the browser cannot resolve. The OSMO UI constructs workflow overview URLs from this value.
Resolution:
Connect via VPN and set service_base_url to the internal load balancer IP, which is reachable from both in-cluster pods and the VPN-connected browser:
# Find the internal LB IP
kubectl get svc azureml-ingress-nginx-internal-lb -n azureml \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}'
# Update service_base_url to the internal LB IP
osmo config show SERVICE | python3 -c "
import sys, json
c = json.load(sys.stdin)
c['service_base_url'] = 'http://10.0.5.6' # replace with actual IP
json.dump(c, sys.stdout, indent=2)" > /tmp/service-config.json
osmo config update SERVICE --file /tmp/service-config.json --description "Set service base URL to internal LB"
Without VPN, use the CLI to view workflow logs: osmo workflow logs <workflow-id>.
Additional Resources
- GPU Configuration
- Security Guide
- Deployment Validation
- NVIDIA CUDA Compatibility
- Azure AKS Troubleshooting
🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.