Troubleshooting Guide

Find the symptom you are experiencing, then follow the resolution steps. Start with the quick diagnostics checklist to narrow down the failure category.

Quick Diagnostics Checklist

Check	Command	Expected
Cluster reachable	`kubectl get nodes`	Node list returned
GPU available	`kubectl describe node \| grep nvidia.com/gpu`	GPU count > 0
AzureML extension	`kubectl get pods -n azureml`	All pods Running
OSMO control plane	`kubectl get pods -n osmo-control-plane`	All pods Running
VPN connected	`ping <private-endpoint-ip>`	Response received

Connection Errors

kubectl commands hang or return "Unable to connect to the server"

Cause: The default deployment creates a private AKS cluster. The API server is not reachable without an active VPN connection.

Resolution:

Verify VPN connection status in the Azure portal under the VPN Gateway resource.
Reconnect using the VPN client profile downloaded during VPN setup.
Confirm connectivity with kubectl get nodes.

DNS resolution fails for private endpoints

Cause: Private DNS zones are not linked to the VPN virtual network, or the client DNS resolver is not forwarding to Azure DNS.

Resolution:

Verify the private DNS zone link exists: az network private-dns link vnet list --zone-name <zone> -g <rg>.
On the client machine, flush DNS cache and retry.
For persistent failures, add manual host entries from the private endpoint IP addresses.

kubectl returns "Unauthorized" or "Forbidden"

Cause: Azure RBAC role assignment is missing or the kubeconfig token has expired.

Resolution:

Refresh credentials: az aks get-credentials --resource-group <rg> --name <cluster> --overwrite-existing.
Verify your Azure AD identity has Azure Kubernetes Service Cluster User Role on the cluster resource.

OSMO UI not reachable at expected URL

Cause: DNS zone for the OSMO service URL is not configured, or the ingress controller internal load balancer has no IP assigned.

Resolution:

Check the ingress service IP: kubectl get svc -n osmo-control-plane.
Verify DNS records in the private DNS zone match the load balancer IP.
See Private DNS for DNS zone deployment.

GPU and CUDA Errors

CUDA_ERROR_NO_DEVICE on RTX PRO 6000 nodes

Cause: MIG strategy is set to none instead of single. Azure vGPU hosts enable MIG, and strategy: none causes NVIDIA_VISIBLE_DEVICES to receive bare GPU UUIDs instead of MIG device UUIDs.

Resolution:

Set mig.strategy: single in the GPU Operator Helm values for RTX PRO 6000 node pools. See GPU Configuration for node-specific settings.

[!WARNING] RTX PRO 6000 nodes require mig.strategy: single. Using none causes all GPU workloads on these nodes to fail with CUDA_ERROR_NO_DEVICE.

GPU Operator attempts to install drivers on GRID driver nodes

Cause: Nodes with pre-installed Azure GRID drivers (580.105.08-grid-azure) do not need the GPU Operator datacenter driver. Installing both causes conflicts.

Resolution:

Label GRID driver nodes with nvidia.com/gpu.deploy.driver=false to prevent the GPU Operator from deploying its own driver DaemonSet.

Vulkan initialization fails in Isaac Sim containers

Cause: The NVIDIA_DRIVER_CAPABILITIES environment variable is not set to all. Isaac Sim requires Vulkan capability for rendering.

Resolution:

Set NVIDIA_DRIVER_CAPABILITIES=all in the job environment variables. This is required for all Isaac Sim workloads regardless of GPU type.

nvidia-smi shows no GPUs inside the container

Cause: The container runtime is not configured with the NVIDIA runtime class, or GPU resource requests are missing from the pod spec.

Resolution:

Verify the pod spec includes resources.limits: nvidia.com/gpu: 1.
Confirm the NVIDIA device plugin is running: kubectl get pods -n gpu-operator.
Check node allocatable GPU count: kubectl describe node <node> | grep nvidia.com/gpu.

Driver version mismatch between host and container

Cause: The GPU Operator installed a driver version incompatible with the CUDA toolkit version in the container image.

Resolution:

Check the host driver version: nvidia-smi on the node.
Verify compatibility with the CUDA compatibility matrix.
Pin the GPU Operator driver version to match container requirements in the Helm values.

Deployment Failures

Terraform provider registration fails

Cause: Required Azure resource providers are not registered on the subscription.

Resolution:

Run source infrastructure/terraform/prerequisites/az-sub-init.sh to register all required providers. The script reads from infrastructure/terraform/prerequisites/robotics-azure-resource-providers.txt.

Terraform plan fails with "subscription not configured"

Cause: The ARM_SUBSCRIPTION_ID environment variable is not set.

Resolution:

Run source infrastructure/terraform/prerequisites/az-sub-init.sh before any terraform commands. This script exports ARM_SUBSCRIPTION_ID and validates Azure CLI authentication.

Helm chart installation fails with connection refused

Cause: The VPN is not connected, or the deploy scripts are running before the VPN Gateway deployment completes.

Resolution:

Complete VPN deployment: infrastructure/terraform/vpn/.
Connect the VPN client.
Re-run deploy scripts in order: 01-deploy-robotics-charts.sh through 03-deploy-osmo.sh.

AzureML extension pods stuck in CrashLoopBackOff

Cause: Identity or RBAC misconfiguration for the AzureML managed identity, or resource quota exceeded.

Resolution:

Check pod logs: kubectl logs <pod> -n azureml.
Verify the managed identity has federated credentials for the azureml:default and azureml:training service accounts.
Check subscription quota: az vm list-usage --location <region> -o table.

OSMO deployment returns oauth2Proxy errors

Cause: oauth2Proxy.enabled is set to true but no OIDC provider is configured.

Resolution:

Set oauth2Proxy.enabled: false in the OSMO Helm values when no OIDC provider is available. See infrastructure/setup/03-deploy-osmo.sh for the configuration.

Resource group creation fails with quota errors

Cause: Subscription-level resource group limit or regional capacity constraints.

Resolution:

Check current limits: az account list-locations and az vm list-usage --location <region>.
Request quota increases through the Azure portal for the target region.

Training and Inference Errors

Isaac Sim job fails with EULA not accepted

Cause: The environment variables ACCEPT_EULA and PRIVACY_CONSENT are not set to Y.

Resolution:

Add both variables to the job definition:

environment_variables:
  ACCEPT_EULA: "Y"
  PRIVACY_CONSENT: "Y"

AzureML model download fails with authentication error

Cause: Workload identity auth failure in the data-capability sidecar when using ro_mount mode.

Resolution:

Switch model validation mode from ro_mount to download in the AzureML job YAML. This is a known workaround for workload identity compatibility.

numpy ImportError or ABI mismatch in Isaac Sim container

Cause: numpy 2.x is installed but Isaac Sim 4.x requires numpy < 2.0.0 for ABI compatibility with its bundled libraries.

Resolution:

The train.sh script pins numpy to >=1.26.0,<2.0.0. Verify this pin is present. If using a custom entrypoint, add:

uv pip install "numpy>=1.26.0,<2.0.0"

Isaac Sim process hangs after training completes

Cause: Isaac Sim 4.x hangs after env.close() on vGPU nodes due to a shutdown bug.

Resolution:

Use simulation_shutdown.py which stops the simulation timeline and applies a SIGKILL watchdog to force process termination.

Checkpoint upload fails silently

Cause: The AzureML named output is wired through ${{outputs.checkpoints}} in environment_variables:, which AzureML does NOT substitute (substitution only happens in command:). The container receives the literal template string and the sync helper writes to a relative directory of that name, leaving cap/data-capability/wd/checkpoints/ empty.

Resolution:

Read AZURE_ML_OUTPUT_CHECKPOINTS (set by the data-capability runtime) directly in the training entry point — do not indirect through a custom env var.
Confirm the named outputs.checkpoints uri_folder is declared in the job YAML.
Check job logs (system_logs/data_capability/data-capability.log) for uploaded N files rather than is empty. Skip uploading.

Workflow Runtime Errors

OSMO workflow code upload or download fails

Cause: The submit host or the workflow pod cannot reach the OSMO object-storage container that carries the packaged code, which the submission uploads with osmo data upload and the pod retrieves through a url: task input.

Resolution:

Ensure the submit host is authenticated (az login) and holds Storage Blob Data Contributor on the OSMO storage account. Set AZURE_STORAGE_ACCOUNT_NAME (and OSMO_WORKFLOW_BUCKET, default osmo) if they are not resolved from Terraform outputs.
Confirm access before submitting: osmo data check azure://<account>/<container> returns {"status": "pass"}.
If the pod fails to fetch the code, verify the workflow pod's workload identity has read access to the same container.

OSMO code-upload objects accumulate under `osmo-code/`

Cause: Each submission content-addresses the packaged code by a hash over its files and uploads it to osmo-code/<hash> only when that object is absent. Byte-identical code reuses the existing object, but every distinct code revision creates a new one, so the prefix grows as the code evolves and is not pruned automatically.

Resolution:

List the staged archives and delete stale ones, or apply an Azure Storage lifecycle management policy that expires blobs under the prefix:

osmo data list azure://<account>/<container>/osmo-code
osmo data delete azure://<account>/<container>/osmo-code/<hash>

OSMO workflow YAML template rendering fails

Cause: OSMO uses Jinja templates ({{ }}). Helm Go template syntax ({{ .Values }}) causes parse errors.

Resolution:

Convert all template expressions to Jinja syntax. For variable substitution, use {{ env_var }} patterns.

OSMO workflow fails during CreateGroup with Exit Code 3002

Cause: OSMO asks Kubernetes to create workflow pods during the CreateGroup phase. Kubernetes calls the KAI Scheduler binder admission webhooks before admitting those pods. If the API server cannot verify the binder webhook TLS certificate, pod admission fails before the training container starts, and OSMO reports Exit Code: 3002.

The characteristic Kubernetes error includes failed calling webhook "binder.run.ai", binder.kai-scheduler.svc, and an x509 message such as certificate signed by unknown authority or parent certificate cannot sign this kind of certificate. The unknown field "spec.env" warning can appear in the same response but is not the admission blocker.

The KAI Scheduler chart (v0.5.5) refreshes binder-webhook-tls-secret on every install but does not reliably keep the MutatingWebhookConfiguration / ValidatingWebhookConfiguration caBundle in sync with the freshly minted leaf. On upgrade the API server can be left trusting a stale CA while the binder service serves a different leaf. In a degenerate case, caBundle references a self-signed cert (binder.kai-scheduler.svc-ca) that does not sign the served leaf at all.

A secondary failure mode is binder pods caching an old mounted certificate after a chart upgrade. Tracked as #794.

Diagnostics:

Compare the leaf certificate served by the binder Secret against the caBundle advertised on the webhook configs — they must match.

kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
  -o jsonpath='{.data.tls\.crt}' | base64 --decode |
  openssl x509 -noout -subject -fingerprint -sha256

kubectl get mutatingwebhookconfiguration kai-binder \
  -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 --decode |
  openssl x509 -noout -subject -fingerprint -sha256

kubectl run kai-binder-webhook-tls-check --image=registry.k8s.io/pause:3.10 \
  --restart=Never --dry-run=server -o yaml

The dry-run admission probe is the definitive test: it exercises the binder webhook path without creating a pod and surfaces the exact x509 failure the API server sees. If the two fingerprints differ, the cluster is in the drift state described above; the probe will report x509: certificate signed by unknown authority.

Resolution:

Repair both failure modes manually with the following sequence. Each step is idempotent and safe to re-run on a live cluster.

First, force binder pods to re-read the mounted certificate (sufficient when only the cached cert is stale):

kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s

Re-run the admission probe above. If the x509 error persists, the Secret and the webhook caBundle are out of sync. Sync the caBundle on both webhook configs to the leaf certificate currently in the Secret, then restart binder:

leaf_b64=$(kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
  -o jsonpath='{.data.tls\.crt}')
[[ -n "$leaf_b64" ]] || { echo "binder-webhook-tls-secret has no tls.crt" >&2; exit 1; }

for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
  kubectl get "$kind" kai-binder -o json |
    jq --arg ca "$leaf_b64" '.webhooks |= map(.clientConfig.caBundle = $ca)' |
    kubectl replace -f -
done

kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s

If the webhook configs themselves are corrupted (for example, caBundle references a self-signed cert that does not match any leaf), delete them and re-run infrastructure/setup/01-deploy-robotics-charts.sh to let the chart recreate them. The post-install sync block above then aligns the freshly minted Secret with the recreated configs:

for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
  kubectl delete "$kind" kai-binder --ignore-not-found
done

infrastructure/setup/01-deploy-robotics-charts.sh

KAI scheduler rejects multi-GPU job

Cause: Coscheduling (gang-scheduling) requirements are not met. Either insufficient GPU resources or the PodGroup configuration is missing.

Resolution:

Verify available GPU capacity across nodes: kubectl describe nodes | grep nvidia.com/gpu.
Confirm the KAI scheduler is installed and configured for coscheduling in the OSMO backend.
Reduce GPU request count or wait for node autoscaling to provide capacity.

OSMO dataset injection fails

Cause: The dataset folder name in the workflow YAML does not match the registered dataset name, or the dataset version is not published.

Resolution:

List available datasets: osmo dataset list.
Verify the dataset name and version in the workflow environment variables match a published dataset.

LeRobot training fails with PyTorch shared memory allocation error

Cause: PyTorch DataLoader workers use /dev/shm to collate batches across worker processes. Kubernetes pods inherit the container runtime default shared-memory mount unless the pod spec overrides it, and that default is too small for image-heavy LeRobot batches. The failure occurs after OSMO creates the pod and after training starts, so it is not an OSMO CreateGroup, KAI, or webhook TLS issue.

Characteristic log line:

RuntimeError: unable to allocate shared memory(shm) for file </torch_...>: Success (0)

Resolution:

OSMO mounts /dev/shm according to the pod template in Helm values, with USER_SHM_SIZE rendered from the platform config (default 16Gi for GPU platforms). The mount applies only to pods created after the config is deployed — restarting the training pod is not enough; a stuck workflow must be cancelled and resubmitted.

In ConfigMap mode the pod template — including USER_SHM_SIZE — is the source of truth in infrastructure/setup/values/osmo-platforms.yaml and is rendered into the deployed ConfigMap on every run of the deploy script. Inspect the live pod to verify it received the mount, and check the values file for the size it should have:

# Live pod (the actual verification)
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.volumes[?(@.name=="dshm")].emptyDir.sizeLimit}'
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.containers[?(@.name!="osmo-ctrl")].volumeMounts[?(@.mountPath=="/dev/shm")].name}'

# Expected size in the source values (for comparison)
grep -A 3 dshm infrastructure/setup/values/osmo-platforms.yaml

If a running pod lacks the dshm mount, either the values file was edited without rerunning infrastructure/setup/03-deploy-osmo.sh (rerun it to re-render the ConfigMap), or the pod predates the last config update — cancel the workflow and submit a new one so it picks up the current template.

If new pods still exhaust /dev/shm, raise USER_SHM_SIZE in infrastructure/setup/values/osmo-platforms.yaml (current default is 16Gi for GPU platforms), rerun 03-deploy-osmo.sh, and resubmit. Reducing --batch-size is the workaround when raising the mount is not an option.

OSMO workflow pods stuck in Pending

Cause: The osmo-workflows namespace lacks resource quota or node affinity rules prevent scheduling.

Resolution:

Check pod events: kubectl describe pod <pod> -n osmo-workflows.
Verify node taints and tolerations match the pod spec.
Check namespace resource quotas: kubectl get resourcequota -n osmo-workflows.

OSMO workflow completes task but workflow status stays RUNNING or fails

Cause: The service_base_url field in the OSMO SERVICE config is empty or points to a service that does not route /api/logger paths. The osmo-ctrl sidecar in workflow pods uses this URL to connect via WebSocket to the logger service and to refresh auth tokens. When misconfigured, the sidecar logs websocket: bad handshake or connection refused errors, preventing the workflow from reporting completion.

Resolution:

Inspect the osmo-ctrl sidecar arguments on a workflow pod:

kubectl get pod <pod> -n osmo-workflows -o json | \
  python3 -c "import sys,json; [print(a) for c in json.load(sys.stdin)['spec']['containers'] if c['name']=='osmo-ctrl' for a in c.get('args',[])]"

If -host is empty (""), service_base_url in the SERVICE config is not set. In ConfigMap mode this value is rendered from the Helm values:
```
grep service_base_url infrastructure/setup/values/osmo-control-plane.yaml
```

Set service_base_url to the AzureML ingress controller ClusterIP FQDN by editing infrastructure/setup/values/osmo-control-plane.yaml (services.configs.service.service_base_url) and rerunning the deploy script:

# infrastructure/setup/values/osmo-control-plane.yaml
services:
  configs:
    service:
      service_base_url: "http://azureml-ingress-nginx-controller.azureml.svc.cluster.local"

./infrastructure/setup/03-deploy-osmo.sh

Cancel and resubmit the workflow. New pods pick up the updated service_base_url.

[!NOTE] service_base_url points to osmo-gateway, which routes /api/logger, /api/agent, /api/auth, and related paths to the correct backend services.

OSMO UI shows "server IP address could not be found" for workflow logs

Cause: The service_base_url is set to an in-cluster FQDN (e.g., azureml-ingress-nginx-controller.azureml.svc.cluster.local) that the browser cannot resolve. The OSMO UI constructs workflow overview URLs from this value.

Resolution:

Connect via VPN and set service_base_url to the internal load balancer IP, which is reachable from both in-cluster pods and the VPN-connected browser:

# Find the internal LB IP
kubectl get svc azureml-ingress-nginx-internal-lb -n azureml \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Set that IP as services.configs.service.service_base_url in infrastructure/setup/values/osmo-control-plane.yaml, then rerun the deploy script:

# infrastructure/setup/values/osmo-control-plane.yaml
services:
  configs:
    service:
      service_base_url: "http://10.0.5.6"  # replace with actual internal LB IP

./infrastructure/setup/03-deploy-osmo.sh

Without VPN, use the CLI to view workflow logs: osmo workflow logs <workflow-id>.

Additional Resources

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

Quick Diagnostics Checklist​

Connection Errors​

kubectl commands hang or return "Unable to connect to the server"​

DNS resolution fails for private endpoints​

kubectl returns "Unauthorized" or "Forbidden"​

OSMO UI not reachable at expected URL​

GPU and CUDA Errors​

CUDA_ERROR_NO_DEVICE on RTX PRO 6000 nodes​

GPU Operator attempts to install drivers on GRID driver nodes​

Vulkan initialization fails in Isaac Sim containers​

nvidia-smi shows no GPUs inside the container​

Driver version mismatch between host and container​

Deployment Failures​

Terraform provider registration fails​

Terraform plan fails with "subscription not configured"​

Helm chart installation fails with connection refused​

AzureML extension pods stuck in CrashLoopBackOff​

OSMO deployment returns oauth2Proxy errors​

Resource group creation fails with quota errors​

Training and Inference Errors​

Isaac Sim job fails with EULA not accepted​

AzureML model download fails with authentication error​

numpy ImportError or ABI mismatch in Isaac Sim container​

Isaac Sim process hangs after training completes​

Checkpoint upload fails silently​

Workflow Runtime Errors​

OSMO workflow code upload or download fails​

OSMO code-upload objects accumulate under osmo-code/​

OSMO workflow YAML template rendering fails​

OSMO workflow fails during CreateGroup with Exit Code 3002​

KAI scheduler rejects multi-GPU job​

OSMO dataset injection fails​

LeRobot training fails with PyTorch shared memory allocation error​

OSMO workflow pods stuck in Pending​

OSMO workflow completes task but workflow status stays RUNNING or fails​

OSMO UI shows "server IP address could not be found" for workflow logs​

Additional Resources​

Quick Diagnostics Checklist

Connection Errors

kubectl commands hang or return "Unable to connect to the server"

DNS resolution fails for private endpoints

kubectl returns "Unauthorized" or "Forbidden"

OSMO UI not reachable at expected URL

GPU and CUDA Errors

CUDA_ERROR_NO_DEVICE on RTX PRO 6000 nodes

GPU Operator attempts to install drivers on GRID driver nodes

Vulkan initialization fails in Isaac Sim containers

nvidia-smi shows no GPUs inside the container

Driver version mismatch between host and container

Deployment Failures

Terraform provider registration fails

Terraform plan fails with "subscription not configured"

Helm chart installation fails with connection refused

AzureML extension pods stuck in CrashLoopBackOff

OSMO deployment returns oauth2Proxy errors

Resource group creation fails with quota errors

Training and Inference Errors

Isaac Sim job fails with EULA not accepted

AzureML model download fails with authentication error

numpy ImportError or ABI mismatch in Isaac Sim container

Isaac Sim process hangs after training completes

Checkpoint upload fails silently

Workflow Runtime Errors

OSMO workflow code upload or download fails

OSMO code-upload objects accumulate under `osmo-code/`

OSMO workflow YAML template rendering fails

OSMO workflow fails during CreateGroup with Exit Code 3002

KAI scheduler rejects multi-GPU job

OSMO dataset injection fails

LeRobot training fails with PyTorch shared memory allocation error

OSMO workflow pods stuck in Pending

OSMO workflow completes task but workflow status stays RUNNING or fails

OSMO UI shows "server IP address could not be found" for workflow logs

Additional Resources