Skip to main content

Troubleshooting Guide

Find the symptom you are experiencing, then follow the resolution steps. Start with the quick diagnostics checklist to narrow down the failure category.

Quick Diagnostics Checklist

CheckCommandExpected
Cluster reachablekubectl get nodesNode list returned
GPU availablekubectl describe node | grep nvidia.com/gpuGPU count > 0
AzureML extensionkubectl get pods -n azuremlAll pods Running
OSMO control planekubectl get pods -n osmo-control-planeAll pods Running
VPN connectedping <private-endpoint-ip>Response received

Connection Errors

kubectl commands hang or return "Unable to connect to the server"

Cause: The default deployment creates a private AKS cluster. The API server is not reachable without an active VPN connection.

Resolution:

  1. Verify VPN connection status in the Azure portal under the VPN Gateway resource.
  2. Reconnect using the VPN client profile downloaded during VPN setup.
  3. Confirm connectivity with kubectl get nodes.

DNS resolution fails for private endpoints

Cause: Private DNS zones are not linked to the VPN virtual network, or the client DNS resolver is not forwarding to Azure DNS.

Resolution:

  1. Verify the private DNS zone link exists: az network private-dns link vnet list --zone-name <zone> -g <rg>.
  2. On the client machine, flush DNS cache and retry.
  3. For persistent failures, add manual host entries from the private endpoint IP addresses.

kubectl returns "Unauthorized" or "Forbidden"

Cause: Azure RBAC role assignment is missing or the kubeconfig token has expired.

Resolution:

  1. Refresh credentials: az aks get-credentials --resource-group <rg> --name <cluster> --overwrite-existing.
  2. Verify your Azure AD identity has Azure Kubernetes Service Cluster User Role on the cluster resource.

OSMO UI not reachable at expected URL

Cause: DNS zone for the OSMO service URL is not configured, or the ingress controller internal load balancer has no IP assigned.

Resolution:

  1. Check the ingress service IP: kubectl get svc -n osmo-control-plane.
  2. Verify DNS records in the private DNS zone match the load balancer IP.
  3. See Private DNS for DNS zone deployment.

GPU and CUDA Errors

CUDA_ERROR_NO_DEVICE on RTX PRO 6000 nodes

Cause: MIG strategy is set to none instead of single. Azure vGPU hosts enable MIG, and strategy: none causes NVIDIA_VISIBLE_DEVICES to receive bare GPU UUIDs instead of MIG device UUIDs.

Resolution:

Set mig.strategy: single in the GPU Operator Helm values for RTX PRO 6000 node pools. See GPU Configuration for node-specific settings.

[!WARNING] RTX PRO 6000 nodes require mig.strategy: single. Using none causes all GPU workloads on these nodes to fail with CUDA_ERROR_NO_DEVICE.

GPU Operator attempts to install drivers on GRID driver nodes

Cause: Nodes with pre-installed Azure GRID drivers (580.105.08-grid-azure) do not need the GPU Operator datacenter driver. Installing both causes conflicts.

Resolution:

Label GRID driver nodes with nvidia.com/gpu.deploy.driver=false to prevent the GPU Operator from deploying its own driver DaemonSet.

Vulkan initialization fails in Isaac Sim containers

Cause: The NVIDIA_DRIVER_CAPABILITIES environment variable is not set to all. Isaac Sim requires Vulkan capability for rendering.

Resolution:

Set NVIDIA_DRIVER_CAPABILITIES=all in the job environment variables. This is required for all Isaac Sim workloads regardless of GPU type.

nvidia-smi shows no GPUs inside the container

Cause: The container runtime is not configured with the NVIDIA runtime class, or GPU resource requests are missing from the pod spec.

Resolution:

  1. Verify the pod spec includes resources.limits: nvidia.com/gpu: 1.
  2. Confirm the NVIDIA device plugin is running: kubectl get pods -n gpu-operator.
  3. Check node allocatable GPU count: kubectl describe node <node> | grep nvidia.com/gpu.

Driver version mismatch between host and container

Cause: The GPU Operator installed a driver version incompatible with the CUDA toolkit version in the container image.

Resolution:

  1. Check the host driver version: nvidia-smi on the node.
  2. Verify compatibility with the CUDA compatibility matrix.
  3. Pin the GPU Operator driver version to match container requirements in the Helm values.

Deployment Failures

Terraform provider registration fails

Cause: Required Azure resource providers are not registered on the subscription.

Resolution:

Run source infrastructure/terraform/prerequisites/az-sub-init.sh to register all required providers. The script reads from infrastructure/terraform/prerequisites/robotics-azure-resource-providers.txt.

Terraform plan fails with "subscription not configured"

Cause: The ARM_SUBSCRIPTION_ID environment variable is not set.

Resolution:

Run source infrastructure/terraform/prerequisites/az-sub-init.sh before any terraform commands. This script exports ARM_SUBSCRIPTION_ID and validates Azure CLI authentication.

Helm chart installation fails with connection refused

Cause: The VPN is not connected, or the deploy scripts are running before the VPN Gateway deployment completes.

Resolution:

  1. Complete VPN deployment: infrastructure/terraform/vpn/.
  2. Connect the VPN client.
  3. Re-run deploy scripts in order: 01-deploy-robotics-charts.sh through 04-deploy-osmo-backend.sh.

AzureML extension pods stuck in CrashLoopBackOff

Cause: Identity or RBAC misconfiguration for the AzureML managed identity, or resource quota exceeded.

Resolution:

  1. Check pod logs: kubectl logs <pod> -n azureml.
  2. Verify the managed identity has federated credentials for the azureml:default and azureml:training service accounts.
  3. Check subscription quota: az vm list-usage --location <region> -o table.

OSMO backend deployment returns oauth2Proxy errors

Cause: oauth2Proxy.enabled is set to true but no OIDC provider is configured.

Resolution:

Set oauth2Proxy.enabled: false in the OSMO Helm values when no OIDC provider is available. See infrastructure/setup/04-deploy-osmo-backend.sh for the configuration.

Resource group creation fails with quota errors

Cause: Subscription-level resource group limit or regional capacity constraints.

Resolution:

  1. Check current limits: az account list-locations and az vm list-usage --location <region>.
  2. Request quota increases through the Azure portal for the target region.

Training and Inference Errors

Isaac Sim job fails with EULA not accepted

Cause: The environment variables ACCEPT_EULA and PRIVACY_CONSENT are not set to Y.

Resolution:

Add both variables to the job definition:

environment_variables:
ACCEPT_EULA: "Y"
PRIVACY_CONSENT: "Y"

AzureML model download fails with authentication error

Cause: Workload identity auth failure in the data-capability sidecar when using ro_mount mode.

Resolution:

Switch model validation mode from ro_mount to download in the AzureML job YAML. This is a known workaround for workload identity compatibility.

numpy ImportError or ABI mismatch in Isaac Sim container

Cause: numpy 2.x is installed but Isaac Sim 4.x requires numpy < 2.0.0 for ABI compatibility with its bundled libraries.

Resolution:

The train.sh script pins numpy to >=1.26.0,<2.0.0. Verify this pin is present. If using a custom entrypoint, add:

uv pip install "numpy>=1.26.0,<2.0.0"

Isaac Sim process hangs after training completes

Cause: Isaac Sim 4.x hangs after env.close() on vGPU nodes due to a shutdown bug.

Resolution:

Use simulation_shutdown.py which stops the simulation timeline and applies a SIGKILL watchdog to force process termination.

Checkpoint upload fails silently

Cause: The AzureML named output is wired through ${{outputs.checkpoints}} in environment_variables:, which AzureML does NOT substitute (substitution only happens in command:). The container receives the literal template string and the sync helper writes to a relative directory of that name, leaving cap/data-capability/wd/checkpoints/ empty.

Resolution:

  1. Read AZURE_ML_OUTPUT_CHECKPOINTS (set by the data-capability runtime) directly in the training entry point — do not indirect through a custom env var.
  2. Confirm the named outputs.checkpoints uri_folder is declared in the job YAML.
  3. Check job logs (system_logs/data_capability/data-capability.log) for uploaded N files rather than is empty. Skip uploading.

Workflow Runtime Errors

OSMO workflow submission fails with payload too large

Cause: Base64-encoded archive exceeds the ~1 MB payload limit.

Resolution:

Switch from inline payload to dataset folder injection. Upload files as an OSMO dataset and reference the dataset folder name in the workflow environment variables.

OSMO workflow YAML template rendering fails

Cause: OSMO uses Jinja templates ({{ }}). Helm Go template syntax ({{ .Values }}) causes parse errors.

Resolution:

Convert all template expressions to Jinja syntax. For variable substitution, use {{ env_var }} patterns.

OSMO workflow fails during CreateGroup with Exit Code 3002

Cause: OSMO asks Kubernetes to create workflow pods during the CreateGroup phase. Kubernetes calls the KAI Scheduler binder admission webhooks before admitting those pods. If the API server cannot verify the binder webhook TLS certificate, pod admission fails before the training container starts, and OSMO reports Exit Code: 3002.

The characteristic Kubernetes error includes failed calling webhook "binder.run.ai", binder.kai-scheduler.svc, and an x509 message such as certificate signed by unknown authority or parent certificate cannot sign this kind of certificate. The unknown field "spec.env" warning can appear in the same response but is not the admission blocker.

The KAI Scheduler chart (v0.5.5) refreshes binder-webhook-tls-secret on every install but does not reliably keep the MutatingWebhookConfiguration / ValidatingWebhookConfiguration caBundle in sync with the freshly minted leaf. On upgrade the API server can be left trusting a stale CA while the binder service serves a different leaf. In a degenerate case, caBundle references a self-signed cert (binder.kai-scheduler.svc-ca) that does not sign the served leaf at all.

A secondary failure mode is binder pods caching an old mounted certificate after a chart upgrade. Tracked as #794.

Diagnostics:

Compare the leaf certificate served by the binder Secret against the caBundle advertised on the webhook configs — they must match.

kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
-o jsonpath='{.data.tls\.crt}' | base64 --decode |
openssl x509 -noout -subject -fingerprint -sha256

kubectl get mutatingwebhookconfiguration kai-binder \
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 --decode |
openssl x509 -noout -subject -fingerprint -sha256

kubectl run kai-binder-webhook-tls-check --image=registry.k8s.io/pause:3.10 \
--restart=Never --dry-run=server -o yaml

The dry-run admission probe is the definitive test: it exercises the binder webhook path without creating a pod and surfaces the exact x509 failure the API server sees. If the two fingerprints differ, the cluster is in the drift state described above; the probe will report x509: certificate signed by unknown authority.

Resolution:

Repair both failure modes manually with the following sequence. Each step is idempotent and safe to re-run on a live cluster.

First, force binder pods to re-read the mounted certificate (sufficient when only the cached cert is stale):

kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s

Re-run the admission probe above. If the x509 error persists, the Secret and the webhook caBundle are out of sync. Sync the caBundle on both webhook configs to the leaf certificate currently in the Secret, then restart binder:

leaf_b64=$(kubectl -n kai-scheduler get secret binder-webhook-tls-secret \
-o jsonpath='{.data.tls\.crt}')
[[ -n "$leaf_b64" ]] || { echo "binder-webhook-tls-secret has no tls.crt" >&2; exit 1; }

for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
kubectl get "$kind" kai-binder -o json |
jq --arg ca "$leaf_b64" '.webhooks |= map(.clientConfig.caBundle = $ca)' |
kubectl replace -f -
done

kubectl -n kai-scheduler rollout restart deployment/binder
kubectl -n kai-scheduler rollout status deployment/binder --timeout=180s

If the webhook configs themselves are corrupted (for example, caBundle references a self-signed cert that does not match any leaf), delete them and re-run infrastructure/setup/01-deploy-robotics-charts.sh to let the chart recreate them. The post-install sync block above then aligns the freshly minted Secret with the recreated configs:

for kind in mutatingwebhookconfiguration validatingwebhookconfiguration; do
kubectl delete "$kind" kai-binder --ignore-not-found
done

infrastructure/setup/01-deploy-robotics-charts.sh

KAI scheduler rejects multi-GPU job

Cause: Coscheduling (gang-scheduling) requirements are not met. Either insufficient GPU resources or the PodGroup configuration is missing.

Resolution:

  1. Verify available GPU capacity across nodes: kubectl describe nodes | grep nvidia.com/gpu.
  2. Confirm the KAI scheduler is installed and configured for coscheduling in the OSMO backend.
  3. Reduce GPU request count or wait for node autoscaling to provide capacity.

OSMO dataset injection fails

Cause: The dataset folder name in the workflow YAML does not match the registered dataset name, or the dataset version is not published.

Resolution:

  1. List available datasets: osmo config list DATASET.
  2. Verify the dataset name and version in the workflow environment variables match a published dataset.

LeRobot training fails with PyTorch shared memory allocation error

Cause: PyTorch DataLoader workers use /dev/shm to collate batches across worker processes. Kubernetes pods inherit the container runtime default shared-memory mount unless the pod spec overrides it, and that default is too small for image-heavy LeRobot batches. The failure occurs after OSMO creates the pod and after training starts, so it is not an OSMO CreateGroup, KAI, or webhook TLS issue.

Characteristic log line:

RuntimeError: unable to allocate shared memory(shm) for file </torch_...>: Success (0)

Resolution:

OSMO mounts /dev/shm according to the POD_TEMPLATE config registered on the control plane at deploy time, with USER_SHM_SIZE rendered from the pool config (default 8Gi). The mount applies only to pods created after the config is registered — restarting the training pod is not enough; a stuck workflow must be cancelled and resubmitted.

Verify the live cluster matches the deploy script's intent:

osmo config get POD_TEMPLATE --output yaml | grep -A 3 dshm
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.volumes[?(@.name=="dshm")].emptyDir.sizeLimit}'
kubectl get pod <pod> -n osmo-workflows -o jsonpath='{.spec.containers[?(@.name!="osmo-ctrl")].volumeMounts[?(@.mountPath=="/dev/shm")].name}'

If the POD_TEMPLATE is missing the dshm mount, the registered config drifted from infrastructure/setup/config/pod-template-config.template.json; re-run infrastructure/setup/04-deploy-osmo-backend.sh to republish it. If the config is correct but a running pod lacks the mount, that pod predates the config update — cancel the workflow and submit a new one.

If new pods still exhaust /dev/shm with USER_SHM_SIZE=8Gi, raise USER_SHM_SIZE in infrastructure/setup/config/pool-config.template.json (16Gi is a safe ceiling for image-heavy ACT/Diffusion datasets), re-run 04-deploy-osmo-backend.sh, and resubmit. Reducing --batch-size or dataloader.num_workers is the workaround when raising the mount is not an option.

OSMO workflow pods stuck in Pending

Cause: The osmo-workflows namespace lacks resource quota or node affinity rules prevent scheduling.

Resolution:

  1. Check pod events: kubectl describe pod <pod> -n osmo-workflows.
  2. Verify node taints and tolerations match the pod spec.
  3. Check namespace resource quotas: kubectl get resourcequota -n osmo-workflows.

OSMO workflow completes task but workflow status stays RUNNING or fails

Cause: The service_base_url field in the OSMO SERVICE config is empty or points to a service that does not route /api/logger paths. The osmo-ctrl sidecar in workflow pods uses this URL to connect via WebSocket to the logger service and to refresh auth tokens. When misconfigured, the sidecar logs websocket: bad handshake or connection refused errors, preventing the workflow from reporting completion.

Resolution:

  1. Inspect the osmo-ctrl sidecar arguments on a workflow pod:

    kubectl get pod <pod> -n osmo-workflows -o json | \
    python3 -c "import sys,json; [print(a) for c in json.load(sys.stdin)['spec']['containers'] if c['name']=='osmo-ctrl' for a in c.get('args',[])]"
  2. If -host is empty (""), the service_base_url in the SERVICE config is not set:

    osmo config show SERVICE | grep service_base_url
  3. Set service_base_url to the AzureML ingress controller ClusterIP FQDN:

    osmo config show SERVICE | python3 -c "
    import sys, json
    c = json.load(sys.stdin)
    c['service_base_url'] = 'http://azureml-ingress-nginx-controller.azureml.svc.cluster.local'
    json.dump(c, sys.stdout, indent=2)" > /tmp/service-config.json
    osmo config update SERVICE --file /tmp/service-config.json --description "Set service base URL"
  4. Cancel and resubmit the workflow. New pods pick up the updated service_base_url.

[!WARNING] Do not set service_base_url to osmo-router. The router only handles /api/router/ paths. The ingress controller routes all API paths (/api/logger, /api/agent, /api/auth, /api) to the correct backend services.

OSMO UI shows "server IP address could not be found" for workflow logs

Cause: The service_base_url is set to an in-cluster FQDN (e.g., azureml-ingress-nginx-controller.azureml.svc.cluster.local) that the browser cannot resolve. The OSMO UI constructs workflow overview URLs from this value.

Resolution:

Connect via VPN and set service_base_url to the internal load balancer IP, which is reachable from both in-cluster pods and the VPN-connected browser:

# Find the internal LB IP
kubectl get svc azureml-ingress-nginx-internal-lb -n azureml \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}'

# Update service_base_url to the internal LB IP
osmo config show SERVICE | python3 -c "
import sys, json
c = json.load(sys.stdin)
c['service_base_url'] = 'http://10.0.5.6' # replace with actual IP
json.dump(c, sys.stdout, indent=2)" > /tmp/service-config.json
osmo config update SERVICE --file /tmp/service-config.json --description "Set service base URL to internal LB"

Without VPN, use the CLI to view workflow logs: osmo workflow logs <workflow-id>.

Additional Resources

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.