Skip to main content

Operations Hub

Centralized hub for operational documentation covering monitoring, troubleshooting, security configuration, and GPU management for Azure-NVIDIA robotics deployments.

📖 Operations Guides​

GuideDescription
TroubleshootingSymptom-based resolution for common deployment, GPU, and workflow errors
Security GuideSecurity configuration inventory and deployment checklist
GPU ConfigurationDriver selection, MIG strategy, and GPU Operator configuration
Deployment ValidationPost-deployment verification steps
Cost ConsiderationsAzure resource cost guidance

📋 Operational Overview​

The reference architecture deploys configurable monitoring components through Terraform feature flags.

ComponentPurposeFeature Flag
Log Analytics workspaceCentral log aggregationAlways deployed
Application InsightsApplication performance monitoringAlways deployed
Azure Monitor workspacePrometheus metrics backendshould_deploy_monitor_workspace
Managed GrafanaVisualization dashboardsshould_deploy_grafana
Container InsightsAKS container telemetryshould_deploy_dce
Prometheus data collection rulesMetric scraping configurationshould_deploy_dce and should_deploy_monitor_workspace
Azure Monitor Private Link ScopePrivate network monitoringshould_deploy_ampls
Data collection endpointPrivate ingestion endpointshould_deploy_dce

[!IMPORTANT] The default configuration deploys a private AKS cluster. Connect through the VPN Gateway before running any kubectl or Helm commands. See VPN Gateway for setup instructions.

Container Insights data collection rules are created when should_deploy_dce is enabled. Prometheus metrics collection rules require both should_deploy_dce and should_deploy_monitor_workspace.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.