physical-ai-toolchain

Terraform configuration for the robotics reference architecture. Deploys Azure resources including AKS with GPU node pools, Azure ML workspace, storage, and OSMO backend services.

[!NOTE] This page is part of the deployment guide. Return there for the full deployment sequence.

📋 Prerequisites

Tool Version Setup or Check
Azure CLI Latest az login
Terraform 1.5+ terraform version
GPU VM quota Region-specific e.g., Standard_NV36ads_A10_v5

Azure RBAC Permissions

Role Scope
Contributor Subscription (new RG) or Resource Group (existing RG)
Role Based Access Control Administrator Subscription (new RG) or Resource Group (existing RG)

Terraform creates role assignments for managed identities, requiring Microsoft.Authorization/roleAssignments/write permission. The Contributor role explicitly blocks this action; the RBAC Administrator role provides it.

[!NOTE] Use subscription scope if creating a new resource group (should_create_resource_group = true). Use resource group scope if the resource group already exists.

Alternative: Owner role (grants more permissions than required).

🚀 Quick Start

cd infrastructure/terraform
source prerequisites/az-sub-init.sh
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply -var-file=terraform.tfvars

[!IMPORTANT] The default configuration creates a private AKS cluster (should_enable_private_aks_cluster = true). After deploying infrastructure, you must deploy the VPN Gateway and connect before running kubectl commands or cluster setup scripts.

⚙️ Configuration

Core Variables

Variable Description Required
environment Deployment environment (dev, test, prod) Yes
resource_prefix Resource naming prefix Yes
location Azure region Yes
instance Instance identifier No (default: “001”)
tags Resource group tags No (default: {})

AKS System Node Pool

Variable Description Default
system_node_pool_vm_size VM size for AKS system node pool Standard_D8ds_v5
system_node_pool_node_count Number of nodes for AKS system node pool 1
system_node_pool_zones Availability zones for system node pool null
should_enable_system_node_pool_auto_scaling Enable auto-scaling for system node pool false
system_node_pool_min_count Minimum nodes when auto-scaling enabled null
system_node_pool_max_count Maximum nodes when auto-scaling enabled null

Feature Flags

Variable Description Default
should_enable_nat_gateway Deploy NAT Gateway for outbound connectivity true
should_enable_private_endpoint Deploy private endpoints and DNS zones for Azure services true
should_enable_private_aks_cluster Make AKS API endpoint private (requires VPN for kubectl) true
should_enable_public_network_access Allow public access to resources true
should_deploy_postgresql Deploy PostgreSQL Flexible Server for OSMO true
should_deploy_redis Deploy Azure Managed Redis for OSMO true
should_deploy_grafana Deploy Azure Managed Grafana dashboard true
should_deploy_monitor_workspace Deploy Azure Monitor Workspace for Prometheus metrics true
should_deploy_ampls Deploy Azure Monitor Private Link Scope and endpoint true
should_deploy_dce Deploy Data Collection Endpoint for observability true
should_deploy_aml_compute Deploy AzureML managed GPU compute cluster false
should_include_aks_dns_zone Include AKS private DNS zone in core DNS zones true

Network Configuration Modes

Three deployment modes are supported based on security requirements:

Full Private (Default)

All Azure services use private endpoints and AKS has a private control plane. Requires VPN for all access.

# terraform.tfvars (default values)
should_enable_private_endpoint    = true
should_enable_private_aks_cluster = true

Deploy VPN Gateway after infrastructure: cd vpn && terraform apply

Hybrid: Private Services, Public AKS

Azure services (Storage, Key Vault, ACR, PostgreSQL, Redis) use private endpoints, but AKS control plane is publicly accessible. No VPN required for kubectl access.

# terraform.tfvars
should_enable_private_endpoint    = true
should_enable_private_aks_cluster = false

This mode provides security for Azure resources while allowing cluster management without VPN.

Full Public

All endpoints are publicly accessible. Not recommended for production without additional hardening.

# terraform.tfvars
should_enable_private_endpoint    = false
should_enable_private_aks_cluster = false

[!WARNING] Public endpoints expose services to the internet. When using this configuration, you must secure cluster workloads:

AzureML Extension: Configure HTTPS and restrict access via inference router settings. See Secure online endpoints and Inference routing.

OSMO UI: Enable Keycloak authentication to protect the web interface. See OSMO Keycloak configuration.

OSMO Workload Identity

Enable managed identity for OSMO services (recommended for production):

osmo_config = {
  should_enable_identity   = true
  should_federate_identity = true
  control_plane_namespace  = "osmo-control-plane"
  operator_namespace       = "osmo-operator"
  workflows_namespace      = "osmo-workflows"
}

See variables.tf for all configuration options.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.