physical-ai-toolchain

STRIDE-based threat analysis of the Physical AI Toolchain covering infrastructure-as-code components, trust boundaries, and a prioritized remediation roadmap.

Executive Summary

This threat model applies the STRIDE framework to the Physical AI Toolchain. The architecture deploys AKS clusters with GPU node pools, Azure Machine Learning, and NVIDIA OSMO for robotics training and inference workloads. All components are infrastructure-as-code artifacts; no hosted service or user-facing application exists.

Area Status Evidence
Authentication Managed identities + workload identity No password-based auth; DefaultAzureCredential
Secret Management Azure Key Vault (RBAC) + CSI driver Secrets synced to K8s pods at mount time
Network Isolation Private endpoints + VPN-only access All Azure services behind VNet; no public IPs
Encryption TLS 1.2+ enforced by Azure Platform-managed keys for data at rest
Supply Chain 95% SHA-pinned GitHub Actions Dependency review blocks moderate+ vulnerabilities

Risk summary: 19 threats identified — 1 Critical, 6 High, 7 Medium, 5 Low. Key open risks: T-2 (Critical), S-1 (High), T-1 (High).

System Description

Architecture Components

Category Component Details
Compute AKS Cluster Private cluster, CNI networking, GPU node pools (Standard_NC-series)
Data & Storage Azure Storage Account Blob containers for datasets, checkpoints; private endpoint access
Data & Storage Azure Database for PostgreSQL Flexible server for OSMO metadata; VNet-integrated
Data & Storage Azure Cache for Redis Enterprise tier; OSMO session state; private endpoint
ML & AI Azure Machine Learning Workspace with managed endpoints; K8s compute attach
Identity Entra ID + Managed Identities System-assigned for AKS, user-assigned for workloads
Networking VNet + NSG + NAT Gateway + VPN Hub-spoke implied; P2S VPN for operator access
Observability Azure Monitor + Log Analytics Container Insights, Prometheus metrics, AMPLS for private ingestion
Security Azure Key Vault RBAC-mode; CSI Secret Store driver syncs secrets to pods
NVIDIA/OSMO OSMO Control Plane + Backend Orchestrates distributed training; Envoy proxy optional

Data Flows

Training data flows from Azure Blob Storage through AKS pods to GPU compute. Checkpoints and metrics flow back to storage and MLflow tracking. OSMO coordinates multi-node training via its control plane and PostgreSQL metadata store. All Azure service traffic uses private endpoints — no data traverses the public internet.

Operator access traverses a P2S VPN gateway to the AKS API server private endpoint. CI/CD pipelines authenticate via GitHub OIDC federation to Entra ID managed identities. Terraform state resides locally on the operator workstation (see T-2 for associated risk).

Security Inheritance

Control Provider Configuration Surface
TLS termination Azure platform Enforced by default
Disk encryption at rest Azure platform Platform-managed keys (PMK)
Identity federation Entra ID Workload identity via OIDC
Network segmentation Azure VNet + NSG Subnets, private endpoints
Secret rotation Azure Key Vault Deployer responsibility
Cluster patch management AKS managed upgrades Deployer selects upgrade policy

Trust Boundaries

ID Boundary Description
TB-1 Azure Control Plane ↔ Data Plane ARM API calls cross into subscription data plane
TB-2 VNet Perimeter ↔ Internet NAT Gateway egress; VPN ingress; no public endpoints
TB-3 AKS ↔ Azure Services Pod-to-service traffic via private endpoints and managed identity
TB-4 K8s Namespace Isolation OSMO, training, inference workloads in separate namespaces
TB-5 Operator Workstation ↔ Cluster P2S VPN tunnel; kubectl via private API server
TB-6 CI/CD ↔ Repository GitHub Actions with OIDC federation; SHA-pinned actions
TB-7 OSMO Control Plane ↔ Backend gRPC between control plane and backend pods
TB-8 Training Code ↔ Azure Services Python SDK calls via DefaultAzureCredential

Credential Delegation Model

Entra ID issues tokens to managed identities. AKS workload identity federation projects service account tokens to pods. Pods exchange projected tokens for Azure resource access. Key Vault stores secrets and syncs them to Kubernetes Secrets via the CSI driver. The chain: Entra ID → Managed Identities → Workload Identity Federation → Key Vault → K8s Secrets.

STRIDE Threat Registry

Spoofing

S-1: OSMO API Authentication Disabled

Field Value
Threat OSMO API server deploys with auth.enabled: false, allowing unauthenticated gRPC calls
Affected Assets OSMO control plane, backend pods
Trust Boundary TB-7
Likelihood High
Impact High
Risk Rating High
Current Controls Cluster-internal networking only; namespace isolation
Evidence infrastructure/setup/values/osmo-control-plane-values.yaml sets auth.enabled: false
Status Open
Remediation Enable OSMO auth when vendor provides production-ready auth configuration

S-2: PostgreSQL Shared Admin Identity

Field Value
Threat PostgreSQL uses a single psqladmin identity for all OSMO database operations
Affected Assets Azure Database for PostgreSQL, OSMO metadata
Trust Boundary TB-3
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls VNet integration; private endpoint; Key Vault–stored credentials
Evidence infrastructure/terraform/modules/sil/postgresql.tf configures single admin login
Status Accepted
Rationale Single-purpose database serving only OSMO; network isolation limits exposure

Tampering

T-1: MEK Stored as ConfigMap

Field Value
Threat Model Encryption Key (MEK) stored in a Kubernetes ConfigMap, bypassing etcd encryption at rest
Affected Assets K8s ConfigMap, trained model artifacts
Trust Boundary TB-4
Likelihood Medium
Impact High
Risk Rating High
Current Controls RBAC-restricted namespace; cluster-internal access only
Evidence OSMO deployment stores MEK in ConfigMap rather than K8s Secret
Status Open
Remediation Migrate MEK to Kubernetes Secret synced from Key Vault via CSI driver

T-2: Terraform State Local Storage

Field Value
Threat Terraform state file stored locally with plaintext secrets including storage keys and passwords
Affected Assets terraform.tfstate on operator workstation
Trust Boundary TB-5
Likelihood High
Impact High
Risk Rating Critical
Current Controls .gitignore excludes *.tfstate; VPN-only access to workstation
Evidence infrastructure/terraform/versions.tf has no remote backend configuration
Status Open
Remediation Configure Azure Storage remote backend with state encryption and locking

T-3: Inference Endpoint Allows Insecure Connections

Field Value
Threat AzureML online endpoint configured with allowInsecureConnections: true
Affected Assets AzureML managed online endpoint
Trust Boundary TB-3
Likelihood Low
Impact Medium
Risk Rating Medium
Current Controls Private endpoint restricts access to VNet; cluster-internal traffic only
Evidence Inference deployment YAML allows insecure connections for internal scoring
Status Accepted
Rationale Traffic stays within private VNet; TLS adds latency to inference hot path

Repudiation

R-1: Training Debug Logging Captures Credentials

Field Value
Threat Training scripts log AZURE_* environment variables at debug verbosity, exposing tokens
Affected Assets Training pod logs, Log Analytics workspace
Trust Boundary TB-8
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls Debug logging disabled by default; Log Analytics RBAC
Evidence training/rl/utils/ modules include debug-level credential logging
Status Open
Remediation Sanitize or redact AZURE_* values before logging; enforce structured logging

Information Disclosure

I-1: Storage Access Key Fallback

Field Value
Threat Storage account access keys used as fallback when managed identity auth fails
Affected Assets Azure Storage Account, training datasets
Trust Boundary TB-3
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls Keys stored in Key Vault; private endpoint restricts network access
Evidence Training scripts fall back to AZURE_STORAGE_KEY when DefaultAzureCredential is unavailable
Status Accepted
Rationale Fallback provides operational resilience; keys are Key Vault–managed with rotation capability

I-2: VPN Shared Keys in Local Terraform State

Field Value
Threat VPN gateway shared secret stored in plaintext within local terraform.tfstate
Affected Assets VPN gateway, operator VPN credentials
Trust Boundary TB-5
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls .gitignore excludes state files; workstation access controls
Evidence infrastructure/terraform/vpn/ stores VPN shared key as Terraform-managed resource
Status Open
Remediation Resolved by T-2 remediation (remote backend with state encryption)

I-3: Redis Access Keys Alongside Private Endpoint

Field Value
Threat Redis Enterprise exposes access keys even though connectivity uses private endpoints
Affected Assets Azure Cache for Redis, OSMO session data
Trust Boundary TB-3
Likelihood Low
Impact Low
Risk Rating Low
Current Controls Private endpoint; Key Vault–stored keys; namespace-scoped RBAC
Evidence Redis module outputs access keys to Terraform state
Status Accepted
Rationale Private endpoint eliminates network-level exposure; key rotation available via Key Vault

I-4: MLflow Temp Config World-Readable

Field Value
Threat MLflow writes tracking configuration to /tmp with world-readable permissions
Affected Assets MLflow tracking URI, experiment metadata
Trust Boundary TB-8
Likelihood Low
Impact Low
Risk Rating Low
Current Controls Pod-level isolation; no credentials in tracking config
Evidence MLflow integration code writes to /tmp/mlflow-config
Status Accepted
Rationale No secrets in config file; pod filesystem isolation limits access to same-pod processes

I-5: Training Environment Variable Debug Logging

Field Value
Threat Training utility modules log environment variables containing Azure credentials at debug level
Affected Assets Pod logs, Log Analytics workspace
Trust Boundary TB-8
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls Debug logging off by default; RBAC on Log Analytics
Evidence training/rl/utils/env.py logs AZURE_* values at debug verbosity
Status Open
Remediation Same as R-1; sanitize credential values before logging

Denial of Service

D-1: Zero NetworkPolicy Manifests

Field Value
Threat No Kubernetes NetworkPolicy resources deployed; all pod-to-pod traffic unrestricted
Affected Assets AKS cluster, all workload namespaces
Trust Boundary TB-4
Likelihood High
Impact High
Risk Rating High
Current Controls Azure CNI network plugin supports NetworkPolicy; namespaces provide logical separation
Evidence No NetworkPolicy resources in infrastructure/setup/manifests/
Status Open
Remediation Define deny-all default policies per namespace; allow-list required traffic flows

D-2: Single Shared NSG With Zero Custom Rules

Field Value
Threat One NSG applied to all subnets with no custom inbound/outbound rules beyond Azure defaults
Affected Assets VNet subnets, all networked resources
Trust Boundary TB-2
Likelihood Low
Impact Medium
Risk Rating Low
Current Controls Private endpoints eliminate public attack surface; VPN-only ingress
Evidence infrastructure/terraform/modules/platform/networking.tf defines NSG with no custom rules
Status Accepted
Rationale Private endpoints and VPN remove public exposure; custom rules add value after traffic audit

D-3: NAT Gateway No Egress Filtering

Field Value
Threat NAT Gateway allows unrestricted egress from AKS nodes to the internet
Affected Assets AKS nodes, container images, external APIs
Trust Boundary TB-2
Likelihood Medium
Impact High
Risk Rating High
Current Controls NSG default rules; container image pull from ACR via private endpoint
Evidence NAT Gateway configured without Azure Firewall or FQDN filtering
Status Open
Remediation Add Azure Firewall or NSG egress rules restricting outbound to required FQDNs

D-4: OSMO API Rate Limiting and Proxy Disabled

Field Value
Threat OSMO API deploys with rateLimit.enabled: false and envoy.enabled: false
Affected Assets OSMO control plane API
Trust Boundary TB-7
Likelihood Medium
Impact Medium
Risk Rating Medium
Current Controls Cluster-internal access only; namespace isolation
Evidence OSMO Helm values disable rate limiting and Envoy sidecar proxy
Status Open
Remediation Enable Envoy proxy and rate limiting when OSMO vendor provides stable configuration

Elevation of Privilege

E-1: Automation Account Contributor Role

Field Value
Threat Automation Account assigned Contributor role at resource group scope
Affected Assets Azure Automation Account, resource group resources
Trust Boundary TB-1
Likelihood Low
Impact Medium
Risk Rating Low
Current Controls Automation runs scheduled maintenance tasks only; no external triggers
Evidence infrastructure/terraform/modules/platform/automation.tf assigns Contributor role
Status Open
Remediation Define custom RBAC role scoped to specific maintenance operations

E-2: OSMO Service Token One-Year Expiry Without Rotation

Field Value
Threat OSMO service token issued with one-year expiry and no automated rotation mechanism
Affected Assets OSMO service authentication, cluster workloads
Trust Boundary TB-7
Likelihood Medium
Impact High
Risk Rating High
Current Controls Token stored in Key Vault; cluster-internal access
Evidence OSMO deployment scripts create long-lived service tokens
Status Open
Remediation Implement token rotation via Key Vault rotation policy or OSMO vendor short-lived token support

E-3: User Provisioning Grants Excessive Admin Roles

Field Value
Threat add-user-to-platform.sh assigns 9+ admin-level RBAC roles to each onboarded user
Affected Assets Azure RBAC, onboarded user identities
Trust Boundary TB-1
Likelihood Medium
Impact High
Risk Rating High
Current Controls Script requires manual execution by a privileged operator
Evidence infrastructure/setup/optional/add-user-to-platform.sh assigns broad role set
Status Open
Remediation Define tiered role profiles (reader, contributor, admin); assign minimum required roles

E-4: GitHub App Token Elevated Repository Permissions

Field Value
Threat GitHub App token used in workflows has elevated repository permissions beyond immediate need
Affected Assets GitHub Actions workflows, repository contents
Trust Boundary TB-6
Likelihood Low
Impact Low
Risk Rating Low
Current Controls SHA-pinned actions; OIDC federation; branch protection rules
Evidence Workflow files request contents: write and other elevated permissions
Status Accepted
Rationale Permissions required for release-please and dependency review workflows; scoped to repository

Assurance Argument

Goal Structuring Notation (GSN) elements supporting the security posture claim.

Element Statement
G0 The architecture provides adequate security controls for an IaC reference architecture
G1 Authentication uses managed identities and workload federation, eliminating password-based access
G2 Secrets are stored in Azure Key Vault with RBAC authorization and synced via CSI driver
G3 Network access is restricted to private endpoints, VPN, and NSG-controlled subnets
G4 Supply chain integrity is maintained through SHA-pinned actions and dependency review
E1 19 STRIDE threats identified; 7 Accepted with compensating controls, 12 Open with remediation roadmap
E2 OpenSSF Passing ~85%; 25 Silver criteria assessed (5 Met, 5 Delegated, 13 N/A, 1 Gap)
A1 Deployer follows docs/operations/security-guide.md hardening checklist
A2 OSMO vendor provides auth/rate-limiting enablement path in future releases

Remediation Roadmap

Priority Item Threats Addressed Effort Key Dependency
1 Terraform Remote Backend T-2, I-2 Low-Medium Storage account
2 Automation Least Privilege E-1 Low Custom role definition
3 MEK Migration (ConfigMap→Secret) T-1 Medium OSMO vendor verification
4 NetworkPolicy Manifests D-1 Medium Traffic audit
5 NSG Rules D-2 Medium-High NSG Flow Logs observation

Security Metrics

Metric Current Target
OpenSSF Passing badge ~85% 100%
OpenSSF Silver badge ~30% 80%
SHA-pinned actions 95% 100%
STRIDE threats mitigated 7/19 (37%) 15/19 (79%)
Critical threats open 1 0
High threats open 6 2

References


🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.