Skip to main content

Threat Model — Physical AI Toolchain

STRIDE-based threat analysis of the Physical AI Toolchain covering infrastructure-as-code components, trust boundaries, and a prioritized remediation roadmap.

Executive Summary

This threat model applies the STRIDE framework to the Physical AI Toolchain. The architecture deploys AKS clusters with GPU node pools, Azure Machine Learning, and NVIDIA OSMO for robotics training and inference workloads. All components are infrastructure-as-code artifacts; no hosted service or user-facing application exists.

AreaStatusEvidence
AuthenticationManaged identities + workload identityNo password-based auth; DefaultAzureCredential
Secret ManagementAzure Key Vault (RBAC) + CSI driverSecrets synced to K8s pods at mount time
Network IsolationPrivate endpoints + VPN-only accessAll Azure services behind VNet; no public IPs
EncryptionTLS 1.2+ enforced by AzurePlatform-managed keys for data at rest
Supply Chain95% SHA-pinned GitHub ActionsDependency review blocks moderate+ vulnerabilities

Risk summary: 19 threats identified — 1 Critical, 6 High, 7 Medium, 5 Low. Key open risks: T-2 (Critical), S-1 (High), T-1 (High).

System Description

Architecture Components

CategoryComponentDetails
ComputeAKS ClusterPrivate cluster, CNI networking, GPU node pools (Standard_NC-series)
Data & StorageAzure Storage AccountBlob containers for datasets, checkpoints; private endpoint access
Data & StorageAzure Database for PostgreSQLFlexible server for OSMO metadata; VNet-integrated
Data & StorageAzure Cache for RedisEnterprise tier; OSMO session state; private endpoint
ML & AIAzure Machine LearningWorkspace with managed endpoints; K8s compute attach
IdentityEntra ID + Managed IdentitiesSystem-assigned for AKS, user-assigned for workloads
NetworkingVNet + NSG + NAT Gateway + VPNHub-spoke implied; P2S VPN for operator access
ObservabilityAzure Monitor + Log AnalyticsContainer Insights, Prometheus metrics, AMPLS for private ingestion
SecurityAzure Key VaultRBAC-mode; CSI Secret Store driver syncs secrets to pods
NVIDIA/OSMOOSMO Control Plane + BackendOrchestrates distributed training; Envoy proxy optional

Data Flows

Training data flows from Azure Blob Storage through AKS pods to GPU compute. Checkpoints and metrics flow back to storage and MLflow tracking. OSMO coordinates multi-node training via its control plane and PostgreSQL metadata store. All Azure service traffic uses private endpoints — no data traverses the public internet.

Operator access traverses a P2S VPN gateway to the AKS API server private endpoint. CI/CD pipelines authenticate via GitHub OIDC federation to Entra ID managed identities. Terraform state resides locally on the operator workstation (see T-2 for associated risk).

Security Inheritance

ControlProviderConfiguration Surface
TLS terminationAzure platformEnforced by default
Disk encryption at restAzure platformPlatform-managed keys (PMK)
Identity federationEntra IDWorkload identity via OIDC
Network segmentationAzure VNet + NSGSubnets, private endpoints
Secret rotationAzure Key VaultDeployer responsibility
Cluster patch managementAKS managed upgradesDeployer selects upgrade policy

Trust Boundaries

IDBoundaryDescription
TB-1Azure Control Plane ↔ Data PlaneARM API calls cross into subscription data plane
TB-2VNet Perimeter ↔ InternetNAT Gateway egress; VPN ingress; no public endpoints
TB-3AKS ↔ Azure ServicesPod-to-service traffic via private endpoints and managed identity
TB-4K8s Namespace IsolationOSMO, training, inference workloads in separate namespaces
TB-5Operator Workstation ↔ ClusterP2S VPN tunnel; kubectl via private API server
TB-6CI/CD ↔ RepositoryGitHub Actions with OIDC federation; SHA-pinned actions
TB-7OSMO Control Plane ↔ BackendgRPC between control plane and backend pods
TB-8Training Code ↔ Azure ServicesPython SDK calls via DefaultAzureCredential

Credential Delegation Model

Entra ID issues tokens to managed identities. AKS workload identity federation projects service account tokens to pods. Pods exchange projected tokens for Azure resource access. Key Vault stores secrets and syncs them to Kubernetes Secrets via the CSI driver. The chain: Entra ID → Managed Identities → Workload Identity Federation → Key Vault → K8s Secrets.

STRIDE Threat Registry

Spoofing

S-1: OSMO API Authentication Disabled

FieldValue
ThreatOSMO API server deploys with auth.enabled: false, allowing unauthenticated gRPC calls
Affected AssetsOSMO control plane, backend pods
Trust BoundaryTB-7
LikelihoodHigh
ImpactHigh
Risk RatingHigh
Current ControlsCluster-internal networking only; namespace isolation
Evidenceinfrastructure/setup/values/osmo-control-plane-values.yaml sets auth.enabled: false
StatusOpen
RemediationEnable OSMO auth when vendor provides production-ready auth configuration

S-2: PostgreSQL Shared Admin Identity

FieldValue
ThreatPostgreSQL uses a single psqladmin identity for all OSMO database operations
Affected AssetsAzure Database for PostgreSQL, OSMO metadata
Trust BoundaryTB-3
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current ControlsVNet integration; private endpoint; Key Vault–stored credentials
Evidenceinfrastructure/terraform/modules/sil/postgresql.tf configures single admin login
StatusAccepted
RationaleSingle-purpose database serving only OSMO; network isolation limits exposure

Tampering

T-1: MEK Stored as ConfigMap

FieldValue
ThreatModel Encryption Key (MEK) stored in a Kubernetes ConfigMap, bypassing etcd encryption at rest
Affected AssetsK8s ConfigMap, trained model artifacts
Trust BoundaryTB-4
LikelihoodMedium
ImpactHigh
Risk RatingHigh
Current ControlsRBAC-restricted namespace; cluster-internal access only
EvidenceOSMO deployment stores MEK in ConfigMap rather than K8s Secret
StatusOpen
RemediationMigrate MEK to Kubernetes Secret synced from Key Vault via CSI driver

T-2: Terraform State Local Storage

FieldValue
ThreatTerraform state file stored locally with plaintext secrets including storage keys and passwords
Affected Assetsterraform.tfstate on operator workstation
Trust BoundaryTB-5
LikelihoodHigh
ImpactHigh
Risk RatingCritical
Current Controls.gitignore excludes *.tfstate; VPN-only access to workstation
Evidenceinfrastructure/terraform/versions.tf has no remote backend configuration
StatusOpen
RemediationConfigure Azure Storage remote backend with state encryption and locking

T-3: Inference Endpoint Allows Insecure Connections

FieldValue
ThreatAzureML online endpoint configured with allowInsecureConnections: true
Affected AssetsAzureML managed online endpoint
Trust BoundaryTB-3
LikelihoodLow
ImpactMedium
Risk RatingMedium
Current ControlsPrivate endpoint restricts access to VNet; cluster-internal traffic only
EvidenceInference deployment YAML allows insecure connections for internal scoring
StatusAccepted
RationaleTraffic stays within private VNet; TLS adds latency to inference hot path

Repudiation

R-1: Training Debug Logging Captures Credentials

FieldValue
ThreatTraining scripts log AZURE_* environment variables at debug verbosity, exposing tokens
Affected AssetsTraining pod logs, Log Analytics workspace
Trust BoundaryTB-8
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current ControlsDebug logging disabled by default; Log Analytics RBAC
Evidencetraining/rl/utils/ modules include debug-level credential logging
StatusOpen
RemediationSanitize or redact AZURE_* values before logging; enforce structured logging

Information Disclosure

I-1: Storage Access Key Fallback

FieldValue
ThreatStorage account access keys used as fallback when managed identity auth fails
Affected AssetsAzure Storage Account, training datasets
Trust BoundaryTB-3
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current ControlsKeys stored in Key Vault; private endpoint restricts network access
EvidenceTraining scripts fall back to AZURE_STORAGE_KEY when DefaultAzureCredential is unavailable
StatusAccepted
RationaleFallback provides operational resilience; keys are Key Vault–managed with rotation capability

I-2: VPN Shared Keys in Local Terraform State

FieldValue
ThreatVPN gateway shared secret stored in plaintext within local terraform.tfstate
Affected AssetsVPN gateway, operator VPN credentials
Trust BoundaryTB-5
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current Controls.gitignore excludes state files; workstation access controls
Evidenceinfrastructure/terraform/vpn/ stores VPN shared key as Terraform-managed resource
StatusOpen
RemediationResolved by T-2 remediation (remote backend with state encryption)

I-3: Redis Access Keys Alongside Private Endpoint

FieldValue
ThreatRedis Enterprise exposes access keys even though connectivity uses private endpoints
Affected AssetsAzure Cache for Redis, OSMO session data
Trust BoundaryTB-3
LikelihoodLow
ImpactLow
Risk RatingLow
Current ControlsPrivate endpoint; Key Vault–stored keys; namespace-scoped RBAC
EvidenceRedis module outputs access keys to Terraform state
StatusAccepted
RationalePrivate endpoint eliminates network-level exposure; key rotation available via Key Vault

I-4: MLflow Temp Config World-Readable

FieldValue
ThreatMLflow writes tracking configuration to /tmp with world-readable permissions
Affected AssetsMLflow tracking URI, experiment metadata
Trust BoundaryTB-8
LikelihoodLow
ImpactLow
Risk RatingLow
Current ControlsPod-level isolation; no credentials in tracking config
EvidenceMLflow integration code writes to /tmp/mlflow-config
StatusAccepted
RationaleNo secrets in config file; pod filesystem isolation limits access to same-pod processes

I-5: Training Environment Variable Debug Logging

FieldValue
ThreatTraining utility modules log environment variables containing Azure credentials at debug level
Affected AssetsPod logs, Log Analytics workspace
Trust BoundaryTB-8
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current ControlsDebug logging off by default; RBAC on Log Analytics
Evidencetraining/rl/utils/env.py logs AZURE_* values at debug verbosity
StatusOpen
RemediationSame as R-1; sanitize credential values before logging

Denial of Service

D-1: Zero NetworkPolicy Manifests

FieldValue
ThreatNo Kubernetes NetworkPolicy resources deployed; all pod-to-pod traffic unrestricted
Affected AssetsAKS cluster, all workload namespaces
Trust BoundaryTB-4
LikelihoodHigh
ImpactHigh
Risk RatingHigh
Current ControlsAzure CNI network plugin supports NetworkPolicy; namespaces provide logical separation
EvidenceNo NetworkPolicy resources in infrastructure/setup/manifests/
StatusOpen
RemediationDefine deny-all default policies per namespace; allow-list required traffic flows

D-2: Single Shared NSG With Zero Custom Rules

FieldValue
ThreatOne NSG applied to all subnets with no custom inbound/outbound rules beyond Azure defaults
Affected AssetsVNet subnets, all networked resources
Trust BoundaryTB-2
LikelihoodLow
ImpactMedium
Risk RatingLow
Current ControlsPrivate endpoints eliminate public attack surface; VPN-only ingress
Evidenceinfrastructure/terraform/modules/platform/networking.tf defines NSG with no custom rules
StatusAccepted
RationalePrivate endpoints and VPN remove public exposure; custom rules add value after traffic audit

D-3: NAT Gateway No Egress Filtering

FieldValue
ThreatNAT Gateway allows unrestricted egress from AKS nodes to the internet
Affected AssetsAKS nodes, container images, external APIs
Trust BoundaryTB-2
LikelihoodMedium
ImpactHigh
Risk RatingHigh
Current ControlsNSG default rules; container image pull from ACR via private endpoint
EvidenceNAT Gateway configured without Azure Firewall or FQDN filtering
StatusOpen
RemediationAdd Azure Firewall or NSG egress rules restricting outbound to required FQDNs

D-4: OSMO API Rate Limiting and Proxy Disabled

FieldValue
ThreatOSMO API deploys with rateLimit.enabled: false and envoy.enabled: false
Affected AssetsOSMO control plane API
Trust BoundaryTB-7
LikelihoodMedium
ImpactMedium
Risk RatingMedium
Current ControlsCluster-internal access only; namespace isolation
EvidenceOSMO Helm values disable rate limiting and Envoy sidecar proxy
StatusOpen
RemediationEnable Envoy proxy and rate limiting when OSMO vendor provides stable configuration

Elevation of Privilege

E-1: Automation Account Contributor Role

FieldValue
ThreatAutomation Account assigned Contributor role at resource group scope
Affected AssetsAzure Automation Account, resource group resources
Trust BoundaryTB-1
LikelihoodLow
ImpactMedium
Risk RatingLow
Current ControlsAutomation runs scheduled maintenance tasks only; no external triggers
Evidenceinfrastructure/terraform/modules/platform/automation.tf assigns Contributor role
StatusOpen
RemediationDefine custom RBAC role scoped to specific maintenance operations

E-2: OSMO Service Token One-Year Expiry Without Rotation

FieldValue
ThreatOSMO service token issued with one-year expiry and no automated rotation mechanism
Affected AssetsOSMO service authentication, cluster workloads
Trust BoundaryTB-7
LikelihoodMedium
ImpactHigh
Risk RatingHigh
Current ControlsToken stored in Key Vault; cluster-internal access
EvidenceOSMO deployment scripts create long-lived service tokens
StatusOpen
RemediationImplement token rotation via Key Vault rotation policy or OSMO vendor short-lived token support

E-3: User Provisioning Grants Excessive Admin Roles

FieldValue
Threatadd-user-to-platform.sh assigns 9+ admin-level RBAC roles to each onboarded user
Affected AssetsAzure RBAC, onboarded user identities
Trust BoundaryTB-1
LikelihoodMedium
ImpactHigh
Risk RatingHigh
Current ControlsScript requires manual execution by a privileged operator
Evidenceinfrastructure/setup/optional/add-user-to-platform.sh assigns broad role set
StatusOpen
RemediationDefine tiered role profiles (reader, contributor, admin); assign minimum required roles

E-4: GitHub App Token Elevated Repository Permissions

FieldValue
ThreatGitHub App token used in workflows has elevated repository permissions beyond immediate need
Affected AssetsGitHub Actions workflows, repository contents
Trust BoundaryTB-6
LikelihoodLow
ImpactLow
Risk RatingLow
Current ControlsSHA-pinned actions; OIDC federation; branch protection rules
EvidenceWorkflow files request contents: write and other elevated permissions
StatusAccepted
RationalePermissions required for release-please and dependency review workflows; scoped to repository

Assurance Argument

Goal Structuring Notation (GSN) elements supporting the security posture claim.

ElementStatement
G0The architecture provides adequate security controls for an IaC reference architecture
G1Authentication uses managed identities and workload federation, eliminating password-based access
G2Secrets are stored in Azure Key Vault with RBAC authorization and synced via CSI driver
G3Network access is restricted to private endpoints, VPN, and NSG-controlled subnets
G4Supply chain integrity is maintained through SHA-pinned actions and dependency review
E119 STRIDE threats identified; 7 Accepted with compensating controls, 12 Open with remediation roadmap
E2OpenSSF Passing ~85%; 25 Silver criteria assessed (5 Met, 5 Delegated, 13 N/A, 1 Gap)
A1Deployer follows docs/operations/security-guide.md hardening checklist
A2OSMO vendor provides auth/rate-limiting enablement path in future releases

Remediation Roadmap

PriorityItemThreats AddressedEffortKey Dependency
1Terraform Remote BackendT-2, I-2Low-MediumStorage account
2Automation Least PrivilegeE-1LowCustom role definition
3MEK Migration (ConfigMap→Secret)T-1MediumOSMO vendor verification
4NetworkPolicy ManifestsD-1MediumTraffic audit
5NSG RulesD-2Medium-HighNSG Flow Logs observation

Security Metrics

MetricCurrentTarget
OpenSSF Passing badge~85%100%
OpenSSF Silver badge~30%80%
SHA-pinned actions95%100%
STRIDE threats mitigated7/19 (37%)15/19 (79%)
Critical threats open10
High threats open62

References


🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.