STRIDE-based threat analysis of the Physical AI Toolchain covering infrastructure-as-code components, trust boundaries, and a prioritized remediation roadmap.
Executive Summary
This threat model applies the STRIDE framework to the Physical AI Toolchain. The architecture deploys AKS clusters with GPU node pools, Azure Machine Learning, and NVIDIA OSMO for robotics training and inference workloads. All components are infrastructure-as-code artifacts; no hosted service or user-facing application exists.
| Area |
Status |
Evidence |
| Authentication |
Managed identities + workload identity |
No password-based auth; DefaultAzureCredential |
| Secret Management |
Azure Key Vault (RBAC) + CSI driver |
Secrets synced to K8s pods at mount time |
| Network Isolation |
Private endpoints + VPN-only access |
All Azure services behind VNet; no public IPs |
| Encryption |
TLS 1.2+ enforced by Azure |
Platform-managed keys for data at rest |
| Supply Chain |
95% SHA-pinned GitHub Actions |
Dependency review blocks moderate+ vulnerabilities |
Risk summary: 19 threats identified — 1 Critical, 6 High, 7 Medium, 5 Low. Key open risks: T-2 (Critical), S-1 (High), T-1 (High).
System Description
Architecture Components
| Category |
Component |
Details |
| Compute |
AKS Cluster |
Private cluster, CNI networking, GPU node pools (Standard_NC-series) |
| Data & Storage |
Azure Storage Account |
Blob containers for datasets, checkpoints; private endpoint access |
| Data & Storage |
Azure Database for PostgreSQL |
Flexible server for OSMO metadata; VNet-integrated |
| Data & Storage |
Azure Cache for Redis |
Enterprise tier; OSMO session state; private endpoint |
| ML & AI |
Azure Machine Learning |
Workspace with managed endpoints; K8s compute attach |
| Identity |
Entra ID + Managed Identities |
System-assigned for AKS, user-assigned for workloads |
| Networking |
VNet + NSG + NAT Gateway + VPN |
Hub-spoke implied; P2S VPN for operator access |
| Observability |
Azure Monitor + Log Analytics |
Container Insights, Prometheus metrics, AMPLS for private ingestion |
| Security |
Azure Key Vault |
RBAC-mode; CSI Secret Store driver syncs secrets to pods |
| NVIDIA/OSMO |
OSMO Control Plane + Backend |
Orchestrates distributed training; Envoy proxy optional |
Data Flows
Training data flows from Azure Blob Storage through AKS pods to GPU compute. Checkpoints and metrics flow back to storage and MLflow tracking. OSMO coordinates multi-node training via its control plane and PostgreSQL metadata store. All Azure service traffic uses private endpoints — no data traverses the public internet.
Operator access traverses a P2S VPN gateway to the AKS API server private endpoint. CI/CD pipelines authenticate via GitHub OIDC federation to Entra ID managed identities. Terraform state resides locally on the operator workstation (see T-2 for associated risk).
Security Inheritance
| Control |
Provider |
Configuration Surface |
| TLS termination |
Azure platform |
Enforced by default |
| Disk encryption at rest |
Azure platform |
Platform-managed keys (PMK) |
| Identity federation |
Entra ID |
Workload identity via OIDC |
| Network segmentation |
Azure VNet + NSG |
Subnets, private endpoints |
| Secret rotation |
Azure Key Vault |
Deployer responsibility |
| Cluster patch management |
AKS managed upgrades |
Deployer selects upgrade policy |
Trust Boundaries
| ID |
Boundary |
Description |
| TB-1 |
Azure Control Plane ↔ Data Plane |
ARM API calls cross into subscription data plane |
| TB-2 |
VNet Perimeter ↔ Internet |
NAT Gateway egress; VPN ingress; no public endpoints |
| TB-3 |
AKS ↔ Azure Services |
Pod-to-service traffic via private endpoints and managed identity |
| TB-4 |
K8s Namespace Isolation |
OSMO, training, inference workloads in separate namespaces |
| TB-5 |
Operator Workstation ↔ Cluster |
P2S VPN tunnel; kubectl via private API server |
| TB-6 |
CI/CD ↔ Repository |
GitHub Actions with OIDC federation; SHA-pinned actions |
| TB-7 |
OSMO Control Plane ↔ Backend |
gRPC between control plane and backend pods |
| TB-8 |
Training Code ↔ Azure Services |
Python SDK calls via DefaultAzureCredential |
Credential Delegation Model
Entra ID issues tokens to managed identities. AKS workload identity federation projects service account tokens to pods. Pods exchange projected tokens for Azure resource access. Key Vault stores secrets and syncs them to Kubernetes Secrets via the CSI driver. The chain: Entra ID → Managed Identities → Workload Identity Federation → Key Vault → K8s Secrets.
STRIDE Threat Registry
Spoofing
S-1: OSMO API Authentication Disabled
| Field |
Value |
| Threat |
OSMO API server deploys with auth.enabled: false, allowing unauthenticated gRPC calls |
| Affected Assets |
OSMO control plane, backend pods |
| Trust Boundary |
TB-7 |
| Likelihood |
High |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
Cluster-internal networking only; namespace isolation |
| Evidence |
infrastructure/setup/values/osmo-control-plane-values.yaml sets auth.enabled: false |
| Status |
Open |
| Remediation |
Enable OSMO auth when vendor provides production-ready auth configuration |
S-2: PostgreSQL Shared Admin Identity
| Field |
Value |
| Threat |
PostgreSQL uses a single psqladmin identity for all OSMO database operations |
| Affected Assets |
Azure Database for PostgreSQL, OSMO metadata |
| Trust Boundary |
TB-3 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
VNet integration; private endpoint; Key Vault–stored credentials |
| Evidence |
infrastructure/terraform/modules/sil/postgresql.tf configures single admin login |
| Status |
Accepted |
| Rationale |
Single-purpose database serving only OSMO; network isolation limits exposure |
Tampering
T-1: MEK Stored as ConfigMap
| Field |
Value |
| Threat |
Model Encryption Key (MEK) stored in a Kubernetes ConfigMap, bypassing etcd encryption at rest |
| Affected Assets |
K8s ConfigMap, trained model artifacts |
| Trust Boundary |
TB-4 |
| Likelihood |
Medium |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
RBAC-restricted namespace; cluster-internal access only |
| Evidence |
OSMO deployment stores MEK in ConfigMap rather than K8s Secret |
| Status |
Open |
| Remediation |
Migrate MEK to Kubernetes Secret synced from Key Vault via CSI driver |
| Field |
Value |
| Threat |
Terraform state file stored locally with plaintext secrets including storage keys and passwords |
| Affected Assets |
terraform.tfstate on operator workstation |
| Trust Boundary |
TB-5 |
| Likelihood |
High |
| Impact |
High |
| Risk Rating |
Critical |
| Current Controls |
.gitignore excludes *.tfstate; VPN-only access to workstation |
| Evidence |
infrastructure/terraform/versions.tf has no remote backend configuration |
| Status |
Open |
| Remediation |
Configure Azure Storage remote backend with state encryption and locking |
T-3: Inference Endpoint Allows Insecure Connections
| Field |
Value |
| Threat |
AzureML online endpoint configured with allowInsecureConnections: true |
| Affected Assets |
AzureML managed online endpoint |
| Trust Boundary |
TB-3 |
| Likelihood |
Low |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
Private endpoint restricts access to VNet; cluster-internal traffic only |
| Evidence |
Inference deployment YAML allows insecure connections for internal scoring |
| Status |
Accepted |
| Rationale |
Traffic stays within private VNet; TLS adds latency to inference hot path |
Repudiation
R-1: Training Debug Logging Captures Credentials
| Field |
Value |
| Threat |
Training scripts log AZURE_* environment variables at debug verbosity, exposing tokens |
| Affected Assets |
Training pod logs, Log Analytics workspace |
| Trust Boundary |
TB-8 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
Debug logging disabled by default; Log Analytics RBAC |
| Evidence |
training/rl/utils/ modules include debug-level credential logging |
| Status |
Open |
| Remediation |
Sanitize or redact AZURE_* values before logging; enforce structured logging |
I-1: Storage Access Key Fallback
| Field |
Value |
| Threat |
Storage account access keys used as fallback when managed identity auth fails |
| Affected Assets |
Azure Storage Account, training datasets |
| Trust Boundary |
TB-3 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
Keys stored in Key Vault; private endpoint restricts network access |
| Evidence |
Training scripts fall back to AZURE_STORAGE_KEY when DefaultAzureCredential is unavailable |
| Status |
Accepted |
| Rationale |
Fallback provides operational resilience; keys are Key Vault–managed with rotation capability |
| Field |
Value |
| Threat |
VPN gateway shared secret stored in plaintext within local terraform.tfstate |
| Affected Assets |
VPN gateway, operator VPN credentials |
| Trust Boundary |
TB-5 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
.gitignore excludes state files; workstation access controls |
| Evidence |
infrastructure/terraform/vpn/ stores VPN shared key as Terraform-managed resource |
| Status |
Open |
| Remediation |
Resolved by T-2 remediation (remote backend with state encryption) |
I-3: Redis Access Keys Alongside Private Endpoint
| Field |
Value |
| Threat |
Redis Enterprise exposes access keys even though connectivity uses private endpoints |
| Affected Assets |
Azure Cache for Redis, OSMO session data |
| Trust Boundary |
TB-3 |
| Likelihood |
Low |
| Impact |
Low |
| Risk Rating |
Low |
| Current Controls |
Private endpoint; Key Vault–stored keys; namespace-scoped RBAC |
| Evidence |
Redis module outputs access keys to Terraform state |
| Status |
Accepted |
| Rationale |
Private endpoint eliminates network-level exposure; key rotation available via Key Vault |
I-4: MLflow Temp Config World-Readable
| Field |
Value |
| Threat |
MLflow writes tracking configuration to /tmp with world-readable permissions |
| Affected Assets |
MLflow tracking URI, experiment metadata |
| Trust Boundary |
TB-8 |
| Likelihood |
Low |
| Impact |
Low |
| Risk Rating |
Low |
| Current Controls |
Pod-level isolation; no credentials in tracking config |
| Evidence |
MLflow integration code writes to /tmp/mlflow-config |
| Status |
Accepted |
| Rationale |
No secrets in config file; pod filesystem isolation limits access to same-pod processes |
I-5: Training Environment Variable Debug Logging
| Field |
Value |
| Threat |
Training utility modules log environment variables containing Azure credentials at debug level |
| Affected Assets |
Pod logs, Log Analytics workspace |
| Trust Boundary |
TB-8 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
Debug logging off by default; RBAC on Log Analytics |
| Evidence |
training/rl/utils/env.py logs AZURE_* values at debug verbosity |
| Status |
Open |
| Remediation |
Same as R-1; sanitize credential values before logging |
Denial of Service
D-1: Zero NetworkPolicy Manifests
| Field |
Value |
| Threat |
No Kubernetes NetworkPolicy resources deployed; all pod-to-pod traffic unrestricted |
| Affected Assets |
AKS cluster, all workload namespaces |
| Trust Boundary |
TB-4 |
| Likelihood |
High |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
Azure CNI network plugin supports NetworkPolicy; namespaces provide logical separation |
| Evidence |
No NetworkPolicy resources in infrastructure/setup/manifests/ |
| Status |
Open |
| Remediation |
Define deny-all default policies per namespace; allow-list required traffic flows |
D-2: Single Shared NSG With Zero Custom Rules
| Field |
Value |
| Threat |
One NSG applied to all subnets with no custom inbound/outbound rules beyond Azure defaults |
| Affected Assets |
VNet subnets, all networked resources |
| Trust Boundary |
TB-2 |
| Likelihood |
Low |
| Impact |
Medium |
| Risk Rating |
Low |
| Current Controls |
Private endpoints eliminate public attack surface; VPN-only ingress |
| Evidence |
infrastructure/terraform/modules/platform/networking.tf defines NSG with no custom rules |
| Status |
Accepted |
| Rationale |
Private endpoints and VPN remove public exposure; custom rules add value after traffic audit |
D-3: NAT Gateway No Egress Filtering
| Field |
Value |
| Threat |
NAT Gateway allows unrestricted egress from AKS nodes to the internet |
| Affected Assets |
AKS nodes, container images, external APIs |
| Trust Boundary |
TB-2 |
| Likelihood |
Medium |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
NSG default rules; container image pull from ACR via private endpoint |
| Evidence |
NAT Gateway configured without Azure Firewall or FQDN filtering |
| Status |
Open |
| Remediation |
Add Azure Firewall or NSG egress rules restricting outbound to required FQDNs |
D-4: OSMO API Rate Limiting and Proxy Disabled
| Field |
Value |
| Threat |
OSMO API deploys with rateLimit.enabled: false and envoy.enabled: false |
| Affected Assets |
OSMO control plane API |
| Trust Boundary |
TB-7 |
| Likelihood |
Medium |
| Impact |
Medium |
| Risk Rating |
Medium |
| Current Controls |
Cluster-internal access only; namespace isolation |
| Evidence |
OSMO Helm values disable rate limiting and Envoy sidecar proxy |
| Status |
Open |
| Remediation |
Enable Envoy proxy and rate limiting when OSMO vendor provides stable configuration |
Elevation of Privilege
E-1: Automation Account Contributor Role
| Field |
Value |
| Threat |
Automation Account assigned Contributor role at resource group scope |
| Affected Assets |
Azure Automation Account, resource group resources |
| Trust Boundary |
TB-1 |
| Likelihood |
Low |
| Impact |
Medium |
| Risk Rating |
Low |
| Current Controls |
Automation runs scheduled maintenance tasks only; no external triggers |
| Evidence |
infrastructure/terraform/modules/platform/automation.tf assigns Contributor role |
| Status |
Open |
| Remediation |
Define custom RBAC role scoped to specific maintenance operations |
E-2: OSMO Service Token One-Year Expiry Without Rotation
| Field |
Value |
| Threat |
OSMO service token issued with one-year expiry and no automated rotation mechanism |
| Affected Assets |
OSMO service authentication, cluster workloads |
| Trust Boundary |
TB-7 |
| Likelihood |
Medium |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
Token stored in Key Vault; cluster-internal access |
| Evidence |
OSMO deployment scripts create long-lived service tokens |
| Status |
Open |
| Remediation |
Implement token rotation via Key Vault rotation policy or OSMO vendor short-lived token support |
E-3: User Provisioning Grants Excessive Admin Roles
| Field |
Value |
| Threat |
add-user-to-platform.sh assigns 9+ admin-level RBAC roles to each onboarded user |
| Affected Assets |
Azure RBAC, onboarded user identities |
| Trust Boundary |
TB-1 |
| Likelihood |
Medium |
| Impact |
High |
| Risk Rating |
High |
| Current Controls |
Script requires manual execution by a privileged operator |
| Evidence |
infrastructure/setup/optional/add-user-to-platform.sh assigns broad role set |
| Status |
Open |
| Remediation |
Define tiered role profiles (reader, contributor, admin); assign minimum required roles |
E-4: GitHub App Token Elevated Repository Permissions
| Field |
Value |
| Threat |
GitHub App token used in workflows has elevated repository permissions beyond immediate need |
| Affected Assets |
GitHub Actions workflows, repository contents |
| Trust Boundary |
TB-6 |
| Likelihood |
Low |
| Impact |
Low |
| Risk Rating |
Low |
| Current Controls |
SHA-pinned actions; OIDC federation; branch protection rules |
| Evidence |
Workflow files request contents: write and other elevated permissions |
| Status |
Accepted |
| Rationale |
Permissions required for release-please and dependency review workflows; scoped to repository |
Assurance Argument
Goal Structuring Notation (GSN) elements supporting the security posture claim.
| Element |
Statement |
| G0 |
The architecture provides adequate security controls for an IaC reference architecture |
| G1 |
Authentication uses managed identities and workload federation, eliminating password-based access |
| G2 |
Secrets are stored in Azure Key Vault with RBAC authorization and synced via CSI driver |
| G3 |
Network access is restricted to private endpoints, VPN, and NSG-controlled subnets |
| G4 |
Supply chain integrity is maintained through SHA-pinned actions and dependency review |
| E1 |
19 STRIDE threats identified; 7 Accepted with compensating controls, 12 Open with remediation roadmap |
| E2 |
OpenSSF Passing ~85%; 25 Silver criteria assessed (5 Met, 5 Delegated, 13 N/A, 1 Gap) |
| A1 |
Deployer follows docs/operations/security-guide.md hardening checklist |
| A2 |
OSMO vendor provides auth/rate-limiting enablement path in future releases |
| Priority |
Item |
Threats Addressed |
Effort |
Key Dependency |
| 1 |
Terraform Remote Backend |
T-2, I-2 |
Low-Medium |
Storage account |
| 2 |
Automation Least Privilege |
E-1 |
Low |
Custom role definition |
| 3 |
MEK Migration (ConfigMap→Secret) |
T-1 |
Medium |
OSMO vendor verification |
| 4 |
NetworkPolicy Manifests |
D-1 |
Medium |
Traffic audit |
| 5 |
NSG Rules |
D-2 |
Medium-High |
NSG Flow Logs observation |
Security Metrics
| Metric |
Current |
Target |
| OpenSSF Passing badge |
~85% |
100% |
| OpenSSF Silver badge |
~30% |
80% |
| SHA-pinned actions |
95% |
100% |
| STRIDE threats mitigated |
7/19 (37%) |
15/19 (79%) |
| Critical threats open |
1 |
0 |
| High threats open |
6 |
2 |
References
🤖 Crafted with precision by ✨Copilot following brilliant human instruction,
then carefully refined by our team of discerning human reviewers.