Are your pods deployed to virtual nodes experiencing issues or high latency with outbound network calls?
cg) configured with a NAT Gateway?
Like with all K8s containers, a diagnostic step to understand what might have happened is to check the pod’s event logs (for example, via kubectl describe). However, for confidential containers the errors are often not immediately understandable.
For example, you might see an event like this:
failed to create containerd task: failed to create shim task: failed to create container 18197025abfacf6365ef65d083687e1d9f03b9792779e15796247a7281043065: guest RPC failure: container creation denied due to policy: policyDecision< eyJkZWNpc2lvbiI6ImRlbnkiLCJyZWFzb24iOnsiZXJyb3JzIjpbImludmFsaWQgY29tbWFuZCJdfSwidHJ1bmNhdGVkIjpbImlucHV0Il19 >policyDecision: unknown
While not immediately obvious, the policy decision section is actually a base64 encoded string with the underlying error.
Decoding the above example (any base64 decoder will do, the built-in utility for PowerShell used here for portability):
[System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("eyJkZWNpc2lvbiI6ImRlbnkiLCJyZWFzb24iOnsiZXJyb3JzIjpbImludmFsaWQgY29tbWFuZCJdfSwidHJ1bmNhdGVkIjpbImlucHV0Il19"))
{"decision":"deny","reason":{"errors":["invalid command"]},"truncated":["input"]}
So, for this example the issue was that the container’s CCE Policy didn’t have the same command for the container as what was actually being run for it.
deviceHash not found ErrorYou might decode a confidential policy decision error that says something like this:
{"decision":"deny","input":{"deviceHash":"deab9495e4a3c245e3be675a350d0e7a9fe6dcdc95a73582f8586dd759ca7a0b","rule":"mount_device","target":"/run/mounts/m9"},"reason":{"errors":["deviceHash not found"]}}
This type of error most commonly occurs when the image layers for the container provided in the CCE Policy do not align with the actual pulled image layers for the container.
This can happen in cases where an image is updated without regenerating the CCE policy… which can happen unexpectedly when using public images or images with tags that are overwritten (a common example being latest).
Confidential containers are behaving as designed and explicitly protecting your usage from these unauthorized updates (that do not align with the Confidential Policy provided), but in so doing will prevent the pods from going into a running state.
The recommendation would be to use images from an image registry you control and to not overwrite tags… both of which will make it easier for you to control what you are deploying and avoid unexpected updates.