Skip to main content

Health Checks

Health checks have been implemented to enable customers to define whether a servicing operation leaves the target OS in a healthy state. These health checks are optionally run during trident commit (the last step of a clean install or an A/B update). The health checks can include user-defined scripts and/or configurations to verify that systemd services are running.

If any health check fails:

  • for A/B update: a rollback will be initiated by trident commit, updating the Host Status state to AbUpdateHealthCheckFailed and triggering a reboot into the previous OS. Within the previous OS, trident commit will validate the boot partition and update the Host Status state to Provisioned (reflecting that the machine is now Provisioned to the previous OS).
  • for clean install: a rollback will NOT be initiated as there is no previous OS. Instead, the Host Status state will be set to NotProvisioned.

Configuring Health Checks

Health checks can be configured in the Host Configuration file under the health.checks section. Any number of scripts and/or systemd checks can be defined.

Scripts here are like the other scripts in Trident (e.g. preServicing), for example, an inline script can be defined in health.checks to query the network or some Kubernetes state like this:

health:
checks:
- name: sample-commit-script
runOn:
- ab-update
- clean-install
content: |
if ! ping -c 1 8.8.8.8; then
echo "Network is down"
exit 1
fi
if ! kubectl get nodes; then
echo "Kubernetes nodes not reachable"
exit 1
fi

Systemd checks can also be defined to ensure that critical systemd services are running after servicing. For example, to ensure that kubelet.service and docker.service are running within 15 seconds of trident commit being called for both clean install and A/B update servicing types:

health:
checks:
- name: sample-systemd-check
runOn:
- ab-update
- clean-install
systemdServices:
- kubelet.service
- docker.service
timeoutSeconds: 15

Behavior

Health checks are run during trident commit after a trident install or trident update have staged and finalized. You can see how health checks fit into the overall servicing flow in these diagrams:

Health Check failures

If a health check fails, the output from each failed check will be captured in a log file located at /var/lib/trident/trident-health-check-failure-<timestamp>.log on the target OS. This log file can be used to help diagnose the reason for the health check failure.

The failures will also be reported in the Trident Host Status lastError field.