Health Checks
Health checks have been implemented to enable customers to define whether a
servicing operation leaves the target OS in a healthy state. These
health checks are optionally run during trident commit (the last step of a
clean install or an A/B update). The health checks can include
user-defined scripts
and/or configurations to verify that
systemd services are running.
If any health check fails:
- for A/B update: a rollback will be initiated by
trident commit, updating the Host Status state toAbUpdateHealthCheckFailedand triggering a reboot into the previous OS. Within the previous OS,trident commitwill validate the boot partition and update the Host Status state toProvisioned(reflecting that the machine is now Provisioned to the previous OS). - for clean install: a rollback will NOT be initiated as there is no
previous OS. Instead, the Host Status state will be set to
NotProvisioned.
Configuring Health Checks
Health checks can be configured in the Host Configuration file under the
health.checks
section. Any number of scripts
and/or systemd checks
can be defined.
Scripts here are like the other scripts in Trident (e.g.
preServicing),
for example, an inline script can be defined in health.checks to query the
network or some Kubernetes state like this:
health:
checks:
- name: sample-commit-script
runOn:
- ab-update
- clean-install
content: |
if ! ping -c 1 8.8.8.8; then
echo "Network is down"
exit 1
fi
if ! kubectl get nodes; then
echo "Kubernetes nodes not reachable"
exit 1
fi
Systemd checks
can also be defined to ensure that critical systemd services are running after
servicing. For example, to ensure that kubelet.service and docker.service
are running within 15 seconds of trident commit being called for both clean
install and A/B update servicing types:
health:
checks:
- name: sample-systemd-check
runOn:
- ab-update
- clean-install
systemdServices:
- kubelet.service
- docker.service
timeoutSeconds: 15
Behavior
Health checks are run during trident commit after a trident install or
trident update have staged and finalized. You can see how health checks
fit into the overall servicing flow in these diagrams:
Health Check failures
If a health check fails, the output from each failed check will be captured in
a log file located at
/var/lib/trident/trident-health-check-failure-<timestamp>.log
on the target OS. This log file can be used to help diagnose the reason
for the health check failure.
The failures will also be reported in the Trident Host Status lastError
field.