Logging Standards
Revision Summary
Revision | Date | Comment |
---|---|---|
1.0 | 2024-10-31 | Initial version. |
Background
Besides the Host Status, logging is the fundamental way for Trident to communicate the state of the host and the progress of the ongoing servicing to the customer.
For that reason, it is important to maintain a set of clear and coherent standards for logging in Trident. This document summarizes the role of different log levels, outlines how logs should be structured and what information should be logged, and finally, provides examples for developers to use while adding new logs.
Goals
This document outlines the standards for logging in Trident, to accomplish the following goals:
- Communicating the state of the host and servicing with clarity and sufficient context.
- Whenever the customers are the target audience of any log, speaking in terms that they will be able to understand.
- Ensuring consistency in the meaning and usage of different log levels.
- Using consistent and uniform structure and language across all the logs produced by Trident.
Overview
Background Log
In addition to the output, Trident logs all its updates to a log file, which can be used both by developers and customers for debugging and troubleshooting. This log file has similar contents to the output, with the exception that this log is NOT filtered, meaning that all log messages, regardless of their log levels, will be present in this file. Currently, Trident saves these logs to the following location:
/var/log/trident-full.log
Refer to Viewing Trident's Background Log for more detailed guidelines on how to use the background log file.
Log Levels
For each command that can be run with Trident, the customer can adjust the log
level, i.e. the verbosity level, to one of the following six values: OFF
,
ERROR
, the default WARN
, INFO
, DEBUG
, and TRACE
.
The last five correspond to the actual functions in the log
crate that the
developers can use to write logs in Trident:
error!("...");
warn!("...");
info!("...");
debug!("...");
trace!("...");
Each log level has its specific purpose and must be used in a particular context. Below, the document outlines the role of each log level and provides example scenarios in which it can be used.
ERROR
As the name suggests, the log level ERROR
is used when an error has occurred
in the system. When something doesn't go as expected by Trident, it could
either be:
- A fatal error that forces Trident to shut down the ongoing servicing,
- A non-fatal error that Trident can recover from, but that the customer should still be aware of.
ERROR
Guidelines
The primary audience of the ERROR
logs is the customers. This means that
every ERROR
log should try to address (1) what went wrong, (2) why it might
have occurred, and (3) how the customer might potentially solve it. The message
should focus on what the customer did, rather than detail how the internal,
Trident-driven steps have failed.
However, in case of some errors, such as servicing errors or the Example 1 below, Trident is not able to state the cause of the failure. Then, the goal of the log is to accurately expose and propagate the original error in the system, so that the customer can address it themselves.
If possible, an ERROR
log should follow the following structure:
error!("Failed to <PERFORM SOME ACTION> due to <POTENTIAL CAUSE THAT CAN BE ADDRESSED>");
Fatal vs Non-Fatal Errors
ERROR
logs can be used in case of both fatal and non-fatal errors. In
the former scenario, Trident has encountered a serious failure that interrups
the ongoing servicing; otherwise, if we can recover from the failure, then the
error is non-fatal. Another type of non-fatal errors is when the failure
would be fatal but another failure error has already occurred, so we're
reporting any subsequent errors as non-fatal.
In case of a fatal error, Trident must return a TridentError
object, which
is automatically printed as error!(...)
in the runtime and also included in
the Host Status under the lastError
section. Because TridentError
already
carries the error context, the callstack, and the error type, an ERROR
log
should only be printed if there is any additional context that must be
shared with the customer. Refer to
Structured Error in Trident for more details on
TridentError
.
In case of a non-fatal error, no TridentError
is returned, hence, an
ERROR
log must be printed.
ERROR
Examples
Example 1.
The snippet below is taken from the logic in src/lib.rs
, where Trident
validates whether the firmware correctly booted from the updated target OS after
a clean install. Here, Trident detected that the firmware failed to boot from
the expected device. Because the contents of the CleanInstallRebootCheck
error
already summarize the failure, an ERROR
log is not needed.
datastore.with_host_status(|host_status| {
host_status.servicing_type = ServicingType::NoActiveServicing;
host_status.servicing_state = ServicingState::NotProvisioned;
})?;
return Err(TridentError::new(ServicingError::CleanInstallRebootCheck {
root_device_path: root_device_path.to_string_lossy().to_string(),
expected_device_path: expected_root_device_path.to_string_lossy().to_string(),
}));
#[error("Clean install failed as host booted from '{root_device_path}' instead of the expected device '{expected_device_path}")]
CleanInstallRebootCheck {
root_device_path: String,
expected_device_path: String,
},
Example 2.
The snippet below comes from the Prepare
step, where Trident attempts to
restart SSHD, to enable SSH for a MOS user. If Trident fails to restart SSHD,
it can still continue the servicing, so it produces an ERROR
log to inform
the customer and then moves on to the next step, without returning a
TridentError
. We are not able to state why the restart has failed, but we
still want to surface the original error to the customer, so that they can
address it themselves.
if let Err(err) =
osutils::systemd::restart_unit("sshd").context("Failed to restart sshd in MOS")
{
error!("{err:?}");
}
WARN
WARN
is used to warn the customer about a potential issue that might break
the servicing in the future. In other words, WARN
is reserved for scenarios
where some non-critical expectation is not being met. In this case, no error is
returned, and so Trident will continue the servicing.
WARN
Guidelines
Similarly to ERROR
, WARN
logs are primarily addressed at the customer.
The goal of WARN
is to warn the user about flows that:
- Technically work but may negatively affect security.
- E.g. a Host Configuration requests to ignore the hash of an image, meaning that the hash validation is skipped.
- Request an unusual action that will probably work but goes against our
expectations.
- E.g. a Host Configuration requests to mount partitions of a certain type in
locations where they are not expected, such as mounting the ESP partition
at
/mnt/random-place
.
- E.g. a Host Configuration requests to mount partitions of a certain type in
locations where they are not expected, such as mounting the ESP partition
at
The difference between a warning and a non-fatal error might be confusing at
times. In general, we issue a WARN
when the requested action might be
intentional, although we find it unsafe or unusual. On the other hand, when we
know that the problematic flow is most likely not what the customer intended
and should probably address, we issue a non-fatal error with an ERROR
log.
WARN
Examples
Example 1.
The code snippet below refers to a scenario where the customer is requesting to
ignore the hash of an image in the Host Configuration. Before deploying the
image onto the block device, Trident produces a WARN
log because streaming an
unverified payload onto a block device is a potential security concern.
match image.sha256 {
ImageSha256::Ignored => {
warn!("Ignoring SHA256 for image from '{}'", image_url);
}
...
}
Example 2.
Taken from the logic in src/lib.rs
, the code below addresses the scenario
where the customer has updated the Host Configuration but did not include
stage
under the allowed operations section. This means that the servicing
requested in the Host Configuration cannot be executed. While this does not
cause an error, Trident suspects that the customer might have forgotten to
include stage
in the --allowed-operations
option and hence, it issues a
WARN
log. This is not a failure but a no-action scenario, so we do not
return an ERROR
log.
if cmd.allowed_operations.has_stage() {
engine::update(cmd, datastore).message("Failed to run update")
} else {
warn!("Host config has been updated but allowed operations do not include 'stage'. Add 'stage' and re-run");
Ok(())
}
INFO
INFO
is reserved for the most high-level updates, such as announcing the
kickoff or successful completion of the next step in the servicing process.
The INFO
logs are written primarily for the developers and some
savvy customers who want to understand how the servicing is progressing.
INFO
Examples
Example 1.
Creating a RAID array is a major operation within the process of initializing
block devices. Thus, we use an INFO
log to inform the user that a RAID array
requested in the Host Configuration is being created.
info!("Initializing '{}': creating RAID array", config.id);
mdadm::create(&config.device_path(), &config.level, device_paths)
.context("Failed to create RAID array")?;
Note: Whenever a device is being initialized, Trident formats the log message in the same way, to maintain consistency:
"Initializing '<BLOCK DEVICE ID>': ..."
Example 2.
The example below comes from the logic in src/lib.rs
, where Trident has
validated that the host had correctly booted from the updated target OS. Here,
Trident uses two different levels of logging:
- First, an
INFO
log announces that the higher-level process, i.e. a clean install or an A/B update, succeeded. - Second, a
DEBUG
log communicates to the user that the servicing state of the host has changed, which is a more granular update and thus, relevant to the developer rather than the customer. Note: More information aboutDEBUG
logs will be shared in the next section.
if datastore.host_status().servicing_type == ServicingType::CleanInstall {
info!("Clean install of target OS succeeded");
tracing::info!(metric_name = "clean_install_success", value = true);
} else {
info!("A/B update succeeded");
tracing::info!(metric_name = "ab_update_success", value = true);
}
debug!(
"Updating host's servicing state to '{:?}'",
ServicingState::Provisioned
);
DEBUG
DEBUG
logs are the primary logs, meaning that they report the individual
actions that Trident is completing during the servicing. Unlike the ERROR
and
WARN
logs, DEBUG
logs are aimed at the developers, who can use them to
debug the code.
DEBUG
Guidelines
Conceptually, DEBUG
logs are the "normal" logs, which provide more details
than INFO
but fewer details than TRACE
.
Compared to INFO
logs, DEBUG
statements go one level lower, reporting
specific paths, device names, ids, and other granular pieces of info that are
important to the developers or some very savvy customers. On the other hand,
the TRACE
logs provide even more detailed info, such as the exact contents of
a file/config. Note: More information about TRACE
logs will be shared in
the next section.
If a major step in the servicing consists of multiple operations that might
potentially fail, we first issue an INFO
log before Trident begins the
operation and then, we use DEBUG
and TRACE
logs to report the more granular
details. Finally, if the operation has completed successfully, we can issue
another INFO
log, to confirm that the operation succeeded. (The confirmation
is optional since Trident logging the next step is a confrmation in itself.)
DEBUG
Examples
Example 1.
The snippet below comes from the logic in osutils/src/grub_mkconfig.rs
where
a new grub-mkconfig script is being written on the host. Here, we're using a
DEBUG
log to report the path at which the script is being written, since it's
too low-level to be relevant to the overwhelming majority of the customers. At
the same time, it's definitely relevant to the developer. On the other hand,
seeing the contents of the script is rarely needed even by the developer, and
so we're using a TRACE
log in this case.
debug!("Writing grub-mkconfig script to '{}'", path.display());
let content = self.render();
trace!(
"Grub-mkconfig script content:\n{}",
content.to_string_lossy()
);
Example 2.
The below function illustrates the difference between INFO
and DEBUG
log
levels. First, we're using an INFO
log to report that Trident is starting the
Prepare
step, which is a high-level stage of the servicing. However, knowing
what subsystem is currently being prepared is more granular and would be
helpful to the developers rather than to the customers. Thus, this info is
recorded via a DEBUG
log.
fn prepare(subsystems: &mut [Box<dyn Subsystem>], ctx: &EngineContext) -> Result<(), TridentError> {
info!("Starting step 'Prepare'");
for subsystem in subsystems {
debug!(
"Starting step 'Prepare' for subsystem '{}'",
subsystem.name()
);
subsystem.prepare(ctx).message(format!(
"Step 'Prepare' failed for subsystem '{}'",
subsystem.name()
))?;
}
debug!("Finished step 'Prepare'");
Ok(())
}
TRACE
TRACE
is used for logging the most granular info, which could only be useful
to the developers as they are going through the source code. Whatever actions
seem too granular for the DEBUG
logs should be logged at the TRACE
level.
The section below attempts to provide a guide on how to decide whether a log is
TRACE
or DEBUG
.
TRACE
Guidelines
When trying to decide whether a piece of information should be logged at the
TRACE
vs DEBUG
levels, consider the following questions:
-
Are you logging an operation executed by Trident OR a generic step/"a side effect" of an operation? If yes, it's a
DEBUG
; if no, it's aTRACE
.- E.g. announcing that Trident is creating a new mdadm config file at a path
is a
DEBUG
log because it describes a particular operation done by Trident. On the other hand, printing out the contents of the config is showing an intermediate outcome of this operation. So,TRACE
makes more sense here.
- E.g. announcing that Trident is creating a new mdadm config file at a path
is a
-
Does the code have any concept of the bigger picture? If yes, it's a
DEBUG
; if no, it's aTRACE
.- E.g. the most common use case of
TRACE
is logging every sub-command that Trident executes. In this case, the low-level logic has no idea what exactly Trident is trying to achieve; it just logs that a certain command, with some arguments, is being executed and then prints out its output. But it does not have an understanding of why it is being run.
- E.g. the most common use case of
-
Is the log produced in osutils? If yes, it should most likely be a
TRACE
, unless it is a specific warning/error.
TRACE
Examples
Example 1.
Similarly to one of the examples above, this snippet highlights the difference
between the DEBUG
and TRACE
log levels. While the path of the mdadm config
is a pretty fundamental piece of info for the developers, the content of the
file is a less relevant outcome of that operation, so it should be reported
with a TRACE
log.
debug!("Creating mdadm config file '{}'", mdadm_config_file_path);
trace!("Contents:\n{}", output);
Example 2.
This example illustrates the most common use of TRACE
: for logging the
lowest-level sub-commands that Trident executes and reporting their output.
These logs allow the developers to see what command is being run at the lowest
level, while other logs, such as DEBUG
will provide more context on why this
command was executed in the first place.
fn run_and_check(&mut self) -> Result<(), Error> {
let rendered_command = self.render_command();
trace!("Executing '{rendered_command}'");
let result = self.output();
trace!(
"Executed '{rendered_command}': {}. Report:\n{}",
result.explain_exit(),
result.output_report(),
);
result
.check()
.with_context(|| format!("Error when running: {}", self.render_command()))
}
Log Structure and Contents
- The log message should concisely summarize the update or error that needs to be communicated to the user. If possible, summarize the info in a single phrase or a sentence.
- The logs must provide all the relevant information.
- If a log is aimed at the customers, and in the case of
ERROR
andWARN
logs, in particular, it must contain the block device id if applicable. This will allow the customer to quickly identify which device is causing the issue.
- If a log is aimed at the customers, and in the case of
- On the other hand, the logs should not duplicate each other or provide
irrelevant info that is not helpful to the target audience. Logging too much
is as bad as logging too little.
- E.g.
ERROR
logs should not simply duplicate the info already provided in the payload of the returned error. - E.g. the lowest-level, potentially confusing info, such as the full paths of the block devices in the system, should be omitted in the logs aimed at the customers.
- E.g.
- Do not end the log with a period, an exclamation mark, an ellipsis, etc. to use uniform formatting and maintain a neutral tone.
- Refer to the API definitions and other documentation to use correct language and apply Trident-specific terms consistently in the logs.
- If a series of logs informs about a similar change or step, try to maintain a
uniform structure in the message.
- For example, we currently use a unique statement structure to report which step of the servicing is being executed or which storage component is being initialized.