Skip to main content

Monitor

SuperBench provides a Monitor module to collect the system metrics and detect the failure during the benchmarking. Currently this monitor supports CUDA platform only. Users can enable it in the config file.

Configuration#

superbench:  monitor:    enable: bool    sample_duration: int    sample_interval: int

enable#

Whether enable the monitor module or not.

sample_duration#

Calculate the average metrics during sample_duration seconds, such as CPU usage and NIC bandwidth.

sample_interval#

Do sampling every sample_interval seconds.

Metrics#

Monitor module will generate the data in jsonlines format, and each line is in json format, including the following metrics:

NameUnitDescription
timedatetimeThe timestamp to collect the system metrics.
cpu_usagepercentageThe average CPU utilization.
gpu_usagepercentageThe GPU utilization.
gpu_temperaturecelsiusThe GPU temperature.
gpu_power_limitwattThe GPU power limitation.
gpu_mem_usedMBThe used GPU memory.
gpu_mem_totalMBThe total GPU memory.
gpu_corrected_ecccountNumber of corrected (single bit) ECC error.
gpu_uncorrected_ecccountNumber of uncorrected (double bit) ECC error.
gpu_remap_correctable_errorcountNumber of rows remapped due to correctable errors.
gpu_remap_uncorrectable_errorcountNumber of rows remapped due to uncorrectable.
gpu_remap_maxcountNumber of banks with 8 available remapping resource.
gpu_remap_highcountNumber of banks with 7 available remapping resource.
gpu_remap_partialcountNumber of banks with 2~6 available remapping resource.
gpu_remap_lowcountNumber of banks with 1 available remapping resource.
gpu_remap_nonecountNumber of banks with 0 available remapping resource.
{device}_receive_bwbytes/sNetwork receive bandwidth.
{device}_transmit_bwbytes/sNetwork transmit bandwidth.