MLPerf Developer Guide

The following guide details the developer process for the MLPerf workload. The focus of this guide is on MLPerf Inference.

In machine learning, inference involves using an already trained model to make predictions on unseen data. The MLPerf workload will run multiple benchmarks on a GPU-based system, in which the performance of the model to make those predictions is measured. Throughput and latency are used as metrics.

Benchmarks

The benchmark suite for NVIDIA GPU-based systems in MLPerf Inference is detailed in the Inference results repository.
The following are supported currently in Virtual Client:

bert: Used for natural language processing tasks. This benchmark does not require any supplemental data to test.
3d-unet: Used for 3D volumetric data for medical imaging applications. This benchmark does not require any supplemental data to test.

Scenarios

MLPerf will evaluate the performance of a system in different scenarios. For a given benchmark, the configurations for each scenario are available under the directory for the benchmark.

Offline: All queries are aggregated into a batch and sent to the tested system. The maximum throughput without a latency constraint is measured.
Server: Queries are aggregated into multiple batches and sent to the tested system. The maximum throughput with a latency constraint is measured.
SingleStream: Queries are sent one-by-one to the tested system. The latency of processing individual queries is measured.

Config Versions

default: Uses lower precision to achieve faster inference times.
high accuracy: Uses higher precision and prioritizes accuracy over performance. (not supported yet)
triton: Uses the triton inference server to manage and serve models. (not supported yet)
triton high accuracy: Uses the triton inference server and higher precision. (not supported yet)

Hardware for MLPerf

A100_SXM4_40GBx8: Azure VM SKU Standard_ND96asr_v4. This represents a system with 8 A100 NVIDIA GPUs. The NVIDIA A100 GPU is designed for high-performance computing.

Adding a config

The config information given the benchmark, scenario, config version, and system to test is in the __init__.py file under the benchmark folder. By default, the repository does not have support for all systems (i.e. A100_SXM4_40GBx8). To add support, the file is replaced with Virtual Client at runtime. For example with bert in the SingleStream scenario, a file with the following section is used:

@ConfigRegistry.register(HarnessType.Custom, AccuracyTarget.k_99, PowerSetting.MaxP)
class A100_SXM4_40GBx8(SingleStreamGPUBaseConfig):
    system = KnownSystem.A100_SXM4_40GBx8
    single_stream_expected_latency_ns = 1700000

These files are stored as script files for MLPerf, under GPUConfigFiles.

Dependencies

make, gcc: Necessary dependencies for the workload. Installed in CUDAAndNvidiaGPUDriverInstallation:

commands.Add("apt install build-essential -yq");

CUDA: An API created by NVIDIA which enables general-purpose computing on GPUs. To install CUDA, a .run file is used.
NVIDIA Linux Driver: Software component that enables communication between GPUs and the operating system. The linux driver will handle the low-level interaction between the GPU and OS. This driver is required for CUDA to function.
The values for CUDA and the Linux Driver are called out in profile parameters section.

"Parameters": {
    ...
    "LinuxCudaVersion": "12.4",
    "LinuxDriverVersion": "550",
    "LinuxLocalRunFile": "https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run",
    ...
}

NVIDIA Fabric Manager: A software stack which is used to connect multiple GPUs for high-performance computing tasks.
CUDA, the NVIDIA linux driver, and the NVIDIA fabric manager are all installed using NvidiaCudaInstallation.

{
    "Type": "NvidiaCudaInstallation",
    "Parameters": {
        "Scenario": "InstallNvidiaCuda",
        "LinuxCudaVersion": "$.Parameters.LinuxCudaVersion",
        "LinuxDriverVersion": "$.Parameters.LinuxDriverVersion",
        "Username": "$.Parameters.Username",
        "LinuxLocalRunFile": "$.Parameters.LinuxLocalRunFile"
    }
}

The versions of the linux driver and the fabric manager must match exactly otherwise the fabric manager will not start and the benchmark cannot be run.

commands.Add($"apt install nvidia-driver-{this.LinuxDriverVersion}-server nvidia-dkms-{this.LinuxDriverVersion}-server -y");
commands.Add($"apt install cuda-drivers-fabricmanager-{this.LinuxDriverVersion} -y");

The values can be checked in the terminal:

azureuser@mlperf-vm-1:~$ nv-fabricmanager --version
Fabric Manager version is : 550.127.05

azureuser@mlperf-vm-1:~$ nvidia-smi
Fri Nov 15 03:56:48 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000001:00:00.0 Off |                    0 |
| N/A   31C    P0             53W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off |   00000002:00:00.0 Off |                    0 |
| N/A   30C    P0             52W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off |   00000003:00:00.0 Off |                    0 |
| N/A   30C    P0             51W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off |   00000004:00:00.0 Off |                    0 |
| N/A   31C    P0             55W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off |   0000000B:00:00.0 Off |                    0 |
| N/A   31C    P0             57W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          Off |   0000000C:00:00.0 Off |                    0 |
| N/A   30C    P0             52W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          Off |   0000000D:00:00.0 Off |                    0 |
| N/A   31C    P0             52W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          Off |   0000000E:00:00.0 Off |                    0 |
| N/A   30C    P0             53W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Docker: A platform to use containers. MLPerf inference will run benchmarks within a docker container.
Docker is installed with DockerInstallation.

{
    "Type": "DockerInstallation",
    "Parameters": {
        "Scenario": "InstallDocker"
    }
}

NVIDIA Container Toolkit: A set of tools which enable the use of NVIDIA GPUs within docker containers.
Nvidia container toolkit is installed with NvidiaContainerToolkitInstallation.

{
    "Type": "NvidiaContainerToolkitInstallation",
    "Parameters": {
        "Scenario": "InstallNvidiaContainerToolkit"
    }
}

Running a Benchmark

There are a few setup steps before running the benchmark:

make prebuild: Download the docker container image and launch the container. The remaining commands are run within the container. In order to avoid launching the docker container shell, the file is replaced with Virtual Client. The replacing Makefile.docker file does not launch the docker container shell.
make download_data BENCHMARKS="bert": Download datasets necessary to run the benchmark.
make download_model BENCHMARKS="bert": Download pre-trained model to be tested.
make preprocess_data BENCHMARKS="bert": Formats data to be used in the benchmark.
make build: Compile and build the executable to run the benchmark.

To actually run the benchmark:

make run RUN_ARGS='--benchmarks=bert --scenarios=Offline,Server,SingleStream --config_ver=default --test_mode=PerformanceOnly --fast: Run performance mode which focuses on the efficiency of the model in making predictions. In this example, the command will run the bert benchmark, with Offline, Server, and Single Stream scenarios, using the default config version, in performance only mode, and with fewer iterations for faster turnaround time.
The json output will include a valid/invalid output, and either the latency or throughput. For example this is the json output for the Single Stream scenario:

{
    "benchmark_full": "bert-99",
    "benchmark_short": "bert",
    "config_name": "DGX-A100_A100-SXM4-40GBx8_TRT-custom_k_99_MaxP-SingleStream",
    "detected_system": "SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name=\"AMD EPYC 7V12 64-Core Processor\", architecture=CPUArchitecture.x86_64, core_count=48, threads_per_core=1): 2}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=928.7656999999999, byte_suffix=ByteSuffix.GB), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout={GPU(name=\"NVIDIA A100-SXM4-40GB\", accelerator_type=AcceleratorType.Discrete, vram=Memory(quantity=40.0, byte_suffix=ByteSuffix.GiB), max_power_limit=400.0, pci_id=\"0x20B010DE\", compute_sm=80): 8}), numa_conf=NUMAConfiguration(numa_nodes={}, num_numa_nodes=4), system_id=\"DGX-A100_A100-SXM4-40GBx8\")",
    "early_stopping_met": true,
    "effective_min_duration_ms": 600000,
    "effective_min_query_count": 100,
    "result_90.00_percentile_latency_ns": 1924537,
    "result_validity": "INVALID",
    "satisfies_query_constraint": false,
    "scenario": "SingleStream",
    "scenario_key": "result_90.00_percentile_latency_ns",
    "summary_string": "result_90.00_percentile_latency_ns: 1924537, Result is INVALID, 10-min runtime requirement met: True",
    "system_name": "DGX-A100_A100-SXM4-40GBx8_TRT",
    "tensorrt_version": "10.2.0",
    "test_mode": "PerformanceOnly"
}

make run RUN_ARGS='--benchmarks=bert --scenarios=Offline,Server,SingleStream --config_ver=default --test_mode=AccuracyOnly --fast: Run accuracy mode which focuses on the accuracy of the model's predictions. In this example, the command will run the bert benchmark, with Offline, Server, and Single Stream scenarios, using the default config version, in accuracy only mode, and with fewer iterations for faster turnaround time.
The json output will inculde a pass/fail output, and the accuracy score. For example this is the json output for the Offline scenario:

{
    "accuracy": [
        {
            "name": "F1",
            "pass": true,
            "threshold": 89.96526,
            "value": 90.2147015680108
        }
    ],
    "accuracy_pass": true,
    "benchmark_full": "bert-99",
    "benchmark_short": "bert",
    "config_name": "DGX-A100_A100-SXM4-40GBx8_TRT-custom_k_99_MaxP-Offline",
    "detected_system": "SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name=\"AMD EPYC 7V12 64-Core Processor\", architecture=CPUArchitecture.x86_64, core_count=48, threads_per_core=1): 2}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=928.7656999999999, byte_suffix=ByteSuffix.GB), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout={GPU(name=\"NVIDIA A100-SXM4-40GB\", accelerator_type=AcceleratorType.Discrete, vram=Memory(quantity=40.0, byte_suffix=ByteSuffix.GiB), max_power_limit=400.0, pci_id=\"0x20B010DE\", compute_sm=80): 8}), numa_conf=NUMAConfiguration(numa_nodes={}, num_numa_nodes=4), system_id=\"DGX-A100_A100-SXM4-40GBx8\")",
    "effective_min_duration_ms": 600000,
    "effective_samples_per_query": 19800000,
    "satisfies_query_constraint": true,
    "scenario": "Offline",
    "scenario_key": "result_samples_per_second",
    "summary_string": "[PASSED] F1: 90.215 (Threshold=89.965)",
    "system_name": "DGX-A100_A100-SXM4-40GBx8_TRT",
    "tensorrt_version": "10.2.0",
    "test_mode": "AccuracyOnly"
}

MLPerf Developer Guide

Benchmarks​

Scenarios​

Config Versions​

Hardware for MLPerf​

Adding a config​

Dependencies​

Running a Benchmark​