System Config Info
This tool is to collect the system information automatically on the tested GPU nodes including the following hardware categories:
#
Usage#
Usage on local machineInstall SuperBench on the local machine using root privilege.
Start to collect the sys info using
sb node info --output-dir ${output-dir}
command using root privilege.After the command finished, you can find the output system info json file
sys-info.json
of local node under \${output_dir}.
#
Usage on multiple remote machinesInstall SuperBench on the local machine.
Deploy SuperBench onto the remote machines.
Prepare the host file of the tested GPU nodes using Ansible Inventory on the local machine.
After installing the Superbnech and the host file is ready, you can start to collect the sys info automatically using
sb run --get-info
command. The detailed command can be found from SuperBench CLI.sb run --get-info -f host.ini --output-dir ${output-dir} -C superbench.enable=none
After the command finished, you can find the output system info json file
sys-info.json
of each node under \${output_dir}/nodes/${node_name}.
#
Parameter and Details#
SystemSubCategory | Key | Command | Description | Example |
OS | system-manufacturer | dmidecode -s system-manufacturer | manufacturer of the system | Microsoft Corporation | system-product name(virtual machine) | dmidecode -s system-product-name | product name or virtual machine | Virtual Machine |
operating_system | cat /proc/version | version of current running os | Ubuntu 9.3.0-17ubuntu1~20.04 | |
uname | uname | short for system information | Linux sb-test-wu-000000 5.8.0-1039-azure #42~20.04.1-Ubuntu | |
Docker | docker_server_version | docker version | server version of docker engine | 20.10.3 |
docker_client_version | docker version | client version of docker engine | 20.10.3 | |
VM | vmbus | lsvmbus | devices attached to the Hyper-V VMBus | "VMBUS ID 1": "[Dynamic Memory]", "VMBUS ID 2": "Synthetic mouse", ... |
Kernel | kernel_modules | lsmod | list of active kernel modules | "Module": "binfmt_misc", "Size": "24576", "Used": "1" ... |
kernel_parameters | sysctl | kernel parameters | "abi.vsyscall32": "1", "debug.exception-trace": "1", ... | |
DMI | dmidecode | dmidecode | DMI table dump (info on hardware components) | "dmidecode": "# dmidecode 3.2\nGetting SMBIOS data from sysfs..." |
#
MemorySubCategory | Key | Command | Description | Example |
General | model | dmidecode -t memory | distinct model name of the memory | Samsung M393A4K40DB3-CWE |
type | dmidecode -t memory | distinct type of memory | DDR4-3200 | |
clock frequency | dmidecode -t memory | distinct clock frequency of memory | 3200 MT/s | |
channels | dmidecode -t memory | the number of memory chips | 16 | |
capacity | lsmem | the total capacity of memory | 511.9G | |
block_size | lsmem | the block size of memory | 128M |
#
CPUSubCategory | Key | Command | Description | Example |
General | archeticture | lscpu | architecture of cpu | x86_64 |
model name | lscpu | model name of cpu | AMD EPYC 7662 64-Core Processor | |
cpu op-mode | lscpu | cpu mode: 32bit/64bit | 32-bit, 64-bit | |
byte order | lscpu | byte order | Little Endian | |
address size | lscpu | size of address | 48 bits physical, 48 bits virtual | |
cpus | lscpu | logical cpu cores count | 256 | |
On-line CPU(s) list | lscpu | on-line logical cpu cores | 0-255 | |
Thread(s) per core | lscpu | thread per core | 2 | |
Core(s) per socket | lscpu | core per socket | 64 | |
Socket(s) | lscpu | socket count | 2 | |
NUMA node(s) | lscpu | numa node count | 4 | |
L<x> caches | lscpu | cache size | "L1d cache": "4 MiB", "L1i cache": "4 MiB", "L2 cache": "64 MiB", "L3 cache": "512 MiB" | |
NUMA node<x> CPU(s) | lscpu | cpu core list of the numa node | "NUMA node0 CPU(s)": "0-31,128-159", "NUMA node1 CPU(s)": "32-63,160-191", "NUMA node2 CPU(s)": "64-95,192-223", "NUMA node3 CPU(s)": "96-127,224-255" | |
Flags | lscpu | cpu flags | fpu vme de pse tsc msr pae mce cx8 apic ... | |
max_speed | sudo dmidecode -t processor | grep "Speed" | distinct cpu max frequency | 3700 MHz | |
current_speed | sudo dmidecode -t processor | grep "Speed" | distinct cpu current frequency | 2000 MHz |
#
DiskSubCategory | Key | Command | Description | Example |
FileSystem | filesystem | df -Th | the name/path of the filesystem | /dev/nvme0n1p2 |
avail | df -Th | avail size of the filesystem | 1.4T | |
size | df -Th | total size of the filesystem | 1.8T | |
type | df -Th | the type of the filesystem | ext4 | |
block_size | blockdev --getbsz /dev/<device> | the block size of the filesytem | 4096 | |
4k_alignment | 4kDEVICE=/dev/sdb1 do parted $DEVICE align-check opt 1; done_alignment | whether the file system is 4k alignment | 1 aligned | |
BlockDevice | name | lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | the name of the block device | nvme0n1 |
model | lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | the model name of the block device | VO001920KXAVP | |
rotational | lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | whether rotational, thai is HDD or SSD | 0 | |
size | lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | the total size of the block device | 1.8T | |
block_size | fdisk -l -u /dev/ | grep "Sector size" | the sector size of the block device | Sector size (logical/physical): 512 bytes / 512 bytes | |
General | mapping | mount | mount relationship between filesystem and block device |
#
NetworkingSubCategory | Key | Command | Description | Example |
NIC | nic_logical_name | lshw -c network | logical name of the nic | ib1 |
nic_model | lshw -c network | model name of the nic | Mellanox Technologies MT28908 Family [ConnectX-6] | |
nic_firmware | lshw -c network | fw version | 20.30.1004 (MT_0000000594) | |
nic_driver | lshw -c network | driver version | mlx5_core[ib_ipoib] 5.3-1.0.0 | |
nic_speed | lshw -c network | speed spec of the nic | 200 Gbit/s | |
nic_disabled | lshw -c network | whether diabled | false | |
IB | device_info | ibv_devinfo -v | list of device information for each ib device | "hca_id:\tmlx5_0": ... |
device_status | ibstat | list of device status for each ib device | "CA 'mlx5_0'": ... | |
General | ofed_version | ofed_info -s | the version of ofed | MLNX_OFED_LINUX-5.3-1.0.5.0: |
#
AcceleratorSubCategory | Key | Command | Description | Example(NVIDIA) | Example(AMD) |
General | driver_version | nvidia-smi -q -x/rocm-smi -a | driver version | 460.27.04 | 5.9.25 |
topology | nvidia-smi topo -m/rocm-smi --showtopo | gpu connection topology (nvidia only) | / | / | |
nvidia-container-runtime_version | nvidia-container-runtime -v | version of nvidia-container-runtime (nvidia only) | 1.0.0-rc92 | N/A | |
nvidia-fabricmanager_version | nv-fabricmanager --version | version of nvidia-fabricmanager (nvidia only) | 460.27.04 | N/A | |
nv_peer_mem_version | dpkg -l | grep 'nvidia-peer-memory' | version of nv_peer_mem (nvidia only) | 1.1-0 | N/A | |
GPUCard | rocm_info | rocm-smi -a & rocm-smi --showmeminfo vram | amd gpu info of each gpu&lsindex>, including firmware, frequency, memory, etc. (amd only) | N/A | "card0": ... "card1": ... |
nvidia_info | nvidia-smi -q | nvidia gpu info list of each gpu, including firmware, frequency, memory, etc. (nvidia only) | "timestamp": "Fri Aug 20 05:36:24 2021", "driver_version": "460.27.04", "cuda_version": "11.2", "attached_gpus": "8", "gpu": [...] ... | N/A |
#
PCIeSubCategory | Key | Command | Description | Example |
General | topology | lspci -t -vvv | topology of installed PCI devices | / |
device_info | lspci -vvv | device info on installed PCI devices | 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex... |