Skip to main content

utils.distributed

distributed utils

ranks_already_set#

def ranks_already_set(args) -> bool

Return True is both local and global ranks have been set.

fetch_ranks_from_azureml_preprocess#

def fetch_ranks_from_azureml_preprocess()

Look up distributed arguments from Azure ML environment variables.

Assumes OpenMPI image.

Notes:

Sets up NCCL environment variables used by Azure ML:

  • NCCL_SOCKET_IFNAME
  • NCCL_IB_DISABLE

fetch_ranks_from_azureml#

def fetch_ranks_from_azureml()

Look up distributed arguments from Azure ML environment variables.

Assumes OpenMPI image.

Notes:

Sets up NCCL environment variables used by Azure ML:

  • NCCL_SOCKET_IFNAME
  • NCCL_IB_DISABLE

fetch_ranks_from_torch_distributed_launch#

def fetch_ranks_from_torch_distributed_launch()

Read distributed arguments set by torch.distributed.launch via environment variables.

set_environment_variables_for_nccl_backend#

def set_environment_variables_for_nccl_backend()

Sets distributed training environments for azureml openmpi runs with NCCL backend.

rank_zero_only#

def rank_zero_only(fn)

Decorates functions to only execute on global rank 0, else wait via torch.distributed