Skip to main content

MLPerf Trainging Bert Preprocessing Data

The following document will show steps ran for Downloading,Preprocessing and Packaging the training data used in Bert training.

VM Configuration used

  • VMSKU: Standard ND96asr v4 (96 vcpus, 900 GiB memory)
  • Operating System : Ubuntu 20.04
  • OS Disk Size: 256 GB
  • Dats Disk Size : 8TB (Mounted on /data/mlperf/bert)

Steps followed to preprocess data

docker build --pull -t mlperf-training:language_model .

docker push mlperf-training:language_model

docker run --runtime=nvidia --ipc=host -v /data/mlperf/bert:/workspace/bert_data mlperf-training:language_model
  • Inside docker container run following commands:
cd /workspace/bert
./input_preprocessing/prepare_data.sh --outputdir /workspace/bert_data
  • Inside container follow steps to package data link

  • Exit container and zip /data/mlperf/bert