Data interface single and multi process
This is an example explaining how to leverage the in-built multiprocessing capability of DataInterface for large amounts of data. For example purpose we are using 27 files from wikipedia raw text. 1) Azure virtual machine , single node multi-process , single selective machine 2) AML, single node vs multi-node, single selective machine
#
Configs - YAML and ParsingFor ease of use we have configs passed in as YAML files. In this case we use the config file : config_prod.yaml included with example code.
Snippet of config: ( modify file paths according to your folder structure)
This config can be read in like below :
Our data processor is a simple token splitter which given raw text will split it into token store the results back in a file. The processor runs 1 file at a time.
#
Virtual machine#
Single virtual machine with multi processHere we create a list of files in the directory and initialize the processor with the input and output directory. We call the the multi_process_data function in the processor, passing the list of files , with the process count. The processor then spins up those many number of processes to create coressponding output.
#
Selective node preprocessingFor a case where we have a single node but want to process the data in batches. We want the processor to run on different subset of files depending upon the rank we assign. This is to emulate multi-node behaviour with a single node by controlling the node rank parameter.
For instance if we have 30 files to process over 5 separate runs , then we need to add the following to config and initialize dataProcessor accordingly
Remember to initialize the base dataProcessor class with the distributed arguemnts as shown below, the default None would treat it like a regular multi-node processing job
With the above setting we would process files 18-24 out of 30. Since the node_rank is 3 (0 indexed) and can be a maximum of 4. node_count gives us a count of total nodes available This gives a flexibility with large data processing with limited compute.
To run in virtual machine copy over the files to virtual machine using SCP Install pymarlin and requirements and run example
#
AMLWe can do single and multi-node processing both with AML. The datamodule handles AML ranking internally for both single and multinodes to appropriately divide the files across nodes. You will find a notebook along with the example to submit a AML a job, with placeholders for storage and compute accounts.