core.data_interface
Data Interface module:
This module contains the abstract classes DataProcessor and DataInterface for data processing prior to model training. Users should implement DataProcessor's such that instances fully encapsulate a data processing task, and expose the process() method for calling upon the task from a DataInterface instance, which acts as ochestrator and implements an interface for other modules to request train, validation datasets. Please consider using the DataInterface.setupdatasets() method to store these datasets within the DataInterface instance, and thus have the get[train,val]_dataset() methods be as quick and computationally inexpensive as possible, as they are called at every epoch.
#
DataProcessor ObjectsProcesses and optionally analyzes data.
Designed to be used in conjunction with a DataInterface, must be extended to implement the process() method.
#
__init__Accepts DistributedPreprocessArguments for custom multiprocess rank handling.
#
processProcess data with operations such as loading from a source, parsing, formatting, filtering, or any required before Dataset creation.
#
analyzeOptional method for analyzing data.
#
process_dataProcess data via a DataProcessor's process() method.
Arguments:
data_processor
DataProcessor - DataProcessor to call.
Returns:
Result of process() call.
#
multi_process_dataProcess data, naive multiprocessing using python multiprocessing.
Calls upon DataProcessor's process() method with any *args provided. All lists within args are split across processes as specified by process_count and executed either sync or async. Any non-list args are sent directly to the process() call. Users are encouraged to only pass as args the objects they wish to divide among processes.
Arguments:
data_processor
DataProcessor - DataProcessor to call.*args
- Arguments to be passed on to DataProcessors's process() method.process_count
int, optional - Number of worker processes to create in pool.
Returns:
List
- list of results to process() call per worker process.
#
DataInterface ObjectsOrganizer and orchestrator for loading and processing data.
Designed to be used in conjunction with DataProcessors. Abstract methods get_train_dataset() and get_val_dataset() must be implemented to return datasets.
#
setup_datasetsSetup the datasets before training.
#
get_train_datasetReturns Dataset for train data.
#
get_val_datasetReturns Dataset for val data.