core.data_interface
Data Interface module:
This module contains the abstract classes DataProcessor and DataInterface for data processing prior to model training. Users should implement DataProcessor's such that instances fully encapsulate a data processing task, and expose the process() method for calling upon the task from a DataInterface instance, which acts as ochestrator and implements an interface for other modules to request train, validation datasets. Please consider using the DataInterface.setupdatasets() method to store these datasets within the DataInterface instance, and thus have the get[train,val]_dataset() methods be as quick and computationally inexpensive as possible, as they are called at every epoch.
DataProcessor Objects#
Processes and optionally analyzes data.
Designed to be used in conjunction with a DataInterface, must be extended to implement the process() method.
__init__#
Accepts DistributedPreprocessArguments for custom multiprocess rank handling.
process#
Process data with operations such as loading from a source, parsing, formatting, filtering, or any required before Dataset creation.
analyze#
Optional method for analyzing data.
process_data#
Process data via a DataProcessor's process() method.
Arguments:
data_processorDataProcessor - DataProcessor to call.
Returns:
Result of process() call.
multi_process_data#
Process data, naive multiprocessing using python multiprocessing.
Calls upon DataProcessor's process() method with any *args provided. All lists within args are split across processes as specified by process_count and executed either sync or async. Any non-list args are sent directly to the process() call. Users are encouraged to only pass as args the objects they wish to divide among processes.
Arguments:
data_processorDataProcessor - DataProcessor to call.*args- Arguments to be passed on to DataProcessors's process() method.process_countint, optional - Number of worker processes to create in pool.
Returns:
List- list of results to process() call per worker process.
DataInterface Objects#
Organizer and orchestrator for loading and processing data.
Designed to be used in conjunction with DataProcessors. Abstract methods get_train_dataset() and get_val_dataset() must be implemented to return datasets.
setup_datasets#
Setup the datasets before training.
get_train_dataset#
Returns Dataset for train data.
get_val_dataset#
Returns Dataset for val data.