Skip to main content

core.data_interface

Data Interface module:

This module contains the abstract classes DataProcessor and DataInterface for data processing prior to model training. Users should implement DataProcessor's such that instances fully encapsulate a data processing task, and expose the process() method for calling upon the task from a DataInterface instance, which acts as ochestrator and implements an interface for other modules to request train, validation datasets. Please consider using the DataInterface.setupdatasets() method to store these datasets within the DataInterface instance, and thus have the get[train,val]_dataset() methods be as quick and computationally inexpensive as possible, as they are called at every epoch.

DataProcessor Objects#

class DataProcessor(ABC)

Processes and optionally analyzes data.

Designed to be used in conjunction with a DataInterface, must be extended to implement the process() method.

__init__#

def __init__(distrib_args: DistributedPreprocessArguments = None)

Accepts DistributedPreprocessArguments for custom multiprocess rank handling.

process#

@abstractmethod
def process(*args) -> Any

Process data with operations such as loading from a source, parsing, formatting, filtering, or any required before Dataset creation.

analyze#

def analyze() -> Any

Optional method for analyzing data.

process_data#

def process_data(*args) -> Any

Process data via a DataProcessor's process() method.

Arguments:

  • data_processor DataProcessor - DataProcessor to call.

Returns:

Result of process() call.

multi_process_data#

def multi_process_data(*args, process_count=1) -> List

Process data, naive multiprocessing using python multiprocessing.

Calls upon DataProcessor's process() method with any *args provided. All lists within args are split across processes as specified by process_count and executed either sync or async. Any non-list args are sent directly to the process() call. Users are encouraged to only pass as args the objects they wish to divide among processes.

Arguments:

  • data_processor DataProcessor - DataProcessor to call.
  • *args - Arguments to be passed on to DataProcessors's process() method.
  • process_count int, optional - Number of worker processes to create in pool.

Returns:

  • List - list of results to process() call per worker process.

DataInterface Objects#

class DataInterface(ABC)

Organizer and orchestrator for loading and processing data.

Designed to be used in conjunction with DataProcessors. Abstract methods get_train_dataset() and get_val_dataset() must be implemented to return datasets.

setup_datasets#

def setup_datasets() -> None

Setup the datasets before training.

get_train_dataset#

@abstractmethod
def get_train_dataset(*args, **kwargs) -> Dataset

Returns Dataset for train data.

get_val_dataset#

@abstractmethod
def get_val_dataset(*args, **kwargs) -> Dataset

Returns Dataset for val data.