DataProcessing

This module contains several transformer classes that aim to change or mitigate certain aspects of a dataset. The goal of this module is to provide a unified interface for different mitigation methods scattered around multiple machine learning libraries, such as scikit-learn, mlxtend, sdv, among others. The advantage of using this module is that it allows users to create their data processing pipelines using a single library, without worrying about making sure that the output from one mitigation (found in one library) is compatible with the output of a second mitigation (found in a separate library). As this library grows, we aim to offer an even larger amount of mitigation, and also offer complex solutions in a simple and easy-to-use interface.

Some of the classes available here are meant to be a wrapper for some other class found in one of these libraries, with the goal of making simplifications to the usage of these approaches, while also automating certain steps. Other classes offer unique solutions to a given aspect found in the area of data preprocessing for tabular data. All classes offer a simple interface, with pre-determined parameter values, making it easier to use when the user doesn’t know (or doesn’t want) to configure all of the available parameters. At the same time, the classes found here provide several different customization options, making the module ideal for those with more experience.

Structure

The dataprocessing module uses a base class (DataProcessing) that implements several tasks shared across the different transformation classes. Whenever possible, the classes implemented in this module follow the .fit() and .transform() behavior from scikit-learn, making these classes compatible with scikit-learn. Each class aims to mitigate a certain aspect of tabular datasets. For each of these aspects, there might be one or more different solutions for it. This way, we create a hierarchy, where each problem category has its own abstract class (that inherits from the main abstract DataProcessing class), where each of these classes has a set of concrete child classes that implements a different solution to the problem at hand. For example, we have an abstract encoding class (DataEncoding), which has concrete child classes that implement different encoding approaches (EncoderOHE and EncoderOrdinal).

Highlights include:

  • A simple interface for mitigation steps that follows the .fit() and .transform() convention.

  • transformer classes that can be combined together in end-to-end mitigation pipelines.

  • Function calls adapted for responsible AI by extending existing calls either with target features or cohorts.

  • Predetermined parameter values, eliminating the need to know or to configure all available parameters.

  • Unique solutions for tabular data.

  • Automation of various mitigation steps, with some transformer classes acting as a wrapper to others in the library.

  • Customization options, helpful for the more experienced AI practitioner.

API

Examples

All notebooks listed here can be found in the folder notebooks/dataprocessing/.

scikit-learn’s Pipeline Support