automl.data

load_openml_dataset

def load_openml_dataset(dataset_id,
                        data_dir=None,
                        random_state=0,
                        dataset_format="dataframe")

Load dataset from open ML.

If the file is not cached locally, download it from open ML.

Arguments:

dataset_id - An integer of the dataset id in openml.
data_dir - A string of the path to store and load the data.
random_state - An integer of the random seed for splitting data.
dataset_format - A string specifying the format of returned dataset. Default is 'dataframe'. Can choose from ['dataframe', 'array']. If 'dataframe', the returned dataset will be a Pandas DataFrame. If 'array', the returned dataset will be a NumPy array or a SciPy sparse matrix.

Returns:

X_train - Training data.
X_test - Test data.
y_train - A series or array of labels for training data.
y_test - A series or array of labels for test data.

load_openml_task

def load_openml_task(task_id, data_dir)

Load task from open ML.

Use the first fold of the task. If the file is not cached locally, download it from open ML.

Arguments:

task_id - An integer of the task id in openml.
data_dir - A string of the path to store and load the data.

Returns:

X_train - A dataframe of training data.
X_test - A dataframe of test data.
y_train - A series of labels for training data.
y_test - A series of labels for test data.

get_output_from_log

def get_output_from_log(filename, time_budget)

Get output from log file.

Arguments:

filename - A string of the log file name.
time_budget - A float of the time budget in seconds.

Returns:

search_time_list - A list of the finished time of each logged iter.
best_error_list - A list of the best validation error after each logged iter.
error_list - A list of the validation error of each logged iter.
config_list - A list of the estimator, sample size and config of each logged iter.
logged_metric_list - A list of the logged metric of each logged iter.

concat

def concat(X1, X2)

concatenate two matrices vertically.

DataTransformer Objects

class DataTransformer()

Transform input training data.

fit_transform

def fit_transform(X: Union[DataFrame, np.ndarray], y, task: Union[str,
                                                                  "Task"])

Fit transformer and process the input training data according to the task type.

Arguments:

X - A numpy array or a pandas dataframe of training data.
y - A numpy array or a pandas series of labels.
task - An instance of type Task, or a str such as 'classification', 'regression'.

Returns:

X - Processed numpy array or pandas dataframe of training data.
y - Processed numpy array or pandas series of labels.

transform

def transform(X: Union[DataFrame, np.array])

Process data using fit transformer.

Arguments:

X - A numpy array or a pandas dataframe of training data.

Returns:

X - Processed numpy array or pandas dataframe of training data.

get_random_dataframe

def get_random_dataframe(n_rows: int = 200,
                         ratio_none: float = 0.1,
                         seed: int = 42) -> DataFrame

Generate a random pandas DataFrame with various data types for testing. This function creates a DataFrame with multiple column types including:

Timestamps
Integers
Floats
Categorical values
Booleans
Lists (tags)
Decimal strings
UUIDs
Binary data (as hex strings)
JSON blobs
Nullable text fields Parameters

n_rows : int, default=200 Number of rows in the generated DataFrame ratio_none : float, default=0.1 Probability of generating None values in applicable columns seed : int, default=42 Random seed for reproducibility

Returns

pd.DataFrame A DataFrame with 14 columns of various data types

Examples

df = get_random_dataframe(100, 0.05, 123) df.shape (100, 14) df.dtypes timestamp datetime64[ns] id int64 score float64 status object flag object count object value object tags object rating object uuid object binary object json_blob object category category nullable_text object dtype: object

auto_convert_dtypes_spark

def auto_convert_dtypes_spark(
        df: psDataFrame,
        na_values: list = None,
        category_threshold: float = 0.3,
        convert_threshold: float = 0.6,
        sample_ratio: float = 0.1) -> tuple[psDataFrame, dict]

Automatically convert data types in a PySpark DataFrame using heuristics.

This function analyzes a sample of the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, numeric values, booleans, and categorical fields.

Arguments:

df - A PySpark DataFrame to convert.
na_values - List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].
category_threshold - Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.
convert_threshold - Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.
sample_ratio - Fraction of data to sample for type inference. Defaults to 0.1.

Returns:

tuple - (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)

Notes:

'category' in the schema dict is conceptual as PySpark doesn't have a true category type like pandas
The function uses sampling for efficiency with large datasets

auto_convert_dtypes_pandas

def auto_convert_dtypes_pandas(
        df: DataFrame,
        na_values: list = None,
        category_threshold: float = 0.3,
        convert_threshold: float = 0.6,
        sample_ratio: float = 1.0) -> tuple[DataFrame, dict]

Automatically convert data types in a pandas DataFrame using heuristics.

This function analyzes the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, timedeltas, numeric values, and categorical fields.

Arguments:

df - A pandas DataFrame to convert.
na_values - List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].
category_threshold - Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.
convert_threshold - Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.
sample_ratio - Fraction of data to sample for type inference. Not used in pandas version but included for API compatibility. Defaults to 1.0.

Returns:

tuple - (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)

load_openml_dataset​

load_openml_task​

get_output_from_log​

concat​

DataTransformer Objects​

fit_transform​

transform​

get_random_dataframe​

Returns​

Examples​

auto_convert_dtypes_spark​

auto_convert_dtypes_pandas​

load_openml_dataset

load_openml_task

get_output_from_log

concat

DataTransformer Objects

fit_transform

transform

get_random_dataframe

Returns

Examples

auto_convert_dtypes_spark

auto_convert_dtypes_pandas