Skip to main content

automl.data

load_openml_dataset

def load_openml_dataset(dataset_id,
data_dir=None,
random_state=0,
dataset_format="dataframe")

Load dataset from open ML.

If the file is not cached locally, download it from open ML.

Arguments:

  • dataset_id - An integer of the dataset id in openml.
  • data_dir - A string of the path to store and load the data.
  • random_state - An integer of the random seed for splitting data.
  • dataset_format - A string specifying the format of returned dataset. Default is 'dataframe'. Can choose from ['dataframe', 'array']. If 'dataframe', the returned dataset will be a Pandas DataFrame. If 'array', the returned dataset will be a NumPy array or a SciPy sparse matrix.

Returns:

  • X_train - Training data.
  • X_test - Test data.
  • y_train - A series or array of labels for training data.
  • y_test - A series or array of labels for test data.

load_openml_task

def load_openml_task(task_id, data_dir)

Load task from open ML.

Use the first fold of the task. If the file is not cached locally, download it from open ML.

Arguments:

  • task_id - An integer of the task id in openml.
  • data_dir - A string of the path to store and load the data.

Returns:

  • X_train - A dataframe of training data.
  • X_test - A dataframe of test data.
  • y_train - A series of labels for training data.
  • y_test - A series of labels for test data.

get_output_from_log

def get_output_from_log(filename, time_budget)

Get output from log file.

Arguments:

  • filename - A string of the log file name.
  • time_budget - A float of the time budget in seconds.

Returns:

  • search_time_list - A list of the finished time of each logged iter.
  • best_error_list - A list of the best validation error after each logged iter.
  • error_list - A list of the validation error of each logged iter.
  • config_list - A list of the estimator, sample size and config of each logged iter.
  • logged_metric_list - A list of the logged metric of each logged iter.

concat

def concat(X1, X2)

concatenate two matrices vertically.

DataTransformer Objects

class DataTransformer()

Transform input training data.

fit_transform

def fit_transform(X: Union[DataFrame, np.ndarray], y, task: Union[str,
"Task"])

Fit transformer and process the input training data according to the task type.

Arguments:

  • X - A numpy array or a pandas dataframe of training data.
  • y - A numpy array or a pandas series of labels.
  • task - An instance of type Task, or a str such as 'classification', 'regression'.

Returns:

  • X - Processed numpy array or pandas dataframe of training data.
  • y - Processed numpy array or pandas series of labels.

transform

def transform(X: Union[DataFrame, np.array])

Process data using fit transformer.

Arguments:

  • X - A numpy array or a pandas dataframe of training data.

Returns:

  • X - Processed numpy array or pandas dataframe of training data.

get_random_dataframe

def get_random_dataframe(n_rows: int = 200,
ratio_none: float = 0.1,
seed: int = 42) -> DataFrame

Generate a random pandas DataFrame with various data types for testing. This function creates a DataFrame with multiple column types including:

  • Timestamps
  • Integers
  • Floats
  • Categorical values
  • Booleans
  • Lists (tags)
  • Decimal strings
  • UUIDs
  • Binary data (as hex strings)
  • JSON blobs
  • Nullable text fields Parameters

n_rows : int, default=200 Number of rows in the generated DataFrame ratio_none : float, default=0.1 Probability of generating None values in applicable columns seed : int, default=42 Random seed for reproducibility

Returns

pd.DataFrame A DataFrame with 14 columns of various data types

Examples

df = get_random_dataframe(100, 0.05, 123) df.shape (100, 14) df.dtypes timestamp datetime64[ns] id int64 score float64 status object flag object count object value object tags object rating object uuid object binary object json_blob object category category nullable_text object dtype: object

auto_convert_dtypes_spark

def auto_convert_dtypes_spark(
df: psDataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 0.1) -> tuple[psDataFrame, dict]

Automatically convert data types in a PySpark DataFrame using heuristics.

This function analyzes a sample of the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, numeric values, booleans, and categorical fields.

Arguments:

  • df - A PySpark DataFrame to convert.
  • na_values - List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].
  • category_threshold - Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.
  • convert_threshold - Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.
  • sample_ratio - Fraction of data to sample for type inference. Defaults to 0.1.

Returns:

  • tuple - (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)

Notes:

  • 'category' in the schema dict is conceptual as PySpark doesn't have a true category type like pandas
  • The function uses sampling for efficiency with large datasets

auto_convert_dtypes_pandas

def auto_convert_dtypes_pandas(
df: DataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 1.0) -> tuple[DataFrame, dict]

Automatically convert data types in a pandas DataFrame using heuristics.

This function analyzes the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, timedeltas, numeric values, and categorical fields.

Arguments:

  • df - A pandas DataFrame to convert.
  • na_values - List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].
  • category_threshold - Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.
  • convert_threshold - Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.
  • sample_ratio - Fraction of data to sample for type inference. Not used in pandas version but included for API compatibility. Defaults to 1.0.

Returns:

  • tuple - (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)