automl.data
load_openml_dataset
def load_openml_dataset(dataset_id,
data_dir=None,
random_state=0,
dataset_format="dataframe")
Load dataset from open ML.
If the file is not cached locally, download it from open ML.
Arguments:
dataset_id
- An integer of the dataset id in openml.data_dir
- A string of the path to store and load the data.random_state
- An integer of the random seed for splitting data.dataset_format
- A string specifying the format of returned dataset. Default is 'dataframe'. Can choose from ['dataframe', 'array']. If 'dataframe', the returned dataset will be a Pandas DataFrame. If 'array', the returned dataset will be a NumPy array or a SciPy sparse matrix.
Returns:
X_train
- Training data.X_test
- Test data.y_train
- A series or array of labels for training data.y_test
- A series or array of labels for test data.
load_openml_task
def load_openml_task(task_id, data_dir)
Load task from open ML.
Use the first fold of the task. If the file is not cached locally, download it from open ML.
Arguments:
task_id
- An integer of the task id in openml.data_dir
- A string of the path to store and load the data.
Returns:
X_train
- A dataframe of training data.X_test
- A dataframe of test data.y_train
- A series of labels for training data.y_test
- A series of labels for test data.
get_output_from_log
def get_output_from_log(filename, time_budget)
Get output from log file.
Arguments:
filename
- A string of the log file name.time_budget
- A float of the time budget in seconds.
Returns:
search_time_list
- A list of the finished time of each logged iter.best_error_list
- A list of the best validation error after each logged iter.error_list
- A list of the validation error of each logged iter.config_list
- A list of the estimator, sample size and config of each logged iter.logged_metric_list
- A list of the logged metric of each logged iter.
concat
def concat(X1, X2)
concatenate two matrices vertically.
DataTransformer Objects
class DataTransformer()
Transform input training data.
fit_transform
def fit_transform(X: Union[DataFrame, np.ndarray], y, task: Union[str,
"Task"])
Fit transformer and process the input training data according to the task type.
Arguments:
X
- A numpy array or a pandas dataframe of training data.y
- A numpy array or a pandas series of labels.task
- An instance of type Task, or a str such as 'classification', 'regression'.
Returns:
X
- Processed numpy array or pandas dataframe of training data.y
- Processed numpy array or pandas series of labels.
transform
def transform(X: Union[DataFrame, np.array])
Process data using fit transformer.
Arguments:
X
- A numpy array or a pandas dataframe of training data.
Returns:
X
- Processed numpy array or pandas dataframe of training data.
get_random_dataframe
def get_random_dataframe(n_rows: int = 200,
ratio_none: float = 0.1,
seed: int = 42) -> DataFrame
Generate a random pandas DataFrame with various data types for testing. This function creates a DataFrame with multiple column types including:
- Timestamps
- Integers
- Floats
- Categorical values
- Booleans
- Lists (tags)
- Decimal strings
- UUIDs
- Binary data (as hex strings)
- JSON blobs
- Nullable text fields Parameters
n_rows : int, default=200 Number of rows in the generated DataFrame ratio_none : float, default=0.1 Probability of generating None values in applicable columns seed : int, default=42 Random seed for reproducibility
Returns
pd.DataFrame A DataFrame with 14 columns of various data types
Examples
df = get_random_dataframe(100, 0.05, 123) df.shape (100, 14) df.dtypes timestamp datetime64[ns] id int64 score float64 status object flag object count object value object tags object rating object uuid object binary object json_blob object category category nullable_text object dtype: object
auto_convert_dtypes_spark
def auto_convert_dtypes_spark(
df: psDataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 0.1) -> tuple[psDataFrame, dict]
Automatically convert data types in a PySpark DataFrame using heuristics.
This function analyzes a sample of the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, numeric values, booleans, and categorical fields.
Arguments:
df
- A PySpark DataFrame to convert.na_values
- List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].category_threshold
- Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.convert_threshold
- Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.sample_ratio
- Fraction of data to sample for type inference. Defaults to 0.1.
Returns:
tuple
- (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)
Notes:
- 'category' in the schema dict is conceptual as PySpark doesn't have a true category type like pandas
- The function uses sampling for efficiency with large datasets
auto_convert_dtypes_pandas
def auto_convert_dtypes_pandas(
df: DataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 1.0) -> tuple[DataFrame, dict]
Automatically convert data types in a pandas DataFrame using heuristics.
This function analyzes the DataFrame to infer appropriate data types and applies the conversions. It handles timestamps, timedeltas, numeric values, and categorical fields.
Arguments:
df
- A pandas DataFrame to convert.na_values
- List of strings to be considered as NA/NaN. Defaults to ['NA', 'na', 'NULL', 'null', ''].category_threshold
- Maximum ratio of unique values to total values to consider a column categorical. Defaults to 0.3.convert_threshold
- Minimum ratio of successfully converted values required to apply a type conversion. Defaults to 0.6.sample_ratio
- Fraction of data to sample for type inference. Not used in pandas version but included for API compatibility. Defaults to 1.0.
Returns:
tuple
- (The DataFrame with converted types, A dictionary mapping column names to their inferred types as strings)