Toy Dataset

raimitigations.utils.toy_dataset_corr.create_dummy_dataset(samples: int, n_features: int, n_num_num: int, n_cat_num: int, n_cat_cat: int, num_num_noise: list = [0.1, 0.2], pct_change: list = [0.1, 0.3], n_classes: int = 2, regression: bool = False)

Creates an artificial dataset containing numerical and categorical features, where several pairs of correlated features are observed. These pairs of correlated features can be a pair of both numerical, both categorical, or numerical and categorical features.

Parameters
  • samples – the number of samples to be created;

  • n_features – the number of numerical features to be created;

  • n_correlated – the number of pairs of correlated features, wherein each pair both features are numerical;

  • n_cat_num – the number of pairs of correlated features, where each pair is constituted by a numerical and a categorical feature;

  • n_cat_cat – the number of pairs of correlated features, wherein each pair both features are categorical;

  • num_num_noise – a list with two values, where num_num_noise[0] < num_num_noise[1] and both values must be between [0, 1]. The ith new numerical feature is created by copying the ith existing numerical feature in the dataset df and adding a noise to it. The standard deviation used for generating the noise is a random value between num_num_noise[0] and num_num_noise[1];

  • pct_change – a list with two values, where pct_change[0] < pct_change[1] and both values must be between [0, 1]. For each categorical feature created, a fraction of p values will be swapped randomly. Here, p is a value selected randomly in the range [pct_change[0], pct_change[1]];

  • n_classes – the number of classes in the label column. This parameter is ignored if regression is set to True;

  • regression – if True, the label column consists of float values. If False, the label column is created to resemble a classification task.

Returns

a dataframe containing correlated features.

Return type

pd.DataFrame