SeqFeatSelection Example
[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import SeqFeatSelection
from download import download_datasets
1 - Dataset with Headers
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset
[2]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
This class implements the sequential feature selection method. It represents a wrapper of the SequentialFeatureSelector class from the mlxtend library, offering certain simplifications and abstractions.
We can call this subclass using the default parameters and passing the dataframe only when calling the .fit() method. We can choose to pass the whole dataset along the label column using the “df=” and “label_col=” parameters.
[3]:
feat_sel = SeqFeatSelection(n_jobs=4)
feat_sel.fit(df=dataset, label_col='is_promoted')
feat_sel.get_selected_features()
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.8s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.8s finished
[2022-07-27 10:46:16] Features: 1/12 -- score: 0.6895577479869736[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 11 out of 11 | elapsed: 0.1s finished
[2022-07-27 10:46:16] Features: 2/12 -- score: 0.7872499546962967[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.2s finished
[2022-07-27 10:46:16] Features: 3/12 -- score: 0.8804506473945294[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 7 out of 9 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 9 out of 9 | elapsed: 0.2s finished
[2022-07-27 10:46:17] Features: 4/12 -- score: 0.8807940412342802[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 6 out of 8 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 8 out of 8 | elapsed: 0.2s finished
[2022-07-27 10:46:17] Features: 5/12 -- score: 0.8691371715375045[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 7 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 7 out of 7 | elapsed: 0.2s finished
[2022-07-27 10:46:17] Features: 6/12 -- score: 0.8522627614734568[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 3 out of 6 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 6 out of 6 | elapsed: 0.2s finished
[2022-07-27 10:46:17] Features: 7/12 -- score: 0.826864693465736[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 0.2s finished
[2022-07-27 10:46:18] Features: 8/12 -- score: 0.7964693834478135[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.1s finished
[2022-07-27 10:46:18] Features: 9/12 -- score: 0.7454954052081013[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done 3 out of 3 | elapsed: 0.2s finished
[2022-07-27 10:46:18] Features: 10/12 -- score: 0.6768912367800968[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 0.7s finished
[2022-07-27 10:46:19] Features: 11/12 -- score: 0.6747642727989089[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s finished
[2022-07-27 10:46:19] Features: 12/12 -- score: 0.65839797929155
[3]:
['department', 'KPIs_met >80%', 'awards_won?', 'avg_training_score']
After calling the fit() method, we can access the summary generated by the SequentialFeatureSelector class (from the mlxtend library) used inside our SeqFeatSelection. This summary dictionary can be accessed by using the get_summary() method, and it follows the following structure: each key is assigned to a different set of features tested, and for each key we have a secondary dictionary that informs all the relevant data for that particular run, such as the features used in that run, the results obtained for each fold (using cross-validation), where the results are associated to the metric specified by the scoring parameter.
[4]:
feat_sel.get_summary()
[4]:
{1: {'feature_idx': (9,),
'cv_scores': array([0.68984489, 0.69184529, 0.68698306]),
'avg_score': 0.6895577479869736,
'feature_names': ('KPIs_met >80%',)},
2: {'feature_idx': (9, 11),
'cv_scores': array([0.78949578, 0.78515037, 0.78710372]),
'avg_score': 0.7872499546962967,
'feature_names': ('KPIs_met >80%', 'avg_training_score')},
3: {'feature_idx': (0, 9, 11),
'cv_scores': array([0.88126278, 0.87977412, 0.88031504]),
'avg_score': 0.8804506473945294,
'feature_names': ('department', 'KPIs_met >80%', 'avg_training_score')},
4: {'feature_idx': (0, 9, 10, 11),
'cv_scores': array([0.87817067, 0.8822343 , 0.88197716]),
'avg_score': 0.8807940412342802,
'feature_names': ('department',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
5: {'feature_idx': (0, 3, 9, 10, 11),
'cv_scores': array([0.86445397, 0.87210534, 0.87085221]),
'avg_score': 0.8691371715375045,
'feature_names': ('department',
'gender',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
6: {'feature_idx': (0, 3, 5, 9, 10, 11),
'cv_scores': array([0.85145133, 0.8523202 , 0.85301676]),
'avg_score': 0.8522627614734568,
'feature_names': ('department',
'gender',
'no_of_trainings',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
7: {'feature_idx': (0, 3, 4, 5, 9, 10, 11),
'cv_scores': array([0.82663561, 0.82188557, 0.8320729 ]),
'avg_score': 0.826864693465736,
'feature_names': ('department',
'gender',
'recruitment_channel',
'no_of_trainings',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
8: {'feature_idx': (0, 2, 3, 4, 5, 9, 10, 11),
'cv_scores': array([0.78944963, 0.79337527, 0.80658324]),
'avg_score': 0.7964693834478135,
'feature_names': ('department',
'education',
'gender',
'recruitment_channel',
'no_of_trainings',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
9: {'feature_idx': (0, 2, 3, 4, 5, 7, 9, 10, 11),
'cv_scores': array([0.75337184, 0.73867261, 0.74444176]),
'avg_score': 0.7454954052081013,
'feature_names': ('department',
'education',
'gender',
'recruitment_channel',
'no_of_trainings',
'previous_year_rating',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
10: {'feature_idx': (0, 2, 3, 4, 5, 7, 8, 9, 10, 11),
'cv_scores': array([0.6835053 , 0.67103064, 0.67613777]),
'avg_score': 0.6768912367800968,
'feature_names': ('department',
'education',
'gender',
'recruitment_channel',
'no_of_trainings',
'previous_year_rating',
'length_of_service',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
11: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
'cv_scores': array([0.67921611, 0.67021602, 0.67486069]),
'avg_score': 0.6747642727989089,
'feature_names': ('department',
'region',
'education',
'gender',
'recruitment_channel',
'no_of_trainings',
'previous_year_rating',
'length_of_service',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')},
12: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
'cv_scores': array([0.65248237, 0.66616091, 0.65655066]),
'avg_score': 0.65839797929155,
'feature_names': ('department',
'region',
'education',
'gender',
'recruitment_channel',
'no_of_trainings',
'age',
'previous_year_rating',
'length_of_service',
'KPIs_met >80%',
'awards_won?',
'avg_training_score')}}
It is also possible to save this summary automatically after calling the fit() method by using the save_json and json_summary parameters. By default, save_json is set to False, which means that no JSON files are saved. By setting it to True, the summary will be saved in the file specified by json_summary.
[5]:
feat_sel = SeqFeatSelection(n_jobs=4, save_json=True, json_summary="json_files/seq_feat.json")
feat_sel.fit(df=dataset, label_col='is_promoted')
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.8s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.8s finished
[2022-07-27 10:46:22] Features: 1/12 -- score: 0.6895577479869736[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 11 out of 11 | elapsed: 0.1s finished
[2022-07-27 10:46:22] Features: 2/12 -- score: 0.7872499546962967[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.2s finished
[2022-07-27 10:46:23] Features: 3/12 -- score: 0.8805863398406606[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 7 out of 9 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 9 out of 9 | elapsed: 0.2s finished
[2022-07-27 10:46:23] Features: 4/12 -- score: 0.8796917785634855[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 6 out of 8 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 8 out of 8 | elapsed: 0.2s finished
[2022-07-27 10:46:23] Features: 5/12 -- score: 0.8700911766983669[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 7 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 7 out of 7 | elapsed: 0.2s finished
[2022-07-27 10:46:24] Features: 6/12 -- score: 0.8521535196777906[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 3 out of 6 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 6 out of 6 | elapsed: 0.2s finished
[2022-07-27 10:46:24] Features: 7/12 -- score: 0.8297866606603811[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 0.2s finished
[2022-07-27 10:46:24] Features: 8/12 -- score: 0.7976130766687802[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.1s finished
[2022-07-27 10:46:24] Features: 9/12 -- score: 0.7430020885354774[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done 3 out of 3 | elapsed: 0.2s finished
[2022-07-27 10:46:25] Features: 10/12 -- score: 0.6820033653384551[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 0.6s finished
[2022-07-27 10:46:26] Features: 11/12 -- score: 0.6601517758791042[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s finished
[2022-07-27 10:46:26] Features: 12/12 -- score: 0.6579024385880693
[5]:
<raimitigations.dataprocessing.feat_selection.sequential_select.SeqFeatSelection at 0x7f12141ec3a0>
The user can then open this JSON file using any JSON viewing tool and check all the results presented there. In the following images, we show an example of the JSON file generated using the “JSON Viewer” extension for VSCode.
In the previous image, we can see that each run is associated to a different numerical key. The key “6” represents a run that used 6 features, specified in the feature_names sub-key. We can also look into the results obtained by each model in the cross-validation scenario (in this case, using 3 folds), as well as the mean score. This JSON file allows users to inspect the finer details of the feature selection process and decide by themselves if the final features selected is the ones they want or if they want to change it.
If the users disagrees (partially or fully) with the selected features, they can manually choose the features to be selected by using the set_selected_features method. This method accepts a list of columns, which will be set as the columns to be selected (can be different from the features selected by the fit method). If we then call the get_selected_features, we can see that the selected features are now the ones defined by the user:
[6]:
feat_sel.set_selected_features(["department", "previous_year_rating"])
feat_sel.get_selected_features()
[6]:
['department', 'previous_year_rating']
We can also set the selected features using the column’s indexes instead of their names:
[7]:
feat_sel.set_selected_features([1, 2])
feat_sel.get_selected_features()
[7]:
['region', 'education']
If we want to reset the selected features using the ones defined by the fit method, we only need to call set_selected_features again, but this time don’t provide any list of columns. By doing this, the selected features will be set as those originally selected by the fit method.
[8]:
feat_sel.set_selected_features()
feat_sel.get_selected_features()
[8]:
['department', 'KPIs_met >80%', 'avg_training_score']
We can also separate the whole dataframe into the X datframe containing the features, and the Y dataframe containing the labels. This way, we use the “X=” and “y=” parameters and ignore the “df=” and “label_col=” parameters. We can also change the scoring function used.
[9]:
X = dataset.drop(columns=['is_promoted'])
Y = dataset['is_promoted']
feat_sel = SeqFeatSelection(scoring='f1', n_jobs=4)
feat_sel.fit(X=X, y=Y)
feat_sel.get_selected_features()
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.7s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.7s finished
[2022-07-27 10:46:28] Features: 1/12 -- score: 0.19351656953830262[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 11 out of 11 | elapsed: 0.2s finished
[2022-07-27 10:46:29] Features: 2/12 -- score: 0.49702858853465043[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.2s finished
[2022-07-27 10:46:29] Features: 3/12 -- score: 0.5024861528381374[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 7 out of 9 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 9 out of 9 | elapsed: 0.2s finished
[2022-07-27 10:46:29] Features: 4/12 -- score: 0.5019562862803734[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 6 out of 8 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 8 out of 8 | elapsed: 0.2s finished
[2022-07-27 10:46:29] Features: 5/12 -- score: 0.49882611496675605[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 7 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 7 out of 7 | elapsed: 0.2s finished
[2022-07-27 10:46:30] Features: 6/12 -- score: 0.48888552734189306[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 3 out of 6 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 6 out of 6 | elapsed: 0.2s finished
[2022-07-27 10:46:30] Features: 7/12 -- score: 0.47494684134297466[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 0.2s finished
[2022-07-27 10:46:30] Features: 8/12 -- score: 0.46829347831133433[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.1s finished
[2022-07-27 10:46:30] Features: 9/12 -- score: 0.4449834067043737[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done 3 out of 3 | elapsed: 0.2s finished
[2022-07-27 10:46:31] Features: 10/12 -- score: 0.4015465436267724[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 0.9s finished
[2022-07-27 10:46:32] Features: 11/12 -- score: 0.38470406126502166[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s finished
[2022-07-27 10:46:32] Features: 12/12 -- score: 0.36287745706651586
[9]:
['department', 'awards_won?', 'avg_training_score']
The SeqFeatSelection implements the Sequential Feature Selection approach using the mlxtend library. This method uses an estimator, which is used to test the performance of the model using different sets of features. The default estimator used is a decision tree classifier (DecisionTreeClassifier from sklearn). But the user might be interested in using other sklearn estimators to see if they can achieve better results. Therefore, we created the estimator parameter, which accepts a sklearn classifier or None if the user wants to use the default one. Let’s see how we can use the SeqFeatSelection subclass while specifying a different classifier. Note that in the following cell we also (i) specify the “label_col” using the index of the label column instead of its name (just to show a different approach when specifying this attribute), and (ii) provide the dataset when instantiating the subclass instead of providing it during the fit method.
[10]:
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier(n_neighbors=4)
feat_sel = SeqFeatSelection(df=dataset, label_col=11, estimator=estimator, scoring='accuracy', n_jobs=4)
feat_sel.fit()
feat_sel.get_selected_features()
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 22.9s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 22.9s finished
[2022-07-27 10:46:58] Features: 1/12 -- score: 0.07325640058383107[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 11 out of 11 | elapsed: 7.5s finished
[2022-07-27 10:47:05] Features: 2/12 -- score: 0.10067900107726384[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 6.1s finished
[2022-07-27 10:47:11] Features: 3/12 -- score: 0.10719246197152053[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 7 out of 9 | elapsed: 4.7s remaining: 1.3s
[Parallel(n_jobs=4)]: Done 9 out of 9 | elapsed: 6.0s finished
[2022-07-27 10:47:17] Features: 4/12 -- score: 0.10540453099546042[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 6 out of 8 | elapsed: 4.0s remaining: 1.3s
[Parallel(n_jobs=4)]: Done 8 out of 8 | elapsed: 4.8s finished
[2022-07-27 10:47:22] Features: 5/12 -- score: 0.1044739769466601[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 4 out of 7 | elapsed: 3.0s remaining: 2.3s
[Parallel(n_jobs=4)]: Done 7 out of 7 | elapsed: 4.4s finished
[2022-07-27 10:47:27] Features: 6/12 -- score: 0.10179179486070188[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 3 out of 6 | elapsed: 2.8s remaining: 2.8s
[Parallel(n_jobs=4)]: Done 6 out of 6 | elapsed: 3.9s finished
[2022-07-27 10:47:31] Features: 7/12 -- score: 0.09294258224479195[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 3.6s finished
[2022-07-27 10:47:35] Features: 8/12 -- score: 0.09215810680398327[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 2.9s finished
[2022-07-27 10:47:37] Features: 9/12 -- score: 0.0910450393828115[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=3)]: Done 3 out of 3 | elapsed: 3.2s finished
[2022-07-27 10:47:41] Features: 10/12 -- score: 0.08617349577068416[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 3.4s finished
[2022-07-27 10:47:45] Features: 11/12 -- score: 0.0777988658594773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
/home/mmendonca/anaconda3/envs/rai/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.
warnings.warn(
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 2.3s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 2.3s finished
[2022-07-27 10:47:47] Features: 12/12 -- score: 0.06568385280411117
[10]:
['department', 'education', 'is_promoted']
Finally, in order to actually transform the desired dataset by selecting only the chosen features, we call the transform method. Following the same pattern of other subclasses, we must always provide a valid dataset for this method and this dataset doesn’t need to be the same as the one used during the fit method.
[11]:
new_df = feat_sel.transform(dataset)
new_df.head()
[11]:
department | education | is_promoted | avg_training_score | |
---|---|---|---|---|
0 | 7 | 2 | 0 | 49 |
1 | 4 | 0 | 0 | 60 |
2 | 7 | 0 | 0 | 50 |
3 | 7 | 0 | 0 | 50 |
4 | 8 | 0 | 0 | 73 |
Setting a list of transformations before using feature selection
Sometimes we would like to prepare the data before performing feature selection. In this example, we use BasicImputer
to fill missing values, and both EncoderOrdinal
and EncoderOHE
to deal with categorical variables. These transformations are passed in the transform_pipe
parameter as a list. When transform()
is called on this or another dataset, these three transformations will be performed prior to SeqFeatSelection
.
[12]:
print(dataset.info())
dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 department 54808 non-null object
1 region 54808 non-null object
2 education 52399 non-null object
3 gender 54808 non-null object
4 recruitment_channel 54808 non-null object
5 no_of_trainings 54808 non-null int64
6 age 54808 non-null int64
7 previous_year_rating 50684 non-null float64
8 length_of_service 54808 non-null int64
9 KPIs_met >80% 54808 non-null int64
10 awards_won? 54808 non-null int64
11 avg_training_score 54808 non-null int64
12 is_promoted 54808 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB
None
[12]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[13]:
dataset['education'].unique()
[13]:
array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
dtype=object)
[14]:
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal
from raimitigations.dataprocessing import BasicImputer
imputer = BasicImputer(
col_impute=None,
specific_col={'previous_year_rating': { 'missing_values':np.nan,
'strategy':'constant',
'fill_value':-100 } }
)
ordinal = EncoderOrdinal(
col_encode=['education'],
categories={'education': ["Below Secondary", "Bachelor's", "Master's & above"]}
)
ohe = EncoderOHE(col_encode=["department", "region", "gender", "recruitment_channel"])
transform_pipe = [imputer, ordinal, ohe]
feat_sel = SeqFeatSelection(transform_pipe=transform_pipe, n_jobs=4)
feat_sel.fit(df=dataset, label_col='is_promoted')
feat_sel.get_selected_features()
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 41 tasks | elapsed: 0.9s
[Parallel(n_jobs=4)]: Done 52 out of 52 | elapsed: 1.0s finished
[2022-07-27 10:47:50] Features: 1/52 -- score: 0.6895577479869736[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 51 out of 51 | elapsed: 0.5s finished
[2022-07-27 10:47:51] Features: 2/52 -- score: 0.7872499546962967[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 50 out of 50 | elapsed: 0.8s finished
[2022-07-27 10:47:52] Features: 3/52 -- score: 0.83642750785164[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 49 out of 49 | elapsed: 0.8s finished
[2022-07-27 10:47:53] Features: 4/52 -- score: 0.866985302389267[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 48 out of 48 | elapsed: 0.8s finished
[2022-07-27 10:47:54] Features: 5/52 -- score: 0.8765245610911356[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 47 out of 47 | elapsed: 0.8s finished
[2022-07-27 10:47:55] Features: 6/52 -- score: 0.8799218756368746[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 46 out of 46 | elapsed: 0.7s finished
[2022-07-27 10:47:55] Features: 7/52 -- score: 0.8822123483721973[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 45 out of 45 | elapsed: 0.8s finished
[2022-07-27 10:47:56] Features: 8/52 -- score: 0.8825335299087659[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 37 out of 44 | elapsed: 0.7s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 44 out of 44 | elapsed: 0.8s finished
[2022-07-27 10:47:57] Features: 9/52 -- score: 0.8824764223523092[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 36 out of 43 | elapsed: 0.7s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 43 out of 43 | elapsed: 0.8s finished
[2022-07-27 10:47:58] Features: 10/52 -- score: 0.8826109087104701[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 42 out of 42 | elapsed: 0.8s finished
[2022-07-27 10:47:59] Features: 11/52 -- score: 0.8820841123146065[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 34 out of 41 | elapsed: 0.8s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 41 out of 41 | elapsed: 0.8s finished
[2022-07-27 10:48:00] Features: 12/52 -- score: 0.8824994409310958[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 40 out of 40 | elapsed: 0.9s finished
[2022-07-27 10:48:01] Features: 13/52 -- score: 0.88131439273642[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 39 out of 39 | elapsed: 0.9s finished
[2022-07-27 10:48:02] Features: 14/52 -- score: 0.8813604472046684[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 38 out of 38 | elapsed: 0.8s finished
[2022-07-27 10:48:03] Features: 15/52 -- score: 0.8804924898095458[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 30 out of 37 | elapsed: 0.9s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 37 out of 37 | elapsed: 1.0s finished
[2022-07-27 10:48:04] Features: 16/52 -- score: 0.879256221965207[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 36 out of 36 | elapsed: 0.9s finished
[2022-07-27 10:48:05] Features: 17/52 -- score: 0.8783063165665314[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 28 out of 35 | elapsed: 0.8s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 35 out of 35 | elapsed: 0.9s finished
[2022-07-27 10:48:06] Features: 18/52 -- score: 0.8765803207831038[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 34 out of 34 | elapsed: 0.9s finished
[2022-07-27 10:48:07] Features: 19/52 -- score: 0.8756175226003494[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 26 out of 33 | elapsed: 0.9s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 33 out of 33 | elapsed: 1.0s finished
[2022-07-27 10:48:08] Features: 20/52 -- score: 0.8739821646328595[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 32 out of 32 | elapsed: 0.9s finished
[2022-07-27 10:48:09] Features: 21/52 -- score: 0.8731880599812847[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 31 out of 31 | elapsed: 0.9s finished
[2022-07-27 10:48:10] Features: 22/52 -- score: 0.8715491264910891[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 30 out of 30 | elapsed: 0.9s finished
[2022-07-27 10:48:11] Features: 23/52 -- score: 0.8693456428081054[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 22 out of 29 | elapsed: 0.7s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 29 out of 29 | elapsed: 0.9s finished
[2022-07-27 10:48:12] Features: 24/52 -- score: 0.8681427343726945[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 21 out of 28 | elapsed: 0.8s remaining: 0.3s
[Parallel(n_jobs=4)]: Done 28 out of 28 | elapsed: 0.9s finished
[2022-07-27 10:48:13] Features: 25/52 -- score: 0.8660444870160423[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 20 out of 27 | elapsed: 0.9s remaining: 0.3s
[Parallel(n_jobs=4)]: Done 27 out of 27 | elapsed: 1.0s finished
[2022-07-27 10:48:14] Features: 26/52 -- score: 0.8628526559709279[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 26 out of 26 | elapsed: 1.0s finished
[2022-07-27 10:48:15] Features: 27/52 -- score: 0.8603961388270603[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 18 out of 25 | elapsed: 0.8s remaining: 0.3s
[Parallel(n_jobs=4)]: Done 25 out of 25 | elapsed: 0.9s finished
[2022-07-27 10:48:16] Features: 28/52 -- score: 0.8583971419846481[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 24 out of 24 | elapsed: 0.9s finished
[2022-07-27 10:48:17] Features: 29/52 -- score: 0.8572540964551741[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 23 out of 23 | elapsed: 1.0s finished
[2022-07-27 10:48:18] Features: 30/52 -- score: 0.8523029606818974[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 22 out of 22 | elapsed: 0.8s finished
[2022-07-27 10:48:19] Features: 31/52 -- score: 0.8511240267050649[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 14 out of 21 | elapsed: 0.6s remaining: 0.3s
[Parallel(n_jobs=4)]: Done 21 out of 21 | elapsed: 0.9s finished
[2022-07-27 10:48:20] Features: 32/52 -- score: 0.8477236712628895[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 0.8s finished
[2022-07-27 10:48:21] Features: 33/52 -- score: 0.8459631494456881[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 12 out of 19 | elapsed: 0.7s remaining: 0.4s
[Parallel(n_jobs=4)]: Done 19 out of 19 | elapsed: 0.8s finished
[2022-07-27 10:48:22] Features: 34/52 -- score: 0.8428610442355481[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 18 out of 18 | elapsed: 0.8s finished
[2022-07-27 10:48:23] Features: 35/52 -- score: 0.8401036051673718[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 10 out of 17 | elapsed: 0.6s remaining: 0.5s
[Parallel(n_jobs=4)]: Done 17 out of 17 | elapsed: 0.8s finished
[2022-07-27 10:48:24] Features: 36/52 -- score: 0.8361276573467237[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 16 out of 16 | elapsed: 0.7s finished
[2022-07-27 10:48:25] Features: 37/52 -- score: 0.8332970689914444[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 15 out of 15 | elapsed: 0.7s finished
[2022-07-27 10:48:25] Features: 38/52 -- score: 0.8301985107591306[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 14 out of 14 | elapsed: 0.7s finished
[2022-07-27 10:48:26] Features: 39/52 -- score: 0.8245966816746299[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 13 out of 13 | elapsed: 0.6s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 13 out of 13 | elapsed: 0.6s finished
[2022-07-27 10:48:27] Features: 40/52 -- score: 0.8204300249621478[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.6s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 12 out of 12 | elapsed: 0.6s finished
[2022-07-27 10:48:28] Features: 41/52 -- score: 0.8189924299706502[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 11 out of 11 | elapsed: 0.5s finished
[2022-07-27 10:48:28] Features: 42/52 -- score: 0.8146435410934924[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.5s finished
[2022-07-27 10:48:29] Features: 43/52 -- score: 0.8078289467042064[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 7 out of 9 | elapsed: 0.4s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 9 out of 9 | elapsed: 0.5s finished
[2022-07-27 10:48:29] Features: 44/52 -- score: 0.7967527130014775[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 6 out of 8 | elapsed: 0.4s remaining: 0.1s
[Parallel(n_jobs=4)]: Done 8 out of 8 | elapsed: 0.4s finished
[2022-07-27 10:48:30] Features: 45/52 -- score: 0.7926074075965114[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 7 | elapsed: 0.2s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 7 out of 7 | elapsed: 0.4s finished
[2022-07-27 10:48:30] Features: 46/52 -- score: 0.7743038548882368[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 3 out of 6 | elapsed: 0.2s remaining: 0.2s
[Parallel(n_jobs=4)]: Done 6 out of 6 | elapsed: 0.4s finished
[2022-07-27 10:48:31] Features: 47/52 -- score: 0.7496559681177949[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 0.3s finished
[2022-07-27 10:48:31] Features: 48/52 -- score: 0.7272615668144816[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.3s finished
[2022-07-27 10:48:32] Features: 49/52 -- score: 0.7093266736980682[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done 3 out of 3 | elapsed: 0.3s finished
[2022-07-27 10:48:32] Features: 50/52 -- score: 0.6810674191630524[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 0.7s finished
[2022-07-27 10:48:33] Features: 51/52 -- score: 0.6608504797245924[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s finished
[2022-07-27 10:48:33] Features: 52/52 -- score: 0.6531054329005759
[14]:
['KPIs_met >80%',
'awards_won?',
'avg_training_score',
'department_HR',
'department_Operations',
'department_Procurement',
'department_Sales & Marketing',
'department_Technology',
'region_region_18',
'region_region_9']
[15]:
new_df = feat_sel.transform(dataset)
new_df.head()
[15]:
KPIs_met >80% | awards_won? | avg_training_score | department_HR | department_Operations | department_Procurement | department_Sales & Marketing | department_Technology | region_region_18 | region_region_9 | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 49 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 60 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 73 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 - DataFrame without column names
SeqFeatSelection
can be performed on datasets without column names. The next few cells demonstrate how to use SeqFeatSelection
on datasets without column names, similar to the example above.
[16]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[16]:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[17]:
feat_sel = SeqFeatSelection(n_jobs=1)
feat_sel.fit(df=dataset, label_col=12)
feat_sel.get_selected_features()
No columns specified for imputation. These columns have been automatically identified:
['2', '7']
No columns specified for encoding. These columns have been automatically identfied as the following:
['0', '1', '2', '3', '4']
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished
[2022-07-27 10:48:36] Features: 1/12 -- score: 0.6895577479869736[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished
[2022-07-27 10:48:36] Features: 2/12 -- score: 0.7872499546962967[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.5s finished
[2022-07-27 10:48:37] Features: 3/12 -- score: 0.8808119002453035[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.5s finished
[2022-07-27 10:48:37] Features: 4/12 -- score: 0.8798465038933846[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.5s finished
[2022-07-27 10:48:38] Features: 5/12 -- score: 0.8718241205396499[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.5s finished
[2022-07-27 10:48:38] Features: 6/12 -- score: 0.8500875690014796[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.4s finished
[2022-07-27 10:48:39] Features: 7/12 -- score: 0.8301586131548361[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[2022-07-27 10:48:39] Features: 8/12 -- score: 0.7965278481347339[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.3s finished
[2022-07-27 10:48:39] Features: 9/12 -- score: 0.7486532193434772[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.3s finished
[2022-07-27 10:48:40] Features: 10/12 -- score: 0.6843054388422951[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.2s finished
[2022-07-27 10:48:40] Features: 11/12 -- score: 0.6654508689096956[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s finished
[2022-07-27 10:48:40] Features: 12/12 -- score: 0.663366868314892
[17]:
['0', '9', '11']
[18]:
feat_sel.set_selected_features([1,8])
feat_sel.get_selected_features()
[18]:
['1', '8']
[19]:
feat_sel.set_selected_features()
feat_sel.get_selected_features()
[19]:
['0', '9', '11']
[20]:
new_df = feat_sel.transform(dataset)
new_df.head()
[20]:
0 | 9 | 11 | 12 | |
---|---|---|---|---|
0 | 7 | 1 | 49 | 0 |
1 | 4 | 0 | 60 | 0 |
2 | 7 | 0 | 50 | 0 |
3 | 7 | 0 | 50 | 0 |
4 | 8 | 0 | 73 | 0 |
3 - Regression Task
So far, we only showed examples of the SeqFeatSelection for classification tasks. However, this class also works for regression tasks. First of all, let’s create a dummy regression dataset so we can build a few examples. For this, we’ll use the create_dummy_dataset function:
[21]:
from raimitigations.utils import create_dummy_dataset
df = create_dummy_dataset(
samples=1000,
n_features=6,
n_num_num=2,
n_cat_num=2,
n_cat_cat=0,
num_num_noise=[0.01, 0.02],
pct_change=[0.03, 0.05],
regression=True,
)
df
[21]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.410174 | 1.013603 | -0.832691 | 0.367173 | 0.567124 | -0.759324 | 27.539174 | 0.404645 | 1.012075 | val0_0 | val1_0 |
1 | 0.049380 | 1.025313 | -1.205273 | 1.060657 | 1.028810 | 0.608497 | 97.544801 | 0.067120 | 1.007892 | val0_1 | val1_0 |
2 | -0.175892 | -0.287463 | -1.603577 | -0.471347 | 0.735466 | 1.022698 | -51.398235 | -0.165907 | -0.300760 | val0_0 | val1_1 |
3 | -0.253181 | 0.210460 | 0.934032 | 1.532367 | -0.410205 | 0.051906 | 96.124444 | -0.231510 | 0.218974 | val0_0 | val1_1 |
4 | 0.255014 | 0.357865 | -0.608587 | 0.454078 | -0.583506 | 1.106965 | 88.526727 | 0.254427 | 0.317064 | val0_0 | val1_3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.032016 | 0.038192 | 0.506141 | 0.329018 | 0.353782 | -0.385402 | 22.558860 | 0.022259 | 0.025872 | val0_1 | val1_1 |
996 | 0.242831 | -0.106563 | -0.072470 | 0.006590 | 1.748496 | 2.133973 | 152.801772 | 0.245658 | -0.111039 | val0_1 | val1_1 |
997 | 1.801105 | 0.231289 | 0.190427 | -0.423817 | 0.568846 | 0.793511 | 194.216915 | 1.791196 | 0.188060 | val0_1 | val1_1 |
998 | -0.639458 | -1.512235 | -1.107102 | -0.857749 | 0.038708 | -1.332841 | -289.448911 | -0.634394 | -1.533487 | val0_0 | val1_0 |
999 | 0.695254 | -0.430597 | 0.460556 | -0.418865 | -1.593463 | 2.280759 | 146.007750 | 0.692414 | -0.419846 | val0_1 | val1_1 |
1000 rows × 11 columns
The SeqFeatSelection class will automatically detect if a problem is a classification or regression task by looking at the label column: if the data type of the label column is a variation of the float data type, then the task is considered to be a regression. Otherwise, it is considered a classification task. Note that we can explicitly determine if we want to solve a classification or a regression task by setting the ‘regression’ parameter when instantiating the SeqFeatSelection class: if the ‘regression’ parameter is set to True, then the task will be considered a regression task, and if set to False, it will be treated as a classification task. The default value of this parameter is None, and in this case, the task will be determined by looking at the data type of the label column, as previously mentioned. If we have a classification problem, but the label column is set with float values (1.0 for class 1, 2.0 for class 2, and so on), then we must set the ‘regression’ parameter to True.
[22]:
feat_sel = SeqFeatSelection(verbose=False)
feat_sel.fit(df=df, label_col="label")
feat_sel.get_selected_features()
[22]:
['num_0',
'num_1',
'num_2',
'num_3',
'num_5',
'num_c0_num_0',
'num_c1_num_1',
'CN_0_num_0',
'CN_1_num_1']
The internal variable ‘regression’ will indicate if the task is a regression or a classification:
[23]:
feat_sel.regression
[23]:
True
We can also specify which regression model we want to use when doing the sequential feature selection procedure. The default regressor used is a Decision Tree Regressor.
[24]:
from sklearn.linear_model import LinearRegression
feat_sel = SeqFeatSelection(verbose=False, estimator=LinearRegression())
feat_sel.fit(df=df, label_col="label")
feat_sel.get_selected_features()
[24]:
['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'num_5', 'num_c0_num_0']