Feature Selection - Part 2
   10 min read    Rangaraj Pandurangan

1. Feature Selection : Part 2

This is part 2 of the blog on feature selection. In part 1, we discussed about what is feature selection, why do we need feature selection, different methods for feature selection and filtering based methods (https://r2dldocs.z6.web.core.windows.net/doc-repo/blog/feature-selection/).

In this article we will continue with other methods

  1. Wrapper Methods
  2. Embedded Methods
  3. Hybrid Methods

2. Wrapper Methods

Unlike basic filter methods, this class of methods use machine learning model for feature selection. Wrapper methods can be divided into:

  1. Step Forward feature selection
  2. Step Backward feature selection
  3. Exhaustive feature selection

2.1 Step Forward feature selection

This method starts with no feature and adds one at a time. It involves training a machine learning model for each feature and selects that feature that returns best performing model based on chosen evaluation metric.

In second step, it creates ml models for all combinations of remaining features and the chosen feature. Retain the pair that produces best metric. Continue this process until the stopping criteria is met, which include:
1. Number of features to Retain
2. Performance threshold breach

Forward feature selection can be done using mlxtend library.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
from scipy import integrate

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score, r2_score

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

data = pd.read_csv('dataset_4.csv', index_col = 0)
data = data.sample(frac=1.0)[-1000:]
print(f'Shape of data: {data.shape}')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), data['target'],  
    test_size=0.3,
    random_state=0)
Shape of data: (1000, 17)

Note

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

1
2
3
4
5
6
7
8
9
sfs = SFS(RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=0), 
           k_features=7, # num of features required
           forward=True, 
           floating=False, 
           verbose=2, 
           scoring='roc_auc',
           cv=2)

sfs = sfs.fit(np.array(X_train), y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    5.8s finished

[2021-02-04 18:21:03] Features: 1/7 -- score: 0.6129253057374262[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    4.0s finished

[2021-02-04 18:21:07] Features: 2/7 -- score: 0.6448066144546173[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:    3.7s finished

[2021-02-04 18:21:11] Features: 3/7 -- score: 0.6524201076469204[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:    3.5s finished

[2021-02-04 18:21:14] Features: 4/7 -- score: 0.630967010281436[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    3.2s finished

[2021-02-04 18:21:17] Features: 5/7 -- score: 0.6572149449982636[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    2.9s finished

[2021-02-04 18:21:20] Features: 6/7 -- score: 0.6701889499679581[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    2.7s finished

[2021-02-04 18:21:23] Features: 7/7 -- score: 0.6569435613637518
1
2
selected_features = X_train.columns[list(sfs.k_feature_idx_)]
selected_features
Index(['var_2', 'var_3', 'var_9', 'var_16', 'var_21', 'var_55', 'var_99'], dtype='object')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# function to train random forests
def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)
    rf.fit(X_train, y_train)
    pred = rf.predict_proba(X_train)
    print('Training roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    pred = rf.predict_proba(X_test)
    print('Testing roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
print("Performance with entire features")
run_randomForests(X_train,
                  X_test,
                  y_train, y_test)

print("\n Performance with Select features")
run_randomForests(X_train[selected_features],
                  X_test[selected_features],
                  y_train, y_test)

Performance with entire features
Training roc-auc: 0.8297403205721402
Testing roc-auc: 0.6660311212251676

 Performance with Select features
Training roc-auc: 0.8221430121899669
Testing roc-auc: 0.6495479426779014

We could already see the benefit of doing forward feature selection. The model with 7 features has similar level of performance.

2.2 Step Backward feature selection

This method starts with all features present and removes one feature at the time. It involves training a machine learning model with all features and determines its performance.

Then, it trains on all possible combinations of all features minus one, and removes that feature that returns lowest performance.

Backward feature selection can also be done using mlxtend library by setting “forward” to False. The procedure otherwise remains same.

1
2
3
4
5
6
7
8
9
sfs = SFS(RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=0), 
           k_features=7, 
           forward=False, # indicating backward selection 
           floating=False, 
           verbose=2, 
           scoring='roc_auc',
           cv=2)

sfs = sfs.fit(np.array(X_train), y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    4.3s finished

[2021-02-04 18:21:28] Features: 15/7 -- score: 0.6381687672312987[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    4.1s finished

[2021-02-04 18:21:32] Features: 14/7 -- score: 0.6037969287061233[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:    3.8s finished

[2021-02-04 18:21:36] Features: 13/7 -- score: 0.6096924764891487[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:    3.5s finished

[2021-02-04 18:21:39] Features: 12/7 -- score: 0.6231375998395974[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    3.2s finished

[2021-02-04 18:21:42] Features: 11/7 -- score: 0.6346336193750423[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    3.0s finished

[2021-02-04 18:21:45] Features: 10/7 -- score: 0.6140625162112188[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    2.7s finished

[2021-02-04 18:21:48] Features: 9/7 -- score: 0.6309560973379343[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.4s finished

[2021-02-04 18:21:50] Features: 8/7 -- score: 0.6384428229758881[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    2.1s finished

[2021-02-04 18:21:53] Features: 7/7 -- score: 0.6295560912598497

2.3 Exhaustive Feature Selection

Exhasutive feature selection identifies the best subset of features among all possible feature subsets. The algorithm creates all combination with 1 feature, 2 features all the way upto N features. Finally selecting only one such combination yeilding best performance.

The library in mlxtend for this feature selection excepts two main parameters;min_features and max_features, indicating the number of features excepted to be in the model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, r2_score

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), data['target'],  
    test_size=0.3,
    random_state=0)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
efs = EFS(RandomForestClassifier(n_estimators=5,
                                 n_jobs=4,
                                 random_state=0,
                                 max_depth=2),
          min_features=3,
          max_features=4,
          scoring='roc_auc',
          print_progress=True,
          cv=2)

# search features
efs = efs.fit(np.array(X_train), y_train)
Features: 2380/2380

The above statement indicates the number of feature combinations! This search is computaionally expensive for this reason and hence, necessity of this type of selection depends on the availability of compute resource.

1
2
selected_features = X_train.columns[list(efs.best_idx_)]
selected_features
Index(['var_3', 'var_9', 'var_55', 'var_99'], dtype='object')
1
2
3
4
5
6
7
8
9
print("Performance with entire features")
run_randomForests(X_train,
                  X_test,
                  y_train, y_test)

print("\n Performance with Select features")
run_randomForests(X_train[selected_features],
                  X_test[selected_features],
                  y_train, y_test)
Performance with entire features
Training roc-auc: 0.8297403205721402
Testing roc-auc: 0.6660311212251676

 Performance with Select features
Training roc-auc: 0.8039151843346928
Testing roc-auc: 0.6470877667753245
1
print(X_train.shape, X_train[selected_features].shape)
(700, 16) (700, 4)

You can notice above the power of exhaustive feature selection. With just four features, we are able to get the similar level of generalization performance when modeled with 17 features.

3. Embedded Methods

These methods perform feature selection while buidling the machine learning model. The steps involve building the model, extracting the feature importance of the model, removing the features having lowest importance.

Different approaches in embedded methods include:

  1. Regularization
  2. Tree/RF Importance

3.1 Regularization

In linear and logistic regression, the magnitude of coefficients directly influence the importance of the feature.

Regularization, particularly is performed to avoid overfitting. Regularization ensures the model coefficients are not high and for irrelevant features especially the coefficients are pushed towards zero. Thus we could select the top N features based on its absolute coefficient magnitude.

There are three main types of regularization for linear models:

  1. Ridge or L2 regularization
  2. Lasso or L1 regularization
  3. Elastic nets or L1/L2 regularization

The forumulation is very much similar and we will see the application of one of these.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

selection = SelectFromModel(LogisticRegression(C=10, penalty='l2', max_iter=1000))
selection.fit(X_train, y_train)

selected_features = X_train.columns[(selection.get_support())]
print(f'Selected Features :{list(selected_features)}')

# removing feature with coefficient
removed_features = X_train.columns[(abs(selection.estimator_.coef_) < 0.5).ravel().tolist()]
print(f'Number of Removed Features :{len(removed_features)}')
Selected Features :['var_16', 'var_55', 'var_99']
Number of Removed Features :15

3.2 Tree Importance

Tree based algorithms like DecisionTrees and RandomForests are popular for its performance and feature importances. Feature importance provides idea on which feature is important for predicting the target. RandomForests provides this using calculating mean decrease impurity and mean decrease accuracy.

** Note **
Random Forest should be built with removing correlated features. Why? In presence of correlated features, the importance assigned is typically lower than the importance assigned to the feature itself. In above process, when correlated features are removed the feature importance of remaining features increases. This would lead to better feature selection

This process of feature selection could be time consuming if data has high number of features

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# remove correlated features

def correlation(dataset, threshold):

    col_corr = set()  # Set of all the names of correlated columns

    corr_matrix = dataset.corr()

    for i in range(len(corr_matrix.columns)):

        for j in range(i):

            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value

                colname = corr_matrix.columns[i]  # getting the name of column

                col_corr.add(colname)

    return col_corr



corr_features = correlation(X_train, 0.7)

print(f'Number of Correlated Features : {len(corr_features)}')

X_train.drop(labels=corr_features, axis=1, inplace=True)

X_test.drop(labels=corr_features, axis=1, inplace=True)
Number of Correlated Features : 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
  rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)

  rf.fit(X_train, y_train)

  importances = rf.feature_importances_



  # create a data frame.

  importance_df = pd.DataFrame({"Features": X_train.columns, "Importances":importances})

  importance_df.set_index('Importances')

  importance_df = importance_df.sort_values('Importances', ascending=False).reset_index(drop = True)

  importance_df

Features Importances
0 var_55 0.390711
1 var_16 0.070937
2 var_90 0.054711
3 var_99 0.052996
4 var_19 0.049638
5 var_9 0.047345
6 var_1 0.043711
7 var_3 0.043497
8 var_2 0.039954
9 var_46 0.038797
10 var_20 0.033847
11 var_21 0.033845
12 var_53 0.030956
13 var_10 0.024338
14 var_11 0.022644
15 var_35 0.022072

4. Hybrid Methods

These methods combines the filter, wrapper and embedded methods. The main examples include Recursive Feature Elimination and Recursive Feature Addition. They are very much similar to what we have discussed with slight modification.

Recursive Feature Elimination:

  1. Build machine learning model using all features

  2. Remove least important features (ex : using feature importance in RF)

  3. Build the model and recalculate feature importance

  4. Repeat until stopping criteria is met.

They can built using from sklearn (sklearn.feature_selection.RFE).

5. Conclusion

We have seen the various methods used for selecting features during modeling. Feature selection is an important step for machine learning models as they directly influence the model performance. The methods discussed in this series of articles include basic filter, wrapper, embedded and hybrid methods. Each of these methods have their own advantages and disadvantages. In machine learning its not always one single efficient method, but its rather about testing to see which method best suits the given problem.

6. References

  1. Wikipedia : https://en.wikipedia.org/wiki/Feature_selection

  2. Udemy : https://www.udemy.com/course/feature-selection-for-machine-learning

  3. Medium Blog : https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-an-introduction-1d8dc6d86c16