Feature Selection
   11 min read    Rangaraj Pandurangan

1. Introduction

We all would have heard “Garbage In, Garbage Out”, in Machine Learning it can be read as “Noise In, Noise Out”. With growing data in all the fields it is important to understand the negative influence of noisy data on model’s accuracy and their demand for computational resources.

Therefore, feature selection is an important step in any ML pipeline aimed at removing irrelevant, redundant and noisy features. Formally feature selection is the process of selecting a subset of relevant features for model development. Feature selection techniques are used for several reasons:

  1. Simplification of models - easy interpretration
  2. Shorter training times
  3. Avoid the curse of dimensionality
  4. Reduce overfitting

Feature Selection Vs Feature Engineering:

Feature selection is often applied in problems where there are many features and is contrast to generating new features or feature engineering. Typically, feature selection is performed after completing feature engineering.

Feature Selection Vs Dimensionality Reduction:

Dimensionality reduction involves reducing feature dimensions at the expense of transforming and modifying them. Feature selection rather subsets the feature without modifying them

2. Types of Feature Selection Methods

The feature selection methods can be broadly divided into :

  1. Filter Methods
  2. Wrapper Methods
  3. Embedded Methods
  4. Hybrid Methods

Note
In this article, we will focus on the filter methods and the remaining will be covered in subsequent articles.

3. Filter Methods

As the name suggests, it involves filtering features based on its characteristics. It doesnt typically involve any model building and hence they are faster.

The variable characteristics used to filter the features involve variance, correlation and stats ranking. We will look into them in order below.

3.1 Basic Filter Methods:

This class of methods can remove features which are constant and quasi-constant (~ >95% same value). These features provide no information that allows ML model to discriminate or predict a target. These can be identified by calculating the variance for numerical features and number of unique values for categorical features.

To identify constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If using the VarianceThreshold, all our features need to be numerical. If we do it manually however, we can apply the code to both numerical and categorical features. We will see in action about this method on a synthetic dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

data = pd.read_csv('dataset_1.csv')
print(f'Shape of data: {data.shape}')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), data['target'],
    test_size=0.3,
    random_state=0)
Shape of data: (50000, 301)

Note

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

3.1.1 Removing Constant Features

1
2
3
4
5
# Approach 1 : Sklearn (Numerical)
vt_constant = VarianceThreshold(threshold=0)
vt_constant.fit(X_train)
constant_num_features = sum(~vt_constant.get_support())
print(f'Number of Constant Features: {constant_num_features}')
Number of Constant Features: 34
1
2
3
4
5
# Approach 2 : Manual (Numerical)
constant_num_features = [
    feature for feature in X_train.columns if X_train[feature].std() == 0
]
print(f'Number of Constant Features: {len(constant_num_features)}')
Number of Constant Features: 34
1
2
3
4
5
# Approach 2 : Manual (Numerical & Categorical)
constant_features = [
    feature for feature in X_train.columns if X_train[feature].nunique() == 1
]
print(f'Number of Constant Features: {len(constant_features)}')
Number of Constant Features: 34

3.1.2 Removing Quasi-Constant Features

Quasi Constant features are those which have same value for great majority of the observations. They can removed using variance thresholding. Using sklearn approach, it removes all features which have variance less than the threshold.

1
2
3
4
5
# Approach 1 : Sklearn
vt_constant = VarianceThreshold(threshold=0.01)
vt_constant.fit(X_train)
qconstant_num_features = sum(~vt_constant.get_support())
print(f'Number of Quasi-Constant Features: {qconstant_num_features}')
Number of Quasi-Constant Features: 85
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Approach 2 : Manual (Numerical and Categorical)

qconstant_features = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant values in feature
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # if predominant feature occurs > 99% in observations
    if predominant > 0.998:

        # add the variable to the list
        qconstant_features.append(feature)

print(f'Number of Quasi-Constant Features: {len(qconstant_features)}')
Number of Quasi-Constant Features: 142

Approach 2 is bit aggressive than VarianceThreshold in selecting the quasi-constant features. This can occur if majority of the observations are the same and the remaining tiny proportion of observations vary a lot. This could result in variance greater than 0.01, yet we can see it as quasi-constant feature.

As you would have guessed by now, the code for quasi-constant removal will also work for removing constant features as well.

3.1.3 Removing Duplicated Features

Finding duplicated features involves inverting the dataset and dropping repeated row. It can be computationally costly depending on the size of the dataset. This completes the basic filter method of feature selection.

1
2
3
4
5
6
7
8
9
# transpose the feature matrix
train_features_T = X_train.T

# print the number of duplicated features
duplicated_num_features = train_features_T.duplicated().sum()
print(f'Number of duplicated features: {duplicated_num_features}')

# select the duplicated features columns names
duplicated_features = train_features_T[train_features_T.duplicated()].index.values
Number of duplicated features: 52
1
2
3
4
# drop basic filter columns
filter_features = list(duplicated_features) + list(qconstant_features)
X_train.drop(labels=filter_features, axis=1, inplace=True)
X_test.drop(labels=filter_features, axis=1, inplace=True)

3.2 Correlation

Correlation feature selection evaluates subsets of features on the basis of the following hypothesis: “Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other”.

There are two approaches to select features based on Correlation

  1. Brute Force : Find correlated features without any further insight
  2. Grouping : Find groups of correlated features, which we can explore to decide which one to keep and discard

Often, more than 2 features are correlated with each other. We can find groups of 3, 4 or more features that are correlated. By identifying these groups, with approach 2, we can then select from each group, which feature we want to keep, and which ones we want to remove.

3.2.1 Brute Force

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# create correlated features
corr_features = set()

# create correlation matrix (default to pearson)
corr_matrix = X_train.corr()

# display a heatmap of the correlation matrix
# plt.figure(figsize=(11,11))
# sns.heatmap(corr_matrix)

for i in range(len(corr_matrix.columns)):
    for j in range(i):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            colname = corr_matrix.columns[i]
            corr_features.add(colname)
print(f'Number of Correlated Features : {len(corr_features)}')

X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)
Number of Correlated Features : 76

Although there are different ways to calculate coorelation, we have sticked to Pearson coorelation. Other approaches include Spearman and Kendall. You can find more details about each method and its applications online.

3.2.2 Grouping

This approach looks to idenitfy groups of highly correlated features. Subsequetly, we can further investigate in each groups to decide which feature to remove.

1
2
3
4
5
6
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)
1
2
3
4
5
6
7
8
9
# Select feature pair with high correlation
corrmat = X_train.corr()
corrmat = corrmat.abs().unstack() # absolute value of corr coef
corrmat = corrmat.sort_values(ascending=False)
corrmat = corrmat[corrmat >= 0.8]
corrmat = corrmat[corrmat < 1]
corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']
corrmat.head()

feature1 feature2 corr
0 var_67 var_129 1.0
1 var_129 var_67 1.0
2 var_287 var_129 1.0
3 var_129 var_287 1.0
4 var_14 var_129 1.0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# find groups of correlated features

grouped_feature_ls = []
correlated_groups = []

for feature in corrmat.feature1.unique():

    if feature not in grouped_feature_ls:

        # find all features correlated to a single feature
        correlated_block = corrmat[corrmat.feature1 == feature]
        grouped_feature_ls = grouped_feature_ls + list(
            correlated_block.feature2.unique()) + [feature]

        # append the block of features to the list
        correlated_groups.append(correlated_block)

print('found {} correlated groups'.format(len(correlated_groups)))
print('out of {} total features'.format(X_train.shape[1]))
found 69 correlated groups
out of 300 total features
1
2
3
4
5
6
7
8
# now we can print out each group. We see that some groups contain
# only 2 correlated features, some other groups present several features
# that are correlated among themselves.

for idx, group in enumerate(correlated_groups):
    if idx in [0, 1, 2]: # printing 3 groups
        print(group)
        print()
    feature1 feature2      corr
0     var_67  var_129  1.000000
128   var_67   var_13  0.990187

    feature1 feature2      corr
2    var_287  var_129  1.000000
131  var_287   var_13  0.990187

    feature1 feature2      corr
4     var_14  var_129  1.000000
6     var_14   var_66  1.000000
8     var_14   var_69  1.000000
125   var_14   var_13  0.990187
1
2
3
4
# we can now investigate further features within one group.
# let's for example select group 2
group = correlated_groups[2]
group

feature1 feature2 corr
4 var_14 var_129 1.000000
6 var_14 var_66 1.000000
8 var_14 var_69 1.000000
125 var_14 var_13 0.990187

** Here is the motivation for this approach **

In this group, 5 features are highly correlated, which one should we keep and which to remove?

Assuming we dont have missing data in these features, we can build a ML model and decide based on individual predictive power

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from sklearn.ensemble import RandomForestClassifier

# add all features of the group to a list
features = list(group['feature2'].unique())+['var_14']

# train a random forest
rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
rf.fit(X_train[features].fillna(0), y_train)

# feature importance
importance = pd.concat(
    [pd.Series(features),
     pd.Series(rf.feature_importances_)], axis=1)

importance.columns = ['feature', 'importance']

# sort features by importance, most important first
importance.sort_values(by='importance', ascending=False)

feature importance
3 var_13 0.998352
0 var_129 0.000520
1 var_66 0.000451
4 var_14 0.000357
2 var_69 0.000320

In this case, feature var_13 shows highest importance according to random forest. Thus, we can ignore other features from this group in the dataset. In similar way, we can build models in each group and keep top feature in each groups for modeling.

It is worth checking after using either of the approaches if there are no correlated features left in the dataset.

3.3 Statistical Tests

These methods are statistical tests that involve ranking feature based on certain criteria and selecting features with highest rankings. It depends on evaluating if the feature is important in discriminating the target.

3.3.1 Mutual Information

The mutual information measures the reduction in uncertainty in feature A when feature B is known. To select features, we are interested in the mutual information between the predictor and the target. Higher mutual information values, indicate little uncertainty about the target Y given the predictor X.

Using Scikit-learn, we can determine the mutual information between a feature and the target using the mutual_info_classif or mutual_info_regression for binary or continuous targets. We will use titanic dataset to show stats based feature selection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

# to obtain the mutual information values
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

# to select the features
from sklearn.feature_selection import SelectKBest, SelectPercentile

data = pd.read_csv('titanic.csv')
data.dropna(inplace = True)
1
2
3
4
5
X_train, X_test, y_train, y_test = train_test_split(
    data[['Age', 'SibSp', 'Fare']],
    data['Survived'],
    test_size=0.3,
    random_state=0)
1
2
3
4
5
6
7
8
# Select the features in the top percentile
k_best = SelectKBest(mutual_info_classif, k=2).fit(X_train, y_train)

# display the features
print(f'Top Columns : {X_train.columns[k_best.get_support()]}')

X_train = k_best.transform(X_train)
X_test = k_best.transform(X_test)
Top Columns : Index(['Age', 'Fare'], dtype='object')

3.3.2 Chi-Squared

Compute chi-squared stats between each non-negative feature and class. This score should be used to evaluate categorical variables in a classification task.

In Chi-squared, small p-value indicates rejecting null hypothesis and indicates significance of the feature to predict the target.

1
2
3
4
5
6
7
# Encode categorical variables
# for Sex / Gender
data['Sex'] = np.where(data['Sex'] == 'male', 1, 0)

# for Embarked
ordinal_label = {k: i for i, k in enumerate(data['Embarked'].unique(), 0)}
data['Embarked'] = data['Embarked'].map(ordinal_label)
1
2
3
4
5
X_train, X_test, y_train, y_test = train_test_split(
    data[['Pclass', 'Sex', 'Embarked']],
    data['Survived'],
    test_size=0.3,
    random_state=0)
1
2
3
4
5
6
7
k_best = SelectKBest(chi2, k=2).fit(X_train, y_train)

# display features
print(f'Top Columns : {X_train.columns[k_best.get_support()]}')

X_train = k_best.transform(X_train)
X_test = k_best.transform(X_test)
Top Columns : Index(['Pclass', 'Sex'], dtype='object')

It is worth noting about other statistical based methods like ANOVA, AUC_ROC and R2 based filtering. They can be easily extrapolated like shown above.

4. Conclusion

In this article, we have briefly introduced feature selection and its necessity for any ML problem. We have seen different methods to do feature selection. With the current article, you wouldhave understood the filter based feature selection comprising of basic, correlation and stats based filtering. They are typically faster and majorly model agnostic (except stats).

In the next series of articles in this topic, we will go over remaining methods.

5. References

  1. Wikipedia : https://en.wikipedia.org/wiki/Feature_selection
  2. Udemy : https://www.udemy.com/course/feature-selection-for-machine-learning
  3. Medium Blog : https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-an-introduction-1d8dc6d86c16