4 Methods to Power Feature Selection for Your Next ML Model

Plus, tips for dealing with categoric, numeric, and mixed data

4 Methods to Power Feature Selection for Your Next ML Model

Photo by Edu Grande on Unsplash

What is Feature Selection?

Feature Selection in Machine Learning is selecting the most impactful features, or columns, in a dataset.

Does your dataset have many columns, and do you want to see which have the biggest impact? Do you want to discard those that aren’t generating much value?

By performing feature selection, you’re not only reducing the amount of data that needs to be processed to speed up your analysis, but you’re simplifying the interpretation of the model, making it easier for others to understand.

Depending on the types of data you have, one can use several techniques, ranging from Statistical methods to leveraging a machine learning model to make the selection. We’ll look at a few of the most common techniques and see how they are applied in practice!

  1. Categorical data using the Chi-Squared Test

  2. Pearson’s Correlation Coefficient for Numeric Data

  3. Principal Component Analysis for Numeric Data

  4. Feature Importance with Random Forests for Both Categorical and Numeric Data

Let’s get started!

Data and Imports

For our demonstration today, we will use the Bank Marketing UCI dataset, which one can find on Kaggle. This dataset contains information about Bank customers in a marketing campaign, and it contains a target variable that one can utilize in a classification model. This dataset is in the public domain under CC0: Public Domain and can be used for any purpose.

For more information on building classification models, check out: Everything You Need to Know to Build an Amazing Binary Classifier and Go Beyond Binary Classification with Multiclass and Multi-Label Models.

We will start by importing the necessary libraries and loading the data. We’ll be utilizing Scikit-Learn for each of our different techniques. In addition to what is demonstrated here, there are many other supported methods.

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("bank.csv", delimiter=";")

Feature Selection for Categorical Values

If your dataset is categorical, we can use Pearson’s Chi-Squared Test. According to Wikipedia:

The Chi-Squared test is a statistical test applied to categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.

You apply the Chi-Squared test when both your feature data is categorical, and your target data is categorical, e.g., classification problems.

Note: While this dataset contains a mix of categorical and numeric values, we’ll isolate the categorical values to demonstrate how you would apply the Chi-Squared test. A better method for this dataset will be described below via Feature Importances to select features across categorical and numeric types.

We’ll start by selecting only those types that are categorical or of type object in Pandas. Pandas stores text as objects, so you should validate if these are categorical values before simply utilizing the object type.

# get categorical data
cat_data = df.select_dtypes(include=['object'])

We can then isolate the features and the target values. The target variable, y, is the last column in the Data Frame, and therefore we can use Python's slicing technique to separate them into X and y.

X = cat_data.iloc[:, :-1].values
y = cat_data.iloc[:,-1].values

Next, we have two functions. These functions will use the OrdinalEncoder for the X data and the LabelEncoder for the y data. As the name implies, the OrdinalEncoder will convert categorical values to numerical representation honoring a specific order. By default, the encoder automatically selects this order. However, you can provide a list of values for an order. The LabelEncoder will convert like values into numerical representations. Per Scikit-Learn's documentation, you should only use the LabelEncoder on the target variable.

def prepare_inputs(X_train, X_test):
    oe = OrdinalEncoder()
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc

def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

Next, we’ll split our data into train and test sets and process the data with the above functions. Note how the above functions only fit the encoders to the training data and transform both train and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Now a function that will help us select the best features utilizing the Chi-Squared test inside the SelectKBest method. We can start by setting the argument k='all', which will first run the test across all features, and later we can apply it to a specific number of features.

def select_features(X_train, y_train, X_test, k_value='all'):
    fs = SelectKBest(score_func=chi2, k=k_value)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

We can start by printing off the scores for each feature.

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)
# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
Feature 0: 11.679248
Feature 1: 0.248626
Feature 2: 3.339391
Feature 3: 0.039239
Feature 4: 11.788867
Feature 5: 12.889637
Feature 6: 64.864792
Feature 7: 4.102635
Feature 8: 10.921719

However, it’s easier to inspect these scores with a plot visually.

# what are scores for the features
names = []
values = []
for i in range(len(fs.scores_)):
chi_list = zip(names, values)

# plot the scores
sns.barplot(x=names, y=values)
plt.xticks(rotation = 90)

Image by Author

Contact has the largest score here, while marital, default, and month have the lowest. Overall it looks like there are about 5 features that are worth considering. We'll use the SelectKBest method to select the top 5 features.

X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc, 5)

And we can print the top features, the values corresponding to their index above.

array(['x0', 'x4', 'x5', 'x6', 'x8'], dtype=object)

And finally, we can print the shape of the X_train_fs and X_test_fs data and see that the second dimension is 5 for the selected features.

(3029, 5)
(1492, 5)

Feature Selection for Numeric Values

When dealing with pure numeric data, there are two methods that I prefer to use. The first is Pearson’s Correlation Coefficient, and the second is Principal Component Analysis or PCA.

Pearson’s Correlation Coefficient

Let’s start with Pearson’s Correlation Coefficient. While this isn’t explicitly a feature selection method, it helps us visualize highly correlated features. When two or more features are highly correlated, they contribute very similar information to a model when training. By calculating and plotting a correlation matrix, we can quickly check to see if any values are highly correlated and if so, we can choose to remove one or more of them from our model.

corr = df.corr()

f, ax = plt.subplots(figsize=(12, 8))

sns.heatmap(corr, cmap="Blues", annot=True, square=False, ax=ax)
plt.title('Pearson Correlation of Features')

Image by Author

In this example, pdays and previous have the strongest correlation of 0.58, and everything else is independent of each other. A correlation of 0.58 isn't very strong. Therefore I will choose to leave both in the model.

Principal Component Analysis

Principal Component Analysis is the most powerful method for feature selection in models where all data is numeric. PCA is not a feature selection method but a dimensionality reduction method. However, the objective with PCA is similar to Feature Selection, where we’re looking to reduce the amount of data needed to compute the model.

The following image represents PCA applied to a photo. The plot represents the amount of cumulative explained variance for each principal component. In this example, we can explain 95% of the variance in the image with only 54 of the 3,000+ principal components of the image.

Image by Author

For more information on how PCA works and performs, please review the following articles: 2 Beautiful Ways to Visualize PCA and The Magic of Principal Component Analysis through Image Compression.

Feature Importance from Random Forests

Finally, my preferred method for feature selection is to utilize Random Forest and its ability to calculate Feature Importance. Scikit-Learn will compute the list of features and their relative contribution to the overall performance as an output of a trained model. This method also works with other Bagged trees like Extra Trees and Gradient Boosting for classification and regression.

The bonus of this method is that it fits well with the overall flow of building a model. Let’s walk through how this works. Let’s start by getting all the columns from our Data Frame. Note that we can utilize both categorical and numeric data in this case.

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y'],

When reading the documentation for this dataset, you’ll notice that the Duration column is something that we shouldn’t use to train your model. We’ll manually remove it from our list of columns.

Duration*: last contact duration, in seconds (numeric). This attribute highly affects the output target (e.g., if duration=0, then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.*

Additionally, we can remove any other features that might not be useful if we’d like. For this example, we’ll keep all of them aside from duration.

# Remove columns from the list that are not relevant. 
targets = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'campaign', 'pdays', 'previous', 'poutcome']

Next, we will utilize a Column Transformer to transform our data into a machine learning acceptable format. I prefer to use pipelines whenever I build a model for repeatability.

Check out my article: Stop Building Your Models One Step at a Time for more information on them: Automate the Process with Pipelines!.

I’ve chosen the MinMaxScaler for numeric values and the OrdinalEncoder for categorical values. In the final model, I would most likely use OneHotEncoder (OHE) for the categorical features, but to determine feature importance, we don't want to expand the columns with OHE; we'll get more value out of treating them as one column with ordinal encoded values.

column_trans = ColumnTransformer(transformers=
        [('num', MinMaxScaler(), selector(dtype_exclude="object")),
        ('cat', OrdinalEncoder(), selector(dtype_include="object"))],

Next, we create an instance of our classifier with a few preferred settings, such as class_weight='balanced', which helps deal with imbalanced data. We'll also set the random_state=42 to ensure that we get the same results.

For more on imbalanced data, check out: Don’t Get Caught in the Trap of Imbalanced Data When Building Your ML Model.

# Create a random forest classifier for feature importance
clf = RandomForestClassifier(random_state=42, n_jobs=6, class_weight='balanced')

pipeline = Pipeline([('prep',column_trans),
                     ('clf', clf)])

Next, we’ll split our data into training and test sets and fit our model to the training data.

# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(df[targets], df['y'], test_size=0.3, random_state=0)

pipeline.fit(X_train, y_train)

We can call the feature_importances_ method against the classifier to see the output. Note how you reference the classifier in the pipeline by calling its name clf, similar to a dictionary in Python.

array([0.12097191, 0.1551929 , 0.10382712, 0.04618367, 0.04876248,
       0.02484967, 0.11530121, 0.15703306, 0.10358275, 0.04916597,
       0.05092775, 0.02420151])

Next, let’s display these sorted by their greatest and cumulative importance. The cumulative importance column is useful to visualize how the total adds up to 1. There is also a loop to cut off any feature that contributes less than 0.5 to the overall importance. This cutoff value is arbitrary. You could look for the total cumulative importance of 0.8 or 0.9. Play around with how your model performs with fewer or more features.

feat_list = []

total_importance = 0
# Print the name and gini importance of each feature
for feature in zip(targets, pipeline['clf'].feature_importances_):
    total_importance += feature[1]

included_feats = []
# Print the name and gini importance of each feature
for feature in zip(targets, pipeline['clf'].feature_importances_):
    if feature[1] > .05:

print('\n',"Cumulative Importance =", total_importance)

# create DataFrame using data
df_imp = pd.DataFrame(feat_list, columns =['FEATURE', 'IMPORTANCE']).sort_values(by='IMPORTANCE', ascending=False)
df_imp['CUMSUM'] = df_imp['IMPORTANCE'].cumsum()
1         job    0.166889  0.166889
0         age    0.151696  0.318585
2     marital    0.134290  0.452875
13   previous    0.092159  0.545034
6     housing    0.078803  0.623837
3   education    0.072885  0.696722
4     default    0.056480  0.753202
12      pdays    0.048966  0.802168
8     contact    0.043289  0.845457
7        loan    0.037978  0.883436
14   poutcome    0.034298  0.917733
10      month    0.028382  0.946116
5     balance    0.028184  0.974300
11   campaign    0.021657  0.995957
9         day    0.004043  1.000000

Finally, based on that loop, let’s print out the features that we’ve selected overall. Based on this analysis, we removed about 50% of them from our model, and we can see which ones have the highest impact!

print('Most Important Features:')
print('Number of Included Features =', len(included_feats))
Most Important Features:
['age', 'job', 'marital', 'education', 'default', 'housing', 'previous']
Number of Included Features = 7

Thank you for reading! You can find all the code for this article on GitHub


Feature Selection is a critical part of the model-building process, and it not only helps improve performance but also simplifies your model and its interpretation. We started looking at utilizing the Chi-Squared test for independence to select categorical features. Next, we looked at Pearson’s Correlation Matrix to visually identify highly correlated numeric features. We also touched on Principal Component Analysis (PCA) as a tool to reduce dimensionality automatically on numeric datasets. Last, we looked at feature_importances_ Random Forest as a way to select across both categorical and numeric values. We hope this gets you started on your journey to Feature Selection!