Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (2024)

Feature selection is one of the most crucial steps in building machine learning models. As a data scientist, I know the importance of identifying and selecting the most relevant features that contribute to the predictive power of the model while minimising the effects of irrelevant or redundant features. One way to do this is by visualising feature importance.

In this article, I will share my experience with different methods for visualising feature importance in a dataset using Python. I will provide code snippets and examples for each method and explain their interpretation. By the end of this article, you will have a deeper understanding of the different methods available for visualising feature importance and how to apply them to your own datasets.

Method 1: Correlation Matrix Heatmap

One way to visualise feature importance is by creating a correlation matrix heatmap. A correlation matrix is a table that shows the pairwise correlations between different features in the dataset.

The heatmap shows the strength and direction of the correlation between each pair of features. A high positive correlation (closer to 1) indicates that two features are highly related. A low correlation (closer to 0) indicates that there is little to no linear relationship between the features.

In our case we use a correlation matrix heatmap to identify highly correlated features in the dataset. Highly correlated features may provide redundant or repetitive information to the model, which can negatively impact the model’s performance. By visualising the correlation matrix heatmap, we can identify such features and remove them from the dataset.

Here’s an example of using a correlation matrix heatmap to visualise feature correlation in a dataset with both continuous and discrete features:

# Create a correlation matrix
corr_matrix = features.corr().abs()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='GnBu', linewidths=0.2, vmin=0, vmax=1)
plt.xlabel('Features')
plt.ylabel('Features')
plt.title('Feature Importances using Correlation Matrix Heatmap')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (2)

Alternatively, a correlation matrix heatmap can be used to identify which features are most strongly correlated with the target variable. These features may be important for the model’s prediction, and visualising them can give us insights into how they influence the target variable.

Here’s an example code snippet:

# Create a correlation matrix with target variable
corr_with_target = features.corrwith(target)

# Sort features by correlation with target variable
corr_with_target = corr_with_target.sort_values(ascending=False)

# Plot the heatmap
plt.figure(figsize=(4, 8))
sns.heatmap(corr_with_target.to_frame(), cmap='GnBu', annot=True)
plt.title('Correlation with Target Variable')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (3)

Method 2: Univariate Feature Selection

Another way to visualise feature importance is by using univariate feature selection. Univariate feature selection is a statistical method that selects the features with the highest statistical significance with respect to the target variable. In other words, it selects the features that are most likely to be relevant for predicting the target variable.

It is important to mention that the effectiveness of this method can be influenced by the scale of the features.

Here’s an example of using univariate feature selection to visualise feature importance in a dataset with both continuous and discrete features using chi-square test:

# apply univariate feature selection
best_features = SelectKBest(score_func=chi2, k=5).fit(df_scaled, target)

# get the scores and selected features
scores = best_features.scores_
selected_features = df_scaled.columns[best_features.get_support()]

sorted_idxs = np.argsort(scores)[::-1]
sorted_scores = scores[sorted_idxs]
sorted_feature_names = np.array(df_scaled.columns)[sorted_idxs]

# plot scores
plt.figure(figsize=(12, 6))
sns.barplot(x=sorted_scores, y=sorted_feature_names)
plt.xlabel('Scores')
plt.ylabel('Features')
plt.title('Feature Importances using Univariate Feature Selection (Chi-square test)')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (4)

Here’s an example of using univariate feature selection to visualise feature importance in a dataset with both continuous and discrete features using anova test:

# apply univariate feature selection
best_features = SelectKBest(score_func=f_classif, k=5).fit(df_scaled, target)

# get the scores and selected features
scores = best_features.scores_
selected_features = df_scaled.columns[best_features.get_support()]

sorted_idxs = np.argsort(scores)[::-1]
sorted_scores = scores[sorted_idxs]
sorted_feature_names = np.array(df_scaled.columns)[sorted_idxs]

# plot scores
plt.figure(figsize=(12, 6))
sns.barplot(x=sorted_scores, y=sorted_feature_names)
plt.xlabel('Scores')
plt.ylabel('Features')
plt.title('Feature Importances using Univariate Feature Selection (ANOVA)')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (5)

In the case of discrete features, we can use chi-square or mutual information tests, while for continuous features, we can use ANOVA or correlation-based tests. In this case, I did not select specific features for each test since I wanted to check how the results are affected.

Method 3: Recursive Feature Elimination

Recursive feature elimination is a machine learning technique that selects features by recursively considering smaller and smaller sets of features. It starts by considering all features, fits a model, and eliminates the least important feature based on a predefined criterion. In this case, we set the n_features_to_select parameter to select the 5 most important features.

Here’s an example of using recursive feature elimination to visualise feature importance in a dataset with both continuous and discrete features:

# Create a random forest classifier
clf = RandomForestClassifier()

# Apply recursive feature elimination
selector = RFE(clf, n_features_to_select=5)
selector = selector.fit(features, target)
X_new = selector.transform(features)

# Plot feature importances
importances = selector.estimator_.feature_importances_
std = np.std([tree.feature_importances_ for tree in selector.estimator_.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12, 6))
plt.title("Feature importances")
plt.bar(range(X_new.shape[1]), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(X_new.shape[1]), features.columns[selector.get_support()][indices], rotation=90)
plt.xlim([-1, X_new.shape[1]])
plt.ylabel('Feature Imporance Scores')
plt.xlabel('Features')
plt.title('Feature Importances using Recursive Feature Elimination based on Random Forest')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (6)

Method 4: Feature Importance from Tree-based Models

Another method for visualising feature importance is by using tree-based models such as Random Forest or Gradient Boosting. These models can be used to rank the importance of each feature in the dataset. In Python, we can use the feature_importances_ attribute of the trained tree-based models to get the feature importance scores. The scores can be visualised using a bar chart.

Here is an example code snippet for visualising feature importance from a Random Forest model:

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(features, target)

# Get feature importances
importances = rf_model.feature_importances_

# Visualize feature importances
plt.figure(figsize=(12, 6))
plt.bar(range(features.shape[1]), importances)
plt.xticks(range(features.shape[1]), features.columns, rotation=90)
plt.ylabel('Feature Imporance Scores')
plt.xlabel('Features')
plt.title('Feature Importances using Random Forest')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (7)

Method 5: LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator) is a modification of linear regression method that performs both feature selection and regularisation to prevent overfitting. LASSO shrinks the regression coefficients of less important features to zero, effectively removing them from the model. The remaining non-zero coefficients indicate the important features.

It is important to mention that the effectiveness of this method can be influenced by the scale of the features.

Here’s an example of using LASSO regression to visualise feature importance in a dataset with both continuous and discrete features:

# Fit the LASSO model
lasso = LassoCV(cv=5, random_state=0)
lasso.fit(df_scaled, target)

# Plot the coefficients
plt.figure(figsize=(10,6))
plt.plot(range(len(df_scaled.columns)), lasso.coef_, marker='o', markersize=8, linestyle='None')
plt.axhline(y=0, color='gray', linestyle='--', linewidth=2)
plt.xticks(range(len(df_scaled.columns)), df_scaled.columns, rotation=90)
plt.ylabel('Coefficients')
plt.xlabel('Features')
plt.title('Feature Importance using LASSO Regression')
plt.show()

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (8)

Conclusion

In this article, we explored different methods for visualising feature importance in a dataset using Python. We covered correlation matrix heatmaps, univariate feature selection, recursive feature elimination, feature importance from tree-based models, and lasso regression.

Visualising feature importance is an important step in the machine learning workflow as it helps identify the most important features that contribute to the predictive power of the model. By using the methods covered in this article, you can gain insights into the relationships between features and their impact on the target variable.

Remember, feature selection is not a one-size-fits-all approach, and the best method for your dataset may depend on your specific problem and data. Therefore, it is always a good idea to try different methods and evaluate their performance before selecting the best one for your problem.

Additionally, it’s important to note that feature importance is just one aspect of feature selection. Depending on the problem at hand, other methods such as principal component analysis (PCA) or independent component analysis (ICA) may be more appropriate. Additionally, it’s important to use domain knowledge to guide feature selection and not rely solely on automatic methods.

Mastering the Art of Feature Selection: Python Techniques for Visualizing Feature Importance (2024)
Top Articles
Understanding Theory of Constraints: How it Helps in Lean Management
Top Jump Trading Alternatives, Competitors
English Bulldog Puppies For Sale Under 1000 In Florida
Katie Pavlich Bikini Photos
Gamevault Agent
Pieology Nutrition Calculator Mobile
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Compare the Samsung Galaxy S24 - 256GB - Cobalt Violet vs Apple iPhone 16 Pro - 128GB - Desert Titanium | AT&T
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Craigslist Dog Kennels For Sale
Things To Do In Atlanta Tomorrow Night
Non Sequitur
Crossword Nexus Solver
How To Cut Eelgrass Grounded
Pac Man Deviantart
Alexander Funeral Home Gallatin Obituaries
Energy Healing Conference Utah
Geometry Review Quiz 5 Answer Key
Hobby Stores Near Me Now
Icivics The Electoral Process Answer Key
Allybearloves
Bible Gateway passage: Revelation 3 - New Living Translation
Yisd Home Access Center
Pearson Correlation Coefficient
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
Marquette Gas Prices
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Vera Bradley Factory Outlet Sunbury Products
Pixel Combat Unblocked
Movies - EPIC Theatres
Cvs Sport Physicals
Mercedes W204 Belt Diagram
Mia Malkova Bio, Net Worth, Age & More - Magzica
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Teenbeautyfitness
Where Can I Cash A Huntington National Bank Check
Topos De Bolos Engraçados
Sand Castle Parents Guide
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Hello – Cornerstone Chapel
Stoughton Commuter Rail Schedule
Nfsd Web Portal
Selly Medaline
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 6183

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.