Five Obstacles faced in Linear Regression (2024)

Five Obstacles faced in Linear Regression (3)

Linear Regression is one of the most trivial machine algorithms. Interpretability and easy-to-train traits make this algorithm the first steps in Machine Learning. Being a little less complicated, Linear Regression acts as one of the fundamental concepts in understanding higher and complex algorithms.

To know what linear regression is? How we train it? How we obtain the best fit line? How we interpret it? And how we access the accuracy of fit, you may visit the following article.

Magic Of Calculus: Linear Regression by Aayush OstwalHuman behavior exceptionally has excellent reserves of knowledge and technology. We are trying to understand and generate…towardsdatascience.com...

After understanding the basic intuition of Linear regression, certain concepts make it more fascinating and more fun. These also provide a deep understanding of flaws in the algorithm, its impact, and remedies. And, we will explore these concepts in the article.

We all know, Linear regression involves a few assumptions. And, these assumptions make the structure of this algorithm straightforward. However, this is the reason why it has lots of flaws and why we need to study and understand these flaws.

This article discusses the problems that may occur while training a Linear model, and some methods to deal with them.

Five problems that lie in the scope of this article are:

  1. Non-Linearity of the response-predictor relationships
  2. Correlation of error terms
  3. A non-constant variance of the error term [Heteroscedasticity]
  4. Collinearity
  5. Outliers and High Leverage Points

Source:

The reason for this problem is one of the assumptions involved in linear regression. It is the assumption for linearity, which states that the relation between the predictor and response is linear.

If the actual relation between response and the predictor is not linear, then all the conclusion we draw becomes null and void. Also, the accuracy of the model may drop significantly.

So, how can we deal with this problem?

Remedy:

The solution to the problem mentioned above is to plot Residual Plots.

Residual plots are the plot between the residual, the difference between the actual value and predicted value, and the predictor.

Once we have plotted the residual plot, we will search for a pattern. If some patterns are visible, then there is a non-linear relationship between response and predictor. And, if the plot shows randomness then we are on the right path!

After analyzing the type of pattern, we can use non-linear transformations such as square root, cube root, or log function. Which removes the non-linearity to some extent, and our linear model performs well.

Example:

Let try to fit a straight line to a quadratic function. We will generate some random points using NumPy and take their squares as the response.

import numpy as np
x = np.random.rand(100)
y = x*x
sns.scatterplot(x,y)

Let us see the scatter plot between x and y (Fig.1).

Five Obstacles faced in Linear Regression (4)

Now, let us try to fit a linear model to this data and see the plot between residual and predictor.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x.reshape(-1,1),y.reshape(-1,1))
predictions = model.predict(x.reshape(-1,1))
residual = y.reshape(-1,1) - predictions
Five Obstacles faced in Linear Regression (5)

We can see a quadratic trend in the residual plots. This trend helps us to identify the non-linearity in data. Further, we can apply the square root transformation to make data more suitable for the linear model.

If the data is linear, then you would get random points. The nature of the residual would be randomized. In that case, we can move forward with the model.

Source:

A principal assumption of the linear model is that the error terms are uncorrelated. The “uncorrelated” terms indicated that the sign of error for one observation is independent of others.

The correlation among error terms may occur due to several factors. For instance, if we are observing the weight and height of people. The correlation in error may occur due to the diet they consume, the exercise they do, environmental factors, or they are members of the same family.

What happens to the model when errors are correlated? If the error terms are correlated then the standard error in the model coefficients gets underestimated. As a result, confidence and prediction intervals will be narrower than they should be.

For more insights, please refer to the example below.

Remedies:

The solution is the same as described in the above problem, Residual Plots. If some trends are visible in residual plots, these trends can be expressed as some functions. Hence, they are correlated!

Example:

To understand the impact of correlation on the confidence interval, we should note two trivial points.

  1. When we estimate model parameters, there is some error (Standard Error: SE) involved. This error arises due to the estimation of population characteristics from the sample. This error is inversely proportional to the square root of the number of observations.
  2. The confidence interval for the model parameters with 95% confidence varies by two standard errors. (Please refer to Fig.3)
Five Obstacles faced in Linear Regression (6)

Now, suppose we have n data points. We calculate the standard error (SE) and confidence interval. Now, we doubled our data. Hence, then we would have observations and error terms in pair.

If we now recalculate the SE, then we will calculate it corresponding to 2n observations. As a result, the standard error will be lower by a factor of root √2 (SE is inversely proportional to the number of observations). And, we will obtain a narrower confidence interval.

Source:

The source of this problem is also an assumption. The assumption is that the error term has a constant variance, also referred to as Homcedacity.

Generally, that is not the case. We can often identify a non-constant variance in errors, or heteroscedasticity, from the presence of funnel shape in residual plots. In Fig.2, the funnel represents that the error terms have non-constant variance.

Remedies:

One possible solution is to transform the response using a concave function such as log and square root. Such a transformation results in shrinkage of the response variable, consequently reducing heteroscedasticity.

Example:

Let us try to apply log transformation to points generated in problem 1.

Five Obstacles faced in Linear Regression (7)

We can observe a linear trend after transformation. Hence we may remove non-linearity by applying concave functions.

Collinearity refers to a situation in which two or more predictor variables are correlated to one another. For example, we can find some relation between height and weight, Area of house and number of rooms, experience, and income, and many more.

Source:

In linear regression, we assume that all the predictors are independent. But often the case is the opposite. The predictors are correlated with each other. Hence, it is essential to look at this problem and find a feasible solution.

When the assumption of independence is neglected, the following concerns arise:

  1. We cannot infer the individual effect of predictors on response. Because they are correlated, change in one variable try to impart change in another. Therefore, the accuracy of model parameters drops significantly.
  2. When the accuracy of model parameters drops, all our conclusion becomes void. We can not tell the actual relation between response and predictor and hence, model accuracy also decreases.

Remedies:

There are two possible solutions to the problem.

  1. Drop the variable: We can drop the problematic variable from the regression. The intuition is that the collinearity implies that the information provided by the variable in the presence of other variables, is redundant. Hence, we can drop the variable without much compromise.
  2. Combining the variable: We can combine both the variables to form a new variable. These techniques are feature engineering. For example, merge weight and height to get BMI (Body mass index).
Five Obstacles faced in Linear Regression (8)

Linear Regression is greatly affected by the presence of Outliers and Leverage points. They may occur for a variety of reasons. And their presence hugely affects to model performance. It is also one of the limitations of linear regression.

Outlier: An outlier is an unusual observation of response y, for some given predictor x.

High Leverage Points: Contrast to an outlier, a high leverage point is defined as an unusual observation of predictor x.

Five Obstacles faced in Linear Regression (9)

There are several techniques available for identifying an outlier. This includes interquartile range, scatter plots, residual plots, quartile-quartile plots, box plots, etc.

As this is a limitation of linear regression, it is vital to take the necessary steps. One method is to drop the outlier. However, this may lead to some loss of information. We can also use feature engineering to deal with outliers.

In this article, we have seen five problems while we are working with linear regression. We have seen the sources, impacts, and solutions for each of the problems.

Though Linear regression is the most basic machine learning algorithm, it has a vast scope for learning new things. For me, these problems provide are different point of view for Linear regression.

I hope understanding these problems will provide you with novel insights when you solve any problem.

You may also check the complete playlist for Linear regression.

Five Obstacles faced in Linear Regression (2024)

FAQs

Five Obstacles faced in Linear Regression? ›

Correlation ≠ Causation

No correlation (between predictors and outcome) is a NO-NO. Too much correlation (between explanatory variables) is a NO-NO. A dataset with good linearity and no multicollinearity but is badly interpreted is also a NO-NO (and you still think linear regression is a piece of cake?).

What are the major problems of linear regression? ›

Correlation ≠ Causation

No correlation (between predictors and outcome) is a NO-NO. Too much correlation (between explanatory variables) is a NO-NO. A dataset with good linearity and no multicollinearity but is badly interpreted is also a NO-NO (and you still think linear regression is a piece of cake?).

What are the 5 assumptions of linear regression? ›

📈 Linearity, 🔵 independence, 📊 hom*oscedasticity, 🔔 normality, and 🚫 no multicollinearity are the five key assumptions of linear regression. Ensuring these assumptions are met is critical to creating an accurate and reliable model for predicting and drawing insights from data.

What are the challenges of regression model? ›

Cons of the Linear Regression Model:
  • We made the fundamental assumption that there is a linear relationship between feature and label.
  • This model doesn't work with the outliers.
  • Many features than interpretability is low.
  • Categorical data is inappropriate for this model.
  • Handling missing data creates problem for this model.
Nov 5, 2021

What are some limitations of linear regression? ›

Limitations of linear regression
  • Linearity: The assumption of linearity between variables restricts linear regressions. ...
  • Overfit: It's not recommended to use linear regressions when the observations aren't proportional to the features. ...
  • Outliers: Linear regressions are prone to mistakes and outliers.
Jun 28, 2024

What is the main problem with using the regression line? ›

Answer: The main problem with using single regression line is it is limited to Single/Linear Relationships. linear regression only models relationships between dependent and independent variables that are linear. It assumes there is a straight-line relationship between them which is incorrect sometimes.

What is the common error in linear regression? ›

1 Mistake 1: Ignoring outliers

Outliers are data points that deviate significantly from the rest of the data. They can have a large impact on the slope and intercept of the regression line, and thus affect the accuracy and reliability of your predictions.

What are the 5 multiple regression assumptions? ›

Five main assumptions underlying multiple regression models must be satisfied: (1) linearity, (2) hom*oskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables.

What are the five types of regression analysis? ›

Let us examine several of the most often utilized regression analysis techniques:
  • Linear Regression. ...
  • Logistic Regression. ...
  • Polynomial Regression. ...
  • Ridge Regression. ...
  • Lasso Regression. ...
  • Quantile Regression. ...
  • Bayesian Linear Regression. ...
  • Principal Components Regression.
Aug 28, 2024

What are the three requirements of linear regression? ›

Important Assumptions of Linear Regression Analysis
  • There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s). ...
  • There should be no correlation between the residual (error) terms. ...
  • The independent variables should not be correlated.
6 days ago

What is the biggest challenge in regression? ›

The Problem

There's no way around it – regression testing involves running the same tests over and over again. This can demoralize testers and over time, they might miss tests, ignore or misinterpret them.

What are the regression problems? ›

The regression problem is how to model one or several dependent variables/responses, Y, by means of a set of predictor variables, X. In the PLS method, we divide the variables (columns) into two blocks denoted as X and Y.

Why does linear regression fail? ›

This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean.

What can be a major problem with linear regression? ›

3 Disadvantage: Sensitive to outliers and noise. One of the main disadvantages of using linear regression for predictive analytics is that it is sensitive to outliers and noise. Outliers are data points that deviate significantly from the rest of the data, and noise is random variation or error in the data.

What are the common challenges faced when building a linear regression model? ›

Large datasets often have many variables that may or may not be relevant for your regression analysis. Including too many variables can cause problems such as multicollinearity, overfitting, and high complexity. Multicollinearity occurs when two or more variables are highly correlated and provide redundant information.

What is the criticism of linear regression? ›

Linear regression models assume that the independent variables are not highly correlated with each other. Multicollinearity occurs when the independent variables are highly correlated, making it difficult for the model to estimate the effect of each independent variable on the dependent variable.

What are the three major questions that linear regression analysis can answer? ›

Multiple Linear Regression Analysis helps answer three key types of questions: (1) identifying causes, (2) predicting effects, and (3) forecasting trends. Identifying Causes: It determines the cause-and-effect relationships between one continuous dependent variable and two or more independent variables.

Top Articles
S'expatrier en Chine 🇨🇳 - Le guide complet en 2024
How to Tell If a Gold Coin Is Real in 2024 | American Bullion
What Did Bimbo Airhead Reply When Asked
Ohio Houses With Land for Sale - 1,591 Properties
Ffxiv Palm Chippings
What happened to Lori Petty? What is she doing today? Wiki
Craigslist Cars And Trucks For Sale By Owner Indianapolis
Txtvrfy Sheridan Wy
Red Wing Care Guide | Fat Buddha Store
craigslist: south coast jobs, apartments, for sale, services, community, and events
World History Kazwire
Betonnen afdekplaten (schoorsteenplaten) ter voorkoming van lekkage schoorsteen. - HeBlad
4156303136
Studentvue Columbia Heights
6813472639
Craigslist Free Stuff Santa Cruz
Navy Female Prt Standards 30 34
Salem Oregon Costco Gas Prices
Obsidian Guard's Cutlass
White Pages Corpus Christi
Weepinbell Gen 3 Learnset
How to Watch the Fifty Shades Trilogy and Rom-Coms
2013 Ford Fusion Serpentine Belt Diagram
UMvC3 OTT: Welcome to 2013!
European city that's best to visit from the UK by train has amazing beer
University Of Michigan Paging System
Urban Dictionary Fov
Pronóstico del tiempo de 10 días para San Josecito, Provincia de San José, Costa Rica - The Weather Channel | weather.com
O'reilly's In Monroe Georgia
Joann Fabrics Lexington Sc
Gopher Hockey Forum
Desales Field Hockey Schedule
Gridwords Factoring 1 Answers Pdf
Jeep Cherokee For Sale By Owner Craigslist
Chicago Pd Rotten Tomatoes
Haunted Mansion Showtimes Near Cinemark Tinseltown Usa And Imax
Unm Hsc Zoom
John F Slater Funeral Home Brentwood
Skill Boss Guru
Convenient Care Palmer Ma
Miracle Shoes Ff6
Topos De Bolos Engraçados
Bill Manser Net Worth
Fedex Passport Locations Near Me
فیلم گارد ساحلی زیرنویس فارسی بدون سانسور تاینی موویز
Interminable Rooms
Lesson 5 Homework 4.5 Answer Key
Mail2World Sign Up
Uno Grade Scale
Congressional hopeful Aisha Mills sees district as an economical model
Grandma's Portuguese Sweet Bread Recipe Made from Scratch
Craigslist.raleigh
Latest Posts
Article information

Author: Neely Ledner

Last Updated:

Views: 5738

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Neely Ledner

Birthday: 1998-06-09

Address: 443 Barrows Terrace, New Jodyberg, CO 57462-5329

Phone: +2433516856029

Job: Central Legal Facilitator

Hobby: Backpacking, Jogging, Magic, Driving, Macrame, Embroidery, Foraging

Introduction: My name is Neely Ledner, I am a bright, determined, beautiful, adventurous, adventurous, spotless, calm person who loves writing and wants to share my knowledge and understanding with you.