Five Obstacles faced in Linear Regression (2024)

These five obstacles may occur when you train a linear regression model on your data set.

Published in

Towards Data Science

This article discusses the problems that may occur while training a Linear model, and some methods to deal with them.

Five problems that lie in the scope of this article are:

Non-Linearity of the response-predictor relationships
Correlation of error terms
A non-constant variance of the error term [Heteroscedasticity]
Collinearity
Outliers and High Leverage Points

Source:

The reason for this problem is one of the assumptions involved in linear regression. It is the assumption for linearity, which states that the relation between the predictor and response is linear.

If the actual relation between response and the predictor is not linear, then all the conclusion we draw becomes null and void. Also, the accuracy of the model may drop significantly.

So, how can we deal with this problem?

Remedy:

The solution to the problem mentioned above is to plot Residual Plots.

Residual plots are the plot between the residual, the difference between the actual value and predicted value, and the predictor.

Once we have plotted the residual plot, we will search for a pattern. If some patterns are visible, then there is a non-linear relationship between response and predictor. And, if the plot shows randomness then we are on the right path!

After analyzing the type of pattern, we can use non-linear transformations such as square root, cube root, or log function. Which removes the non-linearity to some extent, and our linear model performs well.

Example:

Let try to fit a straight line to a quadratic function. We will generate some random points using NumPy and take their squares as the response.

import numpy as np
x = np.random.rand(100)
y = x*x
sns.scatterplot(x,y)

Let us see the scatter plot between x and y (Fig.1).

Five Obstacles faced in Linear Regression (4)

Now, let us try to fit a linear model to this data and see the plot between residual and predictor.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x.reshape(-1,1),y.reshape(-1,1))
predictions = model.predict(x.reshape(-1,1))
residual = y.reshape(-1,1) - predictions

Five Obstacles faced in Linear Regression (5)

We can see a quadratic trend in the residual plots. This trend helps us to identify the non-linearity in data. Further, we can apply the square root transformation to make data more suitable for the linear model.

If the data is linear, then you would get random points. The nature of the residual would be randomized. In that case, we can move forward with the model.

Source:

A principal assumption of the linear model is that the error terms are uncorrelated. The “uncorrelated” terms indicated that the sign of error for one observation is independent of others.

The correlation among error terms may occur due to several factors. For instance, if we are observing the weight and height of people. The correlation in error may occur due to the diet they consume, the exercise they do, environmental factors, or they are members of the same family.

What happens to the model when errors are correlated? If the error terms are correlated then the standard error in the model coefficients gets underestimated. As a result, confidence and prediction intervals will be narrower than they should be.

For more insights, please refer to the example below.

Remedies:

The solution is the same as described in the above problem, Residual Plots. If some trends are visible in residual plots, these trends can be expressed as some functions. Hence, they are correlated!

Example:

To understand the impact of correlation on the confidence interval, we should note two trivial points.

When we estimate model parameters, there is some error (Standard Error: SE) involved. This error arises due to the estimation of population characteristics from the sample. This error is inversely proportional to the square root of the number of observations.
The confidence interval for the model parameters with 95% confidence varies by two standard errors. (Please refer to Fig.3)

Five Obstacles faced in Linear Regression (6)

Now, suppose we have n data points. We calculate the standard error (SE) and confidence interval. Now, we doubled our data. Hence, then we would have observations and error terms in pair.

If we now recalculate the SE, then we will calculate it corresponding to 2n observations. As a result, the standard error will be lower by a factor of root √2 (SE is inversely proportional to the number of observations). And, we will obtain a narrower confidence interval.

Source:

The source of this problem is also an assumption. The assumption is that the error term has a constant variance, also referred to as Homcedacity.

Generally, that is not the case. We can often identify a non-constant variance in errors, or heteroscedasticity, from the presence of funnel shape in residual plots. In Fig.2, the funnel represents that the error terms have non-constant variance.

Remedies:

One possible solution is to transform the response using a concave function such as log and square root. Such a transformation results in shrinkage of the response variable, consequently reducing heteroscedasticity.

Example:

Let us try to apply log transformation to points generated in problem 1.

Five Obstacles faced in Linear Regression (7)

We can observe a linear trend after transformation. Hence we may remove non-linearity by applying concave functions.

Collinearity refers to a situation in which two or more predictor variables are correlated to one another. For example, we can find some relation between height and weight, Area of house and number of rooms, experience, and income, and many more.

Source:

In linear regression, we assume that all the predictors are independent. But often the case is the opposite. The predictors are correlated with each other. Hence, it is essential to look at this problem and find a feasible solution.

When the assumption of independence is neglected, the following concerns arise:

We cannot infer the individual effect of predictors on response. Because they are correlated, change in one variable try to impart change in another. Therefore, the accuracy of model parameters drops significantly.
When the accuracy of model parameters drops, all our conclusion becomes void. We can not tell the actual relation between response and predictor and hence, model accuracy also decreases.

Remedies:

There are two possible solutions to the problem.

Drop the variable: We can drop the problematic variable from the regression. The intuition is that the collinearity implies that the information provided by the variable in the presence of other variables, is redundant. Hence, we can drop the variable without much compromise.
Combining the variable: We can combine both the variables to form a new variable. These techniques are feature engineering. For example, merge weight and height to get BMI (Body mass index).

Five Obstacles faced in Linear Regression (8)

Linear Regression is greatly affected by the presence of Outliers and Leverage points. They may occur for a variety of reasons. And their presence hugely affects to model performance. It is also one of the limitations of linear regression.

Outlier: An outlier is an unusual observation of response y, for some given predictor x.

High Leverage Points: Contrast to an outlier, a high leverage point is defined as an unusual observation of predictor x.

Five Obstacles faced in Linear Regression (9)

There are several techniques available for identifying an outlier. This includes interquartile range, scatter plots, residual plots, quartile-quartile plots, box plots, etc.

As this is a limitation of linear regression, it is vital to take the necessary steps. One method is to drop the outlier. However, this may lead to some loss of information. We can also use feature engineering to deal with outliers.

In this article, we have seen five problems while we are working with linear regression. We have seen the sources, impacts, and solutions for each of the problems.

Though Linear regression is the most basic machine learning algorithm, it has a vast scope for learning new things. For me, these problems provide are different point of view for Linear regression.

I hope understanding these problems will provide you with novel insights when you solve any problem.

You may also check the complete playlist for Linear regression.

FAQs

Five Obstacles faced in Linear Regression? ›

Correlation ≠ Causation

No correlation (between predictors and outcome) is a NO-NO. Too much correlation (between explanatory variables) is a NO-NO. A dataset with good linearity and no multicollinearity but is badly interpreted is also a NO-NO (and you still think linear regression is a piece of cake?).

Read On ›

What are the major problems of linear regression? ›

Discover More Details ›

What are the 5 assumptions of linear regression? ›

📈 Linearity, 🔵 independence, 📊 hom*oscedasticity, 🔔 normality, and 🚫 no multicollinearity are the five key assumptions of linear regression. Ensuring these assumptions are met is critical to creating an accurate and reliable model for predicting and drawing insights from data.

What are the challenges of regression model? ›

Cons of the Linear Regression Model:

We made the fundamental assumption that there is a linear relationship between feature and label.
This model doesn't work with the outliers.
Many features than interpretability is low.
Categorical data is inappropriate for this model.
Handling missing data creates problem for this model.

More items...

Nov 5, 2021

See Details ›

What are some limitations of linear regression? ›

Limitations of linear regression

Linearity: The assumption of linearity between variables restricts linear regressions. ...
Overfit: It's not recommended to use linear regressions when the observations aren't proportional to the features. ...
Outliers: Linear regressions are prone to mistakes and outliers.

More items...

Jun 28, 2024

Find Out More ›

What is the main problem with using the regression line? ›

Answer: The main problem with using single regression line is it is limited to Single/Linear Relationships. linear regression only models relationships between dependent and independent variables that are linear. It assumes there is a straight-line relationship between them which is incorrect sometimes.

Tell Me More ›

What is the common error in linear regression? ›

1 Mistake 1: Ignoring outliers

Outliers are data points that deviate significantly from the rest of the data. They can have a large impact on the slope and intercept of the regression line, and thus affect the accuracy and reliability of your predictions.

Show Me More ›

What are the 5 multiple regression assumptions? ›

Five main assumptions underlying multiple regression models must be satisfied: (1) linearity, (2) hom*oskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables.

Explore More ›

What are the five types of regression analysis? ›

Let us examine several of the most often utilized regression analysis techniques:

Linear Regression. ...
Logistic Regression. ...
Polynomial Regression. ...
Ridge Regression. ...
Lasso Regression. ...
Quantile Regression. ...
Bayesian Linear Regression. ...
Principal Components Regression.

More items...

Aug 28, 2024

What are the three requirements of linear regression? ›

Important Assumptions of Linear Regression Analysis

There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s). ...
There should be no correlation between the residual (error) terms. ...
The independent variables should not be correlated.

More items...

6 days ago

Show Me More ›

What is the biggest challenge in regression? ›

The Problem

There's no way around it – regression testing involves running the same tests over and over again. This can demoralize testers and over time, they might miss tests, ignore or misinterpret them.

Read The Full Story ›

What are the regression problems? ›

The regression problem is how to model one or several dependent variables/responses, Y, by means of a set of predictor variables, X. In the PLS method, we divide the variables (columns) into two blocks denoted as X and Y.

See Details ›

Why does linear regression fail? ›

This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean.

Get More Info Here ›

What can be a major problem with linear regression? ›

3 Disadvantage: Sensitive to outliers and noise. One of the main disadvantages of using linear regression for predictive analytics is that it is sensitive to outliers and noise. Outliers are data points that deviate significantly from the rest of the data, and noise is random variation or error in the data.

What are the common challenges faced when building a linear regression model? ›

Large datasets often have many variables that may or may not be relevant for your regression analysis. Including too many variables can cause problems such as multicollinearity, overfitting, and high complexity. Multicollinearity occurs when two or more variables are highly correlated and provide redundant information.

What is the criticism of linear regression? ›

Linear regression models assume that the independent variables are not highly correlated with each other. Multicollinearity occurs when the independent variables are highly correlated, making it difficult for the model to estimate the effect of each independent variable on the dependent variable.

View Details ›

What are the three major questions that linear regression analysis can answer? ›

Multiple Linear Regression Analysis helps answer three key types of questions: (1) identifying causes, (2) predicting effects, and (3) forecasting trends. Identifying Causes: It determines the cause-and-effect relationships between one continuous dependent variable and two or more independent variables.