Basics of Regression - MAKE ME ANALYST (2024)

Regression Basics

Regression analysis allows you to infer that there is a relationship between two or more variables. These relationships are seldom exact because there is variation caused by many variables, not just the variables being studied.

Regression – A statistical Technique

Regression is a statistical technique used to analyze the relationship between two or more variables. The basic idea behind regression is to predict the value of a dependent variable based on the values of one or more independent variables. The dependent variable is also known as the response variable or outcome variable, while the independent variable is also known as the predictor variable or explanatory variable.

The most commonly used type of regression is linear regression, which involves fitting a straight line to the data. In linear regression, the dependent variable is assumed to have a linear relationship with the independent variable(s). The line of best fit is determined using a method called least squares regression, which minimizes the sum of the squared differences between the observed values and the predicted values.

Different Types of Regression

There are many other types of regression, including logistic regression, polynomial regression, and multiple regression. Logistic regression is used when the dependent variable is binary (i.e., only has two possible values). Polynomial regression is used when the relationship between the dependent and independent variables is not linear. Multiple regression is used when there are multiple independent variables that may be affecting the dependent variable.

What is regression?

Pearson’s r or Pearson’s correlation coefficient describes how strong the linear relationship between two continuous variables is. This linear correlation can be displayed by a straight line which is called regression line.

So, regression line the straight line that describe two variables best. If you can find the best fitting line in your data then it is possible to do prediction using that line equation. Using that regression line, we can predict the value ŷ for some certain value of x. But the question is how to find the regression line?

Simple regression and least squares method

How to find the regression line or how to find the best line that fitted data more accurately?

Let’s imagine that you have a scatter plot of two variables and you have drawn possible straight line through this scatterplot. That’s a huge number of lines, so in practice it will be almost impossible to do that. However, for now, imagine that you have superhuman powers and that you are able to do it. Next, you measure for every possible line the distances from the line to every case.

How to know for this below picture what line is best fit?

Basics of Regression - MAKE ME ANALYST (1)

To measure the best line we first need to know about residual.

Residual

Residual means difference between observed and predicted value. Every distance is called a residual.

Basics of Regression - MAKE ME ANALYST (2)

See the below picture. Lets think you have a point (x1 , y1) in the scatter plot. you have the regression line ŷ =a+bx.

So, now ŷ1=a+bx1. Then value of residual for that observation will residual will bey1- ŷ1.

Data = Fit + Residual

Basics of Regression - MAKE ME ANALYST (3)

You have to calculate squared residual for all line and finally, choose the line that minimize the sum of squared residual –which is called least square error. This is call ordinary least square regression.

Once we get the best fit line that means we get the equation for Regression Line like this.

Basics of Regression - MAKE ME ANALYST (4)

a = Intercept

b= Regression co-efficient

Slope and Intercept

In the regression equation the constant part a is called intercept and b is called slope or regression co-efficient. Below diagram shows what is slope and what is Regression Co-efficient.

Basics of Regression - MAKE ME ANALYST (5)

If you consider the below picture then in left side picture line 1 & 2 both has same slope but intercept is different whereas right side picture has same intercept with different slope.

Basics of Regression - MAKE ME ANALYST (6)

Prediction Using Regression Formula

Now if we have the regression line formula then it is possible to predict some Y-hat value for unknown x value.

Basics of Regression - MAKE ME ANALYST (7)

Estimating Regression Parameter

In practice it is almost impossible to draw every possible line and to compute for every single possible line all the residuals. Luckily, mathematicians have found a trick to find the regression line. I won’t explain how this trick works here, because it is rather complicated. For now it suffices to know that it is based on minimizing the sum of the squared residuals.

Usually the computer finds the regression line for you, so you don’t have to compute it yourself. However, when you know the means and standard deviations of your variables and the corresponding Pearson’s r correlation coefficient, you can compute the regression equation by means of two formulas.

Here are the formulas for estimating the regression parameter slope (b) and interpret (a). Please keep in mind that x̅ and y̅ are the mean of independent and dependent variable. r is the correlation coefficient or pearson’s r and sy and sx are standard deviation of your x and y variable.

Basics of Regression - MAKE ME ANALYST (8)

Once you are doing prediction it is very important to know strength of fit for your regression line.

We have studied Pearson’s r in previous page which explains direction and strength of a relation. If we take square of Pearson’s r that means R^2 which tells us how much better a regression line predicts the value of a dependent variable to than the mean of that variable. This is called R-Square measured measure.

  • Always keep in mind that interpretation of Pearson’s r and R^2 is different. R^2 is always a positive number which give you two information given below.

How to interpret the R-Square?

  • If R^2= 0.71. It’s mean that prediction error is 71 % smaller than when you use mean.
  • On the other way, it explains the amount of variance in your dependent variable (Y) that is explained by your independent variable X.

If you are reading it first time, it might be little confusing to you. So, to get better understanding read the below example carefully.

Example

Social scientists have shown that a leader’s physical height is related to his or her success. Suppose you want to test if you can replicate this result. To do that, you look at the heights and the average approval ratings of the four most recent presidents of the United States. You employ this data matrix and your goal is to answer 4 related questions:

(1) is there a linear relationship between the two variables?

(2) what is the size of Pearson’s r correlation coefficient?

(3) what do the regression equation and the regression line look like? And

(4) what is the size of r-squared?

Let’s start with the first question:

Is there a linear relationship between the two variables?

To answer that question, you make a scatterplot. To make a scatterplot, you must first decide what’s the dependent variable and what’s the independent variable. In this case it’s more likely that a leader’s physical height influences his or her approval ratings than that approval ratings affect a leader’s height. After all, we don’t expect a leader to become taller once his or her approval rates get better… So, the independent variable height goes on the x-axis, and the dependent variable approval rating on the y-axis. Based on the minimum and maximum values of our variables we scale our axes. Our independent variable height ranges from 182 centimeters to 188 centimeters. You therefore use a scale from 180 to 190 centimeters. Your dependent variable ranges from 47 through 60.9. You therefore scale this axis from 45 through 65. Next, we decide, based on our data matrix, where we should position the 4 presidents. Obama is 185 centimeters tall and has an approval rating of 47. Bush junior has a physical height of 182 centimeters and an average approval rating of 49.9. Clinton and Bush senior’s heights are 188 centimeters and approval rates are 55.1 and 60.9 respectively.

Basics of Regression - MAKE ME ANALYST (9)

Now, You can answer the first question: Yes, there seems to be a linear relationship between a leader’s height and his approval rating. The line describing this relationship goes up, which means that the correlation between the two variables is positive.

The second question is what the value of Pearson’s r is.

To compute Pearson’s r we need the formula. To start with, you need to compute all the z-scores of both your independent and your dependent variable. To do that you need the means and standard deviations of these variables. The mean of the independent variable height is 185.75 centimeters and the standard deviation is 2.87 centimeters. The mean approval rating (the dependent variable in our study) is 53.23 and the standard deviation is 6.12. First we compute the z-scores for our independent variable by subtracting the mean from every original score and then dividing the outcome by the standard deviation. We do that here: 185 minus 185.75 divided by 2.87. That makes -0.26132. We also do that for the other scores. Here are the results. We then repeat that for the dependent variable. 47 minus 53.23 divided by 6.11 makes – 1.01964. And we do that for the other cases too. The next step is to multiply the z-scores of every case with each other. For the first case this results in -0.26132 multiplied with -1.01964. That makes 0.266456 and so on. We have now finished this part of the formula. Next we have to add up all these values. That makes 2.202649. Finally, we have to divide by n minus 1. Our n is 4, so n minus 1 equals 4 minus 1 is 3. The result, rounded up, is 0.73. That is our Pearson’s r. This indicates that there is a rather strong and positive linear correlation between a leader’s body height and his average approval rating.

Basics of Regression - MAKE ME ANALYST (10)

The next step is to find the regression equation.

The computer finds the regression line by looking for the line that minimizes the sum of the squared residuals. You do not have to do this yourself. Luckily this complicated procedure boils down to two rather simple formulas. One formula to compute the regression coefficient (this one), and one formula to compute the intercept (this one). Together these formulas give you your regression line. We already have all our necessary ingredients. So now you can use the formulas. The regression slope is 0.73 multiplied with 6.11 divided by 2.87. That makes 1.56. The intercept is 53.23 minus 1.56 multiplied with 185.75. That makes -237.11. The regression equation is y-hat minus 237.11 plus 1.56 times X. The intercept indicates that the predicted y-value is minus -237.11 when x is 0. This number has no substantive meaning because a physical height of 0 meter is impossible. The intercept only serves mathematical purposes: it makes it possible to draw the line. With the regression equation found, we can predict the value of dependent variable when independent variable equals 182 centimeters (the minimum value in our sample). That’s -237.11 plus 1.56 times 182. That makes 46.81. That’s here.

Basics of Regression - MAKE ME ANALYST (11)

You can also do that for maximum value: that’s *-237.11 plus 1.56 multiplied with 188. That’s 56.17. That’s here. You can now draw the regression line. This line is the straight line that best represents the linear relationship between X and Y. It is the line for which the sum of the squared residuals is the smallest. We can, of course, predict y-values for every possible x-value. All these predicted y-values, or y-hat’s, are located on the regression line.

The fourth question is what the value of r-squared is.

That’s easy. It’s Pearson’s r squared. So: 0.73 multiplied with 0.73 equals 0.53. But how should you interpret this number? Well, you can say that the prediction error is 53 per cent smaller when you use the regression line than when we employ the mean of the dependent variable. You can also say that 53 per cent of the variation in the dependent variable is explained by your independent variable. So, we learnt that tall leaders are more successful than short leaders. However… This conclusion is based on a sample of only 4 American presidents who don’t differ much from each other when it comes to their physical height. It is up to you to decide if this warrants far-reaching inferences about the relationship between height and approval ratings.
However, for at least two reasons, we need to be very careful when we interpret the results. The first reason is that on the basis of a regression analysis, we can never prove that there is a causal relationship between two variables. We can, in other words, never be certain that one variable is the cause of another variable. This translates to one single and not very complicated, but extremely important message: correlation is no causation.

Summary

Regression analysis is commonly used in fields such as economics, finance, psychology, and epidemiology, among others, to analyze data and make predictions about future outcomes. Simple linear regression helps researchers estimate parameters of linear equations connecting variables, making it easier to predict dependent variable values in different conditions. It also emphasizes the importance of isolating the effect of one independent variable on the dependent variable for decision-making and policy design. Regression is a powerful statistical tool used in various fields, and while the concept behind it is simple, the mathematics can be complex, and researchers often use computers to find regression equations.

Basics of Regression - MAKE ME ANALYST (2024)

FAQs

Basics of Regression - MAKE ME ANALYST? ›

Regression is a statistical technique used to analyze the relationship between two or more variables. The basic idea behind regression is to predict the value of a dependent variable based on the values of one or more independent variables.

What is the basis of regression analysis? ›

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them.

What are the 7 steps in regression analysis? ›

The seven steps to run linear regression analysis are
  • Install and load necessary packages.
  • Load your data.
  • Explore and Understand the data.
  • Create the model.
  • Get a model summary.
  • Make predictions.
  • Plot and visualize your model.
Nov 9, 2023

What are the basic concepts of regression? ›

Regression is a statistical technique that relates a dependent variable to one or more independent variables. A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the independent variables.

What are the fundamentals of regression analysis? ›

"Regression" is a general term for statistical techniques that try to fit a model to a given set of variables to predict the effect that changes in independent variables have on a dependent variable using linear assumptions.

What is the basic principle of regression? ›

Regression models predict a value of the Y variable given known values of the X variables. Prediction within the range of values in the dataset used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation.

What are the three main purposes of regression? ›

Prediction, association discovery, and model validation are the three main uses for regression analysis. Predicting the value of a dependent variable given the values of one or more independent variables is the main goal of regression analysis.

What are the 4 types of regression analysis? ›

Regression analysis is essential for predicting and understanding relationships between dependent and independent variables. There are various regression models, including linear regression, logistic regression, polynomial regression, ridge regression, and lasso regression, each suited for different data scenarios.

What is the p-value in regression? ›

The p-value in a regression model measures the strength of evidence against the null hypothesis, indicating whether the observed data could occur by chance. A low p-value (<0.05) suggests that the coefficient is statistically significant, implying a meaningful association between the variable and the response.

How to understand regression analysis? ›

Regression analysis is all about determining how changes in the independent variables are associated with changes in the dependent variable. Coefficients tell you about these changes and p-values tell you if these coefficients are significantly different from zero.

What are dummies in regression? ›

In regression analysis, a dummy variable is a regressor that can take only two values: either 1 or 0. Dummy variables are typically used to encode categorical features.

How do you explain regression in simple terms? ›

Regression allows researchers to predict or explain the variation in one variable based on another variable. Definitions: ❖ The variable that researchers are trying to explain or predict is called the response variable. It is also sometimes called the dependent variable because it depends on another variable.

What is the basic formula of regression? ›

To work out the regression line the following values need to be calculated: a=¯y−b¯x a = y ¯ − b x ¯ and b=SxySxx b = S x y S x x . The easiest way of calculating them is by using a table. Start off by working out the mean of the independent and dependent variables.

How do I learn regression analysis? ›

How to Learn Regression Analysis: A Step-by-Step Guide
  1. Study core concepts underpinning regression. The fundamentals of statistical theory and correlation are essential. ...
  2. Acquire a data set and develop a research question. ...
  3. Build a regression model. ...
  4. Test the strength of your model. ...
  5. Add nuance to your regression models.
Dec 4, 2020

How to use Excel for regression analysis? ›

To run the regression, arrange your data in columns as seen below. Click on the “Data” menu, and then choose the “Data Analysis” tab. You will now see a window listing the various statistical tests that Excel can perform. Scroll down to find the regression option and click “o*k”.

Why is it called regression? ›

For example, if parents were very tall the children tended to be tall but shorter than their parents. If parents were very short the children tended to be short but taller than their parents were. This discovery he called "regression to the mean," with the word "regression" meaning to come back to.

What is the basic of regression testing? ›

Regression testing is performed to find out whether the updates or changes had caused new defects in the existing functions. This step would ensure the unification of the software. In a typical software development pipeline, retesting is performed before regression testing practices.

What is the regression model based on? ›

A regression model provides a function that describes the relationship between one or more independent variables and a response, dependent, or target variable. For example, the relationship between height and weight may be described by a linear regression model.

What is basis in linear regression? ›

Basis functions allow modeling non linearity in the data while keeping linearity in parameters, which greatly simplifies the analysis of these models. Using linear combination of different basis function, we can construct complex functions and still use linear regression.

What is the basic assumption of regression? ›

Regression Assumptions

The chosen sample is representative of the population. There is a linear relationship between the independent variable(s) and the dependent variable. All the variables are normally distributed; to check, plot a histogram of the residuals.

Top Articles
Crowdfunding for Non-Accredited Investors
9 Ways to Cope If You Can't See Family and Friends This Holiday Season
San Angelo, Texas: eine Oase für Kunstliebhaber
Walgreens Pharmqcy
Gamevault Agent
Wannaseemypixels
Vanadium Conan Exiles
Truist Drive Through Hours
What’s the Difference Between Cash Flow and Profit?
Vichatter Gifs
Culvers Tartar Sauce
Sams Early Hours
Arre St Wv Srj
Equipamentos Hospitalares Diversos (Lote 98)
8664751911
ZURU - XSHOT - Insanity Mad Mega Barrel - Speelgoedblaster - Met 72 pijltjes | bol
Craigslist Lakeville Ma
Where to eat: the 50 best restaurants in Freiburg im Breisgau
Prey For The Devil Showtimes Near Ontario Luxe Reel Theatre
Sherburne Refuge Bulldogs
Powerschool Mcvsd
3569 Vineyard Ave NE, Grand Rapids, MI 49525 - MLS 24048144 - Coldwell Banker
Idle Skilling Ascension
Beaufort 72 Hour
Nk 1399
Bayard Martensen
Craigslist Northern Minnesota
Sinfuldeed Leaked
Eegees Gift Card Balance
Account Now Login In
Vistatech Quadcopter Drone With Camera Reviews
Japanese Pokémon Cards vs English Pokémon Cards
Upstate Ny Craigslist Pets
Muma Eric Rice San Mateo
Missouri State Highway Patrol Will Utilize Acadis to Improve Curriculum and Testing Management
PA lawmakers push to restore Medicaid dental benefits for adults
Hotels Near New Life Plastic Surgery
Chs.mywork
The Vélodrome d'Hiver (Vél d'Hiv) Roundup
Ukraine-Krieg - Militärexperte: "Momentum bei den Russen"
Alston – Travel guide at Wikivoyage
Walgreens On Secor And Alexis
Sallisaw Bin Store
UT Announces Physician Assistant Medicine Program
Breaking down the Stafford trade
Deezy Jamaican Food
Das schönste Comeback des Jahres: Warum die Vengaboys nie wieder gehen dürfen
tampa bay farm & garden - by owner "horses" - craigslist
Nurses May Be Entitled to Overtime Despite Yearly Salary
Blippi Park Carlsbad
Shiftselect Carolinas
2121 Gateway Point
Latest Posts
Article information

Author: Fredrick Kertzmann

Last Updated:

Views: 6048

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.