Regression Analysis: An Overview (2024)

Interpreting a Regression Analysis

What is regression analysis?

Regression analysis is a statistical technique for studying linear relationships. [1] It begins by supposing a general form forthe relationship, known as the regression model:

Y = α + β₁X₁ +...+β_kX_k + ε .

Example: In the motorpool case, the manager of themotorpool considers the model

Cost = α + β₁Mileage + β₂Age +β₃Make + ε .

Y is the dependent variable, representing a quantity that varies from individual toindividual throughout the population, and is the primary focus of interest. X₁,...,X_k are the explanatory variables (the so-called “independentvariables”), which also vary from one individual to the next, and are thought to be relatedto Y. Finally, ε is the residual term, which represents the composite effect of allother types of individual differences not explicitly identified in the model. [2]

Beside the model, the other input into a regression analysis is some relevant sample data,consisting of the observed values of the dependent and explanatory variables for a sample ofmembers of the population.

The primary result of a regression analysis is a set of estimates of the regressioncoefficients α, β₁,..., β_k. These estimates are made byfinding values for the coefficients that make the average residual 0, and the standard deviation ofthe residual term as small as possible. The result is summarized in the predictionequation:

Y_pred = a + b₁X₁ +...+b_kX_k .

Example: Fitting the model above to the motorpool data, we obtain:

Cost_pred = 107.34 + 29.65 Mileage + 73.96 Age + 47.43 Make .

Why do a regression analysis?

Typically, a regression analysis is done for one of two purposes: In order to predict the valueof the dependent variable for individuals for whom some information concerning the explanatoryvariables is available, or in order to estimate the effect of some explanatory variable on thedependent variable.

Making individual predictions

If we know the value of several explanatory variables for an individual, but do not know thevalue of that individual’s dependent variable, we can use the prediction equation (based on amodel using the known variables as its explanatory variables) to estimate the value of thedependent variable for that individual.

In order to see how much our prediction can be trusted, we use the standard error of theprediction [3] to construct confidenceintervals for the prediction. (Examine a workbook that providesa detailed discussion of the standard error of the prediction.)

Example: In order to predict the next twelve-month’s maintenance and repairexpenses for a specific one-year-old Ford currently in the motorpool, we’d first perform aregression analysis using age and make as the explanatory variables:

Cost_pred = 705.66 + 8.53 Age - 54.27 Make .

Our prediction will then be $714.19, and the margin of error (at the 95%-confidence level) forthe prediction is 2.1788 × 124.0141 = $270.20 .

Estimating the effect of an explanatory variable on the dependent variable

In order to estimate the “pure” effect of some explanatory variable on the dependentvariable, we want to control for as many other effects as possible. That is, we’d like to seehow our prediction would change for an individual if this explanatory variable were different,while all others aspects of the individual were kept the same. In order to do this, we shouldalways use the most complete model available, i.e., we should include all other relevant factors asadditional explanatory variables. (Dive down for further discussion.)

Our estimate of impact of a unit difference in the targeted explanatory variable is itscoefficient in the prediction equation. The extent to which our estimate can be trusted is measuredby the standard error of the coefficient.

Example: Using the full regression model, we estimate that the mean marginalmaintenance and repair cost associated with driving one of the cars in the motorpool an additional1000 miles is $29.65, with a margin of error in the estimate of 2.2010 × 3.915 = $8.62 . Tobetter understand why we use the most complete model available, note that any “one of thecars” has a particular age and make, and we want to hold those constant while considering theincremental effect of another 1000 miles of driving.

Determining whether there is evidence that an explanatory variable belongs in a regressionmodel

Given a specific model, one might wonder whether a particular one of the explanatory variablesreally “belongs” in the model; equivalently, one might ask if this variable has a trueregression coefficient different from 0 (and therefore would affect predictions).

We take the standard approach of classical hypothesis testing: In order to see if there isevidence supporting the inclusion of the variable in the model, we start by hypothesizing that itdoes not belong, i.e., that its true regression coefficient is 0.

Dividing the estimated coefficient by the standard error of the coefficient yields the t-ratio of the variable, which simply shows how many standard-deviations-worth of samplingerror would have to have occurred in order to yield an estimated coefficient so different from thehypothesized true value of 0. We then ask how likely it is to have experienced so much samplingerror: This yields the significance level of the sample data with respect to the null hypothesisthat 0 is the true value of the coefficient. The closer this significance level is to 0%, thestronger is the evidence against the null hypothesis, and therefore the stronger theevidence is that the true coefficient is indeed different from 0, i.e., that the variable doesbelong in the model.

Example: In the full model, the significance level of the t-ratio of mileage is0.0011%. We have overwhelmingly strong evidence that mileage has a true non-zero effect in themodel. On the other hand, the significance level of the t-ratio of make is only 12.998%. We havehere only a little bit of evidence that the true difference between Fords and Hondas is nonzero.(If we really wish to make a case against Hondas, we’ll require that the estimated differencepersist as the sample size is increased, i.e., as more evidence is collected.)

Measuring the explanatory power of a model

Why does the dependent variable take different values for different members of the population?There are two possible answers: “Because the explanatory variables vary.”“Because things still sitting in the residual term vary.” The total variation seen inthe dependent variable can be broken down into these two components, and the coefficient ofdetermination [4] is the fraction of the totalvariation that is explained by the model, i.e., the fraction explained by variation in theexplanatory variables. Subtracting the coefficient of determination from 100% indicates thefraction of variation in the dependent variable that the model fails to explain.

Example: Looking at mileage alone, it can explain 56% of the observed car-to-carvariation in annual maintenance costs. Looking at age alone, it can’t explain much ofanything. But variations in mileage and age together can explain over 78% of the variation incosts. The reason they can explain more together than the sum of what they can explain separatelyis that mileage masks the effect of age in our data. When both are included in the regressionmodel, the effect of mileage is separated from the effect of age, and the latter effect then can beseen.

A natural follow-up is to ask what the relative importance of variation in theexplanatory variables is in explaining observed variation in the dependent variable. The beta-weights [5] of the explanatory variablescan be compared to answer this question. ( Dive down for a discussion ofthe distinction between t-ratios and beta-weights.)

Example: In the full model, the beta-weight of mileage is roughly twice that ofa*ge, which in turn is more than twice that of make. If asked, “Why does the annualmaintenance cost vary from car to car?” one would answer, “Primarily because the carsvary in how far they’re driven. Of secondary explanatory importance is that they vary in age.Trailing both is the fact that some are Fords and others Hondas, i.e., that make varies across thefleet.”

Summary

The six “steps” to interpreting the result of a regression analysis are:

Look at the prediction equation to see an estimate of the relationship.
Refer to the standard error of the prediction (in the appropriate model) when makingpredictions for individuals, and the standard error of the estimated mean when estimating theaverage value of the dependent variable across a large pool of similar individuals.
Refer to the standard errors of the coefficients (in the most complete model) to see how muchyou can trust the estimates of the effects of the explanatory variables.
Look at the significance levels of the t-ratios to see how strong is the evidence in support ofincluding each of the explanatory variables in the model.
Use the “adjectived” coefficient of determination to measure the potentialexplanatory power of the model.
Compare the beta-weights of the explanatory variables in order to rank them in order ofexplanatory importance.

[1] Why is it valuable to be able to unravel linearrelationships? Some interesting relationships are linear, essentially all managerialrelationships are at least locally linear, and several modeling tricks help to transform the mostcommonly-encountered nonlinear relationships into linear relationships.

[2] The dependent and explanatory variables, as well as theresidual term, can be thought of as random variables resulting from the random selection of asingle member of the population, i.e., as quantities that vary from one individual to the next.

[3] The standard error of the prediction takes into accountboth our exposure to error in using a value of 0 for the individual’s residual when makingthe prediction (measured by the standard error of the regression), and our exposure tosampling error in estimating the regression coefficients (measured by the standard error of theestimated mean).

[4] The coefficient of determination is sometimes calledthe “R-square” of the model. Some computer packages will offer two coefficients ofdetermination, one with an adjective – “adjusted”, “corrected”, or“unbiased” – in front. Given the choice, use the one with the adjective. If it issomewhat less than zero, read it as 0%.

[5] The beta-weight of an explanatory variable has the samesign as the estimated coefficient of that variable. It is the magnitude, i.e., absolute value, ofthe beta-weight that is of relevance.