Introduction to Applied Regression Analysis

Learning Outcomes

After successfully completing the Module 8 in Regression Analysis, students will be able to

  1. Explain the Regression Analysis and Its Application
  2. Perform Regression Analysis Using Statistical Software, Including MS Excel
  3. Explain Regression Analysis Outputs in the Context of the Problem, Including
    1. ANOVA table
    2. Model Summary Statistics
    3. Functional Relationship
    4. Regression Coefficients in the Context of the Problem
  4. Perform Regression Diagnostic Analysis Using Statistical Software, Including MS Excel
  5. Explain the Regression Diagnostic Analysis Results, Including
    1. Verifying the Linearity Assumption
    2. Both Visual and Statistical Check for Unusual Observations: Outlier, Leverage and Influential Points
    3. Check for the Normality of Residuals
    4. Check for Variance Homogeneousness (constancy)
    5. Check for Correlated Residuals
  6. Perform Lack-of-Fit Test

1. What is Regression Analysis?

Regression analysis is a process of estimating the functional relationships between the dependent variable (response variable, or y-variable) and one or more independent variables (factor(s) or predictor(s) or x-variable(s)). Regression analysis is primarily used for the probabilistic systems, rather than the deterministic system where relationship is already known. An example of a probabilistic system would be “the prediction for the stock market tomorrow.” Some applications of regression analysis include,

1. develop functional relationship between variables

2. perform trend analysis

3. predict the dependent variable by the independent variable(s)

4. optimized processes

5. fit response surface

6. analyze data from unplanned experiments (arguably the most frequent one)

The simplest form of the regression analysis is the simple linear regression analysis, which involves only one linear predictor (independent or factor) with one dependent (response) variable. Imagine the relationship between the fuel cost and the distance travelled by a vehicle (Figure 1).

Figure 1. Simple Linear Regression with One Dependent and One Independent Variables

The functional relationship be written as in Equation 1.

Equation 1

It is obvious that more driving will cost more. However, the driving conditions, speeds, road conditions, weather, etc. will cause some variations in the data (Figure 1). Therefore, the functional relationship including the error can be written as in Equation 2.

Equation 2

Figure 2. Simple Linear Regression with One Dependent and One Independent Variables with the Intercept

Therefore, the complete functional relationship can be written as in Equation 3.

Equation 3

The relationship using the typical symbols can be written as in Equation 4.

Equation 4

Where y represents the cost of fuel and the x1 represent the distance travelled. β0 represents the intercept which is the cost of fuel even without any distance travelled. β1 represents the fuel cost per unit distance travelled. Therefore, the unit for the β1 is the fuel cost per mile (which is y/x) primarily known as the slope for the simple linear regression.

Figure 3. Simple Linear Regression Error (Residual)

The error ε is the total amount of deviations of the observed data from the predicted values. As a linear relationship is expected in the simple linear regression, the data points ideally should follow a straight line if there were no variability in the data. However, in reality, there is variability in every process and the data points could be away from the regression line. The amount of deviation is shown with the red arrow marked line in the Figure 3. In regression analysis, the error term is also known as the residual.

1.1. Assumptions for errors (or residuals)

Two important assumptions for the error term (residuals) are:

  1. Error term is normally distributed with zero mean and constant (homogeneous) variance
  2. Error term is uncorrelated with the observation order

2. What are the Steps in Regression Analysis?

At the learning stage, the following steps could be suggested for easier understanding of the regression analysis process.

2.1. Step # 0 in Regression Analysis: Famous and Most Useful Scatter Plot

Before performing any statistical analysis, simple scattered plot(s) between the dependent and the independent variable(s) can be performed to check if there is any major issue with the data, especially the linearity of the data and any extreme usual observations. Detail discussion on the data quality can be found in the Regression Analysis diagnostic section.

2.2. Step # 1 in Regression Analysis: Statistical Significance

First step of the regression analysis is to check whether there is any statistical significance between the dependent and the independent variables. If there is no statistically significant relationship between the dependent and the independent variables, no further analysis is performed and the study (or the analysis) stops at the step # 1.

2.3. Step # 2 in Regression Analysis: Practical Significance

The second step of the regression analysis is to check whether the statistically significant results have any practical significance. Often, there is statistical significance. However, the relationship may not be strong enough to predict the dependent variable well. If there is no practical significance of the results, no further analysis is performed and the study (or the analysis) stops at the step #2.

2.4. Step # 3 in Regression Analysis: Explanation of Results in the Context of the Problem.

When both step #1, and step #2 are significant, in step #3, the analysis results are explained in the context of the problem, particularly the explanation of the regression relationship, the slope parameter and the intercept.

2.5. Step # 4 in Regression Analysis: Regression Analysis Diagnosis.

Finally, in step four, the diagnostic analysis is performed to check whether there is any problem in the data such as any outlier and influential points that may skew the results. Ideally, this step could be performed at first. However, the amount of time and resource it takes to perform this step do not justify this step first if there is not statistical significance between the dependent and the independent variables. Nevertheless, using any statistical software, including MS Excel this step can be performed within a couple of mouse clicks. The outliers and the influential points could be removed if justified from the analysis first before doing any steps in regression analysis at all. If this step is performed at the last step, the analysis must be rerun if the outliers and the influential points are removed. Finally, the step 1, 2, and 3 must be performed again after the diagnostic analysis step. Though it sounds like the diagnostic should be performed first, many diagnostic analyses are impossible to perform without performing the analysis first, whether manually using formulas or using any software. Therefore, the regression analyses are performed a couple of times to produce the best analysis results, including the test statistics and the predicted fitted regression.

3. Regression Analysis Example

Any software, including MS Excel can be used to perform the simple linear regression. Video 1 shows the simple linear regression analysis using MS Excel and Minitab.

Data were collected for the fuel cost and the distance travelled by a vehicle and provided in Table 1.

Table 1. Fuel Cost vs Distance Travelled Data

The simple linear regression analysis is performed using the Minitab Software version 19. The output is provided in Figure 4.

Figure 4. Simple Linear Regression Analysis Output for the Fuel Cost vs Distance

4. Explain the Results

The sequences, including the regression equation, coefficients, analysis of variance, model summary, and the diagnostics, are not standardized from software to software. Even different versions of the software provide different output sequences. Nevertheless, the explanation for the results should follow the regression analysis steps suggested earlier at the learning stage of the regression process. Once the basic is learned, the readers can use their own discretion and make their own steps!

4.1. Step # 1. of Regression Analysis:

Statistical Significance Test

The analysis of variance table of the output table # 4 in Figure 4 provides the information on the statistical significance of the relationship between the fuel cost and the distance.

This step on the statistical significance test include four steps, which has been discussed in earlier module 1, 2, and 3. The statistical significance tests follow the four steps provided below.

4.1.1. Step #1. Hypothesis

[Hypothesis is the research questions. In regression analysis in this example, is there a statistical relationship between the fuel cost and the distance?]

Null Hypothesis: β1 = 0

[zero slope = no functional relationship, meaning that fuel cost does not change with the distance]

Alternative Hypothesis: β1 ≠ 0

[There is a functional relationship between the fuel cost and the distance]

4.1.2. Step #2. Appropriate Statistical Method

[Simple linear regression analysis is the appropriate method for this situation]

4.1.3. Step #3. Statistical Results Explanation

Null hypothesis is rejected due to the p-value (=0.000) is less than the level of significance (alpha = 0.05).

[p-value is defined by the observed probability of the null hypothesis to happen]

4.1.4. Step #4. Explanation in the Context of the Problem

There is a statistically significant functional relationship between the fuel cost and the distance travelled.

4.2. Step # 2 of the Regression Analysis:

Practical Significance Test

Once the statistical significance is observed in the step #1, the practical significance is checked from the third output table in Figure 4, the Model Summary Table.

4.2.1 R-square

Coefficient of Determination

R-square or the Coefficient of Determination is defined by the per cent of variation in the dependent variable explained by the independent variable(s) (Equation 5). In this situation, 96.36 percent variation in the fuel cost can be explained by the distance travel.

Equation 5

Where, SSR = Sum of Square of the Regression Model = total variances by all model terms, SSTO = Sum of Square Total = total variances including the experimental error (or error or residual in regression analysis).

Figure 5. Understanding the R-Square

Figure 5 shows a visual representation of the r-square values for data sets with different error (residuals). The top left graph shows that most observed data points follow close to the predicted regression line, while the bottom right graph shows that the most data points are away from the predicted regression line. Higher error (residuals) means lower r-square value (Equation 5). While relationships between the dependent and the independent variables are observed to be significant for all the four data sets in Figure 5, the strength of their relationship is very weak in the bottom right graph than the top-left graph. Therefore, the r-square is a measure for the strength of the relationship. Higher r-square value indicates a stronger relationship between the dependent and the independent variables. Even though the functional relationship between the dependent and the independent variables are observed to be significant, a very weak relationship could be practically meaningless.

As the r-square is proportional to the variation in the regression model terms (SSR) (Equation 5), adding more model terms may inflate/increase the r-square value. Therefore, the r-square could be misleading in finding the strength of the relationship between the dependent and independent variables. To account for this unwanted inflation, an adjustment to the r-square formula has been made as in Equation 6. Therefore, the relatively unbiased the adjusted r-square could be used to find the strength of the relationship.

Equation 6

Nevertheless, an appropriately built regression model will produce very close values for both r-square and the adjusted r-square. If there is a very high difference between them, probably insignificant terms have been included in the final regression model. Therefore, the regression model should be investigated further to find the issue if there is a difference in the value of r-square and adjusted r-square. To explain the model strength or the practical significance, use of either r-square or adjusted r-square should be okay if the model is well built.

4.2.2. What is a satisfactory r-square value?

Satisfactory r-square value depends on the field of study. In the fuel cost versus the distance example, the r-square is observed to be 96.36%, which is considered excellent. Therefore, the functional relationship between the fuel cost and the distance is considered very strong. While a lower r-square value would not be acceptable for this fuel cost study, human behavior study would be happy enough to get a 50% r-square value. In the field of marketing, advertising for an example, 10 to 20 % r-square value would be okay.

4.3. Step # 3. Explanation of Coefficients

Functional Relationships between X and Y

Once the regression model is observed to be statistically and practically significant, the third step is to explain the functional relationships between the dependent and the independent variables using the first and the second table of the Figure 4.

4.3.1. Explanation for Slope

Fuel cost increases by 0.06017 USD per mile distance travelled.

4.3.2. Explanation for Intercept

Intercept, 5.531 is the fuel cost associated with the zero x-value, meaning that there will be a 5.531 fuel cost even if no distance is travelled.

4.4. Step # 4. Regression Analysis Model Diagnostics

Regression analysis diagnostics involve checking the assumptions made for the analysis. Four primary assumptions of regression analysis are listed below.

  1. Relationship between the dependent and independent variables is approximately linear
  2. data is free from unusual observations, including, outlier, leverage, and influential points
  3. Errors (residuals) are normally distributed with zero mean and constant (homogeneous) variance, and uncorrelated, and
  4. Free from multicollinearity

Violations of these assumptions can be checked by (1) visual look at the data and (2) performing statistical tests. The initial step is the visual look at the data using the scatter plots using the MS Excel. Any software can be used for the scatter plot. However, the author finds the MS Excel is the most convenient and useful for a quick look at the data. The final step is to perform statistical tests to confirm the suspected violations of the regression analysis assumptions.

4.4.1. Initial Step of Regression Analysis Diagnostic: Scatter Plots

A simple scattered plot between the dependent and the independent variables will provide the visual representations if there is any problem in the data such as any relationships other than linear (assumption # 1) and any potential unusual observations (assumption #2).

4.4.1.1. Linearity Assumption Check

First assumption of a regression analysis is the linearity of the data. This assumption can be simply checked using a scatter plot in MS Excel. The scatter plot with a trend line will show the linearity of the data (Figure 6). Moreover, the scatter diagram in MS Excel has an option to show the regression equation with the r-square value. The scatter plot in Figure 6 clearly shows that the left data set has a potential statistical linear relationship, while the right data set has a perfect nonlinear relationship. Therefore, performing a linear regression analysis on the right data set will be inappropriate.

Figure 6. Linearity Assumption Violation Check in Regression Analysis

4.4.1.2. Unusual Observations Check

An observation could be unusual with respect to its y-value or x-value. However, rather than calling them x- or y-unusual observations, they are categorized as outlier, leverage, and influential points according to their impact on the regression model.

Outlier – an outlier is defined by an unusual observation with respect to either x-value or y-value. An x-outlier will make the scope of the regression too broad, which is usually considered less accurate. An x-outlier is uncommon, it may seriously affect the regression outcomes though. However, in an unplanned study, often the data is collected before putting much thought into it. In those situations, there could be a possibility of having x-outliers. The y-outliers are very common, and it is usually not as severe as the x-outlier. Nevertheless, the effects of the y-outliers must be investigated further to check, whether it is just a simple data entry error, or some severe issue in the process, or just a random phenomenon. Figure 7 shows both x-outlier (left) and y-outlier (right). Both plots show that a better linear relationship will be possible without these outliers. In this situation, the x-outlier is rotating the line clockwise to change both the slope and the intercept of the relationship, while the y-outlier is moving the predicted line upward. The solid line shows the predicted relationship without the outliers.

Figure 7. Outlier with respect to x-value (left) and y-value (right)

Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted regression line though (Figure 8). A leverage point may look okay as it sits on the predicted regression line. However, a leverage point will inflate the strength of the regression relationship by both the statistical significance (reducing the p-value to increase the chance of a significant relationship) and the practical significance (increasing r-square). Unfortunately, leverage points have no impact on the coefficients because the point follows the predicted regression line.

Practical significance of leverage point – think about a relationship between the muscle mass and the power. In the study, if most individuals weigh around 200 pounds and only one person weighs about 400 lbs. This 400-pound and his extreme y-value (power) will dictate the relationship more than all other individuals weigh near 200 pounds. Therefore, the conclusions for the study could be misleading. The Leverage points usually make the functional regression relationship too broad. Generally, a wider (too broad) model is conserved less accurate as compared to a shorter one. To improve the regression model accuracy, shorter models are recommended.

Figure 8. Leverage Point (Right) in Regression Analysis

Influential – a data point that unduly influences the regression analyses outputs (Figure 9). A point is considered influential if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the p-value, r-square, coefficients, and intercept. Figure 9 shows the impact of an influential point on the regression statistics, including the r-square, slope, and the intercept.

Figure 9. Influential Point in Regression Analysis

4.4.2. Statistical Diagnostic of Regression Analysis:

Unusual Observations Check

Any statistic software, including MS Excel will produce the diagnostic statistics results. Video 2 provides the diagnostic analysis using Minitab software. It also provides the explanation of the analysis results.

Video 2. How to Explain and Interpret the Linear Regression Diagnostics Analysis Explained Example in Minitab

4.4.2.1. Statistical Test for y-Outlier Point

Diagnostic analysis for each data point is provided in Table 2. An observation is generally considered an outlier if the absolute value of the residual (RESI) is higher. For an example, the data point # 6 has a very high residual compared to any other data points of the data set. The absolute values for the other diagnostic statistics for outliers such as the scaled or adjusted residuals, including standardized residuals (SRES) and deleted residuals (TRES) also higher for the point # 6. Generally, higher absolute value for any of these diagnostic statistics for a point is considered an outlier.

Table 2. Regression Diagnostic Analysis: Detection of Outliers

4.4.2.2. Statistical Test for x-Outlier Point

An x-outlier is determined from the diagonal element of the hat matrix, HI. The diagonal element of the hat matrix HI has some interesting properties, include

  1. HI measures the weighted distance from the x-mean (mean of the independent variables).
  2. Sum of all diagonal elements of the hat matrix, HI is equal to the sum of the total number of parameters and the intercept, p. In this example, there is one parameter and one intercept, which is equal to 2 = p. Therefore, the sum of HI column in Table 3 is equal to 2.
  3. Therefore, any large value for the HI is considered an outlier with respect to the x-values.
  4. Generally, any value exceeds twice the mean value of the HI (=2*(p/n)) is considered an x-outlier.

Point #11 produces the 0.73 for the diagonal element of the hat matrix, HI, which is larger than the 2p/n (= 0.036). Therefore, this point #11 is considered an outlier with respect the x-value.

Table 3. Regression Diagnostic Analysis: Detection of x-Outlier and Leverage Points

4.4.2.3. Statistical Test for Leverage Point

A leverage point is determined by a point whose x-value is an outlier, while the y-value is on the predicted line (y-value is not an outlier). Therefore, this point is undetected by the y-outlier detection statistics, including the RESI, SRES, and TRES. For an example, the RESI, SRES, and TRES values for the point # 11 are NOT considered large at all, rather they are very consistent with other points. Therefore, the point #11 is not considered an outlier with respect to the y-value. However, the value for the diagonal element of the hat matrix, HI is very large. Any point whose diagonal element of the hat matrix value exceeds 2p/n (2*2/11=0.36 for this example) is considered a leverage point. Therefore, the point # 11 is considered an x-outlier and it has high leverage on the regression analysis.

4.4.2.4. Statistical Test for Influential Point

DFIT and COOK distance is used to statistically determine the influential point. If the absolute value of DFIT exceeds 1 for small to medium data sets and for large data set, the point is considered as influential to the fitted regression. In this small data set example in Table 4, the absolute value for DFIT is 3.63 which exceeds 1 (one). Therefore, the point #11 is considered an influential point. While the DFIT measures the influence of the ith case on the fitted value for this case, Cook’s distance, COOK measures the influence of the ith case on all n fitted values. A very large Cook’s distance for a point indicates a potential influence on the fitted regression line. However, to statistically determine the influential point, the probability is calculated using the Cook’s distance as the value for the F-distribution. If the probability value for the Cook’s distance is 50% or more, the point has a major significant influence on the fitted regression line. Probability value between 10-20% indicates a very small influence, while 20-50% indicates moderate to high influence on the fitted regression. The probability for Cook’s distance is calculated using an F-distribution of p and n-p degrees freedom for the numerator and the denominator, respectively. For this example in Table 4, type /write/input = 1-FDIST(1.637,2,9) in MS Excel to calculate the p-value for the point # 11. The probability value calculated for the point #11 is 75.2% indicating a major influence on the regression.

Table 4. Detection of Influential Point

4.4.3. Statistical Diagnostic of Regression Analysis:

Residuals Analysis

In regression analysis, the errors (residuals) are assumed to be normally distributed with zero mean and constant (homogenous) variance, and uncorrelated.

4.4.3.1. Normality Analysis

Any software, including MS Excel will produce a normal probability plot (pp-plot) to test the normality of the data. If most points follow a straight line of the pp-plot, the data set is normally distribution. In the following example pp-plot, the residuals are normally distributed.

Figure 10. Normal Probability Plot for Residuals

4.4.3.2. Constant (Homogeneous) Variance Check

Any software, including the MS Excel produce the fitted value vs the residual plots, which can be utilized to test the homogeneousness of variance (Figure 11). Any pattern in the residual plot is a violation of the assumptions on the residuals (Figure 11). While the top-left graph looks perfect, other three residual plots show some pattern or some predictability. Any predictability (=any pattern) of residuals is considered the violation of the homogeneousness (constancy) of the residuals (Figure 11). The data must be reinvestigated for remedial actions before drawing any conclusion from this regression analysis.

Figure 11. Residual Analysis for Homoegenouness (Constancy) of Variance

4.4.3.3. Uncorrelated (nonindependence) Variance Assumptions Check

Any time data is collected with some sequence, including time, place, process, etc.; the observation order vs the residual plot shall be investigated for any correlation between the order of the data collection and the residuals. Fortunately, all statistical software, including MS Excel produces this plot. The left plot shows a positive correlation of the residuals with the observation order, while the right graph shows a cyclic correlation between the residuals and the observation order (Figure 12). An uncorrelated residual would look like the top-left plot in Figure 11. The data must be reinvestigated for remedial actions before drawing any conclusion from this regression analysis with correlated between residuals and the observation order.

Figure 12. Residual Analysis for Uncorrelated Variance Violation

5. Lack of Fit Test

Consider the two data sets in Figure 13. The Y1 data set shows a potential linear relationship, while the Y2 data set may have some better relationships other than linear. Both data sets show significant statistical linear relationship (p-value <0.00001) with strong r-square values (over 85%), indicating satisfactory regression analysis (Figure 14 and Figure 15). A lack-of-fit test will determine if there is any other relationship that fits the model better than what is predicted and included in the analysis. Is there any other term missing from the regression model that would fit the model better? Is there anything missing from the regression model? Without the lack-of-fit tests, very satisfactory linear relationships would be made for both data sets. However, the lack-of-fit test for the Y2 data set indicates that there is a lack-of-fit of the regression model, meaning that there is a statistically better relationship other than the linear relationship exists between the dependent and the independent variables (Figure 15). The data set Y1 shows no lack-of-fit, meaning that the Y1 data set shows that the linear model fits fine without suffering from any lack of fit.

Lack of fit test requires repeated observations for at least at few x-values to estimate the pure error for these repeated observations. If the within variation (pure error) of the repeated observations is large as compared to the between observations, no statistical lack of fit exists in the model. If the between observation error is large as compared to the within variation, there could be something else going in the relationship between the dependent and the independent variables.

Figure 13. Lack-of-Fit Data Comparison

Figure 15. Regression Analysis for y2 Data Set

5.1. How to Conduct Lack of Fit Test

Most software will produce the lack-of-fit test automatically if there is repeated observation in the data. The following video demonstrates the lack-of-fit test using Minitab.

6. Reference

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (Vol. 5). New York: McGraw-Hill Irwin.