After successfully completing the Module 8 in Regression Analysis, students will be able to
Regression analysis is a process of estimating the functional relationships between the dependent variable (response variable, or y-variable) and one or more independent variables (factor(s) or predictor(s) or x-variable(s)). Regression analysis is primarily used for the probabilistic systems, rather than the deterministic system where relationship is already known. An example of a probabilistic system would be “the prediction for the stock market tomorrow.” Some applications of regression analysis include,
1. develop functional relationship between variables
2. perform trend analysis
3. predict the dependent variable by the independent variable(s)
4. optimized processes
5. fit response surface
6. analyze data from unplanned experiments (arguably the most frequent one)
The simplest form of the regression analysis is the simple linear regression analysis, which involves only one linear predictor (independent or factor) with one dependent (response) variable. Imagine the relationship between the fuel cost and the distance travelled by a vehicle (Figure 1).
Figure 1. Simple Linear Regression with One Dependent and One Independent Variables
The functional relationship be written as in Equation 1.
Equation 1
It is obvious that more driving will cost more. However, the driving conditions, speeds, road conditions, weather, etc. will cause some variations in the data (Figure 1). Therefore, the functional relationship including the error can be written as in Equation 2.
Equation 2
Figure 2. Simple Linear Regression with One Dependent and One Independent Variables with the Intercept
Therefore, the complete functional relationship can be written as in Equation 3.
Equation 3
The relationship using the typical symbols can be written as in Equation 4.
Equation 4
Where y represents the cost of fuel and the x1 represent the distance travelled. β0 represents the intercept which is the cost of fuel even without any distance travelled. β1 represents the fuel cost per unit distance travelled. Therefore, the unit for the β1 is the fuel cost per mile (which is y/x) primarily known as the slope for the simple linear regression.
Figure 3. Simple Linear Regression Error (Residual)
The error ε is the total amount of deviations of the observed data from the predicted values. As a linear relationship is expected in the simple linear regression, the data points ideally should follow a straight line if there were no variability in the data. However, in reality, there is variability in every process and the data points could be away from the regression line. The amount of deviation is shown with the red arrow marked line in the Figure 3. In regression analysis, the error term is also known as the residual.
Two important assumptions for the error term (residuals) are:
At the learning stage, the following steps could be suggested for easier understanding of the regression analysis process.
Before performing any statistical analysis, simple scattered plot(s) between the dependent and the independent variable(s) can be performed to check if there is any major issue with the data, especially the linearity of the data and any extreme usual observations. Detail discussion on the data quality can be found in the Regression Analysis diagnostic section.
First step of the regression analysis is to check whether there is any statistical significance between the dependent and the independent variables. If there is no statistically significant relationship between the dependent and the independent variables, no further analysis is performed and the study (or the analysis) stops at the step # 1.
The second step of the regression analysis is to check whether the statistically significant results have any practical significance. Often, there is statistical significance. However, the relationship may not be strong enough to predict the dependent variable well. If there is no practical significance of the results, no further analysis is performed and the study (or the analysis) stops at the step #2.
When both step #1, and step #2 are significant, in step #3, the analysis results are explained in the context of the problem, particularly the explanation of the regression relationship, the slope parameter and the intercept.
Finally, in step four, the diagnostic analysis is performed to check whether there is any problem in the data such as any outlier and influential points that may skew the results. Ideally, this step could be performed at first. However, the amount of time and resource it takes to perform this step do not justify this step first if there is not statistical significance between the dependent and the independent variables. Nevertheless, using any statistical software, including MS Excel this step can be performed within a couple of mouse clicks. The outliers and the influential points could be removed if justified from the analysis first before doing any steps in regression analysis at all. If this step is performed at the last step, the analysis must be rerun if the outliers and the influential points are removed. Finally, the step 1, 2, and 3 must be performed again after the diagnostic analysis step. Though it sounds like the diagnostic should be performed first, many diagnostic analyses are impossible to perform without performing the analysis first, whether manually using formulas or using any software. Therefore, the regression analyses are performed a couple of times to produce the best analysis results, including the test statistics and the predicted fitted regression.
Any software, including MS Excel can be used to perform the simple linear regression. Video 1 shows the simple linear regression analysis using MS Excel and Minitab.
Data were collected for the fuel cost and the distance travelled by a vehicle and provided in Table 1.
Table 1. Fuel Cost vs Distance Travelled Data
The simple linear regression analysis is performed using the Minitab Software version 19. The output is provided in Figure 4.
Figure 4. Simple Linear Regression Analysis Output for the Fuel Cost vs Distance
The sequences, including the regression equation, coefficients, analysis of variance, model summary, and the diagnostics, are not standardized from software to software. Even different versions of the software provide different output sequences. Nevertheless, the explanation for the results should follow the regression analysis steps suggested earlier at the learning stage of the regression process. Once the basic is learned, the readers can use their own discretion and make their own steps!
The analysis of variance table of the output table # 4 in Figure 4 provides the information on the statistical significance of the relationship between the fuel cost and the distance.
This step on the statistical significance test include four steps, which has been discussed in earlier module 1, 2, and 3. The statistical significance tests follow the four steps provided below.
[Hypothesis is the research questions. In regression analysis in this example, is there a statistical relationship between the fuel cost and the distance?]
Null Hypothesis: β1 = 0
[zero slope = no functional relationship, meaning that fuel cost does not change with the distance]
Alternative Hypothesis: β1 ≠ 0
[There is a functional relationship between the fuel cost and the distance]
[Simple linear regression analysis is the appropriate method for this situation]
Null hypothesis is rejected due to the p-value (=0.000) is less than the level of significance (alpha = 0.05).
[p-value is defined by the observed probability of the null hypothesis to happen]
There is a statistically significant functional relationship between the fuel cost and the distance travelled.
Once the statistical significance is observed in the step #1, the practical significance is checked from the third output table in Figure 4, the Model Summary Table.
R-square or the Coefficient of Determination is defined by the per cent of variation in the dependent variable explained by the independent variable(s) (Equation 5). In this situation, 96.36 percent variation in the fuel cost can be explained by the distance travel.
Equation 5
Where, SSR = Sum of Square of the Regression Model = total variances by all model terms, SSTO = Sum of Square Total = total variances including the experimental error (or error or residual in regression analysis).
Figure 5. Understanding the R-Square
Figure 5 shows a visual representation of the r-square values for data sets with different error (residuals). The top left graph shows that most observed data points follow close to the predicted regression line, while the bottom right graph shows that the most data points are away from the predicted regression line. Higher error (residuals) means lower r-square value (Equation 5). While relationships between the dependent and the independent variables are observed to be significant for all the four data sets in Figure 5, the strength of their relationship is very weak in the bottom right graph than the top-left graph. Therefore, the r-square is a measure for the strength of the relationship. Higher r-square value indicates a stronger relationship between the dependent and the independent variables. Even though the functional relationship between the dependent and the independent variables are observed to be significant, a very weak relationship could be practically meaningless.
As the r-square is proportional to the variation in the regression model terms (SSR) (Equation 5), adding more model terms may inflate/increase the r-square value. Therefore, the r-square could be misleading in finding the strength of the relationship between the dependent and independent variables. To account for this unwanted inflation, an adjustment to the r-square formula has been made as in Equation 6. Therefore, the relatively unbiased the adjusted r-square could be used to find the strength of the relationship.
Equation 6
Nevertheless, an appropriately built regression model will produce very close values for both r-square and the adjusted r-square. If there is a very high difference between them, probably insignificant terms have been included in the final regression model. Therefore, the regression model should be investigated further to find the issue if there is a difference in the value of r-square and adjusted r-square. To explain the model strength or the practical significance, use of either r-square or adjusted r-square should be okay if the model is well built.
Satisfactory r-square value depends on the field of study. In the fuel cost versus the distance example, the r-square is observed to be 96.36%, which is considered excellent. Therefore, the functional relationship between the fuel cost and the distance is considered very strong. While a lower r-square value would not be acceptable for this fuel cost study, human behavior study would be happy enough to get a 50% r-square value. In the field of marketing, advertising for an example, 10 to 20 % r-square value would be okay.
Once the regression model is observed to be statistically and practically significant, the third step is to explain the functional relationships between the dependent and the independent variables using the first and the second table of the Figure 4.
Fuel cost increases by 0.06017 USD per mile distance travelled.
Intercept, 5.531 is the fuel cost associated with the zero x-value, meaning that there will be a 5.531 fuel cost even if no distance is travelled.
Regression analysis diagnostics involve checking the assumptions made for the analysis. Four primary assumptions of regression analysis are listed below.
Violations of these assumptions can be checked by (1) visual look at the data and (2) performing statistical tests. The initial step is the visual look at the data using the scatter plots using the MS Excel. Any software can be used for the scatter plot. However, the author finds the MS Excel is the most convenient and useful for a quick look at the data. The final step is to perform statistical tests to confirm the suspected violations of the regression analysis assumptions.
A simple scattered plot between the dependent and the independent variables will provide the visual representations if there is any problem in the data such as any relationships other than linear (assumption # 1) and any potential unusual observations (assumption #2).
4.4.1.1. Linearity Assumption Check
First assumption of a regression analysis is the linearity of the data. This assumption can be simply checked using a scatter plot in MS Excel. The scatter plot with a trend line will show the linearity of the data (Figure 6). Moreover, the scatter diagram in MS Excel has an option to show the regression equation with the r-square value. The scatter plot in Figure 6 clearly shows that the left data set has a potential statistical linear relationship, while the right data set has a perfect nonlinear relationship. Therefore, performing a linear regression analysis on the right data set will be inappropriate.
Figure 6. Linearity Assumption Violation Check in Regression Analysis
4.4.1.2. Unusual Observations Check
An observation could be unusual with respect to its y-value or x-value. However, rather than calling them x- or y-unusual observations, they are categorized as outlier, leverage, and influential points according to their impact on the regression model.
Outlier – an outlier is defined by an unusual observation with respect to either x-value or y-value. An x-outlier will make the scope of the regression too broad, which is usually considered less accurate. An x-outlier is uncommon, it may seriously affect the regression outcomes though. However, in an unplanned study, often the data is collected before putting much thought into it. In those situations, there could be a possibility of having x-outliers. The y-outliers are very common, and it is usually not as severe as the x-outlier. Nevertheless, the effects of the y-outliers must be investigated further to check, whether it is just a simple data entry error, or some severe issue in the process, or just a random phenomenon. Figure 7 shows both x-outlier (left) and y-outlier (right). Both plots show that a better linear relationship will be possible without these outliers. In this situation, the x-outlier is rotating the line clockwise to change both the slope and the intercept of the relationship, while the y-outlier is moving the predicted line upward. The solid line shows the predicted relationship without the outliers.
Figure 7. Outlier with respect to x-value (left) and y-value (right)
Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted regression line though (Figure 8). A leverage point may look okay as it sits on the predicted regression line. However, a leverage point will inflate the strength of the regression relationship by both the statistical significance (reducing the p-value to increase the chance of a significant relationship) and the practical significance (increasing r-square). Unfortunately, leverage points have no impact on the coefficients because the point follows the predicted regression line.
Practical significance of leverage point – think about a relationship between the muscle mass and the power. In the study, if most individuals weigh around 200 pounds and only one person weighs about 400 lbs. This 400-pound and his extreme y-value (power) will dictate the relationship more than all other individuals weigh near 200 pounds. Therefore, the conclusions for the study could be misleading. The Leverage points usually make the functional regression relationship too broad. Generally, a wider (too broad) model is conserved less accurate as compared to a shorter one. To improve the regression model accuracy, shorter models are recommended.
Figure 8. Leverage Point (Right) in Regression Analysis
Influential – a data point that unduly influences the regression analyses outputs (Figure 9). A point is considered influential if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the p-value, r-square, coefficients, and intercept. Figure 9 shows the impact of an influential point on the regression statistics, including the r-square, slope, and the intercept.
Figure 9. Influential Point in Regression Analysis
Any statistic software, including MS Excel will produce the diagnostic statistics results. Video 2 provides the diagnostic analysis using Minitab software. It also provides the explanation of the analysis results.
Video 2. How to Explain and Interpret the Linear Regression Diagnostics Analysis Explained Example in Minitab
4.4.2.1. Statistical Test for y-Outlier Point
Diagnostic analysis for each data point is provided in Table 2. An observation is generally considered an outlier if the absolute value of the residual (RESI) is higher. For an example, the data point # 6 has a very high residual compared to any other data points of the data set. The absolute values for the other diagnostic statistics for outliers such as the scaled or adjusted residuals, including standardized residuals (SRES) and deleted residuals (TRES) also higher for the point # 6. Generally, higher absolute value for any of these diagnostic statistics for a point is considered an outlier.
Table 2. Regression Diagnostic Analysis: Detection of Outliers
4.4.2.2. Statistical Test for x-Outlier Point
An x-outlier is determined from the diagonal element of the hat matrix, HI. The diagonal element of the hat matrix HI has some interesting properties, include
Point #11 produces the 0.73 for the diagonal element of the hat matrix, HI, which is larger than the 2p/n (= 0.036). Therefore, this point #11 is considered an outlier with respect the x-value.
Table 3. Regression Diagnostic Analysis: Detection of x-Outlier and Leverage Points
4.4.2.3. Statistical Test for Leverage Point
A leverage point is determined by a point whose x-value is an outlier, while the y-value is on the predicted line (y-value is not an outlier). Therefore, this point is undetected by the y-outlier detection statistics, including the RESI, SRES, and TRES. For an example, the RESI, SRES, and TRES values for the point # 11 are NOT considered large at all, rather they are very consistent with other points. Therefore, the point #11 is not considered an outlier with respect to the y-value. However, the value for the diagonal element of the hat matrix, HI is very large. Any point whose diagonal element of the hat matrix value exceeds 2p/n (2*2/11=0.36 for this example) is considered a leverage point. Therefore, the point # 11 is considered an x-outlier and it has high leverage on the regression analysis.
4.4.2.4. Statistical Test for Influential Point
DFIT and COOK distance is used to statistically determine the influential point. If the absolute value of DFIT exceeds 1 for small to medium data sets and for large data set, the point is considered as influential to the fitted regression. In this small data set example in Table 4, the absolute value for DFIT is 3.63 which exceeds 1 (one). Therefore, the point #11 is considered an influential point. While the DFIT measures the influence of the ith case on the fitted value for this case, Cook’s distance, COOK measures the influence of the ith case on all n fitted values. A very large Cook’s distance for a point indicates a potential influence on the fitted regression line. However, to statistically determine the influential point, the probability is calculated using the Cook’s distance as the value for the F-distribution. If the probability value for the Cook’s distance is 50% or more, the point has a major significant influence on the fitted regression line. Probability value between 10-20% indicates a very small influence, while 20-50% indicates moderate to high influence on the fitted regression. The probability for Cook’s distance is calculated using an F-distribution of p and n-p degrees freedom for the numerator and the denominator, respectively. For this example in Table 4, type /write/input = 1-FDIST(1.637,2,9) in MS Excel to calculate the p-value for the point # 11. The probability value calculated for the point #11 is 75.2% indicating a major influence on the regression.
Table 4. Detection of Influential Point
In regression analysis, the errors (residuals) are assumed to be normally distributed with zero mean and constant (homogenous) variance, and uncorrelated.
4.4.3.1. Normality Analysis
Any software, including MS Excel will produce a normal probability plot (pp-plot) to test the normality of the data. If most points follow a straight line of the pp-plot, the data set is normally distribution. In the following example pp-plot, the residuals are normally distributed.
Figure 10. Normal Probability Plot for Residuals
4.4.3.2. Constant (Homogeneous) Variance Check
Any software, including the MS Excel produce the fitted value vs the residual plots, which can be utilized to test the homogeneousness of variance (Figure 11). Any pattern in the residual plot is a violation of the assumptions on the residuals (Figure 11). While the top-left graph looks perfect, other three residual plots show some pattern or some predictability. Any predictability (=any pattern) of residuals is considered the violation of the homogeneousness (constancy) of the residuals (Figure 11). The data must be reinvestigated for remedial actions before drawing any conclusion from this regression analysis.
Figure 11. Residual Analysis for Homoegenouness (Constancy) of Variance
4.4.3.3. Uncorrelated (nonindependence) Variance Assumptions Check
Any time data is collected with some sequence, including time, place, process, etc.; the observation order vs the residual plot shall be investigated for any correlation between the order of the data collection and the residuals. Fortunately, all statistical software, including MS Excel produces this plot. The left plot shows a positive correlation of the residuals with the observation order, while the right graph shows a cyclic correlation between the residuals and the observation order (Figure 12). An uncorrelated residual would look like the top-left plot in Figure 11. The data must be reinvestigated for remedial actions before drawing any conclusion from this regression analysis with correlated between residuals and the observation order.
Figure 12. Residual Analysis for Uncorrelated Variance Violation
Consider the two data sets in Figure 13. The Y1 data set shows a potential linear relationship, while the Y2 data set may have some better relationships other than linear. Both data sets show significant statistical linear relationship (p-value <0.00001) with strong r-square values (over 85%), indicating satisfactory regression analysis (Figure 14 and Figure 15). A lack-of-fit test will determine if there is any other relationship that fits the model better than what is predicted and included in the analysis. Is there any other term missing from the regression model that would fit the model better? Is there anything missing from the regression model? Without the lack-of-fit tests, very satisfactory linear relationships would be made for both data sets. However, the lack-of-fit test for the Y2 data set indicates that there is a lack-of-fit of the regression model, meaning that there is a statistically better relationship other than the linear relationship exists between the dependent and the independent variables (Figure 15). The data set Y1 shows no lack-of-fit, meaning that the Y1 data set shows that the linear model fits fine without suffering from any lack of fit.
Lack of fit test requires repeated observations for at least at few x-values to estimate the pure error for these repeated observations. If the within variation (pure error) of the repeated observations is large as compared to the between observations, no statistical lack of fit exists in the model. If the between observation error is large as compared to the within variation, there could be something else going in the relationship between the dependent and the independent variables.
Figure 13. Lack-of-Fit Data Comparison
Figure 15. Regression Analysis for y2 Data Set
Most software will produce the lack-of-fit test automatically if there is repeated observation in the data. The following video demonstrates the lack-of-fit test using Minitab.
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (Vol. 5). New York: McGraw-Hill Irwin.