Unusual Observations

Outlier, Leverage, and Influential Points

An observation could be unusual with respect to its y-value or x-value. However, rather than calling them x- or y-unusual observations, they are categorized as outlier, leverage, and influential points according to their impact on the regression model.

Outlier – an outlier is defined by an unusual observation with respect to either x-value or y-value. An x-outlier will make the scope of the regression too broad, which is usually considered less accurate. An x-outlier is uncommon, it may seriously affect the regression outcomes though. However, in an unplanned study, often the data is collected before putting much thought into it. In those situations, there could be a possibility of having x-outliers. The y-outliers are very common, and it is usually not as severe as the x-outlier. Nevertheless, the effects of the y-outliers must be investigated further to check whether it is just a simple data entry error, or some severe issue in the process, or just a random phenomenon. Figure 7 shows both x-outlier (left) and y-outlier (right). Both plots show that a better linear relationship will be possible without these outliers. In this situation, the x-outlier is rotating the line clockwise to change both the slope and the intercept of the relationship, while the y-outlier is moving the predicted line upward. The solid line shows the predicted relationship without the outliers.

Figure 7. Outlier with Respect to x-Value (Left) and y-Value (Right)

Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted regression line though (Figure 8). A leverage point may look okay as it sits on the predicted regression line. However, a leverage point will inflate the strength of the regression relationship by both the statistical significance (reducing the p-value to increase the chance of a significant relationship) and the practical significance (increasing r-square). Unfortunately, leverage points have no impact on the coefficients because the point follows the predicted regression line.

Practical significance of leverage point – think about a relationship between the muscle mass and the power. In the study, if most individuals weigh around 200 pounds and only one person weighs about 400 lbs. This 400-pound and his extreme y-value (power) will dictate the relationship more than all other individuals weighing near 200 pounds. Therefore, the conclusions for the study could be misleading. The Leverage points usually make the functional regression relationship too broad. Generally, a wider (too broad) model is conserved less accurate as compared to a shorter one. To improve the regression model accuracy, shorter models are recommended.

Figure 8. Leverage Point (Right) in a Regression Analysis

Influential – a data point that unduly influences the regression analyses outputs (Figure 9). A point is considered influential if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the p-value, r-square, coefficients, and intercept. Figure 9 shows the impact of an influential point on the regression statistics, including the r-square, slope, and the intercept.

Figure 9. Influential Point in a Regression Analysis

Statistical Diagnostic Tests for Unusual Observations

Any statistical software, including MS Excel will produce the diagnostic statistics results. Video 2 provides the diagnostic analysis using Minitab software. It also provides an explanation of the analysis results.

Video 2. How to Explain and Interpret the Linear Regression Diagnostics Analysis Explained Example in Minitab

Statistical Test

for

y-Outlier Point

Diagnostic analysis for each data point is provided in Table 2. An observation is generally considered an outlier if the absolute value of the residual (RESI) is higher. For example, the data point # 6 has a very high residual compared to any other data points of the data set. The absolute values for the other diagnostic statistics such as scaled or adjusted residuals, standardized residuals (SRES) and deleted residuals (TRES) are also observed to be higher for point # 6. Generally, higher absolute value for any of these diagnostic statistics for a point is considered an outlier.

Table 2

Regression Diagnostic Analysis: Detection of Outliers

Statistical Test

for

x-Outlier Point

An x-outlier is determined from the diagonal element of the hat matrix, HI. The diagonal elements of the hat matrix HI has some interesting properties, including

  1. HI measures the weighted distance from the x-mean (mean of the independent variables).
  2. The sum of all diagonal elements of the hat matrix, HI is equal to the sum of the total number of parameters and the intercept, p. In this example, there is one parameter and one intercept, which is equal to 2 = p. Therefore, the sum of HI column in Table 3 is equal to 2.
  3. Therefore, any large value for the HI is considered an outlier with respect to the x-values.
  4. Generally, any value exceeds twice the mean value of the HI (=2*(p/n)) is considered an x-outlier.

Point #11 produces a value of 0.73 for the diagonal element of the hat matrix, HI, which is larger than the 2p/n (= 0.036). Therefore, this point #11 is considered an outlier with respect the x-value.

Table 3

Regression Diagnostic Analysis: Detection of x-Outlier and Leverage Points

Statistical Test

for

Leverage Point

A leverage point is determined by a point whose x-value is an outlier, while the y-value is on the predicted line (y-value is not an outlier). Therefore, this point is undetected by the y-outlier detection statistics, including the RESI, SRES, and TRES. For example, the RESI, SRES, and TRES values for the point # 11 are NOT considered large at all, rather they are very consistent with other points. Therefore, the point #11 is not considered an outlier with respect to the y-value. However, the value for the diagonal element of the hat matrix, HI is very large. Any point whose diagonal element of the hat matrix value exceeds 2p/n (2*2/11=0.36 for this example) is considered a leverage point. Therefore, the point # 11 is considered an x-outlier and it has high leverage on the regression analysis.

Statistical Test

for

Influential Point

DFIT and COOK distance is used to statistically determine the influential point. If the absolute value of DFIT exceeds 1 for small to medium data sets and for large data set, the point is considered as influential to the fitted regression. In this small data set example in Table 4, the absolute value of DFIT for point # 11 is observed to be 3.63 which exceeds 1 (one), and therefore, the point #11 is considered an influential point. While the DFIT measures the influence of the ith case on the fitted value for this case, Cook’s distance, COOK measures the influence of the ith case on all n fitted values. A very large Cook’s distance for a point indicates a potential influence on the fitted regression line. However, to statistically determine the influential point, the probability is calculated using the Cook’s distance as the value for the F-distribution. If the probability value for the Cook’s distance is 50% or more, the point has a major significant influence on the fitted regression line. Probability value between 10-20% indicates a very small influence, while 20-50% indicates moderate to high influence on the fitted regression. The probability for Cook’s distance is calculated using an F-distribution of p and n-p degrees freedom for the numerator and the denominator, respectively. For this example in Table 4, type /write/input = 1-FDIST(1.637,2,9) in MS Excel to calculate the p-value for the point # 11. The probability value calculated for point #11 is 75.2% indicating a major influence on the regression.

Table 4

Detection of Influential Point