# Unusual Observations

# Outlier, Leverage, and Influential Points

An observation could be unusual with respect to its y-value or x-value. However, rather than calling them x- or y-unusual observations, they are categorized as * outlier, leverage, *and

*according to their impact on the regression model.*

**influential points*** Outlier – *an outlier is defined by an unusual observation with respect to either x-value or y-value. An x-outlier will make the scope of the regression too broad, which is usually considered less accurate. An x-outlier is uncommon, it may seriously affect the regression outcomes though. However, in an unplanned study, often the data is collected before putting much thought into it. In those situations, there could be a possibility of having x-outliers. The y-outliers are very common, and it is usually not as severe as the x-outlier. Nevertheless, the effects of the y-outliers must be investigated further to check whether it is just a simple data entry error, or some severe issue in the process, or just a random phenomenon. Figure 7 shows both x-outlier (left) and y-outlier (right). Both plots show that a better linear relationship will be possible without these outliers. In this situation, the x-outlier is rotating the line clockwise to change both the slope and the intercept of the relationship, while the y-outlier is moving the predicted line upward. The solid line shows the predicted relationship without the outliers.

*Figure 7.* Outlier with Respect to x-Value (Left) and y-Value (Right)

* Leverage *– a data point whose x-value (independent) is unusual, y-value follows the predicted regression line though (Figure 8). A leverage point may look okay as it sits on the predicted regression line. However, a leverage point will inflate the strength of the regression relationship by both the statistical significance (reducing the

**p****-value**to increase the chance of a significant relationship) and the practical significance (increasing

**r****-square**). Unfortunately, leverage points have no impact on the coefficients because the point follows the predicted regression line.

* Practical significance of leverage point* – think about a relationship between the muscle mass and the power. In the study, if most individuals weigh around 200 pounds and only one person weighs about 400 lbs. This 400-pound and his extreme y-value (power) will dictate the relationship more than all other individuals weighing near 200 pounds. Therefore, the conclusions for the study could be misleading. The Leverage points usually make the functional regression relationship too broad. Generally, a wider (too broad) model is conserved less accurate as compared to a shorter one. To improve the regression model accuracy, shorter models are recommended.

*Figure 8.* Leverage Point (Right) in a Regression Analysis

* Influential – *a data point that unduly influences the regression analyses outputs (Figure 9). A point is considered

*if its exclusion causes major changes in the fitted regression function. Depending on the location of the point, it may affect all statistics, including the*

**influential***,*

**p-value***,*

**r-square****coefficients,**and

**intercept**. Figure 9 shows the impact of an influential point on the regression statistics, including the

*,*

**r-square****slope**, and the

**intercept**.

*Figure 9.* Influential Point in a Regression Analysis

# Statistical Diagnostic Tests for Unusual Observations

Any statistical software, including MS Excel will produce the diagnostic statistics results. Video 2 provides the diagnostic analysis using Minitab software. It also provides an explanation of the analysis results.

*Video 2.* How to Explain and Interpret the Linear Regression Diagnostics Analysis Explained Example in Minitab

## Statistical Test

## for

## y-Outlier Point

Diagnostic analysis for each data point is provided in Table 2. An observation is generally considered an outlier if the absolute value of the * residual (RESI)* is higher. For example, the data point # 6 has a very high residual compared to any other data points of the data set. The absolute values for the other diagnostic statistics such as

*and*

**scaled or adjusted residuals, standardized residuals (SRES)***are also observed to be higher for point # 6. Generally, higher absolute value for any of these diagnostic statistics for a point is considered an outlier.*

**deleted residuals (TRES)**Table 2

*Regression Diagnostic Analysis: Detection of Outliers*

## Statistical Test

## for

## x-Outlier Point

An x-outlier is determined from the * diagonal element of the hat matrix, HI*. The

*has some interesting properties, including*

**diagonal elements of the hat matrix HI**measures the weighted distance from the x-mean (mean of the independent variables).**HI**- The sum of all
is equal to the sum of the total number of parameters and the intercept,**diagonal elements of the hat matrix, HI***p*. In this example, there is one parameter and one intercept, which is equal to 2 =*p*. Therefore, the sum of*HI*column in Table 3 is equal to 2. - Therefore, any large value for the
is considered an outlier with respect to the x-values.**HI** - Generally, any value exceeds twice the mean value of the
(=2*(**HI***p/n*)) is considered an x-outlier.

Point #11 produces a value of 0.73 for the * diagonal element of the hat matrix, HI*, which is larger than the 2

*p/n*(= 0.036). Therefore, this point #11 is considered an outlier with respect the

*x*-value.

Table 3

*Regression Diagnostic Analysis: Detection of x-Outlier and Leverage Points*

## Statistical Test

## for

## Leverage Point

A * leverage* point is determined by a point whose x-value is an outlier, while the

*y*-value is on the predicted line (

*y*-value is not an outlier). Therefore, this point is undetected by the

*y*-outlier detection statistics, including the

*and*

**RESI, SRES,***For example, the*

**TRES.***and*

**RESI, SRES,***values for the point # 11 are NOT considered large at all, rather they are very consistent with other points. Therefore, the point #11 is not considered an outlier with respect to the*

**TRES***y*-value. However, the value for the

*is very large. Any point whose diagonal element of the hat matrix value exceeds 2*

**diagonal element of the hat matrix, HI***p/n*(2*2/11=0.36 for this example) is considered a leverage point. Therefore, the point # 11 is considered an

*and it has high*

**x-outlier***on the regression analysis.*

**leverage**## Statistical Test

## for

## Influential Point

* DFIT *and

*distance is used to statistically determine the*

**COOK***. If the absolute value of*

**influential point***exceeds 1 for small to medium data sets and for large data set, the point is considered as influential to the fitted regression. In this small data set example in Table 4, the absolute value of*

**DFIT***for point # 11 is observed to be 3.63 which exceeds 1 (one), and therefore, the point #11 is considered an influential point. While the*

**DFIT***measures the influence of the*

**DFIT**

**i**^{th}case on the fitted value

**for this case,**

**Cook’s distance, COOK**measures the influence of the

**i**^{th}case on all

*n*fitted values. A very large Cook’s distance for a point indicates a potential influence on the fitted regression line. However, to statistically determine the influential point, the probability is calculated using the Cook’s distance as the value for the

*F*-distribution. If the probability value for the Cook’s distance is 50% or more, the point has a major significant influence on the fitted regression line. Probability value between 10-20% indicates a very small influence, while 20-50% indicates moderate to high influence on the fitted regression. The probability for Cook’s distance is calculated using an

*F*-distribution of

*p*and

*n-p*degrees freedom for the numerator and the denominator, respectively. For this example in Table 4, type /write/input

*in MS Excel to calculate the p-value for the point # 11. The probability value calculated for point #11 is 75.2% indicating a major influence on the regression.*

**= 1-FDIST(1.637,2,9)**Table 4

*Detection of Influential Point*