What is Regression Analysis?

Regression analysis is a process of estimating the functional relationships between the dependent variable (also known as response variable, or y-variable) and one or more independent variables (also known as factors or predictors or x-variables). Regression analysis is primarily used for the probabilistic systems, rather than the deterministic system where the relationship is already known. An example of a probabilistic system would be “the prediction for the stock market tomorrow.” Some applications of regression analysis include,

  1. Develop a functional relationship between variables
  2. Perform trend analysis
  3. Predict the dependent variable by the independent variable(s)
  4. Optimized processes
  5. Fit a response surface
  6. Analyze data from unplanned experiments (arguably the most frequent one)

The simplest form of regression analysis is simple linear regression analysis, which involves only one linear predictor (independent variable or factor) with one dependent (response) variable. Imagine the relationship between the fuel cost and the distance traveled by a vehicle (Figure 1).

Figure 1. Simple Linear Regression with One Dependent and One Independent Variables

Functional relationship can be written as in Equation 1.

Equation 1

It is obvious that more driving will cost more. However, the driving conditions, speeds, road conditions, weather, etc. will cause some variations in the data (Figure 1). Therefore, the functional relationship including, the error can be written as in Equation 2.

Equation 2

Figure 2. Simple Linear Regression with One Dependent and One Independent Variable with the Intercept

Sometimes, there is some fuel cost even without driving the vehicle such as at the stop sign, idling, traffic congestion, etc. (Figure 2). Therefore, the complete functional relationship can be written as in Equation 3.

Equation 3

Relationship using the typical symbols can be written as in Equation 4.

Equation 4

Where y represents the cost of fuel and the x1 represents the distance traveled. β0 represents the intercept which is the cost of fuel even without any distance traveled. β1 represents the fuel cost per unit distance traveled. Therefore, the unit for the β1 is the fuel cost per mile (which is y/x) primarily known as the slope for the simple linear regression.

Figure 3. Simple Linear Regression Error (Residual)

Error ε is the total amount of deviations of the observed data from the predicted values. As a linear relationship is expected in the simple linear regression, the data points ideally should follow a straight line if there were no variability in the data. However, in reality, there is variability in every process and the data points could be away from the regression line. The amount of deviation is shown with the red arrow marked line in Figure 3. In regression analysis, the error term is often known as the residual.

Assumptions

for the Errors (or Residuals)

Two important assumptions for the error term (residuals) are:

  1. Error term is normally distributed with zero mean and constant (homogeneous) variance
  2. Error term is uncorrelated with the observation order