Linear regression is a powerful statistical tool used to quantify the relationship between variables in ways that can be used to predict future outcomes. This method of analysis is used in stock forecasting, portfolio management, scientific analysis, and many more applications. Whenever one has at least two variables in their data—linear regression might be useful.
In this article, we’ll walk through the mathematical underpinnings of building a simple linear model from scratch. We’ll go over when to use a linear model, data requirements for successful linear analysis, how to apply least squares regression techniques, and finally how to predict future values and interpret their validity.
- Understanding the goals of the simple linear regression model and what its limitations are.
- Assumptions of Linear Regression
- Preparing the data for a simple linear model by examining covariance and performing correlation analysis.
- Understanding the basic formulae and mathematical notations and formulas used in linear modeling
- Estimating parameters, minimizing residuals, calculating the coefficient of correlations, and generating other statistics.
- Performing Least Squares Regression to find the line of best fit for a set of data
- Interpreting Results and making predictions.
Goals of Linear Regression
Regression analysis is a statistical framework for quantifying the relationship between a dependent variable and one or more independent variables. Regression analysis comes in many forms—linear, logistic, ride, polynomial, and more—many more! Each has an application for datasets with specific characteristics. Generally, these models can be categorized as linear regression, multiple linear regression, and nonlinear regression.
Understanding Variable Relationships
The goal of linear regression is to predict the value of the dependent variable based on the observed value of an independent variable. These relationships are often described in terms of the coefficient of correlation (r2) and can range from positive to negative, inclusive of a completely uncorrelated relationship.
In the case of simple linear regression, this goal is achieved via modeling the relationship between a dependent variable and a single independent variable.
In the case of multiple linear regression, the relationship between the dependent variable is considered with respect to two or more independent variables. The focus of this article is primarily on simple linear regression.
Linear regression is the simplest and most-used form of regression analysis. It fits a line of best fit such that the distance of predicted values from the mean of observed values is minimized. The formulae for varying Linear regression models are based on the algebraic slope-intercept form. Notation and form vary slightly but the concept of having a slope and passing through the y-axis when the predictor variable equals zero translates entirely.
This formula generates a line that represents the minimal distance between a predicted value for an independent variable and its actual value. The goal of linear regression is to minimize the total squared quantity for all these values. This is done through the method of least squares and can be calculated formulaically by methods such as the ordinary least squares (OLS) or iteratively by using machine-learning algorithms such as gradient descent. Before these calculations can be made, however, linear regression requires certain assumptions about data to be validated.
Assumptions of Linear Regression
For linear regression analysis to be effective, data much abide by a certain number of rules. These rules, which we’ll call assumptions, are necessary for linear regression to be accurate. Generally, these are the five primary assumptions of linear regression analysis with respect to the data:
- No Multicollinearity
Before proceeding to implement a linear regression model, one needs to examine data to ensure these assumptions hold true. If any of these assumptions prove to be false Linear Regression will not provide accurate conclusions. Before we talk about applying linear regression we’ll take a moment to discuss each of these assumptions and how to validate them.
Linear regression models the linear relationships between variables. As one might imagine, if no such linear relationship exists there is little cause for linear regression analysis. Your computer isn’t going to explode from doing a linear regression on data without a linear relationship—it will simply produce a model that is underfit. Fortunately, checking for linearity can be done simplify enough by any of the following methods (in descending order computational complexity).
- Scatterplots: A plot of the data, combined with the infallible eyeball test, is often good enough to make a quick assessment of data linearity (Correll, 2017).
- Partial Residuals: A plot of the partial residuals can help assess the linearity of data as well, but requires a bit more computational effort than just plotting the data (Hastie, 1990). Ultimately, it falls to the eyeball test for validation as well.
- Covariance Analysis (COV): Provides a quantitative description of the direction of linearity among variables. If COV returns a near-zero value it suggests there is no linear relationship between data (Kim, 2008).
- Correlation Analysis (COR): Methods such as Pearson’s Correlation Coefficient can help describe the linearity of data. However, by the time one calculates this a full-on regression analysis has been completed. This calculation is better used to verify final results rather than initial assumptions.
Homoscedasticity (constant variance)
Homoscedasticity describes a situation where the variance of all observed values for the response variable has equal residual values—a measure of distance between an observed value and the line of best fit. Data is said to be homoscedastic when these residual values are the nearish same for all observed values of X and heteroscedastic when the residuals values vary significantly. These differences can be assessed via the infallible eyeball test once again as a scatterplot. Also, several mathematical approaches are available as well:
- Levene’s Test
- Park Test
- Glejser Test
- Breusch-Pagan Test
- White Test
- Goldfield-Quandt Test
When heteroscedasticity is present ordinary least squares linear regression is not an appropriate method. There are a number of ways to deal with these cases but they can be quite technical in nature. For a deeper look, check out this great post on heteroscedasticity by Keita Miyaki.
Data can come in many forms but often can be separated into one of two types: longitudinal or cross-sectional. Longitudinal data reflects multiple measures of a single entity over a period of time. For example, the closing price of a stock being recorded every day for a month. Cross-sectional data reflects a single measure in time for multiple entities. For example, the average yearly income reported for a sample population of 100 day-traders.
Longitudinal data has a property described as autocorrelation which describes a self-correlative relationship among the residuals. Autocorrelation can be checked for by the use of the Durbin-Watson statistic, the Breusch-Godfrey test, or the Ljung-Box test. The Durbin Watson test is the most common and has implementations in many common computing packages such as the
statsmodels in Python (see here.) The Durbin-Watson statistics is scored on a range of 0-4 where the autocorrelation is assessed as follows;
- 0 ≤ Score < 2: Positive autocorrelation
- Score == 0: No autocorrelation
- 2 ≤ Score < 4: Negative autocorrelation
Avoiding independence in data is a key assumption to linear regression, characterized by the absence of autocorrelation. Do not be fooled though; autocorrelation analyses are incredible approaches to drawing a different range of insights from data—just not via regression analysis.
Linear regression models seek to minimize the difference between an expected value predicted by the model and the observed value. The difference in these two values is referred to as the error (a.k.a residual.) How these terms are distributed across a series of observed values can predict issues with a linear model’s ability to accurately predict response values. Testing for normality helps ensure that the dependent variable has no effect on observed values for independent variables. The following tests can be used:
Normality testing is not technically a requirement for linear regression validation but is certainly both common and best practice. Normality testing can be skipped on the assumption that one’s model equation is correct and the only purpose is to estimate coefficients and generate predictions about the mean squared error. In some cases, transformations resulting from correcting for normality issues can cause even larger problems, especially in large datasets (Schmidt, 2018).
Issues with normality arise when the distributions of observed values for variables are non-normal or when the assumption of linearity is not true. Also possible are cases where only a few very large errors exist in a dataset, thus skewing the summary statistics, or in cases where multiple subsets of data are present (see bimodal distributions.)
Multicollinearity describes the case where correlative relationships exist between two or more independent variables. Linear regression models the linear relationship between a dependent variable and one or more independent variables. When one or more linear relationships also exist between independent variables there can be problems such as overfitting. The stronger the relationship the bigger the issue. Multicollinearity can be categorized as one of two types:
- Structure Multicollinearity: Collinearity results from creating independent variables from other predictive variables such as the square of another variable.
- Data Multicollinearity: Interrelationships among independent variables that are present in the data before any transformations are applied.
Each of these two forms of collinearity can cause issues but, arguably, the case of data multicollinearity is more problematic. Structural collinearity is often easily remedied by removing transformations to the data whereas data collinearity may require transformations. In the case of the latter, further analysis is usually required. There are several tests and characteristics of data that can hint at issues of multicollinearity. Below are some common indicators:
- Large changes in regression coefficients with small additions or subtractions of independent variables
- When multivariate regression finds little relationship between an independent variable but simple linear regression finds a significant relationship when isolating that variable.
- Farrar-Gluaber Test
- Condition Number Test
- Re-running regression testing with added noise (perturbation)
In addition to these tests, correlation matrices can be used to both numerically and visually detect multilinearity. In the illustration above, relationships denoted in either dark burgundy or light indicate strong linear correlations among predictor variables.
Preparing the Model
Simple linear regression seeks to predict the value of the dependent variable given an observed value for an independent variable. As we’ve seen, a number of assumptions must hold true for this to be possible. The first assumption of linearity among our variables can be assessed explicitly by calculating the covariance and the correlation coefficient. These two measures can provide insight into the direction and magnitude of the relationship between variables and are essential concepts of linear regression.
Covariance describes the directional relationship of linearity between two variables. This measure can help estimate where a linear relationship exists but offers no insight into the magnitude of that relationship.
Covariance is calculated by multiplying the difference between observed values of each variable X, Y, and their respective sample means and dividing by the total observed values – 1. This last step is known as Bessel’s Correction and is common among sample population statistics.
The correlation coefficient is an extension of utility over covariance and provides a measure of both the direction and magnitude of the relationship between variables. This measure, standardized to a range of -1 to 1 can indicate a positive correlation, negative correlation, or zero correlation between variables. The correlation can be regarded as the covariance between the standardized variables or the ratio of covariance to the standard deviation of the variables (Chatterjee, 2012).
The illustration above shows four sets of data, known as Anscombe’s Quartet, where almost identical summary statistics are calculated for vastly different data sets. Covariance and Correlation tests will fail to spot the issues with these datasets—and likely many more out there. The Anscombe’s Quarter is the canonical warnings of why one should always visually inspect data before diving too deeply into creating a linear model for prediction.
Understanding the Model
One can regard linear regression as an extension to correlation analysis. Whereas correlation analysis only describes the relationship between variables in terms of strength and direction, regression is able to numerically describe that relationship. This allows for powerful predictive ability and has lead to linear regression being among the most utilized statistical models in scientific study (Hyatt, 2017).
Linear regression, be it simple or multiple regression, uses a linear model that is built atop the classic slope-intercept form
y = mx + b. This formula is the cornerstone of linear equations and is introduced in most introductory algebra classes. Here we’ll re-introduce this formula with slightly different notation allowing for flexibility among different linear models:
Before we proceed it’s important to have a basic recognition of how and why the terms of the slope-intercept form have been adapted for use with linear regression models. If for no other reason, this knowledge is important in helping one understand how the linear model’s notation changes based on new features and analytical methods. Let’s take a quick walk through the terms of this equation.
- Y – Representing the dependent variable, not its value. Comparable to the
yterm in slope-intercept form.
- ß0 – The constant coefficient representing the y-intercept where an observed value of X is equal to zero. i.e.
y = 0. Comparable to the
bterm in slope-intercept.
- ß1 – A coefficient representing the slope of the regression line; comparable to the
mterm in slope-intercept.
- X – Representing the independent variable, not its value. Comparable to the
xterm in slope-intercept form.
- ε – Representing the residual (error) term describing how far away a predicted value is from an observed value. Not represented in slope-intercept form.
These terms represent an estimation of the relationship between variables X and Y such that Y is an approximately linear function of X where ε represents the error in that estimation.
Understanding the Notation
These values are represented differently across many textbooks. In addition, there are some nuances that one should be aware of. Namely, the equation we have discussed so far is the general equation for linear regression modeling. That is, it describes the formula by which each observed value should be analyzed (Pardoe, 2012).
As we begin working through our analysis, the equation’s form will change slightly to express which observed values we are currently considering. The capital Y represents all observed values of the independent variable whereas lowercase y represents a single observed value for that variable.
These may seem like nuances but become quite important to understand as we begin to represent Y differently as we progress through the stages of simple linear regression analysis. For example, this is the formula for a single observed value of X and Y, represented as the ith term.
As a regression analysis progresses through a series of variables X and Y, this equation is expressed with
i being replaced for whichever term is currently being considered. For example, the following illustration shows the series of all observed values from 1-n where n represents the last observation:
At this point we can understand how to approach the many observed values for our variables X and Y. Additionally, we can understand how the terms of the linear equation can be compared to the slope-intercept form of linear algebra in a way that a line can be generated through our data points.
Now we need a way to ensure that this line fits our data in a way that affords the greatest possibility of our predicted values being the same as our observed values. In other words, we need to minimize the error between our guesses and observations. There are a number of ways to approach this but the most common is the Ordinary Least Squares (OLS) method. This method, along with alternatives, represent the general approach of finding the least-squares regression line (Fox, 2015)
Building the Model
To find the line of best fit for our data we can calculate the slope (m, ß1 coefficient) and intercept (b, ß0 coefficient). This will give us the equation of a line that will minimize the distance of our predicted values to the mean of our observed values. This collection of resulting distances are referred to as residuals (a.k.a. errors) and can be used to assess the goodness of fit resulting from our regression.
This slew of new vocabulary represents a lot of moving pieces. Let’s start walking through things step-by-step to get a good idea of how things work. First, let’s consider some sample data:
Before we make our calculations for the line of best fit for these data let’s visualize things. Here we’ll create a scatterplot of our observed values and also an initial “best guess” line—being “fit” using the mean of our dependent variable (y) values
y = mean(Y). We’ll also plot lines representing the residuals from this best guess line.
Here we see all our elements represented:
- Line of Best Fit: the black horizontal line which is currently just our “best guess” which is simply
y = mean(y-values).
- Observed Values: the yellow dots representing the (x, y) pairs of our data where the X is our independent (predictor) variable and the y is our dependent (response) variable.
- Residuals: the red lines illustrating the between our current y-values and our line of best fit.
- Coefficient of Determination (r2): A sum of the standardized residual values that provides a non-zero estimate of the total error in our model. Simply; the sum of the squared values of all the red lines.
Calculating the Error
The goal of regression is to find the equation of the line that will minimize the sum of the squared values of our residuals (Coefficient of Determination.) In other words; the line that is as close as possible to all points. Right now, our “best guess” line results in an r2 value of 91.5—not too great. Let’s see a table of relevant values for these numbers:
|x||y||y – ŷ||(y – ŷ)2|
|3||1||1 – 5.75 = -4.75||22.5625|
|4||3||3 – 5.75 = -2.75||7.5625|
|5||4||4 – 5.75 = -1.75||3.0625|
|6||4||4 – 5.75 = -1.75||3.0625|
|7||5||5 – 5.75 = -0.75||0.5625|
|9||8||8 – 5.75 = 2.25||5.0625|
|13||9||9 – 5.75 = 3.25||10.5625|
|16||12||12 – 5.75 = 6.25||39.0625|
Before we start walking through our approach, there are two important characteristics of the least-squares regression line to note:
- The sum of the absolute error of all values from the line of best fit is zero
- The line of best fit will cross exactly through the point represented by the mean of our x and y values.
Our line of best fit is currently represented by little more than an educated guess. Practically, this has only been useful for initial data visualization and illustrating the concept of residuals and what they represent.
Our next goal is to begin developing the formula for a line that will fit our data better. This process is called parameter estimation and will result in numbers for each of our coefficients ß0 and ß1. This is where we’ll need to do some more math. First, let’s consider a new version of our previous formula:
This formula represents the same relationship between terms as the formulae we have seen before. This form uses the term ŷ (pronounced Y-Hat) to represent a predicted value for the independent variable.
The topmost formula represents the general form of the equation—meaning the true value for all values—whereas the bottommost represents the predicted (estimated) value of y for a single observed value of x.
The values represented by the bottom equation are called our fitted values. Let’s do some calculations to better demonstrate how things fit together.
Minimizing the Error
The residuals for our estimated values can be calculated by subtraction from the observed value. In other words; this tells us how far off each estimated value is from the actual observed value. This is represented by the following formula:
Estimating the Parameters
With this formula, we can begin to calculate the coefficients of correlation. The first step is to estimate our coefficient for slope—represented as ß1. We calculate this term first because the ß0 formula relies on the estimate of slope. The following formula will provide our estimate of ß1:
Now that we are able to estimate slope, we can begin formulating an equation for our estimated y-intercept value—represented by ß0. This coefficient is estimated by the following formula:
With these formulae in hand, we can now come up with an estimated value that will minimize the errors for the residuals of our least-squares regression line. Keeping in mind our sample data for X and Y from before, the estimates from the above equations are realized as such:
This provides us with the slope of our least-squares line—we’re halfway there! The next step is to use the formula for ß0 to estimate the slope of our least squares line. This is done as follows:
Line of Better Fit
With our two estimators in hand, we can start applying our formula to our data to make a series of predicted values. To accomplish this, we use the ŷi = β̂0 + β̂1xi formula from before to estimate a response value (y) for each observed value of our independent variable (x). By applying this formula we now have an estimated line of least squares that actually minimizes the error:
This chart represents the calculated line of best fit—not a general best guess as before. This line represents the least-squares regression line such that the sum of the square errors between our observed values and predicted values is minimized. The equation for this line is
y = -.21 + .772 * x and can be used to predict future values of x.
We have covered a lot of ground in this article on simple linear regression. The formulae presented here have been developed through centuries of statistical study—linear regression is that old, yes (Stanton, 2001). The concepts here are fundamental to correlative statistical analysis and are ever-present among traditional statistical methods and more general regression analysis techniques.
This formula and the concept of linear regression have become innate among many modern machine learning pipelines as well. Iterative modeling such as gradient descent has offered efficient calculation for higher-dimensional datasets and can be applied to make powerful predictions, regularize data, and engineer features to reduce collinearity. The concepts covered here are at the underpinnings of all these modern applications.
Check out our recent article to see how a simple linear regression model can be applied to predict stock prices. This type of predictive analysis can be made unbelievably quickly using modern tools like
scikit-learn and Python. Certainly, one isn’t required to muck around with the formula and manual calculations we have covered in this article.
For a better idea of how linear regression integrates into modern data science check out our list of the best books on machine learning. These books offer detailed accounts of how to apply many different algorithms—including simple linear regression—to a wide range of projects.
- Correll, Michael. Heer, Jeffrey. “Regression by Eye: Estimating Trends in Bivariate Visualizations.” ACM Human Factors in Computing Systems (CHI), 2017. Available Here: //idl.cs.washington.edu/papers/regression-by-eye
- Hastie, Trevor J., and Robert J. Tibshirani. Generalized additive models. Vol. 43. CRC press, 1990.
- Kim, Hae-Young. “Statistical notes for clinical researchers: covariance and correlation.” Restorative dentistry & endodontics vol. 43,1 e4. 5 Jan. 2018, doi:10.5395/rde.2018.43.e4
- Upton, Graham, and Ian Cook. A Dictionary of Statistics 3e. Oxford, United Kingdom, Oxford University Press, 2014.
- Schmidt, Amand F, and Chris Finan. “Linear regression and the normality assumption.” Journal of clinical epidemiology vol. 98 (2018): 146-151. doi:10.1016/j.jclinepi.2017.12.006
- Chatterjee. Regression Analysis by Example, 5th Edition. 5th ed., Wiley, 2012.
- Hayat, Matthew J et al. “Statistical methods used in the public health literature and implications for training of public health professionals.” PloS one vol. 12,6 e0179032. 7 Jun. 2017, doi:10.1371/journal.pone.0179032
- Pardoe, Iain. Applied Regression Modeling. 2nd ed., Wiley, 2012.
- Fox, John. Applied Regression Analysis and Generalized Linear Models. Third, SAGE Publications, Inc, 2015.
- Stanton, Jeffrey M. “Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors.” Journal of Statistics Education, vol. 9, no. 3, 2001. Crossref, doi:10.1080/10691898.2001.11910537.