Correlation Analysis: Quantifying Linear Relationships Between Features

correlation analysis banner

Correlation analysis is used to quantify the linear relationship between variables. This analysis is integral to many modern statistical methods such as regression analysis. Correlation analysis helps develop more robust machine learning algorithms and more efficient data processing pipelines.

Simple correlation calculations like covariance can provide basic insights whereas multivariate correlation matrices identify linear relationships between complex series of features. In this article, you will learn the basics of correlation analysis, several components, and approaches, how it is applied to analyze financial assets. Finally, we will take a look at some edges cases where correlation analysis might steer you wrong!

Highlights

  • Covariance is used in correlation analysis to describe the direction of correlative relationships between variables.
  • Correlation coefficients lend a hand at describing the magnitude of correlative relationships between variables.
  • Correlation analysis generally results in a standardized value on a scale of -1 to 1 where -1 is a negative correlation, 0 is no correlation, and 1 is a strong positive correlation.
  • Subtle quirks in data can result in functionally inaccurate conclusions drawn from correlation techniques.
  • Correlation analysis is utilized in machine learning to find better features, reduce computational complexity, and select the most appropriate algorithm
  • Correlation coefficients are integral to regression analysis

Introduction

Correlation is a term used to describe the relationship between two variables where the change in the value of one affects the value of the other. A common example of correlated values is illustrated by the measure of a person’s height and the height of their parents. Larger measures of height are observed among those with parents that are taller.

The terms correlation and dependence are synonymous and are used interchangeably. It is important to note that correlation does not imply causation and, in many cases, the causal influence on an observed value may be reflected in another variable—a possibly unknown one.

Simply put; correlation is present when a change in variable X causes a change in variable Y. Correlation is assumed to describe a linear relationship between variables where one variable reflects a unit increase for change in the value of the other.

Correlation Analysis

In any case, correlation analysis is used to establish if there is a correlative relationship between two variables and, if so, what the nature of that relationship is (positive or negative) and what the strength of that relationship is.

Correlation analysis is used to measure the relationship between two variables, typically an independent and dependent variable. The term multicollinearity is used to describe cases where two independent variables express a correlation.

Correlation Coefficients

A correlation between two variables describes how strong the relationship is. This measure is expressed as a standardized value on the interval -1 to 1 and is referred to as the correlation coefficient and is commonly represented by the letter “r”. Below are descriptions of the extreme points found on the standardized correlation coefficient interval.

  • 1 (Positive): When variable X moves, variable Y moves in the same direction.
  • -1 (Negative): When variable X moves, variable Y moves in the opposite direction
  • zero No Correlation: When variable X moves, variable Y moves doesn’t move at all or moves unpredictably.

Below are some visual examples of randomly generated data illustrating relationships among X and Y variables described by the -1, 1, and zero types described above.

Positive Correlation Coefficient

positive correlation scatterplot
The relationship between the X and Y variables reflects a strong positive relationship.

This scatterplot shows a strong positive linear relationship between variable X and Y such that an increase in observed value for the  X (the independent variable) reflects a strong decrease in value observed for the Y variable (the dependent variable.)

Negative Correlation

negative correlation coefficient scatterplot
The relationship between the X and Y variables reflects a strong negative relationship.

This scatterplot shows a strong negative linear relationship between variable X and Y such that an increase in observed value for the  X (the feature) reflects a strong decrease  in value observed for the Y variable (the label.)

Zero Correlation

zero correlation coefficient scatterplot
The relationship between the X and Y variables reflects no correlation in values at all

This scatterplot shows a zero linear relationship between variable X and Y such that an increase in observed value for the  X (the predictor variable) does not reflect a predictable change in the variable Y (the response variable.)

Note: The X and Y variables are named differently in each of the above illustrations. This variation is only to illustrate how many different names are used for the same variables and does not reflect a necessity of terminology based on correlation coefficient value. i.e. the term “predictor” used to describe the X values in the zero correlation example could also be used to describe the X values in the positive and negative plots as well.

Calculating Correlation Coefficients

Correlation coefficients provide ample information to describe the relationship between the variables of two series of observations. As such, it is a foundational component of many machine learning algorithms such as feature selection (Devaraj, 2015, Chicco, 2020.) The correlation coefficient formula for a sample population is as follows:

correlation coefficient formula sample population
The formula for the correlation coefficient of a sample population

The topmost part of this formula represents the covariance between two variables. It is a means of mathematical representation for the direction of correlation between variables. Calculating the covariance is the first step in calculating the correlation coefficient. Let’s take a closer look.

Covariance

covariance quadrants 1
Plotting the sample means for variables X and Y illustrates the sign of covariance to help describe the linearity of their relationship. (click to enlarge)

Covariance is used to describe the direction of the relationship between two variables. That is, it can be used to predict how an observed value for variable X will move in response to a change in observed value for variable Y. A positive covariance indicates that an increase in observed value for variable X will be reflected by an increase in observed value for variable Y. Likewise, a decrease in X will be reflected by a decrease in Y when a negative covariance exists.

The covariance has several notable characteristics one should always be mindful of:

  1. It only describes the relationship between two variables.
  2. It applies to series data, such that many observed values are available for each variable.
  3. The number of observed values for each variable must be identical.
  4. Covariance can express a positive or negative relationship between variables.
  5. Covariance only describes directional relationships and not measures of magnitude which limits its broader applications.

Covariance is used alone for the management of financial assets to control risk, can help forecast stock prices, and can be extended for use with multivariate data to tasks such as feature extraction and dimensionality reduction. Read this article to learn more about calculating covariance, its applications, and various programmatic implementations in Python with libraries such as NumPy and Pandas. For now, let’s continue calculating the correlation coefficient.

Correlation Coefficients

Covariance doesn’t tell us anything about the strength (magnitude) of the relationship between variables. Knowing one stock price will rise in response to a rise in another is hardly useful information unless one can estimate how much the price of the second stock can be expected to move. This is where correlation coefficients come in.

To avoid this pitfall, we can use standardization to relate our variables in a way that allows for calculating direction and magnitude. This is done by first calculating the sample standard deviation for both X and Y variables using the following formula:

standard deviation sample population
The formula for calculating the standard deviation of a sample population.

This calculates the sample mean of each observed value for a variable and its distance from that variable’s sample mean. By multiplying the result of each variable and dividing by the sample population (minus one) we fit each variable onto the same scale. Kind of like converting between miles and kilometers.

Once we have the sample standard deviation for each variable, we can re-write our initial formulate for the correlation coefficient more concisely as such:

simplified correlation coefficient formula vs expanded
A comparison of the simplified form of the correlation coefficient formula to the more expansive form. (click to enlarge)

This standardization process provides us with a value in the range of -1 ≤ Cor(X, Y) ≤ 1 where both direction and magnitude are represented. Correlation coefficient values of ≤ 0 indicate a negative relationship where values ≥ 0 indicate a positive relationship. It’s important to note that a correlation coefficient of 0 does not mean that two variables are not related. A value of Cor(X, Y) = 0 only guarantees that variables X and Y do not have a linear relationship.

Edge Cases

The correlation coefficient, along with other descriptive statistics, should never be regarded as without error in describing data. The correlation coefficient is used to consider linear relationships between variables in a way that will fail to grasp nonlinear relationships that may be obvious when viewed visually.

To illustrate this, statistician Francis Anscombe constructed a series of four datasets in the early 1970s that had near-identical summary statistics but were very different in nature. These are charted below:

anscombes quarter graphed
Anscombe’s four data sets all possess near-identical summary statistics yet have very different relationships between variables. (click to enlarge)

The first quarter illustrates a common linear relationship, the second illustrates a non-linear relationship where a correlation coefficient would provide a subtly incorrect measure; the third shows a perfectly linear data set with a regression line offset by a single variable; the fourth illustrates how a single outlier variable can indicate a linear relationship where none exists (at least not practically.)

Note: Sample data for the Anscombe’s Quartet shown above is available via Github.

Applications

The correlation coefficient provides a backbone for broader correlative analysis. This may be applied via traditional statistical means or modern approaches such as machine learning where correlation analysis can help extract meaningful features and improve model efficiency. Below are some common applications of correlation analysis that leverage the correlation coefficient:

  • Scheduling Analysis
  • Portfolio Analysis
  • Asset Price Forecasting
  • Parallelization Engineering (kind of like scheduling)
  • Cost Estimation

What are the types of Correlation Analysis?

Correlation analysis is applied to a diverse range of analytical subjects that require adjustment for context among data such as units, bias, and other assumptions. The differences in data, sometimes subtle sometimes not, have produced a number of different types of approaches for calculating the correlation coefficient over the years. Below is a summary of the most common.

  • Pearson Correlation:
  • Kendall Rank Correlation:
  • Spearman Correlation:
  • Point-Biserial Correlation:

For a more exhaustive discussion of the many different approaches for calculating the correlation coefficient read this blog post. For our discussion, just keep in mind that the Pearson correlation coefficient is the most common method and the one reflected in the calculations presented in this article.

Final Thoughts

The Correlation Coefficient provides insight into the relationship between variables such that meaningful insight can be gained. Estimates of both strength and direction can be applied to forecast stock prices, reduce dimensionality, and feature engineering. The correlation coefficient serves as a foundation to many machine learning models, regression models, and other statistical modeling tools.

This article has presented only the basics of how this value is calculated and hinted at its applications. Check out this presentation by NASA for a deeper, more technical, discussion on correlation coefficients and their role in correlation analysis. For additional insight, check out some of the most popular books on machine learning and artificial intelligence.

References

  1. Devaraj, Senthilkumar, and S Paulraj. “An Efficient Feature Subset Selection Algorithm for Classification of Multidimensional Dataset. ”TheScientificWorldJournal vol. 2015 (2015): 821798. doi:10.1155/2015/821798
  2. Chicco, Davide, and Giuseppe Jurman. “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.” BMC genomics 21.1 (2020): 1-13.
  3. Mukaka, M M. “Statistics corner: A guide to appropriate use of correlation coefficient in medical research.” Malawi medical journal : the journal of Medical Association of Malawi vol. 24,3 (2012): 69-71.
Zαck West
Full-Stack Software Engineer with 10+ years of experience. Expertise in developing distributed systems, implementing object-oriented models with a focus on semantic clarity, driving development with TDD, enhancing interfaces through thoughtful visual design, and developing deep learning agents.