Reshaping Data for Linear Regression With Pandas, NumPy, and Scikit-Learn

reshaping matrix data linear regression python banner

Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression. Scitkit-learn’s LinearRegression class is able to easily instantiate, be trained, and be applied in a few lines of code.

Depending on how data is loaded, accessed, and passed around, there can be some issues that will cause errors. These errors can be addressed in one of several approaches to reshaping data before training a linear model.

Introduction: The Problem

One issue arises when linear regression is being done on data with a single feature. Such data is often represented as a list of values (a 1-dimensional array, in most cases.) The LinearRegression model doesn’t know if this is a series of observed values for a single feature or a single observed value for multiple features. Let’s try to visualize the issue:

shape comparisons
Regression models need to know that a list of values is a series of observations for a single variable or a single observation for a series of features. (click to enlarge)

Here we can see that a single collection of values can be interpreted in one of two ways:

  • A series of observed values for a single feature
  • A single observed value for a series of features

These represent very different aspects of data. In the case of single-feature regression analysis, Scikit-learn’s LinearRegresion class needs to be explicitly told that a series of data represents a series of observed values for a single feature and not the other way around. Fortunately, this can be done fairly easily in one of several ways. Before we get into how to solve this issue let’s consider first how it might arise.

Note: This post is about a nuanced aspect of data preparation for linear regression. Check out the article Simple Linear Regression for a broader discussion or the article Predicting Stock Prices with Linear Regression in Python for an applied tutorial.

Pandas DataFrames, Series, and NumPy Arrays

Pandas commonly represent data in one of two ways: DataFrame objects or Series objects. Without diving too deeply; DataFrames are like spreadsheets—they represent rows and columns of data. DataFrame objects can have many rows and many columns. Consider the following illustration:

reshaping illustration rows vs columns
Ensuring data has an index value indicating structure informs our linear model that the data represent a series of observed values for a single variable. (click to enlarge)

Series objects are like a single column from spreadsheets—they can have many rows but only a single column. DataFrames are essentially a collection of Series objects that are given an index value by Pandas.  Under the hood, the data are represented as NumPy Array objects. That’ll be important to know in just a minute.

Scikit-Learn & LinearRegression

The scikit-learn library is a powerful set of tools for machine learning in Python. Among its many utilities is the LinearRegression class. This class makes developing a linear model, training it, and using it to make predictions extremely simple.

In cases where single feature regressions are done (simple linear regression) the LinearRegression class needs to be instructed this is a series of overserved values for a single variable. Otherwise, the following error message is likely to be thrown:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This error results when one attempts a call to the LinearRegression class’ fit() method. This error can arise for any number of reasons depending on one’s data workflow. If, for example, one is using Pandas DataFrame objects, it’s generally an issue of indexing syntax when extracting one column from several. When using numpy arrays, it’s generally an issue of not having an index value. Let’s look at some examples for each.

Note: The following examples are using the same 5-item collection of integers for values of both the independent and dependent variables. This is absolute nonsense and is not intended to represent a valid approach for training a regression model.

Native Python Lists

For the first example let’s consider a workflow using a Python list as our starting point. From there, we’ll pass that as an argument for both the independent and dependent variables to the LinearRegression class.

from sklearn.linear_model import LinearRegression

# Make up some data
data = [1, 2, 3, 4, 5]

# Instantiate new Regression model
regr = LinearRegression()

# Train the model
regr.fit(data, data)  # error here

# Result
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We’re attempting to train the model using a 1-dimensional array. Again, this is confusing in that we aren’t specifying if this is a series of observed values for a single feature or a single observed value for multiple features. To fix this, we can add an index value to our data as follows:

from sklearn.linear_model import LinearRegression

# Add index value
data = [[i, x] for i, x in enumerate(data)]

# View new collection
print(data)

# Result - a 2D array
[[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]]

# Instantiate new Regression model
regr = LinearRegression()

# Train the model
regr.fit(data, data)

# no error

This simple addition of an index value in front of the values of our data lets the LinearRegression model know our data is for a single feature. This is the equivalent of listening to scikit-learn‘s error message “Reshape your data either using array.reshape(-1, 1) if your data has a single feature.” Now that we have an idea of what’s going on let’s take a look at how NumPy handles this.

NumPy Arrays

NumPy arrays are how Pandas represents data at the lower levels of its DataFrame and Series APIs. Knowing how to fix this issue in NumPy generally offers the assurance of being able to handle it in Pandas. At the very least—it makes it easier to interpret the error messages! Let’s re-create our example of Python lists using a NumPy array instead of a Python list.

import numpy as np
from sklearn.linear_model import LinearRegression

# Create our data as an array object
data = np.array([1, 2, 3, 4, 5])

# Examine the data
print(data)
print(data.shape)

# Result
[1 2 3 4 5]
(5,)

# Try to train the model
regr = LinearRegression()
regr.fit(data, data)

# Result
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We can see that we have a 1-Dimensional array of data, showing our elements as expected. The second printout (5,) is a little mysterious. Without diving too deeply into NumPy data structures, this essentially says a collection of 5 elements without information on how that data is organized. For a better idea of how NumPy represents arrays, indexes them, and all the implications I suggest checking out this incredible post on StackOverflow.

Regardless of our understanding of NumPy arrays at this point, it’s clear this approach does not work. The error message seems a bit more relevant now we’re aware that we’re using np.array objects—which are what scikit-learn‘s error message refers to with “array.” That means we can use the np.array.reshape() method here to fix our data as suggested:

import numpy as np
from sklearn.linear_model import LinearRegression

# Create data the same way
data = np.array([1, 2, 3, 4, 5])
print(data.shape)
(5,)

# Add an index value by "reshaping" data
data.reshape(-1, 1)
print(data.shape)
(5,)  # need new reference to return value

# Add an index value by "reshaping" data
# and assigning to new object
data = data.reshape(-1, 1)
print(data.shape)
(5, 1)

# Train the model again
regr = LinearRegression()
regr.fit(data, data)

The numpy.array.reshape() method uses the value of -1 to mean “automatically determine a value by counting the number of elements.” This indexes our data sequentially starting from the first member and moving to the last. It’s also important to note that the reshape() method returns a copy of the data object and does not modify the existing object.

Without assigning that to a new variable (or replacing the existing object in our case) the data will retain the same format. We can now successfully train our model without error. Given our understanding of the numpy.array data structure we can now understand how to approach the issue when using Pandas DataFrames.

Pandas DataFrames

Pandas DataFrames are a collection of Pandas Series objects. The DataFrame is an n-dimensional object (the “n” being the number of Series contained) and the Series is a 1-dimensional object.

DataFrame objects are indexed such that a DataFrame containing a single Series object is considered a 2-dimensional array, where the first dimension is an index value. A Series object is considered a 1-dimensional array.

A Series object is still technically a 2D array, where the first dimension is an index and the second the values. This is clear when one prints a representation to stdout. However, it’s interpreted as a 1D array such that scikit-learn’s LinearRegression class will regard it as such. This is where things get a little unclear so let’s consider some examples:

import pandas as pd

# Create a dataframe with our values labeled as "observed_values"
df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]})
print(df)
print(type(df))
print(df.shape)
print(df.index)

# The Dataframe
   observed_values
0                1
1                2
2                3
3                4
4                5

# The type
<class 'pandas.core.frame.DataFrame'>

# Its dimensions
(5, 1)

# Its index
RangeIndex(start=0, stop=5, step=1)

If we toss our DataFrame object into the LinearRegression.fit() method we’ll not get any errors. This is because our data is a single column with a valid index interpreted as such (evident by the df.shape call).

Consider the case where there may be multiple columns present in a dataset and only certain columns are being extracted for regression analysis. Extracting these values as a Series is where things can go awry. Consider the following:

import pandas as pd

# Create our dataframe object
df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]})

# Extract a single column for use
data = df['observed_values']

# Print some info
print(data)
print(type(data ))
print(data.shape)
print(data.index)

# The representation
0    1
1    2
2    3
3    4
4    5
Name: observed_values, dtype: int64

# Our object type
<class 'pandas.core.series.Series'>

# Our object shape
(5,)

# Our object Index
RangeIndex(start=0, stop=5, step=1)

There are several things to note here:

  1. Our object is now a Pandas Series object, not a DataFrame
  2. The shape is 1-dimensional, of size 5
  3. We still appear to have a valid index, but it’s not reflected in our Series‘ shape anymore.

If we try to train our model as regr.fit(data, data) again we’ll run into the same error message as before indicating our data is of a confusing shape. There are two approaches to remedy this situation—a proactive approach and a reactive approach. Let’s consider both:

# Approach One - Extract Series as DataFrame
data = df[['observed_values']]

# Print Summaries
print(data)
print(type(data))
print(data.shape)
print(data.index)

# result
   observed_values
0                1
1                2
2                3
3                4
4                5
<class 'pandas.core.frame.DataFrame'>
(5, 1)
RangeIndex(start=0, stop=5, step=1)

# Aproach Two - use the np.reshape() method
data = df['observed_values'].values.reshape(-1, 1)

# print summary 
print(data)
print(type(data))
print(data.shape)
# print(data.index) - note: np.array doesn't have an index method

# Result
[[1]
 [2]
 [3]
 [4]
 [5]]
<class 'numpy.ndarray'>
(5, 1)

We can see both methods will properly reshape our data such that scikit-learn’s LinearRegrssion() model will recognize it as being a series of observed values for a single feature.

Final Thoughts

This article was motivated by a continued lack of familiarity with the DataFrame API, the implications of indexing syntaxes, and its widespread integration among Python’s most popular data science packages.

The use of double bracket vs. single bracket notation in Pandas—resulting in either a Series object or DataFrame object—was a real point of confusion for me. The official documentation for indexing and accessing data with Pandas is helpful—I just continually forgot the RTFM.

Zack West
Entrepreneur, programmer, designer, and lifelong learner. Can be found taking notes from Mother Nature when not hammering away at the keyboard.