Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression. Scitkit-learn’s LinearRegression class is able to easily instantiate, be trained, and be applied in a few lines of code.
Depending on how data is loaded, accessed, and passed around, there can be some issues that will cause errors. These errors can be addressed in one of several approaches to reshaping data before training a linear model.
Introduction: The Problem
One issue arises when linear regression is being done on data with a single feature. Such data is often represented as a list of values (a 1-dimensional array, in most cases.) The LinearRegression
model doesn’t know if this is a series of observed values for a single feature or a single observed value for multiple features. Let’s try to visualize the issue:

Here we can see that a single collection of values can be interpreted in one of two ways:
- A series of observed values for a single feature
- A single observed value for a series of features
These represent very different aspects of data. In the case of single-feature regression analysis, Scikit-learn’s LinearRegresion
class needs to be explicitly told that a series of data represents a series of observed values for a single feature and not the other way around. Fortunately, this can be done fairly easily in one of several ways. Before we get into how to solve this issue let’s consider first how it might arise.
Note: This post is about a nuanced aspect of data preparation for linear regression. Check out the article Simple Linear Regression for a broader discussion or the article Predicting Stock Prices with Linear Regression in Python for an applied tutorial.
Pandas DataFrames, Series, and NumPy Arrays
Pandas commonly represent data in one of two ways: DataFrame
objects or Series
objects. Without diving too deeply; DataFrames
are like spreadsheets—they represent rows and columns of data. DataFrame
objects can have many rows and many columns. Consider the following illustration:

Series
objects are like a single column from spreadsheets—they can have many rows but only a single column. DataFrames
are essentially a collection of Series
objects that are given an index value by Pandas. Under the hood, the data are represented as NumPy Array
objects. That’ll be important to know in just a minute.
Scikit-Learn & LinearRegression
The scikit-learn
library is a powerful set of tools for machine learning in Python. Among its many utilities is the LinearRegression
class. This class makes developing a linear model, training it, and using it to make predictions extremely simple.
In cases where single feature regressions are done (simple linear regression) the LinearRegression
class needs to be instructed this is a series of overserved values for a single variable. Otherwise, the following error message is likely to be thrown:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This error results when one attempts a call to the LinearRegression
class’ fit()
method. This error can arise for any number of reasons depending on one’s data workflow. If, for example, one is using Pandas DataFrame
objects, it’s generally an issue of indexing syntax when extracting one column from several. When using numpy arrays
, it’s generally an issue of not having an index value. Let’s look at some examples for each.
Note: The following examples are using the same 5-item collection of integers for values of both the independent and dependent variables. This is absolute nonsense and is not intended to represent a valid approach for training a regression model.
Native Python Lists
For the first example let’s consider a workflow using a Python list
as our starting point. From there, we’ll pass that as an argument for both the independent and dependent variables to the LinearRegression
class.
from sklearn.linear_model import LinearRegression # Make up some data data = [1, 2, 3, 4, 5] # Instantiate new Regression model regr = LinearRegression() # Train the model regr.fit(data, data) # error here # Result Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
We’re attempting to train the model using a 1-dimensional array. Again, this is confusing in that we aren’t specifying if this is a series of observed values for a single feature or a single observed value for multiple features. To fix this, we can add an index value to our data as follows:
from sklearn.linear_model import LinearRegression # Add index value data = [[i, x] for i, x in enumerate(data)] # View new collection print(data) # Result - a 2D array [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]] # Instantiate new Regression model regr = LinearRegression() # Train the model regr.fit(data, data) # no error
This simple addition of an index value in front of the values of our data lets the LinearRegression
model know our data is for a single feature. This is the equivalent of listening to scikit-learn
‘s error message “Reshape your data either using array.reshape(-1, 1) if your data has a single feature.” Now that we have an idea of what’s going on let’s take a look at how NumPy handles this.
NumPy Arrays
NumPy arrays are how Pandas represents data at the lower levels of its DataFrame and Series APIs. Knowing how to fix this issue in NumPy generally offers the assurance of being able to handle it in Pandas. At the very least—it makes it easier to interpret the error messages! Let’s re-create our example of Python lists using a NumPy array
instead of a Python list
.
import numpy as np from sklearn.linear_model import LinearRegression # Create our data as an array object data = np.array([1, 2, 3, 4, 5]) # Examine the data print(data) print(data.shape) # Result [1 2 3 4 5] (5,) # Try to train the model regr = LinearRegression() regr.fit(data, data) # Result Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
We can see that we have a 1-Dimensional array of data, showing our elements as expected. The second printout (5,)
is a little mysterious. Without diving too deeply into NumPy data structures, this essentially says a collection of 5 elements without information on how that data is organized. For a better idea of how NumPy represents arrays, indexes them, and all the implications I suggest checking out this incredible post on StackOverflow.
Regardless of our understanding of NumPy arrays at this point, it’s clear this approach does not work. The error message seems a bit more relevant now we’re aware that we’re using np.array
objects—which are what scikit-learn
‘s error message refers to with “array.” That means we can use the np.array.reshape()
method here to fix our data as suggested:
import numpy as np from sklearn.linear_model import LinearRegression # Create data the same way data = np.array([1, 2, 3, 4, 5]) print(data.shape) (5,) # Add an index value by "reshaping" data data.reshape(-1, 1) print(data.shape) (5,) # need new reference to return value # Add an index value by "reshaping" data # and assigning to new object data = data.reshape(-1, 1) print(data.shape) (5, 1) # Train the model again regr = LinearRegression() regr.fit(data, data)
The numpy.array.reshape()
method uses the value of -1
to mean “automatically determine a value by counting the number of elements.” This indexes our data sequentially starting from the first member and moving to the last. It’s also important to note that the reshape()
method returns a copy of the data object and does not modify the existing object.
Without assigning that to a new variable (or replacing the existing object in our case) the data will retain the same format. We can now successfully train our model without error. Given our understanding of the numpy.array
data structure we can now understand how to approach the issue when using Pandas DataFrames
.
Pandas DataFrames
Pandas DataFrames
are a collection of Pandas Series
objects. The DataFrame
is an n-dimensional object (the “n” being the number of Series
contained) and the Series
is a 1-dimensional object.
DataFrame
objects are indexed such that a DataFrame
containing a single Series
object is considered a 2-dimensional array, where the first dimension is an index value. A Series
object is considered a 1-dimensional array.
A Series
object is still technically a 2D array, where the first dimension is an index and the second the values. This is clear when one prints a representation to stdout
. However, it’s interpreted as a 1D array such that scikit-learn’s LinearRegression
class will regard it as such. This is where things get a little unclear so let’s consider some examples:
import pandas as pd # Create a dataframe with our values labeled as "observed_values" df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]}) print(df) print(type(df)) print(df.shape) print(df.index) # The Dataframe observed_values 0 1 1 2 2 3 3 4 4 5 # The type <class 'pandas.core.frame.DataFrame'> # Its dimensions (5, 1) # Its index RangeIndex(start=0, stop=5, step=1)
If we toss our DataFrame
object into the LinearRegression.fit()
method we’ll not get any errors. This is because our data is a single column with a valid index interpreted as such (evident by the df.shape
call).
Consider the case where there may be multiple columns present in a dataset and only certain columns are being extracted for regression analysis. Extracting these values as a Series
is where things can go awry. Consider the following:
import pandas as pd # Create our dataframe object df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]}) # Extract a single column for use data = df['observed_values'] # Print some info print(data) print(type(data )) print(data.shape) print(data.index) # The representation 0 1 1 2 2 3 3 4 4 5 Name: observed_values, dtype: int64 # Our object type <class 'pandas.core.series.Series'> # Our object shape (5,) # Our object Index RangeIndex(start=0, stop=5, step=1)
There are several things to note here:
- Our object is now a Pandas
Series
object, not aDataFrame
- The shape is 1-dimensional, of size 5
- We still appear to have a valid index, but it’s not reflected in our
Series
‘ shape anymore.
If we try to train our model as regr.fit(data, data)
again we’ll run into the same error message as before indicating our data is of a confusing shape. There are two approaches to remedy this situation—a proactive approach and a reactive approach. Let’s consider both:
# Approach One - Extract Series as DataFrame data = df[['observed_values']] # Print Summaries print(data) print(type(data)) print(data.shape) print(data.index) # result observed_values 0 1 1 2 2 3 3 4 4 5 <class 'pandas.core.frame.DataFrame'> (5, 1) RangeIndex(start=0, stop=5, step=1) # Aproach Two - use the np.reshape() method data = df['observed_values'].values.reshape(-1, 1) # print summary print(data) print(type(data)) print(data.shape) # print(data.index) - note: np.array doesn't have an index method # Result [[1] [2] [3] [4] [5]] <class 'numpy.ndarray'> (5, 1)
We can see both methods will properly reshape our data such that scikit-learn’s LinearRegrssion()
model will recognize it as being a series of observed values for a single feature.
Final Thoughts
This article was motivated by a continued lack of familiarity with the DataFrame
API, the implications of indexing syntaxes, and its widespread integration among Python’s most popular data science packages.
The use of double bracket vs. single bracket notation in Pandas—resulting in either a Series
object or DataFrame
object—was a real point of confusion for me. The official documentation for indexing and accessing data with Pandas is helpful—I just continually forgot the RTFM.