Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression. Scitkit-learn’s LinearRegression class is able to easily instantiate, be trained, and be applied in a few lines of code.

**Table of Contents**show

Depending on how data is loaded, accessed, and passed around, there can be some issues that will cause errors. These errors can be addressed in one of several approaches to **reshaping data** before training a linear model.

## Introduction: The Problem

One issue arises when linear regression is being done on data with a single feature. Such data is often represented as a list of values (a 1-dimensional array, in most cases.) The `LinearRegression`

model doesn’t know if this is a** series of observed values for a single feature** or a **single observed value for multiple features**. Let’s try to visualize the issue:

Here we can see that a single collection of values can be interpreted in one of two ways:

- A series of observed values for a single feature
- A single observed value for a series of features

These represent *very* different aspects of data. In the case of single-feature regression analysis, Scikit-learn’s `LinearRegresion`

class needs to be explicitly told that a series of data represents a series of observed values for a single feature and not the other way around. Fortunately, this can be done fairly easily in one of several ways. Before we get into how to solve this issue let’s consider first how it might arise.

**Note**: This post is about a nuanced aspect of data preparation for linear regression. Check out the article *Simple Linear Regression* for a broader discussion or the article *Predicting Stock Prices with Linear Regression in Python* for an applied tutorial.

## Pandas DataFrames, Series, and NumPy Arrays

Pandas commonly represent data in one of two ways: `DataFrame`

objects or `Series`

objects. Without diving too deeply; `DataFrames`

are like spreadsheets—they represent rows and columns of data. `DataFrame`

objects can have many rows and many columns. Consider the following illustration:

` Series`

objects are like a single column from spreadsheets—they can have many rows but only a single column. `DataFrames`

are essentially a collection of `Series`

objects that are given an index value by Pandas. Under the hood, the data are represented as NumPy `Array`

objects. That’ll be important to know in just a minute.

## Scikit-Learn & LinearRegression

The `scikit-learn`

library is a powerful set of tools for machine learning in Python. Among its many utilities is the `LinearRegression`

class. This class makes developing a linear model, training it, and using it to make predictions extremely simple.

In cases where single feature regressions are done (simple linear regression) the `LinearRegression`

class needs to be instructed this is a series of overserved values for a single variable. Otherwise, the following error message is likely to be thrown:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This error results when one attempts a call to the `LinearRegression`

class’ `fit()`

method. This error can arise for any number of reasons depending on one’s data workflow. If, for example, one is using Pandas `DataFrame`

objects, it’s generally an issue of indexing syntax when extracting one column from several. When using `numpy arrays`

, it’s generally an issue of not having an index value. Let’s look at some examples for each.

**Note**: The following examples are using the same 5-item collection of integers for values of both the independent and dependent variables. This is absolute nonsense and is not intended to represent a valid approach for training a regression model.

## Native Python Lists

For the first example let’s consider a workflow using a Python `list`

as our starting point. From there, we’ll pass that as an argument for both the independent and dependent variables to the `LinearRegression`

class.

from sklearn.linear_model import LinearRegression # Make up some data data = [1, 2, 3, 4, 5] # Instantiate new Regression model regr = LinearRegression() # Train the model regr.fit(data, data) # error here # Result Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We’re attempting to train the model using a 1-dimensional array. Again, this is confusing in that we aren’t specifying if this is a series of observed values for a single feature or a single observed value for multiple features. To fix this, we can add an *index* value to our data as follows:

from sklearn.linear_model import LinearRegression # Add index value data = [[i, x] for i, x in enumerate(data)] # View new collection print(data) # Result - a 2D array [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]] # Instantiate new Regression model regr = LinearRegression() # Train the model regr.fit(data, data) # no error

This simple addition of an index value in front of the values of our data lets the `LinearRegression`

model know our data is for a single feature. This is the equivalent of listening to `scikit-learn`

‘s error message “Reshape your data either using array.reshape(-1, 1) if your data has a single feature.” Now that we have an idea of what’s going on let’s take a look at how NumPy handles this.

## NumPy Arrays

NumPy arrays are how Pandas represents data at the lower levels of its DataFrame and Series APIs. Knowing how to fix this issue in NumPy *generally *offers the assurance of being able to handle it in Pandas. At the very least—it makes it easier to interpret the error messages! Let’s re-create our example of Python lists using a NumPy `array`

instead of a Python `list`

.

import numpy as np from sklearn.linear_model import LinearRegression # Create our data as an array object data = np.array([1, 2, 3, 4, 5]) # Examine the data print(data) print(data.shape) # Result [1 2 3 4 5] (5,) # Try to train the model regr = LinearRegression() regr.fit(data, data) # Result Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We can see that we have a 1-Dimensional array of data, showing our elements as expected. The second printout `(5,)`

is a little mysterious. Without diving too deeply into NumPy data structures, this essentially says a collection of 5 elements without information on how that data is organized. For a better idea of how NumPy represents arrays, indexes them, and all the implications I suggest checking out this *incredible* post on StackOverflow.

Regardless of our understanding of NumPy arrays at this point, it’s clear this approach does not work. The error message seems a bit more relevant now we’re aware that we’re using `np.array`

objects—which are what `scikit-learn`

‘s error message refers to with “array.” That means we can use the `np.array.reshape()`

method here to fix our data as suggested:

import numpy as np from sklearn.linear_model import LinearRegression # Create data the same way data = np.array([1, 2, 3, 4, 5]) print(data.shape) (5,) # Add an index value by "reshaping" data data.reshape(-1, 1) print(data.shape) (5,) # need new reference to return value # Add an index value by "reshaping" data # and assigning to new object data = data.reshape(-1, 1) print(data.shape) (5, 1) # Train the model again regr = LinearRegression() regr.fit(data, data)

The `numpy.array.reshape()`

method uses the value of `-1`

to mean “automatically determine a value by counting the number of elements.” This indexes our data sequentially starting from the first member and moving to the last. It’s also important to note that the `reshape()`

method returns a *copy *of the data object and does not modify the existing object.

Without assigning that to a new variable (or replacing the existing object in our case) the data will retain the same format. We can now successfully train our model without error. Given our understanding of the `numpy.array`

data structure we can now understand how to approach the issue when using Pandas `DataFrames`

.

## Pandas DataFrames

Pandas `DataFrames`

are a collection of Pandas `Series`

objects. The `DataFrame`

is an n-dimensional object (the “n” being the number of `Series`

contained) and the `Series`

is a 1-dimensional object.

`DataFrame`

objects are indexed such that a `DataFrame`

containing a single `Series`

object is considered a 2-dimensional array, where the first dimension is an index value. A `Series`

object is considered a 1-dimensional array.

A `Series`

object is still *technically *a 2D array, where the first dimension is an index and the second the values. This is clear when one prints a representation to `stdout`

. However, it’s interpreted as a 1D array such that scikit-learn’s `LinearRegression`

class will regard it as such. This is where things get a little unclear so let’s consider some examples:

import pandas as pd # Create a dataframe with our values labeled as "observed_values" df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]}) print(df) print(type(df)) print(df.shape) print(df.index) # The Dataframe observed_values 0 1 1 2 2 3 3 4 4 5 # The type <class 'pandas.core.frame.DataFrame'> # Its dimensions (5, 1) # Its index RangeIndex(start=0, stop=5, step=1)

If we toss our `DataFrame`

object into the `LinearRegression.fit()`

method we’ll not get any errors. This is because our data is a single column with a valid index interpreted as such (evident by the `df.shape`

call).

Consider the case where there may be *multiple* columns present in a dataset and only certain columns are being extracted for regression analysis. Extracting these values as a `Series`

is where things can go awry. Consider the following:

import pandas as pd # Create our dataframe object df = pd.DataFrame({'observed_values': [1, 2, 3, 4, 5]}) # Extract a single column for use data = df['observed_values'] # Print some info print(data) print(type(data )) print(data.shape) print(data.index) # The representation 0 1 1 2 2 3 3 4 4 5 Name: observed_values, dtype: int64 # Our object type <class 'pandas.core.series.Series'> # Our object shape (5,) # Our object Index RangeIndex(start=0, stop=5, step=1)

There are several things to note here:

- Our object is now a Pandas
`Series`

object, not a`DataFrame`

- The shape is 1-dimensional, of size 5
- We still appear to have a valid index, but it’s not reflected in our
`Series`

‘ shape anymore.

If we try to train our model as `regr.fit(data, data)`

again we’ll run into the same error message as before indicating our data is of a confusing shape. There are two approaches to remedy this situation—a *proactive* approach and a *reactive* approach. Let’s consider both:

# Approach One - Extract Series as DataFrame data = df[['observed_values']] # Print Summaries print(data) print(type(data)) print(data.shape) print(data.index) # result observed_values 0 1 1 2 2 3 3 4 4 5 <class 'pandas.core.frame.DataFrame'> (5, 1) RangeIndex(start=0, stop=5, step=1) # Aproach Two - use the np.reshape() method data = df['observed_values'].values.reshape(-1, 1) # print summary print(data) print(type(data)) print(data.shape) # print(data.index) - note: np.array doesn't have an index method # Result [[1] [2] [3] [4] [5]] <class 'numpy.ndarray'> (5, 1)

We can see both methods will properly reshape our data such that scikit-learn’s `LinearRegrssion()`

model will recognize it as being a series of observed values for a single feature.

## Final Thoughts

This article was motivated by a continued lack of familiarity with the `DataFrame`

API, the implications of indexing syntaxes, and its widespread integration among Python’s most popular data science packages.

The use of double bracket vs. single bracket notation in Pandas—resulting in either a `Series`

object or `DataFrame`

object—was a *real* point of confusion for me. The official documentation for indexing and accessing data with Pandas is helpful—I just continually forgot the RTFM.