Saving & Loading CSV Files with Pandas DataFrames

Pandas DataFrames provide convenient methods for manipulating data with ease. The row-column format of the DataFrame is native to CSV files and knowing how to both save and load CSV files with Pandas is an essential skill for any data scientist!
load save csv files pandas dataframes alpharithms

Pandas is a powerful data science library with heavy optimizations done. The DataFrame object is an efficient data structure whereby data is organized in a 2-dimensional manner with powerful built-in tools for manipulating that data.

The Pandas DataFrame object is useful for displaying, analyzing, and manipulating tabulated data such as spreadsheets. However, sometimes viewing the data can be burdensome in a terminal. In this quick tutorial, we’ll convert a Pandas DataFrame object into a local CSV file for convenient viewing in other programs as well as data storage.

TL; DR

Convert a DataFrame to CSV by using the following code:

# Note: 'df' is an existing DataFrame object
df.to_csv("your-filename.csv")

Load a CSV file as a DataFrame object using the following code:

import pandas as pd

# load local csv; optinally specify column to use as index
df = pd.read_csv("your-filename.csv", index_col='Your Index')

For those looking for a deeper understanding of the processes of saving/converting a DataFrame to CSV or loading a CSV file as a DataFrame, let’s take a look at what this tutorial will cover.

Highlights

By the end of this tutorial, we will have covered the following:

  1. Fetching remote pricing data via the yfinance library as a DataFrame object.
  2. Saving the DataFrame to a local CSV file.
  3. Loading the local CSV file as a DataFrame.
  4. Reindexing the DataFrame by the original column rather than row number.

Project Setup

To demonstrate saving a Pandas DataFrame to a .csv file we need to load test data. Below is the code for downloading some historical pricing data for Ether from Yahoo finance via the yfinance python library (pip install yfinance):

import yfinance as yf

# download 1-year daily pricing data for Ether, limit columns
data = yf.Ticker("ETH-USD").history(period='1y', interval='1d')[['Open', 'High', 'Low', 'Close', 'Volume']]

# View result
print(data)

                   Open         High          Low        Close       Volume
Date                                                                       
2021-07-07  2323.208496  2403.773438  2298.026855  2315.161865  22398345285
2021-07-08  2317.949219  2324.006836  2089.414307  2120.026367  23188123689
2021-07-09  2115.573975  2185.376221  2051.066650  2146.692383  23029574602
2021-07-10  2146.999756  2190.124023  2081.923584  2111.403564  17581542471
2021-07-11  2110.875732  2172.653320  2083.803711  2139.664795  14705386138
...                 ...          ...          ...          ...          ...
2022-07-03  1066.467407  1083.418945  1044.005615  1073.766968   8557248150
2022-07-04  1073.794312  1152.943848  1048.396362  1151.059082  13670889311
2022-07-05  1150.509766  1165.682739  1086.829956  1134.541016  16195518291
2022-07-06  1134.822266  1193.636719  1116.247559  1186.973999  15373536703
2022-07-07  1187.752075  1187.752075  1165.623169  1185.904053  13014859776

[366 rows x 5 columns]

# View data type
print(type(eth))

<class 'pandas.core.frame.DataFrame'>

Note: This data is available as a .csv file via Github but any data can be used.

Saving A DataFrame as a CSV File

Our Yahoo Finance pricing history is automatically converted to a Pandas DataFrame via the yfinance library. From here, we use the built-in to_csv method to convert the DataFrame to a CSV file as such:

# specify a filename
filename = "eth-usd-20210707-20220707.csv"

# Use DataFrame to_csv method to save output
data.to_csv(filename)

# Check results
import os
print(os.stat(filename))

os.stat_result(st_mode=33206, st_ino=110338190870733658, st_dev=2997963831, st_nlink=1, st_uid=0, st_gid=0, st_size=33720, st_atime=1657200135, st_mtime=1657199413, st_ctime=1657199413)

Using the os.stat() method we have also checked that the file was created and has the expected metadata associated with it. Notably, we see timestamps for last-accessed, created, and modified. Also, we see a filesize stat indicating that the file was, indeed, successfully created. For those needing visual confirmation, here is what the file looks like opened in Microsoft Excel:

pandas dataframe to csv output loaded in excel alpharithms
Figure 1: DataFrame CSV exported data loaded into Microsoft Excel (formatting added)

Here we see our data as expected. Now, imagine we had already downloaded this data as a CSV file and needed to load the CSV file as a Pandas DataFrame to perform some analysis. Fortunately, that’s easy as well!

Load A CSV File as DataFrame

We will use our local file saved from before to load back into memory as a DataFrame object. Consider the following example code:

import pandas as pd

# Specify our filename
filename = 'eth-usd-20210707-20220707.csv'

# Load CSV file as DataFrame via the read_csv method
df = pd.read_csv(filename)

# View result
print(df)

           Date         Open  ...        Close       Volume
0    2021-07-07  2323.208496  ...  2315.161865  22398345285
1    2021-07-08  2317.949219  ...  2120.026367  23188123689
2    2021-07-09  2115.573975  ...  2146.692383  23029574602
3    2021-07-10  2146.999756  ...  2111.403564  17581542471
4    2021-07-11  2110.875732  ...  2139.664795  14705386138
..          ...          ...  ...          ...          ...
361  2022-07-03  1066.467407  ...  1073.766968   8557248150
362  2022-07-04  1073.794312  ...  1151.059082  13670889311
363  2022-07-05  1150.509766  ...  1134.541016  16195518291
364  2022-07-06  1134.822266  ...  1186.973999  15373536703
365  2022-07-07  1187.752075  ...  1185.904053  13014859776

[366 rows x 6 columns]

The data has loaded as expected but there is an issue: the Date the column is no longer the first column! We now have an integer-based index for the DataFrame. Fortunately, this can be resolved by explicitly specifying which column the data should be indexed by.

Manually specifying the index column for a DataFrame can be done in one of two ways.

  1. Via the index_col argument during import via the read_csv method.
  2. Via the set_index argument on an existing DataFrame.

Method 1: Specify DataFrame Index During CSV Import

The following code will set the Date column as the DataFrame index during the import of data via the read_csv method:

# Load local CSV file as DataFrame
df = pd.read_csv(filename, index_col='Date')

# view output
print(df)

                   Open         High          Low        Close       Volume
Date                                                                       
2021-07-07  2323.208496  2403.773438  2298.026855  2315.161865  22398345285
2021-07-08  2317.949219  2324.006836  2089.414307  2120.026367  23188123689
2021-07-09  2115.573975  2185.376221  2051.066650  2146.692383  23029574602
2021-07-10  2146.999756  2190.124023  2081.923584  2111.403564  17581542471
2021-07-11  2110.875732  2172.653320  2083.803711  2139.664795  14705386138
...                 ...          ...          ...          ...          ...
2022-07-03  1066.467407  1083.418945  1044.005615  1073.766968   8557248150
2022-07-04  1073.794312  1152.943848  1048.396362  1151.059082  13670889311
2022-07-05  1150.509766  1165.682739  1086.829956  1134.541016  16195518291
2022-07-06  1134.822266  1193.636719  1116.247559  1186.973999  15373536703
2022-07-07  1187.752075  1187.752075  1165.623169  1185.904053  13014859776

[366 rows x 5 columns]

Method 2: Specify DataFrame Index on Existing DataFrame

A DataFrame can either be reindexed and assigned to a new variable (i.e. a copy) or reindexed in place. Either approach utilizes the set_index method and the results are determined by the use of the inplace keyword argument. Let’s take a look at both approaches.

# Load local CSV file as DataFrame
df = pd.read_csv(filename)

# view initial data
print(df)

           Date         Open  ...        Close       Volume
0    2021-07-07  2323.208496  ...  2315.161865  22398345285
1    2021-07-08  2317.949219  ...  2120.026367  23188123689
2    2021-07-09  2115.573975  ...  2146.692383  23029574602
3    2021-07-10  2146.999756  ...  2111.403564  17581542471
4    2021-07-11  2110.875732  ...  2139.664795  14705386138
..          ...          ...  ...          ...          ...
361  2022-07-03  1066.467407  ...  1073.766968   8557248150
362  2022-07-04  1073.794312  ...  1151.059082  13670889311
363  2022-07-05  1150.509766  ...  1134.541016  16195518291
364  2022-07-06  1134.822266  ...  1186.973999  15373536703
365  2022-07-07  1187.752075  ...  1185.904053  13014859776

[366 rows x 6 columns]

# reindex (Note: doesn't work in place w/o assignment)
df.set_index('Date')

# view result
print(df)
           Date         Open  ...        Close       Volume
0    2021-07-07  2323.208496  ...  2315.161865  22398345285
1    2021-07-08  2317.949219  ...  2120.026367  23188123689
2    2021-07-09  2115.573975  ...  2146.692383  23029574602
3    2021-07-10  2146.999756  ...  2111.403564  17581542471
4    2021-07-11  2110.875732  ...  2139.664795  14705386138
..          ...          ...  ...          ...          ...
361  2022-07-03  1066.467407  ...  1073.766968   8557248150
362  2022-07-04  1073.794312  ...  1151.059082  13670889311
363  2022-07-05  1150.509766  ...  1134.541016  16195518291
364  2022-07-06  1134.822266  ...  1186.973999  15373536703
365  2022-07-07  1187.752075  ...  1185.904053  13014859776

[366 rows x 6 columns]


# make copy
df = df.set_index('Date')

# view result
print(df)

                   Open         High          Low        Close       Volume
Date                                                                       
2021-07-07  2323.208496  2403.773438  2298.026855  2315.161865  22398345285
2021-07-08  2317.949219  2324.006836  2089.414307  2120.026367  23188123689
2021-07-09  2115.573975  2185.376221  2051.066650  2146.692383  23029574602
2021-07-10  2146.999756  2190.124023  2081.923584  2111.403564  17581542471
2021-07-11  2110.875732  2172.653320  2083.803711  2139.664795  14705386138
...                 ...          ...          ...          ...          ...
2022-07-03  1066.467407  1083.418945  1044.005615  1073.766968   8557248150
2022-07-04  1073.794312  1152.943848  1048.396362  1151.059082  13670889311
2022-07-05  1150.509766  1165.682739  1086.829956  1134.541016  16195518291
2022-07-06  1134.822266  1193.636719  1116.247559  1186.973999  15373536703
2022-07-07  1187.752075  1187.752075  1165.623169  1185.904053  13014859776

[366 rows x 5 columns]

Here we see three things:

  1. Our data loaded as expected
  2. Our data was not reindexed after using the set_index method because we didn’t specify the inplace keyword OR we didn’t assign the result to a new variable.
  3. Our data reindexed as intended via assignment to the df variable (overwriting previous data)

Now, let’s consider how using the inplace argument can simply things when the original data is no longer needed:

# Load local CSV file as DataFrame
df = pd.read_csv(filename)

# view initial data
print(df)

           Date         Open  ...        Close       Volume
0    2021-07-07  2323.208496  ...  2315.161865  22398345285
1    2021-07-08  2317.949219  ...  2120.026367  23188123689
2    2021-07-09  2115.573975  ...  2146.692383  23029574602
3    2021-07-10  2146.999756  ...  2111.403564  17581542471
4    2021-07-11  2110.875732  ...  2139.664795  14705386138
..          ...          ...  ...          ...          ...
361  2022-07-03  1066.467407  ...  1073.766968   8557248150
362  2022-07-04  1073.794312  ...  1151.059082  13670889311
363  2022-07-05  1150.509766  ...  1134.541016  16195518291
364  2022-07-06  1134.822266  ...  1186.973999  15373536703
365  2022-07-07  1187.752075  ...  1185.904053  13014859776

[366 rows x 6 columns]

# reindex  using in place argument
df.set_index('Date', inplace=True)

# view result
print(df)

                   Open         High          Low        Close       Volume
Date                                                                       
2021-07-07  2323.208496  2403.773438  2298.026855  2315.161865  22398345285
2021-07-08  2317.949219  2324.006836  2089.414307  2120.026367  23188123689
2021-07-09  2115.573975  2185.376221  2051.066650  2146.692383  23029574602
2021-07-10  2146.999756  2190.124023  2081.923584  2111.403564  17581542471
2021-07-11  2110.875732  2172.653320  2083.803711  2139.664795  14705386138
...                 ...          ...          ...          ...          ...
2022-07-03  1066.467407  1083.418945  1044.005615  1073.766968   8557248150
2022-07-04  1073.794312  1152.943848  1048.396362  1151.059082  13670889311
2022-07-05  1150.509766  1165.682739  1086.829956  1134.541016  16195518291
2022-07-06  1134.822266  1193.636719  1116.247559  1186.973999  15373536703
2022-07-07  1187.752075  1187.752075  1165.623169  1185.904053  13014859776

[366 rows x 5 columns]

Here we see our original data loaded in, again being indexed via row number. However, after using the inplace=True argument, we note the same DataFrame object is then re-indexed using the Date column without assignment to another variable. This is convenient in cases where the initial data format is no longer needed.

Final Thoughts

Pandas is an incredible library, even if its syntax and opinionated approaches can be difficult to understand at first. It is a highly efficient tool for analyzing, manipulating, and — as we’ve seen here — performing I/O operations on CSV files! The ability to conveniently and quickly specify by which column a DataFrame should index one’s data is critical — especially in cases of time series data.

Converting a DataFrame to CSV is essential for supporting data permanence after computationally-heavy tasks. The CSV format is an ideal format that allows structured data to persist without requiring the overhead of a database. Another alternative is JSON, which Pandas can readily load and convert to CSV or a DataFrame as well! Check out our article on linear regression in Python to see how such heavy number crunching can be facilitated by DataFrames.

Zαck West
Full-Stack Software Engineer with 10+ years of experience. Expertise in developing distributed systems, implementing object-oriented models with a focus on semantic clarity, driving development with TDD, enhancing interfaces through thoughtful visual design, and developing deep learning agents.