Pandas is a powerful data science library with heavy optimizations done. The DataFrame object is an efficient data structure whereby data is organized in a 2-dimensional manner with powerful built-in tools for manipulating that data.
The Pandas DataFrame
object is useful for displaying, analyzing, and manipulating tabulated data such as spreadsheets. However, sometimes viewing the data can be burdensome in a terminal. In this quick tutorial, we’ll convert a Pandas DataFrame
object into a local CSV file for convenient viewing in other programs as well as data storage.
TL; DR
Convert a DataFrame to CSV by using the following code:
# Note: 'df' is an existing DataFrame object df.to_csv("your-filename.csv")
Load a CSV file as a DataFrame object using the following code:
import pandas as pd # load local csv; optinally specify column to use as index df = pd.read_csv("your-filename.csv", index_col='Your Index')
For those looking for a deeper understanding of the processes of saving/converting a DataFrame to CSV or loading a CSV file as a DataFrame, let’s take a look at what this tutorial will cover.
Highlights
By the end of this tutorial, we will have covered the following:
- Fetching remote pricing data via the
yfinance
library as a DataFrame object. - Saving the
DataFrame
to a local CSV file. - Loading the local CSV file as a
DataFrame
. - Reindexing the
DataFrame
by the original column rather than row number.
Project Setup
To demonstrate saving a Pandas DataFrame
to a .csv
file we need to load test data. Below is the code for downloading some historical pricing data for Ether from Yahoo finance via the yfinance
python library (pip install yfinance
):
import yfinance as yf # download 1-year daily pricing data for Ether, limit columns data = yf.Ticker("ETH-USD").history(period='1y', interval='1d')[['Open', 'High', 'Low', 'Close', 'Volume']] # View result print(data) Open High Low Close Volume Date 2021-07-07 2323.208496 2403.773438 2298.026855 2315.161865 22398345285 2021-07-08 2317.949219 2324.006836 2089.414307 2120.026367 23188123689 2021-07-09 2115.573975 2185.376221 2051.066650 2146.692383 23029574602 2021-07-10 2146.999756 2190.124023 2081.923584 2111.403564 17581542471 2021-07-11 2110.875732 2172.653320 2083.803711 2139.664795 14705386138 ... ... ... ... ... ... 2022-07-03 1066.467407 1083.418945 1044.005615 1073.766968 8557248150 2022-07-04 1073.794312 1152.943848 1048.396362 1151.059082 13670889311 2022-07-05 1150.509766 1165.682739 1086.829956 1134.541016 16195518291 2022-07-06 1134.822266 1193.636719 1116.247559 1186.973999 15373536703 2022-07-07 1187.752075 1187.752075 1165.623169 1185.904053 13014859776 [366 rows x 5 columns] # View data type print(type(eth)) <class 'pandas.core.frame.DataFrame'>
Note: This data is available as a .csv
file via Github but any data can be used.
Saving A DataFrame as a CSV File
Our Yahoo Finance pricing history is automatically converted to a Pandas DataFrame
via the yfinance
library. From here, we use the built-in to_csv
method to convert the DataFrame
to a CSV file as such:
# specify a filename filename = "eth-usd-20210707-20220707.csv" # Use DataFrame to_csv method to save output data.to_csv(filename) # Check results import os print(os.stat(filename)) os.stat_result(st_mode=33206, st_ino=110338190870733658, st_dev=2997963831, st_nlink=1, st_uid=0, st_gid=0, st_size=33720, st_atime=1657200135, st_mtime=1657199413, st_ctime=1657199413)
Using the os.stat()
method we have also checked that the file was created and has the expected metadata associated with it. Notably, we see timestamps for last-accessed, created, and modified. Also, we see a filesize stat indicating that the file was, indeed, successfully created. For those needing visual confirmation, here is what the file looks like opened in Microsoft Excel:
Here we see our data as expected. Now, imagine we had already downloaded this data as a CSV file and needed to load the CSV file as a Pandas DataFrame
to perform some analysis. Fortunately, that’s easy as well!
Load A CSV File as DataFrame
We will use our local file saved from before to load back into memory as a DataFrame
object. Consider the following example code:
import pandas as pd # Specify our filename filename = 'eth-usd-20210707-20220707.csv' # Load CSV file as DataFrame via the read_csv method df = pd.read_csv(filename) # View result print(df) Date Open ... Close Volume 0 2021-07-07 2323.208496 ... 2315.161865 22398345285 1 2021-07-08 2317.949219 ... 2120.026367 23188123689 2 2021-07-09 2115.573975 ... 2146.692383 23029574602 3 2021-07-10 2146.999756 ... 2111.403564 17581542471 4 2021-07-11 2110.875732 ... 2139.664795 14705386138 .. ... ... ... ... ... 361 2022-07-03 1066.467407 ... 1073.766968 8557248150 362 2022-07-04 1073.794312 ... 1151.059082 13670889311 363 2022-07-05 1150.509766 ... 1134.541016 16195518291 364 2022-07-06 1134.822266 ... 1186.973999 15373536703 365 2022-07-07 1187.752075 ... 1185.904053 13014859776 [366 rows x 6 columns]
The data has loaded as expected but there is an issue: the Date
the column is no longer the first column! We now have an integer-based index for the DataFrame
. Fortunately, this can be resolved by explicitly specifying which column the data should be indexed by.
Manually specifying the index column for a DataFrame
can be done in one of two ways.
- Via the
index_col
argument during import via theread_csv
method. - Via the
set_index
argument on an existing DataFrame.
Method 1: Specify DataFrame Index During CSV Import
The following code will set the Date
column as the DataFrame
index during the import of data via the read_csv
method:
# Load local CSV file as DataFrame df = pd.read_csv(filename, index_col='Date') # view output print(df) Open High Low Close Volume Date 2021-07-07 2323.208496 2403.773438 2298.026855 2315.161865 22398345285 2021-07-08 2317.949219 2324.006836 2089.414307 2120.026367 23188123689 2021-07-09 2115.573975 2185.376221 2051.066650 2146.692383 23029574602 2021-07-10 2146.999756 2190.124023 2081.923584 2111.403564 17581542471 2021-07-11 2110.875732 2172.653320 2083.803711 2139.664795 14705386138 ... ... ... ... ... ... 2022-07-03 1066.467407 1083.418945 1044.005615 1073.766968 8557248150 2022-07-04 1073.794312 1152.943848 1048.396362 1151.059082 13670889311 2022-07-05 1150.509766 1165.682739 1086.829956 1134.541016 16195518291 2022-07-06 1134.822266 1193.636719 1116.247559 1186.973999 15373536703 2022-07-07 1187.752075 1187.752075 1165.623169 1185.904053 13014859776 [366 rows x 5 columns]
Method 2: Specify DataFrame Index on Existing DataFrame
A DataFrame
can either be reindexed and assigned to a new variable (i.e. a copy) or reindexed in place. Either approach utilizes the set_index
method and the results are determined by the use of the inplace
keyword argument. Let’s take a look at both approaches.
# Load local CSV file as DataFrame df = pd.read_csv(filename) # view initial data print(df) Date Open ... Close Volume 0 2021-07-07 2323.208496 ... 2315.161865 22398345285 1 2021-07-08 2317.949219 ... 2120.026367 23188123689 2 2021-07-09 2115.573975 ... 2146.692383 23029574602 3 2021-07-10 2146.999756 ... 2111.403564 17581542471 4 2021-07-11 2110.875732 ... 2139.664795 14705386138 .. ... ... ... ... ... 361 2022-07-03 1066.467407 ... 1073.766968 8557248150 362 2022-07-04 1073.794312 ... 1151.059082 13670889311 363 2022-07-05 1150.509766 ... 1134.541016 16195518291 364 2022-07-06 1134.822266 ... 1186.973999 15373536703 365 2022-07-07 1187.752075 ... 1185.904053 13014859776 [366 rows x 6 columns] # reindex (Note: doesn't work in place w/o assignment) df.set_index('Date') # view result print(df) Date Open ... Close Volume 0 2021-07-07 2323.208496 ... 2315.161865 22398345285 1 2021-07-08 2317.949219 ... 2120.026367 23188123689 2 2021-07-09 2115.573975 ... 2146.692383 23029574602 3 2021-07-10 2146.999756 ... 2111.403564 17581542471 4 2021-07-11 2110.875732 ... 2139.664795 14705386138 .. ... ... ... ... ... 361 2022-07-03 1066.467407 ... 1073.766968 8557248150 362 2022-07-04 1073.794312 ... 1151.059082 13670889311 363 2022-07-05 1150.509766 ... 1134.541016 16195518291 364 2022-07-06 1134.822266 ... 1186.973999 15373536703 365 2022-07-07 1187.752075 ... 1185.904053 13014859776 [366 rows x 6 columns] # make copy df = df.set_index('Date') # view result print(df) Open High Low Close Volume Date 2021-07-07 2323.208496 2403.773438 2298.026855 2315.161865 22398345285 2021-07-08 2317.949219 2324.006836 2089.414307 2120.026367 23188123689 2021-07-09 2115.573975 2185.376221 2051.066650 2146.692383 23029574602 2021-07-10 2146.999756 2190.124023 2081.923584 2111.403564 17581542471 2021-07-11 2110.875732 2172.653320 2083.803711 2139.664795 14705386138 ... ... ... ... ... ... 2022-07-03 1066.467407 1083.418945 1044.005615 1073.766968 8557248150 2022-07-04 1073.794312 1152.943848 1048.396362 1151.059082 13670889311 2022-07-05 1150.509766 1165.682739 1086.829956 1134.541016 16195518291 2022-07-06 1134.822266 1193.636719 1116.247559 1186.973999 15373536703 2022-07-07 1187.752075 1187.752075 1165.623169 1185.904053 13014859776 [366 rows x 5 columns]
Here we see three things:
- Our data loaded as expected
- Our data was not reindexed after using the
set_index
method because we didn’t specify theinplace
keyword OR we didn’t assign the result to a new variable. - Our data reindexed as intended via assignment to the
df
variable (overwriting previous data)
Now, let’s consider how using the inplace
argument can simply things when the original data is no longer needed:
# Load local CSV file as DataFrame df = pd.read_csv(filename) # view initial data print(df) Date Open ... Close Volume 0 2021-07-07 2323.208496 ... 2315.161865 22398345285 1 2021-07-08 2317.949219 ... 2120.026367 23188123689 2 2021-07-09 2115.573975 ... 2146.692383 23029574602 3 2021-07-10 2146.999756 ... 2111.403564 17581542471 4 2021-07-11 2110.875732 ... 2139.664795 14705386138 .. ... ... ... ... ... 361 2022-07-03 1066.467407 ... 1073.766968 8557248150 362 2022-07-04 1073.794312 ... 1151.059082 13670889311 363 2022-07-05 1150.509766 ... 1134.541016 16195518291 364 2022-07-06 1134.822266 ... 1186.973999 15373536703 365 2022-07-07 1187.752075 ... 1185.904053 13014859776 [366 rows x 6 columns] # reindex using in place argument df.set_index('Date', inplace=True) # view result print(df) Open High Low Close Volume Date 2021-07-07 2323.208496 2403.773438 2298.026855 2315.161865 22398345285 2021-07-08 2317.949219 2324.006836 2089.414307 2120.026367 23188123689 2021-07-09 2115.573975 2185.376221 2051.066650 2146.692383 23029574602 2021-07-10 2146.999756 2190.124023 2081.923584 2111.403564 17581542471 2021-07-11 2110.875732 2172.653320 2083.803711 2139.664795 14705386138 ... ... ... ... ... ... 2022-07-03 1066.467407 1083.418945 1044.005615 1073.766968 8557248150 2022-07-04 1073.794312 1152.943848 1048.396362 1151.059082 13670889311 2022-07-05 1150.509766 1165.682739 1086.829956 1134.541016 16195518291 2022-07-06 1134.822266 1193.636719 1116.247559 1186.973999 15373536703 2022-07-07 1187.752075 1187.752075 1165.623169 1185.904053 13014859776 [366 rows x 5 columns]
Here we see our original data loaded in, again being indexed via row number. However, after using the inplace=True
argument, we note the same DataFrame
object is then re-indexed using the Date
column without assignment to another variable. This is convenient in cases where the initial data format is no longer needed.
Final Thoughts
Pandas is an incredible library, even if its syntax and opinionated approaches can be difficult to understand at first. It is a highly efficient tool for analyzing, manipulating, and — as we’ve seen here — performing I/O operations on CSV files! The ability to conveniently and quickly specify by which column a DataFrame should index one’s data is critical — especially in cases of time series data.
Converting a DataFrame to CSV is essential for supporting data permanence after computationally-heavy tasks. The CSV format is an ideal format that allows structured data to persist without requiring the overhead of a database. Another alternative is JSON, which Pandas can readily load and convert to CSV or a DataFrame as well! Check out our article on linear regression in Python to see how such heavy number crunching can be facilitated by DataFrames.