Tabulate JSON Data in Python Using Pandas

Tabulating data is the process in which different segments are aligned in vertical columns and horizontal rows. This is a common presentation format for things like tables, spreadsheets, calendars, and business reports. Tabulating JSON data in Python can be easy when using the Pandas DataFrame object to do the heavy lifting.

Table of Contents show

JavaScript Object Notation (JSON) is a standardized data format common among web-based languages, Representational State Transfer (REST) APIs, and inter-application data messaging. Practically, the ability of JSON to be tabulated depends on the structure of the data. In this article, we’ll take a look at how to quickly tabulate JSON in Python using Pandas and also some cases where results might be poor.

Setup: Sample JSON Data

Before we begin tabulating JSON-formatted data let’s first create some sample data. We will create records of 25 random employees using the create_random_people_data function available via Github. We will then save this data to a local file named sample.json via the following code:

import json

# Generate data for 25 people
people = create_random_people_data(25)

# Write the output date as a json-formatted file
with open("sample-data.json", 'w')as file:
    file.write(json.dumps(people))

This saves a file with the following data (some records removed for brevity):

[
  {
    "id": 1,
    "first_name": "Mary",
    "last_name": "Smith",
    "location": {
      "City": "New York",
      "State": "New York"
    },
    "age": 32
  },
  ...
  {
    "id": 25,
    "first_name": "Pat",
    "last_name": "Smith",
    "location": {
      "City": "Denver",
      "State": "Colorado"
    },
    "age": 32
  }
]

Tabulate JSON Data

To tabulate our JSON data we will use the Pandas library and make use of the DataFrame object. To approach this, the from_dict method can be used in conjunction with the json.load method as such:

import pandas as pd
import json

# open file
with open('sample.json', 'r')as file:
    df = pd.DataFrame.from_dict(json.load(file))

# print first 5 records via head() method
print(df.head())

   id first_name last_name                                        location  age
0   1       Mary     Smith       {'City': 'New York', 'State': 'New York'}   32
1   2       John     Jones  {'City': 'Los Angeles', 'State': 'California'}   18
2   3        Pat     Smith         {'City': 'Denver', 'State': 'Colorado'}   32
3   4       John     Jones  {'City': 'Los Angeles', 'State': 'California'}   42
4   5       Mary    Jacobs         {'City': 'Denver', 'State': 'Colorado'}   42

This is a good start however there are some issues. The id field should likely be used to index our data — this is a minor concern. The more problematic issue is that our location data is still shown as originally formatted, now reflecting Python’s dict format (from being parsed via json.load). To address these issues, we will use the normalize_json and set_index methods of the DataFrame as such:

import pandas as pd
import json

# open file
with open('sample.json', 'r')as file:
    
    # load as dataframe using json_normalize method
    df = pd.json_normalize(json.load(file))

# set the index to employee id
df.set_index(['id'], inplace=True)

# print first 5 records
print(df.head())

# output
   first_name last_name  age location.City location.State
id                                                       
1        Mary     Smith   32      New York       New York
2        John     Jones   18   Los Angeles     California
3         Pat     Smith   32        Denver       Colorado
4        John     Jones   42   Los Angeles     California
5        Mary    Jacobs   42        Denver       Colorado

Notice that Pandas has allowed us to read the nested JSON data accurately, creating appropriate column names location.City and location.State where appropriate.

Note: It’s always a good idea to check for missing values in Pandas prior to deeper analysis. Check out this article for some tips on handling missing data.

Parsing Lists in JSON

An interesting case of parsing JSON with Pandas, especially when tabulated views are prioritized, is that of lists. Imagine the data above had an entry for hobbies as such:

{
  "id": 2,
  "first_name": "Pat",
  "last_name": "Williams",
  "location": {
    "City": "New York",
    "State": "New York"
  },
  "age": 55,
  "hobbies": [
    "singing",
    "painting",
    "programming",
    "fishing"
  ]
}

Here we see a list format for data in the hobbies field. If we parse this data using the pd.normalize_json function as before, we get the following output:

   first_name last_name  age                                         hobbies  \
id                                                                             
1        John    Jacobs   60                                       [singing]   
2         Pat  Williams   55       [singing, painting, programming, fishing]   
3        Mary    Jacobs   18                                       [singing]   
4       Alice     Jones   24  [singing, painting, fishing, dancing, running]   
5         Pat  Anderson   24                 [programming, fishing, singing]

Here we see a string-formatted version of our list-formatted data for the hobbies field for each random person. At first glance, this might seem like a parsing error. This is not the case however and merely how a list of values would be presented in a column format.

Note: the last two columns location.City and location.State have been removed here for brevity.

These issues would likely need to be addressed on a case-by-case basis whereby one instructs Pandas on how to display such data. For example, would it be ideal to have a hobby_1, hobby_2, hobby_3, hobby_4, and hobby_5 column? What if someone were to list more than 5 hobbies? Would you limit the number of hobbies a person could list? Would you simply ignore any hobbies listed beyond 5?

A generalized approach for formatting list data would be to remove the brackets and simply convert to CSV-format data as such:

# map a function to join all list items via a single ',' character
df = df.applymap(lambda x: ",".join(x) if type(x) is list else x)

This would convert the hobbies column into the following format:

   first_name last_name  age                                   hobbies  \
id                                                                       
1        John    Jacobs   60                                   singing   
2         Pat  Williams   55      singing,painting,programming,fishing   
3        Mary    Jacobs   18                                   singing   
4       Alice     Jones   24  singing,painting,fishing,dancing,running   
5         Pat  Anderson   24               programming,fishing,singing

Whether this would be an ideal format would be a decision made by the developer. Personally, I feel dropping the brackets would allow for more flexible use among other tabulated formats — such as the case where the DataFrame was exported as a CSV-formatted file. Additionally, the purpose of brackets may not be well-recognized among non-programmer types in cases where data is intended to be exported and used by more general consumers such as business analysts.

Final Thoughts

Tabulating JSON in Python is straightforward when one leverages the power of Pandas. The normalize_json the method helps ensure that parsing nested JSON results in appropriate column names whereby nested keys are converted to dot-addressable column names.

The handling of certain data like lists can introduce rigidity into the approach one takes in parsing nested JSON. As such, many cases of parsing nested JSON into tabulated formats will require a process of data inspection followed by preferred case-by-case approaches. Such is the life of a programmer!