Importing Packages

In [1]:
import json
import pandas as pd

Data source

Data sample

{
   "episodes":[
      {
         "seasonNum":1,
         "episodeNum":1,
         "episodeTitle":"Winter Is Coming",
         "episodeLink":"/title/tt1480055/",
         "episodeAirDate":"2011-04-17",
         "episodeDescription":"Jon Arryn, the Hand of the King, is dead. King Robert Baratheon plans to ask his oldest friend, Eddard Stark, to take Jon's place. Across the sea, Viserys Targaryen plans to wed his sister to a nomadic warlord in exchange for an army.",
         "openingSequenceLocations":[
            "King's Landing",
            "Winterfell",
            "The Wall",
            "Pentos"
         ],
         "scenes":[
            {
               "sceneStart":"0:00:40",
               "sceneEnd":"0:01:45",
               "location":"The Wall",
               "subLocation":"Castle Black",
               "characters":[
                  {"name":"Gared"},
                  {"name":"Waymar Royce"},
                  {"name":"Will"}
               ]
            }
         ]
      }
   ]
}

Reading data

In [2]:
f = open('../data/episodes.json')
data = json.load(f)
f.close() #close the file to remove the original file from the memory.

Parsing the JSON File into Tidy Format

Tidy data sets have structure and working with them is easy; they’re easy to manipulate, model and visualize. Tidy data sets main concept is to arrange data in a way that each variable is a column and each observation (or case) is a row.
Source: https://www.wikiwand.com/en/Tidy_data

In [3]:
data_list = list()

for episode in data['episodes']:
    seasonNum = episode['seasonNum']
    episodeNum =  episode['episodeNum']  
    for scene in episode['scenes']:
        sceneStart = scene['sceneStart']
        sceneEnd = scene['sceneEnd']
        for character in scene['characters']:
            characterName = character['name']
            row = [seasonNum, episodeNum, characterName, sceneStart, sceneEnd]
            data_list.append(row)

Saving parsed data into Pandas Dataframe

In [4]:
df = pd.DataFrame(columns=['season_num', 'episode_num', 'character_name', 'scene_start_time', 'scene_end_time'], 
                  data=data_list)

Final Table

In [5]:
display(df.head())
season_num episode_num character_name scene_start_time scene_end_time
0 1 1 Gared 0:00:40 0:01:45
1 1 1 Waymar Royce 0:00:40 0:01:45
2 1 1 Will 0:00:40 0:01:45
3 1 1 Gared 0:01:45 0:03:24
4 1 1 Waymar Royce 0:01:45 0:03:24