'Querying deeply nested and complex JSON data with multiple levels

I am struggling to break down the method required to extract data from deeply nested complex JSON data. I have the following code to obtain the JSON.

import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt

base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers={'User-Agent': 'Myheaderdata'}
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()   
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()

Which provides an output showing the JSON and a Pandas DataFrame. The dataframe has two columns, with the third (FACTS) containing a lot of nested data.

What I want to understand is how to navigate into that nested structure, to retrieve certain data. For example, I may want to go to the DEI level, or the US GAAP level and retrieve a particular attribute. Let's say DEI > EntityCommonStockSharesOutstanding and obtain the "label", "value" and "FY" details.

When I try to use the get function as follows;

data=[]
for response in response_dic:

        data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
    new_df=pd.DataFrame(data)
    new_df.head()

I end up with the following attribute error;

AttributeError                            Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
      1 data=[]
      2 for response in response_dic:
----> 3     data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
      4 base_df=pd.DataFrame(data)
      5 base_df.head()

AttributeError: 'str' object has no attribute 'get'


Solution 1:[1]

Use pd.json_normalize:

For example:

entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']

df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
                        meta=['label', 'description'])

df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
                        meta=['label', 'description'])
>>> df1
           end        val                  accn  ...      frame                                    label                                        description
0   2018-10-31  106299106  0001564590-18-028629  ...  CY2018Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
1   2019-02-28  106692030  0001627475-19-000007  ...        NaN  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
2   2019-04-30  107160359  0001627475-19-000015  ...  CY2019Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
3   2019-07-31  110803709  0001627475-19-000025  ...  CY2019Q2I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
4   2019-10-31  112020807  0001628280-19-013517  ...  CY2019Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
5   2020-02-28  113931825  0001627475-20-000006  ...        NaN  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
6   2020-04-30  115142604  0001627475-20-000018  ...  CY2020Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
7   2020-07-31  120276173  0001627475-20-000031  ...  CY2020Q2I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
8   2020-10-31  122073553  0001627475-20-000044  ...  CY2020Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
9   2021-01-31  124962279  0001627475-21-000015  ...  CY2020Q4I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
10  2021-04-30  126144849  0001627475-21-000022  ...  CY2021Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...

[11 rows x 10 columns]


>>> df2
          end         val                  accn    fy  fp  form       filed      frame                label                                        description
0  2018-10-03   900000000  0001627475-19-000007  2018  FY  10-K  2019-03-07  CY2018Q3I  Entity Public Float  The aggregate market value of the voting and n...
1  2019-06-28  1174421292  0001627475-20-000006  2019  FY  10-K  2020-03-02  CY2019Q2I  Entity Public Float  The aggregate market value of the voting and n...
2  2020-06-30  1532720862  0001627475-21-000015  2020  FY  10-K  2021-02-24  CY2020Q2I  Entity Public Float  The aggregate market value of the voting and n...

Solution 2:[2]

I came across this same issue. While the solution provided meets the requirements of your question it might be a better solution to flatten the entire dictionary and have all the columns represented in a long data frame.

That data frame can be used as a building block for a DB or can simply be queried as you wish.

The facts key can have more than sub key dei or us-gaap. Also, within the us-gapp dictionary if you want to extract multiple xbrl tags at a time you will have a pretty difficult time.

The solution below is might not be the prettiest or more efficient but it gets all the levels of the dictionary along with all the facts and values.

import requests
import pandas as pd
import json
from flatten_json import flatten


headers= {'User-Agent':'My User Agent 1.0', 'From':'something somethin'}
file = 'https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json'
data = json.loads(requests.get(file, headers = headers).text)

#get the cik and name of the entity
Cik_Name = dict(list(data.items())[0: 2]) 
Cik_Name_df = pd.DataFrame(Cik_Name,index=[0])

#Flatten file
f = flatten(data['facts'],'|')
#drop into a dataframe and transpose
f = pd.DataFrame(f,index=[0]).T
#reset index
f = f.reset_index(level=0)
#rename columns
f.rename(columns={'index': 'Col_split', 0:'values'}, inplace= True)
#split Col_split column by delimiter
f = f.join(f['Col_split'].str.split(pat='|',expand=True).add_prefix('Col_split'))
#drop original Col_split column
f = f.drop(['Col_split','Col_split4'],axis = 1)

#move values column to the end
f = f[[c for c in f if c not in ['values']] + ['values']]
#create groups based on Col_split2 containing the value label 
f['groups'] = f["Col_split2"].eq('label').cumsum()

df_list = []
#loop to break df by group and create new columns for label & description
for i, g in f.groupby('groups'):
    label = g['values'].iloc[0]
    description = g['values'].iloc[1]
    g.drop(index = g.index[:2], axis = 0, inplace = True)
    g['label'] = label
    g['description'] = description
    df_list.append(g)
    
final_df = pd.concat(df_list)  
final_df.rename(columns={'Col_split0':'facts', 'Col_split1':'tag','Col_split3':'units'}, inplace=True)

final_df = final_df[['facts','tag','label','description','units','Col_split5','values']]
final_df['cum _ind'] = final_df["Col_split5"].eq('end').cumsum()

final_df = final_df.pivot(index = ['facts','tag','label','description','units','cum _ind'] , columns = 'Col_split5' ,values='values').reset_index()
final_df['cik'] = Cik_Name_df['cik'].iloc[0]
final_df['entityName'] = Cik_Name_df['entityName'].iloc[0]
final_df = final_df[['cik','entityName','facts','tag','label','description','units','accn','start','end','filed','form','fp','frame','fy','val']]

print(final_df)  

please feel free to make improvements as you see fit and share them with the community.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien
Solution 2 SQL_Roundabout