'Querying deeply nested and complex JSON data with multiple levels
I am struggling to break down the method required to extract data from deeply nested complex JSON data. I have the following code to obtain the JSON.
import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt
base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers={'User-Agent': 'Myheaderdata'}
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
Which provides an output showing the JSON and a Pandas DataFrame. The dataframe has two columns, with the third (FACTS) containing a lot of nested data.
What I want to understand is how to navigate into that nested structure, to retrieve certain data. For example, I may want to go to the DEI level, or the US GAAP level and retrieve a particular attribute. Let's say DEI > EntityCommonStockSharesOutstanding and obtain the "label", "value" and "FY" details.
When I try to use the get function as follows;
data=[]
for response in response_dic:
data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
new_df=pd.DataFrame(data)
new_df.head()
I end up with the following attribute error;
AttributeError Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
1 data=[]
2 for response in response_dic:
----> 3 data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
4 base_df=pd.DataFrame(data)
5 base_df.head()
AttributeError: 'str' object has no attribute 'get'
Solution 1:[1]
Use pd.json_normalize
:
For example:
entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']
df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
meta=['label', 'description'])
df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
meta=['label', 'description'])
>>> df1
end val accn ... frame label description
0 2018-10-31 106299106 0001564590-18-028629 ... CY2018Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
1 2019-02-28 106692030 0001627475-19-000007 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
2 2019-04-30 107160359 0001627475-19-000015 ... CY2019Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
3 2019-07-31 110803709 0001627475-19-000025 ... CY2019Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
4 2019-10-31 112020807 0001628280-19-013517 ... CY2019Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
5 2020-02-28 113931825 0001627475-20-000006 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
6 2020-04-30 115142604 0001627475-20-000018 ... CY2020Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
7 2020-07-31 120276173 0001627475-20-000031 ... CY2020Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
8 2020-10-31 122073553 0001627475-20-000044 ... CY2020Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
9 2021-01-31 124962279 0001627475-21-000015 ... CY2020Q4I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
10 2021-04-30 126144849 0001627475-21-000022 ... CY2021Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
[11 rows x 10 columns]
>>> df2
end val accn fy fp form filed frame label description
0 2018-10-03 900000000 0001627475-19-000007 2018 FY 10-K 2019-03-07 CY2018Q3I Entity Public Float The aggregate market value of the voting and n...
1 2019-06-28 1174421292 0001627475-20-000006 2019 FY 10-K 2020-03-02 CY2019Q2I Entity Public Float The aggregate market value of the voting and n...
2 2020-06-30 1532720862 0001627475-21-000015 2020 FY 10-K 2021-02-24 CY2020Q2I Entity Public Float The aggregate market value of the voting and n...
Solution 2:[2]
I came across this same issue. While the solution provided meets the requirements of your question it might be a better solution to flatten the entire dictionary and have all the columns represented in a long data frame.
That data frame can be used as a building block for a DB or can simply be queried as you wish.
The facts key can have more than sub key dei or us-gaap. Also, within the us-gapp dictionary if you want to extract multiple xbrl tags at a time you will have a pretty difficult time.
The solution below is might not be the prettiest or more efficient but it gets all the levels of the dictionary along with all the facts and values.
import requests
import pandas as pd
import json
from flatten_json import flatten
headers= {'User-Agent':'My User Agent 1.0', 'From':'something somethin'}
file = 'https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json'
data = json.loads(requests.get(file, headers = headers).text)
#get the cik and name of the entity
Cik_Name = dict(list(data.items())[0: 2])
Cik_Name_df = pd.DataFrame(Cik_Name,index=[0])
#Flatten file
f = flatten(data['facts'],'|')
#drop into a dataframe and transpose
f = pd.DataFrame(f,index=[0]).T
#reset index
f = f.reset_index(level=0)
#rename columns
f.rename(columns={'index': 'Col_split', 0:'values'}, inplace= True)
#split Col_split column by delimiter
f = f.join(f['Col_split'].str.split(pat='|',expand=True).add_prefix('Col_split'))
#drop original Col_split column
f = f.drop(['Col_split','Col_split4'],axis = 1)
#move values column to the end
f = f[[c for c in f if c not in ['values']] + ['values']]
#create groups based on Col_split2 containing the value label
f['groups'] = f["Col_split2"].eq('label').cumsum()
df_list = []
#loop to break df by group and create new columns for label & description
for i, g in f.groupby('groups'):
label = g['values'].iloc[0]
description = g['values'].iloc[1]
g.drop(index = g.index[:2], axis = 0, inplace = True)
g['label'] = label
g['description'] = description
df_list.append(g)
final_df = pd.concat(df_list)
final_df.rename(columns={'Col_split0':'facts', 'Col_split1':'tag','Col_split3':'units'}, inplace=True)
final_df = final_df[['facts','tag','label','description','units','Col_split5','values']]
final_df['cum _ind'] = final_df["Col_split5"].eq('end').cumsum()
final_df = final_df.pivot(index = ['facts','tag','label','description','units','cum _ind'] , columns = 'Col_split5' ,values='values').reset_index()
final_df['cik'] = Cik_Name_df['cik'].iloc[0]
final_df['entityName'] = Cik_Name_df['entityName'].iloc[0]
final_df = final_df[['cik','entityName','facts','tag','label','description','units','accn','start','end','filed','form','fp','frame','fy','val']]
print(final_df)
please feel free to make improvements as you see fit and share them with the community.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Corralien |
Solution 2 | SQL_Roundabout |