'Major rearrangement of pandas DataFrame containing nested lists and dictionaries ( CFBD (College Football Database))
The College Football Database (cfbd) contains all team ranks for each week of every college football season going back to 1937.I am trying to set up data from the cfbd in a way that will allow me to examine how a particular team's AP Top 25 ranking changes week to week in multiple seasons. Ideally, I would like to organize this data like so:
Team | Season | Week1 Rank | Week2 Rank | 16 more weeks worth of ranks | ... | ||
---|---|---|---|---|---|---|---|
Alabama | 1937 | 4 | 3 | etc | etc | ||
Alabama | 1938 | 3 | 6 | etc | |||
etc | |||||||
Wyoming | 2017 | 24 | nr | ||||
Wyoming | 2020 | nr | 25 |
So the output table will have a row for every school that has ever been ranked for each year that it has been ranked.
After running pip install cfbd
I set up the API:
import pandas as pd
import cfbd
configuration = cfbd.Configuration()
# The cfbd API requires users to sign up for a free
# private key in order to access the data.
# https://collegefootballdata.com/key
configuration.api_key['Authorization'] = '[MY_SECRET_KEY]'
configuration.api_key_prefix['Authorization'] = 'Bearer'
api_rankings = cfbd.RankingsApi(cfbd.ApiClient(configuration))
#Initialize blank dataframe
allrankings_df = pd.DataFrame()
#year is required in the API call, so I have to collect each year separately
for yr in range(1936,2022):
rankings = api_rankings.get_rankings(year = yr) #get the data
rankings_df = pd.DataFrame.from_records([p.to_dict() for p in rankings])
allrankings_df2 = allrankings_df2.append(rankings_df, ignore_index=True)
This gives me a dataframe structured like so: |season|season_type|week|List of poll objects| |------|-----------|----|-------------------| |1936|regular|1|object|
Those poll objects have a Poll Name ("AP Top 25", "Coaches Poll") and ranking data. Like this, except with four or five different Polls.
{'poll': 'Coaches Poll',
'ranks': [{'rank': 1,
'school': 'Alabama',
'conference': 'SEC',
'firstPlaceVotes': 44,
'points': 1601},
{'rank': 2,
'school': 'Clemson',
'conference': 'ACC',
'firstPlaceVotes': 14,
'points': 1536},
The API describes all of this like so:
[
{
"season": 0,
"seasonType": "string",
"week": 0,
"polls": [
{
"poll": "string",
"ranks": [
{
"rank": 0,
"school": "string",
"conference": "string",
"firstPlaceVotes": 0,
"points": 0
}
]
}
]
}
]
Phew. I can sort of picture how to do this by iterating through every year & week & polling object and building a new table piece by piece. But I also have read many times that I shouldn't do that - that I should be vectorizing.
I know at this point I should share what I have tried so far, but to be honest, I am nowhere close. Can anyone point me in the right direction? I am willing to bang my head on this, but I can't even tell how I should be banging. Do I need some dataframe methods like melt
or ravel
? Or should I be trying to set this up with Boolean dataframe referencing?
references: https://api.collegefootballdata.com/api/docs/?url=/api-docs.json#/rankings/getRankings https://pypi.org/project/cfbd/
Solution 1:[1]
import pandas as pd
import cfbd
configuration = cfbd.Configuration()
# The cfbd API requires users to sign up for a free
# private key in order to access the data.
# https://collegefootballdata.com/key
configuration.api_key['Authorization'] = 'IXRIjbjaO/WQVVLk/Yj2tA1yB2K1K2ZprNcCmueKdHKMHbyst4fYD6lVbp1xeHXB'
configuration.api_key_prefix['Authorization'] = 'Bearer'
api_rankings = cfbd.RankingsApi(cfbd.ApiClient(configuration))
#Initialize blank dataframe
allrankings_df = pd.DataFrame()
#year is required in the API call, so I have to collect each year separately
for yr in range(1936,2022):
print(yr)
rankings = api_rankings.get_rankings(year = yr) #get the data
rankings = [p.to_dict() for p in rankings]
# Flattens the json
rankings_df = pd.json_normalize(rankings,
record_path=['polls', 'ranks'],
meta = ['season','week'])
rankings_df = rankings_df.drop_duplicates(subset=['school','week'], keep='first')
# Pivot so Week rankings are the columns
rankings_df = rankings_df.pivot(
index = ['school','season'],
columns = 'week',
values = 'rank').add_prefix('Week').add_suffix(' Rank').reset_index(drop=False)
rankings_df = rankings_df.rename(columns={'school':'Team',
'season':'Season'})
allrankings_df = pd.concat([allrankings_df, rankings_df], axis=0)
allrankings_df = allrankings_df.sort_values(['Team', 'Season']).reset_index(drop=True)
Output:
First 20 rows:
print(allrankings_df.head(20).to_string())
week Team Season Week1 Rank Week2 Rank Week3 Rank Week4 Rank Week5 Rank Week6 Rank Week7 Rank Week8 Rank Week9 Rank Week10 Rank Week11 Rank Week12 Rank Week13 Rank Week14 Rank Week15 Rank Week16 Rank Week17 Rank
0 Abilene Christian 2012 10.0 9.0 20.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Air Force 1958 NaN NaN NaN NaN NaN 14.0 13.0 10.0 10.0 9.0 8.0 NaN NaN NaN NaN NaN NaN
2 Air Force 1959 15.0 NaN 18.0 18.0 17.0 NaN NaN 18.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Air Force 1969 NaN NaN NaN NaN NaN 20.0 19.0 19.0 20.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 Air Force 1970 NaN NaN 20.0 10.0 8.0 7.0 7.0 7.0 9.0 13.0 10.0 12.0 11.0 11.0 NaN NaN NaN
5 Air Force 1971 NaN NaN NaN NaN NaN NaN 20.0 18.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Air Force 1972 NaN NaN NaN NaN 19.0 16.0 16.0 NaN 19.0 NaN NaN NaN NaN NaN NaN NaN NaN
7 Air Force 1983 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18.0 17.0 16.0 NaN NaN
8 Air Force 1985 NaN NaN NaN NaN 19.0 17.0 13.0 10.0 8.0 7.0 5.0 4.0 13.0 11.0 10.0 10.0 NaN
9 Air Force 1989 NaN NaN NaN NaN 24.0 20.0 17.0 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 Air Force 1995 NaN NaN NaN 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Air Force 1997 NaN NaN NaN NaN NaN NaN 23.0 19.0 18.0 NaN NaN NaN NaN NaN 24.0 23.0 23.0
12 Air Force 1998 NaN NaN NaN 23.0 NaN NaN NaN NaN NaN 25.0 23.0 20.0 18.0 17.0 16.0 NaN NaN
13 Air Force 1999 NaN NaN NaN NaN 24.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 Air Force 2002 NaN NaN NaN NaN NaN NaN 25.0 19.0 15.0 19.0 NaN NaN NaN NaN NaN NaN NaN
15 Air Force 2003 NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Air Force 2010 NaN NaN NaN NaN NaN 25.0 23.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 Air Force 2019 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 24.0 NaN
18 Alabama 1936 NaN NaN 14.0 4.0 8.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 Alabama 1937 2.0 3.0 2.0 3.0 3.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
....
[3954 rows x 19 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | chitown88 |