'How do I get quarterly S&P500 constituents in Python from the detailed change data?

I want to use S&P500 company information to calculate an index. However, the companies in S&P500 changes frequently, I want to know the constituents for each quarter, but I can only get the most recent list from Wikipedia, the code is as below:

table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
tickers = df.Symbol.to_list()

'tickers' is a list that contains all the company tickers in S&P500 companies

['MMM',
 'ABT',
 'ABBV',
 'ABMD',
 'ACN',
 'ATVI',
 'ADBE',
 'AMD',
 'AAP',
 'AES',
 'AFL',
 'A',
 'APD',
 'AKAM',
 'ALK',
 'ALB',
 'ARE',
 ...]

Now I found a table that contains the historical change information of S&P500 constituents. There are dates, changes, and tickers for all the companies. '1' means the company was added to the list, and '-1' means the company was removed from the list. I want to use this information, particularly 'DateAfterChange', and get the lists of companies in the S&P500 for the past 20 quarters(5 years). A complete list can be found here: https://docs.google.com/spreadsheets/d/1xkq2kkf-iElKl9BhEwqQx3Pgkh0B9dFKJpefQ4oOI_g/edit#gid=455032226.

DateBeforeChange    DateAfterChange Change  Ticker
20200623            20200624        1       TMUSR
20200618            20200619        1       BIO
20200618            20200619        1       TDY
20200618            20200619        1       TYL
20200618            20200619        -1      ADS
20200618            20200619        -1      HOG

My expected output could be single lists or in a combined format like this:

2019-Q1   2019-Q2   2019-Q3    2019-Q4
A         B         C          D
B         C         D          F
C         D         E          E
D         E         F          G 
E         F         G          H
...

What I'm thinking about is to use the most recent list of companies, and first divide the date info into quarters in the change data, and then add back those were removed and remove those were added in the past. But I'm just not sure how to do that in Python. Can anyone please help?



Solution 1:[1]

This method works:

import pandas as pd
# current list
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
tickers = df.Symbol.to_list()
# your file of changes
change = pd.read_excel("sp500change.xlsx")
# convert afterchange to datetime and set as index, sorting
change["DateAfterChange"] = pd.to_datetime(change["DateAfterChange"], format="%Y%m%d")
change.set_index("DateAfterChange", inplace=True)
change = change.sort_index(ascending=False)
# groupby quarter, creating list of tickers for additions and deletions from list
change = change.groupby([pd.Grouper(freq="Q"), "Change"])["Ticker"].agg(lambda x: list(x)).to_frame()
# set index afterchange, change to strings and set these as columns
change = change.reset_index(drop=False).set_index("DateAfterChange")
change["Change"] = change["Change"].map({-1: "drop", 1: "add"})
change = change.pivot(columns="Change")
change.columns = change.columns.droplevel(0)

# series of tickers over time
tick_series = pd.Series({pd.to_datetime("today"): tickers})
tick_series = tick_series.append(pd.Series(index=change.index)).sort_index(ascending=False)

for i in tick_series.iloc[1:].index:
    tick_series.loc[i] = list(set(tick_series.shift(1).loc[i] + change.loc[i]["drop"]).difference(set(change.loc[i]["add"])))

The for loop takes the previous list (it is working backwards, so this is the more recent list), and adds the tickers that were dropped in the quarter, and removes those that were added in the quarter. Sets were needed to only keep the differences between the "add" and "more recent + drop" lists.

Hopefully you have found a solution by now anyway, and haven't waited for 2 years...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rawson