'Pandas Rolling window to calculate sum of the same items of the last n days
Following up with this question, now I would like to calculate the sum/mean of a different column given the same grouping on a rolling window.
Here is the code snippet to set up. I would like to calculate the sum/mean of earnings of each person per row for their past 30 days. Also, since my data is quite big (the real data has 1.7 mil rows), any approach with apply
will not work, I presume.
Pseudo:
- For each row, aggregate data of the past 30 days from the
Date
- Filter data only of the same
Name
- Calculate
sum
ofEarning
and return it to the row
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Solution 1:[1]
This answer is based on @jezrael post so thank you very much for it. The difference is that this solution
- uses left join to make sure we won't be changing the shape of the original data
- filters out the duplicates in case there are the same ['Date', 'Name'] in the data.
def sum_of_last_n_days(df: pd.DataFrame, identifier: str, timestamp: str, sum_col: str, delta: int) -> pd.DataFrame:
col_name = "sum_%s" % identifier
temp_df = df.set_index(timestamp) \
.groupby(identifier, sort=False)[sum_col] \
.rolling('%sd' % delta, closed='both') \
.sum() \
.rename(col_name)
temp_df = temp_df[~temp_df.index.duplicated(keep="first")]
return df.merge(temp_df, how="left", left_on=[identifier, timestamp], right_index=True)
frequency_of_last_n_days(df, "Name", "Date", "Earning", 30)
The outcome
Name Date Earning sum
0 Jack 2021-01-08 40 40.0
1 Jim 2021-01-27 10 10.0
2 Jack 2021-02-05 20 60.0
3 Jim 2021-02-10 20 30.0
4 Jack 2021-02-17 40 60.0
5 Jack 2021-02-18 50 110.0
6 Jim 2021-02-20 100 130.0
7 Jack 2021-02-21 70 180.0
8 Jane 2021-02-22 80 80.0
9 Jane 2021-03-29 90 90.0
Solution 2:[2]
IIUC use
df = (df.join(df.set_index('Date')
.groupby('Name')['Earning']
.rolling('30d', closed='both')
.sum()
.rename('sum'), on=['Name', 'Date']))
print (df)
Name Date Earning sum
0 Jack 2021-01-08 40 40.0
1 Jim 2021-01-27 10 10.0
2 Jack 2021-02-05 20 60.0
3 Jim 2021-02-10 20 30.0
4 Jack 2021-02-17 40 60.0
5 Jack 2021-02-18 50 110.0
6 Jim 2021-02-20 100 130.0
7 Jack 2021-02-21 70 180.0
8 Jane 2021-02-22 80 80.0
9 Jane 2021-03-29 90 90.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Duy Bui |
Solution 2 |