'Pandas Rolling window to calculate sum of the same items of the last n days

Following up with this question, now I would like to calculate the sum/mean of a different column given the same grouping on a rolling window.

Here is the code snippet to set up. I would like to calculate the sum/mean of earnings of each person per row for their past 30 days. Also, since my data is quite big (the real data has 1.7 mil rows), any approach with apply will not work, I presume.

Pseudo:

  • For each row, aggregate data of the past 30 days from the Date
  • Filter data only of the same Name
  • Calculate sum of Earning and return it to the row
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021'],
     'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}

df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')


Solution 1:[1]

This answer is based on @jezrael post so thank you very much for it. The difference is that this solution

  1. uses left join to make sure we won't be changing the shape of the original data
  2. filters out the duplicates in case there are the same ['Date', 'Name'] in the data.
def sum_of_last_n_days(df: pd.DataFrame, identifier: str, timestamp: str, sum_col: str, delta: int) -> pd.DataFrame:
   
    col_name = "sum_%s" % identifier
    temp_df = df.set_index(timestamp) \
        .groupby(identifier, sort=False)[sum_col] \
        .rolling('%sd' % delta, closed='both') \
        .sum() \
        .rename(col_name)
    temp_df = temp_df[~temp_df.index.duplicated(keep="first")]
    return df.merge(temp_df, how="left", left_on=[identifier, timestamp], right_index=True)

frequency_of_last_n_days(df, "Name", "Date", "Earning", 30)

The outcome

   Name       Date  Earning    sum
0  Jack 2021-01-08       40   40.0
1   Jim 2021-01-27       10   10.0
2  Jack 2021-02-05       20   60.0
3   Jim 2021-02-10       20   30.0
4  Jack 2021-02-17       40   60.0
5  Jack 2021-02-18       50  110.0
6   Jim 2021-02-20      100  130.0
7  Jack 2021-02-21       70  180.0
8  Jane 2021-02-22       80   80.0
9  Jane 2021-03-29       90   90.0

Solution 2:[2]

IIUC use

df = (df.join(df.set_index('Date')
                .groupby('Name')['Earning']
                .rolling('30d', closed='both')
                .sum()
                .rename('sum'), on=['Name', 'Date']))
print (df)
   Name       Date  Earning    sum
0  Jack 2021-01-08       40   40.0
1   Jim 2021-01-27       10   10.0
2  Jack 2021-02-05       20   60.0
3   Jim 2021-02-10       20   30.0
4  Jack 2021-02-17       40   60.0
5  Jack 2021-02-18       50  110.0
6   Jim 2021-02-20      100  130.0
7  Jack 2021-02-21       70  180.0
8  Jane 2021-02-22       80   80.0
9  Jane 2021-03-29       90   90.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Duy Bui
Solution 2