'Pandas rolling window cumsum, with incomplete series

I have a pandas df as follows:

YEAR   MONTH   USERID    TRX_COUNT
2020   1        1         1
2020   2        1         2
2020   3        1         1
2020   12       1         1
2021   1        1         3
2021   2        1         3
2021   3        1         4

I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months. So my end result would look like

YEAR   MONTH   USERID    TRX_COUNT   TRX_COUNT_SUM
2020   1        1         1            5
2020   2        1         2            7
2020   3        1         1            8
2020   12       1         1            11
2021   1        1         3            10
2021   2        1         3            7
2021   3        1         4            4

For example TRX_COUNT_SUM for 2020/1 is 1+2+1+1=5 the count of the first 12 months. Second entry is 7 as it is the sum of 2+1+1+3 which is 12 months from 2020/2 I do expect partials, where there is not a full year of data. These can be either summed up partially or set to zero (as I wont be using partials). I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window. Thanks!



Solution 1:[1]

Looking at the logic and expected output, you are looking for more of rolling sum than cumsum. You want to roll 12 months and sum the number of TRX_COUNT. cumsum would cumulatively adding up the previous calculations.

Anyway, a few things that is complicated in your dataset. 1. interval is uneven 2. You are looking for forward rolling and typical rolling is backward.

To solve this, first, I would make the interval even so that I can use regular rolling.

df['ym'] = pd.to_datetime([f'{x}/0{y}' if y < 10 else f'{x}/{y}' for x, y in zip(df.YEAR, df.MONTH)])
df = df.set_index('ym').resample('MS').first()

Then, try forward rolling. To do the forward rolling, I reverse the dataframe once and do rolling then reverse back.

df['TRX_COUNT_SUM'] = (df.iloc[::-1]    # Reverse to do (backward) rolling
                       .rolling(12, min_periods=0)
                       .TRX_COUNT.sum()
                       .iloc[::-1])     # Reverse back to original

# remove resampled records
df = df[df.YEAR > 0]

Result.

              YEAR  MONTH  USERID  TRX_COUNT  TRX_COUNT_SUM
ym
2020-01-01  2020.0    1.0     1.0        1.0            5.0
2020-02-01  2020.0    2.0     1.0        2.0            7.0
2020-03-01  2020.0    3.0     1.0        1.0            8.0
2020-12-01  2020.0   12.0     1.0        1.0           11.0
2021-01-01  2021.0    1.0     1.0        3.0           10.0
2021-02-01  2021.0    2.0     1.0        3.0            7.0
2021-03-01  2021.0    3.0     1.0        4.0            4.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1