'Rolling standard deviation excluding current rows in python
I'm trying calculate the rolling standard deviation for a groupby object which contains Invoice Date, Vendor Name and Agency Name, while excluding the current row in the calculation. So the rolling std() for each Vendor-Agency pair basically. It’s my first time asking a question here, so please let me know if additional information is required.
I already managed to do something similar with the rolling mean with the codes below.
last_3_invoices = dft_gb_comb.set_index("INVOICE_DATE").groupby(["VENDOR_NAME","OCP_AGNCY_NAME"])["daily_avg_amount"].rolling(3,min_periods = 1, closed="left").mean().fillna(0).reset_index()
last_30_df= dft_gb_comb.set_index("INVOICE_DATE").groupby(["VENDOR_NAME","OCP_AGNCY_NAME"])["daily_avg_amount"].rolling("30d",closed="left").mean().fillna(0).reset_index()
The problem is, if I try to replicate this code with .std() instead of .mean() I get no values. If i omit the "closed ="left"" statement I get values, but those include the current row, so that wouldn't work for me. I tried several ways to work with the .shift() statement instead
This could would work for some cases but it does not respect the ["VENDOR_NAME","OCP_AGNCY_NAME"] boundaries.
dft_gb_comb.set_index("INVOICE_DATE").groupby(["VENDOR_NAME","OCP_AGNCY_NAME"])["daily_avg_amount"].rolling(3,min_periods=1).std().shift().reset_index()
If I use shift() directly after the groupby like below it only returns the Invoice date with the values while dropping ["VENDOR_NAME","OCP_AGNCY_NAME"] from the groupby object...
dft_gb_comb.set_index("INVOICE_DATE").groupby(["VENDOR_NAME","OCP_AGNCY_NAME"])["daily_avg_amount"].shift().rolling(3,min_periods=1).std().reset_index()
dft_gb_comb.set_index("INVOICE_DATE").groupby(["VENDOR_NAME","OCP_AGNCY_NAME"])["daily_avg_amount"].apply(lambda x : x.shift(1).rolling(3,min_periods=1).std()).reset_index()
INVOICE_DATE | daily_avg_amount | |
---|---|---|
0 | 2018-04-19 | NaN |
1 | 2018-12-10 | NaN |
408184 | 2019-06-05 | 139.02567 |
408185 | 2019-06-19 | 124.92972 |
408186 | 2018-05-16 | 34.92400 |
408187 | 2019-04-19 | 135.41095 |
408188 | 2018-08-06 | NaN |
instead of something like this, which would be the desired end result.
VENDOR_NAME | OCP_AGNCY_NAME | INVOICE_DATE | daily_avg_amount | |
---|---|---|---|---|
0 | AED PROFESSIONALS N. E. | OKLA. A & M COLLEGE | 2018-04-19 | NaN |
2 | E3 DIAGNOSTIC,INC. DBA E3 GORDON STOWE/ | OKLAHOMA STATE UNIVERSITY | 2018-01-17 | 215.66757 |
I need to join the groupby object with the rolling values back to the main dataframe in the end, so I can't allow "VENDOR_NAME","OCP_AGNCY_NAME" to be dropped.
current python version would be
python : 3.8.8.final.0
pandas : 1.2.4
numpy : 1.20.1
I would be thankful for any kind of suggestions. I'm working with vendor data from here
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|