'How can I fill a column with values that are computed between two dates in pandas, with a delay of one row, respecting certain conditions?

I have the following DataFrame:

Date	Distance	Position	TrainerID
2017-09-03	1000	2	6529
2017-09-03	1600	4	6529
2017-09-03	1200	3	6529
2017-09-06	1200	13	6529
2017-09-08	1000	1	6529
2017-09-10	1600	9	6529
2017-09-15	1600	2	6529

I want to compute on every row the winning percentage so far for the sprint races (distance of 1200 meters or less) in the last 1000 days, grouped by TrainerID. The result will be stored in a Win% Column. Dates need not to be unique. However, the winning % is considered to be before the race happened, so the current row is excluded. Thus, the results are delayed by one row.

The rows of the races that do not fit this category, should have the winning percentage from above.

What I am looking for is a result like this:

Date	Distance	Position	TrainerID	Win %
2017-09-03	1000	2	6529	0 (0 wins, 0 races)
2017-09-03	1600	4	6529	0 (does not fit the criteria, value copied from above)
2017-09-03	1200	3	6529	0 (0 wins, 1 race)
2017-09-06	1200	13	6529	0 (0 wins, 2 races)
2017-09-08	1000	1	6529	0 (0 wins, 3 races)
2017-09-10	1600	9	6529	25 (1 win, 4 races)
2017-09-15	1600	2	6529	25 (does not fit the criteria, value copied from above)

For this, I know I am going to need a mask to select the data that fits and, of course, the selecting code. What I have so far is this:

mask = (df.Distance == 1000) | (df.Distance == 1200)
df = (df.loc[mask].set_index('Date').groupby('TrainerID').rolling(no_days)['Position'].apply(lambda s:round(s.eq(1).sum()/len(s)*100)).groupby('TrainerID').shift().values)

The problem is that I get a dimensional mismatch error: Length of values (12521) does not match length of index (27008)

The selection is correct, I have problems with the mask. Can you help me?

python pandas

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How can I fill a column with values that are computed between two dates in pandas, with a delay of one row, respecting certain conditions?

Sources

Related Questions