'Rolling OLS Regressions and Predictions by Group

I have a Pandas dataframe with some data on race car drivers. The relevant columns look like this:


|Date    |Name     |Distance   |avg_speed_calc  
|----    |----     |----       |----         
|9/6/20  | Smith   | 8         | 85.6
|9/6/20  | Douglas | 8         | 84.9
|9/6/20  | Mostern | 8         | 84.3

.......

|Date    |Name      |Distance  |avg_speed_calc 
|:----   |:----     |:-----    |:----         
|4/5/21  | Smith    | 6        | 88.7
|4/5/21  | Robinson | 6        | 89.3
|4/5/21  | Thomas   | 6        | 87.5

In the data above, each race has a variable number of participants, and not every racer competes in every event - some may even have only one row entry.

I'm trying to make a simple OLS to get an idea of each racer's preferred distance by regressing only distance against his average_speed_calc. Because I want the calculation to update following every race, I have been trying to utilize RollingOLS from StatsModels. With the help of a somewhat similar question a few years ago, my code currently looks like this:

from statsmodels.regression.rolling import RollingOLS
import pandas as pd
import numpy as np


dist_pref = df.groupby(["Name"]).apply(lambda x: RollingOLS(endog=x['avg_speed_calc'], exog=sm.add_constant(x['Distance']),min_nobs=2)).fit().params)

However, this throws the message "ValueError: could not broadcast input array from shape (0) into shape (1)", which I have thus far not been able to fix.

I have also tried a helper function, based on another old question, to no avail:

def ols_res(x, y):
    return pd.Series(RollingOLS(y, x).fit().predict)

df_dist = df.groupby(['Name']).apply(lambda x : x['Distance'].apply(ols_res, y=x['avg_speed_calc']))

Ideally, I would want to then predict the average_speed_calc for that day's race, using only the data from prior races, so that I can compare it to the actual avg_speed_calc reported in that row. I thought to create separate DataFrame columns for each coefficient/intercept, then shift them down one, and then use those numbers along with the distance of the current race to make the prediction, but perhaps there is a more efficient way.

Assuming the resulting dataframe is then sorted by date and racer, my desired final output (which requires two or more prior races for the prediction) is a dataframe with info for each racer as follows:

Date Name Distance avg_speed_calc predicted_speed_calc
9/6/20 Smith 8 85.6 nan
11/15/20 Smith 6 82.6 nan
1/4/21 Smith 7 83.4 84.1
2/20/21 Smith 7 82.9 83.9
4/5/21 Smith 8.5 84.8 85.7

Having another column with the standard error of the predicted_speed_calc would also be nice, but that's a secondary issue I'll sort out later. Thanks in advance for any guidance and suggestions on the above.

EDIT 2:

Thanks to @LarryBird, I now have a workable function for the rolling linear OLS result. However, if possible I would like to also try one for a 2nd degree polynomial fit by using the 'from_formula' method, as well as return the residuals to compare them to the linear fit.

I am trying to fit the quadratic equation based on the info here and the 'speed_preference_from_formula' function defined below. However, if I try to refer to the same column twice (i.e. using 'avg_speed_calc ~ Distance + Distance**2', 'params' returns only two columns instead of the expected three. I have made a helper column for the squared value (see below), but the coefficients returned are clearly inaccurate so I know I'm doing something wrong.

import numpy.polynomial.polynomial as poly
import patsy

df['Distance2']  = df['Distance']**2
grouped2 = df.groupby('Name')
form = "avg_speed_calc ~ Distance + Distance2"
params2 = grouped2.apply(lambda x: speed_preference_from_formula(x, form, 4))



Solution 1:[1]

You should be able to achieve what you want using the groupby / apply pattern. The below code should be helpful.

Create example data:

from statsmodels.regression.rolling import RollingOLS
from statsmodels.tools.tools import add_constant
import pandas as pd
import numpy as np

# make some toy data
race_dates = pd.to_datetime(['2020-06-09']*3 + ['2020-12-01']*4 + ['2021-01-21']*4 + ['2021-05-04']*5)
distance = [8]*3 + [7]*4 + [4]*4 + [6]*5
avg_speed = 80 + np.random.randn(len(distance))*10
names = ['Smith', 'Douglas', 'Mostern',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson', 'Thomas']

df = pd.DataFrame({'Date': race_dates,
                  'Name': names,
                  'Distance': distance,
                  'avg_speed_calc': avg_speed})

Define helper functions:

def speed_preference(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a dataframe containing model parameters
    """
    # we should have atleast (number_of_parameters) + 1 observations
    min_obs = 3 if intercept else 2

    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_obs:
        cols = ['const', 'Distance'] if intercept else ['Distance']
        return pd.DataFrame(index=df_racer.index, columns=cols)
    
    y = df_racer['avg_speed_calc']
    x = add_constant(df_racer['Distance']) if intercept else df_racer['Distance']
    
    model = RollingOLS(y, x, expanding=True, min_nobs=min_obs).fit()
    return model.params


def speed_prediction(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a series containing predicted speed
    """
    params = speed_preference(df_racer, intercept)
    params_shifted = params.shift(1)
    
    if intercept:
        return (params_shifted.mul(add_constant(df_racer['Distance']), axis=0)\
                .sum(axis=1, min_count=1)).to_frame('predicted_speed_calc')
    
    return (params_shifted.mul(df_racer['Distance'], axis=0))\
                .rename({'Distance': 'predicted_speed_calc'}, axis=1)

The speed_preference function computes the rolling OLS for a single driver, and return the fitted parameter(s). The speed_prediction function computes the predicted speed, using the model up the previous race (notice params_shifted) as requested. To put it all together, just a simple groupby and join is required:

No intercept

grouped = df.groupby('Name')
params = grouped.apply(speed_preference)
predictions = grouped.apply(speed_prediction)

df_out_no_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_no_intercept

output no intercept

With intercept

grouped = df.groupby('Name')
params = grouped.apply(lambda x: speed_preference(x, True))
predictions = grouped.apply(lambda x: speed_prediction(x, True))

df_out_w_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_w_intercept

output with intercept

EDIT

If you would like to fit a model from formula instead, you could use:

def speed_preference_from_formula(df_racer, formula, min_nobs):
    """
    Function to operate on the data of a single driver. "formula" should reference column names in df_racer.
    min_nobs should always be >= (# of parameters in the model)+1
    """
    
    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_nobs:

        return None
    model = RollingOLS.from_formula(formula, data=df_racer, expanding=True, min_nobs=min_nobs, window=None).fit()
    return model.params

Then for a polynomial model, you could compute the parameters as follows:

grouped = df.groupby('Name')
formula = "avg_speed_calc ~ 1 + Distance + Distance^2"
grouped.apply(lambda x: speed_preference_from_formula(x, formula, 4)) 

Output:

output 3

Note you would need to also edit the speed prediction function to handle the parameters and generated predictions correctly.

Notice that in the formula, I reference column names present in the dataframe being passed in. The 1 indicates an intercept should be used (similar to sm.add_constant), the Distance means use the value in Distance column directly, and the Distance^2 means square the value in Distance column, and then use that value as a feature. If you wanted to fit a cubic model, you can add the term + Distance^3.

For a good reference on how to use thes "R-style" formulas, see here.

EDIT 2: On debugging groupby

Think of the groupby.apply pattern as simply splitting the dataframe into multiple smaller dataframes, applying some function to each of the dataframes individually, and then joining back together. To see each sub-dataframe that is generated by the groupby, you can use:

grouped = df.groupby('Name')
for name, group in grouped:
    print(f"The sub-dataframe for: {name}")
    print(group)

This is useful, because you can now see exactly what gets passed to the function you are using inside the .apply().

So for the error mentioned in comments, you can apply the function to each group individually to narrow down where the error is occurring.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1