'Creating a summary statistics table in python

I am trying to recreate the 'SummarySE()' function from R in python but I am having trouble getting it to work. The function creates a summary stats table from a repeated measures dataframe. However I am unable to get it working, I keep getting errors due to the column names within my dataframe (which are strings).

Table used:

id Position.Name Period Maximum.Velocity
2 WR Special team 16.5
2 WR Special team 15.2
2 WR Special team 16.5
2 WR Special team 15.2
3 DB Special team 14.5
3 DB Special team 10.6
3 DB Special team 17.5
3 DB Special team 13.5
4 OL Special team 10.2
4 OL Special team 11.3
4 OL Special team 16.2
2 WR team 13.5
2 WR team 12.2
2 WR team 15.5
2 WR team 16.2
3 DB team 13.5
3 DB team 12.5
3 DB team 11.5
3 DB team 16.5
4 OL team 9.2
4 OL team 8.2
4 OL team 11.2
df = pd.DataFrame(columns=["id", "Position.Name", "Period", "Maximum.Velocity"], 
                  data = [[2, "WR", "Special team", 16.5],[2, "WR", "Special team", 15.2], [2, "WR", "Special team", 16.5], [2,"WR", "Special team", 15.2],  [3, "DB", "Special team" ,14.5],[3, "DB", "Special team", 10.6], [3, "DB", "Special team", 17.5],[3, "DB", "Special team", 13.5], [4, "OL", "Special team", 10.2], [4, "OL", "Special team", 11.3], [4, "OL", "Special team", 16.2], [2, "WR", "team", 13.5], [2, "WR", "team", 12.2], [2, "WR", "team", 15.5],[2, "WR", "team", 16.2],[3, "DB", "team", 13.5], [3, "DB", "team", 12.5], [3, "DB", "team", 11.5], [3,"DB","team", 16.5], [4, "OL","team", 9.2], [4, "OL", "team", 8.2], [4, "OL", "team", "11.2"]])
df["Maximum.Velocity"] = df["Maximum.Velocity"].astype("float")

Code used:

import pandas as pd
import scipy as sp
from scipy.stats import t
import numpy as np

#from: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_%28ggplot2%29/
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   conf_interval: the percent range of the confidence interval (default is 95%)
def summarySE(data, measurevar, groupvars, conf_interval=0.95):
    def std(s):
        return np.std(s, ddof=1)
    def stde(s):
        return std(s) / np.sqrt(len(s))

    def ci(s):
        # Confidence interval multiplier for standard error
        # Calculate t-statistic for confidence interval: 
        # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
        ciMult = t.ppf(conf_interval/2.0 + .5, len(s)-1)
        return stde(s)*ciMult
    def ciUp(s):
        return np.mean(s)+ci(s)
    def ciDown(s):
        return np.mean(s)-ci(s)
    
    data = data[groupvars+measurevar].groupby(groupvars).agg([len, np.mean, std, stde, ciUp, ciDown, ci])

    data.reset_index(inplace=True)


    data.columns = groupvars+ ['_'.join(col).strip() for col in data.columns.values[len(groupvars):]]

    return data

summary_table = summarySE(data = df, measurevar = ['Maximum.Velocity'], groupvars = ['Position.Name','Period'], conf_interval=0.95)

Traceback Error that I get:

  • indexer = self.columns.get_loc(key)
  • raise KeyError(key) from err
  • KeyError: 'Position.NameMaximum.Velocity'

The desired output is something like this:

Position.Name Period length Maximum.Velocity std stde ciUp ciDown ci
WR Special team 4 mean std stde ciUp ciDown ci
WR team 4 mean std stde ciUp ciDown ci
DB Special team 4 mean std stde ciUp ciDown ci
DB team 4 mean std stde ciUp ciDown ci
OL Special team 3 mean std stde ciUp ciDown ci
OL team 3 mean std stde ciUp ciDown ci


Solution 1:[1]

I have been looking for the solution for this kind of problem. Since, R has better solution for such. Anyway I tried to solve your problem. Find it here

import pandas as pd
import numpy as np
from scipy.stats import t


def SummarySE(data:pd.DataFrame, measure_var:List[str], group_vars:List[str], conf_interval=0.95):
"""
Calculate the summary statistics for a given measure variable, grouped by group_vars. 
The summary statistics are calculated using the t-distribution.
"""
# Calculate the summary statistics
# summary_stats = data.groupby(group_vars)[measure_var].describe()
summary_stats = pd.DataFrame()
for stat in [np.mean, np.std, np.min, np.max]:
    # Add a new column to the summary statistics dataframe for each summary statistic calculated for the measure variable
    summary_stats[stat.__name__] = data.groupby(group_vars)[measure_var].apply(lambda x: stat(x))
    # summary_stats[stat.__name__] = summary_stats[stat].apply(lambda x: round(x, 2))

    # summary_stats[stat] = data.groupby(group_vars)[measure_var].agg(stat)
    
# Calculate the confidence interval
summary_stats["Conf. Interval"] = data.groupby(group_vars)[measure_var].apply(lambda x: t.interval(conf_interval, x))
# Ungroup the dataframe
summary_stats = summary_stats.reset_index()

# Return the summary statistics
return summary_stats

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Praful Dodda