'Creating a summary statistics table in python

I am trying to recreate the 'SummarySE()' function from R in python but I am having trouble getting it to work. The function creates a summary stats table from a repeated measures dataframe. However I am unable to get it working, I keep getting errors due to the column names within my dataframe (which are strings).

Table used:

id	Position.Name	Period	Maximum.Velocity
2	WR	Special team	16.5
2	WR	Special team	15.2
2	WR	Special team	16.5
2	WR	Special team	15.2
3	DB	Special team	14.5
3	DB	Special team	10.6
3	DB	Special team	17.5
3	DB	Special team	13.5
4	OL	Special team	10.2
4	OL	Special team	11.3
4	OL	Special team	16.2
2	WR	team	13.5
2	WR	team	12.2
2	WR	team	15.5
2	WR	team	16.2
3	DB	team	13.5
3	DB	team	12.5
3	DB	team	11.5
3	DB	team	16.5
4	OL	team	9.2
4	OL	team	8.2
4	OL	team	11.2

df = pd.DataFrame(columns=["id", "Position.Name", "Period", "Maximum.Velocity"], 
                  data = [[2, "WR", "Special team", 16.5],[2, "WR", "Special team", 15.2], [2, "WR", "Special team", 16.5], [2,"WR", "Special team", 15.2],  [3, "DB", "Special team" ,14.5],[3, "DB", "Special team", 10.6], [3, "DB", "Special team", 17.5],[3, "DB", "Special team", 13.5], [4, "OL", "Special team", 10.2], [4, "OL", "Special team", 11.3], [4, "OL", "Special team", 16.2], [2, "WR", "team", 13.5], [2, "WR", "team", 12.2], [2, "WR", "team", 15.5],[2, "WR", "team", 16.2],[3, "DB", "team", 13.5], [3, "DB", "team", 12.5], [3, "DB", "team", 11.5], [3,"DB","team", 16.5], [4, "OL","team", 9.2], [4, "OL", "team", 8.2], [4, "OL", "team", "11.2"]])
df["Maximum.Velocity"] = df["Maximum.Velocity"].astype("float")

Code used:

import pandas as pd
import scipy as sp
from scipy.stats import t
import numpy as np

#from: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_%28ggplot2%29/
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   conf_interval: the percent range of the confidence interval (default is 95%)
def summarySE(data, measurevar, groupvars, conf_interval=0.95):
    def std(s):
        return np.std(s, ddof=1)
    def stde(s):
        return std(s) / np.sqrt(len(s))

    def ci(s):
        # Confidence interval multiplier for standard error
        # Calculate t-statistic for confidence interval: 
        # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
        ciMult = t.ppf(conf_interval/2.0 + .5, len(s)-1)
        return stde(s)*ciMult
    def ciUp(s):
        return np.mean(s)+ci(s)
    def ciDown(s):
        return np.mean(s)-ci(s)
    
    data = data[groupvars+measurevar].groupby(groupvars).agg([len, np.mean, std, stde, ciUp, ciDown, ci])

    data.reset_index(inplace=True)


    data.columns = groupvars+ ['_'.join(col).strip() for col in data.columns.values[len(groupvars):]]

    return data

summary_table = summarySE(data = df, measurevar = ['Maximum.Velocity'], groupvars = ['Position.Name','Period'], conf_interval=0.95)

Traceback Error that I get:

indexer = self.columns.get_loc(key)
raise KeyError(key) from err
KeyError: 'Position.NameMaximum.Velocity'

The desired output is something like this:

Position.Name	Period	length	Maximum.Velocity	std	stde	ciUp	ciDown	ci
WR	Special team	4	mean	std	stde	ciUp	ciDown	ci
WR	team	4	mean	std	stde	ciUp	ciDown	ci
DB	Special team	4	mean	std	stde	ciUp	ciDown	ci
DB	team	4	mean	std	stde	ciUp	ciDown	ci
OL	Special team	3	mean	std	stde	ciUp	ciDown	ci
OL	team	3	mean	std	stde	ciUp	ciDown	ci

python r pandas

Solution 1:^[1]

I have been looking for the solution for this kind of problem. Since, R has better solution for such. Anyway I tried to solve your problem. Find it here

import pandas as pd
import numpy as np
from scipy.stats import t


def SummarySE(data:pd.DataFrame, measure_var:List[str], group_vars:List[str], conf_interval=0.95):
"""
Calculate the summary statistics for a given measure variable, grouped by group_vars. 
The summary statistics are calculated using the t-distribution.
"""
# Calculate the summary statistics
# summary_stats = data.groupby(group_vars)[measure_var].describe()
summary_stats = pd.DataFrame()
for stat in [np.mean, np.std, np.min, np.max]:
    # Add a new column to the summary statistics dataframe for each summary statistic calculated for the measure variable
    summary_stats[stat.__name__] = data.groupby(group_vars)[measure_var].apply(lambda x: stat(x))
    # summary_stats[stat.__name__] = summary_stats[stat].apply(lambda x: round(x, 2))

    # summary_stats[stat] = data.groupby(group_vars)[measure_var].agg(stat)
    
# Calculate the confidence interval
summary_stats["Conf. Interval"] = data.groupby(group_vars)[measure_var].apply(lambda x: t.interval(conf_interval, x))
# Ungroup the dataframe
summary_stats = summary_stats.reset_index()

# Return the summary statistics
return summary_stats

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Praful Dodda

'Creating a summary statistics table in python

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]