'Creating a summary statistics table in python
I am trying to recreate the 'SummarySE()' function from R in python but I am having trouble getting it to work. The function creates a summary stats table from a repeated measures dataframe. However I am unable to get it working, I keep getting errors due to the column names within my dataframe (which are strings).
Table used:
id | Position.Name | Period | Maximum.Velocity |
---|---|---|---|
2 | WR | Special team | 16.5 |
2 | WR | Special team | 15.2 |
2 | WR | Special team | 16.5 |
2 | WR | Special team | 15.2 |
3 | DB | Special team | 14.5 |
3 | DB | Special team | 10.6 |
3 | DB | Special team | 17.5 |
3 | DB | Special team | 13.5 |
4 | OL | Special team | 10.2 |
4 | OL | Special team | 11.3 |
4 | OL | Special team | 16.2 |
2 | WR | team | 13.5 |
2 | WR | team | 12.2 |
2 | WR | team | 15.5 |
2 | WR | team | 16.2 |
3 | DB | team | 13.5 |
3 | DB | team | 12.5 |
3 | DB | team | 11.5 |
3 | DB | team | 16.5 |
4 | OL | team | 9.2 |
4 | OL | team | 8.2 |
4 | OL | team | 11.2 |
df = pd.DataFrame(columns=["id", "Position.Name", "Period", "Maximum.Velocity"],
data = [[2, "WR", "Special team", 16.5],[2, "WR", "Special team", 15.2], [2, "WR", "Special team", 16.5], [2,"WR", "Special team", 15.2], [3, "DB", "Special team" ,14.5],[3, "DB", "Special team", 10.6], [3, "DB", "Special team", 17.5],[3, "DB", "Special team", 13.5], [4, "OL", "Special team", 10.2], [4, "OL", "Special team", 11.3], [4, "OL", "Special team", 16.2], [2, "WR", "team", 13.5], [2, "WR", "team", 12.2], [2, "WR", "team", 15.5],[2, "WR", "team", 16.2],[3, "DB", "team", 13.5], [3, "DB", "team", 12.5], [3, "DB", "team", 11.5], [3,"DB","team", 16.5], [4, "OL","team", 9.2], [4, "OL", "team", 8.2], [4, "OL", "team", "11.2"]])
df["Maximum.Velocity"] = df["Maximum.Velocity"].astype("float")
Code used:
import pandas as pd
import scipy as sp
from scipy.stats import t
import numpy as np
#from: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_%28ggplot2%29/
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## conf_interval: the percent range of the confidence interval (default is 95%)
def summarySE(data, measurevar, groupvars, conf_interval=0.95):
def std(s):
return np.std(s, ddof=1)
def stde(s):
return std(s) / np.sqrt(len(s))
def ci(s):
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult = t.ppf(conf_interval/2.0 + .5, len(s)-1)
return stde(s)*ciMult
def ciUp(s):
return np.mean(s)+ci(s)
def ciDown(s):
return np.mean(s)-ci(s)
data = data[groupvars+measurevar].groupby(groupvars).agg([len, np.mean, std, stde, ciUp, ciDown, ci])
data.reset_index(inplace=True)
data.columns = groupvars+ ['_'.join(col).strip() for col in data.columns.values[len(groupvars):]]
return data
summary_table = summarySE(data = df, measurevar = ['Maximum.Velocity'], groupvars = ['Position.Name','Period'], conf_interval=0.95)
Traceback Error that I get:
- indexer = self.columns.get_loc(key)
- raise KeyError(key) from err
- KeyError: 'Position.NameMaximum.Velocity'
The desired output is something like this:
Position.Name | Period | length | Maximum.Velocity | std | stde | ciUp | ciDown | ci |
---|---|---|---|---|---|---|---|---|
WR | Special team | 4 | mean | std | stde | ciUp | ciDown | ci |
WR | team | 4 | mean | std | stde | ciUp | ciDown | ci |
DB | Special team | 4 | mean | std | stde | ciUp | ciDown | ci |
DB | team | 4 | mean | std | stde | ciUp | ciDown | ci |
OL | Special team | 3 | mean | std | stde | ciUp | ciDown | ci |
OL | team | 3 | mean | std | stde | ciUp | ciDown | ci |
Solution 1:[1]
I have been looking for the solution for this kind of problem. Since, R has better solution for such. Anyway I tried to solve your problem. Find it here
import pandas as pd
import numpy as np
from scipy.stats import t
def SummarySE(data:pd.DataFrame, measure_var:List[str], group_vars:List[str], conf_interval=0.95):
"""
Calculate the summary statistics for a given measure variable, grouped by group_vars.
The summary statistics are calculated using the t-distribution.
"""
# Calculate the summary statistics
# summary_stats = data.groupby(group_vars)[measure_var].describe()
summary_stats = pd.DataFrame()
for stat in [np.mean, np.std, np.min, np.max]:
# Add a new column to the summary statistics dataframe for each summary statistic calculated for the measure variable
summary_stats[stat.__name__] = data.groupby(group_vars)[measure_var].apply(lambda x: stat(x))
# summary_stats[stat.__name__] = summary_stats[stat].apply(lambda x: round(x, 2))
# summary_stats[stat] = data.groupby(group_vars)[measure_var].agg(stat)
# Calculate the confidence interval
summary_stats["Conf. Interval"] = data.groupby(group_vars)[measure_var].apply(lambda x: t.interval(conf_interval, x))
# Ungroup the dataframe
summary_stats = summary_stats.reset_index()
# Return the summary statistics
return summary_stats
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Praful Dodda |