'Vectorize a function for a GroupBy Pandas Dataframe
I have a Pandas dataframe sorted by a datetime column. Several rows will have the same datetime, but the "report type" column value is different. I need to select just one of those rows based on a list of preferred report types. The list is in order of preference. So, if one of those rows has the first element in the list, then that is the row chosen to be appended to a new dataframe.
I've tried a GroupBy and the ever so slow Python for loops to process each group to find the preferred report type and append that row to a new dataframe. I thought about the numpy vectorize(), but I don't know how to incorporate the group by in it. I really don't know much about dataframes but am learning. Any ideas on how to make it faster? Can I incorporate the group by?
The example dataframe
OBSERVATIONTIME REPTYPE CIGFT
2000-01-01 00:00:00 AUTO 73300
2000-01-01 00:00:00 FM-15 25000
2000-01-01 00:00:00 FM-12 3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16 600
2000-01-01 01:00:00 FM-15 5000
2000-01-01 01:00:00 AUTO 5000
2000-01-01 02:00:00 FM-12 12000
2000-01-01 02:00:00 FM-15 15000
2000-01-01 02:00:00 FM-16 8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16 5000
2000-01-01 06:00:00 AUTO 15000
2000-01-01 06:00:00 FM-12 12500
2000-01-01 06:00:00 FM-16 12000
2000-01-01 07:00:00 FM-15 20000
#################################################
# The function to loop through and find the row
################################################
def select_the_one_ob(df):
''' select the preferred observation '''
tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12',
'SY-MT', 'SY-SA']
grouped = df.groupby("OBSERVATIONTIME", as_index=False)
for name, group in grouped:
a_group_df = pd.DataFrame(grouped.get_group(name))
for reptype in preferred_order:
preferred_found = False
for i in a_group_df.index.values:
if a_group_df.loc[i, 'REPTYPE'] == reptype:
tophour_df =
tophour_df.append(a_group_df.loc[i].transpose())
preferred_found = True
break
if preferred_found:
break
del a_group_df
return tophour_df
################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
data_df = platformcig.df
data_df = select_the_one_ob(data_df)
With the complete dataset of 300,000 rows, the function takes over 4 hours. I need it to be much faster. Can I incorporate the group by in numpy vectorize()?
Solution 1:[1]
Found that creating a separate dataframe of the same shape populated with each hour of the observation time, I could use use pandas dataframe merge() and after the first pass use pandas dataframe combine_first(). This took only minutes instead of hours.
def select_the_one_ob(df):
''' select the preferred observation
Parameters:
df (Pandas Object), a Pandas dataframe
Returns Pandas Dataframe
'''
dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
dshell = pd.DataFrame(data = dshelldict)
dshell['YEAR'] = dshell['DateTime'].dt.year
dshell['MONTH'] = dshell['DateTime'].dt.month
dshell['DAY'] = dshell['DateTime'].dt.day
dshell['HOUR'] = dshell['DateTime'].dt.hour
dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
df = df.set_index(['YEAR','MONTH','DAY','HOUR'])
#tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
reptype_list = list(df.REPTYPE.unique())
# remove the preferred report types from the unique ones
for rep in preferred_order:
if rep in reptype_list:
reptype_list.remove(rep)
# If there are any unique report types left, append them to the preferred list
if len(reptype_list) > 0:
preferred_order = preferred_order + reptype_list
## i is flag to make sure a report type is used to transfer columns to new DataFrame
## (Merge has to happen before combine first)
first_pass = True
for reptype in preferred_order:
if first_pass:
## if there is data in dataframe
if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
first_pass = False
# Merge shell with first df with data, the dataframe is sorted by original
# obstime and drop any dup's keeping first aka. first report chronologically
tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
else:
# combine_first takes the original dataframe and fills any nan values with data
# of another identical shape dataframe
# ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))
tophour_df = tophour_df.reset_index()
return tophour_df
Solution 2:[2]
You can avoid to use groupby
. One way could be to categorize your column 'REPTYPE' with pd.Categorical
and then sort_values
and drop_duplicates
such as:
def select_the_one_ob(df):
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
.drop_duplicates(subset='OBSERVATIONTIME', keep='first'))
and you get with your example:
OBSERVATIONTIME REPTYPE CIGFT
1 2000-01-01 00:00:00 FM-15 25000
5 2000-01-01 01:00:00 FM-15 5000
8 2000-01-01 02:00:00 FM-15 15000
10 2000-01-01 03:00:00 SAO 700
11 2000-01-01 04:00:00 SAO 3000
12 2000-01-01 05:00:00 FM-16 5000
13 2000-01-01 06:00:00 AUTO 15000
16 2000-01-01 07:00:00 FM-15 20000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | E. Zeytinci |
Solution 2 | Ben.T |