'Chunking DataFrame by gaps in datetime index
First of all, my apologies if the title was too ambiguous.
I have a pd.DataFrame
with datetime64
as a dtype of index. These indices, however, are not equally spaced: they mostly have a one-minute interval, but there are often some other intervals, such as two-minutes.
Suppose if I have a pd.DataFrame
:
df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
'2018-11-28 14:01:00', '2018-11-28 14:02:00',
'2018-11-28 14:03:00', '2018-11-28 14:05:00',
'2018-11-28 14:06:00', '2018-11-28 14:07:00',
'2018-11-28 14:08:00', '2018-11-28 14:09:00'],
'count': np.random.randint(1, 100, 10)})
datetime_index = pd.to_datetime(df['date'])
df = df.set_index(datetime_index).drop('date', 1)
df.sort_index(inplace=True)
such that df
is:
count
date
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51<<< two minutes gap
2018-11-28 14:05:00 41<<< unlike others
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93
My goal is to divide the df
into multiple chunks where each chunk have a consistent frequency of one-minute. Thus, the expected result from above would become:
#df0
count
date
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51
#df1
count
date
2018-11-28 14:05:00 41
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93
I have tried Split a series on time gaps in pandas? which sadly was quite outdated and did not serve my purpose.
I did actually achieved what I want for the above sample, but the actual dataframe is much larger and has many more gaps, which make following solution extremely inefficient:
df['diff'] = pd.Series(df.index).diff().values
dif = pd.Series(df.index).diff()
gap_index = dif[dif == pd.to_timedelta(120000000000)].index[0]
df[:gap_index], df[gap_index:]
I would very much appreciate any insight on this issue
Solution 1:[1]
If you are interested to create a dictionary which will contain all your seperated dataframes, probably this should work:
df['identifier']=(~df.index.to_series().diff().dt.seconds.div(60, fill_value=0).lt(2)).cumsum()
count identifier
date
2018-11-28 13:59:00 7 0
2018-11-28 14:00:00 49 0
2018-11-28 14:01:00 13 0
2018-11-28 14:02:00 47 0
2018-11-28 14:03:00 72 0
2018-11-28 14:05:00 33 1
2018-11-28 14:06:00 50 1
2018-11-28 14:07:00 10 1
2018-11-28 14:08:00 86 1
2018-11-28 14:09:00 40 1
Post this create a dict and append the groups:
d = {}
for i,grp in df.groupby('identifier'):
d.update(dict([('df_' + str(i),grp)]))
print(d)
Output:
{'df_0': count identifier
date
2018-11-28 13:59:00 7 0
2018-11-28 14:00:00 49 0
2018-11-28 14:01:00 13 0
2018-11-28 14:02:00 47 0
2018-11-28 14:03:00 72 0,
'df_1': count identifier
date
2018-11-28 14:05:00 33 1
2018-11-28 14:06:00 50 1
2018-11-28 14:07:00 10 1
2018-11-28 14:08:00 86 1
2018-11-28 14:09:00 40 1}
you can then see your data by calling the dict keys:
print(d['df_1'])
count identifier
date
2018-11-28 14:05:00 95 1
2018-11-28 14:06:00 68 1
2018-11-28 14:07:00 19 1
2018-11-28 14:08:00 9 1
2018-11-28 14:09:00 61 1
Solution 2:[2]
Here's quick and dirty solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
'2018-11-28 14:01:00', '2018-11-28 14:02:00',
'2018-11-28 14:03:00', '2018-11-28 14:05:00',
'2018-11-28 14:06:00', '2018-11-28 14:07:00',
'2018-11-28 14:08:00', '2018-11-28 14:09:00'],
'count': np.random.randint(1, 100, 10)})
df['date'] = pd.to_datetime(df['date'])
df.sort_index(inplace=True)
# calculate where to cut
df['cut_point'] = pd.to_datetime(df.date.shift(-1))[0:len(df)-1]- df.date[0:len(df)-1] > '00:01:00'
df['cut_point'] = df['cut_point'].shift(1)
# generate chunks
res = []
chunk = []
for i,row in df.iterrows():
date = row['date']
count = row['count']
cut_point = row['cut_point']
if cut_point == True:
res.append(pd.DataFrame(chunk))
del chunk[:]
chunk.append({'date':date,'count':count})
else:
chunk.append({'date':date,'count':count})
res.append(pd.DataFrame(chunk))
print(res[0])
print(res[1])
Solution 3:[3]
For those of you googling this like me who may be working with a non-datetime column and you want to split the Dataframe based on irregular gaps in that column I've adapted anky 's answer above to a more general form:
Where:
- 'stepCol' is the name of the column you are looking for irregular gaps in
- stepResolution is the regular step value you expect, or the smallest step size that can occur.
- minStepsToSplitOn is the threshold where you would like to split. It should probably be '2' for most cases, but for example if you would like to allow gaps in 'stepCol' of up to 60 seconds, where your smallest step size is 10 seconds, then you would set stepResolution to '10' and minStepsToSplitOn to '7'.
stepResolution = 60
minStepsToSplitOn = 2
df['identifier'] = (~subDf['stepCol'].diff().div(stepResolution, fill_value=0).lt(minStepsToSplitOn)).cumsum()
You can then simply use df.groupby('identifier')
to split wherever there are irregular steps in your chosen column.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | anky |
Solution 2 | milos.ai |
Solution 3 | Sheldonfrith |