'Chunking DataFrame by gaps in datetime index

First of all, my apologies if the title was too ambiguous.

I have a pd.DataFrame with datetime64 as a dtype of index. These indices, however, are not equally spaced: they mostly have a one-minute interval, but there are often some other intervals, such as two-minutes.

Suppose if I have a pd.DataFrame:

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
               '2018-11-28 14:01:00', '2018-11-28 14:02:00',
               '2018-11-28 14:03:00', '2018-11-28 14:05:00',
               '2018-11-28 14:06:00', '2018-11-28 14:07:00',
               '2018-11-28 14:08:00', '2018-11-28 14:09:00'], 
                   'count': np.random.randint(1, 100, 10)})
datetime_index = pd.to_datetime(df['date'])
df = df.set_index(datetime_index).drop('date', 1)
df.sort_index(inplace=True)

such that df is:

    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51<<< two minutes gap
2018-11-28 14:05:00 41<<< unlike others
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

My goal is to divide the df into multiple chunks where each chunk have a consistent frequency of one-minute. Thus, the expected result from above would become:

#df0
    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51
#df1
    count
date   
2018-11-28 14:05:00 41
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

I have tried Split a series on time gaps in pandas? which sadly was quite outdated and did not serve my purpose.

I did actually achieved what I want for the above sample, but the actual dataframe is much larger and has many more gaps, which make following solution extremely inefficient:

df['diff'] = pd.Series(df.index).diff().values
dif = pd.Series(df.index).diff()
gap_index = dif[dif == pd.to_timedelta(120000000000)].index[0]
df[:gap_index], df[gap_index:]

I would very much appreciate any insight on this issue



Solution 1:[1]

If you are interested to create a dictionary which will contain all your seperated dataframes, probably this should work:

df['identifier']=(~df.index.to_series().diff().dt.seconds.div(60, fill_value=0).lt(2)).cumsum()

                     count  identifier
date                                  
2018-11-28 13:59:00      7           0
2018-11-28 14:00:00     49           0
2018-11-28 14:01:00     13           0
2018-11-28 14:02:00     47           0
2018-11-28 14:03:00     72           0
2018-11-28 14:05:00     33           1
2018-11-28 14:06:00     50           1
2018-11-28 14:07:00     10           1
2018-11-28 14:08:00     86           1
2018-11-28 14:09:00     40           1

Post this create a dict and append the groups:

d = {}
for i,grp in df.groupby('identifier'):
    d.update(dict([('df_' + str(i),grp)]))
print(d)

Output:

{'df_0':                      count  identifier


date                                  
 2018-11-28 13:59:00      7           0
 2018-11-28 14:00:00     49           0
 2018-11-28 14:01:00     13           0
 2018-11-28 14:02:00     47           0
 2018-11-28 14:03:00     72           0,
 'df_1':                      count  identifier
 date                                  
 2018-11-28 14:05:00     33           1
 2018-11-28 14:06:00     50           1
 2018-11-28 14:07:00     10           1
 2018-11-28 14:08:00     86           1
 2018-11-28 14:09:00     40           1}

you can then see your data by calling the dict keys:

print(d['df_1'])
                     count  identifier
date                                  
2018-11-28 14:05:00     95           1
2018-11-28 14:06:00     68           1
2018-11-28 14:07:00     19           1
2018-11-28 14:08:00      9           1
2018-11-28 14:09:00     61           1

Solution 2:[2]

Here's quick and dirty solution:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
           '2018-11-28 14:01:00', '2018-11-28 14:02:00',
           '2018-11-28 14:03:00', '2018-11-28 14:05:00',
           '2018-11-28 14:06:00', '2018-11-28 14:07:00',
           '2018-11-28 14:08:00', '2018-11-28 14:09:00'],
               'count': np.random.randint(1, 100, 10)})

df['date'] = pd.to_datetime(df['date'])
df.sort_index(inplace=True)

# calculate where to cut
df['cut_point'] = pd.to_datetime(df.date.shift(-1))[0:len(df)-1]- df.date[0:len(df)-1] > '00:01:00'
df['cut_point'] = df['cut_point'].shift(1)

# generate chunks
res = []
chunk = []

for i,row in df.iterrows():
    date = row['date']
    count = row['count']
    cut_point = row['cut_point']

    if cut_point == True:

        res.append(pd.DataFrame(chunk))

        del chunk[:]

        chunk.append({'date':date,'count':count})

    else:
        chunk.append({'date':date,'count':count})

res.append(pd.DataFrame(chunk))

print(res[0])

print(res[1])

Solution 3:[3]

For those of you googling this like me who may be working with a non-datetime column and you want to split the Dataframe based on irregular gaps in that column I've adapted anky 's answer above to a more general form:

Where:

  • 'stepCol' is the name of the column you are looking for irregular gaps in
  • stepResolution is the regular step value you expect, or the smallest step size that can occur.
  • minStepsToSplitOn is the threshold where you would like to split. It should probably be '2' for most cases, but for example if you would like to allow gaps in 'stepCol' of up to 60 seconds, where your smallest step size is 10 seconds, then you would set stepResolution to '10' and minStepsToSplitOn to '7'.
stepResolution = 60 
minStepsToSplitOn = 2
df['identifier'] = (~subDf['stepCol'].diff().div(stepResolution, fill_value=0).lt(minStepsToSplitOn)).cumsum()

You can then simply use df.groupby('identifier') to split wherever there are irregular steps in your chosen column.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 anky
Solution 2 milos.ai
Solution 3 Sheldonfrith