'How to merge two dataframe with column in overlapping datetime range

Following on from this question How to join two dataframes for which column values are within a certain range?. But my datetime range column may overlap with each other.

For example:

>>>df_1

  timestamp              A           B
0 2019-07-14 05:31:00    0.020228    0.026572
1 2019-07-14 06:32:00    0.057780    0.175499
2 2019-07-14 07:02:00    0.076623    0.875499

>>>df_2

  start                  end                    event
0 2019-07-14 05:30:00    2019-07-14 06:30:00    E1
1 2019-07-14 06:00:00    2019-07-14 07:00:00    E2
2 2019-07-14 06:30:01    2019-07-14 07:30:00    E3
3 2019-07-14 07:30:01    2019-07-14 08:30:00    E4

I want to find A of df_1 in its interval for df_2. The result I expect are as follows:

  start                  end                    event timestamp             A
0 2019-07-14 05:30:00    2019-07-14 06:30:00    E1    2019-07-14 05:31:00    0.020228
1 2019-07-14 06:00:00    2019-07-14 07:00:00    E2    2019-07-14 06:32:00    0.057780
2 2019-07-14 06:30:01    2019-07-14 07:30:00    E3    2019-07-14 06:32:00    0.057780
3 2019-07-14 06:30:01    2019-07-14 07:30:00    E3    2019-07-14 07:02:00    0.076623

I followed the answers in the links above, but I didn't the pandas method to achieve my goal. The following mistakes occurred when I tried to use the answer with the highest votes.

KeyError: 'indexer does not intersect a unique set of intervals'

Can anybody help me? Thanks in advance.



Solution 1:[1]

it can also be done using numpy broadcast and boolean indexing like below

## load sample data
df1 = pd.DataFrame([('0', '2019-07-14 05:31:00', '0.020228', '0.026572'), ('1', '2019-07-14 06:32:00', '0.057780', '0.175499'), ('2', '2019-07-14 07:02:00', '0.076623', '0.875499')], columns=('id', 'timestamp', 'A', 'B'))
df2 = pd.DataFrame([('0', '2019-07-14 05:30:00', '2019-07-14 06:30:00', 'E1'), ('1', '2019-07-14 06:00:00', '2019-07-14 07:00:00', 'E2'), ('2', '2019-07-14 06:30:01', '2019-07-14 07:30:00', 'E3'), ('3', '2019-07-14 07:30:01', '2019-07-14 08:30:00', 'E4')], columns=('id', 'start', 'end', 'event'))

df1["timestamp"] = pd.to_datetime(df1["timestamp"])
df2["start"] = pd.to_datetime(df2["start"])
df2["end"] = pd.to_datetime(df2["end"])

Solution

## df2[["start"]] is a column vector of size m and df1.timestamp.values is row 
## vector of size n then broad cast will result matrix of shape m,n which is 
## result of comparing each pair of m and n
compare = (df2[["start"]].values<df1.timestamp.values) & (df2[["end"]].values>df1.timestamp.values)

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(df1)*len(df2))[compare.ravel()]


## calculate row and column index from cell number
pd.concat([df2.iloc[ind//len(df1)].reset_index(drop=True), df1.iloc[ind%len(df1)].reset_index(drop=True)], axis=1, sort=False)

Result

    start               end                event    timestamp             A  B
0   2019-07-14 05:30:00 2019-07-14 06:30:00 E1  2019-07-14 05:31:00 0.020228    0.026572
1   2019-07-14 06:00:00 2019-07-14 07:00:00 E2  2019-07-14 06:32:00 0.057780    0.175499
2   2019-07-14 06:30:01 2019-07-14 07:30:00 E3  2019-07-14 06:32:00 0.057780    0.175499
3   2019-07-14 06:30:01 2019-07-14 07:30:00 E3  2019-07-14 07:02:00 0.076623    0.875499

Edit
in response to comment from @baccandr here is some more explanation how the indexing working.

After comparison we get the compare matrix with boolean values like below

array([[ True, False, False],
       [False,  True, False],
       [False,  True,  True],
       [False, False, False]])
  • you can think of this matrix as table with column representing index of df1 which is (0,1,2) and rows representing index of df2 which is (0,1,2,3)
  • The value in cell is True if corresponding row in df1 and df2 mets the condition e.g row 0 of df1 and row 0 of df2; row 2 of df1 and row 1 of df2 meets the condition
  • To find rows meeting the condition from df1 and df2 separately we can use compare as index directly like df1[compare.T] and df2[compare] but it will not give rows in correct order for pairing. This will give row in increasing order of index for both which will not always be true.
  • Now our aim here is to get indxes of both dataframes which meets the condition, in correct order. so what we need is index of df1 [0, 1, 2, 2] and index of df2 [0,1,1,2]. Using those indexes in df1 and df2 we get the match in correct order from both
  • so what we did here is count cells of matrix from left to right then down then again left two right giving a unique number to each cell. Then filter the cells where conditions meets and convert that to index for df1 and df2.

As the matrix represents index for 2 dataframes in matrix format I think np.where may not work. Another way we can do it is use compare as index for df2 and only find index of df1 like below

this will repeat index of df1 for each row of df2 and find the index of df1 in order with df2

ind_df1 = np.tile(np.arange(len(df1)), len(df2))[compare.ravel()]
pd.concat([df2[compare].reset_index(drop=True), df1.iloc[ind_df1].reset_index(drop=True)], axis=1, sort=False)

I hope this makes it clear, if you have some other idea then I would love to see it in comment or as answer

Solution 2:[2]

This answer does something very similar to what you need. Below is how I adapt that solution to fit your problem but there may be better implementations:

bins = list(zip(df2['start'],df2['end']))
def overlapping_bins(x):
    return pd.Series([l for l in bins if l[0] <= x <= l[1]])

df3=pd.concat([df1, df1.timestamp.apply(overlapping_bins).stack().reset_index(1, drop=True)], 
     axis=1).rename(columns={0: 'bins'})

 #Create start and end columns and drop bins
 df3.loc[:, 'start'] = df3.bins.map(lambda x: x[0])
 df3.loc[:, 'end'] = df3.bins.map(lambda x: x[1])
 df3.drop('bins',axis=1,inplace=True)

 #Merge df2 with df3 on the common columns
 df4=df2.merge(df3).drop('B',axis=1)

This is the result I got:

                start                 end event           timestamp         A
0 2019-07-14 05:30:00 2019-07-14 06:30:00    E1 2019-07-14 05:31:00  0.020228
1 2019-07-14 06:00:00 2019-07-14 07:00:00    E2 2019-07-14 06:32:00  0.057780
2 2019-07-14 06:30:01 2019-07-14 07:30:00    E3 2019-07-14 06:32:00  0.057780
3 2019-07-14 06:30:01 2019-07-14 07:30:00    E3 2019-07-14 07:02:00  0.076623

Solution 3:[3]

One option for a non-equi join is the conditional_join from pyjanitor; underneath the hood it uses binary search to avoid a cartesian product; this can be helpful, depending on the data size.

It can also handle overlapping intervals:

# pip install pyjanitor
import janitor
import pandas as pd

(df_1
.conditional_join(
    df_2, 
    ("timestamp", "start", ">="), 
    ("timestamp", "end", "<="))
)

            timestamp         A         B               start                 end event
0 2019-07-14 05:31:00  0.020228  0.026572 2019-07-14 05:30:00 2019-07-14 06:30:00    E1
1 2019-07-14 06:32:00  0.057780  0.175499 2019-07-14 06:00:00 2019-07-14 07:00:00    E2
2 2019-07-14 06:32:00  0.057780  0.175499 2019-07-14 06:30:01 2019-07-14 07:30:00    E3
3 2019-07-14 07:02:00  0.076623  0.875499 2019-07-14 06:30:01 2019-07-14 07:30:00    E3

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 baccandr
Solution 3 sammywemmy