'How to merge two dataframe with column in overlapping datetime range
Following on from this question How to join two dataframes for which column values are within a certain range?. But my datetime range column may overlap with each other.
For example:
>>>df_1
timestamp A B
0 2019-07-14 05:31:00 0.020228 0.026572
1 2019-07-14 06:32:00 0.057780 0.175499
2 2019-07-14 07:02:00 0.076623 0.875499
>>>df_2
start end event
0 2019-07-14 05:30:00 2019-07-14 06:30:00 E1
1 2019-07-14 06:00:00 2019-07-14 07:00:00 E2
2 2019-07-14 06:30:01 2019-07-14 07:30:00 E3
3 2019-07-14 07:30:01 2019-07-14 08:30:00 E4
I want to find A
of df_1
in its interval for df_2
. The result I expect are as follows:
start end event timestamp A
0 2019-07-14 05:30:00 2019-07-14 06:30:00 E1 2019-07-14 05:31:00 0.020228
1 2019-07-14 06:00:00 2019-07-14 07:00:00 E2 2019-07-14 06:32:00 0.057780
2 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 06:32:00 0.057780
3 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 07:02:00 0.076623
I followed the answers in the links above, but I didn't the pandas method to achieve my goal. The following mistakes occurred when I tried to use the answer with the highest votes.
KeyError: 'indexer does not intersect a unique set of intervals'
Can anybody help me? Thanks in advance.
Solution 1:[1]
it can also be done using numpy broadcast and boolean indexing like below
## load sample data
df1 = pd.DataFrame([('0', '2019-07-14 05:31:00', '0.020228', '0.026572'), ('1', '2019-07-14 06:32:00', '0.057780', '0.175499'), ('2', '2019-07-14 07:02:00', '0.076623', '0.875499')], columns=('id', 'timestamp', 'A', 'B'))
df2 = pd.DataFrame([('0', '2019-07-14 05:30:00', '2019-07-14 06:30:00', 'E1'), ('1', '2019-07-14 06:00:00', '2019-07-14 07:00:00', 'E2'), ('2', '2019-07-14 06:30:01', '2019-07-14 07:30:00', 'E3'), ('3', '2019-07-14 07:30:01', '2019-07-14 08:30:00', 'E4')], columns=('id', 'start', 'end', 'event'))
df1["timestamp"] = pd.to_datetime(df1["timestamp"])
df2["start"] = pd.to_datetime(df2["start"])
df2["end"] = pd.to_datetime(df2["end"])
Solution
## df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (df2[["start"]].values<df1.timestamp.values) & (df2[["end"]].values>df1.timestamp.values)
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(df1)*len(df2))[compare.ravel()]
## calculate row and column index from cell number
pd.concat([df2.iloc[ind//len(df1)].reset_index(drop=True), df1.iloc[ind%len(df1)].reset_index(drop=True)], axis=1, sort=False)
Result
start end event timestamp A B
0 2019-07-14 05:30:00 2019-07-14 06:30:00 E1 2019-07-14 05:31:00 0.020228 0.026572
1 2019-07-14 06:00:00 2019-07-14 07:00:00 E2 2019-07-14 06:32:00 0.057780 0.175499
2 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 06:32:00 0.057780 0.175499
3 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 07:02:00 0.076623 0.875499
Edit
in response to comment from @baccandr here is some more explanation how the indexing working.
After comparison we get the compare matrix with boolean values like below
array([[ True, False, False],
[False, True, False],
[False, True, True],
[False, False, False]])
- you can think of this matrix as table with column representing index of
df1
which is (0,1,2) and rows representing index ofdf2
which is (0,1,2,3) - The value in cell is True if corresponding row in df1 and df2 mets the condition e.g row 0 of df1 and row 0 of df2; row 2 of df1 and row 1 of df2 meets the condition
- To find rows meeting the condition from df1 and df2 separately we can use
compare
as index directly like df1[compare.T] and df2[compare] but it will not give rows in correct order for pairing. This will give row in increasing order of index for both which will not always be true. - Now our aim here is to get indxes of both dataframes which meets the condition, in correct order. so what we need is index of df1 [0, 1, 2, 2] and index of df2 [0,1,1,2]. Using those indexes in df1 and df2 we get the match in correct order from both
- so what we did here is count cells of matrix from left to right then down then again left two right giving a unique number to each cell. Then filter the cells where conditions meets and convert that to index for df1 and df2.
As the matrix represents index for 2 dataframes in matrix format I think np.where may not work. Another way we can do it is use compare
as index for df2 and only find index of df1 like below
this will repeat index of df1 for each row of df2 and find the index of df1 in order with df2
ind_df1 = np.tile(np.arange(len(df1)), len(df2))[compare.ravel()]
pd.concat([df2[compare].reset_index(drop=True), df1.iloc[ind_df1].reset_index(drop=True)], axis=1, sort=False)
I hope this makes it clear, if you have some other idea then I would love to see it in comment or as answer
Solution 2:[2]
This answer does something very similar to what you need. Below is how I adapt that solution to fit your problem but there may be better implementations:
bins = list(zip(df2['start'],df2['end']))
def overlapping_bins(x):
return pd.Series([l for l in bins if l[0] <= x <= l[1]])
df3=pd.concat([df1, df1.timestamp.apply(overlapping_bins).stack().reset_index(1, drop=True)],
axis=1).rename(columns={0: 'bins'})
#Create start and end columns and drop bins
df3.loc[:, 'start'] = df3.bins.map(lambda x: x[0])
df3.loc[:, 'end'] = df3.bins.map(lambda x: x[1])
df3.drop('bins',axis=1,inplace=True)
#Merge df2 with df3 on the common columns
df4=df2.merge(df3).drop('B',axis=1)
This is the result I got:
start end event timestamp A
0 2019-07-14 05:30:00 2019-07-14 06:30:00 E1 2019-07-14 05:31:00 0.020228
1 2019-07-14 06:00:00 2019-07-14 07:00:00 E2 2019-07-14 06:32:00 0.057780
2 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 06:32:00 0.057780
3 2019-07-14 06:30:01 2019-07-14 07:30:00 E3 2019-07-14 07:02:00 0.076623
Solution 3:[3]
One option for a non-equi join is the conditional_join from pyjanitor; underneath the hood it uses binary search to avoid a cartesian product; this can be helpful, depending on the data size.
It can also handle overlapping intervals:
# pip install pyjanitor
import janitor
import pandas as pd
(df_1
.conditional_join(
df_2,
("timestamp", "start", ">="),
("timestamp", "end", "<="))
)
timestamp A B start end event
0 2019-07-14 05:31:00 0.020228 0.026572 2019-07-14 05:30:00 2019-07-14 06:30:00 E1
1 2019-07-14 06:32:00 0.057780 0.175499 2019-07-14 06:00:00 2019-07-14 07:00:00 E2
2 2019-07-14 06:32:00 0.057780 0.175499 2019-07-14 06:30:01 2019-07-14 07:30:00 E3
3 2019-07-14 07:02:00 0.076623 0.875499 2019-07-14 06:30:01 2019-07-14 07:30:00 E3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | baccandr |
Solution 3 | sammywemmy |