'check if timestamp column is in date range from another dataframe
I have a dataframe, df_A with two columns 'amin' and 'amax', which is a set of time range.
My objective is to find whether a column in df_B lies between any of the rows of range in df_A 'amin' and 'amax' columns.
df_A[['amin','amax'] ]
amin amax
0 2016-07-16 19:37:03 2016-07-17 11:16:32
1 2016-07-04 21:15:54 2016-07-05 10:57:46
2 2016-07-24 23:30:41 2016-07-25 15:38:02
3 2016-07-12 03:02:38 2016-07-12 22:11:01
df_B['created_date']
created_date
2016-07-17 01:16:32
2016-07-05 10:15:54
2016-07-12 12:11:01
df_A['amin'] = pd.to_datetime(df_A['amin'], errors='coerce')
df_A['amax'] = pd.to_datetime(df_A['amax'], errors='coerce')
df_B['created_date'] = pd.to_datetime(df_B['created_date'],errors='coerce')
def dt2epoch(value):
epoch = (value - pd.to_datetime(datetime(2015,12,31).strftime('%Y-%m-%d %H:%M:%S.%f'))).total_seconds()
return epoch
df_A['amax_epoch']=df_A['amax'].apply(dt2epoch)
df_A['amin_epoch']=df_A['amin'].apply(dt2epoch)
df_B['created_date_epoch']=df_B['created_date'].apply(dt2epoch)
def make_tuple(row):
n= len(row)
row = [(x,row[n - 1]) for x in row]
return row
minMaxTuple = minMax.apply(make_tuple, axis =1)
Above is part of my code, I've tried below(not sure if it's necessary):
- convert them to epoch values
- convert df_A into a tuple.
However, df_A and df_B has a different number of rows. Also, I dont have any id column to merge them together.
label = []
for l in df_B['created_date_epoch']:
if (m[0] for m in minMaxTuple) <= l <= (m[1] for m in minMaxTuple):
label.append('1')
else:
label.append('0')
However, when I run this, the result I get for 'label' is an empty list.
Also, the label should be a column that has the same number of rows as df_A.
Ultimately, I would like to add a new 'label' column in df_A:
minMaxTuple label
(2016-07-16 19:37:03, 2016-07-17 11:16:32) 1
(2016-07-04 21:15:54, 2016-07-05 10:57:46) 1
(2016-07-24 23:30:41, 2016-07-25 15:38:02) 0
(2016-07-12 03:02:38, 2016-07-12 22:11:01) 1
Solution 1:[1]
One solution would be to see if a created_date
in df_b falls between an amin
and amax
would be to use boolean logic. In a row-wise calculation for each row in df_a
you could use the following logic:
if sum((row['amin'] > df_b['created_date']) | (row['amax'] < df_b['created_date'])) == len(df_b)
In this stament I am using the logical operator |
to check if amin
is less than created_date
OR if amax
is less than created_date
. If the statement is True
you could conclude that a created date does not fall between the time period created by amin
and amax
. If none of the created_dates
fall between the period created by amin
and amax
, you could then assign a value of 0
to df_a['label']
: Something like:
import pandas as pd
from StringIO import StringIO
def myfunc(row, df_b):
if sum((row['amin'] > df_b['created_date']) | (row['amax'] < df_b['created_date'])) == len(df_b):
return 0
else:
return 1
a_str= """
amin,amax
2016-07-16 19:37:03,2016-07-17 11:16:32
2016-07-04 21:15:54,2016-07-05 10:57:46
2016-07-24 23:30:41,2016-07-25 15:38:02
2016-07-12 03:02:38,2016-07-12 22:11:01"""
b_str = """
created_date
2016-07-17 01:16:32
2016-07-05 10:15:54
2016-07-12 12:11:01"""
df_a = pd.read_csv(StringIO(a_str), sep=',')
df_b = pd.read_csv(StringIO(b_str), sep=',')
#Convert to datetime
df_a['amin'] = pd.to_datetime(df_a['amin'])
df_a['amax'] = pd.to_datetime(df_a['amax'])
df_b['created_date'] = pd.to_datetime(df_b['created_date'])
df_a['label'] = df_a.apply(lambda x: myfunc(x,df_b), axis=1)
Which returns a column label
in df_a
with the expected output of:
amin amax label
0 2016-07-16 19:37:03 2016-07-17 11:16:32 1
1 2016-07-04 21:15:54 2016-07-05 10:57:46 1
2 2016-07-24 23:30:41 2016-07-25 15:38:02 0
3 2016-07-12 03:02:38 2016-07-12 22:11:01 1
Solution 2:[2]
Very similar to @dubbbdan 's answer but maybe simpler using any
and the and
operator:
any_in_range = lambda row, iterable: any(
[(row[0] < x) & (x < row[1]) for x in iterable])
df_A['label'] = df_A.apply(any_in_range, iterable=df_B['created_date'], axis=1)
print df_A
Prints:
amin amax label
0 2016-07-16 19:37:03 2016-07-17 11:16:32 True
1 2016-07-04 21:15:54 2016-07-05 10:57:46 True
2 2016-07-24 23:30:41 2016-07-25 15:38:02 False
3 2016-07-12 03:02:38 2016-07-12 22:11:01 True
Solution 3:[3]
I created a list of tuple from the max and min date columns then searched for the date-timestamp inside the this list of tuples.
tuple_to_search = list(zip(df_A.amin,df_A.amax))
df_B['is_true']= df_B['created_date'].map(lambda k: any(filter(lambda x : x [0]<= k <=x[1],tuple_to_search ))).astype(int)
Solution 4:[4]
One option is with conditional_join from pyjanitor, which avoids a cartesian join (helpful with memory, and performance, depending on the data size):
# pip install pyjanitor
import pandas as pd
import janitor
(df_B
.conditional_join(
df_A,
('created_date', 'amin', '>='),
('created_date', 'amax', '<='),
how = 'right')
.assign(label = lambda df: df.created_date.notna().astype(int))
.drop(columns='created_date')
)
amin amax label
0 2016-07-16 19:37:03 2016-07-17 11:16:32 1
1 2016-07-04 21:15:54 2016-07-05 10:57:46 1
2 2016-07-24 23:30:41 2016-07-25 15:38:02 0
3 2016-07-12 03:02:38 2016-07-12 22:11:01 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Delforge |
Solution 3 | Stephen Kennedy |
Solution 4 | sammywemmy |