'In Pandas, how to return the id for the next value which is above/below a threshold
I have a dataframe like this:
date value
0 2018-05-15 06:00:00 100.86
1 2018-05-15 07:00:00 101.99
2 2018-05-15 08:00:00 110.00
3 2018-05-15 09:00:00 95.49
4 2018-05-15 10:00:00 92.32
I would like to create a new column which tells me the index of the next value which is a certain amount above or below. for eg over 5, under 3
date value Over_5 Under_3
0 2018-05-15 06:00:00 100.86 2 3
1 2018-05-15 07:00:00 101.99 2 3
2 2018-05-15 08:00:00 110.00 Nan 3
3 2018-05-15 09:00:00 95.49 Nan 4
4 2018-05-15 10:00:00 92.32 Nan Nan
with the view that ideally I would like to use a Boolean operator to return which comes first, Over 5 (0) or Under 3(1)?
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2 3 1
1 2018-05-15 07:00:00 101.99 2 3 1
2 2018-05-15 08:00:00 110.00 Nan 3 Nan
3 2018-05-15 09:00:00 95.49 Nan 4 Nan
4 2018-05-15 10:00:00 92.32 Nan Nan Nan
Apologies, I know its not the best example. My current thinking has been a groupby with idxmax, but I'm not really sure how to do it, except maybe with a while/for loop.
Its to label encode for an ML project so a good way to carry this out, with vectorization, would be great Thanks
Solution 1:[1]
What about:
df = pd.DataFrame({'date': pd.to_datetime(['2018-05-15 06:00:00', '2018-05-15 07:00:00', '2018-05-15 08:00:00',
'2018-05-15 09:00:00', '2018-05-15 10:00:00']),
'value': [100.86, 101.99, 110.00, 95.49, 92.32]})
df['Over_5'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v > 5).first_valid_index(), axis=1)
df['Under_3'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v < -3).first_valid_index(), axis=1)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
# Prints:
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
Some details:
# Gets the values that come after the current row.
df[df.index>=row.name].value
# Function that subtracts the value of the current row.
lambda row: (df[df.index>=row.name].value - row.value)
# Function that gets the first index where the difference is > 5
f = lambda row: (df[df.index>=row.name].value - row.value).where(lambda v: v > 5).first_valid_index()
# Apply the function of the previous line to every row.
df.apply(f, axis=1)
Numpy alternative:
An alternative with numpy
which should be faster:
import numpy as np
arr = df['value'].to_numpy()
# Calculate all differences between values at once.
# Then, take the upper triangular matrix (i.e., only differences with values coming afterwards).
all_diffs = np.triu(arr - arr[:, None])
# Check for the first index where the condition is fulfilled.
df['Over_5'] = np.argmax(all_diffs > 5, axis=1)
df['Under_3'] = np.argmax(all_diffs < -3, axis=1)
df[['Over_5', 'Under_3']] = df[['Over_5', 'Under_3']].replace(0, np.nan)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
Runtime difference:
# 2.92 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit calc_pandas()
# 587 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit calc_numpy()
So as you can see the numpy
version is about 6 times faster with your example and potentially more with bigger data sets. It requires a bit more memory to calculate the matrix but it should be fine unless you have a really big dataframe.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |