'How to extract a specific range out of a dataframe and store it in another dataframe and then delete the range out of the original dataframe | pandas

I have some timeseries of energy consumption and i can eyeball when someone is on holidays if the consumption is under a certain range. I have this piece of code to extract said holidays:

dummy data:

values = [0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7]
df = pd.DataFrame(values, columns = ["values"])

so the df looks like this:

    values
0     0.80
1     0.80
2     0.70
3     0.60
4     0.70
5     0.50
6     0.80
7     0.40
8     0.30
9     0.50
10    0.70
11    0.50
12    0.70
13    0.15
14    0.11
15    0.10
16    0.13
17    0.16
18    0.17
19    0.10
20    0.13
21    0.30
22    0.40
23    0.50
24    0.60
25    0.70

now, given these variables, I want to detect all subsequent values that are smaller than value_threshold for at least 5 timesteps:

value_threshold = 0.2
count_threshold  = 5

I check which values are under the threshold:

is_under_val_threshold =df["values"] < value_threshold

which gives me this:

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21    False
22    False
23    False
24    False
25    False

Now I can isolate the values under the threshold:

subset_thre = df.loc[is_under_val_threshold, "values"]
13    0.15
14    0.11
15    0.10
16    0.13
17    0.16
18    0.17
19    0.10
20    0.13

Since this can happen for more than one time and not always for more than 5 steps, I put each "sequence" into groups:

thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    2
14    2
15    2
16    2
17    2
18    2
19    2
20    2
21    3
22    3
23    3
24    3
25    3

Now I would like to extract the groups that are under the threshold for more than 5 steps and create new dataframes where the break is, so that in this example I will have three dataframes.

What I tried so far:

Identify where a group switch happens:

identify_switch = thre_grouper.diff().to_frame()
index_of_switch = identify_switch.index[identify_switch['values'] == 1].tolist()

which gives me the index of where the switch happens:

[13, 21]

with this I can for this example at least do the splits as I wish:

holidays_1 = df[index_of_switch[0]:index_of_switch[1]]
split_df_1 = df[:index_of_switch[0]]
split_df_2 = df[index_of_switch[1]:]

My question would be, how to make sure that when looping this for very variable amounts of holidays within a series to make sure that I will do all the needed splits



Solution 1:[1]

I have added to you values to give a better idea of how this answer works. The first few rows are under 0.2, but are not of 5 or more consecutively, so not "holidays", 16-18 are the same, 20-24 satisfy the conditions. Therefore the output should be "split_df_1" 0-19, "holidays_1" 20-24, "split_df_2" 25-32.

import pandas as pd
values = [0.1,0.15,0.1,0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.5,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7,0.1,0.15,0.1]
df = pd.DataFrame(values, columns = ["values"])
df
#    values
#0     0.10
#1     0.15
#2     0.10
#3     0.80
#4     0.80
#5     0.70
#6     0.60
#7     0.70
#8     0.50
#9     0.80
#10    0.40
#11    0.30
#12    0.50
#13    0.70
#14    0.50
#15    0.70
#16    0.15
#17    0.11
#18    0.10
#19    0.50
#20    0.13
#21    0.16
#22    0.17
#23    0.10
#24    0.13
#25    0.30
#26    0.40
#27    0.50
#28    0.60
#29    0.70
#30    0.10
#31    0.15
#32    0.10

The conditions and other series you created:

# conditions
value_threshold = 0.2
count_threshold = 5

# under value_threshold bool
is_under_val_threshold = df["values"] < value_threshold

# grouped
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()

Calculating the group numbers (in thre_grouper) that satisfy the conditions of values being less than value_threshold and greater than or equal to count_threshold:

# if the first value is less than value_threshold, then start from first group (index 0)
if (df["values"].iloc[0] < value_threshold):
    x = 0
# otherwise start from second (index 1)
else:
    x = 1

# potential holiday groups are every other group
holidays = thre_grouper[thre_grouper.isin(thre_grouper.unique()[x::2])].value_counts(sort=False)
# get group number of those greater than count_threshold, and add start of dataframe and group above last
is_holiday = [0] + list(holidays[holidays >= count_threshold].to_frame().index) + [thre_grouper.max()+1]

Looping through to create dataframes:

# dictionary to add dataframes to
d = {}

for i in range(1, len(is_holiday)):
    # split dataframes are those with group numbers between those in is_holiday list
    d["split_df_"+str(i)] = df.loc[thre_grouper[
        (thre_grouper > is_holiday[i-1]) &
        (thre_grouper < is_holiday[i])].index]
    # holiday dataframes are those that are in the is_holiday list but not the first or last
    if not i in([0, len(is_holiday)-1]):
        d["holiday_"+str(i)] = df.loc[thre_grouper[
            thre_grouper == is_holiday[i]].index]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rawson