'Integration of pandas timeframe

I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().

Timestamp                    Power [W]
2022-05-05 06:00:05+02:00    2.0
2022-05-05 06:00:15+02:00    1.2
2022-05-05 06:00:25+02:00    0.3
2022-05-05 06:00:35+02:00    4.3
2022-05-05 06:00:45+02:00    1.1
                            ... 
2022-05-06 20:59:19+02:00    1.4
2022-05-06 20:59:29+02:00    2.0
2022-05-06 20:59:39+02:00    4.1
2022-05-06 20:59:49+02:00    1.3
2022-05-06 20:59:59+02:00    0.8

So I want to be able to integrate over both hours and days, so my output could look like:

Timestamp                    Energy [Wh]
2022-05-05 07:00:00+02:00    some values
2022-05-05 08:00:00+02:00    .
2022-05-05 09:00:00+02:00    .
2022-05-05 10:00:00+02:00    .
2022-05-05 11:00:00+02:00    
                            ... 
2022-05-06 20:00:00+02:00    
2022-05-06 21:00:00+02:00    

(hour 07:00 is to include values between 06:00-07:00, and so on...)

and

Timestamp      Energy [Wh]
2022-05-05     .
2022-05-06     .

So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.

Thank you.



Solution 1:[1]

You could create a new column representing your Timestamp truncated to hours:

df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')

Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.

Then you can group your rows by your new column before applying your integration computation:

df_integrated_hour = (
    df
    .groupby('Timestamp_hour')
    .agg({
        'Power': YOUR_INTEGRATION_FUNCTION
    })
    .rename(columns={'Power': 'Energy'})
    .reset_index()
)

Hope this will help you

Solution 2:[2]

Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.

from numpy import random
import pandas as pd

times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])

hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)

hours_in_a_day = 24  # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)

Output:

               Energy [Wh]
2022-05-05 6      2.625499
           7      2.365678
           8      2.579349
           9      2.569170
           10     2.543611
           11     2.742332
           12     2.478145
           13     2.444210
           14     2.507821
           15     2.485770
           16     2.414057
           17     2.567755
           18     2.393725
           19     2.609375
           20     2.525746
           21     2.421578
           22     2.520466
           23     2.653466
2022-05-06 0      2.559110
           1      2.519032
           2      2.472282
           3      2.436023
           4      2.378289
           5      2.549572
           6      2.558478
           7      2.470721
           8      2.429454
           9      2.390543
           10     2.538194
           11     2.537564
           12     2.492308
           13     2.387632
           14     2.435582
           15     2.581616
           16     2.389549
           17     2.461523
           18     2.576084
           19     2.523577
           20     2.572270
            Energy [Wh]
2022-05-05    60.597007
2022-05-06    59.725029

Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?

Solution 3:[3]

This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.

It will create a Stairs object (which represents a step function) from a pandas.Series, then bin across arbitrary DatetimeIndex values, then integrate.

This solution requires staircase 2.4.2 or above

setup

df = pd.DataFrame(
    {
        "Timestamp":pd.to_datetime(
            [
                "2022-05-05 06:00:05+02:00",
                "2022-05-05 06:00:15+02:00",
                "2022-05-05 06:00:25+02:00",
                "2022-05-05 06:00:35+02:00",
                "2022-05-05 06:00:45+02:00",
            ]
        ),
        "Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
    }
)

solution

import staircase as sc

# create step function
sf = sc.Stairs.from_values(
    initial_value=0,
    values=df.set_index("Timestamp")["Power [W]"],
)

# optional: plot
sf.plot(style="hlines")

step function plot

# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
    "2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)

# slice into bins and integrate
result = sf.slice(bins).integral()

result will be a pandas.Series with an IntervalIndex and Timedelta values. The IntervalIndex retains timezone info, it just doesn't display it:

[2022-05-05 06:00:00, 2022-05-05 06:00:20)          0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40)   0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00)          0 days 00:00:38
dtype: timedelta64[ns]

You can change the index to be the "left" values (and see this timezone info) like this:

result.index = result.index.left

You can change values to a float with division by an appropriate Timedelta. Eg to convert to minutes:

result/pd.Timedelta("1min")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Riley