'Integration of pandas timeframe
I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample()
.
Timestamp Power [W]
2022-05-05 06:00:05+02:00 2.0
2022-05-05 06:00:15+02:00 1.2
2022-05-05 06:00:25+02:00 0.3
2022-05-05 06:00:35+02:00 4.3
2022-05-05 06:00:45+02:00 1.1
...
2022-05-06 20:59:19+02:00 1.4
2022-05-06 20:59:29+02:00 2.0
2022-05-06 20:59:39+02:00 4.1
2022-05-06 20:59:49+02:00 1.3
2022-05-06 20:59:59+02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00+02:00 some values
2022-05-05 08:00:00+02:00 .
2022-05-05 09:00:00+02:00 .
2022-05-05 10:00:00+02:00 .
2022-05-05 11:00:00+02:00
...
2022-05-06 20:00:00+02:00
2022-05-06 21:00:00+02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate
, but my outputs look a bit weird.
Thank you.
Solution 1:[1]
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
Solution 2:[2]
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?
Solution 3:[3]
This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.
It will create a Stairs
object (which represents a step function) from a pandas.Series
, then bin across arbitrary DatetimeIndex
values, then integrate.
This solution requires staircase 2.4.2
or above
setup
df = pd.DataFrame(
{
"Timestamp":pd.to_datetime(
[
"2022-05-05 06:00:05+02:00",
"2022-05-05 06:00:15+02:00",
"2022-05-05 06:00:25+02:00",
"2022-05-05 06:00:35+02:00",
"2022-05-05 06:00:45+02:00",
]
),
"Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
}
)
solution
import staircase as sc
# create step function
sf = sc.Stairs.from_values(
initial_value=0,
values=df.set_index("Timestamp")["Power [W]"],
)
# optional: plot
sf.plot(style="hlines")
# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
"2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)
# slice into bins and integrate
result = sf.slice(bins).integral()
result
will be a pandas.Series
with an IntervalIndex
and Timedelta
values. The IntervalIndex retains timezone info, it just doesn't display it:
[2022-05-05 06:00:00, 2022-05-05 06:00:20) 0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40) 0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00) 0 days 00:00:38
dtype: timedelta64[ns]
You can change the index to be the "left" values (and see this timezone info) like this:
result.index = result.index.left
You can change values to a float with division by an appropriate Timedelta
. Eg to convert to minutes:
result/pd.Timedelta("1min")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Riley |