'How to get all Sundays on dates in pandas and extract the corresponding values with it then save as new dataframe and do subtraction
I have a dataframe with 3 columns:
file = glob.glob('InputFile.csv')
for i in file:
df = pd.read_csv(i)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date X Y
0 2020-02-13 00:11:59 -91.3900 -31.7914
1 2020-02-13 01:11:59 -87.1513 -34.6838
2 2020-02-13 02:11:59 -82.9126 -37.5762
3 2020-02-13 03:11:59 -79.3558 -40.2573
4 2020-02-13 04:11:59 -73.2293 -44.2463
... ... ... ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000
[2039 rows x 3 columns]
print(converted_file.dtypes)
Date datetime64[ns]
xTilt float64
yTilt float64
dtype: object
I would like the output to be:
Date X Y X_Diff Y_Diff
0 2020-02-16 00:11:59 -38.46270 -70.8352 -38.46270 -70.8352
1 2020-02-23 00:11:59 -80.70250 -7.1893 -42.23980 63.6459
2 2020-03-01 00:11:59 -47.38980 -39.2652 33.31270 -32.0759
3 2020-03-08 00:00:00 -35.65350 -64.5058 11.73630 -25.2406
4 2020-03-15 00:00:00 -43.03290 -15.8425 -7.37940 48.6633
5 2020-03-22 00:00:00 -19.77130 -25.5298 23.26160 -9.6873
6 2020-03-29 00:00:00 -13.18940 12.4093 6.58190 37.9391
7 2020-04-05 00:00:00 -8.49098 27.8407 4.69842 15.4314
8 2020-04-12 00:00:00 -19.05360 20.0445 -10.56262 -7.7962
9 2020-04-26 00:00:00 -25.61330 31.6306 -6.55970 11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557 -20.47920 -61.9863
In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.
Solution 1:[1]
Here is my solution.
1. Preparation: I will need to generate some random data to be worked on.
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
The data is like this:
Date X Y
0 2020-02-13 00:00:00 -12.044751 165.962038
1 2020-02-13 01:00:00 63.537406 65.137176
2 2020-02-13 02:00:00 67.555256 114.186898
... ... ... ..
2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.
df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date
Then, it looks like this.
Date X Y date_only
96 2020-02-17 00:00:00 26.632391 120.311315 2020-02-17
97 2020-02-17 01:00:00 -14.111209 21.543440 2020-02-17
98 2020-02-17 02:00:00 -11.941086 -51.303122 2020-02-17
99 2020-02-17 03:00:00 -48.612563 137.023917 2020-02-17
100 2020-02-17 04:00:00 133.843010 -47.168805 2020-02-17
... ... ... ... ...
1796 2020-04-27 20:00:00 -158.310600 30.149292 2020-04-27
1797 2020-04-27 21:00:00 170.212825 181.626611 2020-04-27
1798 2020-04-27 22:00:00 59.773796 11.262186 2020-04-27
1799 2020-04-27 23:00:00 -99.757428 83.529157 2020-04-27
1944 2020-05-04 00:00:00 -168.435315 245.884281 2020-05-04
3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.
df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])
Results:
Date X Y
0 2020-02-17 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274
2 2020-03-02 -231.596763 -46.989246
3 2020-03-09 76.561269 -40.188202
4 2020-03-16 -18.653363 52.376442
5 2020-03-23 106.758484 22.969963
6 2020-03-30 -133.601545 185.561830
7 2020-04-06 -57.748555 -187.878427
8 2020-04-13 57.648834 10.365917
9 2020-04-20 -47.959093 177.455676
10 2020-04-27 -30.527067 -37.046330
11 2020-05-04 -52.854252 -136.069205
4. Last step, get the difference for each X/Y value with their previous value.
df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()
Results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 NaN NaN
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.
df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)
Final results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00
, and the time 00:00:00
will not be displayed in pandas, although they actually exist.
Here is the Colab Link. You can have all my code in a notebook here. https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing
Solution 2:[2]
I will create a dataframe as Christopher did:
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
At First, set the datetime column as index
df = df.set_index('Date')
Secondly, get the rows only for sundays:
sunday_df= df[df.index.dayofweek == 6]
Third, resample the values to day format, take the last value of the day and remove rows with empty hours
sunday_df = sunday_df.resample('D').last().dropna()
Lastly, do the subtraction:
sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Christopher |
Solution 2 | Sina Birecik |