'How to get all Sundays on dates in pandas and extract the corresponding values with it then save as new dataframe and do subtraction

I have a dataframe with 3 columns:

file = glob.glob('InputFile.csv')

for i in file:
    df = pd.read_csv(i)
    df['Date'] = pd.to_datetime(df['Date'])
    print(df)


                    Date        X        Y
0    2020-02-13 00:11:59 -91.3900 -31.7914
1    2020-02-13 01:11:59 -87.1513 -34.6838
2    2020-02-13 02:11:59 -82.9126 -37.5762
3    2020-02-13 03:11:59 -79.3558 -40.2573
4    2020-02-13 04:11:59 -73.2293 -44.2463
...                  ...      ...      ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000

[2039 rows x 3 columns]


print(converted_file.dtypes)

Date     datetime64[ns]
xTilt           float64
yTilt           float64
dtype: object

I would like the output to be:

                  Date         X        Y            X_Diff            Y_Diff
0  2020-02-16 00:11:59 -38.46270 -70.8352         -38.46270          -70.8352
1  2020-02-23 00:11:59 -80.70250  -7.1893         -42.23980           63.6459
2  2020-03-01 00:11:59 -47.38980 -39.2652          33.31270          -32.0759
3  2020-03-08 00:00:00 -35.65350 -64.5058          11.73630          -25.2406
4  2020-03-15 00:00:00 -43.03290 -15.8425          -7.37940           48.6633
5  2020-03-22 00:00:00 -19.77130 -25.5298          23.26160           -9.6873
6  2020-03-29 00:00:00 -13.18940  12.4093           6.58190           37.9391
7  2020-04-05 00:00:00  -8.49098  27.8407           4.69842           15.4314
8  2020-04-12 00:00:00 -19.05360  20.0445         -10.56262           -7.7962
9  2020-04-26 00:00:00 -25.61330  31.6306          -6.55970           11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557         -20.47920          -61.9863

In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.



Solution 1:[1]

Here is my solution.

1. Preparation: I will need to generate some random data to be worked on.

import pandas as pd
import numpy as np

df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100

The data is like this:

Date    X   Y
0   2020-02-13 00:00:00 -12.044751  165.962038
1   2020-02-13 01:00:00 63.537406   65.137176
2   2020-02-13 02:00:00 67.555256   114.186898
... ... ... ..

2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.

df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date

Then, it looks like this.

Date    X   Y   date_only
96  2020-02-17 00:00:00 26.632391   120.311315  2020-02-17
97  2020-02-17 01:00:00 -14.111209  21.543440   2020-02-17
98  2020-02-17 02:00:00 -11.941086  -51.303122  2020-02-17
99  2020-02-17 03:00:00 -48.612563  137.023917  2020-02-17
100 2020-02-17 04:00:00 133.843010  -47.168805  2020-02-17
... ... ... ... ...
1796    2020-04-27 20:00:00 -158.310600 30.149292   2020-04-27
1797    2020-04-27 21:00:00 170.212825  181.626611  2020-04-27
1798    2020-04-27 22:00:00 59.773796   11.262186   2020-04-27
1799    2020-04-27 23:00:00 -99.757428  83.529157   2020-04-27
1944    2020-05-04 00:00:00 -168.435315 245.884281  2020-05-04

3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.

df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])

Results:

Date    X   Y
0   2020-02-17  4.196690    -205.843619
1   2020-02-24  -189.811351 -5.294274
2   2020-03-02  -231.596763 -46.989246
3   2020-03-09  76.561269   -40.188202
4   2020-03-16  -18.653363  52.376442
5   2020-03-23  106.758484  22.969963
6   2020-03-30  -133.601545 185.561830
7   2020-04-06  -57.748555  -187.878427
8   2020-04-13  57.648834   10.365917
9   2020-04-20  -47.959093  177.455676
10  2020-04-27  -30.527067  -37.046330
11  2020-05-04  -52.854252  -136.069205

4. Last step, get the difference for each X/Y value with their previous value.

df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()

Results:

Date    X   Y   X_Diff  Y_Diff
0   2020-02-17  4.196690    -205.843619 NaN NaN
1   2020-02-24  -189.811351 -5.294274   -194.008042 200.549345
2   2020-03-02  -231.596763 -46.989246  -41.785412  -41.694972
3   2020-03-09  76.561269   -40.188202  308.158031  6.801044
4   2020-03-16  -18.653363  52.376442   -95.214632  92.564644
5   2020-03-23  106.758484  22.969963   125.411847  -29.406479
6   2020-03-30  -133.601545 185.561830  -240.360029 162.591867
7   2020-04-06  -57.748555  -187.878427 75.852990   -373.440257
8   2020-04-13  57.648834   10.365917   115.397389  198.244344
9   2020-04-20  -47.959093  177.455676  -105.607927 167.089758
10  2020-04-27  -30.527067  -37.046330  17.432026   -214.502006
11  2020-05-04  -52.854252  -136.069205 -22.327185  -99.022874

5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.

df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)

Final results:

Date    X   Y   X_Diff  Y_Diff
0   2020-02-17  4.196690    -205.843619 4.196690    -205.843619
1   2020-02-24  -189.811351 -5.294274   -194.008042 200.549345
2   2020-03-02  -231.596763 -46.989246  -41.785412  -41.694972
3   2020-03-09  76.561269   -40.188202  308.158031  6.801044
4   2020-03-16  -18.653363  52.376442   -95.214632  92.564644
5   2020-03-23  106.758484  22.969963   125.411847  -29.406479
6   2020-03-30  -133.601545 185.561830  -240.360029 162.591867
7   2020-04-06  -57.748555  -187.878427 75.852990   -373.440257
8   2020-04-13  57.648834   10.365917   115.397389  198.244344
9   2020-04-20  -47.959093  177.455676  -105.607927 167.089758
10  2020-04-27  -30.527067  -37.046330  17.432026   -214.502006
11  2020-05-04  -52.854252  -136.069205 -22.327185  -99.022874

Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00, and the time 00:00:00 will not be displayed in pandas, although they actually exist.

Here is the Colab Link. You can have all my code in a notebook here. https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing

Solution 2:[2]

I will create a dataframe as Christopher did:

import pandas as pd
import numpy as np

df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100

Dataframe view

At First, set the datetime column as index

df = df.set_index('Date')

Secondly, get the rows only for sundays:

sunday_df= df[df.index.dayofweek == 6]

Third, resample the values to day format, take the last value of the day and remove rows with empty hours

sunday_df = sunday_df.resample('D').last().dropna()

Lastly, do the subtraction:

sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()

The last view of the new dataframe

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Christopher
Solution 2 Sina Birecik