'pyspark: hours diff between two dates columns
I would like to calculate number of hours between two date columns in pyspark. Could only find how to calculate number of days between the dates.
dfs_4.show()
+--------------------+--------------------+
| request_time| max_time|
+--------------------+--------------------+
|2017-11-17 00:18:...|2017-11-20 23:59:...|
|2017-11-17 00:07:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
|2017-11-17 00:10:...|2017-11-20 23:59:...|
|2017-11-17 00:03:...|2017-11-20 23:59:...|
|2017-11-17 00:45:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
|2017-11-17 00:59:...|2017-11-20 23:59:...|
|2017-11-17 00:28:...|2017-11-20 23:59:...|
|2017-11-17 00:11:...|2017-11-20 23:59:...|
|2017-11-17 00:13:...|2017-11-20 23:59:...|
|2017-11-17 00:42:...|2017-11-20 23:59:...|
|2017-11-17 00:07:...|2017-11-20 23:59:...|
|2017-11-17 00:40:...|2017-11-20 23:59:...|
|2017-11-17 00:15:...|2017-11-20 23:59:...|
|2017-11-17 00:05:...|2017-11-20 23:59:...|
|2017-11-17 00:50:...|2017-11-20 23:59:...|
|2017-11-17 00:40:...|2017-11-20 23:59:...|
|2017-11-17 00:25:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
+--------------------+--------------------+
Calculation of number of days :
from pyspark.sql import functions as F
dfs_5 = dfs_4.withColumn('date_diff', F.datediff(F.to_date(dfs_4.max_time), F.to_date(dfs_4.request_time)))
dfs_5.show()
+--------------------+--------------------+---------+
| request_time| max_time|date_diff|
+--------------------+--------------------+---------+
|2017-11-17 00:18:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:07:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:35:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:10:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:03:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:45:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:35:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:59:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:28:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:11:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:13:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:42:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:07:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:40:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:15:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:05:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:50:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:40:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:25:...|2017-11-20 23:59:...| 3|
|2017-11-17 00:35:...|2017-11-20 23:59:...| 3|
+--------------------+--------------------+---------+
How can I do the same for hours ? Thanks for any help
Solution 1:[1]
You could use hour to extract the hour from your date time field and simply subtract them to a new column. Now there is a case that the time difference is over a day and you need to add the whole days in between. So I would create the column days _diff as you did and then try this:
from pyspark.sql import functions as F
dfs_5 = dfs_4.withColumn('hours_diff', (dfs_4.date_diff*24) +
F.hour(dfs_4.max_time) - F.hour(dfs_4.request_time))
Solution 2:[2]
One can use unix timestamp and calculate difference in seconds. after that convert to desired unit.
dfs_5 = dfs_4.withColumn(
'diff_in_seconds',
F.unix_timestamp(F.to_date(dfs_4.max_time) - F.unix_timestamp(F.to_date(dfs_4.request_time))
)
dfs_6 = dfs_4.withColumn(
'diff_in_minutes',
F.round(
(F.unix_timestamp(F.to_date(dfs_4.max_time) - F.unix_timestamp(F.to_date(dfs_4.request_time)))/60
)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Jonas Reklaitis |