'Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?

df
    +-------------------------+------------+
    |  hour                   | time_zone  |
    +-------------------------+------------+
    |2019-10-16T20:00:00+0000 | US/Eastern |
    |2019-10-15T23:00:00+0000 | US/Central |
    +-------------------------+------------+

#What I want:
    +-------------------------+------------+---------------------+
    |  hour                   | time_zone  | local_time          |
    +-------------------------+------------+---------------------+
    |2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
    |2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
    +-------------------------+------------+---------------------+


Solution 1:[1]

You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.

Code below works for spark versions starting 2.4.

from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()

For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.

Documentation

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters timestamp – the column that contains timestamps

tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc

Changed in version 2.4: tz can take a Column containing timezone ID strings.

Solution 2:[2]

You should also be able to use a spark UDF.

from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
    dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
    return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vincent Doba
Solution 2 yeamusic21