'Python Polars: Read Column as Datetime

How does one read a csv into a polar DataFrame and parse one of the columns as a datetime?

Alternatively, how does one convert a column to a pl.datetime?



Solution 1:[1]

I would first try parse_dates=True in the read_csv call.

For example, let's say we have the following data:

import polars as pl
from io import StringIO
my_csv = StringIO(
"""
ID,start,last_updt,end
1,2008-10-31, 2020-11-28 12:48:53,12/31/2008
2,2007-10-31, 2021-11-29 01:37:20,12/31/2007
3,2006-10-31, 2021-11-30 23:22:05,12/31/2006
"""
)

pl.read_csv(my_csv, parse_dates=True)
shape: (3, 4)
???????????????????????????????????????????????????????
? ID  ? start      ? last_updt           ? end        ?
? --- ? ---        ? ---                 ? ---        ?
? i64 ? date       ? datetime[?s]        ? str        ?
???????????????????????????????????????????????????????
? 1   ? 2008-10-31 ? 2020-11-28 12:48:53 ? 12/31/2008 ?
???????????????????????????????????????????????????????
? 2   ? 2007-10-31 ? 2021-11-29 01:37:20 ? 12/31/2007 ?
???????????????????????????????????????????????????????
? 3   ? 2006-10-31 ? 2021-11-30 23:22:05 ? 12/31/2006 ?
???????????????????????????????????????????????????????

The start column parsed as a Date, and the last_updt column parsed as a Datetime. But notice that the end column did not parse as a date because it is not in ISO 8601 format. (I've come across plenty of csv files where Date/Datetime fields were non-standard.)

To parse this column, we can use the strptime function and supply the appropriate format.

pl.read_csv(my_csv, parse_dates=True).with_column(pl.col('end').str.strptime(pl.Date, '%m/%d/%Y'))
shape: (3, 4)
???????????????????????????????????????????????????????
? ID  ? start      ? last_updt           ? end        ?
? --- ? ---        ? ---                 ? ---        ?
? i64 ? date       ? datetime[?s]        ? date       ?
???????????????????????????????????????????????????????
? 1   ? 2008-10-31 ? 2020-11-28 12:48:53 ? 2008-12-31 ?
???????????????????????????????????????????????????????
? 2   ? 2007-10-31 ? 2021-11-29 01:37:20 ? 2007-12-31 ?
???????????????????????????????????????????????????????
? 3   ? 2006-10-31 ? 2021-11-30 23:22:05 ? 2006-12-31 ?
???????????????????????????????????????????????????????

strptime can also be used with Datetime columns.

Solution 2:[2]

Polars supports two csv readers, one built-in and one based on pyarrow. The pyarrow reader supports parsing dates directly; see also https://github.com/pola-rs/polars/issues/1330. You can set use_pyarrow=True in read_csv, but as per the documentation, it will only be used given also the other parameter inputs into read_csv.

Alternatively, read as Utf8 (string), and parse to string with strptime: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.internals.series.StringNameSpace.strptime.html?highlight=strptime#polars.internals.series.StringNameSpace.strptime. This is the method I find easier typically, but may, depending on the size of your data, be relatively expensive as you first need to store as Utf8 and then parse.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 jvz