'AWS Athena table from python output with dates - dates get wrongly converted

I have a pandas DataFrame containing a date column ("2022-02-02"). I write this table to parquet using pyarrow.

df[col] = df[col].astype(str)
df.to_parquet(loc)

Now I register this as a table in Athena

CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
  dt_utc date,
  something string,
  else int
)
STORED AS PARQUET
LOCATION 's3://bucket/loc/'
TBLPROPERTIES (
    'skip.header.line.count'='1'
)

But I'm not getting the date column to be accepted.

  • np.datetime64[D] + athena date results in "-4513749-10-01"
  • python str + athena date results in a "Binary Format" error
  • python str + athena string works but is not the goal


Solution 1:[1]

I think for it to work you need to save the date column, dt_utc, as a date32 in parquet.

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
    {
        "dt_utc": ["2021-01-02", "2021-01-03"],
        "something": ["abc", "efg"],
        "else": [1, 2],
        
    }
)
df['dt_utc'] = pd.to_datetime(df['dt_utc'])

schema = pa.schema([
    pa.field("dt_utc", pa.date32()),
    pa.field("something", pa.string()),
    pa.field("else", pa.int32()),
    
])

df.to_parquet(loc, schema=schema)

EDIT: if you need to change columns to date programatically:

date_columns = ["dt_utc"]

for date_column in date_columns:
    df[date_column] = pd.to_datetime(df[date_column])

schema = pa.Schema.from_pandas(df)

schema = pa.schema([
    pa.field(field.name, pa.date32()) if field.name in date_columns else field
    for field in schema
])

df.to_parquet("hello.parquet", schema=schema)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1