'AWS Athena table from python output with dates - dates get wrongly converted
I have a pandas DataFrame containing a date column ("2022-02-02"). I write this table to parquet using pyarrow.
df[col] = df[col].astype(str)
df.to_parquet(loc)
Now I register this as a table in Athena
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
dt_utc date,
something string,
else int
)
STORED AS PARQUET
LOCATION 's3://bucket/loc/'
TBLPROPERTIES (
'skip.header.line.count'='1'
)
But I'm not getting the date column to be accepted.
- np.datetime64[D] + athena date results in "-4513749-10-01"
- python str + athena date results in a "Binary Format" error
- python str + athena string works but is not the goal
Solution 1:[1]
I think for it to work you need to save the date column, dt_utc
, as a date32 in parquet.
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(
{
"dt_utc": ["2021-01-02", "2021-01-03"],
"something": ["abc", "efg"],
"else": [1, 2],
}
)
df['dt_utc'] = pd.to_datetime(df['dt_utc'])
schema = pa.schema([
pa.field("dt_utc", pa.date32()),
pa.field("something", pa.string()),
pa.field("else", pa.int32()),
])
df.to_parquet(loc, schema=schema)
EDIT: if you need to change columns to date programatically:
date_columns = ["dt_utc"]
for date_column in date_columns:
df[date_column] = pd.to_datetime(df[date_column])
schema = pa.Schema.from_pandas(df)
schema = pa.schema([
pa.field(field.name, pa.date32()) if field.name in date_columns else field
for field in schema
])
df.to_parquet("hello.parquet", schema=schema)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |