'"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema
I'm working on a script where I'm sending a dataframe to BigQuery:
load_job = bq_client.load_table_from_dataframe(
df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)
# Wait for the load job to complete
return load_job.result()
This is working fine but only if a schema has already been defined in BigQuery or if I'm defining the schema of my job in my script. If no schema has been defined I have the following error:
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'
Why is pyarrow
generating this error? How can I solve it other than pre-defining schema?
Solution 1:[1]
The default behavior when converting to Arrow or Parquet from pandas is to not allow silent data loss. There are options to set when doing the conversion to allow permit unsafe casts causing loss of timestamp precision or other forms of data loss. The BigQuery Python API would need to set these options, so it may be a bug in the BigQuery library. I suggest reporting on their issue tracker https://github.com/googleapis/google-cloud-python
Solution 2:[2]
I was getting this error: ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: 1602633600999999998
when i inspect the dataframe i saw a value like this: 2021-09-30 23:59:59.999999998
You probably have a mismatch in your date fields with the bigquery default. then i used this code:
df['date_column'] =df['date_column'].astype('datetime64[s]')
then the problem was solved for me.
Solution 3:[3]
I think these errors arise because the pyarrow.parquet module used by the BigQuery library does convert Python's built in datetime or time types into something that BigQuery recognises by default, but the BigQuery library does have its own method for converting pandas types.
I was able to get it to upload timestamps by changing all instances of datetime.datetime or time.time to pandas.Timestamp. For example:
my_df['timestamp'] = datetime.utcnow()
would need to be changed to
my_df['timestamp'] = pd.Timestamp.now()
Solution 4:[4]
The solution for me was to add the following kwargs to to_parquet:
parquet_args = {
'coerce_timestamps': 'us',
'allow_truncated_timestamps': True,
}
You have to set both of them. If you set just allow_truncated_timestamps
, it will still raise the error if coerce_timestamps
is None
. I think the idea is that you only want to suppress the error if you're explicitly asking for coercion. Anyway, the docs are clear about it, but this behavior wasn't obvious to me.
Solution 5:[5]
In my testing of https://github.com/googleapis/python-bigquery-pandas/pull/413, this issue is fixed by upgrading to pandas 1.1.0+.
Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime
Solution 6:[6]
If using write_dataset
, example code to use file_options
to eliminate this error:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
file_options = parquet_format.make_write_options(coerce_timestamps='us', allow_truncated_timestamps=True)
ds.write_dataset(..., file_options=file_options)
Added as anyone who queries the PyArrow error in the title will end up here.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Wes McKinney |
Solution 2 | |
Solution 3 | Matt Simmons |
Solution 4 | BHP |
Solution 5 | Tim Swast |
Solution 6 | Contango |