'How to control whether pyarrow.dataset.write_dataset will overwrite previous data or append to it?
I am trying to use pyarrow.dataset.write_dataset
function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append" conveniently to already existing dataset without having to read in all the data first? I do not need the data to be in one file, i just don't want to delete the old one.
What i currently do and doesn't work:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
use_deprecated_int96_timestamps = True,
coerce_timestamps = None,
allow_truncated_timestamps = True)
ds.write_dataset(data = data, base_dir = 'my_path', filesystem = hdfs_filesystem, format = parquet_format, file_options = write_options)
Solution 1:[1]
Currently, the write_dataset
function uses a fixed file name template (part-{i}.parquet
, where i
is a counter if you are writing multiple batches; in case of writing a single Table i
will always be 0).
This means that when writing multiple times to the same directory, it might indeed overwrite pre-existing files if those are named part-0.parquet
.
How you can solve this is by ensuring that write_dataset
uses unique file names for each write through the basename_template
argument, eg:
ds.write_dataset(data=data, base_dir='my_path',
basename_template='my-unique-name-{i}.parquet', ...)
If you want to have automatically a unique name each time you write, you could eg generate a random string to include in the file name. One option for this is using the python uuid
stdlib module: basename_template = "part-{i}-" + uuid.uuid4().hex + ".parquet"
.
Another option could be to include the current time of writing in the filename to make it unique, eg with basename_template = "part-{:%Y%m%d}-{{i}}.parquet".format(datetime.datetime.now())
See https://issues.apache.org/jira/browse/ARROW-10695 for some more discussion about this (customizing the template), and I opened a new issue specifically about the issue of silently overwriting data: https://issues.apache.org/jira/browse/ARROW-12358
Solution 2:[2]
For those that are here to work out how to use make_write_options()
with write_dataset
, try this:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Contango |