'Bigquery data transfers from S3 and efficiency of filtering data by last modification datetime

I'm planning a pipeline from an S3 bucket to BigQuery. I have set up daily bigquery data transfers from S3 as described here. So, I have a data transfer configuration that looks daily for files with prefix s3://mybucket/my/path/ and adds them to a BigQuery table.

Now, some of these files are being updated regularly. Fortunately, the data transfer service is smart enough to fetch only new files or updated files.

My question: Since S3 does not offer an efficient (i.e. server-side) way to list files by modified datetime, I was wondering how that works. Does google keep track of every PUT event and keeps the metadata somewhere, including modification time, so they know which files to transfer at the next transfer run?

Why I'm asking this: the bucket is going to get huge sooner or later, a lot of the files in there are going to be updated regularly (i.e. same key, but the content will change), so I want to know if each data transfer run will have to scan the whole bucket!

Of course I'm sure Google's engineers would implement the best possible solution, I don't doubt that, but I would like to make sure this won't be a bottleneck along the way.



Solution 1:[1]

Scheduled transfers use modification time filtering to avoid transferring duplicate data. In terms of the performance, the main bottleneck would be global S3 to GCP bandwidth limitations which is hard to predict and varies a lot between regions, but it's generally in the x10Gbps scale.

BigQuery data transfer service is generally recommended for structured <15 TB data. Very generic network transfer time estimates is available here.

You should also consider the relevant quotas for transfer operations.

While TLS is used in data transfer, if you would like to do data transfers over secure private connections, you should consider using VPC-SC.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Prajna Rai T