'Patterns for replicating data to BigQuery

I'm asking for the best practice/industrial standard on these types of jobs, this is what I've been doing:

The end goal is to have a replication of the data in BigQuery

  1. Get the data from an API (incrementally using the previous watermark, using a field like updated_at)
  2. Batch load into native BigQuery table (the main table)
  3. Run an Update-ish query, like this
select * (except _rn)
from (select *, row_number() over (partition by <id> order by updated_at desc) as _rn)
where _rn = 1

Essentially, only get the rows which are the most up-to-date. I'm opting for a table instead of a view to facilliate downtream usages.

This methods works for small table, but when the volume increases, it will face some issues:

  • Whole table will be recreated, whether partitioned or not
  • If partitioned, I could easily ran into quota limits

I've also looked for other methods, including loading into a staging table and then perform merge operation between them.

So I'm asking for advice on what your preferred methods/patterns/architecture are to achieve the end goals.

Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source