'Patterns for replicating data to BigQuery
I'm asking for the best practice/industrial standard on these types of jobs, this is what I've been doing:
The end goal is to have a replication of the data in BigQuery
- Get the data from an API (incrementally using the previous watermark, using a field like
updated_at
) - Batch load into native BigQuery table (the main table)
- Run an Update-ish query, like this
select * (except _rn)
from (select *, row_number() over (partition by <id> order by updated_at desc) as _rn)
where _rn = 1
Essentially, only get the rows which are the most up-to-date. I'm opting for a table instead of a view to facilliate downtream usages.
This methods works for small table, but when the volume increases, it will face some issues:
- Whole table will be recreated, whether partitioned or not
- If partitioned, I could easily ran into quota limits
I've also looked for other methods, including loading into a staging table and then perform merge
operation between them.
So I'm asking for advice on what your preferred methods/patterns/architecture are to achieve the end goals.
Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|