'Make existing bigquery table clustered

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.

From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].

The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).

Is there a good way to make clustering field for an existing partitioned table?

Any comment, suggestion, or answer is well appreciated.

Thanks a lot, Yosua



Solution 1:[1]

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec


You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

Solution 2:[2]

BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.

You can change the clustering specification in the following ways:

Call the tables.update or tables.patch API method.

Call the bq command-line tool's bq update command with the --clustering_fields flag.

Reference

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Solution 3:[3]

What I missed from the above answers was a real example, so here it goes:

bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table

Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 dogmatic69
Solution 2 SANN3
Solution 3 Leonardo Campos