'how to list ALL table sizes in a project
Is there a way to list all the table size in BigQuery?
I know a command like this:
select
table_id,
sum(size_bytes)/pow(10,9) as size
from
certain_dataset.__TABLES__
group by
1
But I want to know all the tables in ALL datasets.
Thanks
Solution 1:[1]
This problem got a little easier with the introduction of dynamic SQL to BigQuery scripting in 2020. Now, we can build up a query dynamically and execute it via EXECUTE IMMEDIATE
.
Somthing like this would do for most circumstances where all the datasets are in region-us
:
DECLARE dataset_names ARRAY<STRING>;
SET dataset_names = (
SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
);
EXECUTE IMMEDIATE (
SELECT STRING_AGG(
(SELECT """
SELECT project_id, dataset_id, table_id, row_count, size_bytes
FROM `""" || s ||
""".__TABLES__`"""),
" UNION ALL ")
FROM UNNEST(dataset_names) AS s);
If there are a large number of datasets then this may return a rate limit error when trying to read all the meta data concurrently.
If this happens then we can then fall back on a "batched" approach that's a little more complicated to read and slower/less-efficient but still gets the job done:
DECLARE dataset_names ARRAY<STRING>;
DECLARE batch ARRAY<STRING>;
DECLARE batch_size INT64 DEFAULT 25;
CREATE TEMP TABLE results (
project_id STRING,
dataset_id STRING,
table_id STRING,
row_count INT64,
size_bytes INT64
);
SET dataset_names = (
SELECT ARRAY_AGG(SCHEMA_NAME)
FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
);
LOOP
IF ARRAY_LENGTH(dataset_names) < 1 THEN
LEAVE;
END IF;
SET batch = (
SELECT ARRAY_AGG(d)
FROM UNNEST(dataset_names) AS d WITH OFFSET i
WHERE i < batch_size);
EXECUTE IMMEDIATE (
SELECT """INSERT INTO results """
|| STRING_AGG(
(SELECT """
SELECT project_id, dataset_id, table_id, row_count, size_bytes
FROM `""" || s || """.__TABLES__`"""),
" UNION ALL ")
FROM UNNEST(batch) AS s);
SET dataset_names = (
SELECT ARRAY_AGG(d)
FROM UNNEST(dataset_names) AS d
WHERE d NOT IN (SELECT * FROM UNNEST(batch)));
END LOOP;
SELECT * FROM results;
Solution 2:[2]
At the moment there's no possible way to do that in a single query, but you can do it with a script, here is my python script that prints out the list:
from google.cloud import bigquery
client = bigquery.Client()
datasets = list(client.list_datasets())
project = client.project
if datasets:
print('Datasets in project {}:'.format(project))
for dataset in datasets: # API request(s)
print('Dataset: {}'.format(dataset.dataset_id))
query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")
results = query_job.result()
for row in results:
print("\tTable: {} : {}".format(row.table_id, row.size))
else:
print('{} project does not contain any datasets.'.format(project))
Solution 3:[3]
If you would like a python script that will crawl all the tables and capture all metadata such as column types, table size, descriptions etc I have a script on my github account that will do it.
It saves the output to a Bigquery table, CSV or JSON depending on what you need.
I created this for a customer who had to audit tens of thousands of tables but I use to find specific table features in the public datasets.
https://github.com/go-dustin/gcp_data_utilities/tree/master/BigQuery/meta_data_crawler
Solution 4:[4]
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOURGCPSERVICEACCOUNTKEY.json"
GCP_PROJECT_ID = "YOURGCPPROJECT"
client = bigquery.Client(project=GCP_PROJECT_ID)
datasets = list(client.list_datasets())
project = client.project
if datasets:
print('Datasets in project {}:'.format(project))
for dataset in datasets: # API request(s)
print('Dataset: {}'.format(dataset.dataset_id))
query_job = client.query(
f"""
SELECT
table_id,
size_bytes/pow(1024,3) AS size_GB,
FROM `{GCP_PROJECT_ID}.{dataset.dataset_id}`.__TABLES__
"""
)
results = query_job.result()
for row in results:
print(f"\tTable: {row.table_id} : {row.size_GB} GB")
else:
print('project does not contain any datasets.'.format(project))
Replying on top on @Alex's answer, this answer will get more accurate number for size in GB. 1kb should be represented as 1024 byte
Solution 5:[5]
This can be done with EXECUTE IMMEDIATE
which needs a string input.
Please note that you need to set the correct region where your datasets are, in my case eu
:
DECLARE dataset_names ARRAY<STRING>;
DECLARE query_first_part STRING;
SET dataset_names = (SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-eu.INFORMATION_SCHEMA.SCHEMATA`);
SET query_first_part = """
SELECT
project_id,
dataset_id,
table_id,
size_bytes / pow(1024, 3) size_gb,
FROM `""";
EXECUTE IMMEDIATE (
SELECT
STRING_AGG(
query_first_part || dataset_name || ".__TABLES__`",
" UNION ALL "
)
FROM UNNEST(dataset_names) AS dataset_name
);
More info on EXECUTE IMMEDIATE
:
https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#execute_immediate
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Jas |
Solution 3 | Dustin Williams |
Solution 4 | yptheangel |
Solution 5 |