'how to list ALL table sizes in a project

Is there a way to list all the table size in BigQuery?

I know a command like this:

select 
  table_id,
  sum(size_bytes)/pow(10,9) as size
from
  certain_dataset.__TABLES__
group by 
  1

But I want to know all the tables in ALL datasets.

Thanks



Solution 1:[1]

This problem got a little easier with the introduction of dynamic SQL to BigQuery scripting in 2020. Now, we can build up a query dynamically and execute it via EXECUTE IMMEDIATE.

Somthing like this would do for most circumstances where all the datasets are in region-us:

DECLARE dataset_names ARRAY<STRING>;

SET dataset_names = (
    SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
);

EXECUTE IMMEDIATE (
    SELECT STRING_AGG(
        (SELECT """
            SELECT project_id, dataset_id, table_id, row_count, size_bytes 
            FROM `""" || s || 
            """.__TABLES__`"""), 
            " UNION ALL ")
    FROM UNNEST(dataset_names) AS s);

If there are a large number of datasets then this may return a rate limit error when trying to read all the meta data concurrently.

If this happens then we can then fall back on a "batched" approach that's a little more complicated to read and slower/less-efficient but still gets the job done:

DECLARE dataset_names ARRAY<STRING>;
DECLARE batch ARRAY<STRING>;
DECLARE batch_size INT64 DEFAULT 25;

CREATE TEMP TABLE results (
    project_id STRING,
    dataset_id STRING,
    table_id STRING,
    row_count INT64,
    size_bytes INT64
);

SET dataset_names = (
        SELECT ARRAY_AGG(SCHEMA_NAME) 
        FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
    );

LOOP
    IF ARRAY_LENGTH(dataset_names) < 1 THEN 
        LEAVE;
    END IF;

    SET batch = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d WITH OFFSET i 
        WHERE i < batch_size);

    EXECUTE IMMEDIATE (
        SELECT """INSERT INTO results """ 
            || STRING_AGG(
                    (SELECT """
                        SELECT project_id, dataset_id, table_id, row_count, size_bytes 
                        FROM `""" || s || """.__TABLES__`"""), 
                " UNION ALL ")
        FROM UNNEST(batch) AS s);

    SET dataset_names = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d
        WHERE d NOT IN (SELECT * FROM UNNEST(batch)));
        
END LOOP; 

SELECT * FROM results;

Solution 2:[2]

At the moment there's no possible way to do that in a single query, but you can do it with a script, here is my python script that prints out the list:

from google.cloud import bigquery

client = bigquery.Client()

datasets = list(client.list_datasets())
project = client.project

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")

        results = query_job.result()
        for row in results:
            print("\tTable: {} : {}".format(row.table_id, row.size))

else:
    print('{} project does not contain any datasets.'.format(project))

Solution 3:[3]

If you would like a python script that will crawl all the tables and capture all metadata such as column types, table size, descriptions etc I have a script on my github account that will do it.

It saves the output to a Bigquery table, CSV or JSON depending on what you need.

I created this for a customer who had to audit tens of thousands of tables but I use to find specific table features in the public datasets.

https://github.com/go-dustin/gcp_data_utilities/tree/master/BigQuery/meta_data_crawler

Solution 4:[4]

from google.cloud import bigquery
import os  

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOURGCPSERVICEACCOUNTKEY.json"
GCP_PROJECT_ID = "YOURGCPPROJECT"

client = bigquery.Client(project=GCP_PROJECT_ID)
datasets = list(client.list_datasets())
project = client.project

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query(
            f"""
            SELECT
            table_id,
            size_bytes/pow(1024,3) AS size_GB,
            FROM `{GCP_PROJECT_ID}.{dataset.dataset_id}`.__TABLES__
            """
        )

        results = query_job.result()
        for row in results:
            print(f"\tTable: {row.table_id} : {row.size_GB} GB")

else:
    print('project does not contain any datasets.'.format(project))

Replying on top on @Alex's answer, this answer will get more accurate number for size in GB. 1kb should be represented as 1024 byte

Solution 5:[5]

This can be done with EXECUTE IMMEDIATE which needs a string input.
Please note that you need to set the correct region where your datasets are, in my case eu:

DECLARE dataset_names ARRAY<STRING>;
DECLARE query_first_part STRING;

SET dataset_names = (SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-eu.INFORMATION_SCHEMA.SCHEMATA`);
SET query_first_part = """
            SELECT 
                project_id, 
                dataset_id,
                table_id,
                size_bytes / pow(1024, 3) size_gb,
            FROM `""";

EXECUTE IMMEDIATE (
    SELECT 
        STRING_AGG(
            query_first_part || dataset_name || ".__TABLES__`", 
            " UNION ALL "
        )
    FROM UNNEST(dataset_names) AS dataset_name
);

More info on EXECUTE IMMEDIATE:
https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#execute_immediate

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Jas
Solution 3 Dustin Williams
Solution 4 yptheangel
Solution 5