'Get Table_Id along with Rowcount for all Tables in a Project
There are >100 datasets in one of my project and I want to get the Table_id * No_of_rows of each table lying in these 50 datasets.
I can get the metadata of a Dataset using the following query.
SELECT
Project_id,
dataset_id,
table_id,
row_count
FROM
Project.dataset.__TABLES__
I can do this for each and every dataset (But that would take 100 iterations). Is there a better way??
Solution 1:[1]
You cannot do this in a single query, but, I did this python script for you:
from google.cloud import bigquery
client = bigquery.Client()
datasets = list(client.list_datasets())
project = client.project
if datasets:
totalTables=0
print('Datasets in project {}:'.format(project))
for dataset in datasets: # API request(s)
print('Dataset: {}'.format(dataset.dataset_id))
query_job = client.query("SELECT Project_id, dataset_id, table_id, row_count FROM "+dataset.dataset_id+".__TABLES__")
results = query_job.result()
for row in results:
print("\tTable: {} : {}".format(row.table_id, row.row_count))
totalTables=totalTables+int(row.row_count)
print("\n\nTotal number of rows: {}".format(totalTables))
else:
print('{} project does not contain any datasets.'.format(project))
Just run it (in the Cloud Shell for example).
EDIT
I created this Feature Request to have the __DATASETS__
option in BigQuery. It should be great to run your command in all your datasets just setting FROM __DATASETS__.__TABLES__
right?
Solution 2:[2]
I realize this question is old, but did find a way to get table_id and row_count from multiple datasets in a BigQuery project and wanted to share.
DECLARE tables_to_check Array<Struct<dataset_id STRING, project_id String>>;
DECLARE i INT64 DEFAULT 0;
-- store project and dataset ids in a struct
set tables_to_check = Array(
SELECT distinct as Struct table_schema as dataset_id, table_catalog as project_id
FROM region-us.INFORMATION_SCHEMA.TABLES);
create temp table results(dataset_id String, table_id String, row_count INT64);
LOOP set i = i+1;
if i > ARRAY_LENGTH(tables_to_check) THEN LEAVE;
END IF;
EXECUTE IMMEDIATE '''
insert results
SELECT "''' || tables_to_check[ORDINAL(i)].dataset_id || '''" as dataset_id,
table_id,
row_count as row_count
from `''' || tables_to_check[ORDINAL(i)].project_id || '''.''' || tables_to_check[ORDINAL(i)].dataset_id || '''.__TABLES__`
'''
;
END LOOP;
-- choose how to view the results
select * from results order by row_count desc;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |