'BigQuery - extract the most used tables
I'm using the Google BigQuery and looking on the default audit
dataset.
I know that this dataset contains various data about the queries the users are running.
I wondered if I can write a query that will show me the MOST USED tables over my project (lets say for the last 30 days)?
Output example:
table_1_name: 5000 queries
table_2_name: 4000 queries
table_3_name: 2000 queries
and so on...
Solution 1:[1]
You can use BigQuery Jobs.list to get all the Jobs you executed.
Inside the response, there is an array, referencedTables, which tells you which tables where used in the Job
"referencedTables": [
{
"projectId": string,
"datasetId": string,
"tableId": string
}
]
You can now collect this information into a destination table or other storage to get the metadata you are looking for
Solution 2:[2]
Adding the answer from Tamir Klein, here's the query you can use:
SELECT
job_id,
creation_time,
referenced_tables.dataset_id,
referenced_tables.table_id
FROM `your-project-name`.`your-region`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
UNNEST(referenced_tables) AS referenced_tables
WHERE job_type = 'QUERY'
AND DATE(creation_time) >= DATE_SUB(DATE(CURRENT_TIMESTAMP()), INTERVAL 30 DAY)
That will give you list of jobs and the reference tables used in all your query jobs in the past 30 days.
Then you can aggregate them by dataset_id and table_id:
SELECT
dataset_id,
table_id,
COUNT(DISTINCT job_id) AS total_jobs
FROM referenced_table_list
GROUP BY 1,2
ORDER BY 3 DESC
and add conditions as you wish, e.g. WHERE dataset_id = 'audit'. Hope it helps!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Tamir Klein |
Solution 2 | mirachanxx |