'Getting Info from GCP Data Catalog
I notice when you query the data catalog in the Google Cloud Platform it retrieves stats for the amount of times a table has been queried:
Queried (Past 30 days): 5332
This is extremely useful information and I was wondering where this is actually stored and if it can be retrieved for all the tables in a project or a dataset.
I have trawled the data catalog tutorials and written some python scripts but these just retrieve entry names for tables and in an iterator which is not what I am looking for.
Likewise I also cannot see this data in the information schema metadata.
Solution 1:[1]
You can retrieve the number of completed/performed queries of any table/dataset exporting log entries to BiqQuery. Every query generates some logging on Stackdriver so you can use advanced filters to select the logs you are interested it and store them as a new table in Bigquery.
However, the retention period for the data access logs in GCP is 30 days, so you can only export the logs in the past 30 days.
For instance, use the following advance filter for getting the logs corresponding to all the jobs completed of an specific table:
resource.type="bigquery_resource" AND
log_name="projects/<project_name>/logs/cloudaudit.googleapis.com%2Fdata_access" AND
proto_payload.method_name="jobservice.jobcompleted"
"<table_name>"
Then select Bigquery as Sink Service and state a name for your sink table and the dataset where it will be stored.
All the completed jobs on this table performed after the sink is established will appear as a new table in BigQuery. You can query this table to get information about the logs (you can use a COUNT statement on any column to get the total number of successful jobs for instance).
Solution 2:[2]
This information is available in the projects.locations.entryGroups.entries/get API. It is availble as UsageSignal, and contains usage information of 24 hours, 7days, 30days.
Sample output:
"usageSignal": {
"updateTime": "2021-05-23T06:59:59.971Z",
"usageWithinTimeRange": {
"30D": {
"totalCompletions": 156890,
"totalFailures": 3,
"totalCancellations": 1,
"totalExecutionTimeForCompletionsMillis": 6.973312e+08
},
"7D": {
"totalCompletions": 44318,
"totalFailures": 1,
"totalExecutionTimeForCompletionsMillis": 2.0592365e+08
},
"24H": {
"totalCompletions": 6302,
"totalExecutionTimeForCompletionsMillis": 25763162
}
}
}
Reference:
Solution 3:[3]
With Python Datacatalog - You first need to search the Data catalog and you will receive linked_resource in response.
Pass this linked_resource as a request to lookup_entry and you will fetch the last queried (30 days)
results = dc_client.search_catalog(request=request, timeout=120.0)
for result in results:
linked_resource = result.linked_resource
# Get the Location and number of times the table is queried in last 30 days
table_entry = dc_client.lookup_entry(request={"linked_resource": linked_resource})
queried_past_30_days = table_entry.usage_signal.usage_within_time_range.get("30D")
if queried_past_30_days is not None:
dc_num_queried_past_30_days = int(queried_past_30_days.total_completions)
else:
dc_num_queried_past_30_days = 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | halfer |
Solution 2 | SANN3 |
Solution 3 | Pasha |