'BigQuery partition pruning while using analytical function as part of a view, on a DATETIME field with DAY granularity
I am trying to use analytical functions (e.g. FIRST_VALUE
), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.
Example Data
Let's assume a table with the following columns:
name | type |
---|---|
dt |
DATETIME |
value |
STRING |
The table is partitioned on dt
with the DAY
granularity.
An example table can be created using the following SQL:
CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
The generated table will be over 1 GB (around 3.4 MB per day).
The Problem
Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)
As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view
in place of the view.
WITH t_view AS (
SELECT
dt,
value,
FIRST_VALUE(value) OVER(
PARTITION BY DATE(dt)
ORDER BY dt
) AS first_val
FROM `project.dataset.example_partioned_table`
)
SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')
The query will contain something like some value: 2020-01-01 00:00:00
for first_val
(i.e. first value for the date).
However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.
Other observations
If I don't include first_val
(the analytical function) in the result, then the partition pruning works as intended.
Including first_val
causes it to scan everything.
If I don't wrap dt
with DATE
, then the partition pruning also works, but would of course not provide the correct value.
I also tried DATETIME_TRUNC(request.timestamp, DAY)
, with the same lacking partition pruning result as DATE(request.timestamp)
.
Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.
How can I restrict the analytical function to the partition of the row?
Failed workaround using GROUP BY
Related, I also tried a workaround using GROUP BY (the date), with the same result...
WITH t_view_1 AS (
SELECT
dt,
DATE(dt) AS _date,
value,
FROM `project.dataset.example_partioned_table`
),
t_view_2 AS (
SELECT
_date,
MIN(value) AS first_val
FROM t_view_1
GROUP BY _date
),
t_view AS (
SELECT
t_view_1._date,
t_view_1.dt,
t_view_2.first_val
FROM t_view_1
JOIN t_view_2
ON t_view_2._date = t_view_1._date
)
SELECT *,
FROM t_view
WHERE _date = '2020-01-01'
As before, it is scanning the whole table rather than only processing the partition with the selected date.
Potentially working workaround with partition on DATE field
If the table is instead partitioned on a DATE field (_date
), e.g.:
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
Then the partition pruning works with the following adjusted example query:
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
i.e. the query scans around 4 MB rather than 1 GB
However, now I would need to add and populate that additional _date
field. (Inconvenient with an external data source)
Having two fields with redundant information can also be confusing.
Additionally there is now no partition pruning at all on dt
(queries need to make sure to use _date
instead).
Solution 1:[1]
BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.
So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.
Ie.
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `mock.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Sakshi Gatyan |