'BigQuery partition pruning while using analytical function as part of a view, on a DATETIME field with DAY granularity

I am trying to use analytical functions (e.g. FIRST_VALUE), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.

Example Data

Let's assume a table with the following columns:

name type
dt DATETIME
value STRING

The table is partitioned on dt with the DAY granularity.

An example table can be created using the following SQL:

CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

The generated table will be over 1 GB (around 3.4 MB per day).

The Problem

Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)

As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view in place of the view.

WITH t_view AS (
    SELECT
        dt,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY DATE(dt)
            ORDER BY dt
        ) AS first_val
    FROM `project.dataset.example_partioned_table`
)

SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')

The query will contain something like some value: 2020-01-01 00:00:00 for first_val (i.e. first value for the date).

However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.

Other observations

If I don't include first_val (the analytical function) in the result, then the partition pruning works as intended. Including first_val causes it to scan everything.

If I don't wrap dt with DATE, then the partition pruning also works, but would of course not provide the correct value.

I also tried DATETIME_TRUNC(request.timestamp, DAY), with the same lacking partition pruning result as DATE(request.timestamp).

Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.

How can I restrict the analytical function to the partition of the row?

Failed workaround using GROUP BY

Related, I also tried a workaround using GROUP BY (the date), with the same result...

WITH t_view_1 AS (
    SELECT
        dt,
        DATE(dt) AS _date,
        value,
    FROM `project.dataset.example_partioned_table`
),

t_view_2 AS (
    SELECT
        _date,
        MIN(value) AS first_val
    FROM t_view_1
    GROUP BY _date
),

t_view AS (
    SELECT
        t_view_1._date,
        t_view_1.dt,
        t_view_2.first_val
    FROM t_view_1
    JOIN t_view_2
        ON t_view_2._date = t_view_1._date
)

SELECT *,
FROM t_view
WHERE _date = '2020-01-01'

As before, it is scanning the whole table rather than only processing the partition with the selected date.

Potentially working workaround with partition on DATE field

If the table is instead partitioned on a DATE field (_date), e.g.:

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

Then the partition pruning works with the following adjusted example query:

WITH t_view AS (
    SELECT
        dt,
        _date,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY _date
            ORDER BY dt
        ) AS first_val
    FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)

SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

i.e. the query scans around 4 MB rather than 1 GB

However, now I would need to add and populate that additional _date field. (Inconvenient with an external data source)

Having two fields with redundant information can also be confusing.

Additionally there is now no partition pruning at all on dt (queries need to make sure to use _date instead).



Solution 1:[1]

BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.

So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.

Ie.

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

WITH t_view AS (
   SELECT
       dt,
       _date,
       value,
       FIRST_VALUE(value) OVER(
           PARTITION BY _date
           ORDER BY dt
       ) AS first_val
   FROM `mock.example_date_field_partioned_table`
)
 
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sakshi Gatyan