'Inserting aggregates into big table is too slow

I need to run a query as cronjob and it is too slow to perform. It sums a value from a view table (data_time_series) for each pair of aks and ii and inserts a row in storage_time_series if those values are not present there yet: Here is the query:

INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
   ( SELECT 
        now() AS time,
        SUM(value) AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM data_time_series
    GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );

The first part of the query produces around 200K rows and is fast:

SELECT 
        now() AS time,
        SUM(nullif(value, 'NaN'))  AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM public.data_time_series
    GROUP BY akd, ii

The second part of the query checks if the data exists in storage_time_series. This part is too slow:

WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)

storage_time_series is a huge table and I need to keep the data and akd and ii are indexed on storage_time_series.

Here is the tables schema:

SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
CREATE TABLE storage_time_series
(
    "time" timestamp with time zone,
    value double precision,
    akd integer,
    ii integer
)

CREATE  INDEX 
    ON storage_time_series USING btree
    (akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)

Number of rows on storage_time_series table: ~190 million data_time_series is a view and has about 1.3 million rows.

Is there any way to speed up this process?

EDITED: I've already tested creating unique index over akd, ii and value and changed query to simple INSERT and ON CONFLICT DO NOTHING, but seems having unique index make insert process very slower than before.



Solution 1:[1]

akd and ii are indexed on storage_time_series

Sounds good. But too vague to say more. If, as it seems, (akd, ii, value) is supposed to be unique, create a UNIQUE INDEX accordingly:

CREATE UNIQUE INDEX ON storage_time_series (akd, ii, value);

Then your query burns down to:

INSERT INTO storage_time_series
      (akd      , ii       , value     , time )
SELECT labels[2], labels[4], sum(value), now())
FROM   data_time_series
GROUP  BY 1, 2
ON     CONFLICT DO NOTHING;

Which should be fast. Much depends undisclosed details.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1