'Inserting aggregates into big table is too slow
I need to run a query as cronjob and it is too slow to perform. It sums a value from a view table (data_time_series
) for each pair of aks
and ii
and inserts a row in storage_time_series
if those values are not present there yet:
Here is the query:
INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
( SELECT
now() AS time,
SUM(value) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM data_time_series
GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );
The first part of the query produces around 200K rows and is fast:
SELECT
now() AS time,
SUM(nullif(value, 'NaN')) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM public.data_time_series
GROUP BY akd, ii
The second part of the query checks if the data exists in storage_time_series
. This part is too slow:
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)
storage_time_series
is a huge table and I need to keep the data and akd
and ii
are indexed on storage_time_series
.
Here is the tables schema:
SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
CREATE TABLE storage_time_series
(
"time" timestamp with time zone,
value double precision,
akd integer,
ii integer
)
CREATE INDEX
ON storage_time_series USING btree
(akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)
Number of rows on storage_time_series table: ~190 million data_time_series is a view and has about 1.3 million rows.
Is there any way to speed up this process?
EDITED: I've already tested creating unique index
over akd, ii and value and changed query to simple INSERT and ON CONFLICT DO NOTHING, but seems having unique index
make insert process very slower than before.
Solution 1:[1]
akd
andii
are indexed onstorage_time_series
Sounds good. But too vague to say more. If, as it seems, (akd, ii, value)
is supposed to be unique, create a UNIQUE INDEX
accordingly:
CREATE UNIQUE INDEX ON storage_time_series (akd, ii, value);
Then your query burns down to:
INSERT INTO storage_time_series
(akd , ii , value , time )
SELECT labels[2], labels[4], sum(value), now())
FROM data_time_series
GROUP BY 1, 2
ON CONFLICT DO NOTHING;
Which should be fast. Much depends undisclosed details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |