'Prometheus - debugging slow query

The following query is taking more than 1 min, and timing out. It's made from Grafana:

/grafana/api/datasources/proxy/2/api/v1/query_range?
  query=rate(rmq_publish{name="app1" }
     [5m])&start=1520264038&end=1520264338&step=30

The behaviour is same both with rate and irate, and with step as 2s or 30s.

I think the number of samples for this metric with different labels is large. How do I find this out?

Any tips for profiling this query to find out why it's taking too long to process?



Solution 1:[1]

I think the number of samples for this metric with different labels is large. How do I find this out?

You can find out the number of samples by using the count operator:

count by (__name__)({__name__="your_metric_name"})

Any tips for profiling this query to find out why it's taking too long to process?

The query performance depends mostly on the size of your data. I would recommend you first investigate the data size before diving into promql profiling.

An easy workaround is to pre-record your query via Prometheus' rule recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules

Solution 2:[2]

Prometheus query performance depends on the following factors:

  1. The number of unique time series the query needs to select on the given time range. This number can be determined with the following query, which must be sent to /api/v1/query endpoint:
count(last_over_time(series_selector[lookbehind_window]))

where:

For example, the following query returns the number of time series, which need to be selected on a week-long time range for rmq_publish{name="app1"} series selector:

count(last_over_time(rmq_publish{name="app1"}[7d]))

Note that simple count(rmq_publish{name="app1"}) may return much lower number than count(last_over_time(rmq_publish{name="app1"}[7d])) if matching time series are under churn rate.

  1. The number of raw samples the query needs to select. This number can be obtained via the following query sent to /api/v1/query endpoint:
sum(count_over_time(series_selector[lookbehind_window]))

For example, the following query returns the total number of raw samples the query rate(rmq_publish{name="app1"}[5m] needs to process when being performed on the 7 days time range:

sum(count_over_time(rmq_publish{name="app1"}[7d]))

This article contains more details on how to determine the root cause of slow PromQL query with possible solutions on how to improve query performance.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Yuankun
Solution 2 valyala