'How To Reduce Prometheus(Federation) Scrape Duration

I have a Prometheus federation with 2 prometheus' servers - one per Kubernetes cluster and a central to rule them all.

Over time the scrape durations increase. At some point, the scrape duration exceeds the timeout duration and then metrics get lost and alerts fire.

I’m trying to reduce the scrape duration by dropping metrics but this is an uphill battle and more like sisyphus then Prometheus.

Does anyone know a way to reduce the scrape time without losing metrics and without having to drop more and more as times progresses?

Thanks in advance!



Solution 1:[1]

Per Prometheus' documentation, these settings determine the global timeout and alerting rules evaluation frequency:

global:
  # How frequently to scrape targets by default.
  [ scrape_interval: <duration> | default = 1m ]

  # How long until a scrape request times out.
  [ scrape_timeout: <duration> | default = 10s ]

  # How frequently to evaluate rules.
  [ evaluation_interval: <duration> | default = 1m ]

...and for each scrape job the configuration allows setting job-specific values:

# The job name assigned to scraped metrics by default.
job_name: <job_name>

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

Not knowing more about the number of targets and number of metrics per target...I can suggest to try to configure appropriate scrape_timeout per job and adjust the global evaluation_interval accordingly.

Another option, in combination with the suggestion above or on its own, can be to have prometheus instances dedicated on scraping non-overlapping set of targets. Thus, making it possible to scale prometheus and to have different evaluation_interval per set of targets. For example, longer scrape_timeout and less frequent evaluation_interval (higher value) for jobs that take longer so that they don't affect other jobs.

Also, check if an exporter isn't misbehaving by accumulating metrics over time instead of just providing current readings at the time of scraping - otherwise, the list of what's returned to prometheus will keep on growing over time.

Solution 2:[2]

It isn't recommended to build data replication on top of Prometheus federation, since it doesn't scale with the number of active time series as could be seen in the described case. It is better setting up data replication via Prometheus remote_write protocol. For example, add the following lines to Prometheus config in order to enable data replication to VictoriaMetrics remote storage located at the given url:

remote_write:
  - url: http://victoriametrics-host:8428/api/v1/write

The following docs may be useful for further reading:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 apisim
Solution 2 valyala