'Query for a cache hit rate graph with prometheus

I'm using Caffeine cache with Spring Boot application. All metrics are enabled, so I have them on Prometheus and Grafana.

Based on cache_gets_total metric I want to build a HitRate graph.

I've tried to get a cache hits:

delta(cache_gets_total{result="hit",name="myCache"}[1m])

and all gets from cache:

sum(delta(cache_gets_total{name="myCache"}[1m]))

Both of the metrics works correctly and have values. But when I'm trying to get a hit ratio, I have no data points. Query I've tried:

delta(cache_gets_total{result="hit",name="myCache"}[1m]) / sum(delta(cache_gets_total{name="myCache"}[1m]))

Why this query doesn't work and how to get a HitRate graph based on information, I have from Spring Boot and Caffeine?



Solution 1:[1]

Run both ("cache hits" and "all gets") queries individually in prometheus and compare label sets you get with results. For "/" operation to work both sides have to have exactly the same labels (and values). Usually some aggregation is required to "drop" unwanted dimensions/labels (like: if you already have one value from both queries then just wrap them both in sum() - before dividing).

Solution 2:[2]

First of all, it is recommended to use increase() instead of delta for calculating the increase of the counter over the specified lookbehind window. The increase() function properly handles counter resets to zero, which may happen on service restart, while delta() would return incorrect results if the given lookbehind window covers counter resets.

Next, Prometheus searches for pairs of time series with identical sets of labels when performing / operation. Then it applies individually the given operation per each pair of time series. Time series returned from increase(cache_gets_total{result="hit",name="myCache"}[1m]) have at least two labels: result="hit" and name="myCache", while time series returned from sum(increase(cache_gets_total{name="myCache"}[1m])) have zero labels because sum removes all the labels after the aggregation.

Prometheus provides the solution to this issue - on() and group_left() modifiers. The on() modifier allows limiting the set of labels, which should be used when searching for time series pairs with identical labelsets, while the group_left() modifier allows matching multiple time series on the left side of / with a single time series on the right side of / operator. See these docs. So the following query should return cache hit rate:

increase(cache_gets_total{result="hit",name="myCache"}[1m])
  / on() group_left()
sum(increase(cache_gets_total{name="myCache"}[1m]))

There are alternative solutions exist:

  1. To remove all the labels from increase(cache_gets_total{result="hit",name="myCache"}[1m]) with sum() function:
sum(increase(cache_gets_total{result="hit",name="myCache"}[1m]))
  /
sum(increase(cache_gets_total{name="myCache"}[1m]))
  1. To wrap the right part of the query into scalar() function. This enables vector op scalar matching rules described here:
increase(cache_gets_total{result="hit",name="myCache"}[1m])
  /
scalar(sum(increase(cache_gets_total{name="myCache"}[1m])))

It is also possible to get cache hit rate for all the caches with a single query via sum(...) by (name) template:

sum(increase(cache_gets_total{result="hit"}[1m])) by (name)
  /
sum(increase(cache_gets_total[1m])) by (name)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 bjakubski
Solution 2 valyala