'Debugging back pressure in apache storm

We are using apache storm. And all of a sudden the topology stops taking new events. looking into zookeeper we can see that a back pressure node is getting created.

example /alchemist/storm/backpressure/OurTopology/ba786e4c-5119-4ebc-856b-6d02d3740d64-6707

This indicates that back pressure is being caused from node ba786e4c-5119-4ebc-856b-6d02d3740d64, which is listening on 6707.

but i don't see any logs from this worker. What are the steps, metrics we can look at to debug what is causing the backpressure?



Solution 1:[1]

According to this and this link, Storm throttles the Spout(s) in case of backpressure. More precisely, the following happens:

  1. If a receive queue of an executor is full, a backpressure thread is notified
  2. This backpressure thread communicates to ZooKeeper that backpressure occurs on a given topology
  3. ZooKeeper notifies all workers that the spouts have to be throttled
  4. The spouts throttle their sending speed / event rate.

Obviously the topology is not expected to crash as in your case.

Some things I`d recommend here:

  • Double-check all logs from all supervisors, workers and the nimbus if you observe any error. I often oversee logs and errors in Storm logs.
  • According to the references above and the Storm docs, there are several parameters that affect the backpressure behaviour. Maybe you can play around with these to see if there is any effect:
    • topology.max.spout.pending: According to the second link, it is the number of tuples that can be pending acknowledgement in your topology at a given time.
    • The back pressure system relies on how full the receive buffer size is for a bolt. This is why there is a notion of watermarks. The high and low watermarks define how full or how empty the buffer must be to throttle the spout or to restart it: disruptor.highwatermark (default 0.9). This means, for 0.9, send the full signal, and throttle the spout when the receive buffer of the bolt is 90% full.
    • disruptor.lowwatermark (default 0.4) means for 0.4, send the not full signal, and start the spout again when the receive buffer of the bolt drops below 40% of capacity
  • Use Storm Metrics to analyze processes more precisely. Some potential metrics to observe are:
    • __skipped-backpressure-ms: This metric records how much time a spout was idle because back-pressure indicated that downstream queues in the topology were too full.
    • arrival_rate_secs: Estimation of the number of tuples that are inserted into the queue in one second, although it is actually the dequeue rate.
    • sojourn_time_ms is calculated from the arrival rate and is an estimate of how many milliseconds each tuple sits in the queue before it is processed.

However, Storm Metrics is a pain as these files are periodically logged to the disk. Setting up monitoring tools might help here. Unfortunately, the only monitoring tool that is mentioned, is storm-graphite which does not seem to be maintained. I also read once of somebody using Grafana or other tools.

All in all, I`d be very interested in how you hopefully solve that issue.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 moosehead42