'Handling logs of huge volume with fluent-bit/fluentd

We have the following observability stack.

We are often challenged with huge influx of logs from certain apps running on ECS which causes the log aggregator to restart and eventually making ES unstable. We incorporated a few ways to alleviate this.

  1. Introduce throttling at the aggregator level (drop messages when it goes beyond a threshold) - no manual intervention

  2. Filter log messages from App1 before the aggregator does any processing (aggregator needs to be redeployed with the filter - needs manual intervention)

  3. Introduce throttling at the collector level (throttle filter plugin) (still need to be tested)

While option 1 works when the log volume is reasonable, it has not been proven to be an optimal solution when the log volume is huge (3million records in 1hr). The log aggregator container keeps restarting which resets the throttle limits with every restart

With option 2 we have tried to resolve the solution - but it needs an operator to restart the log aggregator with the app filter. We normally do an update to the CF stack to achieve this.

Current fluentd config - APP_LOGS_DROP will be need to be set to the App that creates a huge influx of logs and the aggregator container is restarted

 <match "#{ENV['APP_LOGS_DROP']}">
      @type null
   </match>
   <match **>
     @type relabel
     @label @throttle
   </match>
</label>
<label @throttle>
  <filter log.**>
    @type record_modifier
    <record>
      app ${tag_parts[1]}
    </record>
  </filter>
  <filter log.**>
    @type throttle
    group_key app
    group_bucket_period_s   "#{ENV['THROTTLE_PERIOD']}"
    group_bucket_limit      "#{ENV['THROTTLE_LIMIT']}"
    group_reset_rate_s      "#{ENV['THROTTLE_RESET_RATE']}"
  </filter>
  <match log.**>
   @type relabel
   @label @continue
  </match>

I want to know if there are other ways to look at this problem and also ways to automate option 2. Currently the way we get to know about the huge log volume is through watcher alerts in Elasticsearch when it becomes unstable.

enter image description here

Thanks in advance.



Solution 1:[1]

thanks for the question - I believe the architecture you are using is great for scaling this up. I would recommend any throttling or optimization for interacting with Elasticsearch be done at the Aggregator level vs. trying to maintain that at each Fluent Bit DaemonSet / Collector side.

You could use Fluent Bit as an aggregator as well which includes the throttle filter Fluent Bit Throttle Documentation.

If you are seeking configuration management and scaling on top of Kubernetes you could also use Calyptia Enterprise for Fluent Bit Disclaimer: I'm part of the team there and we allow you to deploy and manage for free up to 100GB per day

Happy to help more if you need, you can reach me at anurag at calyptia dot com

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Anurag Gupta