'Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

We have a topic with 100 partitions and the load is millions of records per hour.

We ran into the problem whenever we deploy a new version of stream-processor using state-store with stateful-set in Kubernetes.

Normally, we need 4 pods to handle the workload of 100 partitions.

Before deploying a new version, 4 instances are up to date with the data from the topics.

Three out of 4 times when we deploy a new version, only 2 or 3 instances are processing the data within a minutes, the others throw exception:

 Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

So all the data in the partitions that assigned to instance # 4 are build up and the lag is increasing...

If we scale the number of instances to 6 or 8, then 5 or 6 instances are processing the data, other 3 or 2 instances throw this exception:

 Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

If we let all the instances run like that, eventually ( some between 4 to 36 hours later ) all the instances will be fine and no more exception from any pod.

Any suggestions to fix this problem be appreciated.

Thanks,

Austin



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source