'Kafka connect-distributed mode fault tolerance not working

I have created kafka connect cluster with 3 EC2 machines and started 3 connectors ( debezium-postgres source) on each machine reading a different set of tables from postgres source. In one of the machines, I started the s3 sink connector as well. So the changed data from postgres is being moved to kafka broker via source connectors (3) and S3 sink connector consumes these messages and pushes them to S3 bucket.
The cluster is working fine and so are the connectors too. When I pause any of the connectors running on one of EC2 machine, I was expecting that its task should be taken by another connector (postgres-debezium) running on another machine. But that's not happening. I installed kafdrop as well to monitor the brokers. I see 3 internal topics connect-offsets, connect-status and connect-configs are getting populated with necessary offsets, configs, and status too ( when I pause, status paus message appears). But somehow connectors are not taking the task when I paused. Let me know in what scenario connector takes the task of other failed one? Is pause is the right way? or we should produce some error on one of the connectors then it takes. Please guide.



Solution 1:[1]

Sounds like it's working as expected.

Pausing has nothing to do with the fault tolerance settings and it'll completely stop the tasks. There's nothing to rebalance until unpaused.

The fault tolerance settings for dead letter queue, skip+log, or halt are for when there are actual runtime exception in the connector that you cannot control through the API. For example, a database or S3 network / authentication exception, or serialization error in the Kafka client

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 OneCricketeer