'Flink TaskManager not reconnecting to the new Jobmanager
I have configured Flink in HA mode as mentioned here:
I wanted to test the fault tolerance, hence I did the following:
- Setup Flink cluster with 2 JobManagers and 1 TaskManager
- Start a streaming job on task manager
- Kill the active job manager(to simulate a crash)
- The leader election is happening as expected.
- But the task manager is noted reconnecting to the new job manager. It simply tries to reconnect to the previous leader every 10seconds.
Pasting the task manager log here:
2018-07-25 19:46:08,508 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2018-07-25 19:46:08,515 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2018-07-25 19:46:08,524 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-07-25 19:46:08,525 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service.
2018-07-25 19:46:08,529 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://[email protected]:46477/user/resourcemanager(b91b9aeb3565be973c9bb47259414e0a).
2018-07-25 19:46:08,574 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.10.97.210:46477
2018-07-25 19:46:08,576 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:46477] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:46477]] Caused by: [Connection refused: /10.10.97.210:46477]
2018-07-25 19:46:08,579 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://[email protected]:46477/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[email protected]:46477/user/resourcemanager..
2018-07-25 19:46:18,606 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.10.97.210:46477
2018-07-25 19:46:18,607 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:46477] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:46477]] Caused by: [Connection refused: /10.10.97.210:46477]
2018-07-25 19:46:18,607 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://[email protected]:46477/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[email protected]:46477/user/resourcemanager..
- Restarting task manager doesn't help
- Restarting cluster doesn't help
Please guide me if anything is missing.
Solution 1:[1]
Looking into the logs:
Connection refused: /10.10.97.210:46477
Was port 46477 was opened/excluded from firewall?
Just check if you have set the following in flink config:
jobmanager.rpc.port: 6123
blob.server.port: 50100-50200
And then unblock these ports.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |