'Jenkins Agents "Unable to create live FilePath" and marked offline
Jenkins Controller reports : Unable to create live FilePath for i-xxxxxxxxxxxxx and Agent is marked Offline
Googling this error indicates that it is a problem with the communication paths between Controller and Agent, but what?
Background:
Jenkins Controller running v2.332.1, Java 11 64bit OS, inside a docker container Jenkins Agents running Swarm-Client jar downloaded from the Controller on startup. Swarm Plugin Version 3.32 Java 11 and 64bit OS, inside a docker container
Agents and Controller are hosted on separate EC2 instances in AWS with Security Group permissions on the relevant ports.
The Instance starts up runs the Cloud-Init, downloads the swarm-client.jar
from Jenkins Controller and then runs it with the parameters required to connect to the controller. I mention this to avoid the "are you using the correct version" comments :-)
The Agent connects and is all fully online and gets busy servicing the pending Job queue.
Then some time later, indeterminate, some jobs last > 24 hours and have not failed, other jobs last minutes and sometimes fail.
Things I have tried: (some)
The Swarm Client jar can use either WebSockets and connect to the FQDN of the Jenkins controller or use the JNLP protocol to connect to the IP and dedicated agent connection port (fixed value on the Controller). Similar behavior is seen with either protocols.
Opening all the AWS Security Groups: incase there was another port, not mentioned, that needed to be open. Bypass AWS Load balancer: Agent connects directly to Controller IP:PORT via JNLP Matching Versions: Swarm Client downloaded from Controller Updated Versions: Jenkins 2.319.3, 2.332.1 Normalized Java environments: Java 11 64bit OS Enabled Logging on the Agents: periodic communications happens and then stops after a while, without obvious reason. Increased Controller Instance size: m5.xlarge -> m5.2xlarge
Solution 1:[1]
Bumping Jenkins up to a non-LTS version allowed the connections to become more stable. Jenkins 2.341 and Swarm-Client version 3.32 both use Remoting version 4.13
Now, while I am not particularly happy about running a non-LTS version of Jenkins, I am pleased to have found a workaround
Solution 2:[2]
Fixed by upgrading to Jenkins 2.344
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | edwardTew |
Solution 2 | edwardTew |