'Jenkins Agents "Unable to create live FilePath" and marked offline

Jenkins Controller reports : Unable to create live FilePath for i-xxxxxxxxxxxxx and Agent is marked Offline

Googling this error indicates that it is a problem with the communication paths between Controller and Agent, but what?

Background:

Jenkins Controller running v2.332.1, Java 11 64bit OS, inside a docker container Jenkins Agents running Swarm-Client jar downloaded from the Controller on startup. Swarm Plugin Version 3.32 Java 11 and 64bit OS, inside a docker container

Agents and Controller are hosted on separate EC2 instances in AWS with Security Group permissions on the relevant ports.

The Instance starts up runs the Cloud-Init, downloads the swarm-client.jar from Jenkins Controller and then runs it with the parameters required to connect to the controller. I mention this to avoid the "are you using the correct version" comments :-)

The Agent connects and is all fully online and gets busy servicing the pending Job queue.

Then some time later, indeterminate, some jobs last > 24 hours and have not failed, other jobs last minutes and sometimes fail.

Things I have tried: (some)

The Swarm Client jar can use either WebSockets and connect to the FQDN of the Jenkins controller or use the JNLP protocol to connect to the IP and dedicated agent connection port (fixed value on the Controller). Similar behavior is seen with either protocols.

Opening all the AWS Security Groups: incase there was another port, not mentioned, that needed to be open. Bypass AWS Load balancer: Agent connects directly to Controller IP:PORT via JNLP Matching Versions: Swarm Client downloaded from Controller Updated Versions: Jenkins 2.319.3, 2.332.1 Normalized Java environments: Java 11 64bit OS Enabled Logging on the Agents: periodic communications happens and then stops after a while, without obvious reason. Increased Controller Instance size: m5.xlarge -> m5.2xlarge



Solution 1:[1]

Bumping Jenkins up to a non-LTS version allowed the connections to become more stable. Jenkins 2.341 and Swarm-Client version 3.32 both use Remoting version 4.13

Now, while I am not particularly happy about running a non-LTS version of Jenkins, I am pleased to have found a workaround

Response times of the instances is better

Solution 2:[2]

Fixed by upgrading to Jenkins 2.344

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 edwardTew
Solution 2 edwardTew