'Amazon Linux 2 worker fails to reboot

I'm running a Node.js application on an Amazon Linux 2 worker instance, connected to SQS.

The problem

It all runs fine, except that for technical reasons I need to restart the server regularly. To do this, I've set up a cron to run /sbin/shutdown -r now at night.

As the instance boots back up, I get an error regarding the SQS daemon service:

[INFO] Executing instruction: configureSqsd
[INFO] get sqsd conf from cfn metadata and write into sqsd conf file ...
[INFO] Executing instruction: startSqsd
[INFO] Running command /bin/sh -c systemctl show -p PartOf sqsd.service
[INFO] Running command /bin/sh -c systemctl is-active sqsd.service
[INFO] Running command /bin/sh -c systemctl start sqsd.service
[ERROR] An error occurred during execution of command [self-startup] - [startSqsd]. 
Stop running the command. Error: startProcess Failure: starting process "sqsd" failed: 
Command /bin/sh -c systemctl start sqsd.service failed with error exit status 1. 
Stderr:Job for sqsd.service failed because the control process exited with error code. 
See "systemctl status sqsd.service" and "journalctl -xe" for details.

The instance is then stuck in a loop where the initialization runs until it hits the sqsd.service error and then starts over again.

Logs

The systemctl status sqsd.service command doesn't appear to show much more information than we already got, only that it exited with status 1:

● sqsd.service - This is sqsd daemon
   Loaded: loaded (/etc/systemd/system/sqsd.service; enabled; vendor preset: disabled)
   Active: deactivating (stop-sigterm) (Result: exit-code)
  Process: 2748 ExecStopPost=/bin/sh -c  (code=exited, status=0/SUCCESS)
  Process: 2745 ExecStopPost=/bin/sh -c rm -f /var/pids/sqsd.pid (code=exited, status=0/SUCCESS)
  Process: 2753 ExecStart=/bin/sh -c /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start (code=exited, status=1/FAILURE)
   CGroup: /system.slice/sqsd.service
           └─2789 /opt/elasticbeanstalk/lib/ruby/bin/ruby /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start

The most interesting found when checking journalctl -xe is:

sqsd[9704]: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `initialize': No such file or directory @ rb_sysopen - /var/run/aws-sqsd/default.pid (Errno::ENOENT)
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `open'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `start'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:83:in `launch'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:111:in `<top (required)>'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
systemd[1]: sqsd.service: control process exited, code=exited status=1
systemd[1]: Failed to start This is sqsd daemon.

Further investigation

As per the logs, the file /var/run/aws-sqsd/default.pid does not exist when rebooting the server. It does exist on a rebuild and contains the application process ID.

If I add the file, the setup process gets a little bit further until a similar file is missing.

Solutions?

Has anyone run into this issue before? Not sure why starting sqsd.service fails after a normal reboot but works fine on initial deploy and after rebuilding the environment... It almost seems like it's looking for a config file that doesn't exist...

Are there any other ways to safely reboot the instance that I should try?



Solution 1:[1]

I have the same exact issue. Not posting a solution but some more data on the issue. I found errors in /var/log/messages that suggest that the SQSd daemon ran out of memory.

Apr 28 15:43:05 ip-172-31-121-3 sqsd: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `fork': Cannot allocate memory - fork(2) (Errno::ENOMEM)
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `start'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:83:in `launch'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:111:in `<top (required)>'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service: control process exited, code=exited status=1
Apr 28 15:43:05 ip-172-31-121-3 systemd: Failed to start This is sqsd daemon.
Apr 28 15:43:05 ip-172-31-121-3 systemd: Unit sqsd.service entered failed state.
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service failed.

After setting up a larger instance class, it was going through fine but I'm not sure that it wasn't just the refreshed instance that did it (like david.emilsson mentioned) or the extra memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 phoenix