'nova services fails intermittently to connect to rabbitmq with timeout error

I am using Openstack Pike with RabbitMQ 3.6.5 on Erlang 18.3.4.4 as the messaging queue. When the nova services are restarted, one of the nova services (mostly nova-scheduler) will have error in the logs as shown below. It says the service cannot connect to the rabbitmq and the error appears on and on until it is restarted. Instance launch also fails due to this error. The service openstack-nova- start commands exits successfully.

2018-05-31 06:19:44.737 5510 INFO nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Starting scheduler node (version 16.1.1-1.el7)
2018-05-31 06:19:44.745 5510 DEBUG nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/site-packages/nova/service.py:179
2018-05-31 06:19:44.749 5510 DEBUG oslo.messaging._drivers.pool [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Pool creating new connection create /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/pool.py:143
2018-05-31 06:19:44.753 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Connecting to AMQP server on localhost:5672 __init__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:597
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Found 2 cells: 00000000-0000-0000-0000-000000000000, 4963dd3d-246d-4f9b-9d4b-0cb1ab3032ae _load_cells /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:642
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] START:_async_init_instance_info _async_init_instance_info /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:421
2018-05-31 06:19:44.763 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2018-05-31 06:19:44.764 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2018-05-31 06:19:49.761 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Received recoverable error from kombu: on_error /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:744
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 494, in _ensured
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     return fun(*args, **kwargs)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 569, in __call__
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     self.revive(self.connection.default_channel)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     self.connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     self._connection = self._establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     conn = self.transport.establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     conn.connect()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 300, in connect
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     self.drain_events(timeout=self.connect_timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     return self.blocking_read(timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 468, in blocking_read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     frame = self.transport.read_frame()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 237, in read_frame
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     frame_header = read(7, True)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 377, in _read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     s = recv(n - len(rbuf))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 354, in recv
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     return self._recv_loop(self.fd.recv, b'', bufsize, flags)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 348, in _recv_loop
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     self._read_trampoline()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 319, in _read_trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     timeout_exc=socket.timeout("timed out"))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 203, in _trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     mark_as_closed=self._mark_as_closed)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 162, in trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     return hub.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit     return self.greenlet.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit timeout: timed out
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit
2018-05-31 06:19:49.763 5510 ERROR oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] AMQP server on 127.0.0.1:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out

The rabbitmq logs shows TCP accept and closure errors.

=INFO REPORT==== 31-May-2018::06:39:41 === accepting AMQP connection <0.957.0> (127.0.0.1:33456 -> 127.0.0.1:5672)

=ERROR REPORT==== 31-May-2018::06:40:11 === closing AMQP connection <0.948.0> (127.0.0.1:33450 -> 127.0.0.1:5672): {handshake_timeout,frame_header}

On investigation I found that the rabbitmq error suggests that appropriate frame header was not sent by the client and hence the connection was closed. But I cant find a way to see what is the problem with the frame headers or if it is due to some other errors. Can anyone please shed some light on how I can further debug this please ? Thanks.



Solution 1:[1]

I guess that its a problem with TCP/buffer etc. Try to increase the memory buffer of the rabbitmq and the number of TCP connections

$ sysctl -w net.ipv4.tcp_sack=0

$ echo 0 > /proc/sys/net/ipv4/tcp_sack

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Thiago Fonseca