RabbitMQ Really Cares About Your Hostname Nodename
rabbitmq infrastructure
This is something I ran into at work and it ended up being a quite interesting rabbit hole that I wanted to write about.
It all started with trying to remove a piece of infrastructure at work that creates DNS A records for an instance’ hostname. We use the Consul peer discovery backend for RabbitMQ1. This uses Consul to discover instances that should form a cluster. With the simplest config, it seems to use the hostname as the address it registers in Consul which the other cluster members use to try to connect to it, so the hostname needs to be a DNS address that resolves to that particular instance. This isn’t always the case so upon disabling the piece of infrastructure that created those A records, the RabbitMQ instances we no longer able to cluster.
The first step I look was configuring RabbitMQ to register the IP
address as the node’s address in Consul so that other instances would
use that for clustering - circumventing the need for a resolvable
hostname. Well, this didn’t work. I could see in the debug
logs2 that the instances were correctly trying to connect
to one another with the address rabbit@<ip address>
which is what I
expected. What I didn’t expect, however, was the result the node was
getting:
{badrpc,nodedown}
One thing to note is that the debug logs often include very Erlang-y
things, as you’d expect from the debug logs of something written in
Erlang. This means that the errors, like what is logged here, are in
the Erlang tuple style of {<error>, <reason>}
. I tracked this down
as the result of a call to rpc:call/4
3. So, this error
indicates that the RabbitMQ instance was unable to make and RPC call
to another RabbitMQ instance because it believes the other instance is
down. This set me off trying to figure out if there was something
wrong with the network configuration because the other node was
definitely up.
After finding that there was nothing wrong there either - I could telnet to the specific port that the Erlang RPC system connects to. Scratching my chin, I decided to change tack and use the DNS peer discovery backend4. My initial assumption for this one was that given a DNS address that resolves to a set of IPs, those would be used to form that cluster. However, there is an extra step in there - upon retrieving the IP addresses, the backend then does a reverse DNS lookup to get the hostnames so the instances still need to have resolvable hostnames. In retrospect, this behaviour is clearly document and I had just made a faulty assumption but it is not the behaviour I expected because reverse DNS lookups are pretty uncommon in service discovery. As this was all running in AWS, it was able to perform a reverse lookup, however, due to our VPC configuration it returned the old IP address format. The RabbitMQ node then tries to filter itself out of the list by comparing its node name (this is an Erlang concept5) to the DNS addresses it resolved. As the hostname (and by extension node name) was not in the list, the discovery would fail. The fix for this was relatively simple - set the hostname to the IP address format hostname that the internal AWS DNS resolver was returning. Whilst this works, I didn’t want to use the IP address format because it is older and has restrictions the newer resource naming format does not have.
Whilst trying out the DNS backend, a colleague and I had also been
searching around for why RabbitMQ was getting the badrpc
error. They
found a from the rabbitmq-users Google Group6. This
included an important line that set us on the right path:
You cannot use IP addresses unless they resolve to themselves as hostnames.
That kind of explains why RabbitMQ was failing the cluster. The IPs
are not resolvable as hostnames. I still don’t fully understand why
this is the case but I suspect it has something to do with Erlang’s
RPC library as that is what RabbitMQ is using to communicate. I wonder
if you could set RABBITMQ_NODENAME
to the instance’s private IP
address, rather than making the IP address a resolvable DNS address,
because the address used by the RPC library seems to used for more
than just finding the node.
Knowing this, I went back to the Consul discovery backend and changed the hostnames of all the instances to be the resource name format that AWS provide and make resolvable via the internal DNS - and it worked!
Okay, so we were able to get both the Consul and DNS discovery backends working. I was already leaning towards the Consul backend but there was one other thing that pushed me over the line: distributed locking. When the cluster is forming, one node must initialise the cluster, then the other nodes join it. To do this, both the DNS and Consul backends take out locks. The DNS backend uses Erlang’s built-in locking library7. The Consul backend uses Consul’s own locking mechanism8. From planing around with them both the Consul locks seem much more reliable and I didn’t see any errors on startup but with the DNS backend I saw lots of errors about uninitialised tables even though the clustering did work in the end.
Making sure that the RabbitMQ nodes do cluster is very important because the failure mode is they decide that they’re the first member of the cluster and initialise all their own state and will never again cluster with another node. The easiest way I’ve found to fix this was to blow away the node and have it rejoin the cluster. This behaviour is understandable, RabbitMQ is trying to be very fault tolerant and there isn’t really much else you can do in this situation. It’s optimising for a AP, rather than CP, in CAP theorem parlance.
https://www.rabbitmq.com/cluster-formation.html#peer-discovery-consul
As an aside, RabbitMQ’s debug logs are super informative. If you’re having any issues like this you should turn them on with:
# Turns on logging to standard error
log.console = true
# Sets the log level to debug so we get all the juice details
log.console.level = debug
https://www.erlang.org/doc/man/rpc.html#call-4
https://www.rabbitmq.com/cluster-formation.html#peer-discovery-dns
https://www.erlang.org/doc/reference_manual/distributed.html
https://groups.google.com/g/rabbitmq-users/c/zh-c-R2Pch0/m/VZ_40kV6BAAJ
https://www.erlang.org/doc/man/global.html#set_lock-3
https://developer.hashicorp.com/consul/commands/lock