You need to look into using HA queues (ie. mirroring) and ensure you are using a...

atombender · on June 8, 2014

HA queues have this caveat in the documentation:

    This solution requires a RabbitMQ cluster, which means
    that it will not cope seamlessly with network partitions
    within the cluster and, for that reason, is not
    recommended for use across a WAN

And then:

    However, there is currently no way for a slave to know
    whether or not its queue contents have diverged from the
    master to which it is rejoining (this could happen
    during a network partition, for example). As such, when
    a slave rejoins a mirrored queue, it throws away any
    durable local contents it already has and starts empty.

So I don't think that's helpful to us at all.

We are going to migrate to a client that supports consumer cancel notifications, though. Thanks for the tip.

jpgvm · on June 10, 2014

True, what I do recommend though is using RMQ clusters on each cloud (where networking should be abit more reliable, the exception being AWS, which always sucks in this regard) then using federation (probably via shovel, but there are other means) to the other clouds.

Ultimately though.. when you get to this stage I question if your app is big enough to warrant this is suggest you use Azure Service Bus/Simple Queuing Services/whatever else your providers make available.

If your business really needs such control over messaging.. I understand. I have been in the situation where those easy ways out aren't available and I know your pain. Unfortunately there is no vendor you can go to make it go away, Tibco, Sterling etc aren't much better than RMQ.

I wish you the best of luck in your multi-cloud federated messaging system though, I highly suggest you look at Azure Service Bus though, I have nothing but praise for it despite being a devout RMQ zealot.

atombender · on June 11, 2014

You misunderstood me, I think. Our clouds aren't connected. The partitioning problem exists within each data center (eg., Digital Ocean).

So federation/shovel is probably not the solution.

SQS is far too simple for our needs. No idea what Azure is, but if it's SaaS the latency will likely be too high. We need local performance.

cheald · on June 9, 2014

Can you point me at a good place to start for RMQ and TCP/IP stack tuning?

jpgvm · on June 10, 2014

Sure, I would recommend the below in sysctl:

  net.ipv4.tcp_keepalive_time=5
  net.ipv4.tcp_keepalive_probes=5
  net.ipv4.tcp_keepalive_intvl=1

This will tune the TCP keepalives to decrease the time it takes for most client stacks to realize a server has gone away. It should also be configured on the servers themselves that are participating in the cluster.

As for rabbitmq tuning I recommend these settings at a minimum:

  [
   {rabbit, [{tcp_listen_options, [binary,
                                  {packet, raw},
                                  {reuseaddr, true},
                                  {backlog, 128},
                                  {nodelay, true},
                                  {exit_on_close, false},
                                  {keepalive, true}]}
            ]}
  ].

cheald · on June 10, 2014

Thank you! I have some reading to do. Appreciate it!