All times are shown in UTC
Our internal Reactor queues appear to currently be experiencing a network partition. Consuming from queues may temporarily not work. No other services are affected; realtime messaging is operating as normal.
Queues are now up and running normally. For a long time the partitioned state failed to heal, even after several rolling restarts. Ultimately we made the decision to stop and restart the entire cluster, which then took longer than expected due to an infrastructure issue.
We run our queue servers in 'pause minority' mode, which prioritises data integrity and preservation of messages already in the queue over availability to either consumers or new publishers for connections to the cluster in the minority.
We'll be investigating the causes of the partition, why it failed to heal, and the issues experienced in doing a stop-the-world restart, and post any updates here.
Resolvedin about 5 hours