Incident Details

All times are shown in UTC

14th August 2018 08:13:43 AM

High error rates during emergency revert of faulty deployment

At 5:50 UTC, a routine deploy caused error rates to climb in eu-central-1. See https://status.ably.io/incidents/553 for details.

Following an investigation into the root cause, we have decided that it is safer to revert the recent deployed on any instances globally until we can fully understand the root cause of the issue.

We have noticed error rates briefly increase in regions during (for periods of up to around 1 minute) whilst this code is being reverted.

We'll confirm once this process is completed.

14th Aug 09:00 AM

The new code has been reverted now on all nodes and error rates are back to normal.

We'll be completing a post mortem on this issue and post updates within this incident report.

16th Aug 06:00 PM

We have completed our post mortem on what happened:

The underlying cause was that we deployed code that was effectively broken. The specific issue was related to interoperability in a mixed cluster, and arose in a specific configuration that we had not tested sufficiently before release

When we deployed the code to eu-central-1, we saw a very large spike in system load - whose root cause was the bug - but due to the cascade of effects when hugely overloaded we were overwhelmed by logs containing secondary errors so we didn't recognise the root cause initially

The load then triggered autoscaling and recovery of nodes across the cluster which, due to a procedural error in making the deploy, caused other nodes to be restarted with the broken code.

We were able to recover the system by suspending recovery, finding the root cause bug, and reverting affected nodes.

However, moving forwards, we are going to improve our deployment strategy so that we can better test multi-region mixed cluster environments.

We apologize for the inconvenience this caused.


3 months ago