Incident Details

All times are shown in UTC

30th October 2018 01:35:00 PM

Global incident causing increased latencies and error rates worldwide

An incident on the 30 October caused increased error rates from 14:00 UTC until 16:30 UTC, with residual issues remaining until 19:00 UTC.

Please read the full post mortem below.

30th Oct 02:04 PM

Update at 14:03 (UTC): we are routing traffic away from us-east to other datacenters while we investigate the issue

30th Oct 02:36 PM

Update at 14:30 (UTC): All regions are now affected by the present issues due to cascading failures. We are very sorry for the inconvenience and are trying to restore normal service as fast as possible.

30th Oct 05:09 PM

Update at 16:03 (UTC): All regions are now stable and error rates have dropped dramatically. We will continue to inspect each region manually to ensure any issues are resolved forcibly.

30th Oct 07:00 PM

Update at 17:30 (UTC): Error rates have been consistently back to normal in all regions. We believe all operations are back to normal. We will be completing a full post mortem now to understand what caused the global disruption, and importantly what caused the ripple effect from one region to another.

31st Oct 11:37 AM

Update at 10:30 + 1 day (UTC): Our focus today is on addressing the immediate issues we identified yesterday that caused the significant increase in error rates and latencies globally. Whilst a post mortem is being prepared and will be published in due course, our priority at present is to resolve the underlying issue and prevent a repeat of the incident.

We are currently rolling out an update globally in all regions and expect this will take a few hours to complete.

2nd Nov 01:01 PM

We have published a summary of the incident affecting the Ably production cluster 30 October 2018, which includes our preliminary investigation and conclusions.

Please see https://gist.github.com/mattheworiordan/2abab206ee4e4859010da0375bcf4b1d for the report.

As mentioned in the report, we are sorry to all our customers for the disruption this resulted in, and are doing everything we can to ensure we are better equipped to deal with incidents more quickly and without disruption in future.

If you have any questions, please do get in touch with us at https://www.ably.io/contact


in about 4 hours