All times are shown in UTC
We have been alerted to higher than normal latencies due to a capacity issue in us-east-1 which we are working to resolve
Update 15:51 UTC: We have severe capacity issues worldwide due to a sudden inability to a bootstrap new instances. We are working to fix this as soon as possible.
We have identified the underlying issue (dependency on a third party system that, by design, should not have impacted our ability to add capacity, but did due to an internal bug). We have applied hot fixes to all environments and rolled this out globally. Error rates are dropping rapidly and latencies reducing, however there are still some residual issues we are manually resolving.3rd Jan 05:33 PM
We are still experiencing issues in us-east-1 which is causing higher than normal error rates in us-east-1. We believe the issue is caused by an unstable gossip node reporting inconsistent ring states to the cluster.3rd Jan 06:36 PM
Given that we identified the issue was related to gossip and ring state inconsistencies, we are rolling out new gossip nodes across every region, which is rapidly resolving the issues.3rd Jan 06:47 PM
We have stabilised the gossip and ring state globally now and error rates have reduced dramatically. There are few nodes that are still emitting channel errors which we are investigating.3rd Jan 08:45 PM
Latencies and error rates are back to normal in all regions.
We sincerely apologise to customers who were affected by the incident, and will be posting a post-mortem once the investigation has completed.9th Jan 12:21 AM
We have completed the investigation of this incident and have written up a full post mortem at https://gist.github.com/paddybyers/3e215c0aa0aa143288e4dece6ec16285.
Any customers who have any questions or would like to discuss this incident should get in touch with the support and sales team at https://www.ably.io/support.
Once again, we sincerely apologise to customers who were affected by this incident. We are doing everything we can to learn from this incident and ensure that the service remains fully operational moving forwards.
Resolvedin about 5 hours