www.ably.io
Back

Incident log archive

All times are shown in UTC

June 2017

2nd June 2017 04:34:00 PM

Performance issues in us-east

We are experiencing performance issues in us-east-1. We are investigating. All other regions are unaffected.

Edit: advisory expanded to all regions

2nd Jun 05:14 PM

The issue is now resolved.

We identified the root cause as memory starvation on an instance, however this in itself should have caused little or even no disruption whilst our health services recovered the problem.

The continued high error rates were unfortunately down to a problem in the Gossip protocol. We had indefinitely pending nodes that were not there i.e. they had been disposed of by the health services, yet the nodes in the cluster still reported the node as there. So the root cause of this issue is a Gossip fault which we are investigating now.

Resolved

less than a minute

May 2017

31st May 2017 03:03:13 PM

us-east-1 traffic spike

We are seeing a tremendous spike in traffic arriving at us-east-1. Some of those requests are being re-routed automatically to other regions.

We're initially adding more capacity to address the issue whilst we investigate.

31st May 03:17 PM

The traffic spike subsided within 5 minutes. We'll ensure there is a generous amount of extra capacity in place in this region for 24 hours whilst we investigate where this traffic originated.

Resolved

1 minute
23rd May 2017 06:00:16 PM

us-east-1 slow requests

We are seeing a higher rate of slow requests in our US East 1 datacenter. Until we understand the underlying cause, we are temporarily adding more capacity to fix the problem.

23rd May 10:33 PM

We identified a number of nodes with memory exhaustion. We fixed the issue quickly but and are now investigating the root cause of the issue.

Resolved

less than a minute
22nd May 2017 08:52:04 AM

Minor network disruption in us-west-1

Our external monitoring systems reported a very brief period of network connectivity issues in our US West 1 datacenter. The issue resolved itself within a few minutes.

Resolved

less than a minute
16th May 2017 10:19:42 PM

Abrupt exits in Asia causing disruption primarily in Asia

A set of nodes in Asia have exited abruptly causing significant disruptions in Asia to channels. We are also seeing a knock-on effect in other regions due to roles constantly migrating around the cluster to compensate for the abruptly exiting nodes.

We are investigating and attempting to resolve the abrupt exits in Asia first so that a normal service can be restored.

We will then review the root cause.

16th May 10:41 PM

We have restored stability in Asia, however we are seeing gossip instability in other regions. We continue to investigate the root cause, but our priority remains to restore stability in other regions. We are recycling unstable nodes in other regions to resolve the issue.

16th May 11:03 PM

The global realtime service stability has been resumed in full.

We are now investigating the root cause of the gossip instability to understand why a consensus was not reached soon after the abrupt exits in Asia. We'll also been investigating the cause of the abrupt exits we saw in Asia.

Once our investigation is complete, we'll be writing up a post mortem on the issues this evening.

17th May 02:01 PM

Upon inspection of the logs at the time of the instability, it's clear we have a bug in our Gossip service that resulted in ghosts remaining in the ring that were in fact gone. We are investigating a few avenues which should lead to fixes soon.

Resolved

about 15 hours
10th May 2017 11:25:00 AM

Further performance issues and timeouts

We have been seeing relatively high rates of timeout errors in all datacenters starting at 11:25 UTC; we are investigating

10th May 01:54 PM

The error rates have returned to normal. This was due to a new protocol deploy that should have been compatible, but unfortunately introduced unexpected communication issues.

Resolved

1 minute
10th May 2017 01:12:00 AM

Performance issues and timeouts

A number of nodes in the system at 1:12am became unstable on our gossip ring and were unable to resolve the issues automatically. This has resulted in an increased number of timeout errors due to some nodes having a different view of the gossip ring than the remaining nodes in the system.

We are looking into the issue.

10th May 08:22 AM

As of 08:18 UTC, the issue has been resolved in all datacenters apart from eu-central-1. We are still investigating higher than normal error rates in eu-central-1.

10th May 10:01 AM

The issues in eu-central are now resolved in full. We will provide a full update on the root cause an action we've taken soon.

Resolved

25 minutes

April 2017

6th April 2017 10:07:22 PM

eu-west-1-a spike

Our automated systems detected a spike in an eu-west-1 datacentre that was abrupt, resulted in very slow responses, and was resolved within around 4 minutes.

Resolved

about 1 month
6th April 2017 08:39:07 AM

Abrupt instance crash & ring instability

We have been notified of an abrupt instance crash in Singapore. Normally this has no side effects other than a small delay whilst all nodes agree that the node has gone, and as our design ensures that data is stored in at least two locations, no data is lost.

However, we are seeing some cluster instability in that the nodes appear to be unable to agree with each other on which nodes are active or not.

We are investigating the root cause and can see error rates have increased somewhat (1-2%) for some customers.

6th Apr 09:14 AM

We redeployed the instance which were unable to come to consensus on the ring. This resolved the issue.

We have however raised an internal ticket to now investigate why the ring was unable to reach stability within at most 10-15s, which is what we expect when a node is abruptly terminated.

Resolved

about 1 month
3rd April 2017 12:56:10 PM

Cluster partitioning

We are seeing cluster instability reported again, following a 10 minute period this morning where this happened too.

We are investigating further why this has happened twice now in one day.

3rd Apr 01:09 PM

It appears that the instability was caused by genuine network partitioning issues again.

We will continue to keep an eye on network issues.

Resolved

about 1 month
3rd April 2017 08:37:25 AM

Cluster partitioning

We are currently experiencing an issue where each region is unable to reach a consensus in terms of which other nodes are in the system. This is typically caused by network partitioning.

We are investigating the root cause now.

3rd Apr 08:56 AM

We are no longer seeing any network partitioning and the cluster is reported as stable in all regions.

Resolved

about 1 month

March 2017

21st March 2017 02:13:00 PM

Slow responses and high error rates in us-west-1

We are seeing similar problems to this morning's incident, mostly in us-west-1 but liable to affect anyone, starting at 14:13

21st Mar 03:42 PM

Service is back to normal for almost everyone, except for some lingering issues limited to us-west-1

21st Mar 04:08 PM

Everything is back to normal

29th Mar 05:00 AM

POST MORTEM

The underlying cause of this issue was an upgrade to an upstream RPC library we use called gRPC (this is a Google maintained open source RPC library, see http://www.grpc.io/). Our strategy when upgrading upstream libraries is to roll forward to a stable release, then run this library in our CI , staging and sandbox environments for some time to monitor the impact. If everything is stable and our test suites are passing, we will then proceed to roll out to production.

In this instance, we followed normal procedure, rolled out to production, and everything at first appeared to be running fine. Unfortunately however, this was not the case and there was a native memory leak in gRPC v1.2 that caused the memory usage of all nodes to increase over time until eventually we had consistent failures globally across all nodes in the cluster at different times. Redeploying all instances resolved the issues, however the issue simply reoccurred later when the memory leak caused the instances to become starved of memory.

We did eventually isolate the root case as being that version of gRPC and reverted to version 1.1 and rolled this out globally.

WHAT WE COULD HAVE DONE BETTER

It is very difficult to balance the requirements of ensuring we keep our stack up to date and also minimise the impact of changes in libraries we are not maintaining ourselves. However, it's clear we need to have procedures in place to measure the impact of a change to an upstream library in a production environment. Unfortunately in our staging and sandbox environments, the load in those environments can hide faults like this which only then become obvious in production. Therefore, we will in future try to roll out upstream library changes in only one of our regions and compare that region with all others carefully over a substantial period of time before rolling out more widely. This should help to minimise the impact of a faulty upstream library in future.

5th Apr 01:40 AM

UPDATE

Given the severity of the gRPC issue, we have now helped identify where the leak was introduced into the gRPC code base, have created a reproducible set of tests for the gRPC team to identify and fix the problem, and we have also got a PR waiting to land that will help to prevent upstream gRPC leaks going undetected. See https://github.com/grpc/grpc/issues/10445 for more info

Resolved

about 2 months
21st March 2017 07:20:00 AM

Slow responses in some regions

Since 07:20 UTC, some customers particularly in the eu-west-1 region have been experiencing slow responses and elevated error rates (particularly to REST requests such as requestToken). We are investigating.

21st Mar 11:47 AM

Expanding incident to cover all regions. While the problem was in eu-west-1, customers in all regions may have been affected.

21st Mar 12:03 PM

Error rates and response times are back to normal.

29th Mar 05:01 AM

Please see https://status.ably.io/incidents/433 for post mortem on what the happened to cause this disruption, how we resolved the issue, and how we are looking to ensure changes like this don't have such wide impact on customers in future.

Resolved

about 2 months
14th March 2017 03:13:26 PM

us-west-1 DNS issues for customers locked to us-west-1 region

Amazon Route53 is largely unable to propagate DNS updates at present, which has resulted in old DNS records for our us-west-1 region to remain when they should be updated for our new load balancers in this region.

This issue is not affecting the global cluster or anyone closest to that datacenter. However, if any customers have locked into that us-west-1 region with ClientOption "environment" set to "us-west-1", then unfortunately at this time those clients will be unable to connect to that datacenter.

We do not ever recommend customers lock into regions. If you are currently locked into the us-west-1 region, please remove this region locking from your ClientOptions.

14th Mar 04:57 PM

Amazon resolved the Route53 issues allowing updates to be made.

us-west-1 is operating normally now.

Resolved

about 2 months
13th March 2017 06:30:00 PM

High error rates in ap-southeast-2 (Sydney)

We are seeing elevated error rates from the ap-southeast-2 (Sydney) datacenter, starting at 18:30 UTC. We are investigating.

13th Mar 07:59 PM

An additional fault has been detected in us-east-1-a

13th Mar 07:59 PM

An additional fault has been detected in sa-east-1-a

13th Mar 07:59 PM

An additional fault has been detected in us-west-1-a

13th Mar 07:59 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

13th Mar 07:59 PM

An additional fault has been detected in us-west-2-a

13th Mar 08:41 PM

The issue is now resolved, error rates are back to normal globally.

Resolved

about 2 months