www.ably.io
Back

Incident log archive

All times are shown in UTC

October 2017

3rd October 2017 04:53:00 PM

Performance issues in us-east-1

We are experiencing performance issues in the US East Virginia region due to unexpectedly increased load. More capacity is being automatically brought in to address this; until it comes online you may experience higher than normal latencies in this region.

As of 17:05, latencies are back to normal.

Resolved

1 minute

September 2017

24th September 2017 06:48:18 PM

Persistence issues

A spike in load in our persistence layer has resulted in timeouts.

Whilst the load in our persistence layer is low, there are still some reported timeouts. We are investigating the root cause.

24th Sep 07:20 PM

The timeout issues are resolved.

We'll continue to investigate the root cause of the continued timeouts after the spike subsided.

Resolved

18 minutes
14th September 2017 06:52:00 PM

Issues in us-west-1 (N California)

We are investigating high error rates in the us-west-1 (North California) region

Error rates are back to normal in all regions. There appear to have been transient network issues from 18:52-18:59 UTC.

Resolved

about 1 hour
3rd September 2017 08:27:47 PM

Our automated health check system has reported an issue with realtime cluster health in ap-southeast-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

3rd Sep 08:28 PM

Our health check system has reported this issue as resolved.

6th Sep 05:11 PM

The issue appears to be caused by a brief period of lack of reliable connectivity in this region. The resolved itself within 1 minute.

Resolved

2 days

August 2017

29th August 2017 12:00:00 AM

Propagation of changes to app settings may be delayed

In some cases, immediate propagation of changes to app settings (for example, new or changed webhooks, queue rules, namespaces and so on) is not occurring. The changes have happened, but a caching level is not being properly notified. We believe we have found the cause and should have a solution in place by tomorrow.

Update 2017-08-31T11:00Z: we have now rolled out a fix for this.

Resolved

less than a minute
23rd August 2017 03:35:00 PM

Elevated error rates

15:35 UTC A faulty deploy in ap-southeast-2 is causing higher than normal error rates worldwide. We are reverting the deploy.

16:23 UTC Error rates are back to normal in all regions.

Resolved

1 minute
22nd August 2017 04:14:14 AM

Our automated health check system has reported an issue with realtime cluster health in eu-central-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

22nd Aug 04:14 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

22nd Aug 06:59 AM

There was a momentary loss of connectivity in the EU Central region. It lasted for approximately 1-2 minutes and resolved itself.

Resolved

less than a minute

July 2017

23rd July 2017 11:00:00 AM

High latencies and error rates in EU West

We are seeing intermittent high latencies being reported in our EU West datacenter.

We are investigating the root cause of the issue now. In the mean time, we are adding more capacity to the region.

23rd Jul 02:40 PM

We identified the root cause of the issue was an internal monitoring system that went rogue and was generating significant load in the cluster.

We have disabled that specific monitoring system and will roll out a fix before turning that system back on.

All services are now back to normal.

Resolved

5 minutes
18th July 2017 06:08:01 PM

Higher latencies in US-East-1

We are seeing growing latencies in our US East (North Virginia) data center with latencies being reported up to twice as high as normal.

This latency increase is linked to a recent deploy today.

We are investigating the root cause and will soon bring on additional capacity to mitigate the issue whilst we resolve the underlying problem.

18th Jul 09:21 PM

At 18:40 the latencies returned to normal in us-east-1 following additional capacity being introduced into the cluster.

18th Jul 09:22 PM

The issue was resolved by rolling back to a previous version of the realtime system. We will be trying to reproduce this issue in our lab environments now before rolling this out in production again.

Resolved

1 minute
5th July 2017 02:27:00 PM

Performance issues

From 14:27 UTC we are seeing performance issue across the cluster on some apps and channels, and are investigating

Update: The incident was resolved 14:51 UTC. The cause was a configuration issue combined with a transient inter-region netsplit.

Resolved

3 minutes

June 2017

2nd June 2017 04:34:00 PM

Performance issues in us-east

We are experiencing performance issues in us-east-1. We are investigating. All other regions are unaffected.

Edit: advisory expanded to all regions

2nd Jun 05:14 PM

The issue is now resolved.

We identified the root cause as memory starvation on an instance, however this in itself should have caused little or even no disruption whilst our health services recovered the problem.

The continued high error rates were unfortunately down to a problem in the Gossip protocol. We had indefinitely pending nodes that were not there i.e. they had been disposed of by the health services, yet the nodes in the cluster still reported the node as there. So the root cause of this issue is a Gossip fault which we are investigating now.

Resolved

less than a minute

May 2017

31st May 2017 03:03:13 PM

us-east-1 traffic spike

We are seeing a tremendous spike in traffic arriving at us-east-1. Some of those requests are being re-routed automatically to other regions.

We're initially adding more capacity to address the issue whilst we investigate.

31st May 03:17 PM

The traffic spike subsided within 5 minutes. We'll ensure there is a generous amount of extra capacity in place in this region for 24 hours whilst we investigate where this traffic originated.

Resolved

1 minute
23rd May 2017 06:00:16 PM

us-east-1 slow requests

We are seeing a higher rate of slow requests in our US East 1 datacenter. Until we understand the underlying cause, we are temporarily adding more capacity to fix the problem.

23rd May 10:33 PM

We identified a number of nodes with memory exhaustion. We fixed the issue quickly but and are now investigating the root cause of the issue.

Resolved

less than a minute
22nd May 2017 08:52:04 AM

Minor network disruption in us-west-1

Our external monitoring systems reported a very brief period of network connectivity issues in our US West 1 datacenter. The issue resolved itself within a few minutes.

Resolved

less than a minute
16th May 2017 10:19:42 PM

Abrupt exits in Asia causing disruption primarily in Asia

A set of nodes in Asia have exited abruptly causing significant disruptions in Asia to channels. We are also seeing a knock-on effect in other regions due to roles constantly migrating around the cluster to compensate for the abruptly exiting nodes.

We are investigating and attempting to resolve the abrupt exits in Asia first so that a normal service can be restored.

We will then review the root cause.

16th May 10:41 PM

We have restored stability in Asia, however we are seeing gossip instability in other regions. We continue to investigate the root cause, but our priority remains to restore stability in other regions. We are recycling unstable nodes in other regions to resolve the issue.

16th May 11:03 PM

The global realtime service stability has been resumed in full.

We are now investigating the root cause of the gossip instability to understand why a consensus was not reached soon after the abrupt exits in Asia. We'll also been investigating the cause of the abrupt exits we saw in Asia.

Once our investigation is complete, we'll be writing up a post mortem on the issues this evening.

17th May 02:01 PM

Upon inspection of the logs at the time of the instability, it's clear we have a bug in our Gossip service that resulted in ghosts remaining in the ring that were in fact gone. We are investigating a few avenues which should lead to fixes soon.

Resolved

about 15 hours