Incident log archive

All times are shown in UTC

July 2018

31st July 2018 09:58:20 PM

eu-west-1 spike

In the last 4 minutes we have seen a huge spike in load on all frontend and core servers in EU West. This has resulted in the nodes going into siege mode so that they can recover, whilst traffic has been sent to other regions temporarily.

We'll update this incident shortly.

31st Jul 10:13 PM

Within 4 minutes the issue resolved itself and error rates have returned to normal.

It appears a routine Cassandra upgrade triggered this issue, however we are not yet clear on the root cause of the problem. We will continue to investigate.


in 4 minutes

June 2018

26th June 2018 05:00:37 PM

Unplanned website database migration

At 17:00 a necessary and unplanned migration was performed on our primary website database causing the website to be unavailable for 1-2 minutes.

26th Jun 05:04 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the iss0ue and will update this incident shortly.


in 1 minute
6th June 2018 11:41:12 AM

Intermittent slow requests in US West (California)

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

6th Jun 11:43 AM

We have identified an issue where some HTTP requests are intermittently stalled in the load balancer in the us-west-1 region.

The issue only affects us-west-1. As a precaution we are temporarily routing traffic away from that region. During this period clients that would have used that region will mostly now end up in us-east-1 (East Virginia). Aside from a possibly slight latency increase this should not cause any problems.

6th Jun 11:39 PM

We were hoping that AWS would be able to resolve the issue in a sensible time so that they could diagnose the underlying problem on their end. However, as this did not happen time, we moved forwards and recycled the necessary load balancers and brought the US West region back online. All services are back to normal.


in about 12 hours
1st June 2018 01:00:00 PM

Sydney datacenter essential maintenance

Today, 1 June 2018, at 13:00 UTC we will be performing essential maintenance on our Sydney datacenter. During this time, all traffic will be routed to the nearest other Ably datacenters , which will most likely be Singapore and California.

We expect this upgrade to take 15-30 minutes, and once completed, all traffic will be routed back to Sydney.

1st Jun 01:27 PM

The maintenance is unfortunately taking longer than predicted as we are waiting on AWS to re-provision resources in that region. Once their re-provisioning is complete, the region will be back online.

1st Jun 01:48 PM

The Sydney datacenter is now back online and operating normally.


in about 1 hour

May 2018

17th May 2018 12:00:00 AM

Reactor queue intermittent issues

Whilst upgrading a number of RabbitMQ nodes, we had a hardware failure on one node. Whilst all nods are configured in a HA configuration, we had some issues introducing new nodes into the system which caused availability issues of some queues for short periods of up to 30 seconds. This happened a few times between 12:00 UTC and 1:20UTC.

The issues are now fully resolved.


in about 1 hour

April 2018

30th April 2018 07:13:09 AM

Website availability issues

Some of the nodes serving the website failed at 7:13 today.

30th Apr 07:28 AM

Website issues were resolved by an automated restart of the Heroku dyno. The core realtime system does not use Heroku and was unaffected.


in 24 minutes
19th April 2018 11:34:29 PM

Heavy load in us-east-1 has caused an increase in latencies

We have seen a significant and sudden increase in us-east-1 since 23:30 UTC today.

We are manually over provisioning the capacity in this region and will be investigating the cause of this sudden increase in load. Once identified, we'll look into how we can more effectively prepare for spikes like this.

20th Apr 12:06 AM

Unfortunately after provisioning more capacity, a second wave of traffic has arrived.
We are provisioning again.

20th Apr 12:26 AM

We have identified the customer who is largely responsible for this traffic spike and have resolved the issue in the affected regions. Over the next few days we'll be adding additional limits to prevent usage patterns like this impacting other accounts in the multi-tenanted cluster.


in 24 minutes
19th April 2018 03:57:15 PM

Temporary increase in error rates in eu-west-1-a

Between 15:57 UTC and 16:02 UTC the EU West 1 (Ireland) cluster experienced a sudden increase in load which caused error rates to climb temporarily in this region.

We believe all HTTP and connection requests should have been retried automatically in other regions using our fallback capabilities.


in 5 minutes
6th April 2018 10:30:08 PM

Intermittent increased latencies in both US regions

We are seeing intermittent higher latencies in both US regions at present which is affecting a small number of publishes. The intermittent latency delays we are seeing are in the range of 50-500ms.

We are investigating the root cause of the issue and are actively working to resolve this.

11th Apr 09:16 PM

In this case the latencies returned to normal after a redeploy of affected instances. The underlying cause is believed to be a recent regression that leaks object references and causes certain data structures to grow over time. A fix is being prepared and will be deployed in due course.


in 5 days
5th April 2018 10:40:00 AM

High error rates in eu-central-1 starting 10:40UTC

Since 10:40 we are seeing elevated error rates in the eu-central-1 (Frankfurt) data centre, and are investigating.

All other regions are operating normally.

5th Apr 11:24 AM

Since 10:50, other regions (in particular us-east and eu-west) are experiencing increased latencies as a result of problems in eu-central.

We have now shut down the eu-central-1 region and are redirecting traffic to eu-west-1

5th Apr 11:51 AM

Error rates have returned to normal and we are continuing to investigate the root cause of the issue that arose in Frankfurt (eu-central-1)

5th Apr 12:17 PM

The cluster has been stable so we consider this issue resolved. We will now continue to investigate the root cause and conduct a post-mortem of the issue.


in about 1 hour

March 2018

26th March 2018 11:06:40 PM

Website unavailable during unexpected Heroku API maintenance

Heroku's API is unavailable due to unexpected maintenance which has resulted in our website being unavailable as this is hosted with Heroku.

We are looking into a solution, which is challenging given the API is down.

See https://status.heroku.com/incidents/1459

26th Mar 11:26 PM

Please note that whilst our website is unavailable, this has absolutely no effect on our realtime platform which continues to run without fault.

26th Mar 11:43 PM

The Heroku API has come back online and all services are operating normally again.


in about 1 hour

February 2018

20th February 2018 06:00:00 PM

Pusher adapter connections over SSL incorrectly rejected

For the last few days, a bug related to protocol header forwarding resulted in connections to the Pusher translator over SSL (that is, where `encrypted: true` was set in the Pusher client lib constructor) to be incorrectly rejected with the following error: "Invalid use of Basic authentication over non-TLS transport (Ably error 40103)".

We apologise for the length of time it took for this to be fixed and will be instituting steps to make sure it doesn't happen again. We invite affected customers to get in touch with us at support@ably.io .


in 4 days
5th February 2018 03:39:00 PM

Increased latencies in ap-southeast-2 since 15:39 UTC

We are investigating increased latencies in ap-southeast-2 since 15:39 UTC

Update: Latencies and error rates are back to normal as of 15:53. Regions outside of ap-southeast-2 should have been relatively unaffected except when communicating with clients connected to ap-southeast-2. The cause was a new version of our backend, which showed no issues in CI or staging environments. We will update this incident once we discover the root cause.


in 14 minutes

January 2018

29th January 2018 09:28:32 PM

Increased latencies us-east-1-a

We have seen latencies rise twice today for a period of roughly 2 minutes.

We have identified the root cause of the problem and will aim to roll out a fix to reduce the affect of noisy neighbours on all customers.


in 3 minutes
21st January 2018 10:15:00 AM

Elevated error rates globally

We experienced increased error rates across the production cluster starting at 1015UTC today, due to failing instances in us-west. The impact was mainly on the us-west region but certain other apps and accounts were affected globally.

New instance were brought online qt 10:28 and performance/error rates were back to normal at 10:30.

21st Jan 07:28 PM

Following a review of the impact, we can see that error rates elevation was most prevalent in us-west-1 (California), however we unfortunately also experienced some intermittent increased error rates in us-east-1 (North Virginia) and ap-southeast-2 (Sydney) during this time.


in 15 minutes