Incident log archive

All times are shown in UTC

May 2018

17th May 2018 12:00:00 AM

Reactor queue intermittent issues

Whilst upgrading a number of RabbitMQ nodes, we had a hardware failure on one node. Whilst all nods are configured in a HA configuration, we had some issues introducing new nodes into the system which caused availability issues of some queues for short periods of up to 30 seconds. This happened a few times between 12:00 UTC and 1:20UTC.

The issues are now fully resolved.


in about 1 hour

April 2018

30th April 2018 07:13:09 AM

Website availability issues

Some of the nodes serving the website failed at 7:13 today.

30th Apr 07:28 AM

Website issues were resolved by an automated restart of the Heroku dyno. The core realtime system does not use Heroku and was unaffected.


in 24 minutes
19th April 2018 11:34:29 PM

Heavy load in us-east-1 has caused an increase in latencies

We have seen a significant and sudden increase in us-east-1 since 23:30 UTC today.

We are manually over provisioning the capacity in this region and will be investigating the cause of this sudden increase in load. Once identified, we'll look into how we can more effectively prepare for spikes like this.

20th Apr 12:06 AM

Unfortunately after provisioning more capacity, a second wave of traffic has arrived.
We are provisioning again.

20th Apr 12:26 AM

We have identified the customer who is largely responsible for this traffic spike and have resolved the issue in the affected regions. Over the next few days we'll be adding additional limits to prevent usage patterns like this impacting other accounts in the multi-tenanted cluster.


in 24 minutes
19th April 2018 03:57:15 PM

Temporary increase in error rates in eu-west-1-a

Between 15:57 UTC and 16:02 UTC the EU West 1 (Ireland) cluster experienced a sudden increase in load which caused error rates to climb temporarily in this region.

We believe all HTTP and connection requests should have been retried automatically in other regions using our fallback capabilities.


in 5 minutes
6th April 2018 10:30:08 PM

Intermittent increased latencies in both US regions

We are seeing intermittent higher latencies in both US regions at present which is affecting a small number of publishes. The intermittent latency delays we are seeing are in the range of 50-500ms.

We are investigating the root cause of the issue and are actively working to resolve this.

11th Apr 09:16 PM

In this case the latencies returned to normal after a redeploy of affected instances. The underlying cause is believed to be a recent regression that leaks object references and causes certain data structures to grow over time. A fix is being prepared and will be deployed in due course.


in 5 days
5th April 2018 10:40:00 AM

High error rates in eu-central-1 starting 10:40UTC

Since 10:40 we are seeing elevated error rates in the eu-central-1 (Frankfurt) data centre, and are investigating.

All other regions are operating normally.

5th Apr 11:24 AM

Since 10:50, other regions (in particular us-east and eu-west) are experiencing increased latencies as a result of problems in eu-central.

We have now shut down the eu-central-1 region and are redirecting traffic to eu-west-1

5th Apr 11:51 AM

Error rates have returned to normal and we are continuing to investigate the root cause of the issue that arose in Frankfurt (eu-central-1)

5th Apr 12:17 PM

The cluster has been stable so we consider this issue resolved. We will now continue to investigate the root cause and conduct a post-mortem of the issue.


in about 1 hour

March 2018

26th March 2018 11:06:40 PM

Website unavailable during unexpected Heroku API maintenance

Heroku's API is unavailable due to unexpected maintenance which has resulted in our website being unavailable as this is hosted with Heroku.

We are looking into a solution, which is challenging given the API is down.

See https://status.heroku.com/incidents/1459

26th Mar 11:26 PM

Please note that whilst our website is unavailable, this has absolutely no effect on our realtime platform which continues to run without fault.

26th Mar 11:43 PM

The Heroku API has come back online and all services are operating normally again.


in about 1 hour

February 2018

20th February 2018 06:00:00 PM

Pusher adapter connections over SSL incorrectly rejected

For the last few days, a bug related to protocol header forwarding resulted in connections to the Pusher translator over SSL (that is, where `encrypted: true` was set in the Pusher client lib constructor) to be incorrectly rejected with the following error: "Invalid use of Basic authentication over non-TLS transport (Ably error 40103)".

We apologise for the length of time it took for this to be fixed and will be instituting steps to make sure it doesn't happen again. We invite affected customers to get in touch with us at support@ably.io .


in 4 days
5th February 2018 03:39:00 PM

Increased latencies in ap-southeast-2 since 15:39 UTC

We are investigating increased latencies in ap-southeast-2 since 15:39 UTC

Update: Latencies and error rates are back to normal as of 15:53. Regions outside of ap-southeast-2 should have been relatively unaffected except when communicating with clients connected to ap-southeast-2. The cause was a new version of our backend, which showed no issues in CI or staging environments. We will update this incident once we discover the root cause.


in 14 minutes

January 2018

29th January 2018 09:28:32 PM

Increased latencies us-east-1-a

We have seen latencies rise twice today for a period of roughly 2 minutes.

We have identified the root cause of the problem and will aim to roll out a fix to reduce the affect of noisy neighbours on all customers.


in 3 minutes
21st January 2018 10:15:00 AM

Elevated error rates globally

We experienced increased error rates across the production cluster starting at 1015UTC today, due to failing instances in us-west. The impact was mainly on the us-west region but certain other apps and accounts were affected globally.

New instance were brought online qt 10:28 and performance/error rates were back to normal at 10:30.

21st Jan 07:28 PM

Following a review of the impact, we can see that error rates elevation was most prevalent in us-west-1 (California), however we unfortunately also experienced some intermittent increased error rates in us-east-1 (North Virginia) and ap-southeast-2 (Sydney) during this time.


in 15 minutes
15th January 2018 04:34:00 PM

Cassandra issues in US

Health issues with one of our Cassandra nodes in US-East (our persistance and storage layer) is leading to performance issues (and some "Cannot achieve consistency level QUORUM" errors with account/app creation or alteration actions, or message publishes to channels on which persistance is enabled), especially in US east and US west

15th Jan 05:04 PM

One of the Cassandra nodes in US-east appears to have become partitioned from other nodes, though the realtime service was still connected to it. This resulted in any queries reaching that node failing with a "Cannot achieve consistency level QUORUM" or "Cannot achieve consistency level TWO" error. This was resolved at 16:53 by shutting down Cassandra on the affected node, since there is sufficient redundancy for the cluster to run fine with one fewer node.

All services are fully back to normal; we're now investigating the root cause of the netsplit.


in 19 minutes
4th January 2018 08:24:16 PM

Website offline

A faulty deployment has caused our customer facing website www.ably.io to go offline.

We've reverted the deployment and the website should be online again within a few minutes.

Once everything is confirmed stable, we'll investigate the root cause of the faulty deployment.

4th Jan 08:39 PM

The website is back online and operating normally. It was unfortunately offline for almost 10 minutes in total. The cause was not a faulty deployment, but rather an issue with Heroku (our hosting provider) causing restarted or redeployed apps to fail. See https://status.heroku.com/incidents/1367

4th Jan 08:50 PM

While the website is online, a stats helper app (deployed at the same time) is still experiencing issues due to the above Heroku incident, so stats may not be visible on your dashboard for the moment.

Please note the website issues have no impact whatsoever to our realtime platform, but simply limits our customers' ability to access their dashboards and stats via the website.

4th Jan 09:16 PM

All website services are now back to operating normally


in 35 minutes

November 2017

14th November 2017 12:01:26 PM

Cassandra timeouts causing disruption for history queries and some API requests

We are investigating a huge spike in Cassandra load being generated from Asia, which appears to be having an impact on global latencies and errors rates.

14th Nov 12:02 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

14th Nov 12:05 PM

We are seeing a huge spike in Cassandra load being generated from Asia, which appears to be having an impact on global latencies and errors rates.

14th Nov 01:28 PM

Latencies and error rates in all regions have returned to normal at 13:05.

We identified the root cause of the issue which was unfortunately caused by a high volume simulation being run which encountered a history API bug that created an unsustainable amount of load on our persistence layer.


in about 2 hours
14th November 2017 09:00:00 AM

Timed out history requests since Tuesday

A small proportion of history requests since Tuesday may have hung until the client library timeout, due to a regression introduced in a deploy on Tuesday that regrettably was not caught by the history test suite. We are in the process of rolling out a fix. Sorry for any inconvenience.


in 4 days