www.ably.io
Back

Incident log archive

All times are shown in UTC

February 2018

5th February 2018 03:39:00 PM

Increased latencies in ap-southeast-2 since 15:39 UTC

We are investigating increased latencies in ap-southeast-2 since 15:39 UTC

Update: Latencies and error rates are back to normal as of 15:53. Regions outside of ap-southeast-2 should have been relatively unaffected except when communicating with clients connected to ap-southeast-2. The cause was a new version of our backend, which showed no issues in CI or staging environments. We will update this incident once we discover the root cause.

Resolved

3 minutes

January 2018

29th January 2018 09:28:32 PM

Increased latencies us-east-1-a

We have seen latencies rise twice today for a period of roughly 2 minutes.

We have identified the root cause of the problem and will aim to roll out a fix to reduce the affect of noisy neighbours on all customers.

Resolved

13 minutes
21st January 2018 10:15:00 AM

Elevated error rates globally

We experienced increased error rates across the production cluster starting at 1015UTC today, due to failing instances in us-west. The impact was mainly on the us-west region but certain other apps and accounts were affected globally.

New instance were brought online qt 10:28 and performance/error rates were back to normal at 10:30.

21st Jan 07:28 PM

Following a review of the impact, we can see that error rates elevation was most prevalent in us-west-1 (California), however we unfortunately also experienced some intermittent increased error rates in us-east-1 (North Virginia) and ap-southeast-2 (Sydney) during this time.

Resolved

about 9 hours
15th January 2018 04:34:00 PM

Cassandra issues in US

Health issues with one of our Cassandra nodes in US-East (our persistance and storage layer) is leading to performance issues (and some "Cannot achieve consistency level QUORUM" errors with account/app creation or alteration actions, or message publishes to channels on which persistance is enabled), especially in US east and US west

15th Jan 05:04 PM

One of the Cassandra nodes in US-east appears to have become partitioned from other nodes, though the realtime service was still connected to it. This resulted in any queries reaching that node failing with a "Cannot achieve consistency level QUORUM" or "Cannot achieve consistency level TWO" error. This was resolved at 16:53 by shutting down Cassandra on the affected node, since there is sufficient redundancy for the cluster to run fine with one fewer node.

All services are fully back to normal; we're now investigating the root cause of the netsplit.

Resolved

less than a minute
4th January 2018 08:24:16 PM

Website offline

A faulty deployment has caused our customer facing website www.ably.io to go offline.

We've reverted the deployment and the website should be online again within a few minutes.

Once everything is confirmed stable, we'll investigate the root cause of the faulty deployment.

4th Jan 08:39 PM

The website is back online and operating normally. It was unfortunately offline for almost 10 minutes in total. The cause was not a faulty deployment, but rather an issue with Heroku (our hosting provider) causing restarted or redeployed apps to fail. See https://status.heroku.com/incidents/1367

4th Jan 08:50 PM

While the website is online, a stats helper app (deployed at the same time) is still experiencing issues due to the above Heroku incident, so stats may not be visible on your dashboard for the moment.

Please note the website issues have no impact whatsoever to our realtime platform, but simply limits our customers' ability to access their dashboards and stats via the website.

4th Jan 09:16 PM

All website services are now back to operating normally

Resolved

about 1 hour

November 2017

14th November 2017 12:01:26 PM

Cassandra timeouts causing disruption for history queries and some API requests

We are investigating a huge spike in Cassandra load being generated from Asia, which appears to be having an impact on global latencies and errors rates.

14th Nov 12:02 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

14th Nov 12:05 PM

We are seeing a huge spike in Cassandra load being generated from Asia, which appears to be having an impact on global latencies and errors rates.

14th Nov 01:28 PM

Latencies and error rates in all regions have returned to normal at 13:05.

We identified the root cause of the issue which was unfortunately caused by a high volume simulation being run which encountered a history API bug that created an unsustainable amount of load on our persistence layer.

Resolved

1 minute
14th November 2017 09:00:00 AM

Timed out history requests since Tuesday

A small proportion of history requests since Tuesday may have hung until the client library timeout, due to a regression introduced in a deploy on Tuesday that regrettably was not caught by the history test suite. We are in the process of rolling out a fix. Sorry for any inconvenience.

Resolved

less than a minute

October 2017

26th October 2017 04:15:00 PM

Performance issues worldwide

We are currently experiencing increased latency worldwide due to an unexpected load imbalance; until autoscaling brings on sufficient capacity, customers may experience reduced performance

26th Oct 06:41 PM

We identified the underlying root cause of the increase global load and imposed limits to ensure other customers were not affected. Latencies and error rates have returned to normal as of 17:37 UTC.

Resolved

less than a minute
3rd October 2017 04:53:00 PM

Performance issues in us-east-1

We are experiencing performance issues in the US East Virginia region due to unexpectedly increased load. More capacity is being automatically brought in to address this; until it comes online you may experience higher than normal latencies in this region.

As of 17:05, latencies are back to normal.

Resolved

1 minute

September 2017

24th September 2017 06:48:18 PM

Persistence issues

A spike in load in our persistence layer has resulted in timeouts.

Whilst the load in our persistence layer is low, there are still some reported timeouts. We are investigating the root cause.

24th Sep 07:20 PM

The timeout issues are resolved.

We'll continue to investigate the root cause of the continued timeouts after the spike subsided.

Resolved

18 minutes
14th September 2017 06:52:00 PM

Issues in us-west-1 (N California)

We are investigating high error rates in the us-west-1 (North California) region

Error rates are back to normal in all regions. There appear to have been transient network issues from 18:52-18:59 UTC.

Resolved

about 1 hour
3rd September 2017 08:27:47 PM

Our automated health check system has reported an issue with realtime cluster health in ap-southeast-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

3rd Sep 08:28 PM

Our health check system has reported this issue as resolved.

6th Sep 05:11 PM

The issue appears to be caused by a brief period of lack of reliable connectivity in this region. The resolved itself within 1 minute.

Resolved

2 days

August 2017

29th August 2017 12:00:00 AM

Propagation of changes to app settings may be delayed

In some cases, immediate propagation of changes to app settings (for example, new or changed webhooks, queue rules, namespaces and so on) is not occurring. The changes have happened, but a caching level is not being properly notified. We believe we have found the cause and should have a solution in place by tomorrow.

Update 2017-08-31T11:00Z: we have now rolled out a fix for this.

Resolved

less than a minute
23rd August 2017 03:35:00 PM

Elevated error rates

15:35 UTC A faulty deploy in ap-southeast-2 is causing higher than normal error rates worldwide. We are reverting the deploy.

16:23 UTC Error rates are back to normal in all regions.

Resolved

1 minute
22nd August 2017 04:14:14 AM

Our automated health check system has reported an issue with realtime cluster health in eu-central-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

22nd Aug 04:14 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

22nd Aug 06:59 AM

There was a momentary loss of connectivity in the EU Central region. It lasted for approximately 1-2 minutes and resolved itself.

Resolved

less than a minute