Incident log archive

All times are shown in UTC

March 2017

9th March 2017 02:06:42 PM

us-east-1 and us-west performance issues

We are experiencing a high rate of issues in us-east-1 due to a sudden increase in load.

Our autoscaling has provisioned more capacity and error rates are dropping. We are investigating the root cause now.

9th Mar 02:07 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

9th Mar 03:49 PM

The issue was resolved automatically by the autoscaling and health systems. However, we continue to investigate the cause of the sudden spike we experienced across the cluster at 14:02 today.


in 13 minutes
6th March 2017 08:30:00 PM

Issues with comet streaming connections in ap-southeast and eu-west

Between around 20:30 UTC on the 6th and 12:00 on the 7th, an issue in our routing layer caused streaming comet fallback connections in some regions (mostly Asia Pacific (Singapore), but also some in Europe) to experience problems.

Websocket connections (which for most of our client libraries is the only transport) and xhr polling connections were unaffected.

The issue has now been resolved; apologies for any inconvenience.


in about 16 hours
1st March 2017 10:00:00 PM

Global platform upgrade

In preparation for the release of our v1.0 client libraries, we are performing a major upgrade of the platform globally.

We do not expect any downtime during this upgrade.

Please get in touch if you notice any problems at this time.

The upgrade process is scheduled to start at 22:00 1 March 2017and finish by no later than 02:00 GMT 2 March 2017.

Update: this upgrade is leading to temporarily increased error rates and some performance issues in some regions, we are investigating

Update: Increased error rates are continuing across the cluster; however we've identified the cause and area testing a fix.

Update: We are rolling the fix out region by region; error rates are decreasing. US-east traffic is currently temporarily diverted to other regions.

2nd Mar 04:26 AM

The deployment is completed and error rates are low. We are operating less regions at present and will gradually start reintroducing load to all regions shortly.

2nd Mar 11:54 AM

A number of customers using the Realtime .NET library reported issues due to a bug in the .NET library, see https://github.com/ably/ably-dotnet/issues/97.

We are addressing this in two ways:

* We are rolling out a fix that will ensure the existing .NET library works in spite of the bug
* We are working on a fix for the .NET library and will release a new version soon


in about 6 hours

February 2017

28th February 2017 07:05:34 PM

CDN: some requests being served slowly

The CDN which hosts the ably-js client library is responding to a minority of requests slowly (up to 15 seconds).

This is because it is a caching layer over Amazon S3, which is currently down, see https://status.aws.amazon.com/ . For some proportion of requests, the caching layer is checking if there is a new version on S3, and only serving the stored version after that request times out.

This will only affect a website the first time an end-user's browser loads the library; after that it's cached by the browser with a 2 hour TTL.

Most users should not notice any issues, but we are investigating ways to avoid the slow requests while S3 continues to experience problems.

28th Feb 07:54 PM

We have migrated the origin layer of our CDN off Amazon S3 to resolve these issues. We will also be ensuring our original is replicated and has automatic fail over for future incidents like this.

The CDN is now operating normally .


in about 1 hour
16th February 2017 03:00:00 PM

Database performance issues (post mortem)

Yesterday, our globally distributed Cassandra cluster that provides our persistence layer for history, API key retrieval, account config etc. began to exhibit very spiky response times to queries. Some users as a result experience very high latencies for history requests and occasionally for other operations that required access to the persistence layer.

Looking back at what triggered this, a routine repair of the cluster (part of the required maintenance for Cassandra) caused some nodes to have increased load for short periods of time. This is to be expected, and through a combination of our bespoke client library strategy, the number of replicas we maintain of the data, and the Cassandra executor strategy, this is normally not an issue as only a few servers out of the entire cluster are needed to satisfy queries.

However, what we found was that some nodes were getting into a state where queries were taking in excess of 5,000ms , whereas the median is roughly 25ms (from servers in the same region). Unfortunately because our monitoring systems averaged performance over periods of time, we were not pre-emptively notified of the issue. And our client library strategy did the opposite, and because it was not averaging performance, erratically performing servers were sometimes still being used to satisfy queries and assume the role of query co-ordinator.

As a result of this, we deployed a new Cassandra strategy in the realtime system to correct the problem globally, and at 7:50pm GMT, this resulted in a very high rate of errors in us-west-1 during this time.

Everything has been stable in the realtime system and Cassandra since this morning at 01:00, and we are now looking at the following:

* Rolling out more widely our new Cassandra strategy that will stop using servers that exhibit spikey performance over longer periods of time
* Investigating the root case of the issue in us-west-1 yesterday when we rolled out the new strategy
* Working with our vendor Datastax to establish what has been causing the spikey response times in Cassandra
* Modifying our alerting systems to detect these anomalies more quickly and alert us

1st Mar 03:43 PM

Last week we successfully upgraded our Cassandra cluster and added significantly more capacity globally.

We have also since rolled out a new Cassandra connection strategy, improved our alerting to notify us of degradation in query performance, are preparing a major update imminently to address the underlying us-west-1 issue we experienced.

By all measurements everything stable and performing well, so we consider this issue resolved.


in about 10 hours
10th February 2017 02:40:28 AM

AWS Northern California us-west-1 issues

AWS in Northern California is having network issues with some availability zones being partitioned from others.

From their site:

- 5:28 PM PST We are investigating network connectivity issues in a single Availability Zone in the US-WEST-1 Region.
- 6:15 PM PST We continue to investigate network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region.

As a result, we are pre-emptively rerouting traffic away from us-west-1 until AWS resolve the issue, although at present that datacenter is not reporting issues yet.

10th Feb 03:19 AM

Issues continue in this region as reported by Amazon:

- 7:09 PM PST We have identified the root cause of the network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. Connectivity to some instances has been restored and we continue to work on the remaining instances.
- 7:29 PM PST We can confirm increased error rates and elevated latencies for AWS Lambda requests in the US-WEST-1 Region. Newly created functions and console editing are also affected. We have identified the root cause and working to resolve the issue.

As a further precaution, we have completely decommissioned this region's datacenters and plan on reintroducing this datacenter once we are confident the issues are resolved fully, probably within the next 12 hours.

10th Feb 10:12 AM

At 10:00am, we confirmed that the AWS issues in us-west-1 (Northern California) were fully resolved, and as such, we have brought the us-west-1 datacenter back online.

All datacenters are full operational again.


in about 7 hours
9th February 2017 03:29:14 PM

ap-southeast-2-a loss of connectivity

Our Australian data center lost connectivity for a few minutes resulting a high level of 50x errors and timeouts.

The problem was automatically resolved within a few minutes and all traffic for that data center was routed to other regions by the client libraries.


in 2 minutes
4th February 2017 02:38:52 AM

Database consistency failure

There is a reported database consistency failure across the global cluster which has resulted in some writes to the global account, app and key databases failing.

We are investigating the root cause now.

4th Feb 02:46 AM

The database inconsistency was resolved automatically by the cluster and all services have returned to normal. A higher level of error rates was present for on more than 5 minutes and most services were unaffected.


in 7 minutes
2nd February 2017 04:23:41 AM

Website downtime due to Redis failure

Our website was unavailable for 10 minutes due to a Redis instance failure that it depends on.

This was corrected automatically within 10 minutes and service resumed.

Please note that whilst our realtime system offers a 100% uptime guarantee, our website does not fall into this uptime guarantee as it is not part of the realtime platform.


in 10 minutes

January 2017

28th January 2017 03:53:00 PM

Gossip and sharding upgrade

We performed an upgrade of the global realtime platform to readdress how data is sharded across the system to improve the spread of load and prevent load hotspots.

As the upgrade involved changes to our gossip protocol, we were required to upgrade the system in a way that allowed the old sharding algorithms to be running alongside the new sharding algorithms. This was achieved by doing the upgrade in three stages globally.

We had anticipated little or no disruption to the platform during this time, however following the three stage upgrade, we did notice a higher rate of errors and timeouts during this period. Unfortunately this means that some users will have experienced some degradation of the service during the 1 hour window for the upgrade.


in about 1 hour

December 2016

15th December 2016 02:11:37 AM

Our automated health check system has reported an issue with realtime cluster health in eu-west-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

15th Dec 07:39 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


in 28 minutes

November 2016

23rd November 2016 07:01:20 PM

High error rates

From 19:01 UTC we have been seeing a significant escalation in error rates in all regions.

We are investigating this now.

23rd Nov 07:21 PM

Unfortunately the high error rates have resumed. We are bringing on global capacity to address the problem whilst we investigate the root cause.

23rd Nov 08:09 PM

Unfortunately there was a huge impact on the global platform since the initial escalation in errors.

We have stabilized the platform now, and will focus immediately on ensuring service returns to normal.

Once that is done, we will be doing a full post mortem on what happened this evening (UTC) time.

We apologize for the inconvenience today and are doing everything we can to keep the service stable.

25th Nov 01:59 AM

Following the disruptions we had from 2016/11/23 19:01 until 19:52 UTC, we have now been able to complete a preliminary post mortem on what happened, and how we do our best to make sure it does not happen again.

The sequence of events:

1) 19:00 - A specific traffic pattern triggered a bug that caused several instances (each handling the same channel, but in different regions) to fail. Our health servers correctly identified the issue and restarted the faulty processes across all regions within approximately 60 seconds.

2) Sharing of the load that resulted from those instances exiting resulted in scaling events across multiple regions concurrently. Our design has been to always over-provision capacity when we detect additional load, and as planned, this happened at around 19:04-19:05

3) A subsequent bug was triggered which caused the resulting expanded cluster to become unhealthy (ie cluster state failed to synchronize), which means that the cluster was intermittently unable to service requests, and as a result multiple regions experienced very high error rates for a period of around 20 minutes.

4) Seeing the events unfold, we attempted to intervene manually to restore cluster stability, but the level of activity meant that our tools also experienced errors and so we were unable to regain manual control without taking some regions offline.

5) Once certain affected regions were taken offline we were able to restore the cluster to working order at around 19:50, and error rates were returned to normal levels over several minutes

Actions we have undertaken since the incident:

1) In-depth investigation of the logs to identify the bug that triggered simultaneous failure of the nodes.

2) We have made significant improvements to our tooling so that we can apply configuration changes, execute urgent provisioning changes, control services and disable health systems selectively at a global level.

3) We have in the interim slightly reduced the level of autonomy so that we don’t encounter the same situation before our bug fixes are completed

Actions we are still doing:

1) We have a plan for improvements to the automated controls to remove the specific features that led to the cascade of responses in this case

2) We are setting up infrastructure and a load testing plan to reproduce the issue so that we can hopefully identify and the root cause of issues 1 and 2, and also optimize the autonomous health systems.

3) We are continuing to investigate the underlying cause of the issue.

We are sorry for the disruption this caused to everyone yesterday. We are doing everything we can now to make sure we quickly make sure this does not happen again.

Please do get in touch if you have any questions.


in about 1 hour
7th November 2016 10:25:43 PM

Very brief datacenter in sa-east-1-a was unavailable

An availability issue was detected in sa-east-1-a.

The datacentre is now online again and was only offline for a few minutes.

We are investigating the root cause.

7th Nov 10:26 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


in 2 minutes
1st November 2016 05:07:15 PM

Website database maintenance

The primary Ably website at www.ably.io was down for approximately 4 minutes whilst an essential database upgrade was performed.

This was performed to address a known Linux vulnerability. See https://bugzilla.redhat.com/show_bug.cgi?id=1384344 for more info.

1st Nov 05:08 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


in 2 minutes

October 2016

28th October 2016 08:00:00 AM

Top level .io domain issues and Heroku

Some customers are reporting issues connecting to our primary endpoints and CDN that rely on the TLD .io domain.

This issue seems to be mostly affecting Heroku users, and Heroku has an incident describing this issue at https://status.heroku.com/incidents/967

Please note that whilst users on Heroku are reporting the issue, the underlying .io domain appears to be working reliably according to all of our global monitoring systems.

Also, as all customers' clients should automatically fallback to our alternative fallback hosts on the TLD ably-realtime.com, even if they are affected by the .io domain issue, they should still be able to connect to the Ably service. See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903 for more info.

We are continuing to follow Heroku's progress on resolving this issue and will provide an update once this is resolved.

28th Oct 10:00 AM

All of our monitoring systems, including those testing some web services within the Heroku environment, are reporting that all services are healthy globally.

The underlying TLD .io domain issues appear to be resolved for all customers including those who host services on the Heroku platform.


in about 1 hour