Incident log archive

All times are shown in UTC

January 2017

28th January 2017 03:53:00 PM

Gossip and sharding upgrade

We performed an upgrade of the global realtime platform to readdress how data is sharded across the system to improve the spread of load and prevent load hotspots.

As the upgrade involved changes to our gossip protocol, we were required to upgrade the system in a way that allowed the old sharding algorithms to be running alongside the new sharding algorithms. This was achieved by doing the upgrade in three stages globally.

We had anticipated little or no disruption to the platform during this time, however following the three stage upgrade, we did notice a higher rate of errors and timeouts during this period. Unfortunately this means that some users will have experienced some degradation of the service during the 1 hour window for the upgrade.


over 1 year ago

December 2016

15th December 2016 02:11:37 AM

Our automated health check system has reported an issue with realtime cluster health in eu-west-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

15th Dec 07:39 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


almost 2 years ago

November 2016

23rd November 2016 07:01:20 PM

High error rates

From 19:01 UTC we have been seeing a significant escalation in error rates in all regions.

We are investigating this now.

23rd Nov 07:21 PM

Unfortunately the high error rates have resumed. We are bringing on global capacity to address the problem whilst we investigate the root cause.

23rd Nov 08:09 PM

Unfortunately there was a huge impact on the global platform since the initial escalation in errors.

We have stabilized the platform now, and will focus immediately on ensuring service returns to normal.

Once that is done, we will be doing a full post mortem on what happened this evening (UTC) time.

We apologize for the inconvenience today and are doing everything we can to keep the service stable.

25th Nov 01:59 AM

Following the disruptions we had from 2016/11/23 19:01 until 19:52 UTC, we have now been able to complete a preliminary post mortem on what happened, and how we do our best to make sure it does not happen again.

The sequence of events:

1) 19:00 - A specific traffic pattern triggered a bug that caused several instances (each handling the same channel, but in different regions) to fail. Our health servers correctly identified the issue and restarted the faulty processes across all regions within approximately 60 seconds.

2) Sharing of the load that resulted from those instances exiting resulted in scaling events across multiple regions concurrently. Our design has been to always over-provision capacity when we detect additional load, and as planned, this happened at around 19:04-19:05

3) A subsequent bug was triggered which caused the resulting expanded cluster to become unhealthy (ie cluster state failed to synchronize), which means that the cluster was intermittently unable to service requests, and as a result multiple regions experienced very high error rates for a period of around 20 minutes.

4) Seeing the events unfold, we attempted to intervene manually to restore cluster stability, but the level of activity meant that our tools also experienced errors and so we were unable to regain manual control without taking some regions offline.

5) Once certain affected regions were taken offline we were able to restore the cluster to working order at around 19:50, and error rates were returned to normal levels over several minutes

Actions we have undertaken since the incident:

1) In-depth investigation of the logs to identify the bug that triggered simultaneous failure of the nodes.

2) We have made significant improvements to our tooling so that we can apply configuration changes, execute urgent provisioning changes, control services and disable health systems selectively at a global level.

3) We have in the interim slightly reduced the level of autonomy so that we don’t encounter the same situation before our bug fixes are completed

Actions we are still doing:

1) We have a plan for improvements to the automated controls to remove the specific features that led to the cascade of responses in this case

2) We are setting up infrastructure and a load testing plan to reproduce the issue so that we can hopefully identify and the root cause of issues 1 and 2, and also optimize the autonomous health systems.

3) We are continuing to investigate the underlying cause of the issue.

We are sorry for the disruption this caused to everyone yesterday. We are doing everything we can now to make sure we quickly make sure this does not happen again.

Please do get in touch if you have any questions.


almost 2 years ago
7th November 2016 10:25:43 PM

Very brief datacenter in sa-east-1-a was unavailable

An availability issue was detected in sa-east-1-a.

The datacentre is now online again and was only offline for a few minutes.

We are investigating the root cause.

7th Nov 10:26 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


almost 2 years ago
1st November 2016 05:07:15 PM

Website database maintenance

The primary Ably website at www.ably.io was down for approximately 4 minutes whilst an essential database upgrade was performed.

This was performed to address a known Linux vulnerability. See https://bugzilla.redhat.com/show_bug.cgi?id=1384344 for more info.

1st Nov 05:08 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.


almost 2 years ago

October 2016

28th October 2016 08:00:00 AM

Top level .io domain issues and Heroku

Some customers are reporting issues connecting to our primary endpoints and CDN that rely on the TLD .io domain.

This issue seems to be mostly affecting Heroku users, and Heroku has an incident describing this issue at https://status.heroku.com/incidents/967

Please note that whilst users on Heroku are reporting the issue, the underlying .io domain appears to be working reliably according to all of our global monitoring systems.

Also, as all customers' clients should automatically fallback to our alternative fallback hosts on the TLD ably-realtime.com, even if they are affected by the .io domain issue, they should still be able to connect to the Ably service. See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903 for more info.

We are continuing to follow Heroku's progress on resolving this issue and will provide an update once this is resolved.

28th Oct 10:00 AM

All of our monitoring systems, including those testing some web services within the Heroku environment, are reporting that all services are healthy globally.

The underlying TLD .io domain issues appear to be resolved for all customers including those who host services on the Heroku platform.


about 2 years ago
28th October 2016 07:33:19 AM

Top level DNS IO domain issues affecting the website

Ongoing issues with Heroku being unable to resolve .io domains is currently affecting some functions of the website, and more specifically, the app dashboards.

Please see https://status.heroku.com/incidents/967 for more info of the underlying issue.

We are monitoring Heroku's progress on this issue.

28th Oct 07:35 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

28th Oct 09:50 AM

For the last 30 minutes we have had no reported errors from Heroku and all areas of the website are operating without issue.


about 2 years ago
21st October 2016 04:00:00 PM

ably.io slow to respond due to Heroku/DynDNS issues; realtime service is fine

Our website (ably.io) is currently not responding for some customers due to a DynDNS outage affecting Heroku, see https://www.dynstatus.com/incidents/nlr4yrr162t8

The realtime service is does not use DynDNS, and is unaffected and running smoothly.

21st Oct 06:17 PM

The ably.io website is now bypassing Heroku's DynDNS, and is operating normally. (The realtime system continued operating as normal throughout)

21st Oct 06:53 PM

See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903#.o7p0bbpdy

For customers interested, in the article we've addressed how Ably has mitigated against these types of global DNS failures that have affected large companies like Github and Twitter.


about 2 years ago
13th October 2016 12:00:00 PM

SSL Certificate intermittent issues

A small number of users from 13 Oct until 18 Oct were affected by a GlobalSign failure with their certificate revocation lists.

Unfortunately GlobalSign are the providers of Ably's certificate and as such there was little we could do but wait for the caches to be cleared and the issue to go away.

See https://support.ably.io/solution/articles/3000059255--the-security-certificate-has-been-revoked-error-connecting-to-ably for more info.


about 2 years ago

September 2016

26th September 2016 12:01:42 PM

Global intermittent networking issues

Our monitoring systems are reporting network timeouts globally in what appears to be affecting all US regions, even across independent services we host with different providers.

As yet, we are unclear of what is causing this, but it appears to be an AWS or network routing issue as opposed to any problem within our datacenters or service.

A high level of AWS errors are being reported online at http://downdetector.com/status/aws-amazon-web-services

We're monitoring the situation and will see what we can do to alleviate the problem. However, if the underlying issues are related to core network routing issues, we are unfortunately unable to do anything about this as we rely on a working network in the affected regions.

26th Sep 12:08 PM

The networking issues have resolved themselves and error rates have dropped off.

Amazon as yet has not reported any reason behind the issues, but we expect they'll do a post mortem and post an update later today.

The reported incidents by other parties was very high during the time we experienced the networking issues. See http://downdetector.com/status/aws-amazon-web-services

We'll write up a post mortem once our upstream providers do the same.


about 2 years ago
19th September 2016 02:40:14 PM

us-east-1 region performance issues

Error rates in the us-east-1 datacenter have risen in the last 15 minutes.

We are experiencing what appears to be an unprecedented amount of traffic to only this region.

We are manually provisioning more hardware to minimise the impact whilst we diagnose the root cause of the significant increased load.

Customers using the us-east-1 may be experiencing some failures at present. Client libraries should automatically route around these issues and use alternative datacenters.

19th Sep 02:58 PM

All operations are now back to normal.

Upon our initial investigation, it appears that the autoscaling system we have to provision hardware when demand increases was not proactive enough for this pattern of usage. We will now investigate the usage patterns that led to the significantly increased load on the servers in us-east-1 and adjust our autoscaling in future to be more proactive under these conditions.


about 2 years ago
8th September 2016 09:00:00 PM

Timeouts on some channels in US-East

For around 45 minutes, multiple channels experienced issues (slow performance and 'client ready timeout' errors), especially for customers in the US East Virginia region. Everything is now back to normal now. We will update this issue with more details once we've done a postmortem.


about 2 years ago
4th September 2016 01:20:45 PM

Network connectivity and partitioning issues in Asia

We are seeing a high rate of errors connecting to and from ap-southeast-1 (Singapore) at present from other data centers.

As a result, this is having intermittent impact globally when our masters for apps, accounts and channels are placed in this region.

We are investigating the problem now. If necessary, we will take the entire data center ap-southeast-1 offline.

4th Sep 01:40 PM

The networking issues were isolated to a single instance which has now been recycled.

All network operations are back to normal now.


about 2 years ago

August 2016

31st August 2016 04:28:34 PM

Frankfurt eu-central TLS termination issues

Our Frankfurt eu-central-1 data center is no longer terminating TLS connections correctly. This is an Amazon AWS issue and we have raised a support request with them to resolve the issue.

In the mean time, we are now routing all traffic to other data centers until the issue is resolved by Amazon.

This should have very little impact on any users as eu-west-1 will take over most requests destined with eu-central with very little impact on latency.

1st Sep 07:39 AM

The issue has been identified in a single faulty router which has now been removed.

The Frankfurt eu-central-1 data center has now been reinstated.


about 2 years ago
26th August 2016 04:44:42 PM

Our automated health check system has reported an issue with realtime cluster health in eu-central-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

26th Aug 10:44 PM

The issue was intermittent in this region and resolved itself after approximately 10 minutes. During this time some users were unable to connect to eu-central-1 and were routed to other regions.


about 2 years ago