www.ably.io
Back

Incident log archive

All times are shown in UTC

February 2017

16th February 2017 03:00:00 PM

Database performance issues (post mortem)

Yesterday, our globally distributed Cassandra cluster that provides our persistence layer for history, API key retrieval, account config etc. began to exhibit very spiky response times to queries. Some users as a result experience very high latencies for history requests and occasionally for other operations that required access to the persistence layer.

Looking back at what triggered this, a routine repair of the cluster (part of the required maintenance for Cassandra) caused some nodes to have increased load for short periods of time. This is to be expected, and through a combination of our bespoke client library strategy, the number of replicas we maintain of the data, and the Cassandra executor strategy, this is normally not an issue as only a few servers out of the entire cluster are needed to satisfy queries.

However, what we found was that some nodes were getting into a state where queries were taking in excess of 5,000ms , whereas the median is roughly 25ms (from servers in the same region). Unfortunately because our monitoring systems averaged performance over periods of time, we were not pre-emptively notified of the issue. And our client library strategy did the opposite, and because it was not averaging performance, erratically performing servers were sometimes still being used to satisfy queries and assume the role of query co-ordinator.

As a result of this, we deployed a new Cassandra strategy in the realtime system to correct the problem globally, and at 7:50pm GMT, this resulted in a very high rate of errors in us-west-1 during this time.

Everything has been stable in the realtime system and Cassandra since this morning at 01:00, and we are now looking at the following:

* Rolling out more widely our new Cassandra strategy that will stop using servers that exhibit spikey performance over longer periods of time
* Investigating the root case of the issue in us-west-1 yesterday when we rolled out the new strategy
* Working with our vendor Datastax to establish what has been causing the spikey response times in Cassandra
* Modifying our alerting systems to detect these anomalies more quickly and alert us

1st Mar 03:43 PM

Last week we successfully upgraded our Cassandra cluster and added significantly more capacity globally.

We have also since rolled out a new Cassandra connection strategy, improved our alerting to notify us of degradation in query performance, are preparing a major update imminently to address the underlying us-west-1 issue we experienced.

By all measurements everything stable and performing well, so we consider this issue resolved.

Resolved

in about 10 hours
10th February 2017 02:40:28 AM

AWS Northern California us-west-1 issues

AWS in Northern California is having network issues with some availability zones being partitioned from others.

From their site:

- 5:28 PM PST We are investigating network connectivity issues in a single Availability Zone in the US-WEST-1 Region.
- 6:15 PM PST We continue to investigate network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region.

As a result, we are pre-emptively rerouting traffic away from us-west-1 until AWS resolve the issue, although at present that datacenter is not reporting issues yet.

10th Feb 03:19 AM

Issues continue in this region as reported by Amazon:

- 7:09 PM PST We have identified the root cause of the network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. Connectivity to some instances has been restored and we continue to work on the remaining instances.
- 7:29 PM PST We can confirm increased error rates and elevated latencies for AWS Lambda requests in the US-WEST-1 Region. Newly created functions and console editing are also affected. We have identified the root cause and working to resolve the issue.

As a further precaution, we have completely decommissioned this region's datacenters and plan on reintroducing this datacenter once we are confident the issues are resolved fully, probably within the next 12 hours.

10th Feb 10:12 AM

At 10:00am, we confirmed that the AWS issues in us-west-1 (Northern California) were fully resolved, and as such, we have brought the us-west-1 datacenter back online.

All datacenters are full operational again.

Resolved

in about 7 hours
9th February 2017 03:29:14 PM

ap-southeast-2-a loss of connectivity

Our Australian data center lost connectivity for a few minutes resulting a high level of 50x errors and timeouts.

The problem was automatically resolved within a few minutes and all traffic for that data center was routed to other regions by the client libraries.

Resolved

in 2 minutes
4th February 2017 02:38:52 AM

Database consistency failure

There is a reported database consistency failure across the global cluster which has resulted in some writes to the global account, app and key databases failing.

We are investigating the root cause now.

4th Feb 02:46 AM

The database inconsistency was resolved automatically by the cluster and all services have returned to normal. A higher level of error rates was present for on more than 5 minutes and most services were unaffected.

Resolved

in 7 minutes
2nd February 2017 04:23:41 AM

Website downtime due to Redis failure

Our website was unavailable for 10 minutes due to a Redis instance failure that it depends on.

This was corrected automatically within 10 minutes and service resumed.

Please note that whilst our realtime system offers a 100% uptime guarantee, our website does not fall into this uptime guarantee as it is not part of the realtime platform.

Resolved

in 10 minutes

January 2017

28th January 2017 03:53:00 PM

Gossip and sharding upgrade

We performed an upgrade of the global realtime platform to readdress how data is sharded across the system to improve the spread of load and prevent load hotspots.

As the upgrade involved changes to our gossip protocol, we were required to upgrade the system in a way that allowed the old sharding algorithms to be running alongside the new sharding algorithms. This was achieved by doing the upgrade in three stages globally.

We had anticipated little or no disruption to the platform during this time, however following the three stage upgrade, we did notice a higher rate of errors and timeouts during this period. Unfortunately this means that some users will have experienced some degradation of the service during the 1 hour window for the upgrade.

Resolved

in about 1 hour

December 2016

15th December 2016 02:11:37 AM

Our automated health check system has reported an issue with realtime cluster health in eu-west-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

15th Dec 07:39 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

in 28 minutes

November 2016

23rd November 2016 07:01:20 PM

High error rates

From 19:01 UTC we have been seeing a significant escalation in error rates in all regions.

We are investigating this now.

23rd Nov 07:21 PM

Unfortunately the high error rates have resumed. We are bringing on global capacity to address the problem whilst we investigate the root cause.

23rd Nov 08:09 PM

Unfortunately there was a huge impact on the global platform since the initial escalation in errors.

We have stabilized the platform now, and will focus immediately on ensuring service returns to normal.

Once that is done, we will be doing a full post mortem on what happened this evening (UTC) time.

We apologize for the inconvenience today and are doing everything we can to keep the service stable.

25th Nov 01:59 AM

Following the disruptions we had from 2016/11/23 19:01 until 19:52 UTC, we have now been able to complete a preliminary post mortem on what happened, and how we do our best to make sure it does not happen again.

The sequence of events:

1) 19:00 - A specific traffic pattern triggered a bug that caused several instances (each handling the same channel, but in different regions) to fail. Our health servers correctly identified the issue and restarted the faulty processes across all regions within approximately 60 seconds.

2) Sharing of the load that resulted from those instances exiting resulted in scaling events across multiple regions concurrently. Our design has been to always over-provision capacity when we detect additional load, and as planned, this happened at around 19:04-19:05

3) A subsequent bug was triggered which caused the resulting expanded cluster to become unhealthy (ie cluster state failed to synchronize), which means that the cluster was intermittently unable to service requests, and as a result multiple regions experienced very high error rates for a period of around 20 minutes.

4) Seeing the events unfold, we attempted to intervene manually to restore cluster stability, but the level of activity meant that our tools also experienced errors and so we were unable to regain manual control without taking some regions offline.

5) Once certain affected regions were taken offline we were able to restore the cluster to working order at around 19:50, and error rates were returned to normal levels over several minutes

Actions we have undertaken since the incident:

1) In-depth investigation of the logs to identify the bug that triggered simultaneous failure of the nodes.

2) We have made significant improvements to our tooling so that we can apply configuration changes, execute urgent provisioning changes, control services and disable health systems selectively at a global level.

3) We have in the interim slightly reduced the level of autonomy so that we don’t encounter the same situation before our bug fixes are completed

Actions we are still doing:

1) We have a plan for improvements to the automated controls to remove the specific features that led to the cascade of responses in this case

2) We are setting up infrastructure and a load testing plan to reproduce the issue so that we can hopefully identify and the root cause of issues 1 and 2, and also optimize the autonomous health systems.

3) We are continuing to investigate the underlying cause of the issue.

We are sorry for the disruption this caused to everyone yesterday. We are doing everything we can now to make sure we quickly make sure this does not happen again.

Please do get in touch if you have any questions.

Resolved

in about 1 hour
7th November 2016 10:25:43 PM

Very brief datacenter in sa-east-1-a was unavailable

An availability issue was detected in sa-east-1-a.

The datacentre is now online again and was only offline for a few minutes.

We are investigating the root cause.

7th Nov 10:26 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

in 2 minutes
1st November 2016 05:07:15 PM

Website database maintenance

The primary Ably website at www.ably.io was down for approximately 4 minutes whilst an essential database upgrade was performed.

This was performed to address a known Linux vulnerability. See https://bugzilla.redhat.com/show_bug.cgi?id=1384344 for more info.

1st Nov 05:08 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

in 2 minutes

October 2016

28th October 2016 08:00:00 AM

Top level .io domain issues and Heroku

Some customers are reporting issues connecting to our primary endpoints and CDN that rely on the TLD .io domain.

This issue seems to be mostly affecting Heroku users, and Heroku has an incident describing this issue at https://status.heroku.com/incidents/967

Please note that whilst users on Heroku are reporting the issue, the underlying .io domain appears to be working reliably according to all of our global monitoring systems.

Also, as all customers' clients should automatically fallback to our alternative fallback hosts on the TLD ably-realtime.com, even if they are affected by the .io domain issue, they should still be able to connect to the Ably service. See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903 for more info.

We are continuing to follow Heroku's progress on resolving this issue and will provide an update once this is resolved.

28th Oct 10:00 AM

All of our monitoring systems, including those testing some web services within the Heroku environment, are reporting that all services are healthy globally.

The underlying TLD .io domain issues appear to be resolved for all customers including those who host services on the Heroku platform.

Resolved

in about 1 hour
28th October 2016 07:33:19 AM

Top level DNS IO domain issues affecting the website

Ongoing issues with Heroku being unable to resolve .io domains is currently affecting some functions of the website, and more specifically, the app dashboards.

Please see https://status.heroku.com/incidents/967 for more info of the underlying issue.

We are monitoring Heroku's progress on this issue.

28th Oct 07:35 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

28th Oct 09:50 AM

For the last 30 minutes we have had no reported errors from Heroku and all areas of the website are operating without issue.

Resolved

in about 2 hours
21st October 2016 04:00:00 PM

ably.io slow to respond due to Heroku/DynDNS issues; realtime service is fine

Our website (ably.io) is currently not responding for some customers due to a DynDNS outage affecting Heroku, see https://www.dynstatus.com/incidents/nlr4yrr162t8

The realtime service is does not use DynDNS, and is unaffected and running smoothly.

21st Oct 06:17 PM

The ably.io website is now bypassing Heroku's DynDNS, and is operating normally. (The realtime system continued operating as normal throughout)

21st Oct 06:53 PM

See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903#.o7p0bbpdy

For customers interested, in the article we've addressed how Ably has mitigated against these types of global DNS failures that have affected large companies like Github and Twitter.

Resolved

in about 3 hours
13th October 2016 12:00:00 PM

SSL Certificate intermittent issues

A small number of users from 13 Oct until 18 Oct were affected by a GlobalSign failure with their certificate revocation lists.

Unfortunately GlobalSign are the providers of Ably's certificate and as such there was little we could do but wait for the caches to be cleared and the issue to go away.

See https://support.ably.io/solution/articles/3000059255--the-security-certificate-has-been-revoked-error-connecting-to-ably for more info.

Resolved

in 5 days

September 2016

26th September 2016 12:01:42 PM

Global intermittent networking issues

Our monitoring systems are reporting network timeouts globally in what appears to be affecting all US regions, even across independent services we host with different providers.

As yet, we are unclear of what is causing this, but it appears to be an AWS or network routing issue as opposed to any problem within our datacenters or service.

A high level of AWS errors are being reported online at http://downdetector.com/status/aws-amazon-web-services

We're monitoring the situation and will see what we can do to alleviate the problem. However, if the underlying issues are related to core network routing issues, we are unfortunately unable to do anything about this as we rely on a working network in the affected regions.

26th Sep 12:08 PM

The networking issues have resolved themselves and error rates have dropped off.

Amazon as yet has not reported any reason behind the issues, but we expect they'll do a post mortem and post an update later today.

The reported incidents by other parties was very high during the time we experienced the networking issues. See http://downdetector.com/status/aws-amazon-web-services

We'll write up a post mortem once our upstream providers do the same.

Resolved

in 28 minutes