www.ably.io
Back

Incident log archive

All times are shown in UTC

September 2018

17th September 2018 02:40:39 PM

Increased error rates in California (us-west-1)

At 14:40 today an instance came online in the us-west-1 region.

Shortly after bootstrapping, the performance of the instance degraded significantly, but continued to service most requests. From a consensus and health checking perspective, this is hugely problematic as the bulk of the work it was supposed to perform it did, but an unacceptable number of requests failed.

We have since fixed the issue, but will now need to revisit how we manage our automated health checks to detect similar failures in the future.

We will continue to investigate and post in updates, along with a post mortem in due course.

17th Sep 04:00 PM

Error rates have returned back to normal.

We will continue to investigate the root cause and update with a post mortem.

Resolved

about 1 month ago
13th September 2018 06:10:00 PM

Increased latency and error rates due to issues with the persistance layer

We are seeing high error rates and latencies worldwide, due to issues with our Cassandra persistence layer instances in ap-southeast-1 and us-east-1. We are investigating.

Latencies and error rates are now back to normal. The issue was routine maintenance in our Cassandra persistence layer that caused significantly more work than expected, leading to contention issues. We will consider using a different strategy moving forwards to reduce the amount of work needed during maintenance.

Resolved

about 2 months ago
4th September 2018 12:07:49 PM

Increased errors rates seen in America East, Ireland and Singapore

We are seeing an increase in error rates in the last 5 minutes in us-east-1 (North Virginia) and eu-west-1 (Ireland) regions.

We are investigating.

4th Sep 12:16 PM

The error rates and latency have returned to normal (after 4 minutes). We believe this was caused by a deploy issue which we are investigating.

Resolved

about 2 months ago
3rd September 2018 02:05:00 AM

Website encryption failure

During a routine upgrade of the website, we are seeing reports of encryption failures and 500 errors being sent to our website users.

We are reverting the upgrade and expect the site should be online in the next few minutes.

3rd Sep 11:36 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

4th Sep 01:20 AM

The upgrade has been rolled back whilst we investigate the root cause of the issue.

In total, the website was offline for some users for no more than 15 minutes.

Resolved

about 2 months ago

August 2018

17th August 2018 09:45:29 AM

Website intermittent failures during upgrade

We are in the process of upgrading our website with a whole suite of new changes that will allow customers to better manage their limits, account, and access support.

Whilst the migrations are running from the old schema to our new schema, some customers may experience intermittent failures.

We expect the migrations to be completed within 10 minutes.

17th Aug 10:04 AM

The database migrations are now complete and all operations are back to normal.

Resolved

2 months ago
14th August 2018 08:13:43 AM

High error rates during emergency revert of faulty deployment

At 5:50 UTC, a routine deploy caused error rates to climb in eu-central-1. See https://status.ably.io/incidents/553 for details.

Following an investigation into the root cause, we have decided that it is safer to revert the recent deployed on any instances globally until we can fully understand the root cause of the issue.

We have noticed error rates briefly increase in regions during (for periods of up to around 1 minute) whilst this code is being reverted.

We'll confirm once this process is completed.

14th Aug 09:00 AM

The new code has been reverted now on all nodes and error rates are back to normal.

We'll be completing a post mortem on this issue and post updates within this incident report.

16th Aug 06:00 PM

We have completed our post mortem on what happened:

The underlying cause was that we deployed code that was effectively broken. The specific issue was related to interoperability in a mixed cluster, and arose in a specific configuration that we had not tested sufficiently before release

When we deployed the code to eu-central-1, we saw a very large spike in system load - whose root cause was the bug - but due to the cascade of effects when hugely overloaded we were overwhelmed by logs containing secondary errors so we didn't recognise the root cause initially

The load then triggered autoscaling and recovery of nodes across the cluster which, due to a procedural error in making the deploy, caused other nodes to be restarted with the broken code.

We were able to recover the system by suspending recovery, finding the root cause bug, and reverting affected nodes.

However, moving forwards, we are going to improve our deployment strategy so that we can better test multi-region mixed cluster environments.

We apologize for the inconvenience this caused.

Resolved

3 months ago
14th August 2018 05:50:03 AM

eu-central-1 service disruption

Since 0550 UTC there was service disruption with increased error rates and latencies in eu-central-1. Recovery action is underway and service is getting back to normal.

More information will be posted here in due course.

14th Aug 07:42 AM

The root cause of this was due to a routine deploy in that region, which triggered an unexpected spike in CPU load across the routing layer for the entire eu-central-1 region. The error rates returned to normal at 6:41 UTC.

We are continuing to investigate why this deploy caused the disruption in this region.

14th Aug 09:15 AM

The code that caused this issue was rolled back in all regions across all instances where it had been deployed. See https://status.ably.io/incidents/554 for subsequent incident detailing the rollback process.

We will continue to investigate and publish a full post mortem in due course in incident https://status.ably.io/incidents/554.

Resolved

3 months ago

July 2018

31st July 2018 10:12:36 PM

eu-west-1 second spike

Following a spike we saw around 15 minutes ago, we raised an incident at https://status.ably.io/admin/incidents/550. Within 4 minutes the issue was resolved, and we marked the incident as resolved so that we could then focus on a post mortem. Unfortunately we are now experiencing a second spike in this region and will need to investigate things more thoroughly.

During this time nodes in eu-west-1 are rejecting some traffic causing traffic to be rerouted to other regions.

31st Jul 10:24 PM

Until we understand the underlying issue fully, we are routing traffic away from the eu-west-1 datacenter as error rates are still high.

31st Jul 10:49 PM

We are confident we have sufficient information now to understand and fix the root cause of the issue.

Our Cassandra upgrades will now be put on hold until a permanent fix is deployed to the global cluster.

We will shortly be directing traffic back to the eu-west-1 cluster. In the mean time, all traffic in EU is being handled by our datacenter in Germany.

31st Jul 10:56 PM

All systems are back to normal in eu-west-1 Ireland as of 22:51.

We have a fix that we will deploy and test in the coming days, before we continue our scheduled Cassandra upgrades.

All traffic is now being routed back to eu-west-1 as normal.

Resolved

3 months ago
31st July 2018 09:58:20 PM

eu-west-1 spike

In the last 4 minutes we have seen a huge spike in load on all frontend and core servers in EU West. This has resulted in the nodes going into siege mode so that they can recover, whilst traffic has been sent to other regions temporarily.

We'll update this incident shortly.

31st Jul 10:13 PM

Within 4 minutes the issue resolved itself and error rates have returned to normal.

It appears a routine Cassandra upgrade triggered this issue, however we are not yet clear on the root cause of the problem. We will continue to investigate.

Resolved

3 months ago

June 2018

26th June 2018 05:00:37 PM

Unplanned website database migration

At 17:00 a necessary and unplanned migration was performed on our primary website database causing the website to be unavailable for 1-2 minutes.

26th Jun 05:04 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the iss0ue and will update this incident shortly.

Resolved

4 months ago
6th June 2018 11:41:12 AM

Intermittent slow requests in US West (California)

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

6th Jun 11:43 AM

We have identified an issue where some HTTP requests are intermittently stalled in the load balancer in the us-west-1 region.

The issue only affects us-west-1. As a precaution we are temporarily routing traffic away from that region. During this period clients that would have used that region will mostly now end up in us-east-1 (East Virginia). Aside from a possibly slight latency increase this should not cause any problems.

6th Jun 11:39 PM

We were hoping that AWS would be able to resolve the issue in a sensible time so that they could diagnose the underlying problem on their end. However, as this did not happen time, we moved forwards and recycled the necessary load balancers and brought the US West region back online. All services are back to normal.

Resolved

5 months ago
1st June 2018 01:00:00 PM

Sydney datacenter essential maintenance

Today, 1 June 2018, at 13:00 UTC we will be performing essential maintenance on our Sydney datacenter. During this time, all traffic will be routed to the nearest other Ably datacenters , which will most likely be Singapore and California.

We expect this upgrade to take 15-30 minutes, and once completed, all traffic will be routed back to Sydney.

1st Jun 01:27 PM

The maintenance is unfortunately taking longer than predicted as we are waiting on AWS to re-provision resources in that region. Once their re-provisioning is complete, the region will be back online.

1st Jun 01:48 PM

The Sydney datacenter is now back online and operating normally.

Resolved

5 months ago

May 2018

17th May 2018 12:00:00 AM

Reactor queue intermittent issues

Whilst upgrading a number of RabbitMQ nodes, we had a hardware failure on one node. Whilst all nods are configured in a HA configuration, we had some issues introducing new nodes into the system which caused availability issues of some queues for short periods of up to 30 seconds. This happened a few times between 12:00 UTC and 1:20UTC.

The issues are now fully resolved.

Resolved

6 months ago

April 2018

30th April 2018 07:13:09 AM

Website availability issues

Some of the nodes serving the website failed at 7:13 today.

30th Apr 07:28 AM

Website issues were resolved by an automated restart of the Heroku dyno. The core realtime system does not use Heroku and was unaffected.

Resolved

6 months ago
19th April 2018 11:34:29 PM

Heavy load in us-east-1 has caused an increase in latencies

We have seen a significant and sudden increase in us-east-1 since 23:30 UTC today.

We are manually over provisioning the capacity in this region and will be investigating the cause of this sudden increase in load. Once identified, we'll look into how we can more effectively prepare for spikes like this.

20th Apr 12:06 AM

Unfortunately after provisioning more capacity, a second wave of traffic has arrived.
We are provisioning again.

20th Apr 12:26 AM

We have identified the customer who is largely responsible for this traffic spike and have resolved the issue in the affected regions. Over the next few days we'll be adding additional limits to prevent usage patterns like this impacting other accounts in the multi-tenanted cluster.

Resolved

6 months ago