www.ably.io
Back

Incident log archive

All times are shown in UTC

December 2018

4th December 2018 07:35:00 PM

Reactor queues partitioned

Our internal Reactor queues appear to currently be experiencing a network partition. Consuming from queues may temporarily not work. No other services are affected; realtime messaging is operating as normal.

4th Dec 11:36 PM

Queues are now up and running normally. For a long time the partitioned state failed to heal, even after several rolling restarts. Ultimately we made the decision to stop and restart the entire cluster, which then took longer than expected due to an infrastructure issue.

We run our queue servers in 'pause minority' mode, which prioritises data integrity and preservation of messages already in the queue over availability to either consumers or new publishers for connections to the cluster in the minority.

We'll be investigating the causes of the partition, why it failed to heal, and the issues experienced in doing a stop-the-world restart, and post any updates here.

Resolved

less than a minute ago
3rd December 2018 03:22:00 PM

Availability glitches in several regions

We are investigating abnormal availability metrics in multiple regions

3rd Dec 04:07 PM

The us-east-1 region is continuing to experience a very high rate of errors, so we are now redirecting all traffic away from us-east-1 to us-west-1.

3rd Dec 04:18 PM

Error rates are quickly returning to normal. There are still some residual issues which we are resolving first before we investigate the root cause of the faults in us-east-1 that triggered the high error rates.

3rd Dec 04:38 PM

Error rates and availability metrics are now back to normal.

We are currently running with one fewer region than normal. Traffic that would have gone to us-east-1 is now going to us-west-1. We will updated this issue once we have identified the fault and brought the region back online.

3rd Dec 06:42 PM

Traffic is now being routed to us-east-1 again.

5th Dec 12:10 PM

Please see the preliminary incident report on this incident: https://gist.github.com/paddybyers/3e215c0aa0aa143288e4dece6ec16285

Resolved

2 days ago

November 2018

30th November 2018 07:04:24 PM

Our automated health check system has reported an issue with realtime cluster health in ap-southeast-2-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

30th Nov 07:05 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

3 minutes ago

October 2018

30th October 2018 01:35:00 PM

Global incident causing increased latencies and error rates worldwide

An incident on the 30 October caused increased error rates from 14:00 UTC until 16:30 UTC, with residual issues remaining until 19:00 UTC.

Please read the full post mortem below.

30th Oct 02:04 PM

Update at 14:03 (UTC): we are routing traffic away from us-east to other datacenters while we investigate the issue

30th Oct 02:36 PM

Update at 14:30 (UTC): All regions are now affected by the present issues due to cascading failures. We are very sorry for the inconvenience and are trying to restore normal service as fast as possible.

30th Oct 05:09 PM

Update at 16:03 (UTC): All regions are now stable and error rates have dropped dramatically. We will continue to inspect each region manually to ensure any issues are resolved forcibly.

30th Oct 07:00 PM

Update at 17:30 (UTC): Error rates have been consistently back to normal in all regions. We believe all operations are back to normal. We will be completing a full post mortem now to understand what caused the global disruption, and importantly what caused the ripple effect from one region to another.

31st Oct 11:37 AM

Update at 10:30 + 1 day (UTC): Our focus today is on addressing the immediate issues we identified yesterday that caused the significant increase in error rates and latencies globally. Whilst a post mortem is being prepared and will be published in due course, our priority at present is to resolve the underlying issue and prevent a repeat of the incident.

We are currently rolling out an update globally in all regions and expect this will take a few hours to complete.

2nd Nov 01:01 PM

We have published a summary of the incident affecting the Ably production cluster 30 October 2018, which includes our preliminary investigation and conclusions.

Please see https://gist.github.com/mattheworiordan/2abab206ee4e4859010da0375bcf4b1d for the report.

As mentioned in the report, we are sorry to all our customers for the disruption this resulted in, and are doing everything we can to ensure we are better equipped to deal with incidents more quickly and without disruption in future.

If you have any questions, please do get in touch with us at https://www.ably.io/contact

Resolved

3 days ago
11th October 2018 11:05:33 AM

Increased error rates in us-east-1

We are seeing an increase in error rates in us-east-1 presently.

This appears to be related to a significant increase in load in the us-east-1 region, and the additional capacity brought on by our autoscaling systems.

We're investigating the problem now.

11th Oct 11:30 AM

Error rates have returned to normal.

We will continue to investigate the root cause of the increased error rates that lasted approximately 15 minutes.

11th Oct 07:04 PM

We have completed a post mortem on what happened.

At the time of the issue, we deployed an update to the realtime platform that contained a regression relating to the progressive adoption of load as services come online during scaling operations. This meant that, for a short period of time early in the life of a new server, a small proportion of the channels handled by that server had errors. Externally, this meant that a small fraction of channels were inaccessible in us-east for a period of about 20 minutes.

The bug has since been fixed and we're looking into ways to add more simulated tests over the progressive load feature.

Resolved

19 days ago
1st October 2018 04:46:00 PM

Increased error rates in eu-west-1 and eu-central-1 regions

This is a continuation if incident at https://status.ably.io/incidents/561 which was mistakenly marked as resolved. All notes from the previous issue below.

We are investigating increased error rates in our two European data centers.

1st Oct 04:42 PM

Whilst AWS is not yet reporting any issues on the status or personal health dashboards, we believe this issue is caused by a fault in the eu-central-1 and possibly eu-west-1 AWS regions. We are led to believe this as dedicated clusters, isolated from the global traffic in those regions, are also affected.

We will continue to investigate the issue and take action to minimise the impact.

1st Oct 04:52 PM

AWS continues to report no issues, yet Twitter confirms the issues are widespread.

We have seen that eu-west-1 continues to exhibit problems, so we are now routing traffic away from eu-west-1 for now.

1st Oct 05:08 PM

We are seeing stability return in eu-west-1 and eu-central-1. All eu-west-1 traffic is still being routed to other regions. Once eu-west-1 settles fully we'll redirect traffic back.

We are now investigating any residual issues caused by the partitions and instability to ensure no longer term impact on customer traffic.

1st Oct 05:23 PM

Error rates have returned to normal in all regions apart from eu-central-1.

We are investigating the errors in eu-central-1 now, although the error rate in that region is now very low.

1st Oct 05:52 PM

As AWS issues in eu-west-1 and eu-central-1 continue (they have now confirmed networking issues), we are re-routing all traffic for the global cluster away from all EU regions. Please note that any traffic routed to the EU only (EU-only storage options, compliance reasons etc) unfortunately will continue to be routed to these clusters.

1st Oct 06:47 PM

We believe the EU regions are now reaching stability and intend to route traffic back to EU in the next 15 minutes once some final testing is complete.

1st Oct 07:22 PM

All global traffic is now being routed back to both EU datacenters, and everything appears to be normal. We'll continue to monitor closely now.

Resolved

29 days ago
1st October 2018 04:26:00 PM

Increased error rates in eu-central-1 and eu-west-1

We are investigating increased error rates in our two European data centers.

1st Oct 04:42 PM

Whilst AWS is not yet reporting any issues on the status or personal health dashboards, we believe this issue is caused by a fault in the eu-central-1 and possibly eu-west-1 AWS regions. We are led to believe this as dedicated clusters, isolated from the global traffic in those regions, are also affected.

We will continue to investigate the issue and take action to minimise the impact.

1st Oct 04:52 PM

AWS continues to report no issues, yet Twitter confirms the issues are widespread.

We have seen that eu-west-1 continues to exhibit problems, so we are now routing traffic away from eu-west-1 for now.

1st Oct 05:08 PM

We are seeing stability return in eu-west-1 and eu-central-1. All eu-west-1 traffic is still being routed to other regions. Once eu-west-1 settles fully we'll redirect traffic back.

We are now investigating any residual issues caused by the partitions and instability to ensure no longer term impact on customer traffic.

1st Oct 05:09 PM

Please note this issue is not resolved, but was mistakenly closed. We will continue in issue https://status.ably.io/incidents/562

Resolved

29 days ago

September 2018

17th September 2018 02:40:39 PM

Increased error rates in California (us-west-1)

At 14:40 today an instance came online in the us-west-1 region.

Shortly after bootstrapping, the performance of the instance degraded significantly, but continued to service most requests. From a consensus and health checking perspective, this is hugely problematic as the bulk of the work it was supposed to perform it did, but an unacceptable number of requests failed.

We have since fixed the issue, but will now need to revisit how we manage our automated health checks to detect similar failures in the future.

We will continue to investigate and post in updates, along with a post mortem in due course.

17th Sep 04:00 PM

Error rates have returned back to normal.

We will continue to investigate the root cause and update with a post mortem.

Resolved

about 1 month ago
13th September 2018 06:10:00 PM

Increased latency and error rates due to issues with the persistance layer

We are seeing high error rates and latencies worldwide, due to issues with our Cassandra persistence layer instances in ap-southeast-1 and us-east-1. We are investigating.

Latencies and error rates are now back to normal. The issue was routine maintenance in our Cassandra persistence layer that caused significantly more work than expected, leading to contention issues. We will consider using a different strategy moving forwards to reduce the amount of work needed during maintenance.

Resolved

about 2 months ago
4th September 2018 12:07:49 PM

Increased errors rates seen in America East, Ireland and Singapore

We are seeing an increase in error rates in the last 5 minutes in us-east-1 (North Virginia) and eu-west-1 (Ireland) regions.

We are investigating.

4th Sep 12:16 PM

The error rates and latency have returned to normal (after 4 minutes). We believe this was caused by a deploy issue which we are investigating.

Resolved

about 2 months ago
3rd September 2018 02:05:00 AM

Website encryption failure

During a routine upgrade of the website, we are seeing reports of encryption failures and 500 errors being sent to our website users.

We are reverting the upgrade and expect the site should be online in the next few minutes.

3rd Sep 11:36 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

4th Sep 01:20 AM

The upgrade has been rolled back whilst we investigate the root cause of the issue.

In total, the website was offline for some users for no more than 15 minutes.

Resolved

about 2 months ago

August 2018

17th August 2018 09:45:29 AM

Website intermittent failures during upgrade

We are in the process of upgrading our website with a whole suite of new changes that will allow customers to better manage their limits, account, and access support.

Whilst the migrations are running from the old schema to our new schema, some customers may experience intermittent failures.

We expect the migrations to be completed within 10 minutes.

17th Aug 10:04 AM

The database migrations are now complete and all operations are back to normal.

Resolved

2 months ago
14th August 2018 08:13:43 AM

High error rates during emergency revert of faulty deployment

At 5:50 UTC, a routine deploy caused error rates to climb in eu-central-1. See https://status.ably.io/incidents/553 for details.

Following an investigation into the root cause, we have decided that it is safer to revert the recent deployed on any instances globally until we can fully understand the root cause of the issue.

We have noticed error rates briefly increase in regions during (for periods of up to around 1 minute) whilst this code is being reverted.

We'll confirm once this process is completed.

14th Aug 09:00 AM

The new code has been reverted now on all nodes and error rates are back to normal.

We'll be completing a post mortem on this issue and post updates within this incident report.

16th Aug 06:00 PM

We have completed our post mortem on what happened:

The underlying cause was that we deployed code that was effectively broken. The specific issue was related to interoperability in a mixed cluster, and arose in a specific configuration that we had not tested sufficiently before release

When we deployed the code to eu-central-1, we saw a very large spike in system load - whose root cause was the bug - but due to the cascade of effects when hugely overloaded we were overwhelmed by logs containing secondary errors so we didn't recognise the root cause initially

The load then triggered autoscaling and recovery of nodes across the cluster which, due to a procedural error in making the deploy, caused other nodes to be restarted with the broken code.

We were able to recover the system by suspending recovery, finding the root cause bug, and reverting affected nodes.

However, moving forwards, we are going to improve our deployment strategy so that we can better test multi-region mixed cluster environments.

We apologize for the inconvenience this caused.

Resolved

3 months ago
14th August 2018 05:50:03 AM

eu-central-1 service disruption

Since 0550 UTC there was service disruption with increased error rates and latencies in eu-central-1. Recovery action is underway and service is getting back to normal.

More information will be posted here in due course.

14th Aug 07:42 AM

The root cause of this was due to a routine deploy in that region, which triggered an unexpected spike in CPU load across the routing layer for the entire eu-central-1 region. The error rates returned to normal at 6:41 UTC.

We are continuing to investigate why this deploy caused the disruption in this region.

14th Aug 09:15 AM

The code that caused this issue was rolled back in all regions across all instances where it had been deployed. See https://status.ably.io/incidents/554 for subsequent incident detailing the rollback process.

We will continue to investigate and publish a full post mortem in due course in incident https://status.ably.io/incidents/554.

Resolved

3 months ago

July 2018

31st July 2018 10:12:36 PM

eu-west-1 second spike

Following a spike we saw around 15 minutes ago, we raised an incident at https://status.ably.io/admin/incidents/550. Within 4 minutes the issue was resolved, and we marked the incident as resolved so that we could then focus on a post mortem. Unfortunately we are now experiencing a second spike in this region and will need to investigate things more thoroughly.

During this time nodes in eu-west-1 are rejecting some traffic causing traffic to be rerouted to other regions.

31st Jul 10:24 PM

Until we understand the underlying issue fully, we are routing traffic away from the eu-west-1 datacenter as error rates are still high.

31st Jul 10:49 PM

We are confident we have sufficient information now to understand and fix the root cause of the issue.

Our Cassandra upgrades will now be put on hold until a permanent fix is deployed to the global cluster.

We will shortly be directing traffic back to the eu-west-1 cluster. In the mean time, all traffic in EU is being handled by our datacenter in Germany.

31st Jul 10:56 PM

All systems are back to normal in eu-west-1 Ireland as of 22:51.

We have a fix that we will deploy and test in the coming days, before we continue our scheduled Cassandra upgrades.

All traffic is now being routed back to eu-west-1 as normal.

Resolved

3 months ago