ably.com
Back

Incident log archive

All times are shown in UTC

February 2021

10th February 2021 07:45:00 PM

Mobile push notifications are delayed

Due to an ongoing outage by AWS in US East 1, we are experiencing delays in push notification deliveries.

We are now going to force all push notifications to skip the US East region until AWS resolves the underlying load balancing issues reported as "12:19 PM PST We are investigating increased connectivity issues for Network Load Balancers within the US-EAST-1 Region."

10th Feb 08:48 PM

All push services are back to normal.

19th Feb 10:18 AM

The report summarising our investigation and conclusions for this incident is now available: https://gist.github.com/paddybyers/47e3f4490330b3c8735f643e8e5ed923

Resolved

in about 1 hour
10th February 2021 07:08:00 PM

Connections to US data centers are failing

Connections to us-east-1 are currently all failing. We are investigating why.

EU and AP datacenters are operating as normal; client libraries using fallbacks should automatically retry to one of those, so experience only elevated latency.

10th Feb 07:54 PM

AWS are still not reporting any underlying issues. This pattern is not unusual, we frequently identify problems .

We have identified the problem as a TLS termination issue in the load balancing layers of the US East clusters. We are reporting this to AWS now, but will continue to redirect all traffic away from US East 1 until AWS resolve the underlying issue.

10th Feb 08:16 PM

After investigating the issue in more depth, we know that the load balancing TLS issue is isolated to a single availability zone. We continue to liaise with AWS on the issue, and will continue to direct traffic to other regions to ensure stability for all customers.

10th Feb 08:29 PM

AWS have finally acknowledge the issue with their elastic load balancers, more than 1 hour after we detected the problem and took action to address the fault.
Given this is now acknowledged, we will continue to direct traffic away from US East 1.

11th Feb 01:23 AM

Given the AWS load balancers in US East have been stable for a few hours, we've progressively migrated all traffic back to the US East region.
During this time we discovered some material performance issues for some customers consuming from the Reactor Queues in US East 1.
We are now starting to compile a full post mortem and will post an update as soon as possible.

12th Feb 09:27 AM

There have been no recurrences of the AWS NLB issue and traffic to us-east is fully enabled. Our investigation is ongoing and we will publish our incident report early next week.

19th Feb 09:42 AM

The report summarising our investigation and conclusions for this incident is now available: https://gist.github.com/paddybyers/47e3f4490330b3c8735f643e8e5ed923

Resolved

in about 6 hours
1st February 2021 07:58:00 PM

Reduced performance in eu-west-1 for 5 minutes

Between 19:58:30 and 20:03:30 requests to eu-west-1 may have experienced slow response times due to a load spike. All other regions were fine so any failed requests will have been retried in other regions.

Resolved

in 5 minutes

January 2021

19th January 2021 05:00:00 PM

Delays in push message delivery and reactor queue delivery

Due to unexpectedly high load, mobile push messages (that is, to APNS/FCM) may have experienced delivery delays in the last hour or so.

Additionally, between about 17:50 and 17:55 UTC, consumers of some Reactor queues will have had a chance of experiencing delays.

The core service (normal pub/sub etc) was unaffected.

Everything should now be back to normal.

Resolved

in about 1 hour

December 2020

9th December 2020 09:31:00 AM

Reactor queue admin operations (creation, listing, deletion) are intermittently failing

We are looking into reactor queue management operations on the website (creation, listing, deletion) intermittently failing since around 9:30 UTC this morning, which can cause listing queues to incorrectly show that you have no queues.

Non-management operations (pushing into queues, consuming from queues) are unaffected.

Update at 10:38: All queue management operations are back to normal. We apologise for the inconvenience.

Resolved

in about 1 hour
4th December 2020 12:10:00 PM

Some Reactor queues briefly unavailable to consumers

Between about 12:10 and 12:15 UTC, a subset of Reactor queues were unavailable for consumers, due to one rabbitmq server becoming unavailable. We use mirrored queues with a replication factor of 2, and only one node was affected, so no messages were lost. However, consumers whose queues have primaries on the affected node may have been unable to consume for a few minutes; the consumer would have been rejected with error `home node 'rabbit@queue-b3d24b09.queue-production-us-east-1-a-internal.ably.io' of durable queue ':' in vhost '/shared' is down or inaccessible`.

Resolved

in 5 minutes

October 2020

9th October 2020 02:38:25 PM

Potential pre-emptive action for AWS US-East capacity problems

AWS have been reporting issues with new instances coming online in the US East region.

This has had no impact on our service in this region however we are actively monitoring it. If the situation changes such that we are not confident there is enough capacity to service the traffic, we will likely route traffic away from US East until AWS EC2 instances are stable in that region.

9th Oct 08:14 PM

Amazon reported that they have now resolved the issues in US East 1, and we have resumed normal operations again in all regions.

The update from AWS on the root cause of this problem is as follows:

Starting at 9:37 PM PDT on October 8th, we experienced increased network connectivity issues for a subset of instances within a single Availability Zone (use1-az2) in the US-EAST-1 Region. This was caused by a single cell within the subsystem responsible for the updating VPC network configuration and mappings experiencing elevated failures. These elevated failures caused network configuration and mappings to be delayed or to fail for new instance launches and attachments of ENIs within the affected cell. The issue has also caused connectivity issues between an affected instance in the affected Availability Zone(use1-az2) and newly launched instances within other Availability Zones in the US-EAST-1 Region, since updated VPC network configuration and mappings were not able to be updated within the affected Availability Zone(use1-az2). The root cause of the issue was addressed and at 10:20 AM PDT on October 9th, we began to see recovery for the affected instances. By 11:10 AM PDT, all affected instances had fully recovered. The issue has been resolved and the service is operating normally

Resolved

in about 7 hours

September 2020

1st September 2020 09:18:00 AM

Publishing issues in us-west-1

A small fraction of publishes failing between 09:18 and 09:36 UTC with error "Service Unavailable (server at capacity limit)" due to resource issues in the us-west-1 region. We identified the issue and increased the number of resources available to accommodate a spike in load on the system.

Resolved

in 18 minutes

August 2020

19th August 2020 11:37:00 PM

disruption in us-west-1

Networking issues in aws us-west-1 between 23:37 and 00:13 UTC lead to elevated error rates and channel continuity losses in us-west-1.

Customer-visible effects will have consisted of:
- a small proportion of connections connected to us-west-1 region will have timed out or been disconnected, and may then reconnected to another region and failed to resume their connection, so experiencing a continuity loss on their channels
- a small proportion of channels that were active in the us-west-1 region will have migrated to other instances as the affected instances (which seem to have had all networking cut for an extended period) were detected as unhealthy and removed from the cluster. During this period publishes to those channels may be been refused or timed out, and channels which lost continuity due to the disruption will notify attached clients of continuity losses (in the form of an 'update' event, see https://www.ably.io/documentation/realtime/channels#listening-state).

We are in the process of a larger piece of work that will further decouple channel regions from each other, such that when networking issues affect a single region activity in other regions will be completely unaffected; unfortunately this is not yet complete, and in the current architecture clients in other regions attached to channels active in us-west-1 may also have experienced continuity losses.

Resolved

in 36 minutes

July 2020

28th July 2020 08:25:00 AM

Sporadic high latencies in us-east-1

We are investigating intermittent high latencies in REST requests to us-east-1. Other datacenters are unaffected.

28th Jul 09:20 AM

We have temporarily redirected traffic away from us-east-1 to other datacenters

28th Jul 09:35 AM

The issue is due to AWS issues with resolving dns from within us-east-1: https://status.aws.amazon.com/ . We are leaving traffic redirected away from us-east-1 until that is resolved. (Any connections that were already connected to us-east-1 will remain for now)

28th Jul 11:41 AM

Following AWS reporting the dns resolution latency issues as fixed on their end, which we have confirmed, we are now re-enabling traffic to us-east-1.

Total customer-visible effect should have been very little, other than occasional latency spikes for requests to us-east-1 in the time between the AWS issue beginning and when we redirected traffic away from that region at 08:15 UTC, and fractionally higher latencies for customers near us-east-1 who were redirected to us-west-1 for the duration.

Resolved

in about 3 hours
22nd July 2020 07:08:19 AM

Database errors across all regions

We are seeing very high load in the database layer that is affecting all regions and are currently investigating.

22nd Jul 07:55 AM

Within 20 minutes of active management the load on our global database has returned to normal levels.

Our initial investigation indicates that customers using our Push registration and delivery APIs were most affected during this period.

We are investigating the root cause of this issue now and will continue to post updates as we know more.

We apologise for any inconvenience this may have caused.

Resolved

in 21 minutes
19th July 2020 12:41:20 PM

Scheduled website database maintenance

We're performing scheduled Redis and PostgreSQL database maintenance. Customer dashboards and the website will be unavailable for a few minutes during the maintenance window and notifications might be delayed be a few minutes. The realtime systems will not be affected be the migration.

Resolved

in about 2 hours

June 2020

19th June 2020 06:24:26 PM

Support ticketing site maintenance

We are currently migrating our ticketing system and FAQ site support.ably.io to support.ably.com.

During this migration, there will be some disruption for some customers.

We expect this to be completed within 30 minutes.

If you have any issues, please contact us via live chat on www.ably.io

19th Jun 06:57 PM

DNS migration is now complete with 3rd party provider Freshdesk.

Resolved

in 33 minutes

May 2020

6th May 2020 10:24:00 AM

Push notifications processing stalled

The processing of push notifications is stalled. We are currently investigating the cause.

We will make na update here in 15 minutes, or as soon as there is more information.

6th May 11:32 AM

A fix for this problem is being deployed now and we are monitoring the situation.

6th May 11:34 AM

The service is back to normal.

Resolved

in about 1 hour

April 2020

23rd April 2020 10:00:53 PM

Alert emails and other website notifications stalled

A website problem has been causing the sending of various automated emails, including limit notifications and welcome emails, to be stalled. The majority of emails arising from 0922 UTC on Thursday 23 April were stalled and backlogged until the service was unblocked at 1205 UTC on Monday 27 April. All backlogged emails were eventually sent by 1800 UTC.

The system is now operating normally.

An investigation is continuing into the circumstances that led to the problem, and to the extended time for resolution. This incident will be updated in due course with more information.

7th May 11:23 AM

Our engineering and ops teams have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/paddybyers/c27d302524caa8e46f41e9ba19fdcf2e

Resolved

in 4 days