All times are shown in UTC
Due to unexpectedly high load, mobile push messages (that is, to APNS/FCM) may have experienced delivery delays in the last hour or so.
Additionally, between about 17:50 and 17:55 UTC, consumers of some Reactor queues will have had a chance of experiencing delays.
The core service (normal pub/sub etc) was unaffected.
Everything should now be back to normal.
Resolved
in about 1 hourWe are looking into reactor queue management operations on the website (creation, listing, deletion) intermittently failing since around 9:30 UTC this morning, which can cause listing queues to incorrectly show that you have no queues.
Non-management operations (pushing into queues, consuming from queues) are unaffected.
Update at 10:38: All queue management operations are back to normal. We apologise for the inconvenience.
Resolved
in about 1 hourBetween about 12:10 and 12:15 UTC, a subset of Reactor queues were unavailable for consumers, due to one rabbitmq server becoming unavailable. We use mirrored queues with a replication factor of 2, and only one node was affected, so no messages were lost. However, consumers whose queues have primaries on the affected node may have been unable to consume for a few minutes; the consumer would have been rejected with error `home node 'rabbit@queue-b3d24b09.queue-production-us-east-1-a-internal.ably.io' of durable queue ':' in vhost '/shared' is down or inaccessible`.
Resolved
in 5 minutesAWS have been reporting issues with new instances coming online in the US East region.
This has had no impact on our service in this region however we are actively monitoring it. If the situation changes such that we are not confident there is enough capacity to service the traffic, we will likely route traffic away from US East until AWS EC2 instances are stable in that region.
Amazon reported that they have now resolved the issues in US East 1, and we have resumed normal operations again in all regions.
The update from AWS on the root cause of this problem is as follows:
Starting at 9:37 PM PDT on October 8th, we experienced increased network connectivity issues for a subset of instances within a single Availability Zone (use1-az2) in the US-EAST-1 Region. This was caused by a single cell within the subsystem responsible for the updating VPC network configuration and mappings experiencing elevated failures. These elevated failures caused network configuration and mappings to be delayed or to fail for new instance launches and attachments of ENIs within the affected cell. The issue has also caused connectivity issues between an affected instance in the affected Availability Zone(use1-az2) and newly launched instances within other Availability Zones in the US-EAST-1 Region, since updated VPC network configuration and mappings were not able to be updated within the affected Availability Zone(use1-az2). The root cause of the issue was addressed and at 10:20 AM PDT on October 9th, we began to see recovery for the affected instances. By 11:10 AM PDT, all affected instances had fully recovered. The issue has been resolved and the service is operating normally
Resolved
in about 7 hoursA small fraction of publishes failing between 09:18 and 09:36 UTC with error "Service Unavailable (server at capacity limit)" due to resource issues in the us-west-1 region. We identified the issue and increased the number of resources available to accommodate a spike in load on the system.
Resolved
in 18 minutesNetworking issues in aws us-west-1 between 23:37 and 00:13 UTC lead to elevated error rates and channel continuity losses in us-west-1.
Customer-visible effects will have consisted of:
- a small proportion of connections connected to us-west-1 region will have timed out or been disconnected, and may then reconnected to another region and failed to resume their connection, so experiencing a continuity loss on their channels
- a small proportion of channels that were active in the us-west-1 region will have migrated to other instances as the affected instances (which seem to have had all networking cut for an extended period) were detected as unhealthy and removed from the cluster. During this period publishes to those channels may be been refused or timed out, and channels which lost continuity due to the disruption will notify attached clients of continuity losses (in the form of an 'update' event, see https://www.ably.io/documentation/realtime/channels#listening-state).
We are in the process of a larger piece of work that will further decouple channel regions from each other, such that when networking issues affect a single region activity in other regions will be completely unaffected; unfortunately this is not yet complete, and in the current architecture clients in other regions attached to channels active in us-west-1 may also have experienced continuity losses.
Resolved
in 36 minutesWe are investigating intermittent high latencies in REST requests to us-east-1. Other datacenters are unaffected.
We have temporarily redirected traffic away from us-east-1 to other datacenters
28th Jul 09:35 AMThe issue is due to AWS issues with resolving dns from within us-east-1: https://status.aws.amazon.com/ . We are leaving traffic redirected away from us-east-1 until that is resolved. (Any connections that were already connected to us-east-1 will remain for now)
28th Jul 11:41 AMFollowing AWS reporting the dns resolution latency issues as fixed on their end, which we have confirmed, we are now re-enabling traffic to us-east-1.
Total customer-visible effect should have been very little, other than occasional latency spikes for requests to us-east-1 in the time between the AWS issue beginning and when we redirected traffic away from that region at 08:15 UTC, and fractionally higher latencies for customers near us-east-1 who were redirected to us-west-1 for the duration.
Resolved
in about 3 hoursWe are seeing very high load in the database layer that is affecting all regions and are currently investigating.
Within 20 minutes of active management the load on our global database has returned to normal levels.
Our initial investigation indicates that customers using our Push registration and delivery APIs were most affected during this period.
We are investigating the root cause of this issue now and will continue to post updates as we know more.
We apologise for any inconvenience this may have caused.
Resolved
in 21 minutesWe're performing scheduled Redis and PostgreSQL database maintenance. Customer dashboards and the website will be unavailable for a few minutes during the maintenance window and notifications might be delayed be a few minutes. The realtime systems will not be affected be the migration.
Resolved
in about 2 hoursWe are currently migrating our ticketing system and FAQ site support.ably.io to support.ably.com.
During this migration, there will be some disruption for some customers.
We expect this to be completed within 30 minutes.
If you have any issues, please contact us via live chat on www.ably.io
DNS migration is now complete with 3rd party provider Freshdesk.
Resolved
in 33 minutesThe processing of push notifications is stalled. We are currently investigating the cause.
We will make na update here in 15 minutes, or as soon as there is more information.
A fix for this problem is being deployed now and we are monitoring the situation.
6th May 11:34 AMThe service is back to normal.
Resolved
in about 1 hourA website problem has been causing the sending of various automated emails, including limit notifications and welcome emails, to be stalled. The majority of emails arising from 0922 UTC on Thursday 23 April were stalled and backlogged until the service was unblocked at 1205 UTC on Monday 27 April. All backlogged emails were eventually sent by 1800 UTC.
The system is now operating normally.
An investigation is continuing into the circumstances that led to the problem, and to the extended time for resolution. This incident will be updated in due course with more information.
Our engineering and ops teams have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.
See https://gist.github.com/paddybyers/c27d302524caa8e46f41e9ba19fdcf2e
Resolved
in 4 daysOur website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku. The realtime service is unaffected and remains fully operational in all regions.
Heroku seems to be having ongoing issues; there's no explanation in https://status.heroku.com/incidents/1973. We will continue to monitor the situation.
Resolved
in 5 minutesOur website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku.
The realtime service is unaffected and remains fully operational in all regions.
Resolved
in 12 minutesWe are currently experiencing timeouts from the website for some async operations.
- Stats in dashboards
- Blog feeds in the navigation
- Some queries for keys, queues, rules in the dashboards
We are investigating the root cause, but rolling back to a previous version now.
A downstream provider of our database and web services performed maintenance on our database earlier today, which required a change in the endpoint used by all web services. Unfortunately the update was only made to one of the two web services required, which caused the async operations to fail during this period.
The issue is now fully resolved, and we'll be investigating why this update was only applied to one of the two web services.
Resolved
in 10 minutes