All times are shown in UTC
Fallback realtime hosts (*.ably-realtime.com), the Ably website and CDN (affecting website assets and library access) are having availability problems due to Cloudflare issues: https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr .
The primary realtime hosts (rest.ably.io, realtime.ably.io) do not use cloudflare and are still working fine, so the service is still up.
We are in the process of bypassing Cloudflare on selected high-priority hosts (the website, status site, and CDN).
Update 14:05 UTC: Cloudflare has been bypassed for the website, status site, and CDN. Fallback hosts are still going through cloudflare, but as primary hosts are all up (and have been the whole time), this should have no effect on service status.
Cloudflare is back up, so fallback hosts are now responding as normal.
Resolvedin 36 minutes
The Ably website and CDN (affecting website assets and library access) are having availability problems due to the global Cloudflare outage. We are redirecting away from Cloudflare and service should resume shortly.
All Cloudflare-mediated endpoints have been moved away from Cloudflare.
Resolvedin 11 minutes
We are re-routing fallback endpoints at the moment.
More information as we have it.
Fallback endpoints are now restored, circumventing Cloudflare
Resolvedin about 1 hour
We are seeing a high number of timeouts in the us-west-1 datacenter at present.
We are investigating the root cause of the issue. If this issue is not resolved soon we will temporarily redirect traffic away from the us-west-1 datacenter until the underlying issues are resolved20th Apr 09:19 AM
All intermittent timeouts in the us-west-1 region have stopped since 09:21 UTC.
We believe the underlying issue was a networking issue in the underlying AWS datacenter, but have not been able to confirm that. However, for the last hour, the datacenter appears to be healthy.
Clients closest to us-west-1 experiencing timeouts would have been automatically reconnected to an alternative datacenter with our automatic fallback capability. See https://support.ably.io/a/solutions/articles/3000044636 for more details.
Resolvedin about 1 hour
We have routed traffic away temporarily from our us-east-1 datacenter whilst we investigate the cause of the increased errors in our us-east-1 datacenter. All traffic is being routed automatically to the nearest datacenters.
The us-east-1 (North Virginia) datacenter is healthy and active again. Traffic is now being routed to this datacenter as normal.
Resolvedin about 1 hour
In the last three days (since the 26th February), a small proportion of channels experienced timeouts querying channel history if a message was published in the same region as the query was made in within 16 seconds of the query being made. We are rolling out a fix now and looking into why this was not caught by our test suites. We apologise for the inconvenience.
Resolvedin 4 days
Users connected at the ap-southeast-1 region (Asia Singapore) may have experienced elevated latencies and/or errors in the past half hour.
Other data centers were unaffected, so client libraries should have transparently redirected traffic to another datacenter through normal fallback functionality.
As a precaution, we have shut down that data center; users who normally connect to ap-southeast-1 will now likely connect to either ap-southeast-2 (Australia), us-west-1 (California), or eu-central-1 (Frankfurt)
The ap-southeast-1 region is now fully operational once again.
We're continuing to investigate the underlying issue.
Resolvedin 34 minutes
One of our Cassandra nodes in us-east-1 is down due to an underlying hardware fault. This transiently caused some errors on a percentage of our realtime service nodes connected to that faulty node. The server has now been isolated.
The faulty node has now been fully removed from the cluster, and all data has been successfully replicated to a new healthy node.
Resolvedin 28 minutes
We have been alerted to higher than normal latencies due to a capacity issue in us-east-1 which we are working to resolve
Update 15:51 UTC: We have severe capacity issues worldwide due to a sudden inability to a bootstrap new instances. We are working to fix this as soon as possible.
We have identified the underlying issue (dependency on a third party system that, by design, should not have impacted our ability to add capacity, but did due to an internal bug). We have applied hot fixes to all environments and rolled this out globally. Error rates are dropping rapidly and latencies reducing, however there are still some residual issues we are manually resolving.3rd Jan 05:33 PM
We are still experiencing issues in us-east-1 which is causing higher than normal error rates in us-east-1. We believe the issue is caused by an unstable gossip node reporting inconsistent ring states to the cluster.3rd Jan 06:36 PM
Given that we identified the issue was related to gossip and ring state inconsistencies, we are rolling out new gossip nodes across every region, which is rapidly resolving the issues.3rd Jan 06:47 PM
We have stabilised the gossip and ring state globally now and error rates have reduced dramatically. There are few nodes that are still emitting channel errors which we are investigating.3rd Jan 08:45 PM
Latencies and error rates are back to normal in all regions.
We sincerely apologise to customers who were affected by the incident, and will be posting a post-mortem once the investigation has completed.9th Jan 12:21 AM
We have completed the investigation of this incident and have written up a full post mortem at https://gist.github.com/paddybyers/3e215c0aa0aa143288e4dece6ec16285
Any customers who have any questions or would like to discuss this incident should get in touch with the support and sales team at https://www.ably.io/support.
Once again, we sincerely apologise to customers who were affected by this incident. We are doing everything we can to learn from this incident and ensure that the service remains fully operational moving forwards.
Resolvedin about 5 hours
Our internal Reactor queues appear to currently be experiencing a network partition. Consuming from queues may temporarily not work. No other services are affected; realtime messaging is operating as normal.
Queues are now up and running normally. For a long time the partitioned state failed to heal, even after several rolling restarts. Ultimately we made the decision to stop and restart the entire cluster, which then took longer than expected due to an infrastructure issue.
We run our queue servers in 'pause minority' mode, which prioritises data integrity and preservation of messages already in the queue over availability to either consumers or new publishers for connections to the cluster in the minority.
We'll be investigating the causes of the partition, why it failed to heal, and the issues experienced in doing a stop-the-world restart, and post any updates here.
Resolvedin about 5 hours
We are investigating abnormal availability metrics in multiple regions
The us-east-1 region is continuing to experience a very high rate of errors, so we are now redirecting all traffic away from us-east-1 to us-west-1.3rd Dec 04:18 PM
Error rates are quickly returning to normal. There are still some residual issues which we are resolving first before we investigate the root cause of the faults in us-east-1 that triggered the high error rates.3rd Dec 04:38 PM
Error rates and availability metrics are now back to normal.
We are currently running with one fewer region than normal. Traffic that would have gone to us-east-1 is now going to us-west-1. We will updated this issue once we have identified the fault and brought the region back online.3rd Dec 06:42 PM
Traffic is now being routed to us-east-1 again.5th Dec 12:10 PM
Please see the preliminary incident report on this incident: https://gist.github.com/paddybyers/3e215c0aa0aa143288e4dece6ec16285
Resolvedin about 1 hour
This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.
Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.
Resolvedin 1 minute
An incident on the 30 October caused increased error rates from 14:00 UTC until 16:30 UTC, with residual issues remaining until 19:00 UTC.
Please read the full post mortem below.
Update at 14:03 (UTC): we are routing traffic away from us-east to other datacenters while we investigate the issue30th Oct 02:36 PM
Update at 14:30 (UTC): All regions are now affected by the present issues due to cascading failures. We are very sorry for the inconvenience and are trying to restore normal service as fast as possible.30th Oct 05:09 PM
Update at 16:03 (UTC): All regions are now stable and error rates have dropped dramatically. We will continue to inspect each region manually to ensure any issues are resolved forcibly.30th Oct 07:00 PM
Update at 17:30 (UTC): Error rates have been consistently back to normal in all regions. We believe all operations are back to normal. We will be completing a full post mortem now to understand what caused the global disruption, and importantly what caused the ripple effect from one region to another.31st Oct 11:37 AM
Update at 10:30 + 1 day (UTC): Our focus today is on addressing the immediate issues we identified yesterday that caused the significant increase in error rates and latencies globally. Whilst a post mortem is being prepared and will be published in due course, our priority at present is to resolve the underlying issue and prevent a repeat of the incident.
We are currently rolling out an update globally in all regions and expect this will take a few hours to complete.2nd Nov 01:01 PM
We have published a summary of the incident affecting the Ably production cluster 30 October 2018, which includes our preliminary investigation and conclusions.
Please see https://gist.github.com/mattheworiordan/2abab206ee4e4859010da0375bcf4b1d for the report.
As mentioned in the report, we are sorry to all our customers for the disruption this resulted in, and are doing everything we can to ensure we are better equipped to deal with incidents more quickly and without disruption in future.
If you have any questions, please do get in touch with us at https://www.ably.io/contact
Resolvedin about 4 hours
We are seeing an increase in error rates in us-east-1 presently.
This appears to be related to a significant increase in load in the us-east-1 region, and the additional capacity brought on by our autoscaling systems.
We're investigating the problem now.
Error rates have returned to normal.
We will continue to investigate the root cause of the increased error rates that lasted approximately 15 minutes.11th Oct 07:04 PM
We have completed a post mortem on what happened.
At the time of the issue, we deployed an update to the realtime platform that contained a regression relating to the progressive adoption of load as services come online during scaling operations. This meant that, for a short period of time early in the life of a new server, a small proportion of the channels handled by that server had errors. Externally, this meant that a small fraction of channels were inaccessible in us-east for a period of about 20 minutes.
The bug has since been fixed and we're looking into ways to add more simulated tests over the progressive load feature.
Resolvedin 22 minutes
This is a continuation if incident at https://status.ably.io/incidents/561 which was mistakenly marked as resolved. All notes from the previous issue below.
We are investigating increased error rates in our two European data centers.
1st Oct 04:42 PM
Whilst AWS is not yet reporting any issues on the status or personal health dashboards, we believe this issue is caused by a fault in the eu-central-1 and possibly eu-west-1 AWS regions. We are led to believe this as dedicated clusters, isolated from the global traffic in those regions, are also affected.
We will continue to investigate the issue and take action to minimise the impact.
1st Oct 04:52 PM
AWS continues to report no issues, yet Twitter confirms the issues are widespread.
We have seen that eu-west-1 continues to exhibit problems, so we are now routing traffic away from eu-west-1 for now.
1st Oct 05:08 PM
We are seeing stability return in eu-west-1 and eu-central-1. All eu-west-1 traffic is still being routed to other regions. Once eu-west-1 settles fully we'll redirect traffic back.
We are now investigating any residual issues caused by the partitions and instability to ensure no longer term impact on customer traffic.
Error rates have returned to normal in all regions apart from eu-central-1.
We are investigating the errors in eu-central-1 now, although the error rate in that region is now very low.1st Oct 05:52 PM
As AWS issues in eu-west-1 and eu-central-1 continue (they have now confirmed networking issues), we are re-routing all traffic for the global cluster away from all EU regions. Please note that any traffic routed to the EU only (EU-only storage options, compliance reasons etc) unfortunately will continue to be routed to these clusters.1st Oct 06:47 PM
We believe the EU regions are now reaching stability and intend to route traffic back to EU in the next 15 minutes once some final testing is complete.1st Oct 07:22 PM
All global traffic is now being routed back to both EU datacenters, and everything appears to be normal. We'll continue to monitor closely now.
Resolvedin about 2 hours