www.ably.io
Back

Incident log archive

All times are shown in UTC

March 2020

16th March 2020 07:54:24 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku. The realtime service is unaffected and remains fully operational in all regions.

16th Mar 08:01 PM

Heroku seems to be having ongoing issues; there's no explanation in https://status.heroku.com/incidents/1973. We will continue to monitor the situation.

Resolved

in 5 minutes
16th March 2020 04:13:00 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku.

The realtime service is unaffected and remains fully operational in all regions.

Resolved

in 12 minutes
10th March 2020 10:27:00 PM

Website stats timeouts

We are currently experiencing timeouts from the website for some async operations.

- Stats in dashboards
- Blog feeds in the navigation
- Some queries for keys, queues, rules in the dashboards

We are investigating the root cause, but rolling back to a previous version now.

10th Mar 10:39 PM

A downstream provider of our database and web services performed maintenance on our database earlier today, which required a change in the endpoint used by all web services. Unfortunately the update was only made to one of the two web services required, which caused the async operations to fail during this period.

The issue is now fully resolved, and we'll be investigating why this update was only applied to one of the two web services.

Resolved

in 10 minutes

February 2020

29th February 2020 09:23:00 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

29th Feb 09:36 PM

As yesterday, the incident resolved itself after 9 minutes. We continue to investigate as a top priority.

1st Mar 11:31 PM

We have now identified the root cause of the recent latency issues we've experienced in the global persistence layer, and have rolled out updates in the global persistence layer that have ensured that the latencies are consistently low.

The primary cause of the problem was an inadequate rate limiter in one area of our system, which allowed our persistence layer to be overloaded and thus impact the global service latencies for operations that rely on the persistence layer (primarily history, push registrations, and persisted tokens).

A full post mortem will follow soon.

10th Mar 08:23 PM

Our engineering and ops team have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/pauln-ably/03098db1095f4ef61aac801ae987dac2

Resolved

in 11 minutes
28th February 2020 09:23:49 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

The issue has resolved within 5 minutes.

We're continuing our investigation into the root cause of these intermittent performance issues.

29th Feb 12:43 AM

Following more than 36 hours of intermittent short and significant increases in latencies in our persistence layer, the engineering team have been investigating the root cause to understand why only a small percentage of shards are affected during this time.

We have made significant progress in identifying potential root causes, however in the meantime we have also been addressing the issues by adding capacity and upgrading the entire cluster.

The persistence cluster is now operating with approximately 3x more capacity than it had 24 hours ago and is now upgraded to the latest stable versions.

We'll continue to investigate what triggered these increases in latencies, however we are optimistic that the increased capacity will now offer stability and predictable performance moving forward.

A full post mortem will be published soon.

10th Mar 08:25 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 8 minutes
28th February 2020 10:44:00 AM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions. Investigating.

28th Feb 12:10 PM

Latencies are back to normal as of 11:21 UTC

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 37 minutes
28th February 2020 06:57:22 AM

Performance issues in all regions due to database layer issues

We are seeing increased latencies and error rates in all regions due to database issues

28th Feb 07:16 AM

Error rates and latencies are back to normal. We are continuing to investigate the root cause.

28th Feb 08:51 AM

Service has continued with no further issues.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 13 minutes
27th February 2020 09:23:00 PM

Performance issues in all regions due to database layer issues

We are investigating performance issues in all regions due to an issue with our database layer (Cassandra)

27th Feb 10:41 PM

We had elevanted cassandra latencies for 9 minutes between 21:23 and 21:32 UTC. Essentially the same issue as was happening earlier today; we are still investigating the root cause.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 9 minutes
27th February 2020 11:05:42 AM

Performance issues in all regions due to database layer issues

We are investigating performance issues in all regions due to an issue with our database layer (Cassandra)

27th Feb 11:27 AM

Error rates have dropped back to normal. We are continuing to investigate.

27th Feb 01:11 PM

Error rates are back to normal. A small segment of the keyspace was unable to achieve quorum for a two hour period; sufficient replicates are now back online to achieve quorum for the entire keyspace, and several more instances are in the process of being brought online. We will review our global replication strategy for this persistence layer as part of a post-mortem.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in about 2 hours

December 2019

3rd December 2019 04:30:00 PM

Minor transient disruption to channel lifecycle webhooks over the next day or two

Customers using channel lifecycle webhooks may experience some brief transient disruption (which in some cases may very briefly include duplicate or lost channel lifecycle webhooks) at some point over the next day or two, while we transition channel lifecycle webhooks over to a new architecture (message rules on the channel lifecycle metachannel). The result will be a and more dependable channel lifecycle webhooks, as they will now get the reliability benefits of running on top of Ably's robust, globally distributed channels, rather than (as they were previously) all lifecycle events for an app being funnelled through a single point.

Resolved

in 2 days

September 2019

30th September 2019 05:40:00 AM

Capacity issues in ap-southeast-1 (Singapore) region

Since 0540UTC today, the cluster in the ap-southeast-1 region was unable to obtain sufficient capacity to meet demand. As a result, slightly higher latencies are being experienced by connections in the region.

Until more capacity is available, we are diverting traffic to ap-southeast-2 (Sydney).

30th Sep 03:46 PM

AWS capacity has now come online in the Singapore region (ap-southeast-1). All traffic is being routed back to this region now.

Resolved

in about 10 hours
25th September 2019 11:54:00 AM

Elevate rate of 5xx errors in US-East-1

We had a higher than normal level of 5xx errors from our routing layer in us-east-1 between 11:54 and 13:17 UTC. We believe we have identified the issue, have instituted a workaround, and are working on a fix. Service should be generally unaffected as rejected requests will have been rerouted to other regions by our client library fallback functionality.

Resolved

in about 1 hour
25th September 2019 10:45:20 AM

EU performance issues

In both EU West and EU Central at 10:45 UTC there was a sharp rise in load, which has subsided at 10:49 UTC (4 minutes).

We have manually intervened to accelerate capacity provision, and our monitoring systems indicate traffic is being routed to other regions as expected whilst the capacity issue remains.

Resolved

in 4 minutes

July 2019

25th July 2019 10:00:00 AM

Issues in ap-southeast-2 (Sydney) due to data center connectivity issues

From 10:00 to 10:05 UTC, our ap-southeast-2 (Sydney) data center experienced some connectivity issues between it and other datacenters. After five minutes full connectivity was restored. Other datacenters were unaffected.

Resolved

in 5 minutes
24th July 2019 01:13:22 AM

Our automated health check system has reported an issue with realtime cluster health in ap-southeast-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

24th Jul 01:14 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

24th Jul 10:14 AM

Due to a load spike, message transit latencies for messages through the Asia Singapore datacenter may have been slower than normal for a period of around 10 minutes. The issue resolved itself automatically through autoscaling.

Resolved

in 5 minutes