www.ably.io
Back

Incident log archive

All times are shown in UTC

September 2020

1st September 2020 09:18:00 AM

Publishing issues in us-west-1

A small fraction of publishes failing between 09:18 and 09:36 UTC with error "Service Unavailable (server at capacity limit)" due to resource issues in the us-west-1 region. We identified the issue and increased the number of resources available to accommodate a spike in load on the system.

Resolved

in 18 minutes

August 2020

19th August 2020 11:37:00 PM

disruption in us-west-1

Networking issues in aws us-west-1 between 23:37 and 00:13 UTC lead to elevated error rates and channel continuity losses in us-west-1.

Customer-visible effects will have consisted of:
- a small proportion of connections connected to us-west-1 region will have timed out or been disconnected, and may then reconnected to another region and failed to resume their connection, so experiencing a continuity loss on their channels
- a small proportion of channels that were active in the us-west-1 region will have migrated to other instances as the affected instances (which seem to have had all networking cut for an extended period) were detected as unhealthy and removed from the cluster. During this period publishes to those channels may be been refused or timed out, and channels which lost continuity due to the disruption will notify attached clients of continuity losses (in the form of an 'update' event, see https://www.ably.io/documentation/realtime/channels#listening-state).

We are in the process of a larger piece of work that will further decouple channel regions from each other, such that when networking issues affect a single region activity in other regions will be completely unaffected; unfortunately this is not yet complete, and in the current architecture clients in other regions attached to channels active in us-west-1 may also have experienced continuity losses.

Resolved

in 36 minutes

July 2020

28th July 2020 08:25:00 AM

Sporadic high latencies in us-east-1

We are investigating intermittent high latencies in REST requests to us-east-1. Other datacenters are unaffected.

28th Jul 09:20 AM

We have temporarily redirected traffic away from us-east-1 to other datacenters

28th Jul 09:35 AM

The issue is due to AWS issues with resolving dns from within us-east-1: https://status.aws.amazon.com/ . We are leaving traffic redirected away from us-east-1 until that is resolved. (Any connections that were already connected to us-east-1 will remain for now)

28th Jul 11:41 AM

Following AWS reporting the dns resolution latency issues as fixed on their end, which we have confirmed, we are now re-enabling traffic to us-east-1.

Total customer-visible effect should have been very little, other than occasional latency spikes for requests to us-east-1 in the time between the AWS issue beginning and when we redirected traffic away from that region at 08:15 UTC, and fractionally higher latencies for customers near us-east-1 who were redirected to us-west-1 for the duration.

Resolved

in about 3 hours
22nd July 2020 07:08:19 AM

Database errors across all regions

We are seeing very high load in the database layer that is affecting all regions and are currently investigating.

22nd Jul 07:55 AM

Within 20 minutes of active management the load on our global database has returned to normal levels.

Our initial investigation indicates that customers using our Push registration and delivery APIs were most affected during this period.

We are investigating the root cause of this issue now and will continue to post updates as we know more.

We apologise for any inconvenience this may have caused.

Resolved

in 21 minutes
19th July 2020 12:41:20 PM

Scheduled website database maintenance

We're performing scheduled Redis and PostgreSQL database maintenance. Customer dashboards and the website will be unavailable for a few minutes during the maintenance window and notifications might be delayed be a few minutes. The realtime systems will not be affected be the migration.

Resolved

in about 2 hours

June 2020

19th June 2020 06:24:26 PM

Support ticketing site maintenance

We are currently migrating our ticketing system and FAQ site support.ably.io to support.ably.com.

During this migration, there will be some disruption for some customers.

We expect this to be completed within 30 minutes.

If you have any issues, please contact us via live chat on www.ably.io

19th Jun 06:57 PM

DNS migration is now complete with 3rd party provider Freshdesk.

Resolved

in 33 minutes

May 2020

6th May 2020 10:24:00 AM

Push notifications processing stalled

The processing of push notifications is stalled. We are currently investigating the cause.

We will make na update here in 15 minutes, or as soon as there is more information.

6th May 11:32 AM

A fix for this problem is being deployed now and we are monitoring the situation.

6th May 11:34 AM

The service is back to normal.

Resolved

in about 1 hour

April 2020

23rd April 2020 10:00:53 PM

Alert emails and other website notifications stalled

A website problem has been causing the sending of various automated emails, including limit notifications and welcome emails, to be stalled. The majority of emails arising from 0922 UTC on Thursday 23 April were stalled and backlogged until the service was unblocked at 1205 UTC on Monday 27 April. All backlogged emails were eventually sent by 1800 UTC.

The system is now operating normally.

An investigation is continuing into the circumstances that led to the problem, and to the extended time for resolution. This incident will be updated in due course with more information.

7th May 11:23 AM

Our engineering and ops teams have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/paddybyers/c27d302524caa8e46f41e9ba19fdcf2e

Resolved

in 4 days

March 2020

16th March 2020 07:54:24 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku. The realtime service is unaffected and remains fully operational in all regions.

16th Mar 08:01 PM

Heroku seems to be having ongoing issues; there's no explanation in https://status.heroku.com/incidents/1973. We will continue to monitor the situation.

Resolved

in 5 minutes
16th March 2020 04:13:00 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku.

The realtime service is unaffected and remains fully operational in all regions.

Resolved

in 12 minutes
10th March 2020 10:27:00 PM

Website stats timeouts

We are currently experiencing timeouts from the website for some async operations.

- Stats in dashboards
- Blog feeds in the navigation
- Some queries for keys, queues, rules in the dashboards

We are investigating the root cause, but rolling back to a previous version now.

10th Mar 10:39 PM

A downstream provider of our database and web services performed maintenance on our database earlier today, which required a change in the endpoint used by all web services. Unfortunately the update was only made to one of the two web services required, which caused the async operations to fail during this period.

The issue is now fully resolved, and we'll be investigating why this update was only applied to one of the two web services.

Resolved

in 10 minutes

February 2020

29th February 2020 09:23:00 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

29th Feb 09:36 PM

As yesterday, the incident resolved itself after 9 minutes. We continue to investigate as a top priority.

1st Mar 11:31 PM

We have now identified the root cause of the recent latency issues we've experienced in the global persistence layer, and have rolled out updates in the global persistence layer that have ensured that the latencies are consistently low.

The primary cause of the problem was an inadequate rate limiter in one area of our system, which allowed our persistence layer to be overloaded and thus impact the global service latencies for operations that rely on the persistence layer (primarily history, push registrations, and persisted tokens).

A full post mortem will follow soon.

10th Mar 08:23 PM

Our engineering and ops team have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/pauln-ably/03098db1095f4ef61aac801ae987dac2

Resolved

in 11 minutes
28th February 2020 09:23:49 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

The issue has resolved within 5 minutes.

We're continuing our investigation into the root cause of these intermittent performance issues.

29th Feb 12:43 AM

Following more than 36 hours of intermittent short and significant increases in latencies in our persistence layer, the engineering team have been investigating the root cause to understand why only a small percentage of shards are affected during this time.

We have made significant progress in identifying potential root causes, however in the meantime we have also been addressing the issues by adding capacity and upgrading the entire cluster.

The persistence cluster is now operating with approximately 3x more capacity than it had 24 hours ago and is now upgraded to the latest stable versions.

We'll continue to investigate what triggered these increases in latencies, however we are optimistic that the increased capacity will now offer stability and predictable performance moving forward.

A full post mortem will be published soon.

10th Mar 08:25 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 8 minutes
28th February 2020 10:44:00 AM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions. Investigating.

28th Feb 12:10 PM

Latencies are back to normal as of 11:21 UTC

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 37 minutes
28th February 2020 06:57:22 AM

Performance issues in all regions due to database layer issues

We are seeing increased latencies and error rates in all regions due to database issues

28th Feb 07:16 AM

Error rates and latencies are back to normal. We are continuing to investigate the root cause.

28th Feb 08:51 AM

Service has continued with no further issues.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 13 minutes