Incident log archive

All times are shown in UTC

October 2016

28th October 2016 07:33:19 AM

Top level DNS IO domain issues affecting the website

Ongoing issues with Heroku being unable to resolve .io domains is currently affecting some functions of the website, and more specifically, the app dashboards.

Please see https://status.heroku.com/incidents/967 for more info of the underlying issue.

We are monitoring Heroku's progress on this issue.

28th Oct 07:35 AM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

28th Oct 09:50 AM

For the last 30 minutes we have had no reported errors from Heroku and all areas of the website are operating without issue.


in about 2 hours
21st October 2016 04:00:00 PM

ably.io slow to respond due to Heroku/DynDNS issues; realtime service is fine

Our website (ably.io) is currently not responding for some customers due to a DynDNS outage affecting Heroku, see https://www.dynstatus.com/incidents/nlr4yrr162t8

The realtime service is does not use DynDNS, and is unaffected and running smoothly.

21st Oct 06:17 PM

The ably.io website is now bypassing Heroku's DynDNS, and is operating normally. (The realtime system continued operating as normal throughout)

21st Oct 06:53 PM

See https://blog.ably.io/routing-around-single-point-of-failure-dns-issues-7c20a8757903#.o7p0bbpdy

For customers interested, in the article we've addressed how Ably has mitigated against these types of global DNS failures that have affected large companies like Github and Twitter.


in about 3 hours
13th October 2016 12:00:00 PM

SSL Certificate intermittent issues

A small number of users from 13 Oct until 18 Oct were affected by a GlobalSign failure with their certificate revocation lists.

Unfortunately GlobalSign are the providers of Ably's certificate and as such there was little we could do but wait for the caches to be cleared and the issue to go away.

See https://support.ably.io/solution/articles/3000059255--the-security-certificate-has-been-revoked-error-connecting-to-ably for more info.


in 5 days

September 2016

26th September 2016 12:01:42 PM

Global intermittent networking issues

Our monitoring systems are reporting network timeouts globally in what appears to be affecting all US regions, even across independent services we host with different providers.

As yet, we are unclear of what is causing this, but it appears to be an AWS or network routing issue as opposed to any problem within our datacenters or service.

A high level of AWS errors are being reported online at http://downdetector.com/status/aws-amazon-web-services

We're monitoring the situation and will see what we can do to alleviate the problem. However, if the underlying issues are related to core network routing issues, we are unfortunately unable to do anything about this as we rely on a working network in the affected regions.

26th Sep 12:08 PM

The networking issues have resolved themselves and error rates have dropped off.

Amazon as yet has not reported any reason behind the issues, but we expect they'll do a post mortem and post an update later today.

The reported incidents by other parties was very high during the time we experienced the networking issues. See http://downdetector.com/status/aws-amazon-web-services

We'll write up a post mortem once our upstream providers do the same.


in 28 minutes
19th September 2016 02:40:14 PM

us-east-1 region performance issues

Error rates in the us-east-1 datacenter have risen in the last 15 minutes.

We are experiencing what appears to be an unprecedented amount of traffic to only this region.

We are manually provisioning more hardware to minimise the impact whilst we diagnose the root cause of the significant increased load.

Customers using the us-east-1 may be experiencing some failures at present. Client libraries should automatically route around these issues and use alternative datacenters.

19th Sep 02:58 PM

All operations are now back to normal.

Upon our initial investigation, it appears that the autoscaling system we have to provision hardware when demand increases was not proactive enough for this pattern of usage. We will now investigate the usage patterns that led to the significantly increased load on the servers in us-east-1 and adjust our autoscaling in future to be more proactive under these conditions.


in 15 minutes
8th September 2016 09:00:00 PM

Timeouts on some channels in US-East

For around 45 minutes, multiple channels experienced issues (slow performance and 'client ready timeout' errors), especially for customers in the US East Virginia region. Everything is now back to normal now. We will update this issue with more details once we've done a postmortem.


in about 1 hour
4th September 2016 01:20:45 PM

Network connectivity and partitioning issues in Asia

We are seeing a high rate of errors connecting to and from ap-southeast-1 (Singapore) at present from other data centers.

As a result, this is having intermittent impact globally when our masters for apps, accounts and channels are placed in this region.

We are investigating the problem now. If necessary, we will take the entire data center ap-southeast-1 offline.

4th Sep 01:40 PM

The networking issues were isolated to a single instance which has now been recycled.

All network operations are back to normal now.


in 18 minutes

August 2016

31st August 2016 04:28:34 PM

Frankfurt eu-central TLS termination issues

Our Frankfurt eu-central-1 data center is no longer terminating TLS connections correctly. This is an Amazon AWS issue and we have raised a support request with them to resolve the issue.

In the mean time, we are now routing all traffic to other data centers until the issue is resolved by Amazon.

This should have very little impact on any users as eu-west-1 will take over most requests destined with eu-central with very little impact on latency.

1st Sep 07:39 AM

The issue has been identified in a single faulty router which has now been removed.

The Frankfurt eu-central-1 data center has now been reinstated.


in about 15 hours
26th August 2016 04:44:42 PM

Our automated health check system has reported an issue with realtime cluster health in eu-central-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

26th Aug 10:44 PM

The issue was intermittent in this region and resolved itself after approximately 10 minutes. During this time some users were unable to connect to eu-central-1 and were routed to other regions.


in 10 minutes
22nd August 2016 06:56:19 PM

AWS issues in Asia are affecting our distributed database

AWS has reported EBS issues in a number of Asian data centres resulting in slow access to our data in those regions.

We're looking into this now.

22nd Aug 05:51 PM

An additional fault has been detected in ap-southeast-1-a

22nd Aug 05:52 PM

An additional fault has been detected in ap-southeast-2-a

22nd Aug 05:52 PM

An additional fault has been detected in ap-northeast-1-a


in about 2 hours
22nd August 2016 06:06:37 PM

Asia data centres affected by AWS EBS issues

AWS has reported EBS issues in a number of Asian data centres resulting in slow access to our data in those regions.

We're looking into a fix for this until the EBS issues are resolved by Amazon.

22nd Aug 06:29 PM

The underlying disk issues in our database was resolved and all performance issues in Asia are now resolved.


in 21 minutes
20th August 2016 06:00:00 PM

South America datacentre shut down due to connectivity issues

Due to continuing connectivity issues with our sa-east-1 (South America) data centre since 16:30 UTC today, that has been causing performance issues globally, we have decided to temporary shut down that datacentre until we can discover the root cause.

Traffic from South America will be automatically routed to the closest other datacentre, probably US-East or US-West. They will experience slightly higher latency, but other than that should be unaffected.


in 2 days
20th August 2016 04:30:00 PM

Issues with the South America East datacentre are causing performance problems worldwide

We are investigating and hope to resolve this shortly


in about 1 hour
17th August 2016 11:00:00 PM

Faulty node

Intermittent 500 responses were reported in our Sydney data center on Tue 16 Aug at 7:06PM UTC time.

We identified the problem with a single node in the ap-southeast-2 data center and resolved the problem by redeploying all nodes in that region.

We are investigating the root cause of the issue.


in about 1 hour
16th August 2016 07:06:12 PM

Faulty node

Intermittent 504 responses were reported in us-east-1 on Tue 16 Aug at 7:06PM UTC time.

We identified the problem with a single node in the us-east-1 data center and resolved the problem by simply restarting that faulty node within 2 hours.

However, the root cause appears to be a bug in our RPC layer, so we have taken action to get the issue resolved upstream and also ensure we get notified of this type of fault in advance moving forwards.


in about 2 hours