www.ably.io
Back

Incident log archive

All times are shown in UTC

September 2016

19th September 2016 02:40:14 PM

us-east-1 region performance issues

Error rates in the us-east-1 datacenter have risen in the last 15 minutes.

We are experiencing what appears to be an unprecedented amount of traffic to only this region.

We are manually provisioning more hardware to minimise the impact whilst we diagnose the root cause of the significant increased load.

Customers using the us-east-1 may be experiencing some failures at present. Client libraries should automatically route around these issues and use alternative datacenters.

19th Sep 02:58 PM

All operations are now back to normal.

Upon our initial investigation, it appears that the autoscaling system we have to provision hardware when demand increases was not proactive enough for this pattern of usage. We will now investigate the usage patterns that led to the significantly increased load on the servers in us-east-1 and adjust our autoscaling in future to be more proactive under these conditions.

Resolved

in 15 minutes
8th September 2016 09:00:00 PM

Timeouts on some channels in US-East

For around 45 minutes, multiple channels experienced issues (slow performance and 'client ready timeout' errors), especially for customers in the US East Virginia region. Everything is now back to normal now. We will update this issue with more details once we've done a postmortem.

Resolved

in about 1 hour
4th September 2016 01:20:45 PM

Network connectivity and partitioning issues in Asia

We are seeing a high rate of errors connecting to and from ap-southeast-1 (Singapore) at present from other data centers.

As a result, this is having intermittent impact globally when our masters for apps, accounts and channels are placed in this region.

We are investigating the problem now. If necessary, we will take the entire data center ap-southeast-1 offline.

4th Sep 01:40 PM

The networking issues were isolated to a single instance which has now been recycled.

All network operations are back to normal now.

Resolved

in 18 minutes

August 2016

31st August 2016 04:28:34 PM

Frankfurt eu-central TLS termination issues

Our Frankfurt eu-central-1 data center is no longer terminating TLS connections correctly. This is an Amazon AWS issue and we have raised a support request with them to resolve the issue.

In the mean time, we are now routing all traffic to other data centers until the issue is resolved by Amazon.

This should have very little impact on any users as eu-west-1 will take over most requests destined with eu-central with very little impact on latency.

1st Sep 07:39 AM

The issue has been identified in a single faulty router which has now been removed.

The Frankfurt eu-central-1 data center has now been reinstated.

Resolved

in about 15 hours
26th August 2016 04:44:42 PM

Our automated health check system has reported an issue with realtime cluster health in eu-central-1-a

This incident was created automatically by our automated health check system as it has identified a fault. We are now looking into this issue.

26th Aug 10:44 PM

The issue was intermittent in this region and resolved itself after approximately 10 minutes. During this time some users were unable to connect to eu-central-1 and were routed to other regions.

Resolved

in 10 minutes
22nd August 2016 06:56:19 PM

AWS issues in Asia are affecting our distributed database

AWS has reported EBS issues in a number of Asian data centres resulting in slow access to our data in those regions.

We're looking into this now.

22nd Aug 05:51 PM

An additional fault has been detected in ap-southeast-1-a

22nd Aug 05:52 PM

An additional fault has been detected in ap-southeast-2-a

22nd Aug 05:52 PM

An additional fault has been detected in ap-northeast-1-a

Resolved

in about 2 hours
22nd August 2016 06:06:37 PM

Asia data centres affected by AWS EBS issues

AWS has reported EBS issues in a number of Asian data centres resulting in slow access to our data in those regions.

We're looking into a fix for this until the EBS issues are resolved by Amazon.

22nd Aug 06:29 PM

The underlying disk issues in our database was resolved and all performance issues in Asia are now resolved.

Resolved

in 21 minutes
20th August 2016 06:00:00 PM

South America datacentre shut down due to connectivity issues

Due to continuing connectivity issues with our sa-east-1 (South America) data centre since 16:30 UTC today, that has been causing performance issues globally, we have decided to temporary shut down that datacentre until we can discover the root cause.

Traffic from South America will be automatically routed to the closest other datacentre, probably US-East or US-West. They will experience slightly higher latency, but other than that should be unaffected.

Resolved

in 2 days
20th August 2016 04:30:00 PM

Issues with the South America East datacentre are causing performance problems worldwide

We are investigating and hope to resolve this shortly

Closed

in about 1 hour
17th August 2016 11:00:00 PM

Faulty node

Intermittent 500 responses were reported in our Sydney data center on Tue 16 Aug at 7:06PM UTC time.

We identified the problem with a single node in the ap-southeast-2 data center and resolved the problem by redeploying all nodes in that region.

We are investigating the root cause of the issue.

Resolved

in about 1 hour
16th August 2016 07:06:12 PM

Faulty node

Intermittent 504 responses were reported in us-east-1 on Tue 16 Aug at 7:06PM UTC time.

We identified the problem with a single node in the us-east-1 data center and resolved the problem by simply restarting that faulty node within 2 hours.

However, the root cause appears to be a bug in our RPC layer, so we have taken action to get the issue resolved upstream and also ensure we get notified of this type of fault in advance moving forwards.

Resolved

in about 2 hours

July 2016

17th July 2016 04:00:00 AM

Reliability issues for certain channels for users in some regions

Some users in certain regions, including ap-northeast, ap-southeast, and sa-east, have been experiencing issues receiving messages from users in other regions on certain channels, due to an inter-region communication issue. We have done a temporary fix, and are preparing a permanent fix. Anyone still experiencing problems should contact us asap - thanks.

Resolved

in about 10 hours
9th July 2016 10:26:39 PM

Heroku platform issues affecting reliability of our website platform

Our website which is hosted with Heroku was offline intermittently for around an hour due to Heroku issues.

9th Jul 10:27 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

in about 1 hour

June 2016

14th June 2016 03:24:17 AM

Our automated health check system has reported an issue with realtime cluster health in ap-southeast-1-a

This incident was created automatically by our automated health check system as it has identified a fault.

This was caused by a temporary network issue between data centers.

14th Jun 03:30 AM

The temporary networking issue resolved itself.

Resolved

in 6 minutes
6th June 2016 12:04:55 PM

Cluster health issues

We are seeing some connection issues between regions in Australia, Singapore and Oregon US.

The impact of this is that those regions may experience some delays during this time.

28th May 12:48 PM

An additional fault has been detected in ap-southeast-2-a

28th May 12:48 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

7th Jun 02:02 PM

An additional fault has been detected in us-west-2-a

7th Jun 02:38 PM

Our health check system has reported this issue as resolved.
We will continue to investigate the issue and will update this incident shortly.

Resolved

in about 3 hours