Incident Details

All times are shown in UTC

11th October 2018 11:05:33 AM

Increased error rates in us-east-1

We are seeing an increase in error rates in us-east-1 presently.

This appears to be related to a significant increase in load in the us-east-1 region, and the additional capacity brought on by our autoscaling systems.

We're investigating the problem now.

11th Oct 11:30 AM

Error rates have returned to normal.

We will continue to investigate the root cause of the increased error rates that lasted approximately 15 minutes.

11th Oct 07:04 PM

We have completed a post mortem on what happened.

At the time of the issue, we deployed an update to the realtime platform that contained a regression relating to the progressive adoption of load as services come online during scaling operations. This meant that, for a short period of time early in the life of a new server, a small proportion of the channels handled by that server had errors. Externally, this meant that a small fraction of channels were inaccessible in us-east for a period of about 20 minutes.

The bug has since been fixed and we're looking into ways to add more simulated tests over the progressive load feature.


in 22 minutes