All times are shown in UTC
We are seeing an increase in error rates in us-east-1 presently.
This appears to be related to a significant increase in load in the us-east-1 region, and the additional capacity brought on by our autoscaling systems.
We're investigating the problem now.
Error rates have returned to normal.
We will continue to investigate the root cause of the increased error rates that lasted approximately 15 minutes.11th Oct 07:04 PM
We have completed a post mortem on what happened.
At the time of the issue, we deployed an update to the realtime platform that contained a regression relating to the progressive adoption of load as services come online during scaling operations. This meant that, for a short period of time early in the life of a new server, a small proportion of the channels handled by that server had errors. Externally, this meant that a small fraction of channels were inaccessible in us-east for a period of about 20 minutes.
The bug has since been fixed and we're looking into ways to add more simulated tests over the progressive load feature.
Resolvedin 22 minutes