All times are shown in UTC
At 14:40 today an instance came online in the us-west-1 region.
Shortly after bootstrapping, the performance of the instance degraded significantly, but continued to service most requests. From a consensus and health checking perspective, this is hugely problematic as the bulk of the work it was supposed to perform it did, but an unacceptable number of requests failed.
We have since fixed the issue, but will now need to revisit how we manage our automated health checks to detect similar failures in the future.
We will continue to investigate and post in updates, along with a post mortem in due course.
Error rates have returned back to normal.
We will continue to investigate the root cause and update with a post mortem.
Resolvedin about 1 hour