From 15:05 to 16:25 UTC on August 20, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency for all customer applications protected by the Duo service on this deployment. In some cases, this latency caused authentications to timeout and subsequently fail during this window.
As part of our rolling release process, Duo consistently makes new features and other general service improvements available to customers. Duo’s engineering team has designed and implemented processes allowing these types of changes to be made in a gradual and automated fashion without any corresponding impact to end users or service availability. These processes are tested and exercised regularly as Duo releases code every two weeks. As part of these processes, the latest version of our software was released to the DUO1 deployment at 16:17 UTC on August 17, 2018.
At 15:05 UTC on August 20, 2018, Duo’s automated monitoring systems alerted our engineering team to the initial signs of increased authentication latency, and the team began investigating the issue. Operational metrics revealed a significant increase in the number of connections to the DUO1 cluster, so we took action to identify and mitigate any potentially anomalous activity, including re-routing some traffic to alternate infrastructure. As part of this process, we also began disabling platform functionality unrelated to processing authentications to reduce load on backend services.
After analyzing performance data alongside changes included in the August 17, 2018 software release, the team decided to revert to the prior software version. We began reverting to this prior release at 16:05 UTC. As a result, authentication latency started to decrease at 16:13 UTC, and service was fully restored at 16:25 UTC.
Duo’s engineering team has developed and integrated a patch into the core version of our software to prevent future occurrences of the identified regression. The team will also evaluate the load-testing methodologies used to test this enhancement prior to release to identify opportunities to better detect this type of issue going forward.