From 02:05 to 02:35 UTC on Nov 4, 2016, the DUO1 deployment experienced increased database contention that resulted in an increase in authentication latency. This latency reached levels that caused authentication failure for some customer applications protected by the Duo service.
As part of our rolling release process, Duo consistently makes new features and other general service improvements available to customers. Many of these enhancements come along with the need to alter database content or configuration. Duo’s engineering team has designed and implemented processes allowing these types of changes to be made in a gradual and automated fashion without any corresponding impact to end users or service availability. These processes are tested and exercised regularly as Duo releases code on a biweekly basis.
A defect in code responsible for a specific database change resulted in a higher than anticipated number of concurrent database transactions which exhausted available disk I/O (throughput) on the DUO1 primary database. These transactions were automatically replicated to additional standby database servers which led to redundant resources being similarly exhausted and therefore unable to function properly in a failover scenario.
Duo’s monitoring systems detected an increase in database latency at 01:49 UTC and alerted relevant engineering staff members. Immediate action was taken to halt the generation of additional database transactions and to add additional capacity. Upon completing these actions, service was restored at 2:35 UTC at which time end user authentications began succeeding.
As a result of this incident, Duo’s Engineering team has already implemented finer grained monitoring and additional controls to allow for detection of issues such as these prior to them causing customer facing impact. Data collected during and after the incident will be used to influence future capacity related decisions to ensure that appropriate headroom is available to prevent resource exhaustion of any type.