DUO1: Authentication Issue.
Incident Report for Duo
Postmortem

Authentication issues on DUO1

Incident Report for Duo Security

From 02:05 to 02:35 UTC on Nov 4, 2016, the DUO1 deployment experienced increased database contention that resulted in an increase in authentication latency. This latency reached levels that caused authentication failure for some customer applications protected by the Duo service.

As part of our rolling release process, Duo consistently makes new features and other general service improvements available to customers. Many of these enhancements come along with the need to alter database content or configuration. Duo’s engineering team has designed and implemented processes allowing these types of changes to be made in a gradual and automated fashion without any corresponding impact to end users or service availability. These processes are tested and exercised regularly as Duo releases code on a biweekly basis.

A defect in code responsible for a specific database change resulted in a higher than anticipated number of concurrent database transactions which exhausted available disk I/O (throughput) on the DUO1 primary database. These transactions were automatically replicated to additional standby database servers which led to redundant resources being similarly exhausted and therefore unable to function properly in a failover scenario.

Duo’s monitoring systems detected an increase in database latency at 01:49 UTC and alerted relevant engineering staff members. Immediate action was taken to halt the generation of additional database transactions and to add additional capacity. Upon completing these actions, service was restored at 2:35 UTC at which time end user authentications began succeeding.

As a result of this incident, Duo’s Engineering team has already implemented finer grained monitoring and additional controls to allow for detection of issues such as these prior to them causing customer facing impact. Data collected during and after the incident will be used to influence future capacity related decisions to ensure that appropriate headroom is available to prevent resource exhaustion of any type.

Posted Nov 04, 2016 - 17:23 EDT

Resolved
Our Operations Team has confirmed the service degradation leading to authentication slowness or failure has been fully resolved as of 02:35 UTC (10:35 p.m. ET). The incident timeline occurred as follows:
01:49 UTC: Authentication latency begins to grow.
02:05 UTC: Authentication latency reaches significant levels, resulting in increased timeouts or failures.
02:26 UTC: Authentication latency begins returning to normal levels.
02:30 UTC: Service restored for users who begin authentications now; some very slow authentications still in progress.
02:35 UTC: Service is fully restored.
Posted Nov 03, 2016 - 23:08 EDT
Investigating
We are currently investigating potential issues with authentications. Our Operations Team will provide further updates as more information becomes available.
Posted Nov 03, 2016 - 22:44 EDT
This incident affected: DUO1 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery).