Core Authentication Issues on DUO1
Incident Report for Duo
Postmortem

From 15:05 to 16:25 UTC on August 20, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency for all customer applications protected by the Duo service on this deployment. In some cases, this latency caused authentications to timeout and subsequently fail during this window.

As part of our rolling release process, Duo consistently makes new features and other general service improvements available to customers. Duo’s engineering team has designed and implemented processes allowing these types of changes to be made in a gradual and automated fashion without any corresponding impact to end users or service availability. These processes are tested and exercised regularly as Duo releases code every two weeks. As part of these processes, the latest version of our software was released to the DUO1 deployment at 16:17 UTC on August 17, 2018.

At 15:05 UTC on August 20, 2018, Duo’s automated monitoring systems alerted our engineering team to the initial signs of increased authentication latency, and the team began investigating the issue. Operational metrics revealed a significant increase in the number of connections to the DUO1 cluster, so we took action to identify and mitigate any potentially anomalous activity, including re-routing some traffic to alternate infrastructure. As part of this process, we also began disabling platform functionality unrelated to processing authentications to reduce load on backend services.

After analyzing performance data alongside changes included in the August 17, 2018 software release, the team decided to revert to the prior software version. We began reverting to this prior release at 16:05 UTC. As a result, authentication latency started to decrease at 16:13 UTC, and service was fully restored at 16:25 UTC.

Duo’s engineering team has developed and integrated a patch into the core version of our software to prevent future occurrences of the identified regression. The team will also evaluate the load-testing methodologies used to test this enhancement prior to release to identify opportunities to better detect this type of issue going forward.

Posted Aug 21, 2018 - 13:06 EDT

Resolved
We have resolved the issues affecting DUO1 and all services are now fully functional.

We will be posting an RCA here as soon as it is made available.
Posted Aug 20, 2018 - 14:18 EDT
Update
We are continuing to work on a fix for this issue.
Posted Aug 20, 2018 - 14:17 EDT
Update
Our Engineering team's service changes have been completed and have resulted in authentications now completing as normal.

We will continue to monitor the issue and will post an update & post-mortem RCA when the incident is considered fully resolved.
Posted Aug 20, 2018 - 12:56 EDT
Identified
We are working to identify the root cause of core authentication issues on the DUO1 deployment and have begun making changes that are already improving performance for many impacted customers.
Posted Aug 20, 2018 - 12:27 EDT
Investigating
We have identified an issue causing intermittent issues with logins to our Core Authentication service on DUO, and are working to correct the issue.
Posted Aug 20, 2018 - 11:24 EDT
This incident affected: DUO1 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI).