DUO1 Deployment: Authentication Latency
Incident Report for Duo
Postmortem

Summary:

From 20:40 UTC to 22:04 UTC on January 10, 2019, performance on the DUO1 deployment degraded, resulting in increased authentication latency and intermittent request timeouts. Approximately 60% of authentication requests to DUO1 were slow to respond or timed out.

The root cause of this outage has been identified and engineering changes have been made to prevent similar issues going forward.

Details:

The Duo architecture has been designed with redundancy at each tier and is configured to automatically recover from failure without customer impact. The final stage in recovery is updating an internal map that specifies which load balancer nodes are active. A previous automated recovery event affecting DUO1 completed successfully but did not properly update this map.

At 20:27 UTC on January 10, 2019, a scheduled software update was deployed to DUO1. The automated software release process used the now out-of-date node map to reconfigure DUO1's load balancers. These incorrect configurations misrouted traffic, causing requests to fail or timeout. The percentage of misrouted requests increased as the release process continued modifying load balancers.

At 20:36 UTC, latency exceeded our monitoring thresholds and alerted the operations team to the degraded state. Initial triage implicated the software update as the most probable cause, and operations initiated a rollback at 20:42 UTC.

At 20:51 UTC, traffic was rerouted to servers running a previous software release as part of the rollback process. At 21:02 UTC, the Duo Status Page was updated to reflect data showing that the issue had been resolved.

Subsequently, data became available that showed signs of ongoing service degradation, so a new Status Page incident was opened to reflect the ongoing issue. After the system finished restoring load balancers to a known good configuration at 22:04 UTC, service returned to expected performance levels.

The engineering team has used data collected during this event to improve our release automation to detect errors in the load balancer configuration. Operational processes associated with this event have been reviewed and updated to incorporate lessons learned. With these improvements in place, DUO1 has since been updated to the latest production release of the Duo software. Additional infrastructure and software improvements are underway in order to provide further stability.

Posted Jan 15, 2019 - 10:44 EST

Resolved
The issue regarding authentication latency on DUO1 is fully resolved and all services are now fully functional. Some residual latency may be noted while our mitigation efforts propagate, though it should not affect core authentication.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Please make sure to check back or subscribe to be notified when the RCA is posted.
Posted Jan 10, 2019 - 16:02 EST
Identified
We have identified an issue which caused authentication latency on our DUO1 deployment, and have made appropriate changes to mitigate the issue.

Please check back here or subscribe to updates for any changes.
Posted Jan 10, 2019 - 15:58 EST
This incident affected: DUO1 (Core Authentication Service).