Intermittent Authentication Failures for DUO1
Incident Report for Duo
Postmortem

High latency and authentication failures

Incident Report - 2022-09-14

Summary

On September 14, 2022, at around 8:43 EST, Duo's Engineering Team was alerted by monitoring that the DUO1 deployment was experiencing high latency. Soon after the alert, customers were unable to authenticate or access the Duo Admin Panel. The root cause of the latency was identified as due to the clean up of a customer migration that continued to run during peak traffic hours.

The issue was resolved on the same day automatically by the system back pressure valve that allowed the system to recover. 

Deployments Impacted

  • DUO1

Timeline of Events EST

2022-09-14 2:22 - Duo Site Reliability Engineering (SRE) begins the migration of customers from one database shard to another. 

2022-09-14 08:40 - The customer migration completes successfully. The cleanup script continues to run taking a long time to complete due to a large set of customer logs.

2022-09-14 09:04 - The system responds to the increased stress by triggering back pressure to help regulate the traffic.  

2022-09-14 09:19 - The system continues to report back pressure but the latency begins to reduce. 

2022-09-14 09:27 - The StatusPage is updated to Investigating.

2022-09-14 09:33 - The latency returns to baseline levels and successful authentications reach normal levels. CPU usage normalizes. 

2022-09-14 09:38 - The StatusPage is updated to Monitoring.

2022-09-14 09:04 - The StatusPage is updated to Resolved. 

Details

Duo SRE began a customer migration during planned maintenance hours. However, the volume of customer logs for the migrated customer was higher than expected and took a significantly longer time to migrate than planned. As a result, the migration continued outside the maintenance window and continued into normal customer traffic hours.

As customer traffic increased, the system experienced resource constraints while the migration log processing continued. With high latency in the system, a backpressure valve was triggered. This throttled customer requests, resulting in authentication failures and preventing Admin Panel access.

Once the log migration completed and back pressure ceased, the success rate for authentications returned to normal. Latency also returned to normal levels. 

What is Duo doing to prevent this in the future?

Duo Engineering has paused all customer migrations in the short term. During this period, Duo will adjust its maintenance windows to ensure that maintenance operations do not leak into peak traffic periods. Duo is also evaluating the efficiency of its customer migration process to improve how quickly customers can be moved without a negative impact. 

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Sep 15, 2022 - 12:38 EDT

Resolved
As of 1:55 PM UTC, the issue impacting authentications on DUO1 has been resolved, and all services are fully functional.
Posted Sep 14, 2022 - 10:00 EDT
Monitoring
As of 1:15 PM UTC, the issue has been resolved, and we are currently monitoring the results.
Posted Sep 14, 2022 - 09:38 EDT
Identified
We have identified an issue causing a partial outage on our DUO1 deployment. This outage is affecting most authentications for DUO1. We are working on fixing this issue as soon as possible.
Posted Sep 14, 2022 - 09:27 EDT
This incident affected: DUO1 (Core Authentication Service).