Incident Report - 2022-09-14
On September 14, 2022, at around 8:43 EST, Duo's Engineering Team was alerted by monitoring that the DUO1 deployment was experiencing high latency. Soon after the alert, customers were unable to authenticate or access the Duo Admin Panel. The root cause of the latency was identified as due to the clean up of a customer migration that continued to run during peak traffic hours.
The issue was resolved on the same day automatically by the system back pressure valve that allowed the system to recover.
2022-09-14 2:22 - Duo Site Reliability Engineering (SRE) begins the migration of customers from one database shard to another.
2022-09-14 08:40 - The customer migration completes successfully. The cleanup script continues to run taking a long time to complete due to a large set of customer logs.
2022-09-14 09:04 - The system responds to the increased stress by triggering back pressure to help regulate the traffic.
2022-09-14 09:19 - The system continues to report back pressure but the latency begins to reduce.
2022-09-14 09:27 - The StatusPage is updated to Investigating.
2022-09-14 09:33 - The latency returns to baseline levels and successful authentications reach normal levels. CPU usage normalizes.
2022-09-14 09:38 - The StatusPage is updated to Monitoring.
2022-09-14 09:04 - The StatusPage is updated to Resolved.
Duo SRE began a customer migration during planned maintenance hours. However, the volume of customer logs for the migrated customer was higher than expected and took a significantly longer time to migrate than planned. As a result, the migration continued outside the maintenance window and continued into normal customer traffic hours.
As customer traffic increased, the system experienced resource constraints while the migration log processing continued. With high latency in the system, a backpressure valve was triggered. This throttled customer requests, resulting in authentication failures and preventing Admin Panel access.
Once the log migration completed and back pressure ceased, the success rate for authentications returned to normal. Latency also returned to normal levels.
Duo Engineering has paused all customer migrations in the short term. During this period, Duo will adjust its maintenance windows to ensure that maintenance operations do not leak into peak traffic periods. Duo is also evaluating the efficiency of its customer migration process to improve how quickly customers can be moved without a negative impact.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.