On August 13 2024, at approximately 10:30 am ET, Duo's Site Reliability Engineering Team received an alert indicating that users on DUO1 were experiencing latency issues. The team identified the root cause as a large and distributed burst of authentication traffic.
Duo resolved the issue by 12:12pm ET.
2024-08-13 10:30am Duo Site Reliability Engineering (SRE) Team receives initial alert regarding DUO1. Engineers begin to investigate the customer impact.
2024-08-13 10:49am Duo SRE Team identifies a large and distributed burst of authentication traffic from a handful of users impacting Duo’s capacity to process authentications.
2024-08-13 11:04am Status page is updated to investigating.
2024-08-13 11:09am Duo SRE Team starts working on multiple solutions to address the degradation, including scaling the infrastructure and mitigating the burst of traffic.
2024-08-13 11:29am Duo SRE Team completes scaling DUO1.
2024-08-13 11:40am Duo SRE Team stops the surge of traffic. Duo SRE begins to monitor systems for recovery.
2024-08-13 12:12pm Incident status updated to resolve after recovery was verified.
Duo Engineering determined that the root cause was a substantial surge in authentication traffic, which impacted Duo's ability to process authentications efficiently. This led to increased latency, triggering our automated back pressure systems to reject requests exceeding a certain threshold to maintain system functionality. As a result, we were able to continue processing the majority of authentications and limit impact to a small percentage of our total user base on this deployment.
Duo's Site Reliability Engineering (SRE) team conducts capacity planning for deployments based on historical data and expected utilization, ensuring there is sufficient headroom for increased authentication traffic. However, the recent traffic surge exceeded these margins and required manual intervention. The team implemented two solutions: vertically scaling DUO1 to create additional headroom and reduce pressure, and identifying and blocking the traffic sources. These actions successfully reduced the load on DUO1 and resolved the latency issues.
During the incident retrospective, the Duo SRE team identified several solutions to prevent similar issues in the future. The solutions include, migrating customers with higher authentication volumes to separate deployments, further optimizations and automation in our scaling of deployments to improve their ability to handle large bursts of traffic, and automating surging authentication traffic mitigation processes.