DUO1: Core Authentication Service degraded performance
Incident Report for Duo
Postmortem

Summary

On August 13 2024, at approximately 10:30 am ET, Duo's Site Reliability Engineering Team received an alert indicating that users on DUO1 were experiencing latency issues. The team identified the root cause as a large and distributed burst of authentication traffic.

Duo resolved the issue by 12:12pm ET.

Deployments Impacted

  • DUO1

Timeline of Events (ET)

2024-08-13 10:30am  Duo Site Reliability Engineering (SRE) Team receives initial alert regarding DUO1. Engineers begin to investigate the customer impact.

2024-08-13 10:49am Duo SRE Team identifies a large and distributed burst of authentication traffic from a handful of users impacting Duo’s capacity to process authentications.

2024-08-13 11:04am Status page is updated to investigating.

2024-08-13 11:09am Duo SRE Team starts working on multiple solutions to address the degradation, including scaling the infrastructure and mitigating the burst of traffic.

2024-08-13 11:29am Duo SRE Team completes scaling DUO1.

2024-08-13 11:40am Duo SRE Team stops the surge of traffic. Duo SRE begins to monitor systems for recovery.

2024-08-13 12:12pm Incident status updated to resolve after recovery was verified.

Details

Duo Engineering determined that the root cause was a substantial surge in authentication traffic, which impacted Duo's ability to process authentications efficiently. This led to increased latency, triggering our automated back pressure systems to reject requests exceeding a certain threshold to maintain system functionality. As a result, we were able to continue processing the majority of authentications and limit impact to a small percentage of our total user base on this deployment.

Duo's Site Reliability Engineering (SRE) team conducts capacity planning for deployments based on historical data and expected utilization, ensuring there is sufficient headroom for increased authentication traffic. However, the recent traffic surge exceeded these margins and required manual intervention. The team implemented two solutions: vertically scaling DUO1 to create additional headroom and reduce pressure, and identifying and blocking the traffic sources. These actions successfully reduced the load on DUO1 and resolved the latency issues.

During the incident retrospective, the Duo SRE team identified several solutions to prevent similar issues in the future. The solutions include, migrating customers with higher authentication volumes to separate deployments, further optimizations and automation in our scaling of deployments to improve their ability to handle large bursts of traffic, and automating surging authentication traffic mitigation processes.

Posted Aug 15, 2024 - 05:48 EDT

Resolved
The issue causing degraded performance with Duo's authentication service is now resolved and full functionality is restored.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.
Posted Aug 13, 2024 - 12:19 EDT
Monitoring
We have implemented a fix for the issue causing degraded performance with Duo's authentication service.

We will continue to monitor the issue and will post any updates when the incident is considered fully resolved.
Posted Aug 13, 2024 - 11:51 EDT
Identified
We have identified the cause of the degraded performance with Duo's authentication service and are actively working to restore service to full capacity.
Posted Aug 13, 2024 - 11:41 EDT
Update
We are continuing to investigate this issue.
Posted Aug 13, 2024 - 11:10 EDT
Investigating
We are currently investigating an issue causing degraded performance with the Duo Core Authentication Service on our DUO1 deployment and are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.
Posted Aug 13, 2024 - 11:05 EDT
This incident affected: DUO1 (Core Authentication Service).