On September 1, 2023, at around 13:19 EDT, Duo's Engineering Team was notified by our alerting that some customers might be experiencing slow Active Directory authentications or a large number of authentication retries with our Duo Single Sign-On (SSO) product.
After factoring out successful retries and accounting for noise in the system, Duo identified ~200 authentication failures impacting 128 different customers.
The root cause was the overloading of routing systems used to communicate with Duo Authentication Proxies.
The issue was resolved on the same day by restarting cloud services that communicate with on-premises Duo Authentication Proxies.
To determine which deployment ID you are on, please refer to this Duo Knowledge Base article.
2023-09–01 13:19 Duo Site Reliability Engineering (SRE) is informed by our alerting of a higher-than-normal volume of internal authentication path routing failures. SRE begins triage.
2023-09-01 13:30 Duo SRE restarts the SSO services to flush the system and drop CPU usage.
2023-09-01 14:00 Duo SSO’s service restart finishes. From this point onwards Duo saw no further authentication failures.
2022-09-01 14:15 Status page updated to Investigating.
2023-09-01 15:00 Duo SRE successfully ruled out several possible issues including background jobs, a sudden influx of inbound authentication requests, and Authentication Proxy issues.
2022-09-01 15:04 Status page updated to Monitoring.
2022-09-01 15:33 Status page updated to Resolved.
Duo SSO uses a routing system to accept a high volume of login requests from protected cloud applications and dispatch them to the appropriate identity providers, such as a customer’s on-premises Active Directory facilitated by a Duo Authentication Proxy. In normal operations, this system may throw a benign number of errors indicating that we could not reach a customer’s Authentication Proxy. In most circumstances, this is due to intermittent network issues in a customer’s environment and can be resolved by falling back to another Authentication Proxy made available by following best practices for high availability.
In this instance, we detected failures over our threshold to indicate there might be something wrong with Duo’s system and not simply transient failures from day-to-day operations. Upon further investigation, we identified high CPU load on the services responsible for routing authentications. We believed this high CPU utilization eventually caused enough latency in our system that requests would time out, leading to authentication failures for our end users.
Duo resolved this by restarting services that communicate with on-premises Duo Authentication Proxies. This addressed the temporary problem with the service, but did not implement any long-term solutions. For more details on those plans see the following sections.
If you are using Duo SSO with an Active Directory identity provider we highly recommend following our best practices for setting up the Duo Authentication Proxy for high availability. These best practices will make your setup more resilient in times of high load and could reduce impact to your users in the case of a Duo Service degradation event like this one.
The root cause of why the CPU spiked on our systems has not yet been fully identified. In the long term, Duo will attempt to reproduce these kinds of issues and implement fixes based on those findings
In the short term, Duo is making incremental improvements to our internal routing systems that dispatch requests to on-premises Authentication Proxies. These changes aim to reduce the total amount of resources consumed when fulfilling an authentication request and will hopefully minimize the chances of this kind of failure occurring again.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this Duo Knowledge Base article.