Multiple Deployments: Slow Active Directory authentications with Duo SSO

Incident Report for Duo

Postmortem

Summary

On September 1, 2023, at around 13:19 EDT, Duo's Engineering Team was notified by our alerting that some customers might be experiencing slow Active Directory authentications or a large number of authentication retries with our Duo Single Sign-On (SSO) product.

After factoring out successful retries and accounting for noise in the system, Duo identified ~200 authentication failures impacting 128 different customers.

The root cause was the overloading of routing systems used to communicate with Duo Authentication Proxies.

The issue was resolved on the same day by restarting cloud services that communicate with on-premises Duo Authentication Proxies.

Deployments Impacted

DUO9 (SSO), DUO17 (SSO), DUO22 (SSO), DUO39 (SSO), DUO40 (SSO), DUO42 (SSO), DUO50 (SSO), DUO52 (SSO), DUO55 (SSO), DUO56 (SSO), DUO58 (SSO), DUO62 (SSO), DUO63 (SSO), DUO64 (SSO), DUO65 (SSO), DUO71 (SSO), DUO72 (SSO), and DUO73 (SSO).

To determine which deployment ID you are on, please refer to this Duo Knowledge Base article.

Timeline of Events EDT

2023-09–01 13:19 Duo Site Reliability Engineering (SRE) is informed by our alerting of a higher-than-normal volume of internal authentication path routing failures. SRE begins triage.

2023-09-01 13:30 Duo SRE restarts the SSO services to flush the system and drop CPU usage.

2023-09-01 14:00 Duo SSO’s service restart finishes. From this point onwards Duo saw no further authentication failures.

2022-09-01 14:15 Status page updated to Investigating.

2023-09-01 15:00 Duo SRE successfully ruled out several possible issues including background jobs, a sudden influx of inbound authentication requests, and Authentication Proxy issues.

2022-09-01 15:04 Status page updated to Monitoring.

2022-09-01 15:33 Status page updated to Resolved.

Details

Duo SSO uses a routing system to accept a high volume of login requests from protected cloud applications and dispatch them to the appropriate identity providers, such as a customer’s on-premises Active Directory facilitated by a Duo Authentication Proxy. In normal operations, this system may throw a benign number of errors indicating that we could not reach a customer’s Authentication Proxy. In most circumstances, this is due to intermittent network issues in a customer’s environment and can be resolved by falling back to another Authentication Proxy made available by following best practices for high availability.

In this instance, we detected failures over our threshold to indicate there might be something wrong with Duo’s system and not simply transient failures from day-to-day operations. Upon further investigation, we identified high CPU load on the services responsible for routing authentications. We believed this high CPU utilization eventually caused enough latency in our system that requests would time out, leading to authentication failures for our end users.

Resolution

Duo resolved this by restarting services that communicate with on-premises Duo Authentication Proxies. This addressed the temporary problem with the service, but did not implement any long-term solutions. For more details on those plans see the following sections.

Recommendations

If you are using Duo SSO with an Active Directory identity provider we highly recommend following our best practices for setting up the Duo Authentication Proxy for high availability. These best practices will make your setup more resilient in times of high load and could reduce impact to your users in the case of a Duo Service degradation event like this one.

What is Duo doing to prevent this in the future?

The root cause of why the CPU spiked on our systems has not yet been fully identified. In the long term, Duo will attempt to reproduce these kinds of issues and implement fixes based on those findings

In the short term, Duo is making incremental improvements to our internal routing systems that dispatch requests to on-premises Authentication Proxies. These changes aim to reduce the total amount of resources consumed when fulfilling an authentication request and will hopefully minimize the chances of this kind of failure occurring again.

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this Duo Knowledge Base article.

Posted Sep 27, 2023 - 15:19 EDT

Resolved

The issue that resulted in Duo SSO login service degradation has fully resolved. All authentications are working as expected.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Please check back or subscribe to be notified when the RCA is posted.

Posted Sep 01, 2023 - 15:33 EDT

Monitoring

The issue that resulted in Duo SSO login service degradation has self-resolved. We continue to monitor closely and are actively investigating root cause.

Posted Sep 01, 2023 - 15:04 EDT

Investigating

We are investigating a potential service degradation for Duo SSO authentications through Active Directory that results in intermittent failures to login or slow logins.

Please check back here or subscribe to updates for any changes.

Posted Sep 01, 2023 - 14:15 EDT

This incident affected: DUO9 (SSO), DUO17 (SSO), DUO22 (SSO), DUO39 (SSO), DUO40 (SSO), DUO42 (SSO), DUO50 (SSO), DUO52 (SSO), DUO55 (SSO), DUO56 (SSO), DUO58 (SSO), DUO62 (SSO), DUO63 (SSO), DUO65 (SSO), DUO71 (SSO), DUO72 (SSO), and DUO73 (SSO).