SSO Active Directory Authentication Failures Across Multiple Deployments
Incident Report - 2023-06-12
On June 12, 2023, at around 9:48 EDT, Duo's Engineering Team was notified by customers that we were seeing degraded performance in our Duo Single Sign-On (SSO) product.
The root cause was identified as an increased amount of load on the deployment that overwhelmed the services responsible for managing authentications against an Active Directory identity provider.
The issue was resolved on the same day by implementing several load mitigation techniques. The techniques included infrastructure scaling, configuration changes, and algorithm updates.
2023-06-12 09:48 Duo Site Reliability Engineering (SRE) is informed by customers that they are seeing intermittent authentication failures in our SSO product. SRE begins triage.
2023-06-12 10:26 Duo SRE removes extraneous data from our Redis caching layer to reduce load.
2023-06-12 10:30 Duo SRE rolls out a configuration change to our Redis infrastructure to increase top level performance.
2022-06-12 10:36 Status page updated to: Issue Identified.
2023-06-12 10:58 Duo SRE rolls out an algorithm change intended to reduce overhead per authentication.
2023-06-12 11:01 Duo SRE noticed that all systems appeared to be healthy and authentications were being serviced at 100% again.
2022-06-12 11:05 Status page updated to: Monitoring.
2022-06-12 14:16 Status page updated to: Resolved.
In order to service Authentications for customers using Active Directory, Duo SSO uses the Duo Authentication Proxy to perform LDAP queries. These Authentication Proxies are managed by customers inside their own infrastructure. To accommodate for the lack of insight Duo has into our customers’ local network topology, we implemented a pathing algorithm that would be resilient to various network connectivity issues that may arise. However, we have since learned that this algorithm is costly both for Duo’s infrastructure and the infrastructure managed by our customers.
On June 12, 2023 we reached a tipping point where additional authentication load became too much for our infrastructure to handle. This caused intermittent authentication failures for some customers because our processes were too overwhelmed to take on new requests and were also sometimes too slow in responding to active requests.
Duo resolved this by implementing several changes:
If you are using Duo SSO with an Active Directory identity provider we highly recommend you run your Authentication Proxies in our suggested High Availability (HA) configuration. You can read more about those recommendations here. These recommendations will make your setup more resilient in times of high load, and could reduce impact to your users in the case of a Duo Service degradation event like this one.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this Duo Knowledge Base article.