On November 9, 2023, at 9:05am EST and again at 11:36 am EST Duo's Engineering Team was alerted by monitoring to high latency affecting Duo Single Sign-On (SSO) authentication requests on multiple deployments. At 9:15am and again at 11:41am, the authentication request load and resulting latency triggered a rate limiter that rejects authentication requests above a threshold. Latency and rate limiting resulted in failed authentications for 919 Duo SSO customers. We identified the root cause to be a high volume of illegitimate traffic that sent SSO authentication requests repeatedly without completing authentication. Duo engineers blocked two IP addresses from which these incomplete requests originated, and implemented additional rate limiting to prevent another incident.
2023-11-09 9:05 Duo Site Reliability Engineering (SRE) was alerted by our monitoring systems of high latency leading to authentication failures.
2023-11-09 9:15 The IP address sending the invalid requests was blocked.
2023-11-09 9:26 The service was observed to be fully restored.
2022-11-09 9:48 Status page updated to Monitoring.
2023-11-09 10:15 Status page updated to Resolved.
2023-11-09 11:36 The same actor returned from another IP address causing more strain on the system
2023-11-09 11:41 The IP address sending the invalid requests was blocked.
2023-11-09 11:45 The service was observed to be fully restored again.
2023-11-09 13:46 A strict automatic blocking above a certain request threshold was put in place to prevent any future issues from illegitimate traffic.
2023-11-09 17:45 Another burst of invalid traffic hit our deployment, but the new rate limiting blocked the excess requests automatically and there was no impact to any legitimate authentication requests.
Details of the Incident
As stated above in the timeline we experienced two separate periods of time in which SSO authentication failures occurred. All illegitimate traffic during both periods had characteristics in common, leading us to conclude that a single actor was responsible for the load that caused detrimental health on our system. After five minutes, the latency caused by the excess load triggered rate limiters that reject requests above a threshold in an attempt to keep a deployment healthy. Approximately 90% of the authentication request failures seen during these two periods of time were caused by the rate limiter. The remainder of authentication request failures were caused by latency that resulted in requests timing out after one minute.
How did Duo Resolve the incident
As a short term solution, Duo blocked two IP addresses where the bad activity was coming from. No legitimate authentication requests were observed from those two IP addresses.
To prevent similar loads of illegitimate activity, Engineering implemented an additional static rate limiter. The existing rate limiter dynamically started blocking requests when latency passed a certain threshold, but this new rate limiter is set to enforce whenever the number of requests passes a defined number that Duo sets. Later that evening the bad activity returned from yet another IP address but this new rate limiter automatically blocked the traffic and no legitimate authentications were lost in that timeframe.
919 Duo SSO customers had at least 1 user that was impacted.
What is Duo doing to prevent this in the future?
Engineering has already implemented an additional rate limiter to combat runaway requests that are happening against a single customer. The team has also tuned the existing backpressure rate limiter by shortening latency values to make it trigger sooner and therefore be more effective.
In addition, we are evaluating more rate limiting strategies. Duo will implement those in the coming weeks.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.