2024-05-02 16:02 Configuration change is made to internal testing and internal beta environments for an internal API service.
2024-05-09 14:27 Configuration change is made to the Asia, Europe, and US1 environments for the same service.
2024-05-13 15:07 Configuration change is made to the US2 environment for the same service.
2024-05-13 15:09 Monitoring systems first receive alerts on a high volume of errors and high latency for authentication services. Duo SRE (Site Reliability Engineering) responds.
2024-05-13 15:13 Duo SRE escalates to the service owner engineering team.
2024-05-13 15:18 Configuration change is identified as the root cause of high volume of errors for authentication services.
2024-05-13 15:18 Duo's Technical Support team is notified by customers of authentication failures through calls and emails.
2024-05-13 15:25 Duo’s Technical Support team contacts SRE about high incoming call rates.
2024-05-13 15:29 Rollback of configuration change is initiated.
2024-05-13 15:31 Status page is updated to Investigating.
2024-05-13 15:46 Rollback testing is completed and merged into source control.
2024-05-13 15:52 Feature flag that controlled traffic from the authentication service to the affected internal API service is disabled.
2024-05-13 15:53 Rollback of the configuration change within the affected internal API service is completed and the internal API service is redeployed with its previous configuration.
2024-05-13 16:00 Status page is updated to Identified.
2024-05-13 16:48 Status page is updated to Fix Implemented and Monitoring.
2024-05-13 16:50 Status page is updated to Monitoring.
2024-05-13 17:42 Status page is updated to Resolved.
Duo Engineering deployed a configuration change that enabled a post-authentication step within an internal API that is part of the authentication code path. The internal API became overloaded and therefore calls could not be handled in a timely manner. As a result, authentications could not complete, and began to fail.
Duo Engineering had previously tested this configuration change and had rolled it out to multiple test and production deployments over the past month without issue
When customers notified Duo's Technical Support team of authentication failures, the volume of incoming calls exceeded our phone system's capacity. This led to some customers being unable to connect with Support, receiving a message stating, "The user you are trying to reach is unavailable" before the call was prematurely disconnected.
After Duo reverted the identified configuration change responsible for the authentication failures, some users still had the Duo status "Locked Out" because they exceeded the allowed number of authentication attempts during this incident. Users on Duo accounts with "Revert user status after X minutes" enabled in the Lockout and Fraud section of Settings will be unlocked automatically after the designated amount of time. If this setting is not enabled, an account administrator will need to manually unlock affected user accounts.
How Did Duo Resolve the Incident?
Duo engineering reverted the deployed configuration, disabling post-authentication event processing within the internal API. Once the reverting change was deployed, the issue was resolved.
How Many Customers Were Impacted?
The outage affected 7,666 Advantage and Premier customers hosted in the US.
What Is Duo Doing to Prevent This in the Future?
The service owner engineering team is improving existing service monitors by setting finer-grained thresholds to help identify increased latency sooner. Duo will modify the process of introducing configuration changes to this internal API service to gradually increase traffic, rather than instantly deploying region-wide.
We are taking measures to expand our Technical Support phone system's call handling capabilities to prevent future occurrences like this and ensure all customer calls are managed promptly and efficiently.
You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.