Duo Authentication Delivery Failures
Incident Report for Duo
Postmortem

Summary

  • Between 15:07 and 15:53 EDT on May 13, 2024, users on some Duo deployments received timeouts and failures while attempting authentication. This was caused by an issue with a new version of the API component that creates authentication responses returned to applications. Affected users saw error messages such as "Something went wrong" or "Network Timeout." Users could not access their applications because applications never received responses from Duo. Duo Engineering rolled back the change and restored normal authentication functionality.

Deployments Impacted

  • DUO9, DUO22, DUO39, DUO40, DUO42, DUO49, DUO50, DUO55, DUO56, DUO62, DUO63, DUO64, DUO65, DUO72, DUO73, DUO74, DUO75, DUO77, DUO78

Timeline of Events EDT

2024-05-02 16:02 Configuration change is made to internal testing and internal beta environments for an internal API service.

2024-05-09 14:27 Configuration change is made to the Asia, Europe, and US1 environments for the same service.

2024-05-13 15:07 Configuration change is made to the US2 environment for the same service.

2024-05-13 15:09 Monitoring systems first receive alerts on a high volume of errors and high latency for authentication services. Duo SRE (Site Reliability Engineering) responds.

2024-05-13 15:13 Duo SRE escalates to the service owner engineering team.

2024-05-13 15:18 Configuration change is identified as the root cause of high volume of errors for authentication services.

2024-05-13 15:18 Duo's Technical Support team is notified by customers of authentication failures through calls and emails.

2024-05-13 15:25 Duo’s Technical Support team contacts SRE about high incoming call rates.

2024-05-13 15:29 Rollback of configuration change is initiated.

2024-05-13 15:31 Status page is updated to Investigating.

2024-05-13 15:46 Rollback testing is completed and merged into source control.

2024-05-13 15:52 Feature flag that controlled traffic from the authentication service to the affected internal API service is disabled.

2024-05-13 15:53 Rollback of the configuration change within the affected internal API service is completed and the internal API service is redeployed with its previous configuration.

2024-05-13 16:00 Status page is updated to Identified.

2024-05-13 16:48 Status page is updated to Fix Implemented and Monitoring.

2024-05-13 16:50 Status page is updated to Monitoring.

2024-05-13 17:42 Status page is updated to Resolved.

Details of the Incident

Duo Engineering deployed a configuration change that enabled a post-authentication step within an internal API that is part of the authentication code path. The internal API became overloaded and therefore calls could not be handled in a timely manner. As a result, authentications could not complete, and began to fail.

Duo Engineering had previously tested this configuration change and had rolled it out to multiple test and production deployments over the past month without issue

When customers notified Duo's Technical Support team of authentication failures, the volume of incoming calls exceeded our phone system's capacity. This led to some customers being unable to connect with Support, receiving a message stating, "The user you are trying to reach is unavailable" before the call was prematurely disconnected.

After Duo reverted the identified configuration change responsible for the authentication failures, some users still had the Duo status "Locked Out" because they exceeded the allowed number of authentication attempts during this incident. Users on Duo accounts with "Revert user status after X minutes" enabled in the Lockout and Fraud section of Settings will be unlocked automatically after the designated amount of time. If this setting is not enabled, an account administrator will need to manually unlock affected user accounts.

How Did Duo Resolve the Incident?

Duo engineering reverted the deployed configuration, disabling post-authentication event processing within the internal API. Once the reverting change was deployed, the issue was resolved.

How Many Customers Were Impacted?

The outage affected 7,666 Advantage and Premier customers hosted in the US.

What Is Duo Doing to Prevent This in the Future?

The service owner engineering team is improving existing service monitors by setting finer-grained thresholds to help identify increased latency sooner. Duo will modify the process of introducing configuration changes to this internal API service to gradually increase traffic, rather than instantly deploying region-wide.

We are taking measures to expand our Technical Support phone system's call handling capabilities to prevent future occurrences like this and ensure all customer calls are managed promptly and efficiently.

You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted May 14, 2024 - 16:48 EDT

Resolved
The issue causing authentication failures on all affected deployments is fully resolved and all services are now fully functional.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.
Posted May 13, 2024 - 17:42 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 13, 2024 - 16:50 EDT
Update
A fix has been implemented and we are monitoring the results
Posted May 13, 2024 - 16:48 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted May 13, 2024 - 16:00 EDT
Investigating
We are currently investigating an issue causing authentication failures. We are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.
Posted May 13, 2024 - 15:31 EDT
This incident affected: DUO22 (Push Delivery), DUO39 (Push Delivery), DUO40 (Push Delivery), DUO42 (Push Delivery), DUO9 (Push Delivery), DUO49 (Push Delivery), DUO50 (Push Delivery), DUO55 (Push Delivery), DUO56 (Push Delivery), DUO62 (Push Delivery), DUO63 (Push Delivery), DUO64 (Push Delivery), DUO65 (Push Delivery), DUO72 (Push Delivery), DUO73 (Push Delivery), DUO74 (Push Delivery), DUO76 (Push Delivery), DUO77 (Push Delivery), and DUO78 (Push Delivery).