Active Directory Sync Failures
Incident Report for Duo
Postmortem

Active Directory Sync Failures

Incident Report - 2023-02-07

Summary

On January 31, 2023, Duo received alerts of Duo Directory Synchronization failing for multiple customers. The Duo Engineering team paused the deployment of release D258 to limit customer impact while they investigated. The Engineering team’s investigation identified a D258 code change in the Duo core authentication service that caused a conflict with the Duo Admin Panel service that had not yet been updated to D257. Engineering deployed a code fix and resumed deployment of D258, reaching all impacted customer deployments on February 1, 2023.

Deployments Impacted

  • DUO4, DUO6, DUO7, DUO10, DUO13, DUO19, DUO20, DUO21, DUO23, DUO28, DUO31, DUO33, DUO38, DUO43, DUO44, DUO45, DUO46, DUO47, DUO48, DUO51, DUO52, DUO55, DUO56, DUO58, DUO62, DUO63, DUO64, DUO66, DUO67, DUO68, DUO69, DUO70, DUO72, DUO73

Timeline of Events EST

2023-01-31 16:32  Duo Site Reliability Engineering (SRE) is informed of customers having problems running directory syncs via email. SRE begins triage.

2022-02-28 17:25  Status page updated to: “We have identified the cause of the issue and we are implementing a fix.”

2023-01-31 17:40 Root cause identified by the Engineering team.

2023-01-31 19:16 Fix implemented to our codebase.

2023-02-01 12:50 Fix deployed across our impacted customers.

2023-02-01 13:50 Status page updated, all our systems are operational.

Details

On January 31, 2023 at 16:32 EST, Duo’s Site Reliability Engineering (SRE) team received monitoring alerts about multiple customers having problems running Duo Directory Synchronization with either OpenLDAP or Active Directory. Duo's Engineering Team paused the deployment of release D258 to limit customer impact while they investigated. 

By January 31 at 17:40 EST, Duo Engineering traced the failed directory sync root cause to a D258 change to Duo's core authentication service. When a customer on Duo core D258 was not yet updated to Duo Admin Panel service D258, and that customer started a directory sync, the sync failed. 

Engineering deployed a code fix to release D258 and resumed deployment. The fix reached all impacted customer deployments by February 1 at 12:50 EST. 

Because the error could only occur when Duo core and admin services were on these two different release versions, Engineering determined that the resolution was to repair the code identified as the root cause, then allow deployment to finish, resulting in all services on the same version and allowing the root cause to "self-heal". 

The only customers at risk of impact by this incident were those who executed a Duo Directory Sync with OpenLDAP or Active Directory during the deployment of release D258 (from Thursday, January 26 until Engineering paused deployment on January 31). Six customers reported failed directory syncs during this release window. Customers who experienced failed directory sync will have experienced automatic retry of their directory sync, which will have succeeded once D258 completed deployment.

Root cause analysis identified the need for stronger API version testing in Duo’s continuous integration pipeline and an opportunity to improve Directory Sync monitoring. These measures will increase the likelihood of identifying problems before they impact customers.

Posted Feb 14, 2023 - 13:11 EST

Resolved
The issue regarding Active Directory Sync is now fully resolved and all services are fully functional.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.
Please make sure to check back or subscribe to be notified when the RCA is posted.
Posted Feb 01, 2023 - 16:07 EST
Identified
We have identified the cause of the issue and implementing a fix. Previously failed syncs should be successful in the next scheduled sync.
Please subscribe to the status page for further updates.
Posted Jan 31, 2023 - 17:50 EST
Investigating
We are investigating an issue causing failed Active Directory Syncs for selected deployments.
We are locating the cause of the issue and are working to find a solution.
Please subscribe to the status page for further updates.
Posted Jan 31, 2023 - 17:46 EST
This incident affected: DUO4 (Admin Panel), DUO6 (Core Authentication Service), DUO7 (Admin Panel), DUO47 (Admin Panel), DUO10 (Admin Panel), DUO13 (Admin Panel), DUO19 (Admin Panel), DUO20 (Admin Panel), DUO21 (Admin Panel), DUO23 (Admin Panel), DUO28 (Admin Panel), DUO31 (Admin Panel), DUO33 (Admin Panel), DUO38 (Admin Panel), DUO43 (Admin Panel), DUO44 (Admin Panel), DUO45 (Admin Panel), DUO46 (Admin Panel), DUO48 (Admin Panel), DUO51 (Admin Panel), DUO52 (Admin Panel), DUO55 (Admin Panel), DUO56 (Admin Panel), DUO58 (Admin Panel), DUO62 (Admin Panel), DUO63 (Admin Panel), DUO64 (Admin Panel), DUO66 (Admin Panel), DUO67 (Admin Panel), DUO68 (Admin Panel), DUO69 (Admin Panel), DUO70 (Admin Panel), DUO72 (Admin Panel), and DUO73 (Admin Panel).