Intermittent authentication failure on DUO1

Incident Report for Duo

Postmortem

Summary

On December 4, 2023, at around 23:03 EST, Duo's Engineering Team was alerted by monitoring that DUO1 was experiencing intermittent authentication failures. The root cause was identified as a race condition on routine database cluster scaling.

‌

The issue was resolved on the same day by removing the additional database nodes and restarting affected instances.

Deployments Impacted

DUO1

Timeline of Events EST

2023-12-04

‌

21:41 Duo Site Reliability Engineering (SRE) completed the scheduled maintenance window for DUO1, increasing the total database count to six, all systems are operational.

22:40 Duo SRE adds an additional two nodes to increase overhead for the database tier

22:48 Duo SRE receives an elevated Redis error alert count from internal monitoring.

23:06 Duo SRE finds degradation in authentication from two of eight of the database nodes, the two most recently added.

23:10 Duo SRE starts removal of both new database nodes

23:26 Increased remediation in authentications after both have scaled down

23:33 Status page updated to monitoring

‌

2023-12-05

‌

1:09 Duo SRE does a rolling restart on affected nodes, clearing lasting behavior from affected nodes

1:14 Status page is updated to resolved

4:44 Final rolling restart is started to ensure no lingering behavior is present.

Details

‌

Duo SRE completes maintenance work without expected downtime in off peak times for a deployment. DUO1 had an off peak maintenance set to increase database redundancy and capacity. After further investigation it was determined that during the increase in headroom in our database tier, there occurred a race condition that was met from adding an additional two nodes. This caused the connection to the caching layer to fail and create failures on a quarter of requests for two factor authentication. The effect was semi random and users, upon a hard refresh, could be presented with a successful authentication.

‌

By removing the two problematic nodes, Duo SRE resolved the race condition allowing for normal authentication patterns. Long running processes were still experiencing errors and to remediate it was determined that we would run a rolling restart across the deployment. This cleared up any remaining issues.

‌

Duo SRE is going to address the race condition in the client code that connects to the database. In addition, Duo SRE is investigating ways to ensure minimal disruption in the event of an issue with the database.

‌

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Dec 05, 2023 - 00:14 EST

Resolved

We have confirmed that authentication services are back to fully operational on DUO1, and this issue is resolved.
We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Posted Dec 04, 2023 - 01:19 EST

Monitoring

We have identified and remediated an issue intermittently affecting authentications on DUO01 deployment.
We are continuing to see a recovery for the authentication failures on DUO01 and we are monitoring the results.

Posted Dec 03, 2023 - 23:35 EST

Investigating

We have identified and remediated an issue intermittently affecting authentications on DUO1 deployment.
We are continuing to see a recovery for the authentication failures on DUO1 and we are monitoring the results.

Posted Dec 03, 2023 - 23:34 EST

This incident affected: DUO1 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI, SSO).