On December 4, 2023, at around 23:03 EST, Duo's Engineering Team was alerted by monitoring that DUO1 was experiencing intermittent authentication failures. The root cause was identified as a race condition on routine database cluster scaling.
The issue was resolved on the same day by removing the additional database nodes and restarting affected instances.
21:41 Duo Site Reliability Engineering (SRE) completed the scheduled maintenance window for DUO1, increasing the total database count to six, all systems are operational.
22:40 Duo SRE adds an additional two nodes to increase overhead for the database tier
22:48 Duo SRE receives an elevated Redis error alert count from internal monitoring.
23:06 Duo SRE finds degradation in authentication from two of eight of the database nodes, the two most recently added.
23:10 Duo SRE starts removal of both new database nodes
23:26 Increased remediation in authentications after both have scaled down
23:33 Status page updated to monitoring
1:09 Duo SRE does a rolling restart on affected nodes, clearing lasting behavior from affected nodes
1:14 Status page is updated to resolved
4:44 Final rolling restart is started to ensure no lingering behavior is present.
Duo SRE completes maintenance work without expected downtime in off peak times for a deployment. DUO1 had an off peak maintenance set to increase database redundancy and capacity. After further investigation it was determined that during the increase in headroom in our database tier, there occurred a race condition that was met from adding an additional two nodes. This caused the connection to the caching layer to fail and create failures on a quarter of requests for two factor authentication. The effect was semi random and users, upon a hard refresh, could be presented with a successful authentication.
By removing the two problematic nodes, Duo SRE resolved the race condition allowing for normal authentication patterns. Long running processes were still experiencing errors and to remediate it was determined that we would run a rolling restart across the deployment. This cleared up any remaining issues.
Duo SRE is going to address the race condition in the client code that connects to the database. In addition, Duo SRE is investigating ways to ensure minimal disruption in the event of an issue with the database.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.