On September 29, 2023, at around 6:42 pm EST, Duo's Engineering Team was alerted by monitoring that we lost authentication to the Duo Applications and the Duo Admin Panel for customers on the DUO3 deployment. The root cause was identified as failure on one of our scheduled maintenance tasks.
The issue was resolved on the same day by 6:49 pm EST.
18:42 Duo Site Reliability Engineering (SRE) is informed by Duo internal monitoring that a scheduled maintenance task fails.
18:49 Duo SRE team immediately mitigated the failed task and restored functionality.
18:49 Duo SRE started monitoring to ensure we had mitigated the issue.
18:53 Duo SRE team validated through automated monitoring that we had restored complete functionality.
18:59 Duo SRE team worked with our TSE team to provide communication to our customers, validating that only DUO3 was impacted for 7 minutes. Status Page Updated.
19:19 Status Page Updated to: “We have confirmed that authentication services are back to fully operational and this issue is resolved. We will provide a Root Cause Analysis (RCA) as soon as it is available.”
DUO3 has multiple redundant load balancer pairs that accept requests from the internet and distribute them to applications. Within each pair, one half actively processes requests and the other acts as a passive hot spare.
Duo SRE runs scheduled maintenance after hours for our Load Balancer inventory. While conducting scheduled maintenance one of our tasks failed. As soon as the SRE team noticed, the failed task was quickly updated to restore service. It took 7 minutes for the team to restore service.
Duo SRE team is dedicated to providing reliable service to all users. The Duo SRE team has completed a retrospective to determine steps and actions to avoid similar incidents in the future.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.