On September 7, 2023, at 14:55 EST, Duo's Engineering Team was alerted by monitoring that the DUO39 deployment was unreachable. The root cause was identified as an automation test that began at 14:53 EST and failed because it pointed to the wrong deployment.
The issue was resolved at 15:18 EST by restoring the load balancer configurations for DUO39.
September 7, 2023
14:53 Prior to incident alert - Automation test executed
14:55 Duo Site Reliability Engineering (SRE) received an alert indicating a failure in the load balancer health check.
14:57 Duo SRE acknowledged the failed health check alert.
14:57 Duo SRE received a high error volume alert indicating that users on DUO39 were experiencing timeouts.
15:03 Duo Engineering started investigating, confirming alerts were correlated and impacting customers.
15:13 The root issue and resolution of restoring configurations to the associated load balancers identified (see timestamp 14:53)
15:18 Resolution completed on first balancer pair, thus enabling authentications
15:20 Resolution completed on all balancer pairs
15:22 Customer reports of authentication working, impact subsided
15:43 Status page updated to "We have identified an issue causing authentication failures on DUO39. The issue has resolved. We will continue to monitor closely and are actively investigating the root cause."
16:17 Status page updated to "The issue with DUO39 is fully resolved and all services are now functional."
DUO39 has multiple redundant load balancer pairs that accept requests from the internet and distribute them to applications. Within each pair, one half actively processes requests and the other acts as a passive hot spare.
When a load balancer requires maintenance, Duo Site Reliability Engineering (SRE) works on the passive half, then performs a passive to active swap, and repeats the maintenance on the remaining system.
Duo SRE has developed a new process for auto-detecting load balancer inventory which eliminates risk and better detects errors prior to maintenance swaps. We conducted a test to determine if the new process correctly connected to DUO39's production load balancers. We intended to only test the connection to DUO39. However, an error in the script caused the test to remove configuration from all DUO39 load balancers, causing them to stop accepting requests for authentication, APIs, and other operations.
Upon receiving the alert SRE began investigation, discovered the issue, and manually restored configurations to the load balancer pairs. It took 23 minutes to bring the first pair of load balancers back online. This pair had enough capacity to handle all DUO39 traffic while Duo SRE worked to restore the remaining pairs.
Duo SRE team is dedicated to providing reliable service to all users. In that regard the Duo SRE team will complete a retrospective to determine steps and actions to avoid similar incidents in the future.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.