On July 22, 2024, from 4:38 pm ET to 4:40 pm ET, users on DUO15 experienced a brief authentication outage. The root cause was identified as a configuration issue with one of our Deployment Services.
The issue was resolved in approximately 2 minutes, as the services came online again.
2024-07-22 14:03: Duo Site Reliability Engineering (SRE) is informed that the Continuous Delivery system is not working as expected
2024-07-22 14:11: SRE acknowledges that updates are not propagating to production systems via Continuous Delivery system
2024-07-22 14:13: SRE decides that for quick propagation of updates to production systems, a manual restart is needed
2024-07-22 14:14: SRE starts manually upgrades across various deployments and it was observed that changes were propagating as expected.
2024-07-22 16:38: While upgrading one deployment in a similar fashion, SRE observed that all the services were terminated during rolling restarts, resulting in a downtime of about two minutes
2024-07-22 16:40: New services in the affected deployment start serving traffic with updated configurations
2024-07-22 16:45: To avoid such scenarios in the rest of the deployments, SRE decided that the safest way was to take down one service at a time while the others served incoming traffic.
2024-07-22 17:30: All the deployments were updated with latest configuration changes
2024-07-22 22:30: SRE identified the root cause of Continuous Delivery System not working as expected and proceeds to solve the issue.
During a scheduled configuration rollout, the team noticed that our Deployment Service was struggling with a specific Deployment. The team quickly noticed this issue and moved to remediate it. However, while doing so, there was a brief outage of our Core Authentication Service affecting the DUO15 deployment.
The team diligently deployed the rest of the configurations one by one, to ensure we followed the 33% restart per deployment guidelines.
800-1000 authentications were affected.
To ensure that this issue doesn’t happen again, we’ve done the following: