Duo15: Core Authentication Service outage
Incident Report for Duo
Postmortem

Summary

On July 22, 2024, from 4:38 pm ET to 4:40 pm ET, users on DUO15 experienced a brief authentication outage. The root cause was identified as a configuration issue with one of our Deployment Services.

The issue was resolved in approximately 2 minutes, as the services came online again.

Deployments Impacted

  • DUO15

Timeline of Events EST

2024-07-22 14:03: Duo Site Reliability Engineering (SRE) is informed that the Continuous Delivery system is not working as expected 

2024-07-22 14:11: SRE acknowledges that updates are not propagating to production systems via Continuous Delivery system  

2024-07-22 14:13: SRE decides that for quick propagation of updates to production systems, a manual restart is needed

2024-07-22 14:14: SRE starts manually upgrades across various deployments and it was observed that changes were propagating as expected.  

2024-07-22 16:38: While upgrading one deployment in a similar fashion, SRE observed that all the services were terminated during rolling restarts, resulting in a downtime of about two minutes

2024-07-22 16:40: New services in the affected deployment start serving traffic with updated configurations

2024-07-22 16:45: To avoid such scenarios in the rest of the deployments, SRE decided that the safest way was to take down one service at a time while the others served incoming traffic.

2024-07-22 17:30: All the deployments were updated with latest configuration changes

2024-07-22 22:30: SRE identified the root cause of Continuous Delivery System not working as expected and proceeds to solve the issue.

Details

During a scheduled configuration rollout, the team noticed that our Deployment Service was struggling with a specific Deployment. The team quickly noticed this issue and moved to remediate it.  However, while doing so, there was a brief outage of our Core Authentication Service affecting the DUO15 deployment. 

The team diligently deployed the rest of the configurations one by one, to ensure we followed the 33% restart per deployment guidelines.

800-1000 authentications were affected.

To ensure that this issue doesn’t happen again, we’ve done the following: 

  1. Added steps to make sure manual intervention is not needed/as minimal as possible.
  2. If manual intervention is ever needed, make sure to take the services out one by one.
Posted Jul 26, 2024 - 15:11 EDT

Resolved
From approximately 8:37 AM UTC to 8:40 AM UTC, there was a brief outage of our Core Authentication Service affecting the DUO15 deployment. All services are now fully functional and we are working to provide a Root Cause Analysis as soon as it is available.
Posted Jul 22, 2024 - 17:16 EDT
This incident affected: DUO15 (Core Authentication Service).