Authentication failures for the India region
Incident Report for Duo
Postmortem

Summary

On January 28, 2025, at around 18:45, Duo's Engineering Team was alerted by monitoring that indicated errors observed in the DUO69 MFA deployment. The root cause was identified as an incomplete configuration update rollout in the environment, resulting in the deployment becoming unhealthy.

The issue was fully resolved on the same day by 19:55.

Timeline of Events

2025-02-28 18:45 Duo's Engineering Team is alerted by monitoring that indicated errors in the DUO69 deployment. SRE begins triage.

2025-02-28 19:06 All the nodes for DUO69 deployment were identified as being in `NotReady` state.

2025-02-28 19:07 Authentication impact for this deployment was identified as 100%.

2022-02-28 19:41 A review of recent changes to the system identified that a change previously made had not completed successfully and was the root cause of the failure.

2022-02-28 19:42 The root cause was fixed by executing the system change to completion.

2022-02-28 19:47 All nodes for this deployment were up and healthy.

Details

The Duo Site Reliability Engineering Team has been rolling out changes to the way that engineers access the Amazon Elastic Kubernetes Service (EKS) clusters, moving from an older deprecated method to using access entries as recommended by AWS as a best practice.

During the rollout, the IAC (Infrastructure-as-Code) tooling needed to perform an operation that affected permissions for services in the cluster. The IAC tool saw the need to detach and reattach a policy although this was unnecessary. The order of operations caused the policy to remain detached. This resulted in the infrastructure and services being unable to authenticate with our EKS cluster and nodes could not join the

cluster or renew their leases defaulting to a ‘NotReady’ state ultimately caused the system outage. Although the initial results appeared successful, the permissions were not applied correctly. Subsequently re-running our IAC tool resolved this issue by reattaching the policy and allowing the services to authenticate with the EKS cluster.

How did Duo Resolve the incident

Once the issue had been identified, re-running our IAC tooling completed the partial changes, allowing for Kubernetes nodes to authenticate to the cluster, reach a Ready state and have resources scheduled onto them again that could process authentication traffic and restore service.

How many Customers were impacted, if applicable

This incident affected all customers on the DUO69 deployment.

What is Duo doing to prevent this in the future?

As this is a one-time change, this issue is not expected to occur again in the future.

However, Duo has added some automation to the IAC change process to ensure that such incidents are not repeated. We have also updated our production change process to include a stricter review when changes are made to production clusters.

Posted Jan 31, 2025 - 11:22 EST

Resolved
The issue causing authentication failures in the India region has been resolved and full functionality has been restored.
Posted Jan 28, 2025 - 20:10 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 28, 2025 - 19:54 EST
Identified
We have identified the issue causing authentication failures including passwordless experience in the India region and a fix is being implemented.

Please check back here or subscribe to updates for any changes
Posted Jan 28, 2025 - 19:48 EST
Investigating
We are currently investigating an issue causing authentication failures including passwordless experience in the India region and are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.
Posted Jan 28, 2025 - 19:21 EST
This incident affected: DUO69 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI, SSO).