Authentication slowness or failure to load Duo Prompt on DUO62

Incident Report for Duo

Postmortem

Summary

On October 4, 2023, at around 16:55 EST, Duo's Engineering Team was alerted by monitoring that one of the database shards on the DUO62 deployment was experiencing high latency and database errors. Customers were unable to authenticate or access the Admin panel as a result. The root cause was identified as a failure in the underlying hardware error by our cloud vendor.

The issue was resolved on the same day by a failover to a standby instance.

Deployments Impacted

DUO62

Timeline of Events EST

2023-10-04

16:55 Duo Site Reliability Engineering (SRE) is notified by our internal monitoring that DUO62 is experiencing latency and begins to investigate.

17:02 Engineering and Support confirms that the customers impacted are unable to authenticate or log into the admin panel.

17:09 Instance failover initiated by automated monitoring.

17:11 SRE is informed of instance failover initiation.

17:17 Instance failover is complete.

17:18 Support confirms customer impact is mitigated.

Details

Duo SRE was notified that one of the database shards on DUO62 was experiencing high latency. During investigation, the system automatically failed the database over to its synchronized replica. Within seven minutes of the failover the system began to return to normal.

Further investigation by Duo confirmed that the reason for the latency and subsequent failover was due to an underlying hardware error. The storage error on the primary database server triggered a failover event to the secondary which restored service as part of its self healing capability.

What is Duo doing to prevent this in the future?

Duo will continue to look for opportunities to minimize the impact of hardware failures. After the incident, we improved our monitoring to detect hardware failures for a faster failover initiation. In order to further decouple the authentication path from the persistence/database layer, Duo is working to complete this work as part of an Authentication Path Resiliency project. This streaming based solution will reduce the impact to the authentication path during instances of database contention.

Posted Oct 17, 2023 - 14:00 EDT

Resolved

We have confirmed that authentication services are back to fully operational on DUO62 and this issue is resolved.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Posted Oct 04, 2023 - 20:11 EDT

Monitoring

We are continuing to see a recovery for the authentication slowness and failures on DUO62 and we are monitoring the results.

Please check back here or subscribe to updates for any changes

Posted Oct 04, 2023 - 17:54 EDT

Identified

We have identified the issue causing authentication slowness and failures to load the Duo Prompt and are beginning to see recovery.
We are now working toward full resolution.

Please check back here or subscribe to updates for any changes.

Posted Oct 04, 2023 - 17:39 EDT

Investigating

We are investigating an issue potentially impacting authentications on DUO62 deployment and are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.

Posted Oct 04, 2023 - 17:29 EDT

This incident affected: DUO62 (Core Authentication Service, Admin Panel, SSO).