Authentications failing on application and admin panel logins

Incident Report for Duo

Postmortem

Summary

On September 9th, 2022, at around 9:41 EST, Duo's Engineering Team was alerted by monitoring that one of the database shards on the DUO63 deployment was experiencing high latency and database errors. Customers were unable to authenticate or access the Admin panel as a result. The root cause was identified as a failure in the underlying hardware Elastic Block Storage (EBS) volume.

The issue was resolved on the same day by an automated failover to a standby instance without human intervention.

Deployments Impacted

DUO63

Timeline of Events EST

2022-09-09 09:41 Duo Site Reliability Engineering (SRE) is informed by our internal monitoring tooling that DUO63 Shard2 was experiencing latency.

2022-09-09 09:46 The backpressure system kicked in for DUO63 throttling traffic due to the high latency.

2022-09-09 09:54 The investigation continues as additional errors are observed. Engineering confirms that the customers impacted are failing to authenticate and log into the admin panel.

2022-09-09 09:56 The system automatically initiates a failover event to a standby database instance.

2022-09-09 09:57 The failover completes successfully

2022-09-09 10:02 Customers report they are able to access the admin panel and the logs show authentications returning to normal levels.

2022-09-09 10:28 Status page updated to Monitoring.

2022-09-09 12:52 Status page updated to Resolved.

2022-09-13 14:36 Duo’s Cloud service provider confirms that an EBS volume error was the root cause of the RDS failover.

Details

Duo SRE was notified that one of the database shards on DUO63 was experiencing high latency. As soon as the engineering team huddled to investigate the alert, there were additional errors and the backpressure kicked in. Soon after that the system automatically failed the database over to its synchronized replica. Within a few minutes of the failover the system began to return to normal.

Further investigation by Duo’s hosted Cloud Service provider confirmed that the reason for the latency and subsequent failover was due to an underlying hardware EBS volume error. The storage error on the primary RDS server triggered a failover event to the secondary which restored service as part of its self healing capability. Hardware failures that result in failovers are rare events.

Resolution

The system self healed as designed without human intervention.

Customer Impact

Our Duo deployments have multiple primary database shards for resiliency and redundancy. All customers on the impacted single shard in DUO63 were affected by this incident.

What is Duo doing to prevent this in the future?

While the system behaved as expected, Duo is reviewing the thresholds that trigger an automated failover event to minimize the customer impact. The net result of the investigation would be that the system would trigger a faster database failover in the event of high latency or errors.

In addition, prompt incident identification and communication are primary areas of concern. We know service availability is vital to our customers and prompt communication helps our customers make informed decisions. We will review our process to update the Status Page and ensure we improve the efficiency at which we provide updates.

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Sep 14, 2022 - 13:31 EDT

Resolved

All reports of failed authentications for a subsection of our customers have been confirmed as resolved. All services are now operational. We will provide an RCA as soon as it is available.

Posted Sep 09, 2022 - 12:52 EDT

Monitoring

We had reports of failing authentications for a subsection of our customers. Our investigation thus far points to database related issues that were resolved when our automatic failover mechanisms kicked in. We continue to monitor and investigate the root cause of the issue.

Posted Sep 09, 2022 - 10:28 EDT

This incident affected: DUO63 (Core Authentication Service, Admin Panel).