On September 9th, 2022, at around 9:41 EST, Duo's Engineering Team was alerted by monitoring that one of the database shards on the DUO63 deployment was experiencing high latency and database errors. Customers were unable to authenticate or access the Admin panel as a result. The root cause was identified as a failure in the underlying hardware Elastic Block Storage (EBS) volume.
The issue was resolved on the same day by an automated failover to a standby instance without human intervention.
2022-09-09 09:41 Duo Site Reliability Engineering (SRE) is informed by our internal monitoring tooling that DUO63 Shard2 was experiencing latency.
2022-09-09 09:46 The backpressure system kicked in for DUO63 throttling traffic due to the high latency.
2022-09-09 09:54 The investigation continues as additional errors are observed. Engineering confirms that the customers impacted are failing to authenticate and log into the admin panel.
2022-09-09 09:56 The system automatically initiates a failover event to a standby database instance.
2022-09-09 09:57 The failover completes successfully
2022-09-09 10:02 Customers report they are able to access the admin panel and the logs show authentications returning to normal levels.
2022-09-09 10:28 Status page updated to Monitoring.
2022-09-09 12:52 Status page updated to Resolved.
2022-09-13 14:36 Duo’s Cloud service provider confirms that an EBS volume error was the root cause of the RDS failover.
Duo SRE was notified that one of the database shards on DUO63 was experiencing high latency. As soon as the engineering team huddled to investigate the alert, there were additional errors and the backpressure kicked in. Soon after that the system automatically failed the database over to its synchronized replica. Within a few minutes of the failover the system began to return to normal.
Further investigation by Duo’s hosted Cloud Service provider confirmed that the reason for the latency and subsequent failover was due to an underlying hardware EBS volume error. The storage error on the primary RDS server triggered a failover event to the secondary which restored service as part of its self healing capability. Hardware failures that result in failovers are rare events.
The system self healed as designed without human intervention.
Our Duo deployments have multiple primary database shards for resiliency and redundancy. All customers on the impacted single shard in DUO63 were affected by this incident.
While the system behaved as expected, Duo is reviewing the thresholds that trigger an automated failover event to minimize the customer impact. The net result of the investigation would be that the system would trigger a faster database failover in the event of high latency or errors.
In addition, prompt incident identification and communication are primary areas of concern. We know service availability is vital to our customers and prompt communication helps our customers make informed decisions. We will review our process to update the Status Page and ensure we improve the efficiency at which we provide updates.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.