On October 4, 2023, at around 16:55 EST, Duo's Engineering Team was alerted by monitoring that one of the database shards on the DUO62 deployment was experiencing high latency and database errors. Customers were unable to authenticate or access the Admin panel as a result. The root cause was identified as a failure in the underlying hardware error by our cloud vendor.
The issue was resolved on the same day by a failover to a standby instance.
16:55 Duo Site Reliability Engineering (SRE) is notified by our internal monitoring that DUO62 is experiencing latency and begins to investigate.
17:02 Engineering and Support confirms that the customers impacted are unable to authenticate or log into the admin panel.
17:09 Instance failover initiated by automated monitoring.
17:11 SRE is informed of instance failover initiation.
17:17 Instance failover is complete.
17:18 Support confirms customer impact is mitigated.
Duo SRE was notified that one of the database shards on DUO62 was experiencing high latency. During investigation, the system automatically failed the database over to its synchronized replica. Within seven minutes of the failover the system began to return to normal.
Further investigation by Duo confirmed that the reason for the latency and subsequent failover was due to an underlying hardware error. The storage error on the primary database server triggered a failover event to the secondary which restored service as part of its self healing capability.
Duo will continue to look for opportunities to minimize the impact of hardware failures. After the incident, we improved our monitoring to detect hardware failures for a faster failover initiation. In order to further decouple the authentication path from the persistence/database layer, Duo is working to complete this work as part of an Authentication Path Resiliency project. This streaming based solution will reduce the impact to the authentication path during instances of database contention.