From 16:15 to 17:56 UTC on December 16, 2017, the following Duo deployments experienced intermittent authentication timeouts affecting up to 5% of all requests: DUO8, DUO10, DUO21, DUO28, DUO39, DUO40, DUO49, DUO52.
Duo utilizes many premier cloud partners as part of our SaaS platform, including Amazon AWS. As part of normal operations, Amazon regularly performs maintenance to the underlying cloud infrastructure that powers the Duo service. This maintenance is expected to be non-impactful due to the multiple levels of redundancy built into the Duo platform, and has been validated as such numerous times over a multi-year period.
On December 16th at 16:00 UTC, AWS performed maintenance to the service that powers one of the database tiers of the Duo platform. Duo’s automated monitoring system alerted our engineering team to the beginning of this maintenance, and the team began monitoring to ensure that the maintenance completed successfully. During this window, the team observed numerous successful automated failovers between database instances, as expected. At 17:00 UTC, the team was alerted to intermittent authentication failures affecting between 1% and 5% of all requests, varying depending on the affected deployment.
After additional investigation, Duo’s engineering team identified the root cause as sporadic error responses from the AWS database service starting at 16:15 UTC. These errors were likely the result of maintenance being conducted against both primary and secondary failover instances resulting in a double-failover scenario. A rolling service restart of the affected deployments was performed at 17:21 in an effort to resolve these intermittent errors, and service was confirmed to be stabilized for all deployments except DUO21 at that time. Additional troubleshooting was performed against DUO21, and service was confirmed to be stabilized for this deployment at 17:56 UTC.
The Duo team will use data collected during this incident to influence future infrastructure related decisions regarding platform resilience. Specifically, we intend to determine why this unique double-failover situation left a small percentage of connections in a failed state, and continue to put systems and processes in place ensuring that all Duo services are resilient to failure modes of any kind. In the short term, we will work with AWS to ensure that maintenance affecting both the primary and secondary failover instances are not conducted in the same window.