Degraded availability on DUO42
Incident Report for Duo Security
Postmortem

From 16:47 to 17:11 UTC on January 17, 2018, the DUO42 deployment experienced backend datastore availability issues that resulted in an authentication outage for all customer applications protected by the Duo service.

Duo utilizes many premier cloud partners as part of our SaaS platform, including Amazon AWS. As part of normal operations, Amazon regularly performs maintenance to the underlying cloud infrastructure that powers the Duo service. This maintenance is expected to be non-impactful due to the multiple levels of redundancy built into the Duo platform, and has been validated as such numerous times over a multi-year period.

On January 17 at 16:47 UTC, AWS started a maintenance operation on the service that powers one of the data storage tiers used by the Duo platform. The type of maintenance being conducted includes an automated failover to a standby data storage node which the Duo platform has been designed to handle gracefully. In this specific case, the failover event did not fully complete and the Duo platform was unable to properly interact with the datastore as a result.

Duo’s automated monitoring systems alerted our engineering team to the initial signs of this failover event occuring, and the team began monitoring to ensure that the operation completed successfully. At 16:54 UTC, the team identified the incomplete state of the failover event and began taking action to reroute datastore connectivity to an alternate storage node. At 17:11 UTC, this rerouting operation was complete and service was confirmed to be stabilized.

After additional investigation, Duo’s engineering team in partnership with contacts at AWS identified the root cause as a service failure on the AWS side which affected a small number of data storage nodes, including those used by the DUO42 deployment.

The Duo team will use data collected during this incident to influence future infrastructure related decisions regarding platform resilience. We are also leveraging our partnership with AWS to better understand failure modes in their service offerings and will update operational processes and automation to reflect information gathered.

Posted 11 months ago. Jan 23, 2018 - 17:29 EST

Resolved
DUO42 is fully operational.
Posted 11 months ago. Jan 17, 2018 - 12:47 EST
Monitoring
All service on DUO42 has returned to normal operation and is being actively monitored.
Posted 11 months ago. Jan 17, 2018 - 12:21 EST
Investigating
We are currently investigated an elevated error rate on DUO42.
Posted 11 months ago. Jan 17, 2018 - 12:10 EST
This incident affected: DUO42 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI).