Incident Report - 2019/02/25
Summary:
From 15:10 to 15:25 UTC on February 25th, 2018, the DUO39 deployment experienced performance degradation that resulted in increased authentication latency and intermittent request timeouts. Approximately 50% of authentications processed during this time period were affected. The root cause of this outage has been identified and Duo’s engineering team is committed to reducing the overall impact of similar events going forward.
Details:
The DUO39 application tier is comprised of multiple redundant application servers. At 15:09 UTC, one of the application servers stopped functioning unexpectedly. Duo’s automated monitoring systems detected this failure and initiated a recovery procedure.
Due to the nature of the application server failure, its network connections to the database tier remained open and were not forcibly closed. This caused the database tier to hold open locks on several database tables. Due to these locks, contention for these tables began as other requests were being processed.
At 15:11 UTC Duo’s application monitoring systems alerted the Duo Engineering team to this contention. At 15:15 UTC Duo’s monitoring systems alerted the Duo Engineering team to increased authentication latency.
At 15:25 UTC the previously failed application server returned to service, releasing locks being held at the database tier. At this time, the resource contention was resolved and operations resumed normally.