On February 9, 2023 at 12:00 EST, Duo's Engineering Team was alerted by monitoring that one of the database servers for the DUO55 deployment was responding very slowly. Root cause analysis has attributed a third-party vendor’s temporary inability to write to a database volume as the root cause.
The third-party vendor’s automated response identified and resolved the root cause by February 9 at 12:31 EST.
2023-02-09 12:00 Duo Site Reliability Engineering (SRE) receives an alert for queries in DUO55 having extended completion times. SRE begins triage.
2023-02-09 12:15 Verified there is no ongoing maintenance or code changes.
2023-02-09 12:24 Underlying database volume is fully inaccessible. Database write latency spikes. Logging is unavailable.
2023-02-09 12:26 Determined that Single Sign On was failing because downstream services were not responding.
2023-02-09 12:26 Transaction failures seen in logs.
2023-02-09 12:31 Transactions begin to clear.
2023-02-09 12:39 Confirm with multiple customers that Push Authentication is available again.
2023-02-09 12:40 Database accessibility resolves without user intervention.
2023-02-09 12:46 Status page updated to Identified.
2023-02-09 12:47 Investigation leaning toward database connectivity or degradation.
2023-02-09 12:53 Continued investigation found that database latency spiked at 11:56.
2023-02-09 13:00 Metrics have returned to normal. Environment deemed Healthy.
2023-02-09 13:02 Final alerts have been cleared.
2023-02-09 13:04 Status Page set to Monitoring.
2023-02-09 13:14 Root Cause identified.
2023-02-09 13:30 Status Page set to Resolved.
2023-02-16 12:32 Received RCA from our cloud provider indicating network congestion and associated packet loss caused by a defect in their automated monitoring system that allowed a single network interface to receive more traffic than it can process.
On Thursday, February 9, 2023 at 12:00 EST, Duo Security experienced an interruption to our ability to write to the database volume. This interruption created a backlog of database writes, which created a growing backlog of queued requests. The growing request queue depth caused a delay in authorization acknowledgment that caused non-response to Duo Push authentication requests leading to eventual authentication failure, affecting end-user attempts to sign in to Duo-protected applications.
Duo’s automated queue management system alerted Duo engineers. As Duo engineers triaged efforts to troubleshoot the issue, they observed a reduction in server load that allowed the server to process queued requests. Push authentications still waiting for response were completed, and the system cleared the request queue and backlog of database writes.
This incident caused a 37% increase in the number of failures of Duo Push authentications compared to the same time period the previous day.
Duo requested a documented RCA from the third-party vendor. The third-party vendor provided the RCA on February 16, 2023 which indicated network congestion and associated packet loss caused by a defect in their automated monitoring system that allowed a single network interface to receive more traffic than it can process. This correlates with what we experienced with database unavailability causing the request queue to grow and delay response to authentication requests. Duo Engineering is actively working on roadmap improvements to provide streaming and caching mechanisms to allow for eventual consistency to read data; providing better read capabilities when the database is not reachable for writes.
Note: You can find your Duo deployment ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.