Authentication Failures on DUO1
Incident Report - 2023-08-21
Duo Engineering recreated the outage in a test environment, and then successfully validated a fix under load conditions, to the underlying root cause of the outage. The root cause was a specific database query pattern that had a higher probability of deadlocks, getting exacerbated under heavy load. A combination of increased activity, deadlocks, and retries overwhelmed certain shards, causing DUO1 to be unresponsive. The performance test that reproduced this scenario will now be part of our non-functional testing to better gate future releases.
From 9:03 EDT to 14:16 EDT on Aug 21, 2023, the DUO1 deployment experienced increased authentication latency that caused authentication failures for some customer applications protected by the Duo Service. At 9:03 AM Duo’s automated monitoring systems detected and alerted engineering team members to a database latency that was elevated but that did not impact service. Within minutes, while engineering team members investigated and triaged related issues, latency increased to service-impacting levels. A detailed investigation by Duo Engineering identified the application layer as the bottleneck. The team began horizontally scaling the application layer to absorb the increased load. The Duo team began to see recovery around 12:50 EDT, fully resolving by 16:01 EDT. Duo Engineering continues to monitor.
09:03 Duo Engineering received an alert indicating Authentication Latency on DUO1.
09:11 Duo Engineering received a Database Backlog Depth alert, indicating replication latency.
09:14 Duo Engineering started investigating, confirming latency reports via our monitoring tools.
09:22 Duo Engineering initiated a full incident response process bringing in Customer Support (CS) and other stakeholders to assist with the investigation and communication.09:34 Status Updated to Investigating.
09:28 Determined only DUO1 deployment was affected, investigative efforts continue.
09:36 Back pressure, an automated mechanism that allows our systems to recover safely, came into play, rejecting requests.
09:40 Duo Engineering verified alerts that pointed to latency in the database layer, validating performance metrics and investigating any slow queries.
10:01 Duo Engineering also investigated the load balancer and application layer for performance insights.
10:26 Back pressure began to subside.
10:44 Multiple impacted customers identified, investigative efforts continued.
10:45 Options for mitigation were being considered for easing load on service and the potential impact of each option.
10:45 Additional alerts received, around our AzureAuth service. An investigation was spun up for this alert, to ascertain if they were related.
11:16 Duo Engineering determined that the AzureAuth incident was caused by the current ongoing incident impacting DUO1.
11:45 All AzureAuth instances had self-healed.
12:11 Duo Engineering considered options for scaling DB layer, until investigation pointed to application layer being the bottleneck (at 12:24 EDT).
12:42 Duo Engineering began increasing application capacity by scaling horizontally.
12:54 Some customers began recovering as a result of capacity increases.
13:47 Duo Engineering continued to monitor after the rollout of additional capacity.
14:16 Additional capacity was in place and taking traffic <- Issue resolved at this point and we entered the observation phase before reporting back to customers.
14:33 Duo Engineering requested that customers be notified of the update, and asked to re-enable MFA if turned off.
14:37 Status changed to Monitoring.
14:58 Duo Engineering identified additional preventative measures to balance load, to prevent recurrence.
16:01 Status updated to Resolved.
Increased load on DUO1 due to significantly increased adoption and simultaneous peak usage across multiple larger customers led to authentication failures. Our monitoring system picked up on latency metrics that alerted Duo Engineering. During this time our automated back-pressure system became active. This system tracks request volume and latency, and if latency is increasing, returns a HTTP Status Code 429 to clients, requesting a retry. This back-pressure mechanism is enabled on all deployments to allow our systems to recover safely and prevent them from being overwhelmed. The team’s investigation considered, evaluated, and ruled out multiple potential causes. This outage was not due to an attack, but the root cause being a significant spike in traffic with a surge of overlapping peak usage, causing capacity issues at the application server tier. Duo Engineering responded by adding more application servers to the load balancer. As a result, the back-pressure began to release and our systems began processing authentications within normal response time ranges.
After service was restored, Duo Engineering continued to investigate and identified overlapping peak usage across multiple customers, with similar usage patterns, leading to contention for shared resources.
Based on peak usage patterns observed during the outage, the team made additional capacity increases to avoid a recurrence. This was completed on Aug 21 at 04:09 EDT. Duo Engineering monitored DUO1 closely the following day, Aug 22nd, between 7:30 EDT and 12:00 EDT, and confirmed that these capacity levels handled the elevating volume experienced, without triggering additional alerts.
Data collected during and after the incident will be used to influence near-term and future capacity-related decisions to ensure that appropriate headroom is available and to improve load distribution with no downtime for customers. Careful capacity planning has been one of pillars of Duo’s service management and that will continue with lessons learned from this incident.
Duo Engineering is in the midst of a multi-quarter re-architecture of our services, which upon completion will enable our services to scale automatically in response to load and be more resilient. The team will also evaluate and enhance current load testing methodologies to better understand system performance and recovery under duress.
Duo Engineering, continuing post incident analysis, identified an excessive write to the database on August 22nd at 01:56 AM, that when eliminated will ease table locking and improve shard performance.
Duo Engineering has identified a number of measures to be integrated to our response runbooks and is implementing additional automation to ensure easy access to these, during an outage, to decrease our response time. Additional training is being added to our training program for current and new engineers, to decrease our response time.
At Duo Engineering, we are committed to learning and improving our service every day, as we go about the business of securing our customers. We will continue to study other observed symptoms and update this document with additional details as they become available.