Incident Report - 2023-08-21
On August 21, 2023, at 10:43 EDT Duo was alerted by observability monitors that the azureauth-us Load Balancer (LB) was unhealthy. The azureauth-us LB is a service that supports all US Deployments in AWS regions us-west-1 (UW1) and us-west-2 (UW2). The services became unhealthy as a result of a parallel outage on DUO1. The instances experienced memory saturation, prompting the invocation of the out of memory killer (OOM Killer), an automated service, to intervene by restarting the LB. This action was taken to release memory and restore normal service. After each LB instance restarted the azureauth returned to a healthy state.
DUO1, DUO2, DUO4, DUO5, DUO6, DUO7, DUO9, DUO10, DUO12, DUO13, DUO14, DUO15, DUO16, DUO17, DUO18, DUO19, DUO20, DUO21, DUO22, DUO23, DUO24, DUO28, DUO31, DUO32, DUO33, DUO35, DUO36, DUO37, DUO39, DUO40, DUO41, DUO42, DUO49, DUO50, DUO55, DUO56, DUO58, DUO60, DUO62, DUO63, DUO64, DUO65
*While the deployment list is long, the number of customers impacted was 133.
10:43 Duo Site Reliability Engineering (SRE) monitoring received an alert that the azureauth-us LB is unhealthy.
10:54 SRE on-call engineer acknowledges that the LB is showing all 9 instances are OutOfService.
11:05 SRE observes that 2 LB instances have recovered.
11:10 SRE initiates Incident Response, bringing in Customer Support (CS).
11:16 SRE correlates the current incident as an effect of a previous, ongoing incident impacting DUO1.
11:16 SRE observes that instances are running out of memory (OOM). This causes the OOM killer to kick in and restart the failing instance.
11:21 SRE notifies CS that UW1 and UW2 are both impacted.
11:23 SRE Escalates to the appropriate Engineering Team
11:35 Status updated to investigating.
11:45 All instances have self-healed.
11:48 Status changed to monitoring.
11:56 Customer acknowledges recovery.
12:01 Status moved to resolved.
The AzureAuth Service is an AWS Multi-AZ deployment in the US region that sits behind a load balancer. AzureAuth Services halted in the us-west region as a side effect of a DUO1 outage. AzureAuth LB instances became saturated as each instance ran out of memory because of latent responses and a 60-second timeout of requests from Duo’s Main Service. The timeout setting for requests from the AuthServ Service to the Duo Main Service was set too high to accommodate the volume of requests and their associated memory consumption. This resulted in timeouts, causing requests in the AuthServ Service to degrade while waiting on a response. This caused the service to run out of memory. This triggered a self-healing OOM Killer which shut down and restarted the service, bringing the LB cluster back to health.
Duo Engineering is making changes to limit the impact of an unhealthy deployment on the AzureAuth service, allowing traffic destined for healthy deployments to continue with little or no impact. The most immediate change is to set the request timeout to a very low value. Today most all responses return in 250 milliseconds, so our timeout can be considerably lower. Additionally, Duo is looking into how to get the failing or unhealthy LB instances back to a healthy state quickly, by closely monitoring memory and preemptively restarting the service. This is what the OOM Killer does, but we believe that we could make this faster, by being proactive and keeping the cluster healthy.
For longer-term projects, Duo will look into a circuit breaker or backpressure pattern to alleviate the load, allowing the service to fail safely. This is a more advanced capability that supersedes managing low timeouts.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.