Azure Auth authentication errors

Incident Report for Duo

Postmortem

Azureauth Unhealthy ELB

Incident Report - 2023-08-21

Summary

On August 21, 2023, at 10:43 EDT Duo was alerted by observability monitors that the azureauth-us Load Balancer (LB) was unhealthy. The azureauth-us LB is a service that supports all US Deployments in AWS regions us-west-1 (UW1) and us-west-2 (UW2). The services became unhealthy as a result of a parallel outage on DUO1. The instances experienced memory saturation, prompting the invocation of the out of memory killer (OOM Killer), an automated service, to intervene by restarting the LB. This action was taken to release memory and restore normal service. After each LB instance restarted the azureauth returned to a healthy state.

Deployments Impacted*

DUO1, DUO2, DUO4, DUO5, DUO6, DUO7, DUO9, DUO10, DUO12, DUO13, DUO14, DUO15, DUO16, DUO17, DUO18, DUO19, DUO20, DUO21, DUO22, DUO23, DUO24, DUO28, DUO31, DUO32, DUO33, DUO35, DUO36, DUO37, DUO39, DUO40, DUO41, DUO42, DUO49, DUO50, DUO55, DUO56, DUO58, DUO60, DUO62, DUO63, DUO64, DUO65

*While the deployment list is long, the number of customers impacted was 133.

Timeline of Events EDT

2023-08-21

10:43 Duo Site Reliability Engineering (SRE) monitoring received an alert that the azureauth-us LB is unhealthy.

10:54 SRE on-call engineer acknowledges that the LB is showing all 9 instances are OutOfService.

11:05 SRE observes that 2 LB instances have recovered.

11:10 SRE initiates Incident Response, bringing in Customer Support (CS).

11:16 SRE correlates the current incident as an effect of a previous, ongoing incident impacting DUO1.

11:16 SRE observes that instances are running out of memory (OOM). This causes the OOM killer to kick in and restart the failing instance.

11:21 SRE notifies CS that UW1 and UW2 are both impacted.

11:23 SRE Escalates to the appropriate Engineering Team

11:35 Status updated to investigating.

11:45 All instances have self-healed.

11:48 Status changed to monitoring.

11:56 Customer acknowledges recovery.

12:01 Status moved to resolved.

Details

The AzureAuth Service is an AWS Multi-AZ deployment in the US region that sits behind a load balancer. AzureAuth Services halted in the us-west region as a side effect of a DUO1 outage. AzureAuth LB instances became saturated as each instance ran out of memory because of latent responses and a 60-second timeout of requests from Duo’s Main Service. The timeout setting for requests from the AuthServ Service to the Duo Main Service was set too high to accommodate the volume of requests and their associated memory consumption. This resulted in timeouts, causing requests in the AuthServ Service to degrade while waiting on a response. This caused the service to run out of memory. This triggered a self-healing OOM Killer which shut down and restarted the service, bringing the LB cluster back to health.

What is Duo doing to prevent this in the future?

Duo Engineering is making changes to limit the impact of an unhealthy deployment on the AzureAuth service, allowing traffic destined for healthy deployments to continue with little or no impact. The most immediate change is to set the request timeout to a very low value. Today most all responses return in 250 milliseconds, so our timeout can be considerably lower. Additionally, Duo is looking into how to get the failing or unhealthy LB instances back to a healthy state quickly, by closely monitoring memory and preemptively restarting the service. This is what the OOM Killer does, but we believe that we could make this faster, by being proactive and keeping the cluster healthy.

For longer-term projects, Duo will look into a circuit breaker or backpressure pattern to alleviate the load, allowing the service to fail safely. This is a more advanced capability that supersedes managing low timeouts.

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Aug 25, 2023 - 15:42 EDT

Resolved

The issue causing Azure Conditional Access authentication failures has been fully resolved.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Posted Aug 21, 2023 - 13:17 EDT

Monitoring

We have identified the root cause of the issue causing authentication errors with Microsoft Azure Conditional Access Duo protected applications. The issue has auto-resolved and we are now monitoring for stability.

Please check back here or subscribe to the status page for further updates.

Posted Aug 21, 2023 - 12:00 EDT

Investigating

We are currently investigating an issue that is causing authentication errors for Azure Conditional Access integrations and are working on fixing the issue as soon as possible.

Posted Aug 21, 2023 - 11:35 EDT

This incident affected: DUO1 (Core Authentication Service), DUO2 (Core Authentication Service), DUO3 (Core Authentication Service), DUO4 (Core Authentication Service), DUO5 (Core Authentication Service), DUO7 (Core Authentication Service), DUO47 (Core Authentication Service), DUO10 (Core Authentication Service), DUO13 (Core Authentication Service), DUO14 (Core Authentication Service), DUO15 (Core Authentication Service), DUO16 (Core Authentication Service), DUO17 (Core Authentication Service), DUO18 (Core Authentication Service), DUO19 (Core Authentication Service), DUO20 (Core Authentication Service), DUO21 (Core Authentication Service), DUO22 (Core Authentication Service), DUO23 (Core Authentication Service), DUO24 (Core Authentication Service), DUO28 (Core Authentication Service), DUO31 (Core Authentication Service), DUO32 (Core Authentication Service), DUO33 (Core Authentication Service), DUO36 (Core Authentication Service), DUO37 (Core Authentication Service), DUO39 (Core Authentication Service), DUO40 (Core Authentication Service), DUO41 (Core Authentication Service), DUO42 (Core Authentication Service), DUO44 (Core Authentication Service), DUO45 (Core Authentication Service), DUO9 (Core Authentication Service), DUO49 (Core Authentication Service), DUO50 (Core Authentication Service), DUO52 (Core Authentication Service), DUO55 (Core Authentication Service), DUO56 (Core Authentication Service), DUO57 (Core Authentication Service), DUO58 (Core Authentication Service), DUO60 (Core Authentication Service), DUO62 (Core Authentication Service), DUO63 (Core Authentication Service), DUO64 (Core Authentication Service), DUO65 (Core Authentication Service), DUO70 (Core Authentication Service), DUO72 (Core Authentication Service), DUO73 (Core Authentication Service), DUO6 (Core Authentication Service), and DUO35 (Core Authentication Service).