All Deployments: User authentications and Admin Panel access impacted due to connectivity issues within AWS regions.

Incident Report for Duo

Postmortem

AWS Outage - All Deployments

Summary

From 15:20 UTC to 16:42 UTC on December 15, 2021, customers on US-hosted Duo deployments experienced issues accessing the Duo service. During this window, the majority of Duo services were intermittently inaccessible to affected customers, impacting user authentications and Admin Panel access.

This was the result of networking issues affecting one of our cloud infrastructure providers (AWS) and the us-west-1, us-west-2, us-gov-west-1 AWS regions which Duo leverages.

Confirmed affected Duo deployments:

DUO1, DUO2, DUO4, DUO5, DUO6, DUO7, DUO9, DUO10, DUO12, DUO13, DUO14, DUO15, DUO16, DUO17, DUO18, DUO19, DUO20, DUO21, DUO22, DUO23, DUO24, DUO26, DUO28, DUO31, DUO32, DUO33, DUO34, DUO35, DUO36, DUO37, DUO39, DUO40, DUO41, DUO42, DUO43, DUO44, DUO45, DUO46, DUO49, DUO50, DUO51, DUO52, DUO55, DUO56, DUO58, DUO59, DUO60, DUO61, DUO62, DUO63, DUO64, DUO65

Timeline 2021-12-15:

15:20 UTC - Duo Engineering receives availability alerts from production deployments and begins investigating.

15:22 UTC - Duo Engineering identifies that multiple US-based Duo deployments are intermittently unreachable and experiences difficulty in accessing other AWS-hosted cloud services.

15:30 UTC - Duo Engineering confirms that the entirety of multiple AWS regions are affected and that the impact is not limited to Duo and begins incident response.

15:32 UTC - Duo Engineering continues assessing impact while experiencing difficulty accessing some of the systems we rely upon for understanding service health.

15:48 UTC - Duo Engineering confirms that all US-based Duo deployments are affected.

15:51 UTC - Duo Engineering continues monitoring for signs for recovery.

15:57 UTC - Duo Engineering observes that authentication traffic is starting to normalize.

16:11 UTC - Duo posts initial Status Page update for incident.

16:15 UTC - Duo Engineering confirms that the majority of Duo services are showing signs of recovery.

16:27 UTC - Duo updates the Status Page incident to Monitoring.

16:30 UTC - Duo Engineering identifies ongoing issues with Azure Conditional Access, on-premises Active Directory Sync, and Duo Single Sign-On and continues investigating.

16:42 UTC - Duo Engineering confirms Azure Conditional Access services are fully restored.

22:42 UTC - Duo updates incident status on status.duo.com to Resolved with additional information on verifying Duo Authentication Proxy service connectivity.

Details

Duo utilizes many premier cloud partners as part of our SaaS platform, including Amazon AWS. Per Amazon’s public status page ([https://status.aws.amazon.com/](https://status.aws.amazon.com)), AWS experienced network issues specific to the us-west-1, us-west-2, and us-gov-west-1 AWS regions on 2021-12-15. This issue affected connectivity to infrastructure hosted within the affected regions. Below is AWS’s summary of the incident:

"Between 7:14 AM PST (15:14 PM UTC) and 7:59 AM PST (15:59 PM UTC), customers experienced elevated network packet loss that impacted connectivity to a subset of Internet destinations. Traffic within AWS Regions, between AWS Regions, and to other destinations on the Internet was not impacted. The issue was caused by network congestion between parts of the AWS Backbone and a subset of Internet Service Providers, which was triggered by AWS traffic engineering, executed in response to congestion outside of our network. This traffic engineering incorrectly moved more traffic than expected to parts of the AWS Backbone that affected connectivity to a subset of Internet destinations. The issue has been resolved, and we do not expect a recurrence."

Duo’s platform spans multiple AWS regions and availability zones for redundancy. Within each region, we have redundancy across multiple availability zones (AZs). All infrastructure is configured in an Active / Active or Active / Passive topology with automatic recovery capabilities to ensure no single points of failure exist.

In addition to redundancy across multiple AZs within each region, we also leverage cross-region replication where possible. For example, data stores in the us-west-1 and us-west-2 AWS regions are replicated in realtime to us-east-1 to enable recovery efforts.

Because this was a multi-region outage impacting both us-west-1 and us-west-2, recovery options were more limited than in a single-region failure scenario. After determining the root cause, we estimated that executing Disaster Recovery procedures to restore services in us-east-1 would be more disruptive to customers than waiting for AWS to resolve the networking issues.

Duo Azure Conditional Access services were down during the same time period, but took longer to come back online, with functionality being fully restored at 16:42 UTC. After further investigation and collaboration with AWS, we have confirmed that this was due to additional AWS infrastructure issues that were related to, but distinct from the overarching network connectivity problems.

A software defect in the Duo Authentication Proxy caused some Authentication Proxies to not properly re-establish connectivity with the Duo authentication service even after the AWS connectivity issues were resolved. Because Duo Single Sign-On and on-premises Active Directory Sync rely upon the Authentication Proxy, these capabilities continued to fail for impacted customers until the affected Authentication Proxies were restarted by customer administrators. Duo plans to notify customers via email who we believe need to restart the Authentication Proxy service. This information was also provided on 2021-12-15 via Duo’s Status Page.

Opportunities for Improvement

Prompt incident identification and communication are primary areas of concern. We know service availability is vital to our customers and prompt communication helps our customers make informed decisions. We apologize and look to improve in the following areas:

Improve the time to identify an incident:

Identifying the AWS outage and starting our incident management process took 11 minutes.
We would like this to be 5 minutes or less.

Status Page updates and communication to our stakeholders:

Our first communication happened 51 minutes after we first began receiving alerts.
We would like this to be 15 minutes or less.

Our Duo Integrations team is already working on a fix for the software defect found in the Duo Authentication Proxy that caused some Authentication Proxies to not properly re-establish connectivity.

Failmode is a configuration that is available as an integration level setting for some applications, which deals with behavior when Duo services cannot be reached. We have received feedback about unexpected Failmode behavior and are actively working on this. In the meantime, Duo’s Business Continuity Guide is our best resource for helping customers operate effectively in the event of an outage, and has more details about how Failmodes work for our various applications.

Improving resilience is top of mind at Duo. The Duo team will use data collected during this incident to influence future infrastructure-related decisions regarding platform resilience. Our architectural improvements for outages look to improve our resiliency, limit blast radius and failure domains, and manage appropriate complexity. Our work will continue to improve situations where an entire region is degraded, as happened in this incident, and automate recovery.

Posted Dec 16, 2021 - 18:39 EST

Resolved

The issue affecting all Duo deployments is resolved and authentication services are fully functional.

Some customers may still be experiencing connectivity issues for Duo SSO and/or Directory Sync related to a Duo Authentication Proxy that has not restarted. Here is more detail on how that can be done: https://help.duo.com/s/article/2149

To verify if your Authentication Proxy needs to be restarted in order to complete directory synchronization, in the Duo Admin Panel, navigate to the Directory’s page via Users > Directory Sync > Click on the Directory name. If you see a message that says, “There was a problem communicating with the Authentication Proxy. Please check that the proxy is running and connected to the Duo service.“, you will need to restart your Authentication Proxy service.

To verify if your Authentication Proxy needs to be restarted in order to restore Duo Single Sign-On service, in the Duo Admin Panel, navigate to Single Sign-On > Configured Authentication Sources > Active Directory Configuration. If you see a message next to your Authentication Proxy server that says "Not connected to Duo", you will need to restart the Authentication Proxy Service.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Please make sure to check back or subscribe to be notified when the RCA is posted.

Posted Dec 15, 2021 - 17:42 EST

Update

Note that Duo is monitoring reports of a lag in returning Azure authentications to normal. Duo products that are configured to leverage Azure Active Directory as the authentication source (Duo Azure Conditional Access, Duo Access Gateway, Duo SSO, Duo Admin SSO logins, and Azure Directory Sync) may experience a delay in returning to normal.

We continue to monitor for any residual impacts. Please check back here or subscribe to updates for any changes.

Posted Dec 15, 2021 - 11:52 EST

Update

Note that there was a lag in returning Azure authentications to normal. Duo products that are configured to leverage Azure Active Directory as the authentication source (Duo Azure Conditional Access, Duo Access Gateway, Duo SSO, Duo Admin SSO logins, and Azure Directory Sync) may have experienced a delay in returning to normal.

This issue has been resolved, and we continue to monitor for any residual impacts.

Posted Dec 15, 2021 - 11:48 EST

Update

We are continuing to monitor for any further issues.

Posted Dec 15, 2021 - 11:32 EST

Monitoring

This issue has been resolved and we are seeing core authentication and access to the admin panel functioning properly. We will actively keep this incident in monitoring mode to observe for any recurrence before resolving.

Posted Dec 15, 2021 - 11:27 EST

Identified

As of 8:01 AM PST, Amazon has identified the root cause to be isolated to AWS US-West-2 and has taken steps to restore connectivity. They have seen some improvement in the last few minutes but continue to work towards full recovery.

Please check back here or subscribe to updates for any changes. You can also subscribe to the AWS status page here: https://status.aws.amazon.com/

Posted Dec 15, 2021 - 11:11 EST

This incident affected: DUO1 (Core Authentication Service, Admin Panel), DUO2 (Core Authentication Service, Admin Panel), DUO3 (Core Authentication Service, Admin Panel), DUO4 (Core Authentication Service, Admin Panel), DUO5 (Core Authentication Service, Admin Panel), DUO6 (Admin Panel, Core Authentication Service), DUO7 (Core Authentication Service, Admin Panel), DUO8 (Core Authentication Service, Admin Panel), DUO47 (Core Authentication Service, Admin Panel), DUO10 (Core Authentication Service, Admin Panel), DUO11 (Core Authentication Service, Admin Panel), DUO12 (Core Authentication Service, Admin Panel), DUO13 (Core Authentication Service, Admin Panel), DUO14 (Core Authentication Service, Admin Panel), DUO15 (Core Authentication Service, Admin Panel), DUO16 (Core Authentication Service, Admin Panel), DUO17 (Core Authentication Service, Admin Panel), DUO18 (Core Authentication Service, Admin Panel), DUO19 (Core Authentication Service, Admin Panel), DUO20 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI, SSO), DUO21 (Core Authentication Service, Admin Panel), DUO22 (Core Authentication Service, Admin Panel), DUO23 (Core Authentication Service, Admin Panel), DUO24 (Core Authentication Service, Admin Panel), DUO25 (Core Authentication Service, Admin Panel), DUO26 (Core Authentication Service, Admin Panel), DUO27 (Core Authentication Service, Admin Panel), DUO28 (Core Authentication Service, Admin Panel), DUO29 (Core Authentication Service, Admin Panel), DUO30 (Core Authentication Service, Admin Panel), DUO31 (Core Authentication Service, Admin Panel), DUO32 (Core Authentication Service, Admin Panel), DUO33 (Core Authentication Service, Admin Panel), DUO34 (Core Authentication Service, Admin Panel), DUO36 (Core Authentication Service, Admin Panel), DUO37 (Core Authentication Service, Admin Panel), DUO38 (Core Authentication Service, Admin Panel), DUO39 (Core Authentication Service, Admin Panel), DUO40 (Core Authentication Service, Admin Panel), DUO41 (Core Authentication Service, Admin Panel), DUO42 (Core Authentication Service, Admin Panel), DUO43 (Core Authentication Service, Admin Panel), DUO44 (Core Authentication Service, Admin Panel), DUO45 (Core Authentication Service, Admin Panel), DUO46 (Core Authentication Service, Admin Panel), DUO48 (Core Authentication Service, Admin Panel), DUO9 (Core Authentication Service, Admin Panel), DUO49 (Core Authentication Service, Admin Panel), DUO50 (Core Authentication Service, Admin Panel), DUO51 (Core Authentication Service, Admin Panel), DUO52 (Core Authentication Service, Admin Panel), DUO53 (Core Authentication Service, Admin Panel), DUO54 (Core Authentication Service, Admin Panel), DUO55 (Core Authentication Service, Admin Panel), DUO56 (Core Authentication Service, Admin Panel), DUO57 (Core Authentication Service, Admin Panel), DUO58 (Core Authentication Service, Admin Panel), DUO59 (Core Authentication Service, Admin Panel), DUO60 (Core Authentication Service, Admin Panel), DUO61 (Core Authentication Service, Admin Panel), DUO62 (Core Authentication Service, Admin Panel), DUO63 (Core Authentication Service, Admin Panel), DUO64 (Core Authentication Service, Admin Panel), DUO65 (Core Authentication Service, Admin Panel), DUO66 (Core Authentication Service, Admin Panel), DUO67 (Core Authentication Service, Admin Panel, Push Delivery, Phone Call Delivery, SMS Message Delivery, Cloud PKI, SSO), DUO68 (Core Authentication Service, Admin Panel), and DUO35 (Core Authentication Service, Admin Panel).