Multiple Deployments: Single Sign-On Connectivity Errors

Incident Report for Duo

Postmortem

SSO authentication failures across multiple deployments

Summary

On August 15, 2022, at around 8:00 EDT, Duo's Engineering Team was alerted by monitoring that we were seeing degraded performance in our Duo Single Sign-On (SSO) product for some deployments. The root cause was identified as a configuration change that caused the software to consume more resources than expected which led to cascading failures.

The issue was resolved on the same day by reverting the configuration changes.

Deployments Impacted

DUO6, DUO9, DUO17, DUO18, DUO22, DUO31, DUO39, DUO40, DUO42, DUO49, DUO50, DUO52, DUO55, DUO56, DUO62, DUO63, DUO64, DUO65
To determine which deployment ID you are on refer to this Duo Knowledge Base article

Timeline of Events EDT

2022-08-15 08:04 - Duo Site Reliability Engineering (SRE) is informed by automated monitoring that they are seeing degraded performance in our SSO product for some deployments. SRE begins triage.

2022-08-15 08:10 - Our SRE team observed sustained CPU load on all servers for this SSO deployment that exceeded the capacity they were capable of handling.

2022-08-15 08:30 - Our SRE team investigated failure logs on our servers and databases in an attempt to identify a root cause.

2022-08-15 08:42 - Our SRE team attempted to alleviate processing load by restarting instances and starting up instances that went down due to the high load.

2022-08-15 09:00 - Status page updated to: Issue Identified.

2022-08-15 09:30 - We observed a high and rising backlog of database calls and cleared that queue in the hopes that it would improve availability to the service. Behavior did not change.

2022-08-15 09:51 - Status page update: Continuing to investigate.

2022-08-15 09:54 - Our SRE team began rolling back to the previous version of Duo SSO software. This software had recently been updated as of Thursday, August 11, 2022. Behavior did not change.

2022-08-15 10:23 - Our SRE team began the process of adding additional servers to the SSO deployment to handle more traffic.

2022-08-15 10:30 - Status page update: Attempted rollback with no improvement. Investigation continues.

2022-08-15 10:30 - Our SRE team made network changes to attempt to reduce load on the service. The changes were reverted after demonstrating no improvement.

2022-08-15 11:09 - Status page update: We are scaling up infrastructure in an attempt to mitigate this issue.

2022-08-15 11:09 - Our SRE team identified a recent change to the deployment’s observability and performance monitoring software. Reverting that change caused a significant drop in CPU usage. We began to see authentications completing.

2022-08-15 11:09 - The root cause is identified.

2022-08-15 11:12 - The new observability feature was reverted for all servers in the deployment and CPU usage returned to normal. The issue was resolved at this time. Our SRE team continued to ensure availability and determine the cause of the observability software issue.

2022-08-15 11:21 - The root cause is fixed.

2022-08-15 11:46 - Status page updated to Monitoring.

2022-08-15 12:10 - Additional application servers were added to the SSO deployment to ensure resource capacity remained at acceptable levels.

2022-08-15 12:30 - Our SRE team redeployed the latest Duo SSO software to the servers once we were confident that this was not the cause of the outage.

2022-08-15 14:46 - Status page updated to Resolved.

Details

Duo SRE had recently rolled out configuration changes to our SSO product that were designed to increase the observability capabilities of our services. The observability service that the SSO product communicates with began failing. The communication behavior with this observability service was aggressively retrying all failed requests. This behavior caused CPU load to increase across the board until reaching maximum capacity. This caused failures to occur in our application and database services. In addition, because authentications on the SSO service tends to be highest during North American daytime hours, the product was already experiencing significant load when the issue occurred. The combination of these issues resulted in all the servers in the deployment being marked as unhealthy and removed from operation by our automated processes.

Resolution

Duo resolved this issue by disabling the observability service that was causing unsustainable load on the system.

The communication between the Duo Authentication Proxy and Duo SSO was also impacted during this incident. As a result, some Authentication Proxies may need manual restarting in order to reconnect to the Duo service.

Recommendations

If you are running a version of the Duo Authentication Proxy older than version 5.5.1 and you are still experiencing connectivity issues, you need to restart the Authentication Proxy service. To determine which version of the Authentication Proxy you are running, refer to this Duo Knowledge Base article.

We recommend logging into the Duo Admin Panel to verify whether your Authentication Proxy service is running normally. Here’s how to do that:

From the Admin Panel, go to Single Sign-On > Configured Authentication Sources > Active Directory Configuration. If you see a message next to your Authentication Proxy server that says “Not connected to Duo,” you will need to restart the Authentication Proxy.

For instructions on restarting the Authentication Proxy service, refer to this Duo Knowledge Base article.

Note: Restarting the service will briefly block authentications to integrations that use the Authentication Proxy until the service resumes, usually taking a few seconds.

If you are on an Authentication Proxy version less than 5.5.1, we also recommend updating to the latest version to improve its ability to reconnect to the Duo service after certain outage scenarios.

What is Duo doing to prevent this in the future?

Duo recognizes that outages of this kind are extremely disruptive to our customers and we apologize. While Duo Single Sign-On has historically had near 100% uptime, we will continue to prioritize availability as we add features to SSO and to Duo as a whole. In the near term, we are increasing our ability to monitor for performance changes that could be introduced via application or configuration changes. In addition, we are:

Implementing processes that improve our time to update the Duo Status Page to let customers know about issues.
Running the configuration change that caused issues through a more rigorous performance testing process to profile and address any issues. Additionally, we are looking at ways to automatically repeat that process for future changes.

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this Duo Knowledge Base article.

Posted Aug 16, 2022 - 13:48 EDT

Resolved

We have identified an issue disrupting SSO-based authentications. After investigating this problem, we have deployed a solution and resolved the issue.

If you are still having issues and are using Active Directory as your authentication source for Duo SSO, please attempt to restart the Duo Authentication Proxy service and monitor for any further errors.

Please check back here or subscribe to updates for full RCA to be published.

Posted Aug 15, 2022 - 14:46 EDT

Monitoring

We have deployed a fix for the connectivity issues for SSO-protected applications. We are actively monitoring the performance of successful SSO authentications and will update with any additional information we can provide!

If you are still having issues and are using Active Directory as your authentication source for Duo SSO, please attempt to restart the Duo Authentication Proxy service and monitor for any further errors. Instructions on restarting the Authentication Proxy can be found in this KB article: https://help.duo.com/s/article/2149

Please check back here or subscribe to updates for any changes.

Posted Aug 15, 2022 - 11:46 EDT

Update

We are still working on deploying a fix for connectivity issues to our SSO applications. We are in the process of scaling up additional infrastructure to combat the issue.

Thank you for your continued patience as we urgently work toward resolving this issue for all who are affected.

Please check back here or subscribe to updates for any changes.

Posted Aug 15, 2022 - 11:09 EDT

Update

We are still working on deploying a fix for connectivity failures to our SSO applications. We have performed a server rollback but have yet to see performance improvements.

Thank you for your continued patience as we urgently work toward resolving this issue for all who are affected.

Please check back here or subscribe to updates for any changes.

Posted Aug 15, 2022 - 10:32 EDT

Update

We are still investigating an issue with connectivity to our SSO applications resulting in 504 Bad Gateway errors.

Please check back here or subscribe to updates for any changes.

Posted Aug 15, 2022 - 09:51 EDT

Identified

We have identified an issue with connectivity to our SSO applications resulting in 504 Bad Gateway errors. This can affect both normal SSO-protected applications as well as SSO authentication to the Duo Admin Panel.
We are actively working on a solution.

Please check back here or subscribe to updates for any changes.

Posted Aug 15, 2022 - 09:00 EDT

This incident affected: DUO17 (Admin Panel, SSO), DUO22 (Admin Panel, SSO), DUO39 (Admin Panel, SSO), DUO40 (Admin Panel, SSO), DUO42 (Admin Panel, SSO), DUO9 (Admin Panel, SSO), DUO49 (Admin Panel, SSO), DUO50 (Admin Panel, SSO), DUO52 (Admin Panel, SSO), DUO55 (Admin Panel, SSO), DUO56 (Admin Panel, SSO), DUO58 (Admin Panel, SSO), DUO62 (Admin Panel, SSO), DUO63 (Admin Panel, SSO), and DUO65 (Admin Panel, SSO).