On August 15, 2022, at around 8:00 EDT, Duo's Engineering Team was alerted by monitoring that we were seeing degraded performance in our Duo Single Sign-On (SSO) product for some deployments. The root cause was identified as a configuration change that caused the software to consume more resources than expected which led to cascading failures.
The issue was resolved on the same day by reverting the configuration changes.
2022-08-15 08:04 - Duo Site Reliability Engineering (SRE) is informed by automated monitoring that they are seeing degraded performance in our SSO product for some deployments. SRE begins triage.
2022-08-15 08:10 - Our SRE team observed sustained CPU load on all servers for this SSO deployment that exceeded the capacity they were capable of handling.
2022-08-15 08:30 - Our SRE team investigated failure logs on our servers and databases in an attempt to identify a root cause.
2022-08-15 08:42 - Our SRE team attempted to alleviate processing load by restarting instances and starting up instances that went down due to the high load.
2022-08-15 09:00 - Status page updated to: Issue Identified.
2022-08-15 09:30 - We observed a high and rising backlog of database calls and cleared that queue in the hopes that it would improve availability to the service. Behavior did not change.
2022-08-15 09:51 - Status page update: Continuing to investigate.
2022-08-15 09:54 - Our SRE team began rolling back to the previous version of Duo SSO software. This software had recently been updated as of Thursday, August 11, 2022. Behavior did not change.
2022-08-15 10:23 - Our SRE team began the process of adding additional servers to the SSO deployment to handle more traffic.
2022-08-15 10:30 - Status page update: Attempted rollback with no improvement. Investigation continues.
2022-08-15 10:30 - Our SRE team made network changes to attempt to reduce load on the service. The changes were reverted after demonstrating no improvement.
2022-08-15 11:09 - Status page update: We are scaling up infrastructure in an attempt to mitigate this issue.
2022-08-15 11:09 - Our SRE team identified a recent change to the deployment’s observability and performance monitoring software. Reverting that change caused a significant drop in CPU usage. We began to see authentications completing.
2022-08-15 11:09 - The root cause is identified.
2022-08-15 11:12 - The new observability feature was reverted for all servers in the deployment and CPU usage returned to normal. The issue was resolved at this time. Our SRE team continued to ensure availability and determine the cause of the observability software issue.
2022-08-15 11:21 - The root cause is fixed.
2022-08-15 11:46 - Status page updated to Monitoring.
2022-08-15 12:10 - Additional application servers were added to the SSO deployment to ensure resource capacity remained at acceptable levels.
2022-08-15 12:30 - Our SRE team redeployed the latest Duo SSO software to the servers once we were confident that this was not the cause of the outage.
2022-08-15 14:46 - Status page updated to Resolved.
Duo SRE had recently rolled out configuration changes to our SSO product that were designed to increase the observability capabilities of our services. The observability service that the SSO product communicates with began failing. The communication behavior with this observability service was aggressively retrying all failed requests. This behavior caused CPU load to increase across the board until reaching maximum capacity. This caused failures to occur in our application and database services. In addition, because authentications on the SSO service tends to be highest during North American daytime hours, the product was already experiencing significant load when the issue occurred. The combination of these issues resulted in all the servers in the deployment being marked as unhealthy and removed from operation by our automated processes.
Duo resolved this issue by disabling the observability service that was causing unsustainable load on the system.
The communication between the Duo Authentication Proxy and Duo SSO was also impacted during this incident. As a result, some Authentication Proxies may need manual restarting in order to reconnect to the Duo service.
If you are running a version of the Duo Authentication Proxy older than version 5.5.1 and you are still experiencing connectivity issues, you need to restart the Authentication Proxy service. To determine which version of the Authentication Proxy you are running, refer to this Duo Knowledge Base article.
We recommend logging into the Duo Admin Panel to verify whether your Authentication Proxy service is running normally. Here’s how to do that:
For instructions on restarting the Authentication Proxy service, refer to this Duo Knowledge Base article.
Note: Restarting the service will briefly block authentications to integrations that use the Authentication Proxy until the service resumes, usually taking a few seconds.
If you are on an Authentication Proxy version less than 5.5.1, we also recommend updating to the latest version to improve its ability to reconnect to the Duo service after certain outage scenarios.
Duo recognizes that outages of this kind are extremely disruptive to our customers and we apologize. While Duo Single Sign-On has historically had near 100% uptime, we will continue to prioritize availability as we add features to SSO and to Duo as a whole. In the near term, we are increasing our ability to monitor for performance changes that could be introduced via application or configuration changes. In addition, we are:
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this Duo Knowledge Base article.