Authentication Latency on DUO1 Deployment

Incident Report for Duo

Postmortem

Authentication Issues - DUO1

Incident Report - 2018/08/29

Summary:

From 14:11 to 15:13 UTC on August 29th, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency and intermittent request timeouts for all customer applications protected by the Duo service on this deployment. The root cause of this and the previous prolonged August 20th outage has been identified, and additional capacity is being made available on DUO1 to prevent similar issues going forward.

Details:

Our DUO1 deployment has experienced two outages in as many weeks. This is neither expected nor acceptable. We first want to apologize for the impact, and then provide more information on what happened and what we have and will be doing about it.

On August 20th, we experienced an outage affecting authentications to the DUO1 deployment. At that time, we rerouted traffic and disabled scheduled jobs in response to unexpected platform load, but a software rollback and subsequent service restart was ultimately necessary to resolve the issue. Given that a new software release had recently been deployed to DUO1 and a rollback resolved the issue, the newly deployed software was identified as a significant contributor to the additional load we were experiencing. As part of incident follow-up, we reviewed the deployed software, found opportunities to improve performance, then patched and redeployed the software.

From that point, the DUO1 deployment was stable but traffic and load continued to grow. After further review of the data surrounding the incident, it was determined that the outage was not solely the result of the software release in question, but ultimately a combination of factors (types of requests, background jobs, inefficient queries, automatic retry mechanisms, etc.) all contributing to capacity issues. In response, additional efforts were made to optimize our software for increased performance over the next week, resulting in increased available capacity.

On August 29th, we experienced a similar capacity issue, but in this case we had not recently deployed new software. In response to this issue, we applied request limits to our platform in an effort to manage an influx of inbound requests and stabilize the platform, but these efforts proved ineffective. After further debugging, we determined that limiting was ineffective because of the way our application queues requests while waiting for a database connection. In this case, these queued requests had built up in such a way that the database could not recover as it tried to process this large backlog of requests, even after traffic subsided and the limits where in place. Once this problem was identified, these queues were flushed on each application server and things immediately began to stabilize. In hindsight, this is effectively what the software rollback did as part of the issue on August 20th, which is why the rollback solved that prior issue.

Now that we had a better understanding of what was causing these prolonged backups, we knew what to do. On August 29th, we implemented a max limit to this queue. If we attempt to go over the limit, we'll start proactively rejecting the requests instead of queuing them up, which is better for both the Duo service and the clients making the requests if these requests are not going to complete in a timely fashion. This will help prevent the cascading failure scenario that is a result of the database getting backed up. We also setup monitoring for this queue depth so our team will be alerted if its starting to back up. This will be a good leading indicator for us to get ahead of similar issues going forward.

In addition, despite our best efforts to optimize things on DUO1, overall usage is still at higher than acceptable levels. This increased usage is what has resulted in requests queueing up in periods of significantly increased traffic. We’re going to do a number of things to remediate this issue in the short and medium term:

We have already migrated a number of customer accounts off of DUO1 since August 29th, which has provided additional available capacity. Further migrations will be evaluated as necessary to determine risk and potential impact, and will be coordinated with the customers in question.
We will double DUO1's database capacity on Saturday, September 1st at 11:00AM UTC. This upgrade is being done at that time to minimize impact to our customers as it will require a short amount of downtime to complete. We expect between 1 and 7 minutes based on prior testing.
We’ve been actively re-architecting the database tier of our deployments so that we can home customers to a specific database server within a deployment. That way, we’re able to quickly add capacity on demand for those customers that need it without affecting other customers on the deployment. We’ve been exercising this technology in our non-production environments for some time, and have prioritized finalizing this for production usage in Q4, 2018.
We will take lessons learned from these outages and incorporate them into our ongoing capacity planning process to ensure our DUO deployments are always operating with necessary headroom.

With these changes in place, we’re confident DUO1 can both handle current traffic demands and scale for the future.

We apologize again for these recent DUO1 outages. We know you depend on Duo, and that any downtime in a service you’ve come to trust is unacceptable.

Posted Aug 30, 2018 - 13:38 EDT

Resolved

We confirmed that Duo service has been restored to normal functionality. We will post a RCA when our engineers have completed their full analysis.

Posted Aug 29, 2018 - 13:02 EDT

Update

We've confirmed that the issue has subsided, and authentications are completing successfully.

Posted Aug 29, 2018 - 11:21 EDT

Update

We're still monitoring an elevated number of slow and/or failed authentication requests affecting our DUO1 deployment. Our Engineering team is actively working to resolve this issue.

Posted Aug 29, 2018 - 11:17 EDT

Update

We are continuing to monitor for any further issues.

Posted Aug 29, 2018 - 10:42 EDT

Monitoring

We are monitoring an elevated level of slow or failed authentication requests affecting our DUO1 deployment.

Posted Aug 29, 2018 - 10:38 EDT

This incident affected: DUO1 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery).