Incident Report - 2018/08/29
Summary:
From 14:11 to 15:13 UTC on August 29th, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency and intermittent request timeouts for all customer applications protected by the Duo service on this deployment. The root cause of this and the previous prolonged August 20th outage has been identified, and additional capacity is being made available on DUO1 to prevent similar issues going forward.
Details:
Our DUO1 deployment has experienced two outages in as many weeks. This is neither expected nor acceptable. We first want to apologize for the impact, and then provide more information on what happened and what we have and will be doing about it.
On August 20th, we experienced an outage affecting authentications to the DUO1 deployment. At that time, we rerouted traffic and disabled scheduled jobs in response to unexpected platform load, but a software rollback and subsequent service restart was ultimately necessary to resolve the issue. Given that a new software release had recently been deployed to DUO1 and a rollback resolved the issue, the newly deployed software was identified as a significant contributor to the additional load we were experiencing. As part of incident follow-up, we reviewed the deployed software, found opportunities to improve performance, then patched and redeployed the software.
From that point, the DUO1 deployment was stable but traffic and load continued to grow. After further review of the data surrounding the incident, it was determined that the outage was not solely the result of the software release in question, but ultimately a combination of factors (types of requests, background jobs, inefficient queries, automatic retry mechanisms, etc.) all contributing to capacity issues. In response, additional efforts were made to optimize our software for increased performance over the next week, resulting in increased available capacity.
On August 29th, we experienced a similar capacity issue, but in this case we had not recently deployed new software. In response to this issue, we applied request limits to our platform in an effort to manage an influx of inbound requests and stabilize the platform, but these efforts proved ineffective. After further debugging, we determined that limiting was ineffective because of the way our application queues requests while waiting for a database connection. In this case, these queued requests had built up in such a way that the database could not recover as it tried to process this large backlog of requests, even after traffic subsided and the limits where in place. Once this problem was identified, these queues were flushed on each application server and things immediately began to stabilize. In hindsight, this is effectively what the software rollback did as part of the issue on August 20th, which is why the rollback solved that prior issue.
Now that we had a better understanding of what was causing these prolonged backups, we knew what to do. On August 29th, we implemented a max limit to this queue. If we attempt to go over the limit, we'll start proactively rejecting the requests instead of queuing them up, which is better for both the Duo service and the clients making the requests if these requests are not going to complete in a timely fashion. This will help prevent the cascading failure scenario that is a result of the database getting backed up. We also setup monitoring for this queue depth so our team will be alerted if its starting to back up. This will be a good leading indicator for us to get ahead of similar issues going forward.
In addition, despite our best efforts to optimize things on DUO1, overall usage is still at higher than acceptable levels. This increased usage is what has resulted in requests queueing up in periods of significantly increased traffic. We’re going to do a number of things to remediate this issue in the short and medium term:
With these changes in place, we’re confident DUO1 can both handle current traffic demands and scale for the future.
We apologize again for these recent DUO1 outages. We know you depend on Duo, and that any downtime in a service you’ve come to trust is unacceptable.