DUO1 Deployment: Duo Prompt experiencing Unknown Error
Incident Report for Duo
Postmortem

Summary:

From 14:58 to 15:33 UTC on October 16, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency and intermittent request timeouts for all customer applications protected by the Duo service on this deployment. During this window, approximately 50% of customer requests to DUO1 may have been slow to respond or timed out. The root cause of this outage has been identified and resolved to prevent similar issues going forward.

Details:

The Duo platform provides a full featured set of APIs for satisfying a wide variety of customer use cases. These APIs include those necessary for authentication workflows and also provide programmatic access to administrative functionality built into the platform. Our customers use these API endpoints to build custom integrations to satisfy their unique use cases, and we use these same APIs internally as part of providing the Duo SaaS platform.

One of these administrative API endpoints allows customers to retrieve hardware tokens available within their account. When called with specific parameters, this API endpoint consumed significant Duo backend resources. This inefficiency, coupled with a large increase in request volume for this API endpoint, caused degraded authentication performance and eventually led to a cascading failure affecting many interactions with DUO1.

After the root cause was identified, we developed code to improve performance of this API endpoint by optimizing underlying database queries as well as redirecting these database operations to alternate hardware to isolate them from critical authentication functionality. Both of these changes have been successfully deployed to DUO1 and other Duo deployments to prevent similar issues going forward.

One platform safeguard we’ve implemented to control resource consumption is a set of object and action rate limits that are applied to both the authentication and administrative API endpoints. Unfortunately, in this case the rate limits built into our platform didn't account for all potential use cases and were not sufficient to control the impact to our backend. The team will use data collected during this event to improve our object and action rate limits, make them more contextually aware, and leverage API endpoint and parameter specific performance characteristics to ensure stability of the platform without negatively impacting customer workflows.

Posted Oct 17, 2018 - 11:14 EDT

Resolved
From 10:58 AM to 11:33 AM EDT, Duo customers hosted on the DUO1 deployment experienced intermittent authentication failures. Our service changes have fully resolved these issues and all services are now fully functional.

We're continuing to investigate root cause and will post a root cause analysis here as soon as it is made available.

Please subscribe to be notified when the RCA is posted.
Posted Oct 16, 2018 - 12:21 EDT
Monitoring
We've taken action to stabilize DUO1 and authentications are now completing as normal.

We will continue to monitor the issue and will post an update when the incident is considered fully resolved.

Please check back here or subscribe here for further updates.
Posted Oct 16, 2018 - 11:43 EDT
Investigating
We are currently investigating an issue causing 'Unknown Error' messages with Core Authentication Services on our DUO1 deployment and are working to correct the issue as soon as possible.
Posted Oct 16, 2018 - 11:21 EDT
This incident affected: DUO1 (Core Authentication Service).