Summary:
From 14:58 to 15:33 UTC on October 16, 2018, the DUO1 deployment experienced performance degradation that resulted in increased authentication latency and intermittent request timeouts for all customer applications protected by the Duo service on this deployment. During this window, approximately 50% of customer requests to DUO1 may have been slow to respond or timed out. The root cause of this outage has been identified and resolved to prevent similar issues going forward.
Details:
The Duo platform provides a full featured set of APIs for satisfying a wide variety of customer use cases. These APIs include those necessary for authentication workflows and also provide programmatic access to administrative functionality built into the platform. Our customers use these API endpoints to build custom integrations to satisfy their unique use cases, and we use these same APIs internally as part of providing the Duo SaaS platform.
One of these administrative API endpoints allows customers to retrieve hardware tokens available within their account. When called with specific parameters, this API endpoint consumed significant Duo backend resources. This inefficiency, coupled with a large increase in request volume for this API endpoint, caused degraded authentication performance and eventually led to a cascading failure affecting many interactions with DUO1.
After the root cause was identified, we developed code to improve performance of this API endpoint by optimizing underlying database queries as well as redirecting these database operations to alternate hardware to isolate them from critical authentication functionality. Both of these changes have been successfully deployed to DUO1 and other Duo deployments to prevent similar issues going forward.
One platform safeguard we’ve implemented to control resource consumption is a set of object and action rate limits that are applied to both the authentication and administrative API endpoints. Unfortunately, in this case the rate limits built into our platform didn't account for all potential use cases and were not sufficient to control the impact to our backend. The team will use data collected during this event to improve our object and action rate limits, make them more contextually aware, and leverage API endpoint and parameter specific performance characteristics to ensure stability of the platform without negatively impacting customer workflows.