On 01 February 2016, at approximately 11:06am EST, users on Duo’s DUO1 deployment experienced increased latency or in some cases a complete inability to authenticate when interacting with Duo’s web-based authentication prompt. The issue stemmed from new code implemented last week that stored endpoint telemetry data with increased fidelity.
The code caused additional data to be loaded from backend systems as part of every authentication request. Rigorous internal load testing of the new code had been conducted prior to release and production systems were fully able to handle the increased load for multiple business days after the code had been deployed. Atypically large numbers of authentication requests were processed by DUO1 this morning, leading to resource starvation which negatively affected users logging in via Duo’s authentication prompt.
At 11:48am EST, Duo’s Operations Team successfully patched the code that caused the issue. At this point in time, authentications began to succeed, and any outstanding requests which had not timed-out were serviced. Existing backlogged requests, in combination with new requests, led to further cascading failures. High levels of load and latency continued throughout this time. By 12:35pm, the Operations Team had successfully deployed additional application servers to handle the high volumes of traffic and latency returned to normal levels.
Next steps to prevent future issues of this nature include additional efforts in load testing new code to ensure no negative impact and enhanced monitoring to allow for potential problems to be identified and acted upon prior to a customer facing event.