On March 25, 2024, at around 8:42 ET, Duo's Engineering Team was alerted by monitoring that one of the endpoints was unreachable. The root cause was identified as misconfiguration in our routing infrastructure.
The issue was resolved on the same day by fixing the routing configuration.
Duo SRE rolled out a new configuration to our applications and in the process of rolling out the changes some of the routing configurations were misconfigured. This caused the request to be not accepted by our load balancers.
After discovering the change in routing configuration, we immediately fixed the issue and made sure traffic is being served in the deployment.
As a short term solution, engineers quickly fixed the routing configurations which were later ported back to our IaC solutions to be persisted across various environments.
SRE has also taken up the task to improve monitoring around specific failure scenarios faster and to come up with fault tolerant routing mechanisms which will prevent such failures in the future.