On August 17, 2023, at 08:01 EDT, Duo's Engineering Team was alerted by our application monitoring components which reported SMS and VOIP authentications failing for customers relying on affected deployments. The root cause was identified as a failure of our autoscaling mechanism to handle increased traffic.
The issue was resolved on the same day by manually increasing the number of servers in this region.
2023-08-17 08:02 Duo Site Reliability Engineering (SRE) is informed that authentication failures are happening on these deployments due to issues with SMS and VOIP not being delivered to users
2023-08-17 08:07 SRE team acknowledges the outage and begins our normal incident response process
2023-08-17 08:31 After initial troubleshooting SRE team posts a status page update to warn our customers and continue to look for the root cause of this issue
2023-08-17 08:41 SRE team decides to manually scale the number of servers as the resource usage for our telephony service looks abnormally high
2023-08-17 08:45 SRE team acknowledges that this is not only a telephony issue but a wider issue also affecting the Risk-Based Factor Selection service for the same deployments
2023-08-17 08:53 Recovery is observed in the delivery of our SMS and VOIP to users to perform authentications
2023-08-17 09:06 After monitoring the partial recovery the Duo team decides to scale our servers more in an attempt to rebalance the load on our servers
2023-08-17 09:46 The team assessed a slow recovery of servers while still trying to find the root cause
2023-08-17 10:30 Status page updated to monitoring as we do not see any more failures on our telephony service nor other impacted services (risk-based factor selection)
2023-08-17 11:43 Root cause was identified as a failure in our metrics provider for resource usage of servers that would prevent our auto-scaling mechanism from functioning properly. The root cause was fixed by performing a restart of certain observability components.
Duo SRE recently rolled out a new way of renewing certificates between our observability components to strengthen security of communications. This resulted in the failure to renew some certificates. On August 16 at around 7pm EDT, our autoscaling components started to fail for certain services as it could not properly get partial metrics for our server resources usage. This was due to a TLS handshake error between observability components caused by expired certificates. This led to the nonscaling of our servers yesterday during high-traffic periods which put our servers under pressure and created this outage.
As a short-term solution, the Duo SRE team manually increased the number of servers for our telephony service, slowly resolving our issue of serving requests. Duo SRE team has restarted our observability components with renewed certificates and resumed the normal functioning of our Autoscaling mechanism based on resource usage.
This incident impacted all customers using telephony on deployments DUO9, DUO17, DUO22, DUO39, DUO40, DUO42, DUO45, DUO49, DUO50, DUO52, DUO55, DUO56, DUO58, DUO62, DUO63, DUO64, DUO65, DUO72, DUO73.
Duo SRE team is dedicated to providing reliable service to all users. In that regard SRE team will implement several improvements to make sure this kind of issue does not happen again:
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.