SMS/Phone call delivery and Risk-Based Factor Selection failures on multiple deployments
Incident Report for Duo
Postmortem

Summary

On August 17, 2023, at 08:01 EDT, Duo's Engineering Team was alerted by our application monitoring components which reported SMS and VOIP authentications failing for customers relying on affected deployments. The root cause was identified as a failure of our autoscaling mechanism to handle increased traffic.

The issue was resolved on the same day by manually increasing the number of servers in this region.

Deployments Impacted

  • DUO9, DUO17, DUO22, DUO39, DUO40, DUO42, DUO45, DUO49, DUO50, DUO52, DUO55, DUO56, DUO58, DUO62, DUO63, DUO64, DUO65, DUO72, DUO73

Timeline of Events EDT

2023-08-17 08:02 Duo Site Reliability Engineering (SRE) is informed that authentication failures are happening on these deployments due to issues with SMS and VOIP not being delivered to users 

2023-08-17 08:07 SRE team acknowledges the outage and begins our normal incident response     process

2023-08-17 08:31 After initial troubleshooting SRE team posts a status page update to warn our customers and continue to look for the root cause of this issue

2023-08-17 08:41 SRE team decides to manually scale the number of servers as the resource usage for our telephony service looks abnormally high 

2023-08-17 08:45 SRE team acknowledges that this is not only a telephony issue but a wider issue also affecting the Risk-Based Factor Selection service for the same deployments

2023-08-17 08:53 Recovery is observed in the delivery of our SMS and VOIP to users to perform authentications

2023-08-17 09:06 After monitoring the partial recovery the Duo team decides to scale our servers more in an attempt to rebalance the load on our servers

2023-08-17 09:46 The team assessed a slow recovery of servers while still trying to find the root cause

2023-08-17 10:30 Status page updated to monitoring as we do not see any more failures on our telephony service nor other impacted services (risk-based factor selection)

2023-08-17 11:43 Root cause was identified as a failure in our metrics provider for resource usage of servers that would prevent our auto-scaling mechanism from functioning properly. The root cause was fixed by performing a restart of certain observability components.

Details

Duo SRE recently rolled out a new way of renewing certificates between our observability components to strengthen security of communications. This resulted in the failure to renew some certificates. On August 16 at around 7pm EDT, our autoscaling components started to fail for certain services as it could not properly get partial metrics for our server resources usage. This was due to a TLS handshake error between observability components caused by expired certificates. This led to the nonscaling of our servers yesterday during high-traffic periods which put our servers under pressure and created this outage. 

As a short-term solution, the Duo SRE team manually increased the number of servers for our telephony service, slowly resolving our issue of serving requests. Duo SRE team has restarted our observability components with renewed certificates and resumed the normal functioning of our Autoscaling mechanism based on resource usage. 

This incident impacted all customers using telephony on deployments DUO9, DUO17, DUO22, DUO39, DUO40, DUO42, DUO45, DUO49, DUO50, DUO52, DUO55, DUO56, DUO58, DUO62, DUO63, DUO64, DUO65, DUO72, DUO73.

Duo SRE team is dedicated to providing reliable service to all users. In that regard SRE team will implement several improvements to make sure this kind of issue does not happen again:

  • Improved monitoring with new alerts for abnormal behavior on our autoscaling mechanism
  • Improved alerting on certain observability components 
  • Improve the back pressure system, which allows the system to fail safely, to ensure services can recover more quickly after a scaling failure  
  • A long-term fix for our internal certificate renewal system for observability components has been identified and already implemented

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Aug 18, 2023 - 12:33 EDT

Resolved
The issue causing SMS/Phone Call delivery and Risk-Based Factor Selection failures to all numbers on multiple deployments is fully resolved and all services are now fully functional.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.
Please make sure to check back or subscribe to be notified when the RCA is posted.
Posted Aug 17, 2023 - 11:43 EDT
Monitoring
Our Engineering Team has identified the cause for these telephony and Risk-Based Factor Selection errors and deployed a fix. We are continuing to monitor the issue.

We will post any updates when the incident is considered full resolved.
Posted Aug 17, 2023 - 10:30 EDT
Update
We are continuing to investigate an issue causing SMS/Phone Call delivery and Risk-Based Factor Selection failures to all numbers on multiple deployments and are working to correct the issue as soon as possible. This is impacting all countries.

Please check back here or subscribe to updates for any changes.
Posted Aug 17, 2023 - 09:57 EDT
Update
We are continuing to investigate an issue causing SMS and Phone Call delivery failures to all numbers on multiple deployments and are working to correct the issue as soon as possible. This is impacting all countries.

Please check back here or subscribe to updates for any changes.
Posted Aug 17, 2023 - 09:35 EDT
Update
We are continuing to investigate this issue.
Posted Aug 17, 2023 - 08:39 EDT
Investigating
We are currently investigating an issue causing SMS and Phone Call delivery failures to US and Canada numbers and are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.
Posted Aug 17, 2023 - 08:31 EDT
This incident affected: DUO17 (Phone Call Delivery, SMS Message Delivery), DUO22 (Phone Call Delivery, SMS Message Delivery), DUO39 (Phone Call Delivery, SMS Message Delivery), DUO40 (Phone Call Delivery, SMS Message Delivery), DUO42 (Phone Call Delivery, SMS Message Delivery), DUO45 (Phone Call Delivery, SMS Message Delivery), DUO9 (Phone Call Delivery, SMS Message Delivery), DUO49 (Phone Call Delivery, SMS Message Delivery), DUO50 (Phone Call Delivery, SMS Message Delivery), DUO52 (Phone Call Delivery, SMS Message Delivery), DUO55 (Phone Call Delivery, SMS Message Delivery), DUO56 (Phone Call Delivery, SMS Message Delivery), DUO58 (Phone Call Delivery, SMS Message Delivery), DUO61 (Phone Call Delivery, SMS Message Delivery), DUO62 (Phone Call Delivery, SMS Message Delivery), DUO63 (Phone Call Delivery, SMS Message Delivery), DUO64 (Phone Call Delivery, SMS Message Delivery), DUO65 (Phone Call Delivery, SMS Message Delivery), DUO71 (Phone Call Delivery, SMS Message Delivery), DUO72 (Phone Call Delivery, SMS Message Delivery), and DUO73 (Phone Call Delivery, SMS Message Delivery).