Summary
A change in the D322 build triggered high database load by calling an unoptimized query. This resulted in disruptions across several deployments. The system was stabilized after recovery actions.
Timeline of Events
08/26/2025 06:47:00 UTC - Alert fires that our service latency is above target.
08/26/2025 06:58:00 UTC - On-call engineer confirms that autoscaling is in progress.
08/26/2025 07:00:00 UTC - Separate alert fires that our service database CPU utilization is high, and requests backlog starts to grow.
08/26/2025 07:37:00 UTC - On-call engineer disables certain features of our service to alleviate load.
08/26/2025 08:08:00 UTC - Status page is published.
08/26/2025 08:45:00 UTC - Service database is scaled up.
08/26/2025 08:50:00 UTC - On-call engineer monitors recovery of systems.
08/26/2025 10:40:00 UTC - All features of our service are reenabled.
08/26/2025 11:44:00 UTC - Status page is updated to resolved.
Details
Our D322 build started calling an unoptimized API endpoint on an internal system which led to high database load, latency and request timeouts. This degraded multiple deployments (DUO3, DUO47, DUO57).
To alleviate load, certain features were disabled temporarily. However, because Duo Directory depends on this feature, customers received errors until it was re-enabled. At the same time, the database was scaled up vertically to allow for quicker processing of backlogged requests.
Once the backlog was cleared, all features were reenabled and the incident was resolved.
Deep analysis was conducted to determine why the service database CPU utilization was higher than expected. A newly introduced unoptimized query was found and fixed promptly.
Furthermore, our teams have plans to implement more rigorous testing and review processes for new database queries that are added to our services.
Also, our teams will work to improve our performance testing capabilities in the development environment to allow more accurate simulation of production load. This will aid us in being able to better detect unoptimized queries before they are released.