On June 14, 2023, at around 20:00 EDT, Duo's Engineering Team was alerted by our internal monitoring systems that Duo’s data pipeline systems were experiencing high latency in ingesting authentication and SSO logs. Soon after the alert, there were delays in data arriving to be queried for authentication and SSO logs from the Duo Admin Panel and Duo Admin APIs. The root cause of the latency was identified as a service stuck in an intermittent error state during the scheduled deployments on that day.
The issue was resolved on the same day by manually restarting the specific service within our data pipeline which allowed the system to recover fully.
2023-06-14 18:41 - Duo Engineering team completes the regular scheduled deployments for its data pipeline services.
2023-06-14 20:00 - Duo Engineering team is alerted of an issue and begins troubleshooting.
2023-06-14 21:36 - Duo Engineering team finds the specific issue causing the latency in ingesting authentication and SSO logs.
2023-06-14 21:48 - Duo Engineering team begins restarting the service experiencing issues.
2023-06-14 22:05 - The system was restarted successfully and we immediately stopped receiving alerts for the ingestion latency.
2023-06-14 22:18 - The latency returns to baseline levels.
2023-06-14 22:33 - The StatusPage is updated to Monitoring.
2023-06-14 22:52 - The StatusPage is updated to Resolved.
Duo Engineering completed the scheduled deployment of services that constitute our data pipeline which serves the authentication and SSO logs. After the deployment, our monitoring system alerted us that the system was unable to ingest logs at the expected pace and the ingestion latency was high. Shortly after the alert, we started investigating the issue. We still had other nodes that were ingesting these logs but the service performance was degraded and slow.
This issue would have affected the retrieval of authentication and SSO logs from the Duo Admin Panel and Admin APIs. We then determined that the degraded state was due to the intermittent errors in a critical service of the system. As a quick fix, we decided to restart the service experiencing issues.
Once the restart was completed, the ingest latency went back to the normal and expected levels. The alerts from our monitoring also stopped and the incident was eventually marked as resolved after careful consideration from our teams. Note, this did not affect authentications in any way.
Duo Engineering is continuing to investigate why the intermittent errors were seen in the first place. We are also adding more monitoring around our system to detect this specific problem sooner. We are also considering adjusting our deployment schedules to ensure that detection and remediation of problems is more prompt.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.