Incident Report - 2023-09-11
On September 11, 2023, at around 01:27 EDT, Duo's Engineering Team was alerted by our internal monitoring systems that Duo’s data pipeline systems were experiencing high latency in ingesting authentication and SSO logs. Soon after the alert, there were delays in data arriving to be queried for authentication and SSO logs from the Duo Admin Panel and Duo Admin APIs. The root cause of the latency was identified as expired certificates on the components of our internal data pipeline.
The issue was resolved on the next day by renewing the certificates within our data pipeline which allowed the system to recover fully. No logs were lost as part of this incident as our recovery mechanisms had kicked in to store the unprocessed logs.
2023-09-11 13:27 - Duo Engineering team is alerted of an issue with log ingestion in the data pipeline in DUO67.
2023-09-11 13:34 - Duo Engineering team begins troubleshooting
2023-09-11 16:46 - Duo Engineering team identifies the root cause and starts working on a fix
2023-09-12 04:29 - Duo Engineering team successfully renews all required certificates and begins restarting the service experiencing issues.
2023-09-12 05:01 - The system was restarted successfully and we started seeing logs flowing through the data pipeline as expected
2023-09-12 05:30 - The StatusPage is updated to Monitoring.
2023-09-12 06:27 - The StatusPage is updated to Resolved.
Duo Engineering completed the scheduled patching of services that constitute our data pipeline which serves the authentication and SSO logs. Once patching was completed, our monitoring system alerted us that the system was unable to ingest logs at the expected pace and the ingestion latency was high. Shortly after the alert, we started investigating the issue.
This issue affected the retrieval of authentication and SSO logs from the Duo Admin Panel and Admin APIs. We then determined that the degraded state was due to the expired certificates on the components of our internal data pipeline. We renewed the required certificates for all the components of our data pipeline to resolve issues.
Once the certificates were renewed and the services were restarted, the ingest latency went back to the normal and expected levels. The alerts from our monitoring also stopped and the incident was eventually marked as resolved after careful consideration from our teams. Note, this did not affect authentications in any way.
Duo Engineering has conducted a retrospective to determine how we can stop any certificate related issues in the future. We will be implementing automation to our certificate renewal process and increasing observability and alerting for certificates that are close to their expiration to give us more than enough time to act on them. In addition, we have identified a number of measures to be integrated into our response runbooks to decrease our incident response time.
Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.