Duo Admin Panel: delay retrieving Authentication and SSO log

Incident Report for Duo

Postmortem

Summary

On June 14, 2023, at around 20:00 EDT, Duo's Engineering Team was alerted by our internal monitoring systems that Duo’s data pipeline systems were experiencing high latency in ingesting authentication and SSO logs. Soon after the alert, there were delays in data arriving to be queried for authentication and SSO logs from the Duo Admin Panel and Duo Admin APIs. The root cause of the latency was identified as a service stuck in an intermittent error state during the scheduled deployments on that day.

The issue was resolved on the same day by manually restarting the specific service within our data pipeline which allowed the system to recover fully.

Deployments Impacted

DUO9
DUO17
DUO22
DUO39
DUO40
DUO42
DUO45
DUO49
DUO50
DUO52
DUO55
DUO56
DUO58
DUO61
DUO62
DUO63
DUO64
DUO65
DUO71
DUO72
DUO73

Timeline of Events EDT

2023-06-14 18:41 - Duo Engineering team completes the regular scheduled deployments for its data pipeline services.

2023-06-14 20:00 - Duo Engineering team is alerted of an issue and begins troubleshooting.

2023-06-14 21:36 - Duo Engineering team finds the specific issue causing the latency in ingesting authentication and SSO logs.

2023-06-14 21:48 - Duo Engineering team begins restarting the service experiencing issues.

2023-06-14 22:05 - The system was restarted successfully and we immediately stopped receiving alerts for the ingestion latency.

2023-06-14 22:18 - The latency returns to baseline levels.

2023-06-14 22:33 - The StatusPage is updated to Monitoring.

2023-06-14 22:52 - The StatusPage is updated to Resolved.

Details

Duo Engineering completed the scheduled deployment of services that constitute our data pipeline which serves the authentication and SSO logs. After the deployment, our monitoring system alerted us that the system was unable to ingest logs at the expected pace and the ingestion latency was high. Shortly after the alert, we started investigating the issue. We still had other nodes that were ingesting these logs but the service performance was degraded and slow.

This issue would have affected the retrieval of authentication and SSO logs from the Duo Admin Panel and Admin APIs. We then determined that the degraded state was due to the intermittent errors in a critical service of the system. As a quick fix, we decided to restart the service experiencing issues.

Once the restart was completed, the ingest latency went back to the normal and expected levels. The alerts from our monitoring also stopped and the incident was eventually marked as resolved after careful consideration from our teams. Note, this did not affect authentications in any way.

What is Duo doing to prevent this in the future?

Duo Engineering is continuing to investigate why the intermittent errors were seen in the first place. We are also adding more monitoring around our system to detect this specific problem sooner. We are also considering adjusting our deployment schedules to ensure that detection and remediation of problems is more prompt.

Note: You can find your Duo deployment’s ID and sign up for updates via the StatusPage by following the instructions in this knowledge base article.

Posted Jun 22, 2023 - 19:01 EDT

Resolved

After monitoring the latency issues in the Authentication and SSO Log, the issue is confirmed as resolved. As of 02:37 UTC June 15, Authentication Logs are operating normally. We will provide an RCA as soon as it is available.

Posted Jun 14, 2023 - 22:52 EDT

Monitoring

After the fix being implemented we are experiencing service improvements we are now monitoring the solution. We will update this page in case of new information or developments.

Posted Jun 14, 2023 - 22:33 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 14, 2023 - 22:30 EDT

Investigating

We have identified an issue that prevented some log entries from being populated correctly for events that occurred since 10:30 PM UTC on June 14th. This issue affects the authentication and Duo SSO logs visible via the Duo Admin Panel as well as authentication logs retrieved via the Duo Admin API.
Recovery is currently in progress and we will post an update as soon as all entries are up to date.

No authentications are affected and all other services are operational at this time.

Posted Jun 14, 2023 - 22:14 EDT

This incident affected: DUO17 (Admin Panel), DUO22 (Admin Panel), DUO39 (Admin Panel), DUO40 (Admin Panel), DUO42 (Admin Panel), DUO45 (Admin Panel), DUO9 (Admin Panel), DUO49 (Admin Panel), DUO50 (Admin Panel), DUO52 (Admin Panel), DUO55 (Admin Panel), DUO56 (Admin Panel), DUO58 (Admin Panel), DUO62 (Admin Panel), DUO63 (Admin Panel), DUO64 (Admin Panel), DUO65 (Admin Panel), DUO71 (Admin Panel), DUO72 (Admin Panel), and DUO73 (Admin Panel).