Multiple Deployments: Authentication Delays and Timeouts
Incident Report for Duo
Postmortem

Update:

This message is a follow-up to the postmortem published August 21 regarding a delay in Authentication Log reporting that occurred on August 20 from 1:38 AM to 2:37 AM Eastern (0538 UTC to 0637 UTC) after the authentication delays described in the postmortem.

The following information is for customers using a SIEM or other security tooling via the Duo Admin API. These customers will need to perform manual steps to backfill logs for the affected time period into their SIEM or other system.

Step 1: Download the log export script from Duo.

From our Github project page, go to Code > Download ZIP to download the entire project folder.

This folder will contain both the export script, authlog_export.py, and a requirements.txt file that will install all the dependencies required for the script to run.

Step 2: Run the script to export your logs.

  1. Run pip install -r requirements.txt to install the duo_client dependency.
  2. Execute the script with python authlog_export.py or python3 authlog_export.py

    1. If you plan to import the downloaded data to Splunk, add --splunk to your command line. This will add Splunk-specific fields to the downloaded data. By default, data will be retrieved using version 1 of the API, which is used by the Splunk integration and the legacy third-party Duo Log Grabber tool.
    2. To retrieve data from version 2 of the API, which is in a different format, please add --version=2 to your command line.
  3. Input your IKEY, SKEY, and host to connect to your Admin API integration.

  4. Specify the directory where you wish to write the logs.

  5. Provide the following start and end time values to fetch the logs for the affected time period. Both values are in UTC and correspond to the incident period of about 1:38 a.m. to 2:37 a.m. ET.

    1. Start time value: 1597901880000
    2. End time value: 1597905420000

Step 3: Import the downloaded data into your SIEM or other external system.

Follow your usual workflow for manually importing data into your system. Here is a sample set of instructions for Splunk, which uses version 1 of the API.

  1. Log into the Splunk administrative UI and go to Settings > Data > Source Types.
  2. Make sure you have JSON as a source type. If not, create a new JSON source type.
  3. Go to Settings > Add Data.
  4. Click on Upload and then drag and drop the exported Duo authlog_data.json file.
  5. Click on Next, and then expand the Timestamp settings in the left sidebar to make the following selections:

    1. Set the extraction mode to Advanced.
    2. Set the timezone to default.
    3. Set the timestamp format to %s and the timestamp field to timestamp.
    4. If prompted, save the source type as a custom source type.
  6. Click on Next and then from the index dropdown, select Duo as the index to import the data to.

  7. Review your settings and then submit to import the data.

  8. After importing the data, review it for any duplicate entries. Your SIEM or other tool may automatically detect duplicates. Duplicates would be caused by differences in how your system pulls logs from the Duo API for the given time period and when the missing data occurred for your unique system.

Availability Zone Failover - DUO1, DUO2, DUO4, DUO5, DUO6, DUO7, DUO9, DUO10, DUO12, DUO13, DUO14, DUO15, DUO16, DUO18, DUO19, DUO20, DUO21, DUO23, DUO24, DUO28, DUO31, DUO32, DUO33, DUO35, DUO36, DUO37, DUO41, DUO44, DUO60

Incident Report - 2020/08/19 - 2020/08/20

Summary:

At 7:35 PM Eastern on August 19, 2020, users began experiencing authentication delays or timeouts. Duo utilizes many cloud partners as part of our SaaS platform, including Amazon Web Services (AWS). Amazon experienced connectivity issues affecting systems in a single availability zone. This event initiated automated failover to another availability zone.

At 8:18 PM Eastern, the Duo Engineering team observed that the failover process had completed. At 8:19 PM Eastern, the Engineering team initiated post-failover checks. At 9:55 PM Eastern, all authentications were verified to be processing properly. Approximately 3,000 authentications were affected.

Additionally, from 1:37 AM to 2:37 AM on August 20, 2020, Authentication Logs were delayed in reaching the Duo Admin Panel and to customer monitoring workflows, such as automated SIEM logging consumption or other security tooling that relies on retrieving Authentication Logs in near-real-time from Duo’s APIs. Beginning at 2:38 AM, new logs were available. From 2:50 AM on, all data was available. No data was lost due to this incident.

The Engineering team continues to investigate the root cause of the issue that caused the Authentication Log reporting delay.

Details:

At 7:39 PM Eastern, automated monitoring informed the Engineering team of multiple failover events. In order to provide a high availability service, Duo’s systems are set to automatically failover to another availability zone if failure occurs in the active zone.

At 7:47 PM Eastern, the Engineering team verified the automatic failover process had begun.

At 8:12 PM Eastern, AWS confirmed the issue.

At 8:18 PM Eastern, the Duo Engineering team observed that the automatic failover process had completed.

At 8:19 PM Eastern, the Engineering team began post failover verification.

From 9:00 PM Eastern to 9:54 PM Eastern, the Duo Engineering team restarted unresponsive services that caused additional authentication delays and timeouts, as well as intermittent outages to the Duo Admin Panel.

At 9:55 PM Eastern, the Duo Engineering team verified that normal operations had resumed.

At 1:37 AM Eastern on August 20, the Engineering team was alerted to an issue with the system that processes Authentication Logs.

At 2:12 AM Eastern, the Engineering team restarted portions of the Authentication Log processing system that had run out of memory. The Engineering team continues to investigate the root cause of the issue that caused the Authentication Log reporting delay.

At 2:38 AM Eastern, the Engineering team took further action to restart processes that feed logs into the Duo Admin Panel. Logs then resumed processing normally.

Duo’s Engineering team has identified a path to significantly reducing recovery times for this type of failure and are working to implement those improvements.

Posted Aug 21, 2020 - 14:11 EDT

Resolved
From 1:37 AM Eastern to 2:38 AM Eastern authentication logs were delayed appearing in the Duo Admin Panel. This issue is now resolved and authentication logs are no longer delayed.

We will be posting a root-cause analysis (RCA) here once our engineering team has finished its thorough investigation of the issue.

Please make sure to check back or subscribe to be notified when the RCA is posted.
Posted Aug 20, 2020 - 00:10 EDT
Identified
Our Engineering Team has identified the cause of the authentication delays and timeouts on the affected deployments and are actively working to restore service.

Please check back here or subscribe to updates for any changes.
Posted Aug 19, 2020 - 20:56 EDT
Investigating
We are currently investigating an issue with authentication delays, timeouts, and issues loading the Duo Authentication Prompt iframe on multiple deployments. We are working to correct the issue as soon as possible.

Please check back here or subscribe to updates for any changes.
Posted Aug 19, 2020 - 20:12 EDT
This incident affected: DUO33 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO14 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO13 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO37 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO32 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO16 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO31 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO2 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO20 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO21 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO4 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO28 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO24 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO23 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO7 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO41 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO18 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO15 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO60 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO44 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO36 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO1 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO5 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO10 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO19 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO12 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), DUO35 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery), and DUO6 (Core Authentication Service, Push Delivery, Phone Call Delivery, SMS Message Delivery).