This message is a follow-up to the postmortem published August 21 regarding a delay in Authentication Log reporting that occurred on August 20 from 1:38 AM to 2:37 AM Eastern (0538 UTC to 0637 UTC) after the authentication delays described in the postmortem.
The following information is for customers using a SIEM or other security tooling via the Duo Admin API. These customers will need to perform manual steps to backfill logs for the affected time period into their SIEM or other system.
Step 1: Download the log export script from Duo.
From our Github project page, go to Code > Download ZIP to download the entire project folder.
This folder will contain both the export script,
authlog_export.py, and a
requirements.txt file that will install all the dependencies required for the script to run.
Step 2: Run the script to export your logs.
Execute the script with python authlog_export.py or python3 authlog_export.py
Input your IKEY, SKEY, and host to connect to your Admin API integration.
Specify the directory where you wish to write the logs.
Provide the following start and end time values to fetch the logs for the affected time period. Both values are in UTC and correspond to the incident period of about 1:38 a.m. to 2:37 a.m. ET.
Step 3: Import the downloaded data into your SIEM or other external system.
Follow your usual workflow for manually importing data into your system. Here is a sample set of instructions for Splunk, which uses version 1 of the API.
Click on Next, and then expand the Timestamp settings in the left sidebar to make the following selections:
Click on Next and then from the index dropdown, select Duo as the index to import the data to.
Review your settings and then submit to import the data.
After importing the data, review it for any duplicate entries. Your SIEM or other tool may automatically detect duplicates. Duplicates would be caused by differences in how your system pulls logs from the Duo API for the given time period and when the missing data occurred for your unique system.
Incident Report - 2020/08/19 - 2020/08/20
At 7:35 PM Eastern on August 19, 2020, users began experiencing authentication delays or timeouts. Duo utilizes many cloud partners as part of our SaaS platform, including Amazon Web Services (AWS). Amazon experienced connectivity issues affecting systems in a single availability zone. This event initiated automated failover to another availability zone.
At 8:18 PM Eastern, the Duo Engineering team observed that the failover process had completed. At 8:19 PM Eastern, the Engineering team initiated post-failover checks. At 9:55 PM Eastern, all authentications were verified to be processing properly. Approximately 3,000 authentications were affected.
Additionally, from 1:37 AM to 2:37 AM on August 20, 2020, Authentication Logs were delayed in reaching the Duo Admin Panel and to customer monitoring workflows, such as automated SIEM logging consumption or other security tooling that relies on retrieving Authentication Logs in near-real-time from Duo’s APIs. Beginning at 2:38 AM, new logs were available. From 2:50 AM on, all data was available. No data was lost due to this incident.
The Engineering team continues to investigate the root cause of the issue that caused the Authentication Log reporting delay.
At 7:39 PM Eastern, automated monitoring informed the Engineering team of multiple failover events. In order to provide a high availability service, Duo’s systems are set to automatically failover to another availability zone if failure occurs in the active zone.
At 7:47 PM Eastern, the Engineering team verified the automatic failover process had begun.
At 8:12 PM Eastern, AWS confirmed the issue.
At 8:18 PM Eastern, the Duo Engineering team observed that the automatic failover process had completed.
At 8:19 PM Eastern, the Engineering team began post failover verification.
From 9:00 PM Eastern to 9:54 PM Eastern, the Duo Engineering team restarted unresponsive services that caused additional authentication delays and timeouts, as well as intermittent outages to the Duo Admin Panel.
At 9:55 PM Eastern, the Duo Engineering team verified that normal operations had resumed.
At 1:37 AM Eastern on August 20, the Engineering team was alerted to an issue with the system that processes Authentication Logs.
At 2:12 AM Eastern, the Engineering team restarted portions of the Authentication Log processing system that had run out of memory. The Engineering team continues to investigate the root cause of the issue that caused the Authentication Log reporting delay.
At 2:38 AM Eastern, the Engineering team took further action to restart processes that feed logs into the Duo Admin Panel. Logs then resumed processing normally.
Duo’s Engineering team has identified a path to significantly reducing recovery times for this type of failure and are working to implement those improvements.