ClickCease
Jul 19, 2024

Navigating The CrowdStrike and Microsoft Outages: Lessons For OT Industries

The most significant IT outage in recent times, some security pundits calling it the largest in history, is affecting computers worldwide and has impacted various Operational Technology (OT) sectors, including airlines, hospitals, retailers, and other critical infrastructure. This disruption has highlighted the interconnected nature of modern OT systems and their dependence on reliable cybersecurity solutions. Here’s a look at what transpired, its impacts, and Armis’ take on a strategy for better preparedness in the future.

What Happened?

The widespread disruption stems from a flawed content update from CrowdStrike. The update to CrowdStrike’s Falcon Sensor software, crucial for cybersecurity tasks, inadvertently caused significant issues on many Windows devices. This resulted in massive disruptions, affecting various OT industries severely reliant on these systems.

Impact on OT Industries

Retail Sector: Retailers rely heavily on seamless customer-facing service to protect customer data, manage point-of-sale systems, and ensure transaction integrity. The outage has left many retailers unable to process payments, resorting to ‘cash only’ signs being displayed in shop windows.

Rail Industry: Rail networks have experienced signaling outages, communication network failure, and passenger information system ‘Blue Screen Of Death’(BSOD). The outages continue to cause significant delays and undoubtedly will have raised the risk of cyber-attacks targeting these critical infrastructures during a vulnerable period.

Airports: Airports have been experiencing travel chaos worldwide, with many flights grounded, huge queues, and delays at airports- almost 2,000 flights have been canceled. Several airports and airlines have reported issues with their IT systems. All of this happened on what was predicted to be the busiest day in 2024 for air travel in multiple countries globally.

Emergency Services: 911 in several states has been impacted, with many police, fire, and EMS having to roll back to backup systems due to computer-aided dispatch (CAD) systems being taken offline.

Specific Recommendations In Light Of The Recent Outages

  1. Map Your Dispersed And Complex Environment: Armis can help you map the impacted devices in your environment and identify the specific devices that require remediation. Armis can help streamline your journey back to normal operations faster as we can provide the physical location of the actual server or workstation—the switch and port—which will greatly assist this process in tracking down assets that need to be manually found and fixed. (Read about what happened here.). The following query can be used to obtain a list of all Crowdstrike versions active in the environment.
in:devices timeFrame:"1 Days" visibility:Full dataSource:(name:CrowdStrike)

 

Please use the following query to identify devices with Crowdstrike that haven’t been seen since 2024-07-19T01:00:00

in:devices visibility:Full dataSource:(name:CrowdStrike) operatingSystem:(name:Windows) !after:2024-07-19T01:00:00 timeFrame:"7 Days"

 

  1. Prioritize What Matters Most: Your organization runs the critical infrastructure that keeps essential services running, what assets could be impacted? Which assets are the most important ones to fix to keep essential services running? Organizations gain full situational awareness of all of their assets through network telemetry data(passive monitoring and Smart Active Queries), hundreds of pre-built integrations, and an AI-driven Asset Intelligence Engine tracking billions of assets globally. Prioritize your most critical assets that may pose risk to operational downtime or public safety based on insights derived from Armis Centrix™.We also provide insights into the breadth of the business impact by illustrating in a visual map which assets are communicating or are failing to communicate with other assets that may be impacted. Review the connections that are trying to communicate with services not available. This will help find shadow IT servers and have someone manually fix these servers.
  2. Assign Ownership And Return Operations To Normal Efficiently: Armis predicts and assigns the correct owner of each impacted asset by using embedded workflows to remediate, track progress with workflow tools, and measure the effectiveness of the remediation process.

Moving To The Future

Firstly, we want to acknowledge the amount of work effort required by teams across organizations worldwide to make the world safe again—to make hospitals, banks, grocery stores, travel, and more operational again.This was unexpected, and there will be countless unsung heroes who have helped save the day. This will continue to take a mammoth effort.

As we begin to enter the recovery stage for this incident, it is important to remember the following:

  1. Know Your Digital Real-Estate: Ensure you have full visibility of all assets and the communication paths between them so that when devices wholly fail or fail to act as expected, that notification and corrective action can immediately be employed.
  2. Implement Rollback Mechanisms: Ensure systems can revert to a last known good state in case of problematic updates. While many organizations lack this capability, investing in such mechanisms can drastically reduce recovery times in future incidents.
  3. Automate Recovery Processes Where Possible: Although the workaround for this current incident requires manual intervention, organizations should strive to automate recovery processes where possible. This includes leveraging remediation tools that can prioritize, assign,mitigate issues and apply fixes more efficiently across multiple systems.
  4. Engage in Industry Collaboration: Sharing information and strategies within and across industries can help stay ahead of emerging threats and develop best practices for cybersecurity resilience. Armis helps with this in maintaining a device security data lake of over 5 billion assets. This combined with the ability to work in conjunction with your existing security stack creates an ecosystem of trust amongst systems, so when an incident occurs, the sharing of information can speed identification and recovery.
  5. Leverage Bootable Recovery Media: For large-scale outages, having bootable recovery media on hand can expedite the repair process.
  6. Maintain Comprehensive Backups: Regularly updated backups and shadow copies can be a lifeline during severe disruptions. Ensure that backup protocols are rigorous and include critical system states.
  7. Diversify Detection Mechanisms: Particularly when IT and OT are operating in tandem, it is essential to have multi-detection capabilities. This includes early warning detection, which can detect incidents while still in the formulation stage. It also includes a multi detection engine that looks for policy violations and anomalous behavior. For OT,it also requires a mix of passive detection and smart active querying, which can locate dormant devices that do not communicate over the network.
  8. Develop Comprehensive Incident Response Plans: Establish and regularly update incident response plans. These should include procedures for maintaining operations and protecting critical assets during cybersecurity outages.

Cyber Resilience is Foundational in OT

The CrowdStrike and Microsoft outages are a stark reminder of the critical importance of cybersecurity in OT industries. By learning from this incident and implementing robust, multi-faceted cybersecurity strategies, organizations can better protect themselves against future disruptions. The path forward requires adaptability and a commitment to continuously improving cybersecurity practices. This incident should catalyze a broader industry push towards greater resilience and preparedness in the face of an increasingly complex cyber threat landscape.

Stay tuned for further updates as this story develops.