Providing Out-of-Band Connectivity to Mission-Critical IT Resources

Home » Improve Network Security » Data Logging

Why Out-of-Band Management Is Critical to AI Infrastructure

Out-of-Band Management for AI

Artificial intelligence is transforming every corner of industry. Machine learning algorithms are optimizing global logistics, while generative AI tools like ChatGPT are reshaping everyday work and communications. Organizations are rapidly adopting AI, with the global AI market expected to reach $826 billion by 2030, according to Statista. While this growth is reshaping operations and outcomes for organizations in every industry, it brings significant challenges for managing the infrastructure that supports AI workloads.

The Rapid Growth of AI Adoption

AI is no longer a technology that lives only in science fiction. It’s real, and it has quickly become crucial to business strategy and the overall direction of many industries. Gartner reports that 70% of enterprise executives are actively exploring generative AI for their organizations, and McKinsey highlights that 72% of companies have already adopted AI in at least one business function.

It’s easy to understand why organizations are rapidly adopting AI. Here are a few examples of how AI is transforming industries:

  • Healthcare: AI-driven diagnostic tools have improved disease detection rates by up to 30x, while drug discovery timelines are being slashed from years to months.
  • Retail: E-commerce platforms use AI to power personalized recommendations, leading to a revenue increase of 5-25%.
  • Manufacturing: AI in predictive maintenance can help increase productivity by 25%, lower maintenance costs by 25%, and reduce machine downtime by 70%.

AI is a powerful tool that can bring profound outcomes wherever it’s used. But it requires a sophisticated infrastructure of power distribution, cooling systems, computing, GPUs, servers, and networking gear, and the challenge lies in managing this infrastructure.

Infrastructure Challenges Unique to AI

AI environments are complex, with workloads that are both resource-intensive and latency-sensitive. This means organizations face several challenges that are unique to AI:

 

  1. Skyrocketing Energy Demands: AI racks consume between 40kW and 200kW of power, which is 10x more than traditional IT equipment. Energy efficiency in the AI data center is a top priority, especially as data centers account for 1% of global electricity consumption.
  2. Cost of Downtime: AI systems are especially vulnerable to interruptions, which can cause a ripple effect and lead to high costs. A single server failure can disrupt entire model training processes, costing enterprises $9,000 per minute in downtime, as estimated by Uptime Institute.
  3. Cybersecurity Risks: AI processes sensitive data, making AI data centers prime targets for attack. Sophos reports that in 2024, 59% of organizations suffered a ransomware attack, and the average cost to recover (excluding ransom payment) was $2.73 million.
  4. Operational Complexity: AI environments rely on a diverse set of hardware and software systems. Monitoring and managing these components effectively requires real-time visibility into thermal conditions, humidity, particulates, and other environmental and device-related factors.

The Role of Out-of-Band Management in AI

Out-of-band (OOB) management is a must-have for organizations scaling their AI capabilities. Unlike traditional in-band systems that rely on the production network, OOB operates independently to give teams uninterrupted access and control. They can remotely perform monitoring and maintenance tasks to AI infrastructure, troubleshooting, and complete system recovery even if the production network goes offline.

 

How OOB Management Solves Key Challenges:

  • Minimized Downtime: With OOB, IT teams can drastically reduce downtime by troubleshooting issues remotely rather than dispatching teams on-site.
  • Energy Efficiency: Real-time monitoring and optimization of power distribution enable organizations to eliminate zombie servers and other inefficiencies.
  • Enhanced Security: OOB systems isolate management traffic from production networks per CISA’s best practice recommendations, which reduces the attack surface and mitigates cybersecurity risks.
  • Operational Efficiency: Remote monitoring via OOB offers a complete view of environmental conditions and device health, so teams can operate proactively and prevent issues before failures happen.

Use Cases: Out-of-Band Management for AI

There’s no shortage of use cases for AI, but organizations often overlook implementing out-of-band in their environment. Aside from using OOB in AI data centers, here are some real-world use cases of out-of-band management for AI.

1. Autonomous Vehicle R&D

Developers of self-driving technology find it difficult to manage their high-density AI clusters, especially because outages delay testing and development. By implementing OOB management, these developers can reduce recovery times from hours to minutes and shorten development timelines.

2. Financial Services Firms

Banks deploy AI to detect and combat fraud, but these power-hungry systems often lead to inefficient energy usage in the data center. With OOB management, they can gain transparency into GPU and CPU utilization. Not only can they eliminate energy waste, but they can optimize resources to improve model processing speeds.

3. University AI Labs

Universities run AI research on supercomputers, but this strains the underlying infrastructure with high temperatures that can cause failures. OOB management can provide real-time visibility into air temperature, device fan speed, and cooling systems to prevent infrastructure failures.

Download Our Guide, Solving AI Infrastructure Challenges with Out-of-Band Management

Out-of-band management is the key to having reliable, high-performing AI infrastructure. But what does it look like? What devices does it work with? How do you implement it?

Download our whitepaper Solving AI Infrastructure Challenges with Out-of-Band Management for answers. You’ll also get Nvidia’s SuperPOD reference design along with a list of devices that integrate with out-of-band. Click the button for your instant download.

What is FIPS 140-3, and Why Does it Matter?

A lock representing cybersecurity, with the title What is FIPS 140-3 and why does it matter?

Handling sensitive information is a responsibility shared by so many organizations. Ensuring the security of data, whether in transit or at rest, is not only critical for maintaining the trust of end users and customers, but is often a regulatory requirement. One of the most reliable ways to secure data within network infrastructure is by implementing FIPS 140-3-certified cryptographic solutions. This certification, which was developed by the National Institute of Standards and Technology (NIST), serves as a benchmark for robust encryption practices, enabling organizations to meet high security standards and ensure regulatory compliance.

Let’s explore what it means to have FIPS 140-3 certification, why it matters, and its key applications in network infrastructure.

What is FIPS 140-3 Certification?

The Federal Information Processing Standard (FIPS) 140-3 certification is a stringent, government-endorsed security standard that sets guidelines for cryptographic modules used to protect sensitive data. It includes requirements for securing cryptographic functions within hardware, software, and firmware. The certification process rigorously tests cryptographic solutions for security and reliability, ensuring that they meet specific criteria in data encryption, access control, and physical security.

There are four levels of FIPS 140-3 certification, each adding layers of protection to help secure information in various environments:

  • Level 1: Ensures basic encryption standards.
  • Level 2: Adds tamper-evident protection and role-based authentication.
  • Level 3: Provides advanced tamper-resistance and strong user authentication.
  • Level 4: Offers the highest level of security, including physical defenses against tampering.

FIPS 140-3 certification ensures that an organization’s network infrastructure meets high standards for cryptographic security. This is important for protecting sensitive information against cyber threats as well as fulfilling regulatory requirements.

Why FIPS 140-3 Certification Matters

1. Meeting Regulatory Compliance Requirements

FIPS 140-3 certification is often required by regulatory bodies, especially in sectors like government/defense, healthcare, and finance, where sensitive data must be protected by law. Here are a few industry-specific regulations that FIPS 140-3-certified modules help with:

  • Defense: DFARS, NIST SP 800-171
  • Healthcare: HIPAA
  • Finance: PCI-DSS
  • Energy: NERC CIP
  • Education: FERPA

Compliance with FIPS 140-3 also makes it easier for organizations to meet audit requirements, reducing the risk of fines or penalties for security lapses.

2. Strengthening Customer Trust

End users and customers expect that their data is handled with care and protected against breaches. By using FIPS 140-3-certified solutions, organizations can demonstrate their commitment to securing customer data with recognized, government-endorsed security standards. FIPS certification is a valuable trust signal, showing customers that their information is being managed with the highest level of protection available.

3. Protecting Against Emerging Cyber Threats

Relying on uncertified or outdated cryptographic solutions increases the risk of data breaches. FIPS 140-3-certified solutions are tested to withstand advanced attacks and tampering, which is an important safeguard against threats that continue to evolve in complexity. Certified modules help prevent unauthorized access to sensitive data, whether through intercepted communications, phishing, or other cyber threats.

FIPS 140-3 certification gives assurance, especially for organizations that handle high volumes of data, that they have adequate encryption to protect against sophisticated attacks.

4. Ensuring Business Continuity and Operational Resilience

According to IBM’s Cost of a Data Breach Report 2024, data breaches now cost $4.88 million (global average), with healthcare being the most costly at $9.8 million per breach. The financial impact is staggering, but the ongoing operational disruption and recovery efforts determine whether an organization can fully bounce back from a breach. With FIPS 140-3 certification, there’s an added layer of resilience to an organization’s infrastructure, which reduces the likelihood of breaches and ensures a secure base for maintaining continuity (such as through an Isolated Recovery Environment). By implementing FIPS-certified encryption, businesses can minimize downtime, maintain access to encrypted systems, and recover more smoothly from potential incidents.

5. Gaining a Competitive Advantage in Security-Conscious Markets

Organizations that follow rigorous data security standards are more likely to gain the trust of clients, stakeholders, and customers, especially in industries where security is non-negotiable. Organizations that adopt FIPS 140-3-certified infrastructure can differentiate themselves as having a reputation for security, which can be a competitive advantage that attracts customers and partners who value data protection.

Key Applications of FIPS 140-3 in Network Infrastructure

For organizations managing large amounts of customer data, FIPS 140-3-certified solutions can be applied to several critical areas within network infrastructure:

  • Network Firewalls and VPNs: FIPS-certified encryption ensures that data moving across networks remains private, protecting it from interception by unauthorized users.
  • Access Control Systems: Identity-based access controls with FIPS-certified modules add another layer of security to protect against unauthorized access to sensitive data.
  • Out-of-Band Management: Using FIPS 140-3-certified encryption in OOB management ensures the same stringent security level for OOB traffic as for in-band network traffic.
  • Data Storage and Backup: FIPS-certified encryption secures data at rest, protecting stored customer information from unauthorized access or tampering.
  • Cloud and Hybrid Environments: For companies using cloud or hybrid environments, FIPS-certified encryption helps protect data across multiple infrastructure layers, ensuring consistent security whether data resides on-premises or in the cloud.

Discuss FIPS 140-3 With Our Network Infrastructure Experts

FIPS 140-3 certification gives organizations the ability to reassure customers, meet compliance requirements, and protect critical data across every layer of the network. Get in touch with our network infrastructure experts to discuss FIPS 140-3, isolated management infrastructure, and other resilience best practices.

Explore FIPS 140-3 for Out-of-Band Management

Read about 7 benefits of implementing FIPS 140-3 across your out-of-band management infrastructure. This article discusses the benefits it brings to remotely accessing devices, protecting against physical attacks, and securing edge infrastructure.

7 Security Benefits of Implementing FIPS 140-3 for Out-of-Band Management

ZPE Systems -FIPS 140-3

Out-of-band (OOB) management is essential for maintaining control over critical network infrastructure, especially during outages or cyberattacks. This separate management network enables administrators to remotely access, troubleshoot, and recover production equipment. However, managing network devices outside the main data path also brings unique security challenges, as these channels often carry sensitive control data and system access credentials.

Implementing FIPS 140-3-certified encryption within OOB systems can help organizations secure this vital access path to ensure that management data can’t be intercepted or manipulated by unauthorized actors. Here’s how FIPS 140-3 certification can enhance the security, reliability, and compliance of your out-of-band management.

What is FIPS 140-3 Certification?

FIPS (Federal Information Processing Standard) 140-3 is a high-level security standard developed by the National Institute of Standards and Technology (NIST). It specifies rigorous requirements for cryptographic modules used to protect sensitive data. FIPS 140-3 certification covers everything from data encryption to user authentication and physical security. For out-of-band management, FIPS 140-3 certification ensures that cryptographic components in hardware, software, and firmware meet stringent data security standards.

By implementing FIPS-certified solutions, organizations can ensure their OOB management is resilient against modern cyber threats, protecting both the control channels and the sensitive data they carry. Here are seven security benefits of implementing FIPS 140-3 for out-of-band management.

7 Security Benefits of Implementing FIPS 140-3 for Out-of-Band Management

1. Secure Encryption of Management Traffic

OOB management often involves remote access to routers, switches, servers and other critical devices. FIPS 140-3 certification guarantees that all cryptographic modules used in these systems have been rigorously tested to secure data in transit. Encrypting management traffic is crucial to prevent interception or manipulation by unauthorized users, particularly for tasks such as command execution, configuration updates, and device monitoring.

With FIPS-certified encryption, companies can protect OOB traffic between management devices and network components, so that only authorized administrators have access to sensitive system commands and device settings.

2. Enhanced Authentication and Access Control

OOB management solutions typically support different user roles, each with its own access privileges. FIPS 140-3-certified modules, like ZPE Systems’ Nodegrid, feature multi-factor authentication (MFA) to control who can initiate OOB management sessions. Certified solutions also include secure key management practices that prevent unauthorized access, ensuring that only verified users can control and modify network devices.

These protections mean FIPS-certified solutions help mitigate the risk of unauthorized users accessing high-value assets. This is especially important during ransomware recovery efforts, when teams need to launch a secure, Isolated Recovery Environment to combat an active attack in a compromised environment.

3. Protection Against Tampering and Physical Attacks

Many organizations deploy IT infrastructure in locations where physical device security is lacking. For example, remote colocations, unmonitored drilling sites, or rural health clinics can easily expose network infrastructure to device tampering. FIPS 140-3 certification mandates tamper-evident and tamper-resistant features to protect the cryptographic modules used in OOB systems. OOB solutions like ZPE Systems’ Nodegrid provide robust protection against tampering, with features including:

  • UEFI secure boot: Prevents the execution of unauthorized software during the boot process.
  • TPM 2.0: Ensures secure key generation and storage, so only authorized software can run.
  • Secure erase: Allows for deletion of all data from storage, so no data can be recovered from devices that have been tampered with.

These features prevent unauthorized individuals from physically accessing OOB equipment to intercept or modify management traffic. In remote and edge locations, FIPS-certified cryptographic modules provide robust protection against physical attacks, making it harder for adversaries to compromise OOB management pathways.

4. Compliant and Secure Logging of Access Activities

Because OOB management systems provide access to critical equipment, organizations need transparency into OOB users and their management activities. This means logging and auditing are essential to maintaining security and compliance. FIPS 140-3-certified modules support secure logging of all management activities, creating a clear audit trail of access attempts and security events. These logs are stored securely to prevent unauthorized users from altering or erasing them, providing valuable insights for security monitoring and incident response.

Secure logging is not only critical for monitoring access but also necessary for meeting regulatory compliance. FIPS 140-3 ensures that OOB management systems can satisfy audit requirements, making compliance easier and protecting organizations from potential regulatory penalties.

5. Meeting Regulatory Requirements in Sensitive Environments

Many industries handle sensitive data, especially government, healthcare, and finance. For organizations in these industries, it’s often mandatory to use FIPS-certified cryptographic solutions. FIPS 140-3 certification helps OOB management systems align with federal security regulations and standards like HIPAA and PCI-DSS. By deploying FIPS-certified encryption, organizations can comply with these standards, streamline audits, reduce the risk of regulatory penalties, and reinforce trust with customers.

6. Consistent Security Across Main and OOB Networks

It’s easy for organizations to focus mostly on securing the main network, while overlooking the security protections that they employ on their out-of-band network. FIPS-certified solutions help establish consistent security standards across both paths. This is especially important in protecting against lateral attacks, where hackers infiltrate one network and are then able to jump to the other. In cases where attackers gain access to one segment of the network, matching security protocols across the main and OOB networks prevents them from moving laterally into sensitive management channels.

Using FIPS 140-3-certified encryption across both networks also strengthens the organization’s ability to monitor, manage, and control devices, even when the primary network is under threat.

7. Securing Remote and Edge Devices

For organizations with remote infrastructure, such as telecom and retail, OOB management is critical for managing network devices in distant locations. However, these environments often lack the physical security of centralized data centers, making them vulnerable to tampering. FIPS-certified solutions ensure that all communication with remote OOB devices is encrypted, which protects management data from unauthorized access.

FIPS 140-3 certification also supports the resilience of IoT and edge devices, which often require OOB management for secure monitoring, patching, and configuration.

Implement the Most Secure Out-of-Band Management with ZPE Systems

Security in Layers

ZPE Systems’ Nodegrid is the industry’s most secure out-of-band management solution. Not only do we carry FIPS 140-3, SOC 2 Type 2, and ISO27001 certifications, but we also feature a Synopsys-validated codebase and dozens of security features across the hardware, software, and cloud layers. These are all part of a multi-layered, secure-by-design approach that ensures the strongest physical and cyber safeguards.

Download our pdf to explore more of our security assurance.

See FIPS-Certified Out-of-Band in Action

Our engineers are ready to walk you through our industry-leading out-of-band management. Use the button below to set up a 15-minute demo and explore FIPS 140-3 security features first-hand.

The CrowdStrike Outage: How to Recover Fast and Avoid the Next Outage

CrowdStrike Outage BSOD

 

On July 19, 2024, CrowdStrike, a leading cybersecurity firm renowned for its advanced endpoint protection and threat intelligence solutions, experienced a significant outage that disrupted operations for many of its clients. This outage, triggered by a software upgrade, resulted in crashes for Windows PCs, creating a wave of operational challenges for banks, airports, enterprises, and organizations worldwide. This blog post explores what transpired during this incident, what caused the outage, and the broader implications for the cybersecurity industry.

What happened?

The incident began on the morning of July 19, 2024, when numerous CrowdStrike customers started reporting issues with their Windows PCs. Users experienced the BSOD (blue screen of death), which is when Windows crashes and renders devices unusable. As the day went on, it became evident that the problem was widespread and directly linked to a recent software upgrade deployed by CrowdStrike.

Timeline of Events

  1. Initial Reports: Early in the day, airports, hospitals, and critical infrastructure operators began experiencing unexplained crashes on their Windows PCs. The issue was quickly reported to CrowdStrike’s support team.
  2. Incident Acknowledgement: CrowdStrike acknowledged the issue via their social media channels and direct communications with affected clients, confirming that they were investigating the cause of the crashes.
  3. Root Cause Analysis: CrowdStrike’s engineering team worked diligently to identify the root cause of the problem. They soon determined that a software upgrade released the previous night was responsible for the crashes.
  4. Mitigation Efforts: Upon isolating the faulty software update, CrowdStrike issued guidance on how to roll back the update and provided patches to fix the issue.

What caused the CrowdStrike outage?

The root cause of the outage was a software upgrade intended to enhance the functionality and security of CrowdStrike’s Falcon sensor endpoint protection platform. However, this upgrade contained a bug that conflicted with certain configurations of Windows PCs, leading to system crashes. Several factors contributed to the incident:

  1. Insufficient Testing: The software update did not undergo adequate testing across all possible configurations of Windows PCs. This oversight meant that the bug was not detected before the update was deployed to customers.
  2. Complex Interdependencies: The incident highlights the complex interdependencies between software components and operating systems. Even minor changes can have unforeseen impacts on system stability.
  3. Rapid Deployment: In the cybersecurity industry, quick responses to emerging threats are crucial. However, the pressure to deploy updates rapidly can sometimes lead to insufficient testing and quality assurance processes.

We need to remember one important fact: whether software is written by humans or AI, there will be mistakes in coding and testing. When an issue slips through the cracks, the customer lab is the last resort to catch it. Usually, this can be done with a controlled rollout, where the IT team first upgrades their lab equipment, performs further testing, puts in place a rollback plan, and pushes the update to a less critical site. But in a cloud-connected SaaS world, the customer is no longer in control. That’s why they sign waivers stating that if such an incident occurs, the company that caused the problem is not liable. Experts are saying the only way to address this challenge is to have an infrastructure that’s designed, deployed, and operated for resilience. We discuss this architecture further down in this article.

How to recover from the CrowdStrike outage

CrowdStrike gives two options for recovering:

  • Option 1: Reboot in Safe Mode – Reboot the affected device in Safe Mode, locate and delete the file “C-00000291*.sys”, and then restart the device.
  • Option 2: Re-image – Download and configure the recovery utility to create a new Windows image, add this image to a USB drive, and then insert this USB drive into the target device. The utility will automatically find and delete the file that’s causing the crash.

The biggest obstacle that is costing organizations a lot of time and money is that with either of these recovery methods, IT staff need to be physically present to work on each affected device. They need to go one by one manually remediating via Safe Mode or physically inserting the USB drive. What makes this more difficult is that many organizations use physical and software/management security controls to limit access. Locked device cabinets slow down physical access to devices, and things like role-based access policies and disk encryption can make Safe Mode unusable. Because this outage is affecting more than 8.5 million computers, this kind of work won’t scale efficiently. That’s why organizations are turning to Isolated Management Infrastructure (IMI) and the Isolated Recovery Environment (IRE).

How IMI and IRE help you recover faster

IMI is a dedicated control plane network that’s meant for administration and recovery of IT systems, including Windows PCs affected by the CrowdStrike outage. It uses the concept of out-of-band management, where you deploy a management device that is connected to dedicated management ports of your IT infrastructure (e.g., serial ports, IPMI ports, and other ethernet management ports). IMI also allows you to deploy recovery services for your digital estate that is immutable and near-line when recovery needs to take place.

IMI does not rely at all on the production assets, as it has its own dedicated remote access via WAN links like 4G/5G, and can contain and encrypt recovery keys and tools with zero trust.

IMI gives teams remote, low-level access to devices so they can recover their systems remotely without the need to visit sites. Organizations that employ IMI are able to revert back to a golden image through automation, or deploy bootable tools to all the computers at the site to rescue them without data loss.

The dedicated out-of-band access to serial/IPMI and management ports gives automation software the same abilities as if a physical crash cart was pulled up to the servers. ZPE Systems’ Nodegrid (now a brand of Legrand) enables this architecture as explained next. Using Nodegrid and ZPE Cloud, teams can use either option to recover from the CrowdStrike outage:

  • Option 1: Reboot in Pre-Execution Environment Software – Nodegrid gives low-level network access to connected Windows as if teams were sitting directly in front of the affected device. This means they can remote-in, reboot to a network image, remote into the booted image, delete the faulty file, and restart the system.
  • Option 2: Re-image – ZPE Cloud serves as a file repository and orchestration engine. Teams can upload their working Windows image, and then automatically push this across their global fleet of affected devices. This option speeds up recovery times exponentially.
  • Option 3: – Run Windows Deployment server on the IMI device at the location and re-image servers and workstations if a good backup of the data has been located. This backup can be made available through the IMI after the initial image has been deployed. The IMI can provide dedicated secure access to the InTune services in your M365 cloud, and the backups do not have to transit the entire internet for all workstations at the time, speeding up recovery many times over.

All of these options can be performed at scale or even automated. Server recovery with large backups, although it may take a couple of hours, can be delivered locally and tracked for performance and consistency.

But what about the risk of making mistakes when you have to repeat these tasks? Won’t this cause more damage and data loss?

Any team can make a mistake repeating these recovery tasks over a large footprint, and cause further damage or loss of data, slowing the recovery further. Automated recovery through the IMI addresses this, and can provide reliable recording and reporting to ensure that the restoration is complete and trusted. 

What does IMI look like?

Here’s a simplified view of Isolated Management Infrastructure. You can see that ZPE’s Nodegrid device is needed, which sits beside production infrastructure and provides the platform for hosting all the tools necessary for fast recovery.

A diagram showing how to use Nodegrid Gen 3 OOB to enable IMI.

What you need to deploy IMI for recovery:

  1. Out-of-band appliance with serial, USB, ethernet interfaces (e.g., ZPE’s Nodegrid Net SR)
  2. Switchable PDU: Legrand Server Tech or Raritan PDU
  3. Windows PXE Boot image

Here’s the order of operations for a faster CrowdStrike outage recovery:

  • Option 1 – Recover
    1. IMI deployed with a ZPE Nodegrid device that will start Pre-Execution Environment (PXE) which are Windows boot images that the Nodegrid will push to the computers when they boot up
    2. Send recovery keys from Intune to IMI remote storage over ZPE Cloud’s zero trust platform easily available in cloud or air-gapped through Nodegrid Manager
    3. Enable PXE service (automated across entire enterprise) and define the PXE recovery image
    4. Use serial or IP control of power to the computers, or if possible Intel vPro or IPMI capable machines, to reboot all machines
    5. All machines will boot and check in to a control tower for PXE, or be made available to remote into using stored passwords on the PXE environment, Windows AD, or other Privileged Access Management (PAM)
    6. Delete Files
    7. Reboot

 

  • Option 2 – Lean re-image
    1. IMI deployed with a Windows Pre-Execution boot image running PXE service
    2. Enable access to cloud and Azure Intune to the IMI remote storage for the local image for the PC
    3. Enable PXE service (automated across entire enterprise) and define the PXE recovery image
    4. Use serial or IP control of power to the computers, or if possible, Intel vPro or IPMI capable machines, to reboot all machines
    5. Machines will boot and check in to Intune either through the IMI or through normal Internet access and finish imaging
    6. Once the machine completes the InTune tasks, InTune will signal backups to come down to the machines. If these backups are offsite, they can be staged on the IMI through backup software running on a virtual machine located on the IMI appliance to speed up recovery and not impede the Internet connection at the remote site
    7. Pre-stage backups onto local storage, push recovery from the virtual machine on the IMI

 

  • Option 3 – Windows controlled re-image
    1. Windows Deployment Server (WDS) installed as a virtual machine running on the IMI appliance (offline to prevent issues or online but under a slowed deployment cycle in case there was an issue) 
    2. Send recovery keys from Intune to IMI remote storage over a zero trust interface in cloud or air-gapped
    3. Use serial or IP control of power to the computers, or if possible, Intel vPro or IPMI capable machines, to reboot all machines
    4. Machines will boot and check in to the WDS for re-imaging
    5. Machines will boot and check in to Intune either through the IMI or through normal Internet access and finish imaging
    6. Once the machine completes the InTune tasks, InTune will signal backups to come down to the machines. If these backups are offsite, they can be staged on the IMI through backup software running on a virtual machine located on the IMI appliance to speed up recovery and not impede the Internet connection at the remote site
    7. Pre-stage backups onto local storage, push recovery from the virtual machine on the IMI

Deploy IMI to avoid the next outage

Get in touch for help choosing the right size IMI deployment for your organization. Nodegrid and ZPE Cloud are the drop-in solution to recovering from outages, with plenty of device options to fit any budget and environment size. Contact ZPE Sales now or download the blueprint to help you begin implementing IMI.

DORA Act: 5 Takeaways For The Financial Sector

Thumbnail – DORA Act 5 Takeaways for the Financial Sector

The Digital Operational Resilience Act (DORA) is a regulatory initiative within the European Union that aims to enhance the operational resilience of the financial sector. Its main goal is to prevent and mitigate cyber threats and operational disruptions. The DORA Act outlines regulatory requirements for the security of network and information systems “whereby all firms need to make sure they can withstand, respond to and recover from all types of ICT-related disruptions and threats” (DORA Act website).

Who and What Are Covered Under the DORA Act?

The DORA Act is a regulation that covers all financial entities within the European Union (EU). It recognizes the critical role of information and communication technology (ICT) systems in financial services. DORA applies to financial services including payments, securities, credit rating, algorithmic trading, lending, insurance, and back-office operations. It establishes a framework for ICT risk management through technical standards, which are being released in two phases, the first of which was published on January 17, 2024. The DORA Act will go into effect in its entirety on January 17, 2025.

With cyberattacks constantly in the news cycle, it’s no surprise that governing bodies are putting forth standards for operational resilience. But without combing through this lengthy piece of legislation, what should IT teams start thinking about from a practical standpoint? Here are 5 takeaways on what the DORA Act means for the financial sector.

DORA Act: 5 Takeaways for the Financial Sector

1. Shore-up your cybersecurity measures

The DORA Act emphasizes strengthening cybersecurity measures within the financial sector. It requires financial institutions, such as banks, stock exchanges, and financial infrastructure providers, to implement robust cybersecurity controls and protocols. These include adopting advanced authentication mechanisms, encryption standards, and network segmentation to protect sensitive financial data and critical infrastructure from cyber threats. Part of this will also require organizations to apply system patches and updates in a timely manner, which means automated patching will become necessary to every organization’s security posture.

2. Implement resilience systems

Operational resilience is a key focus area of the DORA Act, aiming to ensure the continuity of essential financial services in the face of cyber threats, natural disasters, and other operational disruptions. Financial institutions are required to develop comprehensive business continuity plans, establish redundant systems and backup facilities, and conduct regular stress tests to assess their ability to withstand and recover from various scenarios. Implementing a resilience system helps with this, as it provides all the infrastructure, tools, and services necessary to continue operating during major incidents.

3. Conduct regular scans for vulnerabilities

The DORA Act mandates financial institutions to implement robust risk management practices to identify, assess, and mitigate cyber risks and operational vulnerabilities. This includes conducting regular assessments, vulnerability scans, and penetration tests, and developing incident response procedures to quickly address threats. This is all part of taking a proactive approach to identify and mitigate cyber incidents, and reduce the impact that adverse events have on financial stability and consumer confidence.

4. Collaborate and share information with industry peers

The DORA Act encourages financial institutions to share cybersecurity threat intelligence, incident data, and best practices with industry peers, regulators, and law enforcement agencies. The ability to monitor systems and collect data will be crucial to this approach, and will require systems that can rapidly (and securely) deploy apps/services during ongoing incidents. This will help financial institutions to better understand emerging threats, coordinate responses to cyber incidents, and strengthen collective defenses against threats and operational disruptions.

5. Segment physical and logical systems to pass regular audits

Through the DORA Act, regulators are empowered to conduct regular assessments, audits, and inspections of systems. This will ensure that financial institutions are implementing adequate controls and safeguards to protect against cyber threats and operational disruptions. A crucial part to this will involve physical and logical separation of systems, such as through Isolated Management Infrastructure, as well as implementing zero trust architecture across the organization. These will help bolster resilience by eliminating control dependencies between management and production networks, which will also help to streamline audits.

Get the blueprint to help you comply with the DORA Act

DORA’s requirements are meant to help IT teams better protect sensitive data and the integrity of financial systems as a whole. But without a proper network management infrastructure, their production networks are too sensitive to errors and vulnerable to attacks. ZPE has created the blueprint that covers these 5 crucial takeaways outlined in the DORA Act. The architecture outlined in this blueprint has been trusted by Big Tech for more than a decade, as it allows them to deploy modern cybersecurity measures, physically and logically separated systems, and rapid recovery processes. Download the blueprint now.

What to do if You’re Ransomware’d: A Healthcare Example

What to do if youre ransomwared

This article was written by James Cabe, CISSP, a 30-year cybersecurity expert who’s helped major companies including Microsoft and Fortinet.

Ransomware gangs target the innocent and vulnerable. They hit a Chicago hospital in December 2023, a London hospital in October the same year, and schools and hospitals in New Jersey as recently as January 2024. This is one of the biggest reasons I’m committed to stopping these criminals by educating organizations on how to re-think and re-architect their approach to cybersecurity.

In previous articles, I discussed IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environments), and how they could have quickly altered outcomes for MGM, Ragnar Locker victims, and organizations affected by the MOVEit vulnerability. Using IMI and IRE, organizations find that the key to not only speedy recovery, but also to limiting the blast radius and attack persistence, is isolation.

Why is isolation (not segmentation) key to ransomware recovery?

The NIST framework for incident response has five steps: Identify, Protect, Detect, Respond, and Recover. It’s missing a crucial step, however: Isolate. Stay tuned for a full breakdown of this in my next article. But the reason this is so critical is because attacks move at machine speed, and are very pervasive and persistent. If your management network is not fully isolated from production assets, the infection spreads to everything. Suddenly, you’re locked out completely and looking at months of tedious recovery. For healthcare providers, this jeopardizes everything from patient care to regulatory compliance.

Isolation is integral to building a resilience system, or in other words, a system that gives you more than basic serial console/out-of-band access and instead provides an entire infrastructure dedicated to keeping you in control of your systems — be it during a ransomware attack, ISP outage, natural disaster, etc. Because this infrastructure is physically and virtually isolated from production (no dependencies on production switches/routers, no open management ports, etc.), it’s nearly impossible for attackers to lock you out.

So, what really should you do if you’re ransomware’d? Let’s walk through an example attack on a healthcare system, and compare the traditional DR (Disaster Recovery) response to the IMI/IRE approach.

Ransomware in Healthcare: Disaster Recovery vs Isolated Recovery

Suppose you’re in charge of a hospital’s network. MDIoT, patient databases, and DICOM storage are the crown jewels of your infrastructure. Suddenly, you discover ransomware has encrypted patient records and is likely spreading quickly to other crown jewel assets. The risks and potential fallout can’t be understated. Millions of people are depending on you to protect their sensitive info, while the hospital is depending on you to help them avoid regulatory/legal penalties and ensure they can continue operating.

The problem with Disaster Recovery

Though the word ‘recovery’ is in the name, the DR approach is limited in its capacity to recover systems during an attack. Disaster Recovery typically employs a couple things:

  • Backups, which are copies of data, configurations, and code that are used to restore a production system when it fails.
  • Redundancy, which involves duplicating critical systems, services, and applications as a failsafe in the event that primaries go down (think cellular failover devices, secondary firewalls, etc.).

What happens when you activate your DR processes? It’s highly likely that you won’t be able to, and that’s because the typical DR setup relies on the production network. There’s no isolation.

Think about it this way: your backup servers need direct access to the data they’re backing up. If your file servers get pwned, your backup servers will, too. If your primary firewall gets hacked, your secondary will, too. The problem with backup and redundancy systems — and any system, for that matter — is that when they depend on the underlying infrastructure to remain operational, they’re just as susceptible to outages and attacks. It’s like having a reserve parachute that depends on the main parachute.

And what about the rest of your systems? You just discovered the attack has encrypted your servers and is quickly bringing operations to a crawl. How are you going to get in and fight back? What if you try to log into your management network, only to find that you’re locked out? All of your tools, configurations, and capabilities have been compromised.

This is why CISA, the FBI, US Navy, and other agencies recommend implementing Isolated Management Infrastructure.

IMI and IRE guarantee you can fight back against ransomware

You discover that the ransomware has spread. Not only has it encrypted data and stopped operations, but it has also locked you out of your own management network and is affecting the software configurations throughout the hospital. This is where IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environment) come in.

Because IMI is physically separate from affected systems, it guarantees management access so teams can set up communication and a temporary ‘war room’ for incident response. The IRE can then be created using a combination of cellular, compute, connectivity, and power control (see diagram for design and steps). Docker containers should be used to bring up each step.

Diagram showing a chart containing the systems and open-source tools that can be deployed for an Isolated Recovery Environment

Image: The infrastructure and incident response protocol involved in the Isolated Recovery Environment. These products were chosen from free or open source projects that have proven to be very useful in each of these stages of recovery. These can be automated in pieces for each phase, and then be brought down via Docker container to eliminate the risk of leakage or risk during each phase.

Without diving too far into the technicalities, the IRE enables you to recover survivable data, restore software configurations, and prevent reinfection. Here are some things you can do (and should do) in this scenario, courtesy of the IRE:

Establish your war room

You can’t fight ransomware if you can’t securely communicate with your team. Use the IRE to create offline, break-the-glass accounts that are not attached to email. This allows you to communicate and set up ticketing for forensics purposes.

Isolate affected systems

There’s no use running antivirus if reinfection can occur. Use the IRE to take offline the switch that connects the backup and file servers. Isolate these servers from each other and shut down direct backup ports. Then, you can remote-in (KVM, iKVM, iDRAC) to run antivirus and EDR (Endpoint Detection and Response).

Restore data and device images

The key is to have backup data at its most current, both for patient data and device/software configurations. Because the IRE provides an isolated environment, and you’ve already pulled your backups offline, you can gradually restore data, re-image devices, and restore configurations without risking reinfection. The IRE ensures devices “keep away” from each other until they can be cleansed and recovered.

Things You’ll Need To Build The IMI and IRE

Network Automation Blueprint

We’ve created a comprehensive blueprint that shows how to implement the architecture for IMI and IRE. Don’t let the name fool you. The Network Automation Blueprint covers everything from establishing a dedicated management network, to automating deployment of services for ransomware recovery. Get your PDF copy now at the link below.

Gen 3 Console Servers To Replace End-of-Life Gear

It’s nearly impossible to build the IMI or deploy the IRE using older console servers. That’s because these only give you basic remote access and a hint of automation capabilities. You’ll still need the ability to run VMs and containers. Gen 3 console servers let you do all of the things for IMI and IRE, like full control plane/data plane separation, hosting apps, and deploying VMs/containers on-demand. They’ve also been validated by Synopsys and have built-in security features I’ve been talking about for years. Check out the link below for resources about Gen 3 and how we’ll help you upgrade.

Get in touch with me!

I’d love to talk with you about IMI, IRE, and resilience systems. These are becoming more crucial to operational resilience and ransomware recovery, and countries are passing new regulations that will require these approaches. Get in touch with me via social media to talk about this!