What Colonial Pipeline Taught Us About OT Incident Response 

Henry Kogan

Ask any OT security leader if their team has an incident response plan, and you’ll almost always get a yes. Push them further and ask if the plan has actually been tested through a simulation, and the room often goes quiet. Plant operators, the people who’d have to run the process by hand if SCADA went dark, are often never in the room even when that exercise does take place. Three distinct parties are involved during a cyber incident: practitioners, plant operators, and service vendors. Most of the time, the handoffs between them have not been rehearsed. The incident response plan exists, but the actual mechanisms to enable coordination don’t. The Colonial Pipeline Incident of 2021 is a textbook example.  

Colonial Pipeline’s Decision Disaster 

In May 2021, a ransomware attack by the DarkSide group encrypted Colonial Pipeline’s IT and billing systems, forcing a six-day shutdown of the largest fuel pipeline on the U.S. East Coast and triggering emergency declarations across seventeen states. 

 The control systems moving fuel down the pipeline still worked. The company shut the pipeline down as a precaution, partly because IT and OT were interconnected enough that operators couldn’t be sure the infection hadn’t crossed over, but mostly because they couldn’t bill customers. 

The result was six days of shutdown. Seventeen states were under emergency declaration and a $4.4 million ransom was paid to the cyber criminals. Roughly half the East Coast’s fuel supply disrupted, all without the attackers touching a single PLC. 

What Colonial tested wasn’t the cyber team’s ability to contain ransomware. Plenty of companies do that every week. What it tested was the organization’s ability to make a high-consequence operational decision under uncertainty.  

The key decision to make was whether to keep running with degraded visibility into billing and monitoring, or shut down. That decision required IT, OT, finance, legal, regulatory affairs, and executive leadership to align in hours, not days to make sure customers would get their fuel. 

A basic tabletop exercise built around “how do you respond to ransomware on the billing system” would have missed the actual decision. The real tabletop exercise should have been centered around consequences and IT cyber risk impact. Here’s the actual simulation questions to think through:  your billing and compliance systems are unavailable, you do not know whether the OT network is clean. At what point do you shut the pipeline down, and at what point do you restart? 

Most OT incident response programs are not built to ask that question. Instead, they are built around the cyber team’s scope, which ends at the firewall or the data diode. The people who decide whether to keep a process running are usually not in the exercise. 

What a Mature OT Incident Response Plan Looks Like 

The Colonial case exposed a set of capabilities that should be standard in any OT-focused incident response program. Here’s what should have been considered: 

  • Pre-defined degraded operating modes. Most IR plans focus on restoration. Mature OT plans define what “running degraded” looks like — what the plant can run without billing, without email, without the ERP, without historian data flowing to corporate. Each mode is pre-authorized by leadership so the on-call incident commander doesn’t have to escalate to the CEO at 2 AM to keep the process flowing. 

In other words, know what you can run on a bad day 

  • Operational continuity playbooks for the IT/OT seam. For Colonial specifically, that meant manual ticketing for custody transfer, pre-negotiated “trust and true-up” agreements with shippers so deliveries continue against nominated volumes, hot-standby billing infrastructure isolated from production IT, and the organizational comfort to operate “blind to revenue” temporarily. For other operators it looks different but every OT environment has equivalent dependencies on IT systems that need a manual or isolated fallback. 

In other words: deliver now, invoice later.  

  • Cross-functional tabletops with the CFO and General Counsel in the room. Colonial’s shutdown decision was driven as much by inability to bill and uncertainty about legal exposure as by cyber concerns. Pre-decided answers to “can we deliver product we can’t immediately invoice for?” remove that paralysis when minutes matter. 

In other words, OT system shutdowns shouldn’t be left to the cyber team to decide  

  • Verified IT/OT segmentation. The Purdue Model exists for a reason. The fact that an IT-only compromise forced an OT shutdown reveals coupling that shouldn’t exist. A mature OT strategy treats the control system as survivable independent of corporate IT — and proves it through exercises that actually sever the connection. 

In other words: if IT going down and takes OT with it, you don’t have true segmentation 

  • Out-of-band command and control. When IT is down, how does the incident commander coordinate with field operators, vendors, and regulators? Pre-provisioned secondary email, signal/phone trees, and satellite comms for critical sites are the difference between coordinated response and improvisation. 

In other words: don’t run the response on the network you’re trying to recover. 

  • Pre-staged external relationships. CISA, FBI, the relevant ISAC (ONG-ISAC and TSA Pipeline Security in Colonial’s case), incident response retainers, and ransomware negotiators if that path is on the table. Mandiant was brought in to support Colonial — but a pre-signed retainer saves days off the response clock. 

 In other words: exchange business cards before the breach, not during it. 

  • Identity hygiene as IR foundation. Colonial’s initial access vector was a legacy VPN account with a leaked password and no MFA. A strong incident response plan compensates for that. But that doesn’t you to ignore deprecating unused accounts, having MFA everywhere,  and privileged access management for OT remote access. The strongest IR plan still starts with not getting hit through a five-year-old VPN credential. 

In other words: getting the cybersecurity hygiene basics right is critical.  

The throughline: the 2021 incident wasn’t a cybersecurity failure in the OT sense — it was a resilience and decision-making failure. The technical compromise stayed in IT, but the business couldn’t figure out how to operate without its IT systems, so it shut down critical infrastructure. A mature OT IR program treats “we got hit, now what runs anyway?” as the central design question. 

Where Most OT Programs Go Wrong 

Three patterns come up repeatedly in mature-looking programs that turn out to be fragile under pressure. 

The first is scope. The IR plan covers the SOC and the CSIRT, but stops at the boundary of the OT environment, where “operations will handle it” becomes an unexamined assumption. Real OT incidents don’t respect that boundary. 

The second is realism. Tabletop scenarios are written by cyber staff for cyber staff, and they tend to describe attacks the cyber team already knows how to respond to, rather than attacks that would actually be difficult — the ambiguous ones, the cross-domain ones, the ones that force shutdown decisions. 

The third is cadence. A single annual exercise is not enough to build the muscle memory required to coordinate across IT, OT, vendors, and executives during a real event. Incident command is a perishable skill. 

CyberEd.io Helps Build Incident Response Leaders  

Colonial Pipeline’s shutdown decision wasn’t a cyber call. It was driven as much by the inability to bill customers and uncertainty about legal and regulatory exposure as by concerns about the ransomware itself. The cyber team could contain DarkSide. What no one in the room could answer in the moment was whether the company could keep moving fuel without invoicing it, whether shippers would accept reconciliation after the fact, and what the regulatory consequences looked like either way. Those are CFO and General Counsel questions, not SOC questions, and they have to be pre-decided. When minutes matter, “can we deliver product we can’t immediately invoice for?” is not a question you want to be debating for the first time at 2 AM.  

CyberEd’s OT incident response tabletops are built by the world’s leading OT experts to surface exactly these decisions before a real event forces them, putting cyber, operations, finance, and legal in the same exercise so the answers exist on paper, with leadership signoff, long before the pipeline goes quiet. 

Learn more about our OT training offerings.  

 

References: CISA and FBI, Joint Cybersecurity Advisory: DarkSide Ransomware (May 2021); U.S. Department of Energy, Colonial Pipeline Cyber Incident response page; Mandiant testimony before the U.S. House Committee on Homeland Security (June 2021); SANS and E-ISAC, Analysis of the Cyber Attack on the Ukrainian Power Grid (2016); MITRE ATT&CK Campaign C0028. 

 

Related Content