The Call I Got on a Sunday Night
A client called me on a Sunday night, panicked. Their file server had been hit by ransomware Friday evening, and nobody noticed until an employee opened a shared spreadsheet Monday morning and got a ransom note instead. Three days of silence while the encryption spread across every mapped drive.
The good news: they had backups. The bad news: the backup drive was a network-attached device on the same subnet as everything else, so it got encrypted right along with the production files. Their "disaster recovery plan" turned out to be a single external hard drive that was, in practice, just another folder on the same network.
They recovered — eventually — from a six-week-old copy on a departing employee's old laptop. It cost them four days of operations, a client they never got back, and roughly $40,000 in recovery consulting and lost billable time. None of it was necessary. A plan built around basic separation and regular testing would have gotten them back online in hours, not days.
Most business owners assume "we have backups" means "we have a disaster recovery plan." It does not. A backup is a file. A disaster recovery plan is the decision-making framework for what happens when that file is all you have left.
What a Real Disaster Recovery Plan Covers
A disaster recovery plan is not an IT document that lives in a folder nobody opens. It is a business continuity plan that answers four questions before a crisis, not during one:
- What systems matter most, and in what order do we restore them? Not everything is equally critical. Your accounting system and your customer-facing application probably do not have the same recovery priority.
- How much downtime can we actually tolerate per system? This is your Recovery Time Objective (RTO) — the maximum acceptable time before a system is back online.
- How much data can we afford to lose? This is your Recovery Point Objective (RPO) — the maximum acceptable gap between your last good backup and the moment of failure.
- Who does what, and who talks to customers? A plan without assigned ownership is a wish list, not a plan.
I recommend treating this with the same seriousness as a financial audit. It rarely gets that treatment, and that gap is exactly where the damage happens.
RTO and RPO: The Two Numbers That Actually Drive Cost
Every disaster recovery conversation eventually comes down to these two metrics, and clients make better decisions once they see the trade-off is not abstract — it is a direct cost curve.
Near-zero RTO and RPO, where systems fail over automatically with almost no data loss, is achievable. It also requires real-time replication, redundant infrastructure, and ongoing spend most SMBs do not need for every system. In my experience, the right approach is tiering:
- Tier 1 — Revenue-critical systems (payment processing, core transactional databases, customer-facing applications). Target RTO of minutes to a few hours; RPO measured in minutes.
- Tier 2 — Operational systems (internal tools, CRM, project management). Target RTO of same-day; RPO of a few hours.
- Tier 3 — Everything else (archived files, historical reports). Target RTO of days; RPO of 24 hours is usually fine.
This tiering exercise alone, done honestly with business stakeholders rather than IT alone, resolves most disagreements I see between finance and operations about how much resilience is "enough" — and keeps the conversation grounded in business need rather than vendor pitch.
The Backup Strategy That Actually Survives an Attack
The near-miss story above is common enough that I consider it the norm, not the exception. Ransomware today specifically targets backup systems — Sophos's ransomware research found attackers attempted to compromise backups in the large majority of incidents, and succeeded more than half the time. If your backup lives on the same network as your production data, you do not have a disaster recovery plan. You have a slightly larger blast radius.
I recommend the extended version of the classic 3-2-1 rule:
- 3 copies of your data — the original plus two backups.
- 2 different media types — for example, cloud storage and a local NAS, not two copies on the same drive.
- 1 copy offsite, physically or logically isolated from your primary network.
- 1 copy immutable or air-gapped — write-once storage or offline media that ransomware cannot reach or alter, even with valid credentials.
- 0 errors — verified through regular restore testing, not just backup completion logs.
This connects directly to the resilience conversation I have with clients pursuing cloud migration — cloud providers replicate infrastructure, but they do not automatically back up your application data unless you configure it. Reading "99.9% uptime SLA" in a vendor contract and assuming it covers you against ransomware or accidental deletion is one of the most common, costly misunderstandings I encounter. A 99.9% SLA means the platform is available; it says nothing about whether your data is recoverable.
Test the Plan Before You Need It
A disaster recovery plan that has never been tested is a hypothesis. I have watched organizations discover, during an actual outage, that backup files were corrupted, nobody remembered the admin credentials for the recovery environment, or the "documented" restore process was three versions out of date.
I recommend a simple, recurring cadence:
- Quarterly restore tests — actually restore a system from backup into an isolated environment and confirm it works. Not a checklist item; an actual restore.
- Annual tabletop exercises — walk the leadership team through a simulated incident (ransomware, cloud outage, deleted database) and have them talk through the decisions in real time.
- Post-incident reviews — after any real outage, however minor, document what worked and what did not, and update the plan.
This is non-negotiable for clients in fintech and other regulated spaces. The discipline described here connects to what I cover in remote work security for financial services — auditors and regulators increasingly expect evidence of tested continuity plans, not just written policies.
Building the Incident Response Runbook
The technical recovery is only half the plan. The other half is communication, and it is the half most businesses skip. When a system goes down, someone needs to know, within minutes, who declares the incident, who communicates with affected customers, and who has the authority to make costly decisions (like paying for expedited cloud support) without waiting for a committee.
A workable runbook includes:
- A named incident commander for each shift or on-call rotation.
- Pre-written customer communication templates, so you are not drafting a status page update while also trying to fix the actual problem.
- A contact tree that does not depend entirely on a system that might itself be down (do not store your emergency contact list only in the email system that just got encrypted).
- A clear escalation path to vendors, cloud providers, and, if relevant, cyber insurance carriers.
Closing Thoughts
Disaster recovery planning is not glamorous work, and I understand why it keeps getting deprioritized in favor of features and growth initiatives. But I have sat across the table from too many business owners during their worst week to treat this as optional advice. The cost of a real plan is a fraction of the cost of not having one, and the businesses that survive a serious incident cleanly are, without exception, the ones that had already answered these questions before they needed to.
If you are not confident your organization could recover from a ransomware attack, a deleted production database, or a multi-day cloud outage by next week, that is worth a conversation now, not after the fact.