Ransomware and wiper attacks are causing organizations to re-evaluate their backup and recovery capabilities. An obvious concern is whether backups are safe – for example, are they offline where they can’t be encrypted or wiped.
While this is a good first step, it’s just that. We also need to evaluate whether the backup and recovery technologies we use support recovery from cyberattacks. This is particularly important if bare-metal recovery (BMR) is part of your AD recovery plan.
New sense of urgency
As the head of product at Semperis, I talk with a lot of CIOs and CISOs about disaster recovery (DR) for Microsoft Active Directory (AD). The attention on AD recovery is well deserved given AD’s critical role (if AD isn’t available, nothing is) and specialized recovery process (how many of your colleagues have actually performed a full forest recovery?).
Over the last year I’ve seen a new sense of urgency around AD recovery. It’s no longer a question of undoing a forest-level upgrade or a schema change. Now the focus is on recovering AD when a destructive, fast-moving cyberattack (like NotPetya or LockerGoga) takes out all your domain controllers (DCs).
That was then, this is now
BMR has historically provided a convenient way to recover an entire server. For example, the DR team could perform the recovery on their own, with no dependency on the server provisioning team. But in the new threat landscape, BMR’s greatest strength (recovers everything in the OS) has become its greatest weakness (potentially reintroduces malware).
BMR is also showing its age. It was built for physical computers back when many IT departments required days or even weeks to provision a server. But virtualization changed all that.
And BMR was built prior to cloud and infrastructure-as-a-service (IaaS), which provide an inexpensive, readily available alternative to standby hardware.
If your current AD recovery plan relies on BMR – or if you’re considering new automated recovery capabilities that rely on BMR (to recover DCs when the hardware or operating system is affected) – below are some things to consider.
BMR backs up and restores everything on the DC, including the operating system (OS), registry, system files, etc. That’s not a bad thing if a fire or flood takes out your DCs and you need to restore AD on new hardware. But it’s a very bad thing if your DCs are hit by a cyberattack.
When an intruder installs rootkits, ransomware, or other malware on your DCs, all those executables and DLLs are included in BMR backups. When you do a BMR restore, the malware you don’t want is restored along with the OS, registry, AD database, and all the other things you do want. (Note that malware can also be reintroduced when restoring from system state backups.)
Here’s a demo that illustrates the risk of reinfection with BMR:
In some cases, you might be able to identify and remove the malware after restore. I personally wouldn’t be comfortable with that approach and would only consider it as a last resort (when no other restore option is available). I certainly don’t recommend designing your AD recovery plan around it.
If DCs are infected when BMR backups are taken – or if you suspect they might have been – the traditional approach is to go back to an earlier backup. But how far back should you go? Malware can remain latent and go undetected for weeks or even months, so you might have to go back a long way.
Of course, the further back you go, the more data you lose. You might take daily backups for a 24-hour RPO (recovery point objective), but if you have to go back 2 months to get a good backup, you’re missing that SLA big time. And given the number of directory changes made each day, the data loss really adds up.
Every minute of downtime costs your organization real money. However, BMR extends recovery time in a number of ways:
Hardware setup: BMR is designed for recovery to matching hardware. Unless you can afford duplicate standby hardware, you’ll need at least a day or two to procure new hardware. This alone is outside the typical RTO (recovery time objective) for critical infrastructure (like AD) and tier 1 applications. And if a cyberattack strikes many organizations and causes hardware shortages, procurement can take much longer. Workarounds for restoring to alternate hardware (such as driver injection) may be possible, but they take time and introduce additional risk.
Backup retrieval: Because they contain the entire system, BMR backups are big. I knew this from experience but wanted to quantify it, so I ran a simple test in my lab using a “vanilla” Windows Server 2016 DC with essentially an empty AD (to determine the overhead associated with different backup methods). I used Windows Server Backup to make a BMR backup (as well as a system state backup), and I used Semperis AD Forest Recovery to make a Semperis backup (which extracts AD from the underlying Windows OS). Here are the (uncompressed) results:
Bigger backups consume more storage and network bandwidth and tend to occur less frequency (for example, weekly or after patch Tuesday). And during recovery, they take longer to retrieve (especially when stored in the cloud).
Backup selection: Finding a clean BMR backup is an iterative process (retrieve, mount, extract, test, repeat) that takes time when every minute is expensive.
Rework: Restoring AD from an older (and hopefully clean) BMR backup requires recreating directory changes, reconfiguring applications, rejoining workstations, etc. – and all this rework takes considerable time and effort.
False recovery: BMR carries a large degree of uncertainty – were DCs infected, when were they infected, how far back do I need to go? You might not discover you didn’t go back far enough until you’ve completed or are well into the AD recovery process. AD recovery is no simple task, so starting over is particularly painful.
Times change and so must our DR plans. Cyberattacks are a clear and present danger that cannot be ignored. While BMR may address certain recovery scenarios, AD recovery after a cyberattack isn’t one of them.