Most organizations have virtualized some or all their AD domain controllers (DCs). Virtualized DCs have their advantages, but they also introduce risks that didn’t exist with physical servers. One of these risks is the temptation to use hypervisor snapshots (a point-in-time VM image) for AD backups.
Let’s be clear: Even though Microsoft supports hypervisor snapshot restores since Windows Server 2012 (i.e., they won’t break AD as they could in previous OS versions), they still aren’t recommended. And they especially aren’t recommended for forest disaster recovery scenarios.
In a forest-wide AD outage, the goal is to return AD to service as quickly and reliably as possible. A severity 1 forest recovery event is not like a test recovery in your lab. It will most likely be the most stressful IT situation of your life, with C-level management breathing down your team’s necks because most or all of the business will be down. Every minute of an outage will cost your organization tens or hundreds of thousands of dollars—so each one saved is precious.
With this goal in mind, let’s look at the shortcomings of using hypervisor snapshots to restore AD:
- A forest restored from snapshots will have difficult data consistency issues. Object and attribute update replication between DCs is never-ending; a partition is rarely synchronized (i.e., updates are complete and all DCs contain the same information). Because of this behavior, snapshots of one DC will almost certainly contain inconsistent information relative to the other DCs being snapshotted. Thus, a restore of these DCs will almost certainly create lingering objects in both writeable and read-only partitions. An AD lingering object is an object that is present on one DC but has been deleted and garbage collected (i.e., is completely gone) from other DCs, creating a consistency problem across the directory partition and global catalog. If your DCs don’t have strict replication consistency enabled, the lingering objects will replicate to other DCs. And if you’re still running Windows Server 2008 / R2, you could induce USN rollback in your forest.
Microsoft escalation support states that “Lingering object issues are the most challenging Active Directory replication issue to resolve and are routinely escalated through multiple levels of support. On average, it takes twice as long to resolve a lingering object problem than it does the average AD replication issue as a result of the complexity involved in its troubleshooting.”
The only feasible way to take snapshots without creating lingering objects or other issues is to shut down the entire forest first. Are you willing to incur the outages that shutting down your forest will create? Do you have the skills and time to stamp out the lingering objects that will likely result from a snapshot restore if you don’t?
- If malware was on the DCs at snapshot time, you’re just restoring the malware. Since a snapshot contains the entire virtual machine, any malware that was on the VM will be restored as well.
- Whatever servers you don’t restore from snapshot, you must rebuild. Restoring fewer DCs from snapshots lessens the chance of the issues described above. But it means your seed forest to support the business will be smaller (thus likely overloaded) and the rebuild process of the remaining DCs will be manual (thus take more time and be subject to more errors).
- Snapshot recovery is just the beginning. Once your seed forest is restored, you must go through the lengthy forest recovery process. Even if you’ve built any scripts to automate metadata cleanup or DNS cleanup, you must update them every time your AD topology changes. Get anything wrong in the process and you must start all over again. Have you built and tested a manual forest recovery process—preferably with managers yelling in your ear?
- The snapshots themselves may be vulnerable to malware attacks. If they’re stored in the default location on the host, they’re every bit as vulnerable to malware encryption as the DC VM itself.
- The AD team probably doesn’t control its backups. It’s usually the virtualization operations group, not the AD team, that controls VM snapshots. This delegates a critical disaster recovery function to a team that generally doesn’t know anything about Active Directory and may not recognize the extreme sensitivity of its backups.
Hypervisor snapshots may be a workable disaster recovery strategy for small environments where you can shut down your entire AD forest recovery while you take them. But this approach doesn’t scale and is highly error-prone. Remember the goal is to bring AD back online as quickly and reliably as you can—because your organization will be in crisis mode.