Active Directory remains the core identity service for most enterprises and a favorite target in modern ransomware and wiper campaigns. When AD fails, identity-dependent services—email, conferencing, VPN, and critical apps—grind to a halt. And the only thing worse than trying to recover AD is scrambling to do so during a full-blown business crisis.
In this webinar, Josh Wagman, Director of Cyber Resilience Advisors, explains how attackers compromise AD and what it takes to coordinate identity recovery and crisis response so that you can quickly restore identity to a trusted state under real-world conditions.
You’ll learn:
- How attackers view—and infiltrate—common AD compromise paths and failure modes
- What breaks when identity is down and how to operate when email, conferencing, and other core services are offline
- How to build a crisis‑ready blueprint that includes tested AD forest recovery runbooks, decision points, and order of operations to re‑establish trust before you reconnect your systems
- The essentials for coordination under duress: role clarity, out‑of‑band communications, and a single source of truth for tasks and status
- How to create a practical checklist to raise readiness before the next incident, informed by real‑world blockers
Hello. I’m Josh Wagman, Director of Cyber Resilience Advisors for Semperis. And today, we’re here to talk about crisis control and recovering Active Directory under pressure. What we’ll go through today is the identity-related cyberattack landscape, what it looks like today, what happens when the identity system goes down, and the essential elements of a crisis-ready response plan. So today’s threat landscape is constantly changing. New attack types and vectors are always available to threat actors. And as a result, it’s becoming more and more difficult to defend against those attacks. What this has done is cause a shift to happen, especially at the board level within organizations. And that shift is from a risk-based stance—where we try to prevent absolutely everything—into a resilient stance. And the importance behind that is it’s not a matter of if we’re going to be attacked at some point and that attack will be successful; but it’s a matter of when that’s going to happen. So as a result, we have to better prepare to keep the business moving forward the best we can and to get back to normal as quickly as possible. And this has got visibility at the board level. And as a result, the focus is shifting from that risk-based posture to resilient. We actually commissioned a study where a number of organizations were reached out to, and 96% of those respondents said that they actually have a crisis response plan in place and practice that plan regularly through activities like tabletop exercises or other cadences that they might do. But despite that, 71% of organizations have still experienced at least one high impact cyber incident in the past year—despite having those crisis response plans and that regular practice. Of the respondents, only about 10% reported no serious blockers to effective cyber crisis response. And when we asked them what the blockers that they were seeing within their organizations are, they highlighted communication breakdowns as being probably the biggest blocker in crisis response. Another one— and I think many of us have experienced this in the plans—outdated response plans. And another highlight was tool overload. So too many tools in place by too many individuals within the organization, which led to coordination breakdowns in response. And we understand that you have a tough job to do. Given the changing landscape of threats that are available and AI being introduced into threat actor activity, cyber has become the top business risk. And this is global. And it’s really among all verticals, all industries. That being said, the cost of cyber incidents is increasing year over year as well. This is not a small increase. And that’s really what’s driving that tolerance for risk to increase at the board level and shifting the focus from traditional risk based to a resilient posture. Now we’ve been involved in incident response for a long time. And in incident response, we’ve seen a problem. And that problem is related to the complexity and the cost with orchestrating a proper response. It can cost organizations millions of dollars to properly fund a response effort. Putting the tools in place that we need for communications, document storage, things like that, the overhead involved with managing those platforms. And despite having those tools in place from a traditional toolset standpoint, we still uncover a number of areas that need to be addressed in order for a response to be successful. What we’ve seen is a breakdown of people, process, and technology—and regulatory is not making life any easier as well. With the people side of it, it’s not that we don’t have qualified responders. We do. But when it comes time to coordinate the effort across teams, especially as incidents escalate into crisis level situations, it becomes very difficult to do with traditional tooling. And we’ve seen a number of communication breakdowns between those teams as well. The processes in place are often fragmented or outdated. Do we even have access to the processes that we need in the for the event that we’re trying to remediate? Have we tested all of those processes? That’s been very difficult to have visibility into, traditionally speaking. And those two issues point towards technology. The disparate systems that have been used for these efforts over the course of time typically fail. Some of them might be identity integrated and are unavailable during the crisis we’re dealing with. None of the tooling in place communicates with one another, so we see the breakdowns. I mentioned regulatory. Regulatory now has potential for high financial fallout if not undertaken properly. We have to pro provide the proper disclosures depending on the regulatory environments we’re beholden to. We have to file the correct information for cyber insurance. All of those potentially have an impact on the overall outcome of the crisis or incident we’re dealing with. And around risk, we’ve got false confidence. We’ve got response gaps that aren’t being addressed. And the goal here is to shift from that risk-based posture to a resilient posture and close those gaps. This, again, was done based on a survey of our customer base. And what we uncovered during this survey really points to the problem I was talking about before. Different teams within the same organization that are using different tool sets to try to accomplish the same thing. You have your crisis management team that might have one alerting platform, an out-of-band communications platform, out-of-band document storage, where a different team in the same organization will take business continuity and disaster recovery, for instance, and might have a completely different tool set for out-of-band communications, documentation, and things like that. Then you’ve got your cyber IR team, which typically has a dedicated tool set for secure messaging. Case management might be done through the SIEM and SOAR where nobody else has access or from an ITSM platform that might have identity integration. They’ve got their own alerting and communications as well. And what we’ve seen is the breakdowns that happen in incident and crisis response as a result of these disparate tools. When we try to bring these teams together, it’s often very difficult. If we need to find documentation from a different team, we may not have access. That team might not have access to their own documentation depending on where it lives in the time of a crisis. And then we’ve also got the workforce to think of as well. This is often an overlooked area in incident and crisis response, but it’s potentially an area of great risk. Because if there’s a situation where our in-band solutions aren’t available—for instance, if we’re a Teams shop and we don’t have access to that, we can’t access email or a global address list—then what’s the workforce going to do to communicate? Typically, they will defer to something that they’re comfortable with, if not given tools. And that usually results in something that’s insecure like text messaging. And that’s where information regarding the incident or crisis that we’re dealing with leaks, and we can no longer control the narrative around how that information gets out. And so it’s really important to address that. And one thing to note is the critical dependency that identity has in a cyber crisis. Active Directory and Entra ID are the most critical assets in an enterprise. And so recovery from cyber incidents really starts with the availability of your identity stores. But they have to be in a trusted state. It’s not good enough just recover, but we also have to be able to trust that we’re not letting the bad actors or threat actors back into the environment immediately. Can you trust your data protection vendor to cover your most critical enterprise asset, that identity store, and then all of the dependencies that go into getting it back to a trusted state? That’s one thing you really have to ask yourself. Now there’s a really interesting case study that Microsoft has based off of a real-life attack that involves Active Directory and Entra ID. It’s in regard to an organization that has a parent organization with many subsidiaries underneath. Now all of these subsidiaries under the parent organization have their own Active Directory domains, and they’re all connected back to the parent via a trust relationship. And so navigating between the domains is not a challenge in this environment. I’ve got the link to the article at the bottom of the slide here. But really what happened is a threat organization called Storm-0501 actually attacked this organization, and they were able to achieve Domain Admin privileges in one of the domains in the environment. Now one of the things to note is even though that parent company had a number of subsidiaries with independent Active Directory domains, their Azure tenant was centralized. So the parent owned the tenant and all of the subsidiaries synced up into that same tenant. As a result, there were visibility issues within some of the subsidiaries. They didn’t have the access they needed to see where all of the security software deployment was. In this case, they were using Defender as the security platform, but they couldn’t tell whether or not all of their machines had Defender deployed because they didn’t have the proper access to the parent tenant to be able to do that. And so using an unsecured machine, the attackers were able to get Domain Admin privileges. And from there, they used tooling, in this case evil-winrm, to be able to move laterally within the environment and find the Entra Connect sync server. Now from there, they underwent an attack process that included a DCSync attack to extract credentials. And then using AzureHound, were able to perform a full reconnaissance on the Entra ID tenant that enumerated all of the users, roles, and Azure resources that existed. And then from there, what they tried to do is log in directly to Entra ID. Now the organization did have conditional access and MFA in place, which prevented this from happening successfully, at least at this point in time. What the threat actors did then was move through a trust to one of the connected domains. And from there, they were able to find a second Entra Connect server. So they went through that same process—they went through on the initial Connect server, dumping the credentials, and then they already had their reconnaissance from AzureHound. What they were able to find in that domain is a nonhuman identity with Global Admin privileges that was being synced. And so because this was a nonhuman identity, MFA was never enabled on that. And so using that object, they were actually able to attempt to log in to Entra ID, at which point, they were prompted to set up MFA. And so using that nonhuman identity, they set up MFA to one of their own devices. From there, they attempted to log in, but the conditional access policy on the tenant still blocked it. What they had in place was a requirement that the login attempt had to be from a machine that was both Active Directory joined as well as connected to the Entra tenant directly. And so they moved laterally within that domain and found a server eventually that didn’t have Defender on it, but that had that dual membership, that joined to Active Directory and the connection to Entra ID. From there, they were able to fully access the Entra tenant and basically had the keys to both kingdoms. It’s a very interesting case study that really highlights the visibility that we need to have in our Active Directory environment to make sure it’s protected. Now when you go through a situation like that, we’re really looking at a full Active Directory recovery in order to restore operations. Now if you have to do this manually, you’re going to have to go through the 29 steps that are highlighted on this slide that really boil down to about a 100-page document provided by Microsoft. And this is a long and arduous process If you have good backup software in place, but it’s a general purpose backup solution, the only assistance you’ll have during this recovery process is with step three. And so for each single domain inside of the forest and for all connected domains that need to be recovered, you will only have help with a nonauthoritative restore of the first writable DC in each of those environments. That means there’s a lot of manual work that has to go on after that recovery has been performed. What we’re looking at here is what I call the scariest game of snakes and ladders for an identity professional. This is basically the process that you undergo if Active Directory is attacked. At the start, we can see that we’ve been attacked, and our identity system is offline. Our users cannot log in. As a result, we see in step two, we’ve got outages. All of the systems with dependencies on our identity store are no longer functional. Now in most organizations, this is going to take down pretty much everything. So now that we’ve established we can’t log in, most, if not all, of our systems are down, how do we get our team together? Do we have contact information in the secondary location? So I’m tying this back to the challenges we’ve seen in crisis management. Do we have the information we need to gather the right people and communicate efficiently to start the remediation process? Without the proper tooling in place, we’re going to see massive delays as we try to get ahold of all of our resources, spin up a centralized bridge so everybody can communicate. We have to review the IR plan, if we can access it, and then go through the recovery process. We’ve got to stand up our infrastructure, choose the backup, so on and so forth. And what’s really important to note here is if the crisis plan is not in place, this can be a very, very difficult challenge. I’ve also got a story from the field that we were actually involved in with the Semperis organization as to what happens when Active Directory is attacked. And so part of the background here is the victim is an offshore telecom and microbanking organization that has been deemed as critical infrastructure within its own country. Now this organization had 1500 IT endpoints as well as 10,000 endpoints on the OT network, and they were running a modern Microsoft infrastructure, which at the time was Windows Server 2016. There was no virtualization in place. The security platform that they leveraged was basic Defender Endpoint Security, and they had a very insecure Active Directory configuration. So the threat actors that were involved in this attack were—actually there was at least three different nation-state actors on the network that were discovered: MuddyWater from Iran; GALLIUM from China; as well as Cozy Bear, also known as APT29, from Russia. And the goals of these attackers were espionage exfiltration, not necessarily ransomware. And so as part of this response, the defenders that were engaged were the QGroup from Germany founded in ‘93, specializing in ATP response. They’ve got close ties to the German government and military as well as Semperis. We were involved from an identity standpoint within incident response. So the situation was that the attack was eventually detected because of the load that the threat actors were putting on the domain controllers or servers in general. Data exfiltration can obviously put some load on the servers. Well, they were pulling so much information out that the servers were actually bogging down. Basically, almost every kind of standard ATP from different sources was discovered in this environment. They were using a collection of vulnerabilities that were not found in any US security or manufacturer notifications, so they were ahead of the game with the vulnerabilities they were using. And we set up some pretty crazy AD training environments that contain a lot of misconfigurations or potentially exploits. Well, there were more exploits in this environment than you’d actually find in one of those scary training environments. It was an absolute mess. AD was fully compromised. And to make matters worse, their firewall infrastructure was completely Active Directory integrated, which means the firewalls were completely gone as well. So what happens when AD is attacked like this? What do we typically see? Well, the most obvious things that we see on a day-to-day basis with Active Directory-based attacks are all of the DCs having malware on them, either in the OS layer or within SysVol. Some DCs may no longer be functional, so you might not have some of your domain controller infrastructure available. But some of the not-so-obvious things that we typically discover in these situations are changes to the Active Directory service itself. One of the most obvious that’s probably the most common is privileged group membership changes. So adding users that should not have privileged access into Domain Admins, Enterprise Admins, or other high value groups that may be custom within an organization. Also, permissions changing on key security mechanisms within Active Directory, like an example would be AdminSDHolder. Having permissions modified on that to kind of hide away into elevation. Group policy objects are very often modified—number one, to deploy software in the form of malware. They can provide access to different areas within the environment, deploy certificates, whatever you need as a threat actor to do. Group policy is obviously an efficient way to do that. But we also discover hidden objects in the environment. So an object might be created and privileged, and the deny read ace enabled on that object hiding it from your administrative group. And then there’s backdoors as well. So toolsets like Mimikatz, leveraging a DCShadow-type attack for SID history injection to provide privilege without a lot of visibility to it, are all things that we typically see within these environments that may not be as obvious as the typical ransom or malware. So when that’s the case, when the identity is under attack, how do we get back to secure and, quote-unquote normal? Well, in the event that I described before, the first thing that we did was installed Active Directory Forest Recovery. We installed a management server into the environment and deployed a number of agents on the minimum set of recommended DCs. Then we took a backup of that known infected environment. So ADFR backups operate kind of a little differently than everything else where we leave behind the operating system, but pull the relevant Active Directory information we need to be able to fully recover Active Directory. So in this case, we can use it as part of our incident response effort. It also allows us to take that information from those physical domain controllers they had in place and recover them really anywhere. From there, we get into the recovery portion. We’ve got the backup of the environment. Now we need to recover. So, obviously, we don’t wanna recover into an area that the threat actors can immediately access. So we do need to perform segregation within the environment and create an isolated network and provision fresh virtual machines with agents on them. It’s really important to use those fresh virtual machines. They’re the only way we can trust that the machines will be malware free. Then the process is to make that management server, that ADFR server with the backups on them available into that isolated network. From there, we can perform an automated high speed AD Forest Recovery to that minimum-viable deployment using ADFR, and we’re going to deploy that to those new malware free machines that have been created in that isolated environment. That doesn’t mean we’re done though. What that means is we can have Active Directory fully recovered. We’re confident there’s no malware in that recovery, but we still have to sanitize the environment. You have to go through and make Active Directory trustworthy again. So in order to do that, what we did is run the Purple Knight Post-Breach Edition, and what that allows us to do is focus in on an attack window and really find the threat actor ingress and where they have control of the Active Directory service, all of the changes they made during that attack window. Once that’s been done, we run a full Purple Knight assessment as well as Forest Druid—and those are free community tools available to anybody. And they perform two activities. They perform an overall vulnerability assessment as well as attack path analysis through the identity stores for Active Directory and Entra ID. And so having that information, we can secure the environment again. We remediate the most critical items, focusing on where the threat actors were most probable in gaining penetration to the Active Directory environment and the network in general. And then if we have time during this process, we’ll apply security best practices. Now that’s this is going to vary situationally depending on what we’re dealing with. But if we have time, we immediately recommend implementing something like admin tiering. If you can’t do that immediately, it’s something that should be addressed in the very near term. At this point, we’re ready to cut over. So, basically, we’ve got our recovered and sanitized environment operating in isolation. What we’re going to do first is shut down the original dirty production forest, the forest that’s been compromised and used in the attack. From there, what we’re going to do is take the domain controllers in the isolated environment and update their DNS resolvers to point to the production DNS infrastructure. Then we open that isolated network to production to expose the new domain controllers to that environment and register the domain controllers in DNS. So now—and this is really critical—if we change the IP addresses of the domain controllers during the recovery process, which is an option. Once we’ve done that, all we have to do is reboot all of the domain joined servers and PCs that we’ve investigated and determined were clean, and they will pick up after the reboot to the new domain controllers, reestablish the secure channel connection that they need in order to authenticate, and really proceed as normal. So once that’s done, our environment should be back up and running. Then at that point, if there are any other applications or servers or workstations that need to be recovered, we can go through that process then. A few things that we wanna consider during this process is when we’re recovering the forest, it’s not enough to have just a single domain controller per domain in a forest depending on the size of the environment. We have to have the recovery scale to what is required to support production capacity on Day 0 or Day 1. Whoever needs to be able to access the environment or their applications quickly, we must scale the domain controller infrastructure to be able to meet that need immediately. Then we have to examine our Active Directory-dependent applications, like PIM, virtualization infrastructure, backup infrastructure. Any other components that tie in to Active Directory directly should be examined for functionality and ensure they’re functioning properly. And then, of course, we have to resync with whatever our cloud identity provider of choice is. So whatever connectors we’ve got in the environment, we’ve got to re-enable, make sure they’re syncing properly and efficiently, and that the information that we recovered has now been updated in those tenants. And, again, we’re gonna bring up the information regarding a manual recovery of Active Directory just to highlight the steps that would be needed to be taken if a general-purpose recovery solution was in place instead of something focused on recovering Active Directory forests. Using a forest recovery utility can greatly shorten the amount of time and runway needed to get that identity system back online so we can begin the rest of the triage process. Using a traditional method like this could take days to weeks to get Active Directory back online and working again. Now identity recovery and crisis management are really inseparable for effective incident response. Why is that? I mentioned this a little bit before, but when you’re having a severe level issue with platform, chances of it becoming a crisis are almost certain. And that’s because of the dependency the rest of the environment has on identity. And so if we’ve got a critical issue with identity, very good chance it becomes a crisis. Now it’s really important organizationally to understand what our escalation points are internally. We have to have that defined. When does an event become an incident? When does an incident become a crisis? When does a crisis become a disaster? Those definitions have to be clear because at each stage of escalation, there’s different requirements that we’re going to have to meet really in order to be successful in our response. We’re going to have different teams involved. As things escalate to a crisis level, we now need to involve other areas of the organization that we may never have had to involve before, like legal or senior management, human resources, PR. We have to be able to have our proper communications plans in place. We need to know what response plan we need to leverage in that situation. So having these definitions is absolutely critical to being successful in incident response and crisis management. Of course, identity-based issues, as I mentioned a couple of times already, especially with authentication systems being down, typically fall into the crisis category rather quickly. Now when we have a crisis level situation, there are certain things we have to have right in order to be efficient and effective in response. Number one is our crisis technology has to be available. We have to have the systems we need to be able to facilitate that response effort. And all of the rest of the things kind of fall under that crisis tech being available. We have to have access to runbooks, playbooks, recovery documentation, business continuity information, downtime procedures, contact information for both internal and external resources. How do we bring our responders in? How do I get ahold of them? Our processes have to be up to date and trusted. We have to know that that crisis response plan is up to date. It’s been tested. It’s been validated. Our recovery documentation is current and accurate. We can use it in that response effort. In most situations, we have to be able to contact our third-party IR support. We need to bring in the resources that can help us be successful in this effort. So how do we do that? We need their contact information, and we need a way to communicate and validate that they are in fact who they say they are. Our stakeholders, both internally and externally, have to be aligned with our responders. We can’t continually be pulling responders away from their effort in order to try to get everybody on the same page. We have to have a way to do that efficiently. We can’t use tools that are siloed between different teams. It just creates chaos, breakdowns, and coordination, communication, and it makes it really difficult to get everyone pulling the rope in the same direction. And probably one of the most important aspects of crisis response is communication. We have to have communication available, and it has to be secure. I’ve been on countless number of calls where I’ve heard the threat actors sitting on the bridge in these crisis situations. We can’t have that. We have to know who’s on the bridge at any given point in time, we’ve got to be able to control that. And then because of the regulatory overhead as well as cyber insurance, all of these events that we deal with have to be documented properly. And with disclosures, of course, many are on a ticking clock, so they have to be done quickly. With cyber insurance, if we’re not documenting everything properly, it can greatly affect the eventual payout at the end of the day. And so a call to action is, as an organization, Semperis runs a number of tabletop exercises around the globe called Operation Blindspot. There’s no cost to sign up for these. They’re at many different events. Check for one near you. But our call to action would be to participate in an event like that to help uncover the blind spots in your identity crisis management plan. It will give you information that you need to take back to your organization and develop a more robust response for issues around identity crisis management. And thank you all for your time today.
