7월 30, 2024

EP 58 – Trust and Resilience in the Wake of CrowdStrike’s Black Swan

In this episode of Trust Issues, we dig into the recent the global IT outage caused by a CrowdStrike software update, which impacted millions of Microsoft Windows endpoints and disrupted many sectors. This “black swan” event highlights, among other things, the importance of preparedness, adaptability and robust crisis management. CyberArk Global Chief Information Officer (CIO) Omer Grossman discusses with host David Puner the outage’s ramifications, the shaking of trust in technology – and the criticality of resilience against cyberthreats. This conversation underscores the need to be ready for the unexpected and the value of adaptability and resilience in unforeseen circumstances. 

[00:00] David Puner
You’re listening to the Trust Issues Podcast. I’m David Puner, a Senior Editorial Manager at CyberArk, the global leader in identity security.

[00:10] David Puner
The July 19th global IT outage that resulted from a faulty CrowdStrike software update affected an estimated eight and a half million Microsoft Windows endpoints. In turn, it disrupted sectors including airlines, hospitals, and governments, making it one of the biggest IT disruptions ever. It’s one of those watershed cyber events with widespread implications that we’ll all remember when.

[00:35] David Puner
More importantly, as defenders, we must collectively learn from it. This outage, a black swan in IT because it was unforeseen and carries significant consequences, underscores the critical need for preparedness, adaptability, and vigorous crisis management. There are also broad implications for maintaining trust in an era where the unexpected is the only certainty.

[00:55] David Puner
To discuss all this, today we’re joined by CyberArk’s Global Chief Information Officer Omer Grossman. Omer talks about the outage’s impact and its broader implications that we know of thus far. He also explores how an event like this can shake trust in technology and the importance of resilience in the face of cyberthreats and incidents.

[01:20] David Puner
It’s a conversation that emphasizes the importance of being prepared for the unexpected and the value of adaptability and resilience in the face of unforeseen events. Big important note. We’re not here to criticize CrowdStrike, but instead to learn from this event with the goal of strengthening our collective defenses against future cyberthreats and, perhaps in some small way, helping to preempt or sidestep the next would-be black swan incident.

[01:45] David Puner
Here’s my conversation with Omer Grossman. Omer Grossman, CyberArk’s global CIO. Welcome to Trust Issues.

[01:50] Omer Grossman
Thank you, David, for having me.

[01:55] David Puner
Thank you so much, Omer. We’ve been looking forward to this for a while. To having you on, that is. We didn’t know what the topic would be, of course, until at this point last Friday, July 19th. And we’ll get to that in a moment. By the time this episode comes out, it’ll be at least a Friday ago. To let folks, I guess, get to know you a little bit first, as CyberArk’s global CIO, what are you charged with in your role and what’s a typical day look like for you or day and night as the case may be?

[02:20] Omer Grossman
As a CyberArk CIO, before anything else, actually, I lead a global technology organization that supports CyberArk’s business goals and values. I guess the standard answer would be that they have a broad range of responsibilities that span across security, infrastructure, business application, data, and internally. I’m actually also the sponsor of a company’s business continuity program, which basically ensures that we can operate smoothly and securely in any situation. Maybe we’ll refer to that later on, going back to last Friday. But at the core of my role, maybe even my “why,” is that I’m building trust. I like the people, processes, and technology framework, the PPT framework. My team and I are trusted partners of the other business units and key stakeholders in the company, like product and technology. And I know you had Peretz Regev here.

[03:05] David Puner
Chief product officer. Yes. Yep.

[03:10] Omer Grossman
And just lately, the go-to-market organization, HR, finance, and legal. So, at the end of the day, no matter how much tech we’re using, it’s all about the people. That’s the first piece. The second one is the processes, and basically, you want to make sure you have trusted processes. So all of them are based on IT platform, applications, automation, etc. So you assume that the data quality is okay, the AI copilot recommendation is correct, and the automation doesn’t fail in the middle. You also assume that someone took care of the segregation of duties issue when, and the compliance, by the way. And last but not least is the tech. Whenever you click on the mouse button or consume knowledge through the screen or just hit the keyboard and send a bit of data upstream to the back end, you want to make sure everything works. That’s on IT. So, trust is the most important currency in the digital age. You earn it in drops, you lose it in buckets. And it’s my biggest challenge. About a typical day, I think that was the second part of your question.

[04:00] David Puner
It basically involves back-to-back meetings on different topics and initiatives. Actually, we had a meeting before. I have long days usually, and it changes depending on the priorities and challenges. I regularly have one-on-one meetings with my direct reports, key stakeholders, and executive leadership to align strategy and directions, spending time on operational reviews, project updates, sometimes even vendor negotiation, most of the time budget planning, and risk management. Maybe alongside all of this, I consider myself a continuous learner, so I’m always trying to stay on top of strengths and innovations.

[04:30] David Puner
You are one of my most prolific writers for the CyberArk blog, and we thank you for that. And we encourage our listeners to check out your work. You’ve got a monthly post, and every once in a while, you post a second as well, and we really appreciate that. So, thank you. And on top of that, I can vouch for this being a particularly long day for you because we first chatted about nine hours ago, and it is now about four o’clock in the afternoon for me and 11 o’clock on Thursday night for you, which is, of course, going right into your weekend. So, thank you for doing this now. So, of course, the focus of why we’re here and the big cyber story this summer has been the extensive interruption and disruption caused by a CrowdStrike outage that affected Microsoft Windows endpoints. And to start things off, what happened? Why did it happen? And what else do we know and not know at this point at the time of this recording, which, of course, is about six days out from when it happened?

[05:30] Omer Grossman
I’ll try to explain it as I did with my mother-in-law. Basically, keep it simple, and then we’ll try to deep dive a little bit to maybe a little more technical description. So, but in a nutshell, CrowdStrike is a leading endpoint security vendor. An update they pushed to the endpoint security center crashed more than 8 million Windows-based computers worldwide. That’s huge. The fix that they deployed later on required physical access to the computer, making basically their recovery a manual and long process that took days. I believe some organizations haven’t fully recovered yet, at least at this point. The outage disrupted airlines, hospitals, governments, even TV stations, and more, making it, I believe, the biggest IT disruption ever, not just bigger than any cyberattack in history, as far as scale goes. Is that right? Or at least that’s what they’re saying at this point.

[06:10] David Puner
Yeah, there’s an initial estimation of a cost that this disruption caused, and maybe we’ll get to that later on, but that’s huge.

[06:20] Omer Grossman
What was the update itself, and is that at all relevant to this story?

[06:30] Omer Grossman
Yeah. So, as you said, on Friday, July 19, it was about 4 a.m. in the U.S. There was a release by CrowdStrike to their Falcon solution. Falcon is their EDR, Endpoint Detection and Response solution. It was a configuration update. The update itself triggered a logic error resulting in a system crash. And what Microsoft called the blue screen of death. Microsoft estimated it was about eight and a half million impacted systems. So, basically, this means that CrowdStrike updated a behavioral indication of compromise, IOC, that crashed the Windows operating system. The reason that these kinds of things could happen is that an EDR solution usually has kernel permissions, which means those are the highest permissions possible on Windows machines. Breaking things at that level might, as we saw last weekend, break the entire OS, the entire operating system. The next thing is that when you get the blue screen of death, in that case, at least, the computer lost the internet connection. Once you lose your internet connection, rebooting the system doesn’t help. You get to a situation that basically created a real headache for the IT personnel that now needed to reach every endpoint and fix it manually. So, the combination of a leading EDR vendor, with a huge worldwide adoption of Windows-based endpoints and servers, by the way, and a specific error that couldn’t be resolved by just rebooting the system—all of these together created this massive outage.

[07:55] David Puner
And so not only were IT pros scrambling at that point, of course, but every affected consumer and patient and everyone else, somebody at a concert, and the opening act couldn’t make it from a flight, all that kind of stuff.

[08:10] Omer Grossman
Yeah. Yeah. The impact, I said at the beginning, it’s all about the people. So people got hurt. Like leaving aside my wife that couldn’t pay with a credit card in an IKEA store that morning. But I heard about a surgeon that couldn’t use equipment in a hospital during surgery. We all heard about people stuck at airports because flights didn’t take off and there were delays.

[08:30] David Puner
Is it typical that an update like this would roll out all at once?

[08:40] Omer Grossman
No, that’s a great question. The best practice would be for the vendor to do a thorough and robust validation of the update. Basically, QA, quality assurance, simulating the best way possible, the operational environment or the way the update will be handled in real life. The test environment should be as similar as possible to the actual environment. So this way, you make sure that if you break things or something doesn’t work, you see it in the lab before sending it out to the world. And when you send it to the world, you need to make sure it’s gradually deployed in order to take another measure of caution or taking care of being able to revoke the action if you see something isn’t right. At first stage. On the other side, from the organization’s perspective, the IT personnel or the security team, the best practices is to test it, to test the update, and deploy it only on a test team or test group. And make sure nothing breaks. And only then deploy it to the entire organization. The thing about this specific update, because it was an IOC update, channel file as was referred to by CrowdStrike. It’s not really even a big one and not even a major update. It’s not like shifting your iOS system on an iPhone or something like that. It’s actually a relatively small update.

[09:45] David Puner
So we don’t know whether protocol was broken or not. It may have just been considered to be such a minor update that they rolled it out all at once.

[09:55] Omer Grossman
Yeah, I would say that I’m very optimistic, and we talked about trust. You should make sure that every update has a phased implementation or deployment. That’s the best practice. Big or small, you need to be careful with taking the lessons learned from this incident to the wrong side because all the way to the end, because not implementing or deploying updates, that’s not a good practice as well. You need to stay updated. You need to be vigilant with the latest updates, etc. So the answer here is not to stop updates. The best answer will be to do it, do it in a responsible way, effective approach, but do it. Still, do it.

[10:35] David Puner
But this one, was it an automatic update, or was it something that somebody actually had to do manually or accept the update?

[10:45] Omer Grossman
It depends on the configuration. Usually, you do need to hit the okay button. The way it’s configured, it’s very much possible that it was an automatic update. If the agent, the Falcon agent, is configured for automatic updates, it might get it without any, uh, many in the loop. It might be. It’s a possible scenario. I would say that if you ask me if it’s been fixed, I guess the short answer would be yes.

[11:15] David Puner
I’m glad to know that.

[11:20] Omer Grossman
Yeah, no. But because, and by the way, CrowdStrike fixed it pretty fast, actually. They got a fix. A bit longer answer would be probably yes, but and from a tactical perspective, yes, there is a fix and a formal guideline for deploying it. But, and I said it, they did it rather fast. They acknowledged it’s a technical issue. They found the issue. They made sure it’s not a cyber thing. In the beginning of the incident, there was an option. It may be a cyber incident. So getting to each impacted computer takes time, as I mentioned earlier. So sometimes even days, and it couldn’t have been deployed across more than 8 million computers without all the IT people, the security teams. It couldn’t be handled without great customer support people. I think they were working 24, 48 straight hours during all the weekend to get the system back online.

[12:00] David Puner
So the bad part comes not just because it took days to fix it. But if you look not from a tactical perspective, maybe at the bigger picture, there’s a huge lesson learned. There are huge lessons learned here. It will take some time and a lot of effort to get them all fixed. It probably will take a village as we like to say it in the security community. The issue in the bigger picture isn’t yet resolved. It’s not just the gradual deployment of updates. There are many other things to be considered.

[12:30] David Puner
And I definitely want to hear what you have to say as far as ramifications and lessons learned shortly, because I’m sure there’s an infinite number of ramifications and maybe slightly fewer lessons learned, but still many. But I’m curious, as a CIO, what in particular resonates with you about what happened? What were your immediate concerns when you heard about this?

[13:00] Omer Grossman
What resonates with me are the trust issues. Pun intended, David. You trust your vendor, and it fails you big time in that incident. It actually blew my mind. I couldn’t stop thinking about it. As a CIO at CyberArk, CyberArk wasn’t impacted at all by this outage. But I do have two more tangible concerns. First of all, I was interested in making sure our customers are okay. These efforts were led by our excellent customer support team. But that was my main concern, although it’s not an internal cyber or digital infrastructure or something like that. The second thing was revalidating that our internal IT update processes are using phased deployment best practices. But I can assure you that we do have a phased approach, but those two things were the immediate tangible concerns I had. Thinking about the trust, trusted vendor issue, and making sure our customers or our practices are in place.

[13:45] David Puner
I’d like to backtrack a moment in order to move forward because I think that your story as a CIO, as a global CIO is super interesting. And you came to CyberArk, you came here in December of 2022. So just shy of a couple of years with an interesting background and perspective. How does your background before CyberArk shape your POV about this particular event, and then other major cyber events as well?

[14:15] Omer Grossman
First of all, I had the privilege of serving my country for 25 years as an officer in the military. I’m actually still doing reserve duty, so I’m not done with supporting the greater good. In my last two leadership positions as a colonel, I led the biggest cloud service provider, Unity, the IDF, the on-prem top-secret cloud. And after that, I led the cyber defense operation center, basically functioning as the IDF CISO. And for most of the last decade, maybe even a decade and a half, I had a frontline seat to everything related to cyber warfare. And regarding my relevant experience, I still remember how we stopped the NatPetya attack before it spread across Israel, basically protecting my country in the cyber domain. And what I learned back in 2017 is that small actions can create or mitigate great havoc. Just like last weekend, a small action melted down the internet. That’s the first thing I remember thinking when the ripple effects, but the ripple effects of small things, butterfly effects, if you want. So, the second thing I learned, and it’s applicable in this case as well, is that operational capabilities rely on a functioning digital ecosystem. This is why we should be ready for disruption in our most mission-critical processes. Resilience is key. It’s like cyber. We’re saying, prepare for the worst and assume breach.

[15:30] David Puner
Absolutely. And it seems like time and again, both in the real world and in the cyber world even more so, everything is interconnected. And that’s those ripple effects.

[15:40] David Puner
An article published this week on cio.com, which I assume is a publication that may be in your bookmarks, says that the CrowdStrike incident has CIOs rethinking their cloud strategies and that, quote, “For CIOs, the event serves as a stark reminder of the inherent risks associated with overreliance on a single vendor, particularly in the cloud,” end quote. What is your take on that, and what about their cloud strategies should CIOs maybe be rethinking as a result of this event?

[16:15] Omer Grossman
Okay. So first, cio.com is on my bookmarks, rest assured. So, it’s easy to say that you want to be a multi-cloud company, a more resilient one as part of your cloud strategy. But in most cases, if you’re a software vendor, for example, you usually have a primary cloud service provider that you’re using. Still, this doesn’t mean you can’t dramatically improve your digital resiliency with availability zones and data center strategy. I wouldn’t want other CIOs to think that replacing their AWS, Azure, or GCP is easy. It’s basically so costly if you think about replacing a big part of your workload that you usually need to analyze it from a risk perspective, a cost-effective risk perspective, some kind of a combination of potential impact and probability. My takeaway or my suggestion to other CIOs or regarding the quote that you mentioned earlier, I suggest building the IT landscape, which is solid architecture. But as the saying goes, plans are nothing, planning is everything. So, I would also recommend having BCP controls in place and training your organization to function with limited digital capabilities.

[17:10] David Puner
Like business continuity plans.

[17:15] Omer Grossman
Exactly. So, you cannot not join the cloud. You do need to think about it from a risk perspective. And alongside the solid architecture, you need to have readiness in your organization to deploy and function with business continuity playbooks whenever you have an outage. That’s my two cents.

[17:30] David Puner
Thank you for those two cents. And so, I think we’re at the point in the conversation now where we’re going to look at the ramifications and lessons learned from all this. But I should also point out at this point that we’re not having this public conversation to rub salt in CrowdStrike’s wounds here. We’re all playing for the same team. We’re all defenders, and we’re brothers in arms to a certain degree.

[17:50] Omer Grossman
Exactly.

[17:55] David Puner
Yeah. So, I just wanted to point that out. And there are, of course, lots of ramifications and lessons learned from this CrowdStrike incident. And so, let’s tackle those in two parts. First, what are the tactical lessons organizations can take from this outage around update deployments and QA and those sorts of things? I think we touched upon some of those earlier, but maybe we’ll bring it back to that.

[18:20] Omer Grossman
At this point, a few days after the incident, I have two tactical lessons, and maybe one or two strategic ones. So, from a tactical perspective or the tactical lesson, you should have a robust quality assurance process. You should make sure nothing is missed during the test. You should try and get the most accurate test environment to model what is out there. That should resolve most of the issues before you deploy it to the field. The second tactical lesson is, and we talked about it in previous questions, that you need to deploy updates in a phased approach. You need to have a test group before full company implementation. It’s as simple as that. Just follow this practice. If something goes sour, the blast radius or the impact, it wouldn’t break the business or the process. Strategic lessons, I recommend mapping your critical vendors and asking them about the way they validate their updates, how they implement their SDLC processes, their secure development life cycle processes. Basically, whatever questions you may have to be able to trust them, going back to the trust issue. Make sure you’re doing that. You’ll learn a lot from those Q&As, and I think that they even might learn a little bit and improve if needed. The second thing is to be optimistic, but prepare for any day. Have a contingency plan, stress test it, and then do it again. I think we say it at CyberArk on a daily basis with the zero trust approach. Never trust, always verify.

[19:40] David Puner
So, assume breach.

[19:45] Omer Grossman
Yeah. So, this mindset is still applicable here. You ask questions. You validate. You make sure you can trust the vendor. Although being optimistic, you also have a rainy day kit ready at the door.

[19:55] David Puner
So then, and I think those are really important, valuable lessons. Where we stand now, of course, again, six days out from this event, we know quite a bit more than we did at the time. But what do we still not know? What are we waiting to learn, and what do you think we may learn? Or is that even possible to speculate at this point?

[20:20] Omer Grossman
No, that’s for sure. That’s a hard one. You’re talking about the black swans, kind of thing, what we don’t know that we don’t know.

[20:30] David Puner
Difficult to predict.

[20:35] Omer Grossman
Yeah. I don’t have any crystal ball. Yet, when you’re surprised, like fully surprised, most people function badly. There will be a black swan. There will be unknowns that will hit you, surprisingly. If you’re mentally prepared to be surprised, it means that you’re accepting that there are unknown unknowns. When it will hit you, you basically function better. But, as Mike Tyson said, everybody has a plan until they get punched in the face. You need to make sure that when it happens, and it will, you won’t be in shock. And that means you’ll basically be in a better position to get out of it faster. The sad thing about it, and you mentioned earlier, is that the ripple effects, like the internet, is based on packet switching. Basically, it was built with the TCP/IP protocol. It was intended to be resilient, even to a nuclear bombing. Like, that’s back more than 50 years ago. The idea was that it should be able to survive or function, even with a very hard incident. It’s just not reasonable that we got to a point where a small IOC update, indication of compromise update, breaks the internet. Like, that’s the exact opposite of the idea behind the first phases and foundation of the internet. So, this is where we need to get different IT and security people, government, think tanks. It’s not that easy to solve.

[21:45] David Puner
I don’t think we’re going to solve it tonight, particularly because your night is about to turn into am in about 16 minutes. So, I’m about to say this. I hate this saying, but we’re going to give you the time back, Omer. I guess one last question. So, maybe we won’t give you 16 minutes. Maybe we’ll give you 14 minutes. What’s certain for you in the next 15 minutes? Anything?

[22:10] Omer Grossman
Maybe I’ll be able to get a few minutes with my kids before we call it the night.

[22:15] David Puner
That sounds like a good one. So, with that, Omer Grossman, CyberArk’s global CIO, thank you so much for coming on to Trust Issues. It’s been a pleasure. We look forward to having you on again real soon. And of course, I will, uh, hope to see you soon. Yeah. See you in Boston. Thanks, Omer. Thanks for listening to Trust Issues.

[22:30] David Puner
If you liked this episode, please check out our back catalog for more conversations with cyber defenders and protectors. And don’t miss new episodes. Make sure you’re following us wherever you get your podcasts and let’s see. Oh, oh yeah. Uh, drop us a line. If you feel so inclined, questions, comments, suggestions, which come to think of it are kind of like comments. Our email address is trustissues, all one word, at cyberark.com. See you next time.