Self-service Your Cloud Through Automated Remediation; Without Losing Control (Cloud Next '19)
So. My name is Thomas Martin I'm the, founder and CEO of big Co small Co and so, we are a technology integrator, that brings, start-up. Technology and innovative technologies, to the big company enterprise. Previously. I was a CEO and CTO across, multiple GE industrial, businesses, and had the pleasure of leading, the effort to migrate over 9,000, workloads, to public cloud and today, I'm here with Brian Johnson hi, my, name is Brian Johnson I'm the CEO and co-founder at Divi cloud prior. To taking. On this journey I worked, at Electronic Arts for about seven-and-a-half years working on massive, multiplayer online games, deploying. A lot of those games around data centers around the world and led the charge into migration. Into cloud with about 2013, very, cool. So. Through. That process of, the migration one of the things that we thought was really interesting is that as we adopted, cloud we. Started treating it like a technology, problem but, at the end of the day that actually wasn't the case right the other day was actually a culture problem and what, we realized is that shift to self-service, was incredibly, important for our ability to compete all, of our competitors are out there building games and doing it much faster and getting those products to market and learning, from their customers. And building really cool games how, could we continue to do that how, could we continue to innovate well we couldn't do that if IT continue to get in the way so. We needed to find a way to allow our engineering organizations, to access cloud to, be able to deploy applications to, innovate, through their cloud infrastructure without security, and IT really. Getting in a way of that and through, that process what. We discovered is it wasn't just a technology problem right, it, was actually three things that, came together in this sort of perfect storm that caused an issue for security, number, one yeah yeah we were moving with, more right just more everything we have more servers, so we were moving from, you know 500 was all that was not the actual number thousands. Of servers to tens and twenty thousands of instances inside, of AWS, so, that in GCP, so that was really, interesting through that process and, then we also had this mechanism where number of people touching the infrastructure dramatically, increased it wasn't just the IT staff anymore it was thousands, of engineers all, over, the globe touching, infrastructure, and deploying applications which led to third thing we, were no longer server, huggers right, we were basically having change, every, day see ICD, process one little code change a bug fix tear, everything down rebuild, it back again so that led to just a dramatic. Increase in the number of changes that were occurring in the infrastructure to give a moment so these three things combined, number. Of resources you had to manage number, of people touching those resources, and how often those resources were changing, led, to this incredibly, difficult problem, how, did we deal with this scale, how, could we possibly as, an organization, understand.
Everything That was changing, and react to it in any reasonable amount of time based, on traditional IT and security processes we had in place which were, super. Slow so that. Was really the problem we see out there this is not a technology, it's. A company transformation. Issue how, do you get ahead of this and how do you deal with scale. So. With that let's, just kind of talk about me what does this really begin to mean from a practical aspect right I mean fairly. Simple right this is just a simple three two your architecture. The. Miss the opportunities. Actually, seem quite small right I mean you've just got a couple you've, got a load balancer a couple computes, you, know spanning across a couple of a couple of availability zones we've got a cloud, storage, and cloud sequel, the. Point is is if, you're just managing and it's in a small scale with a single team not, all that hard right but, really when you start looking at it there's at least 20. Plus. Opportunities. For miss configuration. Just in this simple three-tier architecture think. About that when you begin to migrate say, five seven. Ten thousand applications, across an enterprise, to try to manage, this at any kind, of scale just becomes unwielding, and, what I found was actually for ourselves in, my, past experience has really been somewhere, between about a hundred and 200 applications the, team this the whole structure starts to fall down you, really, have to begin to think about not. Only is that CI CD process, so important but it's really about all the configurations. Not, only real-time aprende deployment, but ongoing and on forward. So. What, does this lead to well, in our case it led to a couple of different things led, to loss of control right. We're letting engineers go and sort of deploy that's a great thing right you want that innovation that, innovation in order to survive as a company you've got to find a way to compete through. Innovation, so absolutely important but you lose control right it used to be that any change was gonna be made in the infrastructure went through us first so, we would know a problem, we, would see a mistake, we'd be able to stop a map that's, not necessarily the case anymore, and furthermore what, interesting, is this the. All, things considered so, it means that you went from having an IT organization who. Had awesome. Place, to be able to catch these issues do the sort of controls and gateways they had two. Engineers, just doing all sorts of things all over the place and the problem is nobody ever sat down the engineering organization explained. To him twenty years of history tough, security, issues that we hit it's, not like IT learned, that stuff the easy way and. We got compromised, we had problems, we had issues since we learned build processes, unfortunate. Processes, really slowed us down and. The other thing we've recognized, is when we started to move towards a more cloud native approach we're like well we'll, just do alerting we'll, just basically get alerts every time there's something we need to pay attention to that, really, quickly got out of control I mean it just became whack-a-mole there was just no way to keep up with it I have alert fatigue on here's what, we talked about and that is getting those slack message or emails and how do you know when to pay attention to what, are the important areas because the reality is of 20,000.
Changes Now you're dealing with there's, gonna be one of those it might be really important and you may have a hard time identifying which, one of those you need to pay attention to so. Really this. Is really about a signal, and noise problem, with. All these things going on how do you reduce all of the noise so that you can focus on the signal and at, the end of the day it have, to leverage automation, to do that there's, just no way that we leverage traditional, IT process, to, use a runbook to, correct problems to contact, the perch to talk to them about making the change by, the time that's occurred the application has been torn down and redeployed three times right. So you, need to be able to get rid of the noise leveraging. Automation, so, that your IT staff so, your security staff so you're set up your cloud offs have the ability, to focus on that ten percent that they need to be with dealing with on a more annual. And active basis. And. So with that I mean you really start to look at and say that. Traditional. IT perimeter. And processes. That we've always used and, relied upon. Are ineffective, you just, can't handle those kind of changes it's scale they're still important, I'm not I'm not mitigating the fact of perimeter control we're gonna talk a little bit more about that about how that begins to fold in but. It's really so important, particularly, as Brian, talked about to be able to filter out the noise and, so, you. Know as you step, back and think about it for those of you who are, working those large enterprise firms is you, know think about at, least for me the IT procurement. Process our, development, teams literally had they knew in their head probably, an extra 60 to 90, days in the schedule they they committed, but they figured by the time things get through procurement, backlog. Of servers, making it to the data center by the time they get it in they rack it put it in put, up the operating system get it network we're, looking at somewhere between 90, 120, days I've, even seen as long as 180, days to get procured services, into the data center and so. Those. Kind of processes, as Brian was talk about when you're standing up something to try to detect, it and resolve, it that resource, may have already been built and gone and I know you, certainly experienced that a TA yeah I mean certainly I mean if we weighed 180, days games, get built and thrown. Away in hundred eighty days right so you, have to be able to move more quickly I think the other thing we saw was not just, the time it took to provision I mean that certainly took a lot of time and I'm never. Gonna rack another server again but. But. Certainly the other thing we saw was just when, you were going through this process and working with engineering, and trying to talk, about the, problems are gonna face when they start to adopt cloud and you're going to that transformation.
Sometimes. People, don't understand, the scale of the, attack surface and, so what, I mean is the offensive, nature. Of what's going on out there so we used to have an exercise we do with our engineers they come on board we'd have them deploy a server into a secured, environment where, port 20 to be open to the world and completely. Public accessible and we just set like route route on password and login and we just having time it how, long does it take before that box gets popped and sometimes. When you go through the exercise it opens your eyes to the amount of things that are just out there scanning. And looking and, trying to find ways in I mean when you know 10, or 12 years ago there, was an increase in the amount of sophisticated, exploits, are being developed that's. Actually started dovetailed, a little bit primarily cuz it's not necessary, anymore people, were out there opening up s3 buckets or leaving, databases. Open in the world I know that hasn't happened yeah those. Things and so it's not necessary, to spend time and energy and resources and those really complex exploits, when you can just scan and find a way in and, so that's part of this equation is not only understanding as, security. Professionals what scale looks like internally, but, also training, the engineering organization about, what's important, about security, how they need to deploy how, they need to think about people, trying to get in because, you can help teach, them as they go through this process everyone's, gonna race like everything's gonna get better at you're gonna get faster and more innovative well. Just think about your own SLA is right I mean what. Would be an SLA from you know a typical, event in a data center to, when you're gonna respond to it how, much data could be lost if you had that you, know that cloud, storage open to the world so. Those are those are the things to think about and so how, do you begin to to, really think about it from a remediation, standpoint, right so the, first off is it needs to be near. Real-time right. So, as you go around it's really starting, at a harvesting, point so utilizing, all the access, points those api's across, all those resources, and harvesting. Them back real-time, not only upon creation but, actually, upon change, it's. That day to drift, to also think about things might have been great as you deployed at other CI CD tool change but what happened, after. That point with that engineer I I honestly, don't think in item 1 I really don't believe people intentionally, do a lot of the configuration, mistakes they do but, it's that middle, the night day to ops when something's wrong I'll change that back as soon as it's resolved, and it, doesn't get resolved it doesn't get flipped back right so, do you first got to harvest, that data back in then, you want to unify it so, that it's consistent across all of your individual accounts, all of your V pcs all those resources are, then normalized. Into a single data plane, then. You want to drive analysis, against it so as you thought about, establishing. Those compliance. And security. Policies. Of what does it mean to our organization. To be compliant that's, the analysis, that gets done real-time against those resources, and then, being able to take action what do I want to have happen when. This occurs so it's that if then this scenario, if, port, 22 is open to the world what I want to do who do I want to wake up what immediate, action do I want to take not, only to protect the company, but, also from a forensics, perspective, as well, as to learn right was, it the team that inadvertently, did it to resolve an issue or were we actually breached so all that data is captured and dumped off for analytics, so, ahead yeah. One thing Hans your thymine everyone saw this morning's announcement right, about the multi-cloud push. From Google this is absolutely. The. Right way to go I'm super excited about they're doing the center doing on top of kubernetes kubernetes, is going to be the element that breaks down the barriers and commoditize, as infrastructure, and, so at the end of the day it's, going to be really important, from the enterprise, organization, perspective and you're looking at the infrastructure layer that, you can have a unified because. You're gonna have engineers, who were using Azure you're gonna have engineers, who are using GCP, you're gonna have engineers using Amazon and you can't build policies, they're, gonna be just living in those worlds because.
You're Gonna forget about them or they're, gonna sort of die on the vine or in different ways to go on and as we know in security it, doesn't matter if you have 95%, coverage, that. 5% is the one that's going to get you so, you need to make sure you have a create holistic, strategy and, holistic policy as you move move forward. So. How do you do that well. There's, a lots of different ways to think about dealing, with this, using, remediation, one. Is you know a development. Environment your remediation, might be slightly different than as your production environment and your development environment you look, engineer. You want to do some latency testing I worked for a big bank we are not allowed to have servers outside, the United States but you want to do some latency testing so in the development environment you, can have an instance in Asia pack for the next two hours two, hours later our system's gonna automatically come back and clean it up make sure everything's ok but. When you move that same application the staging, you may not actually have that ability to do that you might leverage more, go, faster, remediation, the server comes on in Asia back it's killed instantly right but, you still want to let them have that ability to try new services, and do more things right you don't want to lock them down using preventive. Controls because, you need them to go in there and try any things you need them to innovate and if you block them at the light at the top layer they're, just gonna go around you they'll, go create an account try it themselves and that's the worst thing you can be in being. Compromised is bad being, compromising not knowing it if it were, right, so embrace, this help. Them innovate help them learn goes. To that process with them and leverage your mediation in real time to, be able to provide flexibility, about how they do that but. Then when you get to production this, is where you may want to leverage some preventative controls the, cloud device today provide different ways to do preventative controls and lock down certain services are being used so, as you're taking your engineer through this journey right. You want them to each stage, understand, it's a little bit more stringent it's a little bit tighter it's a little bit harder to do what you're gonna do beside, the parameters. Of what we've approved so when they get production it just doesn't work right but, they're not surprised by the time they get there because the whole way through the journey you've, been teaching them and what's more important about that is it's, not just about enforcing, a policy, and they're just running away right. It's about engaging them it's about bringing into the conversation, okay what is it you're trying to accomplish what, are you trying to do let me help you find a secure way of doing that help, them innovate and help them along that journey let. Me interject there Brian's. Gonna talk about on the next slide but think about it too is it's it's as he's mentioned this funnel, of, restriction, you're providing. Guardrails, that are much wider in that, early stage to generate. Innovation but by the time you are at stage 3 its its its least privileged right it in fact in many cases it's just gonna be machine only privileges, that are enabled in production, to be able to to, run those services, and so we'll talk here in the next layer about how, you get there yeah. Yeah so let's, talk about these different layers it's sort of like you almost thinking about super fine-grained, up to a more coarse grain right so, when you're what we call protect, mode right, this is where you're leveraging real-time, remediation, to go and clean up after things or things down or clean up security groups whatever. It might be right, identified, databases, that have not been connected to in a long time all those different elements are going in and cleaning up fixing that kind of stuff and you're protecting, your armored on a regular, basis then.
You Have your implied checks right, so this is your ability, to take things like terraform, or cloud. Formation templates or anything. You need to work with to be able to deploy into your environment and provision maybe Hellman shart ray, bill deploy into that infrastructure and have, the engineers integrate. With a tool that will allow it to check, those things as they're, doing it so when they're going through the CI CD process, it checks them of the system and goes hey I'm about to build these ten resources, is what it looks like is this okay. Right have the CI CD process then either pass, it and say yeah you're, allowed to do this cuz its development but we're gonna tell you this is a problem or or have it just straight fail to build right, you want to integrate and bring security, into their world not the other way around again, if you try and do that there's gonna find it annoying and go around you so how do you go how do you bring it in right, so those imply checks are really important. And then finally these the serve link spaces this idea that as you provision accounts you. Do in an automated fashion and when you do that for projects, or teams or whatever might be you, go in and start slapping, controls around it at provision, time and this might be a mixture of remediation. And preventative measures. You might enforce some sort of ability worth to a CI CD pipeline where it's game checked as it goes all those sort of tighten things around as you go into a more production and preventative matrix myth that accounts. There we go and, really and so there's really go in combination, right so those those, coarse-grained. Are those big mindsets, that say these are never you know never to, you know be violated, if you will down, to that mid grain where as Brian, talked about you, may put a warning there in that dev cycle, but you're not going to shut it down immediately because you also are facing. Into that cultural, shift right so you also want to educate engineering. As to why we are going that direction down. To those fine-grained. Controls that not only take care of upon launch but really that drift that it can occur, day. To day, 30, so. Those really combined with some of the things that we talked about on the previous slide around the cycle aspect, gives, us so for those of the leaders in the room gives, us that ability to become more the Department of yes to, drive innovation for, your company versus the Department of no. Look. I think the the, the high level key takeaway, here is that as everyone, goes down this journey you. Need to define, a strategy, for the organization hey, you can't just not a so, I we started out talking about the back this is not just a technology problem this is how a business is going to transform when. I was at a and, we are building games the introduction of cloud didn't just change how we deployed our applications, it literally changed, what applications, we built it, changed, what games we took the market it, changed which games we decided, to stop developing, on because. We're able to do it faster right, so it's a huge business transformation, so as you go through this process you decide how sick. How IT how, cloud office is gonna address this it's important, to think about as a holistic strategy right, so we talk about that those layers you, know all the way from development, and the production needs to be could take into consideration, and how, you engage your engineering staff and teach.
Them As they go, absolutely. And. I think as Brian mentioned earlier around filtering. Out the noise we can't, rely on the traditional just, the perimeter, security control, and just providing, notification, okay. It's great to know that there's a theft happening, in Aisle five but, have, we filtered out the noise to know exactly, where in pinpoint, what's what, is happening, where it's happening and how to resolve it and then be able to take that action, to remediate it in a time of cloud speed. The. Other so, we're out there on regular basis working large enterprise customers solving, for this strategic problem and one of the things that we've found is the companies that tend to have most success we're, able to get moving quickest have an established cloud, ops team rate, what this is sort of interesting because when we talk about security. There's this desire to think about traditional infrastructure, security this, is about analyzing. Network, traffic and identifying, external, threats and coming, up with preventative measures. To react to those threats but, the security problem we're facing right now is different than what we've seen before I, used to do exploit, development professionally, for a while and so from the offensive side you're thinking about things slightly different you think about how to get into a black box from. The security side when, you're defending against add you're looking at traffic to try and figure out what people are throwing at you what do they know about you that you don't know and so on and so forth when. You're dealing with the cloud ops side of things and helping engineering teams grow it's, much more about understanding what they're doing and what their needs are making, sure they don't make mistakes it's an internal threat it's.
Very Different because it turns out you and then we're on the same side right. You know fighting, one another so you have to find a way to embrace that and what we've found is by establishing a cloud center of excellence a cloud ops team that's going to be focused on security, from a cloud perspective, and what that means to the internal organization means. You get a lot more innovation a lot quicker I think. To add to that is my. Experience has also been is that having, that focal team that cloud ops team also helps as an accelerant, not only from an adoption perspective but also from an educational, cultural, perspective, across, the entire organization, as folks. Begin to transition, out of that data center mindset in many cases you're gonna be I mean my, experience has been again and that most organizations. Of that size are going to always be hybrid, they're going to have their data center with their large ERP, systems and others that will remain in on, on-prem. But, to be able to manage that mindset across the board it helps to have that cloud ops team, absolutely.