hey does data loss prevention seem like it's too big of a task to handle or the idea of classifying everything in your Enterprise seem beyond your scope well I'm going to tell you a little bit about the history of DLP and what you can do today and how you can tackle that big monster stay tuned you're going to hear it right after this [Music] message hello and welcome to another episode of ceso tradecraft the podcast that provides you with the information knowledge and wisdom to be a more effective cyber security leader my name is g mark Hardy and today I'm going to be talking to you about data loss prevention or DLP for short now let's take a little tip back in time about how did we get to where we are now with respect to DLP and in fact what are we talking about when we talk about data loss prevention essentially we're looking at technologies that allow us to tag or flag certain information that we want to keep on one side of a barrier from crossing over to another in the early days that I remember back in the 1990s we had something that we kind of nicknamed a dirty word list and this is something we had in the military when we had classified networks and unclassified networks now there were opportunities to move information back and forth but of course you wanted to make sure you uh reflected was a Star Property from the Bell La Padula model for those of us who took a cissp many years ago where it's okay to uh read information up but you can't move classified information down and so the idea was is that if you knew of certain terms or project names or anything else that would indicate that something were sensitive or inappropriate to be put in the unclassified world that your dirty word list would then pick up on a project name or word secret or confidential or something like that and flag it so you could go ahead and say hey maybe this really shouldn't be doing it so that was probably earliest instantiation that I saw and that was think it was almost 30 years ago and what that happened is you had one network here that was operating at one level system high another Network operating at a lower level and this sat in the middle because there was only one way to get from here to here pretty simplistic model but under the circumstances that's all we had were simplistic networks and life worked pretty well now a little bit later as we go ahead we find out that as risks increase we have not only accidental exposure or Insider threats but even external attacks and we have a lot more external attacks today than we ever did before uh but we move down let's say to the early DLP what we might call Simple content filtering and this is probably from around early 2000s and include endpoint protection so then what was happening is we were getting regulatory requirements uh requirements like Hippa gram Leach B act and sarbanes Oxley and this is pushing companies to go ahead and protect and restrict or sensitive data from moving around now these initial DLP Solutions were focused pretty much on content filtering and endpoint base controls sort of similar to what I had explained before but now with a little bit more sophistication typically what you'd find then is being able instead of just looking for a word you could write a Rex a regular expression and Rex matching allows one to go ahead and look for things such as credit card numbers three digits a hyphen or I'm sorry Social Security numbers three digits a hyphen two digits hyphen four more digits or credit card numbers which should be four groups of four 16 numbers maybe email addresses anything that might be considered to be sensitive pii uh PCI if we're looking at credit card data and things such as that and so those are the early usage where you'd look to match something in an email or transferred file because you're kind of looking for words or phrases or something that would suggest that it's confidential or proprietary the difficulty is though is you're going to end up with a high positive rate why because it has no comprehension whatsoever this to the context so if I have something where I was saying things such as instead of classifying a document is confidential I would say that be last week we held a confidential meeting with their client and it went really well well that would get flagged as a potential false positive even though you didn't disclose anything that you needed to protect and so on the keyword matching came up a little bit short but it was better than nothing and on the end point points what would happen if you're looking at data movement so what happens when I stick a USB device in there someone's going to xfill data I can identify that and block USBS from working of course back then we have the old PS2 mice and keyboards which is a different port and today where if you have an external mouse or keyboard it is pretty much guaranteed to be a USB connection well you block USB then you can't operate the keyboard you can't operate the mouse that becomes a little bit of a problem and you had tools like device lock you could uh disable that but the problem with that might be that well you disable functionality because if you don't discriminate against a USB data device versus a keyboard or a mouse and then a little bit later of course you have for those of us who are familiar with it the concept of a rubber ducky which was a USB that declared itself to be a keyboard and then it would allow you to do a keystroke injection now that's a little bit later but let's not even worry about that but again High false positives is a problem due to lack of context did scale very well when you increase the data volume because now if you're trying to move things at a much faster speed your processor might not be able to keep up and a lot of times you couldn't support complex file formats um or even things like structured data or what we were used to do is it was looking for your dot doc and would review that or you say you can't send itzip so what do you do you rename the file extension instead of zip it's do zap and then it goes oh that's not interesting off you go and these are almost trivial to go ahead and get around now the next phage probably let's do it a little fiveyear increments somewhere around 2005 to 2010 we moved on to Advanced content inspection and policy enforcement so now what's happening is we in the context of data breaches remember the TJX breach it took place back in 2007 there's more regulations coming and so as a result we have to up our game to be able to defend against the loss of sensitive information so some of those technical approaches would include things such as fingerprinting where you could create a unique hash of a sensitive document some structured data set Etc so that any attempt to move that block of data you get a hash match you have a signature match and you say nope can't do it not allowing it and so now I could use and say I have this database it has all my client data here are all the files that pertain to customer X I record those fingerprints at any time an attachment it's pretty trivial and pretty quickly to produce a hash of the document no matter how big it is it goes quickly and say yep that's a direct match now of course you're already thinking what's going on here we have to say well how do I know that this is sensitive how do I tell somebody that you need to add this file to the database I have to have the concept of a classification engine how to be able to categorize data and it could be public data which is sort of the lowest level it could be internal it could be confidential or it could be highly confidential it sounds a little bit like purviews defaults well we're going to get to that a little bit and so now what happens is you can add tools like network-based data loss prevention it's going to monitor the network traffic now this is pre- Snowden so an awful lot of stuff was going HTTP or even file transfer protocol FTP uh SMTP m going back and this stuff was all in the clear and so as a result it was pretty trivial to go ahead and grab these things inspect them and sit in the middle and tools like a web sense would go ahead and Pioneer this network based filtering for sensitive content now what about a scan document well there's a solution for that and that was being able to look at images and say hm we could build an OCR engine and that OCR Optical car recognition engine could then go ahead and translate that into asky text obviously a lot less space than it would require to store that image but now I can run operations on it and I can go back and use my matching algorithm before to say three digits a hyphen two digits a hyphen four digits hm looks like there might be a social security number in that scan document we probably want to do something about that and now what we can do is we can start to centralize our policy management we can Define rules for different types of data and with a tool that works Enterprise wide we can enforce things like block all pii from being sent over email all righty well again that works great when your information and your data is structured but in unstructured data it's a little bit harder and again if you have to update your classification rules there's a lot of maintenance involved if you miss something then all that stuff keeps going by and of course once it's past your filters it's Lo lost and with a lot of frequent blocking false positives the users weren't all that thrilled about that now let's go to another phase where we would have sort of a unified DLP let's say the first half of the the teens the 20110 or so now we're going to integrate with the scene endpoint agents and what we have is a cloud adoption at mobile workforces this change in our Behavior instead of just client server here's my endpoint here's my server and then maybe server to server now what happens I've got mobile workforces they're all over the place they're on laptops they're coming in from a lot of places I have Cloud adoption where I'm moving things up into Amazon and Microsoft and Google and all the other cloud services that are out there and that's a problem because now I'm not running like a perimeter type of a defense I don't have a perimeter anymore so where do I go ahead and put these dlps things well it turns out that some vendors started integrating endpoint and network and Cloud DLP Solutions so you had a single pane of glass you could monitor endpoints monitor your networks monitor Cloud environments and then you could take a look at these unified DLP platforms and determine to say h it looks like we've got some information that might be moving from the cloud to an endpoint or maybe going ahead and moving across networks and we could go ahead and identify it what we came up with is a concept for Behavioral Analytics and ueba if you remember that Al algorithm acronym sorry about that user and entity Behavior analytics ueba you could allow DLP to detect anomalies based on user activity for example if a person is normally doing a certain amount of transaction volume and all of a sudden there's a huge Peak like they're exporting a ton of files or diagrams or internal things you could flag that as a potential Insider threat also you could start to look at data sensitivity not based on a pure l of classifications and dirty words if you will remember the back in the old days but now look at the who the what the when the where the how the data were access provide some context so that if you said I've got sensitive data transfer with internal systems that should be okay because it's not going out even though it's going across the network it's going from one endpoint to another the destination is still internal and the route is still internal is compared to something that's going external which we say hm that violates this rule we ought to block it now we could go ahead with the seams and integrate this DLP data so that we can correlate it and do some threat analysis so things like Splunk and arite they would ingest the DLP alerts and now you'd have deeper insights as to where potential problems might be also another Innovation that really started to come of age then is tokenizing data upon detection of sensitive information so you might say I do need to move it from A to B and a is my Los Angeles office B is my New York office they're both internal but I've got to go across a public internet so how about we go ahead and tokenize it so we agree on something that the client name if that's sensitive or some project is sensitive we just substitute something for it and then we move it across and then we resubstitute for the second user what was really there now you could go full bore and encrypt everything which would be better in my opinion but then you have to manage keys and things such as that if you do it right it works pretty well but it's now getting more complex and with this breadth of coverage these multiple different platforms not only are you going to have to worry about uh performance issues how am I going to keep up with this biome but you might have a whole lot of alerts and as a result you're going to create alert fatigue which eventually becomes like the boy who cried wolf you stop paying attention to it the next phase if you will of DLP kind of got into cloud and software as a service using tools like a casby a cloud access security broker and now this is a companion to traditional DLP and software as a service and the increased use of that back then says hey we're going to go ahead and push everything up into the cloud tools like what's now called Microsoft 365 but 0365 coming out of age back then and hey I'm moving stuff up in the cloud it's moving out there it's going to be in my one drive Etc you have to go ahead and account for that now with casby what I could do is take this Cloud access security broker tool inspect data in my software as a service platforms now if I'm using something like 0365 now Microsoft 365 or Salesforce they go hm all right that should be there that should not be there and Microsoft Defender started coming in line to integrate DLP with your Cloud app security so that we could go ahead and potentially flag where problems might be in addition vendors came up with plugins API so you could have a direct API connection to a ploud cloud platform and you do real-time data inspection and blocking feels like net scope for example that would work with Dropbox and G suite and even inline proxy deal l p where I could leverage proxy based inspection to Monitor and block data flows to and from cloud services so basically kind of sit there in the middle zscaler at proxy based DLP for secure web gateways you could have identity aware dlps you could integrate with identity platforms that allow for kind of dynamic policy adjustments based on the user roles and permissions Force Point had a tool that would allow you to adjust your policies based on active directory groups about that time I was working on a startup and I remember remember that they had had a blue coat and that firewall was set up to go ahead and essentially provided man in the Middle where it would serve up security certificates that were rolled there so what happened is you said hey what's my challenge everybody's encrypting everything how do I view the encrypted traffic well if I control the certificate that your machine trusts then what I can happen is when you ask for https and you do the TLs protocol that protocol exchange is not with the ultimate server what happens is your device your man in the middle so to speak that the blue coat was doing is it would say hey I'll accept that call and I will go ahead and do the four-way handshake with your TS now at that point I have a secure connection but that secure connection is to my Trusted Man in the middle which would then decrypt it allow me for inspection check for DLP then reencrypt it because it's going to go talk to the ultimate destination so what you have in the middle is an instantiation of unencrypted data which allows you to then go ahead and inspect everything even though it's https and you know with Ed snow and everybody else say wow we ought to encrypt everything the amount of HTTP traffic and all the stuff in the clear really went down as everybody sort of got religion on encryption but that could create some performance bottlenecks because you got to do this real-time Cloud inspection people don't want to go ahead and wait for things you've got this om crem and Cloud how do I coordinate these things and know that well this cloud is mine but that cloud is not mine and this is okay but that's not okay and then the encryption tokenization things such that gets rather difficult finally if you will our current phase where now we getting into Ai and machine learning DLP now we can adapt and automate our protection because with this massive amount of unstructured data out there and potentially increased Insider threat we've got to have a more intelligent automated protection what can we do about that one is just use machine learning and machine learning could actually be used for data classification so my AI driven classifiers can improve my accuracy reduce my false positives and allow me to go ahead and have a higher Fidelity for my process Microsoft perview DLP would use machine language to distinguish between sensitive and non-sensitive data because it can ingest a lot of the information and infer a context natural language processing or NLP allows context-based understanding of human language and documents communication so it could figure it out so tools like force point would look at that and say hm this looks like confidential contract language it should probably be protected even though it wasn't correctly marked and then have that policy tune itself allowing the machine learning to adjust automatically based on user behavior and content and therefore tools like proof point force point you could allow that real time so you adapt to your users's behavior and you don't complete the continuous cycle of false positive false positive false positive once it's been told or trained that hey this part's okay off it goes and then finally zero trust integration as we make DLP a key component of zero trust strategies we can tie data protection to Identity location device health all the things to ensure that if we're going to encrypt and authenticate everything we should be okay C scaler net scope will work pretty well on that but that's a lot of complexity trying to tune these machine language based models and a lot of high compute requirements if you're going to do real-time data classification and of course think about it is a privacy concern if you're going to be doing deep inspection of communication something that dives deep into all of your stor documents how do you feel about that is that a privacy issue kind of Maybe So today we're seeing a shift from Standalone DLP to essentially integrated data protection we use secure access service edge or SAS platforms uh potentially receive more homomorphic encryption meaning I can work on the encrypted data without decrypting it now it's kind of interesting idea and it's been a proposed for a long time but there's a couple functional instances out there being a crypto guy I um might learn a little bit more about that because I haven't studied all that much be able to say hey I can produce a result operating on two encrypted inputs create an encrypted output it's as if I were working on the clear text but I still get the right answer interesting concept but I can see how it mathematically you can do it but maybe we'll say that for another episode uh agentless DLP models that way you don't have any footprint on your endpoint and you could go ahead and improve your performance because it's going to go ahead and just kind of look at stuff as it goes by and F focus on things like data sovereignty crossborder data controls if we have gdpr and other regulatory requirements that put some pressure on us we're going to find out that that's a bit of an issue I'm working with a client this this week on how do we go ahead and do data classification and use Microsoft purview now I'm just G to do a quick overview because I'm not going to dive deep into it but I just want to use this as an example of things you can do currently and Microsoft pview which is plugged into the 365 environment will give you visibility to all of your data sets let you manage it across your environment help you secure your data throughout its life cycle between one it's created transmitted modified archived and eventually destroyed and help with risk and regulatory requirements so one of the course the first things is if you're going to look at doing DLP where is my data and what is it and how do I access it and so if you have everything in a known location let's say one drive and or SharePoint but you could also have box and Dropbox and other services and things like that you really need to kind of build a data map and using a tool like purview they say we can help you do that take this metadata across all these other environments whether you're hybrid on-prem multicloud software as a service and be able to say hey we can see where all this is and we can ingest all that and create a data map okay well then what you want to be able to do is make your data discoverable so you can maximize its value to do so you create a data catalog and now the data owners the stewards of data can curate their data assets and say hey here's what it is so if you're a consumer I can search for things and say well that's where I need to find it again not a big deal in a small simple organization but you get large and complex it could be difficult then we want to have policies for accessing the data moving it perhaps even sharing it and again data policies that you specify these so you could go ahead and say hm I want to create and apply different policies based on who's accessing it if it's devops and you're looking at code that's probably okay the data owner that's probably okay uh somebody over in accounting or somebody that's on the assembly floor probably should not be looking at source code and so be able to go ahead and limit that and so to do that we're going to want to secure our data and we're going to secure it throughout the life cycle whether it's on an app in a cloud in a device wherever it happens to be key to this and this is the hard part that I'm facing right now is discovering and labeling everything because I could start labeling all new information tomorrow but what about terabytes of data that have already there do I have to go back there and pick through all these files that I may never ever care to look at again and so it seemed like a tremendous waste of time so the idea is is having tools that could go ahead the software development kit where you can place things on there to say hey look at all my old client files and here are the types of things that would be sensitive pricing how do we go ahead and have a proprietary method for doing the engagement but standard boilerplate stuff that would be in our legal terms and conditions may not be so insensitive so then go ahead and say let AI let your automation Let It Go scr all through your data so you don't have the tedium of doing that then we want to deploy our DP policies to restrict our data leakage and now data loss prevention it's the capability of purview that let you know when things are going out I'm using that right now and I've got it in a notification mode rather than in actual interdict mode and I get a lot of false positives I got to tell you that most of them are with our Chief Financial Officer warning credit card number going across warning bank account information going across and I look at the source item and like yep well that's a CFO now yes I'm going to filter that out at some point but I want to see everything that it does so I get a little bit of noise but as a sees so or the person you designate you want want to understand what this thing is capable and not capable of it's a little bit easier to take too much information and trim it down with filters than at Le to start with not enough and say how do I get that stuff that I'm missing and so now what there also is tools available for Insider risk management being able to use machine learning can look for potential Insider risk so you can say hey this just doesn't look like what a normal person ought to be doing a big challenge that most of us face are compliance and compliance represents a tax if you will a restriction a constraint on your operations that says you have to manage things in a certain way why well because the regulation says we don't want to disclose personal identifiable information protected health information uh payment card information all these other types of freeletter acadm beginning with p and ending with I uh that say that yeah this has to be protected because not only do you run the risk of the embarrassment of disclosing something but you may have Financial consequences for violating that as well as well as the reputational damage that could occur when somebody says hey you just had a major data breach and things such as that so by being able to build in classification and governance at scale and recognize patterns these are the types machine learning types of capabilities that you're looking for with your tool and then ultimately as they said you have to keep core business records when do I dispose of data information is not always an asset it can't become a liability and in a situation where you've kept data far too long and then something comes along where somebody's doing Discovery in a court case and they ask for everything well not only you have to produce that because it's a court order but they're going to charge you probably to say here we spend hundreds of hours reviewing all the data you sent us and we charge by the hour thank you very much also there might be things that took place in the past that could create a potential liability today because let's face it when somebody goes ahead and goes to court they don't try to put your best foot forward going to take everything as much as they can I would argue potentially out of context to try to use it against you but if you had an absolutely strict retention policy that said interns when they leave after six months all the data is gone employees after two years all the data is gone Executives three years all the data is gone except if you have things like a funeral requirement or some other regulatory requirement for it now you enforce that and you do it over and over again now you get served with a subpoena we said hey you had an intern that worked with you four years ago and this person is now facing some harassment suit and we want everything that person ever said four years ago to try to either build a case or defend a case at which point you just hold up a copy of your data retention policy you say I got logs but since such and such a date when you approve this policy we've been deleting things on time on schedule there is no business purpose to keep this person stuff after six months it's now been three or four years I cannot give you anything and now'll hold up in court you say they say judge they're not giving us anything why aren't you giving anything our policy which we've enforced for years says we delete everything and we've been enforcing it and if you have those logs they're like yeah we got also make sure if you do have logs that they're unalterable that you can go ahead and prove that who knows maybe stick them on a blockchain but again episode for another dat and so now if you have ecovery that's a tool that allows you to go ahead and preserve and collect and analyze and review and Export content that are based on a particular investigation so now instead of having to go through and man finally pick through every single email every single file every correspondence let the AI tools do it let some intelligent machine learning do some deep indexing email threads and the near duplication detection because one of the things I found is that when you new discovery I send an email and then you send a reply and then I send an email and reply and S and goes back 10 times that's 10 different emails they want they supposed to print them all out an eight a half by 11 depending in you know certain size font they might take them electronically but the whole idea is is that you want to go ahead and say look this is all one thread the 10th message contains 987654321 it is unaltered I can prove it was unaltered I'm just going to give you the one if you want to start pulling strings after a while they realize yep they they got their act together and the Court's going to kind of leave you all you hope and then forensic investigations being able to go ahead and have audit read reporting insights and things such as that this is going to require some specific information but now I can look at user activity when all of a sudden I have a peak in data access it's high band what's going on um what's happening in things that might cause an audit to occur and now I can preserve audit logs and uh for example Microsoft perview now let you keep them for up to 10 years as compared to your normal data which is going to be about 30 days unless you're on academic license when I think it's about seven and then let's look for violations potentially if somebody is violating Regulatory Compliance or business contact or sending inappropriate Communications don't wait for the lawsuit stop it quash it immediately and then go ahead and ensure that you're remaining compliant so that if you're dealing with things like SEC or finra and you have somebody who's making promises that they can't deliver in terms of investment results you can spot that very quickly and shut it down and not just hide the evidence but maybe have a follow-up to say hey this is compliance yesterday received an email from one of our employees who said I can get you 10% on a risk-free investment uh a that is not authorized B we did not authorize that communication uh please disregard it and see that person is no longer working here but again build yourself an audit record so you can go ahead and protect yourself and then we want to make sure ultimately that you're tracking your compliance Effectiveness how all am I doing can I continuously look at my compliance efforts have I reduced risk because ultimately that's what I'm trying to do for our organization our Executives is reduce the uncertainty of bad things happening so that I can go ahead and meet the requirements that are out there so these are the types of tools that I can do with DLP and I gain a much better capability for my Enterprise so in summary if we think about data loss prevention it's a way of ensuring that what goes out around through or past some barrier some perimeter some defined Point meets a predefined set of requirements of course you have to predefine those sets of requirements and some tools like perview give you a head start they will give you four different classifications you can start putting different Tools in those buckets information in there let it automatically scan let it build up that case for you and then proed going forward it looks daunting it really is and so for a long time I like do I really want to open that can of worms but as you move forward you realize that as a ciso is a security leader you're goingon to have to open some cans of worbs why because they're going to start to rot and smell and cause danger for your organization if you don't tackle the project so if you've been holding off on DLP if you've been holding off on data classification if somehow you felt this isn't for me or this is too big or I don't have the staff or I don't have the bandwidth today is your day because we've got a die we've got all these capabilities with machine learning and all those tasks that look nearly impossible to accomplish as human beings can be done pretty simply with these tools so hopefully you found this episode useful if so share your insights with others let them know that you're getting them here at ceso tradecraft follow us on LinkedIn if you're not already following us we have more than podcasts make sure you watch us on YouTube if you're not subscribed there and let other people know the source of your ceso tradecraft so you can help them in your career paths as as well thank you very much for listening or watching this is your host g mark Hardy until next time stay safe out there
2025-03-26 02:59