Migrate and Modernize Hadoop-Based Security Policies for Databricks
okay uh welcome everybody to my talk on migrating and modernizing hadoop-based security policies for databricks my name is c tao i'm the cto of emeda okay so what are we going to talk to talk about today um this is a common question that we get asked um and this is you know a customer case where you know they they're either on cloudera or hortonworks or run on their own on-prem spark deployment or high deployment and can i just migrate my apache ranger century policies directly to databricks and i just want to point out that while this talk is focused on databricks you know this could apply to migrating your policies from ranger and sentry to presto synapse snowflake starburst you know whatever your compute may be on the cloud so the short answer to this is yes you can migrate those policies right over to databricks but if you just do a direct migration one for one you are not going to modernize your policies um and so this talk is really going to be about how do i get a yes for both and i'm going to spend a lot of time talking about what why and what modernization is and why you really need to do it so why modernize um so looking at century and ranger you know they were both started development um you know about eight years ago century in 2012 ranger in 2013. and i don't make it sound like um open source projects don't evolve over time and improve over time certainly they do but there are some foundational design decisions that were made at the get-go for these projects that you know our current environment has has made a challenge to continue to use these products and i'll talk a little bit about that so the first big thing that's changed is obviously hadoop is no longer the center of the universe i'd argue maybe it never really was there is always other rdbms's on-prem that people were using but suffice it to say it's gotten even more complicated because now you potentially have multi-cloud and then you have multi-compute within that cloud and if you are trying to manage your data policies uniquely across all of these different systems you are going to drive yourself crazy and it's just isn't feasible the second big thing is that data protection laws of the world continue to grow so this is a screenshot from the dla piper law firm website where they keep track of all the data protection laws and as you can see it's quite expansive across the globe and continually growing and so what we're what we're seeing here is this um what we call this data fuel crisis but you know back in the early 90s if you had data you could pretty much use it you don't have to worry about all this stuff and you know we started introducing regulations like hipaa and as we move to the right here and you get to our current situation there's 350 plus privacy and infosec bills per proposed it's it's getting quite hard to manage all of this and so the amount of compliant data you have to use is dropping as the amount of regulatory controls and privacy controls you think about increases and i don't want to make it sound like it's all about the regulatory controls i mean you want to you know that just privacy in general and data privacy is more obvious to your consumers and you know apple has made a business around this you know what happens on your iphone stays in your iphone it's important that your data consumers understand that you're treating their data appropriately and ethically so you can see ranger and sentry started way back here before this curve really started a hockey stick over here on the right and so that's that has implications we've got this tug of war going on right where you've got your legal and compliance team on the left that says we need to secure our data and meet all these regulations and meet the expectations of our data consumers or our data providers our customers basically and on the right you've got the data analysts and data scientists whose job is to analyze the data you're collecting and they want as much of it as they can get and they just have that insatiable thirst for all your data and the poor data platform team or the data engineering team here stuck in the middle they've got the compliance and legal team breathing down their neck and they've got the data analyst team yelling at them so how do they manage this um and the other issue is there's just continual complexity like these regulations have really changed definitions of privacy preservation and it's no longer just about blocking access to direct identifiers but you need to think about indirect identifier so i stole some language from ccpa that i'm not going to read here but you know we have other similar language in the gdpr and the long and short of it is if you de-identify or anonymize your data well enough ccpa doesn't apply or gdpr doesn't apply but this you know nothing in life is free because they define personal information as not only directly relating to someone but data could be reasonably linked and i'll talk a little bit about this but this really causes a lot of more a lot more data in your organization to have to be protected than historically has and a lot more views into your data and so that really leads to this privacy versus utility trade-off where again we've got this idea on the left of of complete privacy you just hide all your data which of course is unreasonable and on the right is you just have all your data in the open which is equally unreasonable and there's a lot of momentum pulling and pressure pulling this direction and so how do you get this privacy verse utility trade-off where you can keep both these parties happy while meeting and meeting their demands essentially you really need to play in this gray area between closed and open and a lot of these existing tools are very binary decisions from this perspective you either have access to this table or you don't or you have access to this column or you don't and that course granularity causes a lot of problems with this privacy versus utility trade-off um so combining all those things the complex data platform ecosystem more regulatory and privacy concerns and these new stringent definitions of what privacy preservation really means we've now entered this cloud private data era and this cloud private data era has created a role tidal wave and when i say role i literally mean the technical term role like a role based access control and this isn't limited to things like ranger and sentry this is like your iam roles aws and you know we found that our customers over time their roles have been exploding right because of these concerns you have to create all these different views into your data and this just becomes unmanageable for a human or a team of humans to to take on and this is just a role explosion example and this is from a real customer use case you can see this is a screen shot of ranger but again this can apply to other systems and you can see here that we've got a policy and it's basically the same policy written over and over again organization name in subquery select org name from external table which i redacted where role equals r01 the only thing that changes are these roles right here right it's the same policy over and over again associated to different users that map to that role and so if you if you've got new data you need to expose you're going to have to potentially create a new role in a new policy this becomes very very complex and you get to that explosion problem that i was just talking about so roles are tied to this idea of role-based access control which is a lot almost all um legacy rdbms's take this approach and some of our new sas database technologies and compute technologies do as well and rvac should really be named static based access control because at the end of the day it's like writing code without being able to use variables you saw i was just writing the same thing over and over again for each different role and so these two products are you know their foundations are built upon both role-based access control as well as this world where you just didn't have this many different views into your data so they're really conceived before the cloud private data era that i just talked about okay so we've kind of talked about why you want to modernize and our argument is that if you don't both migrate and modernize you really aren't going to be able to realize the benefits of the clouds because of those pressures i just described and you're going to have even more security pressures um when when on the cloud okay so so we've covered the y now let's let's talk about how to fix each of these so i'm going to talk about each of these individually i won't run through it on this slide so we'll stop start with this one um the separation of policy from platform so this is pretty straightforward you know just like the big data era required the separation of compute from storage which everyone knows and loves i mean you're able to ephemerally spill spin up data bricks on top of your data stored in s3 compute it spin it down um similarly you could spin up a presto cluster to you know run interactive queries against s3 there's a lot of power in this right and a lot of flexibility and cost savings and so for the same reasons you the the private data era requires the separation of policy from platform it's this idea of having a single plane of glass extracted or abstracted from your actual compute engine so you're not uniquely enforcing policy you have this one place to do things like table access controls column level controls row level security cell level controls you can do this in a consistent matter no matter what your compute by separating it and it's not just about separating the policy from the platform of the compute it's also about separating policy from your physical table so if we think about all these different computes and different metastores that you might have this could you know it almost always does ends up being thousands of tables and columns right if you have thousands of tables and columns and you build policies at the table and column level you're going to have thousands of policies so if you combine this problem with the our back problem i just spoke about you really get this management explosion and scalability problem with with your policies um so what you need to do um is instead abstract your physical layout and your physical tables and columns with logical metadata things like pii phi address social security number you can lay this across your your physical data structure as in a logical way and just reference that that lot those logical tags and this allows very few understandable policies to be created understandable is a key word here because remember now the policies are written such as mask pii not mask this weirdly named table in this weirdly named column your legal compliance teams can understand what's being what's going on and these this logical metadata you know amuda can discover this ourselves and and you can build policies on top of it or if you've already done this work and exist in things like big id or calibra that could be sucked in and used to drive policy okay and you get a ton of scalability with this with with this approach so you need to tie that scalability with fixing the r back problem right and this is fixed with something called attribute based access control and in fact um there's something called policy based access control um which amiga also does i'll touch on that a little bit but so if you remember this slide r back should really be called static based access control it's like writing code without being able to use variables um and and again this alliance arranger and sentry and other things like you know iam roles but if you had something like this wouldn't it be nice right if you could write this policy once and the role was actually dynamically attached at run time so essentially the policy gets to defined at query time based on who the user is and what their role is and this is in fact really what abac is all about abac is not about where the attributes come from or how many places they could come from it's about how the policy gets enforced at run time so this is really dynamic based access control where our back is static based access control and again amuda can do this this abac methodology so if we revisit this real customer example they had eight rules i'm just showing one table that had eight rules on it which we already spoke about but then this table had there was there was 12 total tables that had the same association that needed this row level policy so you had to write 96 total rules to enforce this correctly and this is a very understandable right if you have to make a change it kind of needs to be the person that originally built this stuff right so with using an abac approach like a muta this can become a single policy so the first thing we talked about is we can scale this because we're using the logical metadata rather than the physical table names so we're looking at any column tag organization name and running this where clause on it and this will propagate this policy everywhere it finds that on those 12 tables essentially and then we're using the group attribute as a runtime variable so we don't have to write it eight separate times for the same table and so that's how we're able to get this down to a single policy and it also future proofs you know if you get a new group this policy still applies if you get a new table it will be discovered and the policy gets attached to it so you don't have to remember to make these changes okay and then the last one is the privacy enhancing technology so you've got these more stringent definitions of privacy preservation how do you deal with that and this is probably like of everything i've spoken about today maybe the less obvious one and our customers realize this kind of you know as they become more mature in building these these policies so how do we get to this sweet spot and meet the demands of legal compliance and the data analysts and scientists at the same time and this is called the privacy utility trade-off as i mentioned so just to demonstrate a silly example of why it's more than just indirect identifier i mean direct identifiers you need to worry about indirect identifiers i'm going to tell a quick story about judd and leslie apatow leslie mann you can see they're having a good time their taxi here and the new york taxi limousine commission actually released a bunch of data uh on all their their taxi rides so you can see this is pretty granular data but also is pretty darn harmless right it's the medallion to pick up the drop-off locations the times the the total fair amount and the tip amount but it actually isn't that harmless because a tabloid magazine actually took all these photos of celebrities and cabs and cross-reference the taxi medallion and the photo pickup time with the you know millions of records in this taxi data and they were able to pinpoint the exact row for judd and leslie for example and in this case they tipped two dollars and 10 cents but there are other examples where the celebrity tipped zero which i didn't call out here but you know this makes great tabloid fodder and is also a great example of how indirect identifiers can lead to privacy intrusion and and so there's a lot of fancy techniques we can we can use to get to play in this gray area between privacy and utility that we've been talking about that sweet spot and so i think a lot of the obvious ones are column restrictions or we could reduce specificity or hash and encrypt but there's more advanced techniques these are these privacy enhancing technologies like k anonymization can actually suppress about the values that can lead to those linkage attacks we talked about with judd and leslie we could do things like local differential privacy or differential privacy where we add noise to data which provides us some some guarantees of privacy we could limit records with row restrictions we could limit the types of queries you could run like aggregate only and add noise to those aggregations and this allows you to play in that utility space and get the analysts what they need but also enforce the privacy controls that are required and immuta gets you all of this where these legacy systems aren't even thinking about things like k and differential privacy um and this is just an example of you know correctly um privatizing the taxi data had new york done this you know that tabloid would have never been able to pull off the stunt that they did but it also would have provided plenty of utility from that taxi data they could have hashed the taxi medallion they could have generalized to pick up latitudes and longitudes and date times so you couldn't do that kind of attack i described and we could actually use local differential privacy to randomize the tip amount slightly just so there's no guarantee that it's the exact tip amount that you're looking at and so taking the step back on all this there's when you think about these attacks that people can take and by the way attacks don't have to be on purpose they could be accidental you know breakage of regulations as well but you think about the attack event so the probability of attack actually occurring and if someone actually does try to do an attack the success of that attack and that that's represented by these circles so what we've been talking about is removing data risk or shrinking our success circle right and we can do these with those privacy enhancing technologies um and and even simple policies uh things like can happen local differential privacy differential privacy and masking what we didn't talk about is reducing context risk or the likelihood of an attack even occurring and we can reduce this through things like purpose limitations and agreements that users you know agree to or sign keeping you giving this like legal audit trail and this is a concept that exists in a tool like commuter that doesn't exist in these legacy systems so if you kind of align this to what we've been talking about you know both century and ranger don't even have a concept of context controls at all and they provide a limited amount of data controls i mean century basically can just block tables and columns so there's just a little bit of gain of shrinking s here ranger can do slightly more it can generalize if you do rear level policies so s shrinks slightly more but you're still left with this large attack circle where something like immuta can reduce your likelihood of attack with the contextual controls and we have a wide range of privacy enhancing technologies to shrink your success rate or your data controls and so this significantly reduces risk right but the real value in this is that not only are you reducing risk but you're providing more utility right because you're letting people get on that data um that with coarse grain controls you're essentially just completely blocking them and so you're reducing their utility so i like to think of a tool like immuta as not taking data away from analysts you're actually getting them more access to data while reducing your risk which is very very powerful okay so this is all great i put all this effort into century and ranger but this seems like a big change how am i gonna actually like move to something like an immuta and so we made that easy for you so we built a migration utility that you can use to migrate from ranger or sentry to immuta um the century one is full ga the ranger one um we're still working on um it's it's a private preview right now but it but the idea is to not only just migrate one for one your century and ranger policies to immuta but also do this modernization i've been talking about get the scalability go from you know whatever it was 90 policies down to one instead of kind of like keeping the the antiquated approach that you have with these tools um so yeah so i'm going to stop here and give a quick demo of this okay let's provide a quick demonstration of what i've been talking about i'm going to show translating sentry policies over to emeda so the first thing i'm going to show is a grant statement on this fraud analyst role and so in century we granted this role select on the database default we've also granted this role on these specific columns that they're able to query on this on this table so notice we're not saying what they're not allowed to see we're seeing what they're allowed to see and this implicitly grants them access to the table note that if i granted select to this table outside of these columns that would override this so it does get quite confusing and sensory to manage all this so just i'm going to show you this user floyd that's in this fraud analyst role so i'm going to run a show databases that should show that he can see and i'm in i'm in impala right now default and non-dbfs i have not gone over to databricks yet um i can also show you that i can only see the credit card transactions table non-dbfs because again that was the only thing i was granted via these column level controls and if i try to do a select star on credit card transactions it's not going to let me because remember i'm only able to see these specific columns in here so floyd needs to be smart enough to know which columns he's allowed to actually query so i'm going to explicitly list all those out here and now i'm actually going to be able to get results back okay so as i mentioned we have not transferred any of this over to immuta or databricks yet so just to show prove this to you i'm going to try to query this non-dbfs database floyd can't see anything in there if i try to query the credit card transactions table this table doesn't even exist if i go to floyd's immuta console or catalog here there just aren't any tables available to him if i jump over to steve's immuta steve is the data owner so these tables actually exist as far as he's concerned they've been registered with immutably simply haven't transferred the policies over yet so for example if i look at this credit card transactions table there are no members of it except for me and there's no policies in here yet and i've spoke about these global policies during the slides but there are no global policies and this is how we global policies are what we term building policies at that metadata logical layer and there's none of these for the subscription policies either so what i'm going to do is come back here to my shell and i'm going to run this command that will actually register the sentry policies with immuta and this is done by reading the sentry database and making api calls out to amuda so it's doing all that work right now and notice since we're just reading from the sentry database we do not need um immuta to reach into um the uh the sentry deployment over the the network you could you can run this locally and just ship everything out to immuta um so this is the important part if you remember we talked about migrating and modernizing so before or currently in century we have 22 different grant statements i only showed you a few of them here and we were able to boil this down to four subscription policies and immuta and two data policies in the media so significant gains in scalability here and understandability um and just to show you this if we run back out here and look at this credit card transactions table you can see we've got this policy here masking the credit card number and the only people that are allowed to see the credit card number in the clear so it's kind of like the inverse of sentry or is the admin or the or the data owner similarly we have these subscription policies that got created for example you could see that that fraud analyst was granted access to the default database here and this got applied globally across all the tables in a muta so let's show floyd actually querying this now so first of all he's going to be able to see all this stuff in immuta so these two guys are in the default database credit card transactions was non-dbfs as we remember so if i query or show the tables in non-dbfs we're going to be able to see credit card transactions and if i only query credit card transactions the interesting thing here is i can do a select star and credit card number is nulled out for me so that was the masking type that we selected but we can get fancy here and let's use one of our privacy enhancing technology so i'm going to edit this policy instead of making it null i'm going to replace it with something called format preserving masking and this is going to actually convert the credit card number to or i'm sorry the nulled out mask to what looks like a real credit card number but it isn't the real credit card number so notice these are all nulls right now i'm going to rerun this query and we now see what's real looking credit card numbers right so i can go ahead and save this right here because i'm going to show you i'm going to add this group as an exception this role is an exception to this policy so we can see what the real credit card number looks like under the covers so let's come back here and edit this again i'm going to add this exception possesses attribute sentry role fraud analyst because remember that's the one that floyd's in so now this rule does not apply to him it's no longer going to mask with format preserving masking so if i rerun this query and we pay attention to this first cell again we can see that we've got a real credit card number but the last you know several digits here are are real and these ones up here were false so it lets you see what looks like credit card numbers but really aren't um so yeah so that's i mean obviously there's a lot of other features here that i'm glossing over but that was a quick demonstration of translating sentry policies getting that modernization and then being able to manually update uh these policies to give you more granularity and open more tables and columns to users because you can play in this gray area between hiding the column first actually you know getting some utility out of it but also adding a layer of privacy so thanks again for your time and i encourage you to please leave feedback on this session thanks
2021-01-02 16:09