Building and Securing a Data Lake on Google Cloud Platform (Cloud Next '19)
I. Spend. A lot of my time building. Security features in cloud data proc which is if you're not familiar are, managed, Hadoop, and spark offering. On Google cloud join. With me today is Larry. McKay. He is from cloud air he's got over 20, years of overall, security, experience. Including. Seven. In Hadoop, security, and he's. Currently the PMC chair of Apache Knox a system. That provides a single point of authentication, and access, for. Apache Hadoop services, also. With me is Julian, Philippe solutions, architect at Google cloud and. We're gonna chat today about securing. Your data Lake. Specifically. Around securing, your data lakes that are tent that are based around the Hadoop platform, and, the, Hadoop ecosystem. So. I'm finding three, main reasons why, customers, are building. Data lakes on Google, cloud first. Of all there's just a lot of new sources of data coming in from a variety of places streaming, applications, IOT, device. Data and that, just means there's a lot more on structured, data around that you have to deal with secure. Analyzed. The. Second reason is that, folks are building data. Lakes on GCP, is that just data is becoming a lot more self-service. Now IT. Departments, they're, getting away from this idea of having a centralized. Reporting team and, I think there's a lot of reasons for that, but. The primary one is simply that data, now is just becoming, way, too important, to wait on all the in efficiencies, that are typically imposed by teams like that. Also. The, tools for, working with data they're becoming a lot more mainstream and, easier to work with so. You're also finding, business users, that are a lot more data savvy and. The. Third reason is that, users. Now need access to a lot more data. Than they used to a lot more ranges, of data at. The same time businesses. Are getting a lot more stringent. Privacy, regulations. And they want to protect their PII trade. Secrets. And so, companies they need to start automating, a lot of those privacy, checks and data access because you just can't keep up with all the different sources. Most. Of the, talks focused today that we're about to do is on, this third, reason. But. We also find, that to satisfy the first two reasons oftentimes, this is a requirement. To help with number one and two. So. Customers, are using Google. Cloud platform to. Build data lakes because we have a lot of those tools that, are commonly found in data lakes so. Whether it's data ingestion data, processing, data management, or data security. GCP. Probably. Has an option for your use case now. I can't even begin to scratch the surface of, you know all the different offerings that you see on this slide in this talk but. What I am gonna try to do is I'm, gonna try to answer for, you know why security, minded customers, like, yourself are. Choosing, GCP, to, secure their data leaks. So. The GCP, platform, as a whole we, have independent verification of, our security, privacy. And compliance, controls. Google. Undergoes, independent. Third party audits on a regular, basis to provide this assurance, so. This means that an independent auditor, has come in and examine all of our controls and all of our data centers in our infrastructure, and in. Our operations, and so, are the certifications, that we tend to go after these. Are gonna be the most widely, recognized, independent. And. Widely, accepted security, standards so, you're gonna find things like ISO. For our security, controls, and cloud, security and privacy, but, then we also you're gonna find things like our sock one two and three and all, of these certifications. Having, these in place what, this really helps us do is meet. Industry standards like. PCI. Or like, HIPAA and. We're, gonna continue to expand. The list that you see here with, certifications. Globally, to. Make sure that we can help our customers, meet, their compliance, obligations. Now. Another, reason why security minded customers, choose GCP, is because, we have deep and wide features, that.
Make It easy to keep your data secure. So, we, have default, at rest encryption, if. Anyone, here has ever had the deal with doing like a Lux encryption. On drives, you know sort, of the pain, both from. A performance and management, perspective like, aspects, of doing that and so, Google Cloud anytime. We land data it gets encrypted and that just kind of makes. It easy but. At the same time you. Can can often control the keys so, if you want to have customer manage encryption, keys for that data that's stored you, can do that you can rotate those keys on your schedule, revoke, those keys as you need to, most. Of our products also integrate, with stackdriver. That's. Gonna offer you a single, pane of glass for, your data, access, and audit, logs across, all the cloud products, and, even. Other clouds stack, Trevor can integrate with other clouds as well, we. Also have os, login if you're not familiar with that feature that, lets you take a cloud I am a identity. Translate. That into a Linux user and the, big benefit, there is that, you don't have to be handing, out and managing, SSH, keys to get into. Your. Your. Virtual, machines, we. Also have VPC service controls if, this is which. Is gonna give you the ability to do a lot of the things that you're doing in your on-prem networking, stack in google cloud, and. Then we also have, cloud, identity and access management as, well as Kerberos, support and data proc and I'm gonna talk a little bit more about those in a minute, but. Before I do I want to come back to some of the data like commonalities. That I'm seeing. So. I think between the three of us I think, we've seen countless data, Lake implementations. From, a variety of industries, and at, least me personally, I don't think I've ever seen two that were alike if you guys ever seemed to customer, data leaks they were exactly alike no so. But. There are three, you, know common elements that, every data leaks gotta address, ingestion. Storage. And then, a compute, platform, that's gonna let you use a variety of tools to mine and identify workflows, so.
From. My perspective at least ingest, that tends to be all over the place. Data is now coming in from just everywhere, all sorts of volumes all sorts of velocities, and even, like in my GCP world I've seen. A lot of common tools but, a universal, pattern I have not seen for ingest, yet however. When it comes to number two and number three where you store the data and then, having a platform for. For. Tools. Customers. Dude I tend. To answer these in the same way. So. In cloud. Storage at least in Google cloud is the store for the data Lake it's. The hub that brings together, GCP. Around, your data. It's. Gonna check that box for, most. Of those features that a data Lake storage system is going to require. You're. Gonna have high durability, centralized. Security. Controls storage. Classes that help you optimize cost, and one. Very relevant for Hadoop. And spark is, we. Have a media consistency. Now. That's a really important, feature that allows, you to use cloud storage as, a drop-in replacement, for HDFS. Without, adding a bunch of workarounds, like no sequel databases and. So. Once customers land, on, cloud storage and they've identified that, as their storage mechanism, the. Next step is typically, trying to identify a way to, manage in mine and identify. All those various workflows. More. Often than not customers. Always choose a platform over, a specific, tool that. Lets them run a variety, of open source processing, applications. You. Really need a platform they can provide lots of and the. Latest capabilities, so, cloud, era and cloud. Data proc both provide these platforms, cloud. Air is focused, on building a unified, hybrid. Cloud where. Cloud data proc we tend to want to modernize. Your stack and pull you into the Google cloud platform, yet. Because. You know both of us are working on so many of the similar keep. Capabilities. And in. The open source community we.
Were Able to come together and, build. Some common security infrastructure. That, we both could use, so. In a bit Larry and Julien are going to talk a little bit more, about that collaboration, but. Before. I hand off to them I want, to leave you with some I am best, practices, for, data leaks. So. Just to back up and just baseline, a cloud. I am, it is, who can. Do what on which. Resources. So. You're typically going to have a project owner they'll. Invite members to projects, then they're gonna grant them roles and then. Those roles those, consist of a collection of permissions, for one or more resources. Some. Basic I am best practices, you, want to definitely follow the, the. Principle, of least privilege, that. Applies to identities, roles and resources. So. You always want to select the smallest scope necessary. To, reduce your exposure to risk you. So you're not gonna grow and grant everybody, like an owner role to your entire, organization. You know there's, intentional hacks, or mistakes that could bring down applications, you. Want to be really specific and deliberate and assign. You know a specific a group like a security admin role to, manage SSL. Certs and firewalls. On specific. Projects, and. You. Want to use groups as much as possible so, dealing, with in you, know managing. Permissions for individual, users that can be really cumbersome, it can be error prone so, you want to use groups you, can see in this example we. Have users that are getting added to a suck-up team group and then, it's that group that's gonna get your permissions, like. Security. Admin and log of viewer, you. Also want to make sure that you have policies, in place to secure your resources you, want to make sure your controlling, how, additional, users gain access to resources, through, policies, and group memberships, because, without strict control over those policies, you, might be inadvertently allowing, new users, more, permission than they need and then, now we're back to violating that principle, of least policy. Okay. At this, point maybe a lot of you are rolling your eyes or pulling out your phones thinking yeah, Chris I get this this is every rback system ever and there's, some validity to that but. Here's. Some important, things to keep. In mind that are actually, different in google cloud, so, you're gonna have two members of your data Lake in Google Cloud you're, gonna have the accounts, that might, be either a Gmail or Google address, most, likely you're gonna have folks. Coming from your your, G suite domain so that's gonna be the folks that have a at and. Then your company's name comm, and, it's. Important, to understand, that GCP. Does, not create, or manage, users, or groups, I'm. Gonna say that one more time GCP. Does, not create or, manage, users. Or groups, so. The I in, are I am, that's. Almost a bit of a misnomer. Sometimes, when. You assign a permission to someone in cloud I am it's. Going on to that user but that user lives. Outside of the GCP, project, and they, might be associated with many. GCP projects, or even, a different organization, altogether, with a different set of permissions, that's. Why in GCP, in your data Lake it is really important, to scope, things as much, as possible with, service, accounts because, those service accounts are, associated. And limited to those GCP, projects. So. Now. That we understand, that, users, and groups are not part of your specific gzb projects, here. Are some data like best practices that may not be quite as intuitive as that first set so. Your. Your. Data Lake objects, your GC your. GCS bucket Ackles those default, Ackles any object. Ackles. As much as possible those. Should contain groups great we established that but. Those, groups, as much as possible should, contain iam. Service. Accounts, you. Want those service, accounts then to, represent a type of automated. Workload and then, what you're going to want to do is grant in users, the, I am dot service account user role, to. That service, account. You. Also as much as possible the goal here is we want to avoid exposing, a long live, credential. Really. Avoid, that if. You, know that. Credential was associated with a real end user because. Remember that user. Might have other permissions, that are not part of your GCP project, so. Let's say I'm working with Larry and you. Know he's from Cloudera he only has some read-only de-identified. Access, to my data I might think whatever I'll give him along with credential, because it doesn't really matter in my project, but. If it is with, his real user. He. Also, probably has access, to potentially. Other cloud era in other cloud era projects he could be an admin and so, creating these long live credentials, for real people that. Creates almost like a somewhat ambiguous blast, radius because you're not just limited to the GCV projects, that you have control, over for.
These Users and so. Also as much as possible try. To use the. Google API if things like the G cloud or if, you're using cloud data products use the jobs API to, break apart isolation. And permissions, so the Cloud Data brought jobs API that'll determine who, can submit jobs to the cluster versus, who has actual, access to the cluster and better. Yet is even. If you can do isolation. Based. On the. Like the VM or the cluster itself and. So, I know I just said like that's my best practice, is you know isolate, by VM, or cluster but, I'm also very well aware that that is not, practical, in reality, for a lot of situations which. Is exactly why we do have a beta version of Kerberos. Now integrated, with data proc and this. Is gonna let you do things like tie back your data proc users to Active Directory it. Also makes it as simple as a button click to turn on security, features like Hadoop, secure, mode or encryption, in flight, now. The standard model, in this Kerberos, beta is that, users access, Cloud Data proc they're. Gonna obtain a Kerberos, principal, for that they work that they do on the cluster itself and then. Everything, done on the cluster is that Kerberos user, however. If the data does need to get accessed from the data Lake in. Google, Cloud storage that. Data is going to get obtained, using, the service account credentials of the virtual machines of the cluster. Now. This model works really well for customers, that can lock down their cloud storage buckets and this is really what a lot of customers want they, want all the security controls in their cluster that lets them use tools like, Apache, Ranger. However. We. Do have a lot of customers, with very stringent. Security, and compliance, requirements, that. Actually need to translate, that identity, that. Kerberos, identity, back. Into a cloud identity so, the chain of access never gets broken. This. Is actually a security problem that. Exists across other, cloud providers as well and it, was something that Google and cloud era were, both working to address and so, I think you now know where this is headed I'm gonna hand off to Larry now to talk about how Google and cloud, era came, together to, solve that dilemma. Okay. The what. Chris was talking about just now is something. That I I've, been calling the cloud identity crisis. I. Will. Dig into what that means. Okay. So. Challenges. For data, like identity, when, we're in the cloud or were accessing cloud, data. Something. I call the identity, gap. We'll. Get into each of these things. Related. To this we end up with a phenomenon, called credential. Sprawl and. And. Also. This. Issue of being able to use short-lived, credentials. Rather than long-lived credentials. So. This gray area on this slide represents, this identity. Gap and on. One side you have the. Hadoop and. Kerberos. Users or your enterprise, users. Which. Hadoop. Knows about and the. Other side you have your client cloud i.m users and identities. Which. Cloud. Storage knows about the. Cloud storage doesn't know about Kerberos, users and Hadoop. Doesn't know about i.m users. So. You. Know we've we've actually supported. These. Cloud. Data connectors, for a number of years and we've had to bridge this gap. So. These, don't really represent points. In time but these are different approaches. That. Have been used to try to bridge this gap and, that they're all some flavor of.
Bring Your own keys okay, so. We we started out using. Environment, variables. We've, had the, username and password in the URL itself to, have your keys. We. Have. Configuration. Files where. You put your keys in there. We, have you know instance profile or service. Counsels associated, with the with the VM which Chris, mentioned. We. Added the ability to protect, your keys with the do credential, provider. And. Coming. Soon we have delegation. Took and support, so. These are all attempts to kind, of bridge that gap but. They they, led to, this. Credential sprawl. So. You. Know I have I have many jobs from, many users running and they're configuring, their keys across your platform, in various. Ways how, do you know where they are how do you know where they end up, they. Are in, possibly. In environment, variables not often anymore but it, could be configuration. Files. They. Could be in credential, stores which are key. Stores that are, across. Your data platform. They, could end up in process listings, in log files. They. Could end up in in, github. Repositories. So. This, is this, is what I'm calling credential, sprawl okay, it's, hard to find them it's hard to clean them up, and. It's very easy to leak them, so. We needed a way, to address that if. You don't just I'm not saying anything about number five here yeah this is this, is related, to the work that we've come together to, work. On and collaborate on and. This. This. Introduces, the use of delegation, tokens which, is a native. Part of Hadoop and this, is the. The standard way that credentials, are propagated, across your data, platform, for use, by worker tasks, that don't have, Kerberos, scratchers of their own possibly, and things like that. They are securely distributed, with the jobs and cleaned up when the jobs are done. They're. Also time constrained, with an expiration time and a fixed number of times that they could be renewed within, another, time period. And. Very. Importantly there, are already supported. In in the, existing, Hadoop clients, so. In. Order to introduce, a new delegation. Token type or. To introduce new, sources of delegation, tokens you, don't have to change any of the new clients. So. This. Can also be leveraged, to. Broker broker. Tokens. Which. We'll talk about later. So. This this picture I'm not going to talk in great detail about this because Julianne's actually haven't, has a nice presentation, of this overall. Flow but. This, is you. Know the, canonical. Picture, of how delegation. Tokens are actually, used for. Accessing. HDFS. Okay. So. A, user. Logs in. And. They submit a job the. Hadoop client, interrogate. The file. Systems, that are being used and, tells. Them to go collect their delegation, tokens which, then get distributed and. You. Know to the. The. Worker nodes right, and are. Available to those tasks, while they're running and. Upon. Job completion. They're, all cleaned up so. That, that's, essentially. Kerberos. For HDFS. Remember. That. Okay. So so. Open source open source is actually why I'm on the stage here it's, it's. Actually why I joined Hortonworks, when I joined Hortonworks, and and. Why I'm happy to still be with the new cloud era, to. Work in open, source. So. Contributors. To to open source you. Know not, only did. Dedicate. Their time and, commit their time to, these projects, but. At the end of the day they put their name on it so. I want to call. Out some people who have worked on this stuff because, this is this. Is moving the platform, forward actually, so. Steve Loughran who, happens to work for cloud era. Created. This this, year in this patch, which. Add added. Delegation. Token support to s3a which. Is a data connector for for. S3 access. He. Also provided, and out-of-the-box. Implementation. For those delegation, tokens to limit, that credential sprawl so you can you, can now use. Your keys, from, your desktop only. So. Authenticate. The AWS, and, submit. A job and. His. Change, will. Call AWS, STS, service and, acquire. Short-lived, session tokens for you and. They. Get distributed with as a delegation, tokens, with. The jobs and cleaned up when the job is complete, they. They are short-lived, tokens. And there's no really way to renew, them but. It it, really, minimizes this, credential sprawl phenomenon, to you. Know your desktop where you're using them. So. This was a great step forward. So. As we. Looked at that mechanism. And we realized. With. Minimal change we can swap out. The. Source of these tokens. We. Realize we can start. Normalizing. How, tokens, are accessed across, all the cloud vendor connectors, so. Fills an Pino also from cloud era. Contributed. A patch, back, to the GCS data connector, in, open. Source and Julian. Igor and Chou, reviewed. And collaborated.
And Improved this patch so. That. It. Works as well, as. The. S3 a patch did, and. Meets. The same needs. So. That brings me to the cloud. Era identity, broker and, this is an upcoming feature, it is it is not released yet. The, goals essentially. If we were to. Boil it down is remember, I said Kerberos for HDFS, this is Kerberos for the cloud it's. The easiest way to think of it you. Login with your enterprise credentials, and Keun it as normal. Delegation. Tokens are. Acquired. From the ID burger and then, later used to acquire access tokens, as needed. The. Goals for the ID broker itself or a secure, multi-user, data Lake deployment, in the cloud and, it's. Multi, cloud and, hybrid, cloud. Like. I said you authenticate, with your enterprise credentials. We. Only use short-term I. M, tokens, and. We. Leverage your existing investment, in cloud I M policies, and tooling etc. We. Also have the ability to renew those short-term tokens. So long-lived. Jobs will, still maintain. Their access. So. This is a simple, picture of the existing, kind, of ecosystem, of data connectors, along. With HDFS, and. You. Know these these. Red. Credentials. Needed, you. Know kind of represents, that identity gap we were talking about before and. All. Those many ways of filling. That gap, we're. Used. But. This is the picture we're moving towards now okay. So. All. The existing clients, are. Getting to talk to the data connectors, and tell them to go get their delegation, tokens and, you. Know transparently. To the user they're getting acquired from the identity, broker and access. Tokens are acquired and ready for access. To. Cloud storage. So. This is essentially, Kerberos. For the cloud the, same picture is Kerberos. For HDFS. But. It shows how the identity. Broker is, used for acquiring. Delegation. Tokens and access tokens and then subsequently providing. Access to cloud storage from, the the, worker nodes. So. In, order for us to have you. Know, one of our requirements was actually to not require, i.m. Users, for every one of your enterprise users okay. In, order to not do that we, need a way to associate. Roles. Quote-unquote. In. S3, they are actually, I am roles in, GCP, there they are service. Accounts, so. We associate. Group. Memberships. To. Roles. And. We. Have a row selection algorithm, that, picks. The best match based on configuration, based, on these mappings, and based, on a number, of other things and, this is just a mock-up. Of what a, UI, will look like to do this. So. Kind. Of it just to illustrate. That. The power of this of being able to just use your enterprise credentials. We'll. Walk through a simple use case of migrating, data, from one cloud provider to another. Using. TCP. So.
You. You. Log. In with your your, typical Kerberos. Kmut and, you issue the Hadoop this CP command with, your source and your destination. URLs. And you. Can see by the scheme. There that it one, is frustrating, one is for. Google. Storage. Ok, so the. Client as I described before iterates. Over the, file systems, it's about to use and tells, it to collect the its delegation tokens. The. Delegation, tokens are then collected. From the idle burger, front. For both of those file systems. They're. Collected into the app launch context. And. You, can see one for for each of the. Cloud. Storage vendors. There as. Well as an HDFS, the delegation, token here. And. Yarn. And distributes, it. Along. With the app launch context, and the credentials in a secure manner and, during. Execution those, credentials are used to. Move. The data from. The source bucket to the destination, bucket and the. Way that it does that is it takes the delegation, token and it, authenticates. The ID broker, even though it doesn't have Kerberos credentials, it's using that delegation token, and it's. Act, it's requesting, access tokens. It, will also ask for new. Access tokens when they when they expire. And. Then. As the job completes, delegation. Tokens are are all cleaned up and we. Eliminate, credential, sprawl completely, we've. Logged in with our enterprise credentials. And we, ended up dynamically. Getting access to tokens, for. As long as we needed but. Even. Though they're short-lived. From. Multiple. Cloud vendors and, your. Data is now migrated, so, just kind of a summary, slide here. We. Have. Multi-cloud. The access token acquisition. For AWS. S3 Google, GCS, Azure. ADL's gen2 is under, work under. Is. Still, being worked on. You. Only need your enterprise credentials, as I said supports. Long the jobs with the short-lived tokens, and. Leverages. The existing support and dupe clients. Another. Thing to be aware of which is a related, but but, separate, effort is that, we are working on. Common. Enterprise data policy. Across. On-prem and cloud storage data. Access so it's that's, also coming soon. So. I will. Hand. It over now to to, Julian who will talk about the, GCP, token broker which. Was enabled by the the, same, collaboration. Thank. You Larry. And hi everyone I, am Julian Philippe I'm a Solutions, Architect at, Google cloud and I'm really excited to introduce you to the GCP, token poker. So. What, is the GCP, token broker it. Is a brand-new open-source project, that we just released last week it. Currently is in alpha and I will talk to you a bit more about the. Upcoming roadmap so, this. Is the result of a close collaboration, that we've had with several large Hadoop, customers, as well as with partners, such as cloud era and Hortonworks. The. Project on github contains, documentation, about the design that we've.
Aligned. On as well. As a reference implementation. So. This project is intended. For enterprise. Customers, who, are currently using Kerberos. To, secure their on-prem. Clusters. And. Who. Are looking to migrate those, clusters to the cloud and try. To, replicate. The. Same, user. Workflows, and, experience. So. This is particularly, valuable for. Customers. Who are interested in a hybrid environment. Where. They can let the users run jobs on Prem as well as run jobs in the cloud. So. The main goal for the, GCP token broker is to ease that migration, and. This. Is done by leveraging the. Same Kerberos. And Hadoop delegation. Token facility, that. Larry introduced, you. This. Facility. Allows to, securely, distribute. Short-lived. Credentials, across a cluster. It. Also allows for a, seamless, integration across. Multiple, tools. And frameworks around, the, Hadoop, ecosystem that, includes things like spark, hive. Tears, yarn MapReduce. And many others. And. Because, it's the integration is so seamless, this, makes, this. Make this, reduces, the chances that you have to make one you're migrating your clusters. The. Broker also helps, enforce, triple. A security. This, means that users can continue to use Kerberos, to authenticate, and. You. Can also have a very granular, authorization. Where you can set permission, on individual, users for. Specific. Buckets and. As. Far as accounting. Those desire also allows you to do some very granular access logs which is very useful. For troubleshooting. Monitoring. Or, detailed. Chat box where you can trace back which. User actually, used which. Bucket or which resources, in the cloud at what time. And. The broker offers. Tight. Integration with. The. Google, cloud, services. Starting. With cloud storage in the, future with other services, such as be Korea BigTable. So. While the ID broker, that Larry talked to is is focused. On. The, cloud their environment, and Chola multi-cloud. Jobs. To run the, broker currently, is focused, on on on, the Google cloud environment. Now. Looking at the workflow at a high level. Again. This should look familiar it's, trying to solve the same problem, that Larry described, earlier so. You have your users your Kerberos principals, who. Want to use a Hadoop environment to, run jobs and, as. Part of those jobs one also be able to access cloud. Resources like, cloud storage, now. Cloud storage. Like. All the cloud services, work, with cloud identities. So. To what. You want to achieve basically. Is you. Want to let you users to continue using. Kerberos, to authenticate, and run jobs and some. Ways somehow. Match. Those identities, with cloud identities, so that they can then access, cloud. Services. Now. Let's zoom in a little bit in that Hadoop, environment to, see how it works. So. Now the system revolves around this. Broker service. There's. A prerequisite. That you, need to setup first that. Is to use the, Google cloud. Directory, sync it's a tool that's been around for a long time what. Is what it does it will sync your LDAP. On, Prime identities, over, to cloud, identities, this is done automatically. Will sync all of your users and groups, once. You have all of your identities. Synced. In the cloud. You. Can then write standard, iron policies. For. Those users and groups and dictate what those users and groups are allowed to access in, a Hadoop context typically, this will mean setting permissions on, GCS. Buckets. And. Then once that's, in place your. Users are ready to submit jobs. So. Here, is Alice, a Kerberos. Principal. What. Alice will do first is to log, in by running, the. Key init command. With. The KDC, the. KDC is backed by your, LDAP. Database. And, also, very importantly, the broker is also, connected to the same kvc so. That it will be able to authenticate your users further. Along in the process. Then. Now. That alice is logged in she. Can submit a job by using a Hadoop client first. Thing that the client does is to contact, the broker to. Ask for a new, delegation, token, the. Broker at this point uses Kerberos, to authenticate, the user, make. Sure that that user is who they claim to be and. Then generates, a new access token and return, that sorry, a delegation. Token, and return. The delegation, token to, the client the. Client then submits, a job to the young resist manager and passes along that delegation, token. The. Young resist manager the first thing he does is to send a renew request, to the broker just, to make sure that it has full authority over, the delegation, token.
And. Then the. Job is effectively, submitted, across, a cluster over. Multiple. Tasks as, those. Tasks are running, at. Some point in time they might need to access cloud. Storage whether to read a file or write a file. When. A task needs, to do that the first thing that it will do is to contact, the broker again this. Time by pulling, the delegation, token from the secured contacts and sunny that dedication token, to the broker the. Broker then verifies, that the delegation, token, is still valid and. If so will. Then call. The Google OAuth API. To. Generate, a new access, token for the cloud identity that, maps, to the, Kerberos, identity. That. Access token is then returned over, to. The task and the, task can, then access. Cloud. Storage as. That. Cloud entity, so Alice the cloud identity. And. Then at the end when the job is complete, the. Young resource manager, will call the broker one, last time and call. The cancel and point. This, will effectively effectively, make the, delegation, token. Inoperable. So that it cannot be reused, outside. Of the context. Of this job. So. From the very beginning the ID. The token broker, was. Designed. With scalability in mind by, using several, fully, managed products, on. Google. Cloud that, includes kubernetes, engine, cloud. Datastore cloud, memory store clock AMS now. Let's look at how this all fits together. Where. You can see here in the top left corner, is. The broker environment. The. Broker service, is running, on kubernetes. Engine and then you can scale that cluster, up, to however, many, nodes as you need to bear. The load for that's, required for your users. The. State is. Stored, in, a database on, cloud, datastore. Which. Is highly. Scalable no sequel database, and fully, managed on. Google cloud that. State includes, the delegation, token and also some session details that are attached, to the jobs that are currently running. There. Is a cache currently. Using Redis on cloud memory store to. Improve. Performance. Cloud. KMS is used to encrypt any sensitive, data so for example before caching, the access token those tokens are encrypted. We. Came as. And. Every, operation, done by the broker is also, logged in stack driver so you can, monitor. How. The broker is used. Now. Looking, at the top, right corner this is your client, environment this, is where you had dupe clusters. Will be running and you can choose to run it either. As a self managed. Cluster. On compute, engine or you can also use data proc which, is Google's. Fully managed, Hadoop. And spark service. A. Couple. Of configuration. Set. Up that you have to do first. You need to setup a two-way trust between the, client machines and the, broker service, this is to allow services. Like yarn that, are running on the client machines to access the broker and also, to allow. Logged. In Kerberos users to also access the broker and then. You also need to set up a one-way. Trust between the broker and the, LDAP, database, this, is so that the broker can authenticate. Your users. And.
Then You can have your data, be. Hosted, on any number of buckets as you need. What. You can see here with the dotted lines are. Separate. VP sees all. Of those VP, sees are private, VP sees all. Of those machines, only use private, IPs and you can then interconnect. Those VP sees in different ways by using VPC sharing the disappearing. Of VPNs. So. This I wanna emphasize is, only, an example deployment. The project, on github contains. Terraform. Scripts that will allow, you to automatically, deploy, this. Architecture. But. You should, just use that as a starting point and adapt it potentially, to your production, environment as you see fit. And. Then. You. Can also swap out any of those components, so if you choose to use for example my sequel, my sequel database to store the state your, you, can, truly do that and replace. Datastore. For example or you can also use memcache instead, of Reddit. Quick. Word on the roadmap so this current alpha release contains. Support for cloud, storage and. Also supports Kerberos at the authentication, back-end and. Also supports, proxy. Users which is a Hadoop context, concept. That allows, services. Like hive for example to impersonate other users, in, the. Upcoming, beta we're going to focus on performance, optimizations. And on stabilizing, the API and. Then. If. Sorry. In future, releases. We. Will add several. More features, most. Prominently, by adding support for bigquery. And BigTable, so. I invite you to look, at the project on github for, more details about the design the, roadmap as. Well as the actual implementation that, you can then try. Out and use, in your environment, we. Will really. Appreciate a good feedback we. Want to make sure that we. We. Can. Implement. All the features that you want to need so. Please feel, free to try. It out and get, in touch with your sales team at. Google and then we'll be sure to follow up and make sure that we iterate, and make it better for you so. Thank, you very much this is the entire.