Kubeflow - Tools and Framework (Google Cloud AI Huddle)
The. Speaker for today is Abu Sheikh Abu. Sheikh manages, the engineering team building the Q flow platform, and, supports, the mission of Q flow to. Make scaleable ml models in deploying them to production as simple, as possible he. Originally, worked with Google search on crawl, and indexing systems, and. Then spent few years with. Enterprise ml, startups before he returned back to. Cloudy, a platform, and is now in the working on Q flow I wish Sheikh take it away. Thank. You for attention. Welcome. Guys thank you for showing, up and also, apologize, for the little bit of delay we had but. We'll start and I think Sean through already kind of gave me ain't about like how to communicate with the audience I would, like to be a bit interactive if possible and I might ask you a few questions on the way and. Okay. I think the first thing we, are all here probably when we talk about machine learning what, do we want to really do the first thing that comes to mind is we, want to build a model. How. Many of you are data scientists, or consider themselves data scientists, I see. One that is. Lower. Than what I expected, I'm sure there are more data scientist here anybody. Else. How. Many of you are engineers or people who productionize, models. Okay. A bit bigger group, great thank. You I'm. Sure you have all had different experiences doing, this but everybody, has probably phased this you start with thinking that you are going to be building a model and the. Moment you start doing actual, work you realize well. There's a lot of other stuff. You have to do around before you can even get your hands on tensor flow or, something to train your model you, start with data ingestion you have to figure out where to get data into your system from data. Analysis, leader transformation. Validation. Actually. Creating. Test sets training, sets then training a model building. A model experiment. A lot of it and then then, you want to validate your model figure out whether it actually does what you are looking for and after. That you want to train at scale roll. Out the model in production if it doesn't go into production I don't know why it's, worth to tree in the model at all so, I think that is probably the most important, step and then actually survey, and build probably, an application, around it of course. Then eventually you want to monitor and have, logging as well.
So. What's the problem in this in, this workflow. Or in this space. Setting. Up an ml stack that, a or a pipeline is incredibly. Hard anybody who has done this I think would agree with me that when you start doing this you realize you need to install a lot of tools either, on your desktop workstation wherever, you're working you. Get, sucked, into a lot of engineering, and, DevOps where you need to install get things connected. Working together to figure out networking and I don't know what not. Talk. To production engineers and doing, this on production, environment, is even harder it's. You. Have to be careful you're not messing things up that are already there if you, productionize your model in the wrong way you might end up with making completely. Wrong predictions, from what the data scientists, intended when they actually, train the model and. Is. Setting up an ml stack that works across different, clouds or. Different. Deployments. Is extremely. Hard in, fact I don't know how if, it at all is possible in current with. The current set of tools and. For, our current purposes for today's discussion, we will consider your local laptop or, your desktop to, be multi cloud as well it's. Just, another instance of the cloud. Alright, if we look at our experimentation. Stack, when you are building, your model your training locally. Trying things out you have a lot of stuff running. On your laptop and enabling. You starting. With hardware at the bottom with accelerators, operating. System drivers run, time and then storage. And, on top of this you have your frameworks that you actually use to train and bail your models. On. Top, of the stack on your laptop or your desktop you're going to be installing a bunch of micro services that, enable, you to compose, and create your workflow all the way from reading. And data training and then testing. Out your model making predictions now, we, take it a step further you gotta do this whole thing over and again, when, you actually want to train your model at scale we. All would. Love to train, the models with the volume of data that we available the more, dealers it's the, more data points you were able to use hopefully. We could you get better accuracy in your model. And. Then finally, you actually want to deploy your model on to cloud we're actually the N endpoints. Are accessible, to users and they're actually making difference in the real, world actual. Intention of your model. For. All three of these environments, unfortunately. Right now in the current set. Of technologies, that are available you have to literally. Redo, a lot of this stuff what, you did in your experimentation. Stack to actually build your model is not exactly what you can take to, production or to or deploy, on the cloud and there's. A ton of work involved in making that happen which. Is why we thought of building cube flow and, we. Started this effort. Our. Mission, literally, is to make it easy for everyone to develop. Deploy, and manage portable. Distributed, ml, on communities. So. In in summary. Essentially. The, platform, is going to allow you to run cube flow wherever you can run kubernetes on whatever cluster, whatever cloud whether, it's on Prem or some, kind of cloud platform, if you, can run kubernetes on it you can deploy a cube flow on it and take.
Your Technology stack along with it including. Your local machine so you can you can run kubernetes. On your local machines as well with. This technology. Essentially, this is what your stack should look like you, are literally working. With cube flow services. And whatever. You deploy and develop you, should be able to translate it across all the environments, without much effort. Looking. At what what's going on inside cube flow we have a. Bunch. Of micro, services that we rely on in building our ml workflows and. We. Have integrated. A whole bunch of components some coming from Google itself, from tensorflow T effects and others, from open, source community. Including. Things like TF transform, dump ice power for data transformations, TF. Data for data validation. Jupiter. For actually a development, environment where, you experiment and build your model. TF. Job and extent by touch to actually train your model at scale and. T. Of serving cell don't answer our D for serving, at scale with, with the kind of latencies you want and, from, matthias were monitoring, there. Is more stuff that we can't fit on the slide but also there is even notice there are boxes which are right now missing component so we do need, to kind of like build. Out more components. And get community to get involved and. Incorporate. Their solutions into the platform. Alright, so, that was kind. Of the. Back story and what cue flow is about and like how it kind of like lets, you translate. Your solutions across different platforms I, want. You to spend some time on this slide and think about like what the user experience is going to look like people. Who have done development. Of models. We want, to kind of like see what is the workflow you will be going through when you are trying to figure out how you develop, and build your model. One. Of the important, goals of cube flow is. Really. Creating. A low bar and high. Ceiling, experience, what. Do I mean by high ceiling if you. Notice, the way cube flow is designed on top of kubernetes, it's. It does, not restrict, any kubernetes, level api's and in, any every tool that is available on cue flow you should be able to configure to. Your heart's content and, in whatever environment you are there is a lot of complexity, in enterprise, world a lot, of challenging networking, setup and different kind of environments. That you might have to optimize, for and the.
Configurations. In queue flow should allow you to be, able to go down in the detail and modify, whatever, it is that you need to, make it work in your environment but. On the other hand we want to enable, users. Who, are not familiar with kubernetes, to. Be able to access this platform and be able to build their ml flows without, having to worry about learning. Communities. Similarly. On the redesigns front-end, as well I mean there, are engineers, who. Are just starting learning data signs they have a lot of good. Background in communities, and systems, but, they might not be that familiar with data science we want to enable create, a low bar for them as well have tools that will enable them to to, build data science machine learning products without, having to go too much into detail of. Learning. Various technologies, if possible. Okay. With that in mind I would, like to present an example which. Will walk us through the, user process, that we just user. Experience that we just talked about it's, specifically, focusing, I think, a lot of you are already familiar with water github issue is. Generally. If you go to github I think there is a content, associated. With with the actual issue where. You have all the details and then you label. You title the github issue, with a title, that seems, to fit or describe summarize what, the issue is all about the. Problem we are going to be looking at is essentially a user, filing, your bug and when. They are running. Into some issue they are not in the right mindset, and they are really troubled about what's going on and all they want is like to pace that error. Provide. Whatever information you want to provide and then file, the issue that. Might not be the best time to think about what exactly is going on and what, should the real title for the for, the bug be so. With that in mind one. Of the real scientists, at github. Actually. Thought about this problem and felt, that he can come up with an FML, algorithm, or animal, model that will look at the content analyze it and hopefully suggest, a title, that is relevant and they're, helpful. So. I want to thank at this point ml, from github who actually built this. Model and Uncas. And Michelle helped, productionize, this into. An application on top of cube flow. And. While I'm at the, stage. Of thanking people I should also thank. A whole lot of other people none, of the the, content that you're seeing is actually my, creation, it's a lot of people, who have actually had created and I, personally really want to thank Jeremy, who was around actually right. Here you should please meet, him when you can he's. Our, tech lead of Google Q flow engineering team and, he conceptualized formulated. The idea around. Q flow starting. In 2017. And then in December 2017, and how's this to the world along. With a product, manager David Aaron chick who's not here today but he's, also one. Of the fun so the two of them co-founded, this project and really defined, the vision and the. Technology, that will go into making this happen and to this day we rely heavily on them to provide us the direction I would, also like to thank Google's, Q flow engineering team. Now them none of this would be possible and, of, course the cue flow community we really love our community they are extremely, inviting. Open, friendly and they. Helped us a lot with figuring. Out solutions to these problems. All. Right moving forward the, step first, step that we were talking about in the user experience was deployment, and. This. Is something really, fresh. Actually, this is not directly available so far but I want to give you a preview of like what deployment, experience would look like with, this application that we have been building and this. This deployment is a deployment of cube flow on top of google kubernetes engine as I, mentioned you can deploy, cube flow wherever kubernetes, deploys whether it's Microsoft. Azure AWS, or some other cloud or your on-time hardware but, we are specifically, focusing, in this, presentation on GK based deployment. You. Start with a few, set of fields you talk about providing. Information about the Google project cloud, project that you want to deploy the the.
Stack, On and a. Give the deployment and name and then provide, some kind of authorization, information and then, be able to say click deployment, and a few minutes later you. Should have, you. Should be able to view your cluster along with the services, running on your, kubernetes, engine dashboard so, what this is showing a sorry, about the small, font but, essentially these are some of the core services that cube Flo deploys on your cluster when you create the deployment a lot, of these are fundamental. To stitching. Together your workflows or your. Your process to train or deploy a model the. URL here that you are looking at is essentially. A way to get into an. Access, point into your du flow. Cluster deployment, everything. By default the way it is deployed is is, predicted, in your cloud project and not accessible form from an external IP. With, the exception, that we create this ingress point which has a IP protection Google IP protection that. Allows you to access and authorize, any access that is going into the cluster so, if you click on that link it takes you to the Q flow dashboard, right now it's very minimal and we. Are working on figuring out what is the right information, that should go in here and what, all make sense but, at this point it kind of like helps you bounce off to a few things that you are most commonly going to be doing the four thing that it will help you get to is the Jupiter hub which, is where you will be actually doing a development and, that, is the next step in the in the deployment process I do. Want to take some time here and look at, referring. Back to a principle that we talked about of low bar high ceiling we. Created. The UI, web, UI bass-driven. Deployment, process to lower the bar on deployment, to make it easy for data, scientists, who are not familiar with communities to get on to Google. Cloud and be able to create their cue flow clusters without much effort just a few clicks and a few keystrokes. But. On the other hand we want to ensure that anybody, who's familiar with what's going on has. Complete, control on what their deployment, is going to look like so, on, that front this CLI base deployment, uses, calf cattle which is our command line tool and it. Lets. You control your deployment, with four sets of commands in it generate, apply. And delete, we. See an example of how that workflow, goes for, deployment on the right side so, you start, with initializing. Your deployment, it ends up like reading. A few parameters and. Environment, variables determining, what. York like on, figuration is going to look like and then creates an environment variable and, in its indirect. Tree which is going to be your cue flow application, referring. To your cue, flow deployment, once. That is done then, you can and, in this instance again, is appointed, to a GCP google cloud platform, so it will end up creating a GK, cluster the. Next step is basically generating, the platform and what that generate does is it generates a lot of configuration, about. What is your what. Your GK configuration. Is going to look like the. Benefit of this is that that generated, set. Of configuration, is available locally from where you are doing deployment, and you have access, to that configuration you. Can go look in there browse around what's going on and you are able to tweak a lot of stuff and modify. Things if you want to, create additional load pools you want to add GPUs, right from the beginning or you. Can make a lot of configuration changes, in there whether you want to have, IEP or not essentially, and the, next step is going to be apply, the, platform, which, would actually take, the configuration, that it generated in the previous step and then actually talk, to Google deploy a cloud, deployment, manager and create, the GK cluster that that was defined by the configuration. The. Next step is, going to generate the kate's about which, are basically configuration, for the, various services that will deploy it on your gk cluster so in. One of the previous slides when we looked at the gk cluster, it had a list of services. That are running as part of coal queue flow those the, configuration, for, those services is generated by this command and apply. Would again go ahead and apply that but. Intermediates. If it, allows you to keep that configuration, available and modify in whatever form you feel is, needed. Switching. Back to the Jupiter. Notebook. You go to the Jupiter hub and this, is the page you land on which allows you to create a Jupiter notebook if you notice here the first field, in here is letting, you choose a container, image, we, have a few pre-built, container images these. Are docker container, images which.
Already, Have tensorflow. Support. For, different versions of tensorflow a long. Way the image, is built for CPU or GPU. If. You want you can just click spawn at that point and then say like start my notebook but. It also allows you to configure further, and specify. How many cores you want so if you know your, workload is going to require GP you can go ahead and specify that you need GPU. Containers. And then you can specify how much memory you might need, once. You have created your notebook you're inside the notebook and this. Is kind. Of your playground, review you go around you can import tensorflow libraries if you are working with some other Python. Libraries by touch you, should be able to pip install packages, here as well and then, you have access to those packages. Python packages you. Have access to pandas whatever, whatever, other Python tools you might want to use this, is where you will spend a lot of time iterating. Through your model figuring out what works what doesn't work and how do you want your model to look like connecting. To various leader and and. Actually, training your model remember. That at this stage you are still running inside your your, container that has got the jupiter notebook running so, you're you're only working, possibly with a reasonably, sized data set that will allow you to iterate and experiment on your model. Once. You have figured, out what. Your training script, is going to look like the, next step in the process is to build docker container image. Currently. The process involves. Defining. Docker. File which will specify. Warrior'. Docker container needs to have installed so for example if you, need to install a few Python packages you, will have corresponding pipe installs in there and then. You copy in your, training, script. Python script that you had from your notebook you can create a corresponding Python file and you can include it in your in your container image along, with any other supporting, libraries, you might have written that you want in there and once. You have specified, the docker file then you will use docker build command to build the docker container image and then, g-cloud to push it to cloud from where you can actually then. Use the container image to create your training, jobs. Another. Thing I do want, to call out here is. Looking. Back at the low power, high ceiling, I think. For, but for a few data scientists this might be too much they might not want. They, are probably capable of doing this but. Is, this where. They should be spending most of their time which. Is why we are thinking of integrating, this into a workflow. That goes directly from the notebook and you are able to take your. Training exercise, and from there itself click on build and then. At. The back of this it will actually create a container image for you so, there are different ways we are thinking of based on the feedback we are getting or now we can lower the bar at every step.
Alright, So let's look at the, next step which is the most exciting part for me at least is, to, take my training, script, and actually run it on a distributed crystal train a full-scale, tensor flow model. In. Any application, that you deploy in aq flow cluster we. The. Way it works is and, I will go a bit more in detail in the in a few, more slides but essentially, it involves generally two steps, creating. A configuration, for. Your, training job or any other service that you want to deploy on your cluster and. Then essentially making, sure that you have that configurations, parameters, specified appropriately. Based, on what you need like how many workers you might need where. The models input is going to come from where do you want the output model to go and they're. Setting up sizes on what what your training set is going to look like maybe possibly. Once. You have your configuration, ready. Then. You can kick off your training job by saying take this TF, job configuration. That we just defined and apply. It on my cue flow cluster and that's. The key a supply command which actually takes the configuration, then talks. To the cluster and says bring these bring up these containers get it running and. While. That job is running you. Would be able to use various, cubicle commands to monitor the pods that are running the, job to x-axis, the logs and figure. Out what what is going on with the training job back. To the same I, keep coming, back to the same theme but essentially looking, back at lo see low bar high ceiling we. Do want to allow, data. Scientists, or anybody, who actually just. Want a bit of flexibility, and ease to, be able to go and create their tensor flow job configuration, from, a UI, and, this is really possible because if we look at the previous slide there, was a lot of configuration, which. Is defined for the job and we can literally interpret, it pause the, underlying configuration, and generate the UI. Corresponding. To that configuration and allow, you to set the parameters not. From the command line but from them from, the UI and here, you, you.
Should Be able to like point to the container image that you want to use how, many workers you want to use and where, where to pull the data from etc. Great. Once, our morale is strained the. Next step for us is how, do we serve this model i, I. Believe the example that I present here is to kind of use tensor, flow serving and, you. Could use Selden as well which is available in, cube, flow and we have a github examples. Repo, that you can go and take a look at for the github issue summarizing, for the github issue. Summarization. Example, and then you should be able to look and like see, how that the same thing can be done using Selden in this. Slide I will walk you through tensorflow. Serving, to, you to serve the model. We. Talked about this case case, solid that we use I, want to give you a bit of background on what's going on over there we. Use case on it which is a, configuration. Language, which allows templating. To, describe various configuration, and the, power of templating, to parameterize, a whole bunch of things, that you can specify, and, set and modify. Your configuration, at the time of instance instantiation. So. All queue flow components, are described, by a corresponding, K solid configuration, which. Is available to you. On. Your in. The audio configuration, directory that you created, when you were deploying the cluster from. There you should be able to create, services. By generating, new. Service, components, using. The template and then, configure the parameters, as I am doing here we. Generated, a server inception, component, which is a brand new component from the T of serving template and then we are setting a few parameters, here talking about like, whether, we want an HTTP proxy do we want to use GPUs, where. The model is actually going to be sold where to load the model from and. There are a bunch of other parameters that I'm not using but you should you're free to explore and figure out what you want to configure and, then. The, next last step just, like last time is once your configuration, is already you. Say KS apply and then it will actually end up deploying. This component, on the cluster and bringing. Up your model. Finally. Deploying. A web app around this is a very similar process you. Would end up probably, using flake or some other. Python. Library or whatever is your favorite tool to, kind of build every bap against. The, server the, the. Model that we just created and once, your application is, ready, you can create a corresponding container, image deploy. The container on, cue flow and then you, have your UI to. Talk to, your model at the back I. Think. I've, been mentioning. This over and over but essentially that is the real theme here. Is that we have been working on, making, sure that we get the BRIT of the. Components, that are necessary across. The whole spectrum of the workflow all the way from data ingestion to. Deploying and running your applications, there's a lot of work around this, more. Than we can handle by ourselves which. Is why we are relying a lot on our community, to provide us solutions, as well and, we would encourage you to get involved as well but. While we work on this bread we, do study our users, and get a lot of feedback and figure out where, they are having challenges and we, will leverage that information, to. Drive lowering. The bar and improving, the experience on on where wherever we see there. Is pain in user, experience we. Have tremendous momentum I think since the. Project went went live early. This year late last year early this year we. Have over thousand plus commits now 100. Plus community members that, are active and roughly. 20 plus companies that are contributing. And. You can get live stats about this from our github repo.
Finally. I do want to mention queue flow is open, we. Are an open community we love people. Joining, our community and contributing. Open. Design open source and we, are open to ideas which. Is why we are hoping that more. And more data scientists get involved as. I noticed today area there are very few inner scientist I would love for you to go. To our website queue floret or go to our github repo file. Issues let's, start there why. Don't you file issues on what what is not working not working for you and then, we'll get you more involved and hopefully you will have ideas to contribute. Happy. To take some questions if you guys have. Thank, You Chuck, what one question I had earlier, we talked about different, companies, creating. Their own ml. Pipelines. And things like that. Uber's. Michaelangelo, Facebook's. FB Lerner how, generalizable, are you guys designing, this and do you think that this could, potentially. Apply. To, the, use cases they have and. How are you guys kind of optimizing, between, being. Opinionated and, and, being more general right, I think we're starting from a place where we are not opinionated. And we want to allow the flexibility. And. Allow. The community to kind of come up with ideas and contribute, I think, the biggest challenge at the beginning is to kind of get the breadth and make, sure that we have components, we have support to cover, the whole end-to-end workflow I think one of the biggest challenges. If people are interested is data connectors this, is a big pain point for all enterprise, solutions, where people always ask. Us like can, we can, we have characters to our data systems, and, similarly. I think you mentioned Michelangelo, face. All of them have various, strengths and I, think we are welcoming, them to come and contribute it to the Oasis community and build, it into the platform we, would love to have that not. Only those if, anybody, else wants to build we have a lot of contributors, who have gone and like made. Their fort to take their components but make them available and we would encourage to continue to do that. Thank. You. Hey. Thank, you for the talk so, one of the issues. Or, one, of the things that can help better scientists, today is. To. To, know what the standard way of doing ml is because there. Is none why and how. To do feature engineering how, to do serving order training and if. We are doing it in a way that it is compliant, with the enterprise's. Then. Being opinionated actually helps absolutely. So how, do you how, do you see this right, I'm, not disagreeing, I think your question is very valid and I do hear this from a lot of customers we talked to a lot, of them are at a place where they are tell me one way to do this and this that, is what I want to do I don't want to explore, and figure out so we do see a spectrum. Of people where there are really, sophisticated engineers. And teams that are actually they want full control but. There are people who are actually interested in like this, is my business problem that is what I want to focus on and just tell me how to kind of like do this I do, see a time, when opinion being, opinionated would help but. I feel like the at, the beginning of the project it is not the right time to form. Opinions where we won't even know what the right opinions, are I think, it is time for us to kind of explore and allow, the community to contribute, ideas along, with us producing. Ideas as well and, hopefully. We find out from our users from customers what, works for them we, are willing to help them set, up when they are confused, or if they need some, hand-holding and guidance we are willing to put the effort our whole, community is happy, to do that in. Providing guidance but, I think I I don't believe, that if, we start being open ated right now I think we will end up losing some, good solutions and we might end up creating a platform that is not usable by all I think. The whole ml space in my opinion is at a place where as. An industry we don't really know what the right way to do all the things are and, we need figure that out first before we can start. Imposing those, opinions, on people. Good. Evening my, name is Paul thank, you for the information it. Seems like a really great tool to help a lot of people thank. You I'm. I've. Been hearing a lot about, key. Stores and I'm. Curious. How q flow would, play. With like like, key stores or. Managing. Identities. Okay. I know that you have like I am, in the Google. Cloud right, platform. Can, you speak to that right. I think we are relying for, identity. And identity. And access management to, be on.
The Underlying cloud platforms, for example queue, flow itself is not reinventing, the wheel on a lot of like underlying cloud infrastructure, so, on on Google cloud we rely completely on in. Fact we will rely heavily on kubernetes, to, design and define those solutions. And leverage that and in. Addition to that when you deploy kubernetes, on different, cloud providers. If. They will be leveraging the I am solutions, that that might be available over, there. Does. That make sense yes, thank you thank you Paul I. Was. Wondering if we could help us position. Queue flow against some of those managed, machine, learning, platforms. And how you see. It from the Google perspective right. I think it will be a, good idea to do that there are definitely, users, as we just discussed who. Can benefit from managed services as well and I think it's an important part of their experience where, they are willing to experiment and go deep into certain areas but, there are portions of the ML workflow where they would not want to invest and it, will be create to integrate I think that also ties into the question. That mark, I think. About. Like how do we get FB Lerner or for example, Michelangelo. Or maybe Google clouds, managed. Offerings, to integrate into the platform I think it is important, for us for us to figure out integration. Points with those services for example people. Might be willing to spend time and are are, okay with things failing, or experimentation. Going on in training phase but, the moral, leverage the strength of Google simile, platform for serving or similarly. There could be other managed services that might be providing so I think that's an interesting idea, and very useful as well we, should definitely look at all those solutions as well I would not discount any of those, does. That answer your question cool. Thanks. Vanessa. You. Mention at the beginning that oh. Boy. Um not. Only on the cloud but also, on other clusters, right a. New. Currently, deal with other kind of cluster systems like HPC, systems right. That's a good question, so I think I mean we rely heavily on kubernetes to provide us but, provide. Us the capabilities, on deploying on different, platform and. On. Prem is for example one such platform where we are working with certain partners who, are focusing, on making this work to. Work on like high, performance clusters, for example if there, are infinite band connects or, if there are other kind of interconnects, that for example Cisco SMI high bandwidth connections as well so, we would rely on some of our community partners to help us port, these platforms. On to those hardware, platforms, is. This currently still not available right. I'm. Not aware of this but I do know that for. Example certain certain. Of partners, are working on the solution. Right. But this is certainly something that we are interested in thank. You. Thanks. Ok thank you thank, you.