Train and deploy models in Cloud and on-prem from notebooks Google Cloud AI Huddle

Show video

So. Today we have, an. Awesome topic an. Awesome. Speaker so, I will quickly go over, Karthik's. Byock karthik is a mln, CH, leading. The data science experience, for cue flow and working primarily on key. Flow ferenc project. Before. Google karthik was part of ML, platform, team at uber self-driving, team. And also led a team of data scientists, in ubers risks, team he, is passionate about making design, is productive. Having. Said that I'll just do a quick plug for user. Research so. One. Of the things that we say. Often in our meetups is this is a collaborative, forum. We, are here to gather feedback from you, understand. Your pain points running. Machine learning workloads so, feel free to send us feedback. Through. A meetup channel, you. Can either post it on, the meter page or you know send us an email we. Are here to answer your questions and also give us feedback for, the. Kinds, of topics that you want us to cover so going forward we can, make. Sure that we pick topics that resonate with. With. Your needs. That. Being said over, to you colleague, Thank. You Shannon no further introduction. Hey. How. Are you guys doing good. So. Today, we. Are going to talk. About the. Notebook based, machine. Learning, development. I. Probably, should have put that but. I put. The. Project name that we are going to talk about which, is fairing. Before. Kind, of indeed using I want to kind of give some context. Around. What. We are doing, in our team some, my team. Is. Cube floats a cube flow if we haven't heard, of before, it's, a. It's. A ml. Platform, based. On cubed, it is. Just. A kind of a. Raise. Of hands like how many people have heard, about Cuba deities are used Cuban, it is. It's, good, like a lot of people it's. It's quickly getting like you very. Like. Becoming, the standard de facto solution, to deploy your. Production, applications. So. Think of what. Was. Her how do or. Yarn was doing, in like a kind, of a ten years ago. Similarly. Like Cuban it is this kind of a resetting. The. The. Playing field for like how you deploy apps, and, Cuban, it is proud. Native meaning. It, understands. The. Cloud environment, it, understands, docker. So. So. With all these things it it's. Quickly becoming, like a much, more easier, to deploy things. At scale and, what. In, our team we realized, is that with. Rice of Cuban it is there's not much tooling, around machine, learning for. Communities, and that's why we started this project, keep, flow and cue. Flow aim, of the cue flow is to make. Machine. Learning on Cuban, it is much. More simple. And. I'm. Specifically. Talking about fairing. Fairing. It's. Very focused. On making the. Goal of cube flow work. For data scientist, how. Many of here would. Categorize, yourself as a data scientist or a machine learning engineer. Yeah. Quite a few. Right so, a lot, of data scientists out there. They're, not very. Familiar. With the infrastructure. Technologies. And rightly, so like they don't need to know, about Cuban it is or they don't need to know.

About Me source other infrastructure. In order to perform, their duties, so. And. Especially, when. You look, at Cuban, it is it's pretty, has. A very pretty steep learning curve, so. For. A data, scientist to spend that much time to, learn about Cuban, it is and and in ability. Familiarize, with, that it's. May not be the right. Kind. Of a, impactful. Thing for them so, so, what we want to do is that can you bring all, this goodness. Of Cuban, it is into. Data. Scientist world and that, means like how we integrate. With the notebooks how we integrated, a, Python. Kind, of a style of programming, and. And. When, talking about like a data scientist. Environment. So. Nowadays if you see like Jupiter. Notebooks, becoming. Like a kind of the ubiquitous, IDE, for data, scientists. There. Lot. Of lot more languages. Not only Python like you have our kernel, you have. A. Kernel. Now Swift kernel, like lot of new kernels are coming up and the, Jupiter. Notebook is kind, of a becoming a very good. IDE. For data scientists so, I'm going to kind of show how we, integrate with Jupiter and give a like a very easy, way for data, scientists to kind of build and deploy models. So. Before. I kind of dive. Into some of the architecture, and more, complex things like I want to give you a very quick, simple, demo, that, way you, get some context, so, here we, are in a notebook. So. So. Our fairing, provides. A kind. Of a couple of different ways of expressing, your training. Logic so, I'll go by one by one and and and you, can see how it works so, here I'm kind of using this notebook but. If. You are not familiar with this magic commanded it, basically. Saves this, cell. Contents, into our file so I'm, kind of a saving a Python file called trained at PI so. If you this, is an example but you might, already have a trained out pie that has all your training logic, right so, let's say like if you if you have that. Trained. At PI how do you go about executing it, so. I will kind of a dive. Into like, what the backend is later. But here. You can see that like I'm constructing, a training job out of this train dot PI and I, am just submitting that training job, so. If I submit that job. Like. Ignore, this all this. Logging for now we will connect live deep into. Later for. Now, what. Is happening is that you, know brief, this. Are trained. At PI it's kind of packaged into a docker container, and shipped, into the cube. Flow cluster that I'm running, so, I'm running a cuban, entities custer. On. GCP. So. We, have a product. Called gke. Which is like a google cuban, it is engine so where you can go and create a clusters, of. Your. Choice. So. I already have a cluster that's running, and. So I'm going to I'm. Running. On. Top of that so, here you can see it just printed the hello world which. I wrote here it's a pretty simple way how. Many of you here I can never have, their training logic. In a Python, file okay. Typically, you have it in Python file yeah, good number of people like that's a kind of the standard way it, works with gate so. A lot. Of pieces that we see have, and. Even, though like you have your main training, logic, in a Python file you still wants. To use notebooks, to kind of a Schedule II a job, and. Analyze the output of your job. So. The second way you can kind. Of execute a. Training. Job is through a Python. Function so, here you can okay this is just a simple training job function here, this, can be a very, complex function to you and it can depend on other functions that you define so. This is very well-suited for people who develop, their machine learning models in, a notebook itself, in a notebook environment, itself how. Many of you kind of develop their, code only. Within the notebook not like switching to Python, file but within. The notebook any. Show of hands like that couple there so. So if we are having, all your code in the notebook itself, so. You can use you. Can just pass that function that has your training logic into the the. Train job API oops. I have to execute this first. Nonlinear. Execution, of notebooks, so you do, that. So. It's just a very similar job of packaging. That into a docker. Container. And executing, that into. The queue flow cluster so, as. You can see we. Try to kind of a. Not. Require, the users to go go. To another, CLI tool or go, to another dashboard to, look at the logs or other things but keep, them within, the notebooks, environment, itself and. Kind of provide you with. That workflow so here like we, just executed, the Train function. Then. We got the output here it cleaned up all the things that it ran, so. The. Other, set of users, I'm. Not, sure I will find it here but like, a. Couple, of, our. Colleagues. At Netflix they're very. You. Know passionate. About executing, the whole notebook, so.

For Example here I'm trying to execute a notebook all trained at IP, and, Bay let me open that. So. Here this, is a different. Notebook, so think of this. Like you write the whole your, model development in a one notebook right and sometimes, that. That. Behaves like a kind of a report also so you train. The your. Model you, you do some prediction, you brought. Some AUC, curves you, do some model analysis right so let's say like you wanted to kind of take that whole notebook and execute. It in the remote cluster, so. That's possible too so, so, and. In. Order to do that we, use. A technology called paper mill which, is open, sourced by Netflix it's, a way to execute, your, notebook and parameterize. And it's an kind of a schedule, so. So. Far all these, other things that I executed. Doesn't, have a dependency, so it was executing, on a simple, Python based. Docker image now. In, order for me to execute this notebook I I need to kind, of add a paper mill and Jupiter to the requirements. File so I'm just creating a requirements, file with. All my Python, dependencies. So so we are going from, how. You specify your training logic to how you specify, your dependencies, so right now we, support a two ways of specifying your dependencies, the one way is to give a requirement, start text. Where, do you add your dependencies, and. The. Other way I'm going to show is using, a kind of custom, based container so if you already have a base. Docker, image where. You all have, your dependencies, you can specify that those. Are specific. A docker image it's very useful for running. It on GPU because for, GPU you need to install like a CUDA and a lot of other things probably, your team or your company. Already have a base images up that, has all your dependencies so, that's a good option for that use case and. We, are planning to add a contour support. In. Addition to the the pipe requirements, how, many of you. Here like use a pipe. Corner. Yeah. I think kind of equal in, numbers. How. Many of you use kind of a have, they're all the dependencies captured, in a info, docker file either, you created, or your team provides. Very. Few. That's. Fine. So. Yeah. Like anyways, like if you you, don't have to know. Anything about docker in order to use the tools so that's one of the main. Things that we wanted to abstract out so. Here I wrote. Recommenced, a txt with paper mill and Jupiter now. I can kind of a execute, the same thing the only addition thing that additional, thing that I'm doing is that I'm giving. The input file which. Is the kind of the all. That your dependencies, requirements, dot txt so. Now here, you can see that, it. Tries. To run that pip install since, I already did. This it's, using the cached version of it so. Here, you can see it could, have executed that, that. Notebook, and prints out the kind, of a non graphical, version of it you, can also save the notebook. And. Retrieve that notebook back. So. So. So far we have we, done like able. To kind of a Swiss way what dependencies, you want able, to specify how, your training logic, whether it's in a Python file or a Python, function or, a notebook. Now, how do you let's, see how I can specify, that. Resource that, you want, so. Here. This. Is where like it's getting. A little bit into the, kubernetes. Weeds. But. We, are trying to kind of this abstract this out so you will you'll probably. See a battery, API coming. Soon but. For now let's say like this, is the function that I wanted to execute it just prints out how much CPU, I have and memory I have, so this executed, locally so. It tells me like I have 12 cores with. 32 gig ram so, let me kind of execute the same thing in a cluster, and. Ignore. This particular syntax, for now this is where I, talked, about the kind of very Cuban it is specific syntax, so. Basically. I'm asking, like a 90 CPU. Cores and. 600. Gig memory so. If I went go and execute this. So. Say. It does like a similar kind. Of a talker building. Step but. It's going to wait for some time the reason it. Waits for, it it's that it's. Spinning. Up a VM with. Those. Requirements. So. You, know like no one wants to kind of a keep, spinning. Like a key keep, those. VM try those are like way big, VMs and costs a lot so, so, we are using cuban, it is feature called Auto scale so, if I go to my Cuban, it is cluster. And. Cuban is has a notion. Of node poles the, node poles are kind of the pools of, VMs, that you want used. For your workload so, so, here I created, a node pool. With. A 96, CPUs. And. 624. GB, RAM so, so. This node pool I set it to auto scale so like the minimum size is 0 and the maximum, is 10 so, if there's no load on the cluster it autoscale stat to zero so you don't you, don't pay for it so, since I executed, this job it, will create an instance, and start that instance and run this workload so.

It Will take probably, like at 3 or 4 minutes time so, here's some of the important, concepts. That we covered right the, entry point is like what you how. You want to specify your logic you. Can, specify. Your logic is a Python file or a Python. Function or, a class we will I. Will, kind of show you the the Python class example later, or again, if, you want to exceed a complete notebook you can execute it, and. When. It comes to a Python dependency, management, we. Can either. You can provide a base docker image, which. Is just. Very very flexible, right like you can install not only a Python dependencies, but also you can install system. Dependencies, like CUDA or like, other. System. Components, and. And, if, you don't want to deal with the docker you. Can just give a requirement, start text file and we will kind of a build, docker ourself, and. Right. Now we have two types of jobs one. Is called training the, other one is on in prediction the, training is very simple. Just. Takes, your code, and run it so. We have a support for distributed training. Also will show that and. For. The online prediction, is where you train your model and and you want to kind of, put. That into an service. Where you can make a HTTP, call and get back the results so. This is useful for. Kind. Of a production rising your model so once you have the train the model if. You want like. A users, of your app or users of your internal. Users, they. Can kind of a hit this HTTP. Endpoint. Then. Lastly I will kind of dive into the the backends, I kind, of keep that for. Now will, we see that later. So. Let's. See like what's, happening to, the. Node that we still, are waiting if, I kind of rough. So. Here I can, see this like a one-note started. It's. Probably. Say. Where is that it's so. It's still not, there yet. But. That's fine it will come, back little, later so. Diving. Into the cane of the architecture, and the implementation. So. So. We have three, distinct, a process. That we go through when, whenever we schedule a job so. The first process is the the preprocessor. And. And, this is where like how, you specify, your code gets. Packaged. Say, either you can specify it as a notebook, function. Or Python so, so the preprocessor handles. That logic, the, second, step is the Builder phase and. This, is where we build a docker image. Out of that and. Here. We. Have two we are supporting like a multiple environments where, you are in right so for example like you. You. You can be in an environment where like a collab. Kind. Of online hosted notebook. Environment, or you're running your notebook. Locally, or you, are running your notebook, in a Cuban it is cluster itself so. For different environments. We, need different types, of docker builders, so that's why ok we have a couple of varieties of docker builders not only just a local doctor builder but also, remote and will probably add some cloud. Build support also so. In the deployer face is where we, built. The docker image of. Your. Code. And we, wanted to deploy it into a, kind. Of a cluster and that like the.

The, Job, type matters, like for example if, it's a distributed. Tensorflow. Job we wanted to use a TF. Job that's, part of the key flow or if. You wants, to do an online prediction, we wanted to do. Serving. A type of job and. The. One of the aim of this our queue, preparing. Sdk is that we wanted to be very modular, and extensible. So, so, we expose all this a pas to. The inducer so, if you wanted to kind, of a build a different. Type of preprocessor, for, example, let's, say like you want to kind if I take a notebook and, convert that into a Python file and run that you will. Be able to do that and. Also like, for example like if your company has a different. Way of building docker, containers. You can kind of plug that in inside, of that, and. On, top of that we what. I showed is the kind of the Train job API which. Tries to hide, all these Integrity's, and and give you a very simplified, api, where, you just give your Python file or your Python function it. Goes and executes. So. One of the, demos. That we have is, the XC, boost. Training. And deployment, so. Here we. Have an extra, boost. Training. So, all, these cells are like pretty standard cells like here I'm kind of a reading a. Sample. Data set and. Doing some. Mutation. Like a kind, of replacing. Nulls with some mean, values. And. And. Here I'm. Doing like a very simple. Fit this. Is nothing interesting it's just like trying, to showcase how, you do actually, boost training, and. The. Both the eval and the save or T simple like I'm saving to a local file. So. So. And one. More thing that we we. Are exposing, is nfn. Python. Class interface, where. If we wanted to do both training, in prediction, so. You can follow this interface, where. You have to have a trained method in a predict method first. Look let's look at the Train method and see what it does so. The Train method reads. The input converts. That into number, arrays and train. Set model with an umpire race and does eval and save and. In, the predict method. Ignore. This feature, names for for, a moment that's, a kind of extra. Things that may, not be. Clickable here the. Main one is that you get at and, if an umpire input. And. You're loading the model, that. You saved from the training and you're, doing prediction, on that numpy, input and. Returning the prediction so here, I'm just returning, the same prediction. Just, for. Example. So. If. I wanted to kind of a run this training job, locally I will just say the, house trained or trained right how's our train, here.

Like, It's a pretty simple model so it's X it's pretty fast so, typically. Data, scientists, they want to paint over train locally, with a smaller data set, before. Training, it on a bigger, data set so so that they. The. Ideations speed, is much faster right. So. So, here I did that I trained. Locally. It works so, now if. I wanted to kind of do. The same thing on a remote, cluster, I. I. Do. Some configurations, where I configure. My where, I have to store my daugher, widget. Like a output, docker files and. Also, kind. Of out creating a Python, base image so all, these are defaults, I'm just showing. It, but. You don't have to specify this it. Will infer, that automatically. So. So. We, are going, back to that train job API so, before, we. Gave our Python function so here we are giving a Python class. So. The, train job will take the Python class and package. That into a docker image. And. Execute. It in the remote cluster. So, you would, see a similar. Experience. So, here we are also giving a requirements, dot text so. This one contains all the pandas psychic, learn all, your requirements so. It's going to install all those requirements for. You in. The docker image so, it takes some, time to install that so. Once it's installed, it's going, to push. That container to the remote. Docker registry. So. Let, it go so. In. Terms of frameworks, right now we. Are supporting, like a HC boost and like GBM and we. Have support for denser, flow in. A single node by, touch, just. Show our fans like how many people here, use kind. Of extra boost or like GBM before, they work like. More like a tabular data right like. Do, you have and. And. One, more thing I wanted to kind of her here is the kind of the feedback because we. Are doing this in open source and we wanted to kind of make, sure that it's useful, to a wide range of audience and, not biased, by like either internal, Google use cases or our, customers. So. One of the things that I will. Show you is kind of a distributed, support. For like GBM. For. The people who raise their hands like how, many people have a. Requirement. For distributed, training for. XE boost are like GPM, or. Like, a single node training is like fully sufficient so. You so. You have a requirement like it can you kennefa. And. Like a ballpark, like like. Talk. About like how many what, cause like you typically use or how long your, training goes. It's. An. Order of dozens. Of workers and the training typically runs for like 72 hours or so. Should. Be simple enough for, it we'll give it a try. So. So. Here yeah it's pushing, the, docker. Image here. Say. Since. I kind of cleared out all my cash it's, rebuilding, it when. You typically. Do. This second, time it should be very fast. So. Yeah, the. It, did, run and it's, very similar to the local running it's pretty fast so that's not much. Time. To wait for, it running so. And. The. Other one that I wanted to kind of a show is that how to do a prediction so so, going back to the. The. Class that we had. Said. This predict method right so this predict method you, can call it locally but if you wanted to kind of a put it as a HTTP, service it takes a lot of effort to kind, of a package that and. Ship it right so, we wanted to kind of simplify that, process. So. Here instead. Of a train job you, say I need a prediction, endpoint. And. You, give the similar, arguments, like you're, the. Python input class your requirements, one, more thing that you hear you are giving is a trained model, from. Your train from. Your previous training, and. That's. It so. If I execute this what. It does is that first. It packages, your. Python. Class in around. HTTP. Wrapper, so, we use a library. Called Seldon, which. Is, another popular open source framework. For doing.

Online Prediction. So. Since I built the docker image previously, this. Thing came, up pretty fast so it's waiting for the endpoint to be created, this. Will take probably I get 10 to 20 seconds. For. It to happen so. Then, once the. Endpoint comes back so you can do some you can take, an numpy, array and, do prediction against that endpoint. So. So, let that happen. The. When. Talking about the distributed. Support like we have a. Support. For distributed, training, for. Allied GBM so right GBM is so. Yeah. So this we, got an endpoint where it, says like a kind, of an IP address and, port and a path so. If I go ahead and do the endpoint, prediction, it, basically, calls that, numpy. Array against. That. HTTP, endpoint and, gives back the results. So, there will be more like. Will improve the APA to be more easier. To use like probably giving. Back a panda's, data frame. Is much better use. Case and you, can go ahead and delete that in, point if you want. Homie. So. What you developed here essentially acts as a wrapper on top of sexy. Boost right or like GBM so, why is it not more generous to any other model like CAD boost or random forests while why are you limited, to XG boost or like GPA. For. Single node training that's a great question good, observation. Yeah. Like if. I were to Farah place the question right like. Whatever I shown, here seems. Like a generic rail like executing. A Python file executing, a full notebook and executing a Python function why it has to be, constrained. To X you boost actually, it's not like the, demo that I'm showing is actually boost but, you can put whatever you want like it's not even ml, related for now, the. One thing that is ml, related, and and and framework, specific is the distributed, training part, because. For the distributed training, each, framework, requires, a kind, of a specific set up for. Example if. You are doing a distributed. Training for tensorflow, it, needs specific environment, variables to be set up if you're doing for. Extreme, boost there, are a couple of ways to do it some. Needs like kind of MPA kind of systems. To be set up so so. We are adding. A framework, specific, support for the, distributed. Learning but. For, single node training you should be able to use any.

Framework, You want. The. Other one that, we are kind. Of a so, what I showed with the extra boost is where. It's. A Python API right so you're kind, of using the Python. XC. Boost API. For. For. Like GBM, we. Went ahead and did a kind of a native integration with the CLI version of the API so, I will show that. So. A lot of times, especially. For like GBM the lot of features that are exposed as a CLI, that's very useful, so. They wanted to kind of see, a lot of uses they want to kind of reuse the CLA version and. Here we did a kind, of a very. Native. Integration where. So. This is the config that that you typically use to trainer like GBM model like this is similar to an, extra boost model, thing. Right like so, you have a because, unlike. Tensorflow. Models. The, model architecture, is pretty fixed so you have only you, control only the hyper parameters, so, so. Here I'm giving all my hyper. Parameters, or configuration. For the model, and. Here, the, input data and and the. Output data all specified as a GCS. Path so, GCS is our Google. Cloud storage. Kind. Of a takes, care of all the storage. For you, so. And the. Users just go ahead and do like a like GBM gotta execute like let me kind of go through this run this process Oh. Actually. I don't need to do that one because. They already built that container. So. This. So. Here it's. Much more native. Integration. Towards. Like GBM, where. You're. Not. Controlling. Like, NFI, you don't have to build any. Specific. Python. Class or wrapper you just give this configuration, file and we are, going to execute like GBM with that configuration file the. One interesting, part is that in, order to go from a single node to a distributed, node in like GBM you just change a one parameter so, I will show, you the, distributed. Version of it so, here the. Like GBM itself has a parameter called number, of machines, so. You change that to like a here. 3 or whatever, you want. And, and. When. You exclude that with that parameter. We. Allocate. That many workers and do the distributor training so. So this is where like doing, kind of the framework level support. Is. Very useful and very fast. So. And. What. What we have seen is that, specifically. For like why we chose like GBM. Specifically. Said you know benchmarks, like we. Like. A scaled up to like a terabytes, of data and and it gave like a pretty fast. Results and. Closer. To linear. Scaling. So. That's. The, like GBM, one let me try to see and. Here. Are. Oops. Okay. So, again, here, you can say like how resourcing. Is pretty easily like you can say like how many curves you want how. Much memory you want, and.

If. You say like a stream log is equal to true it, will stream the logs. But. If you say stream. Log equal to false we, can probably easily put a for loop on. Top of this and do very, basic hyper, parameter tuning so you can if, you have like it in hyper parameters, you can gain to go through that pretty, simple, and. And all the model outputs are saved into a specific GC, spot so you can can, defer go go. Back and retrieve that files. So. Let me see if I can execute, this part. So. All. Of this are happening with. Autoscale enabled, so so, if we go back to that, your. The cluster scales automatically. Whenever you execute a job, and. And it, goes back, when. You don't. But. This seems to be either some problem, where. It says like my parts. Are not scheduled, probably, I'm hitting some limits or something. So. Let me, show. You the. Another. APA. So. So. Similarly, like we have a separate, support. For tensorflow. Distributed. Like if you want to do tensorflow distributed, so. Here's an example of, a. Simple. Amnesty, tensorflow based, on the TF 2.0, api so. Here we are using Kara's and. The. New distributed. Distribution. Strategy, and. And. Here, I'm. Using this kind. Of the lower level a place that I talked about just. To showcase how flexibility, is right so, I'm doing I'm constructing, this preprocessor. And. I would recommend like this lower level API for people who are very familiar with cuban it is and they, want to control kind of the each step, of the process, so. You can kind of do. The pre-processing, first then. You can choose what builder you want it to use like if. You if you want to use a key local docker builder you can use this. And. So. We have a TF job API where you can say like I want like a worker, count 3, I'm. Not using the parameter server style training so I'm just putting the chief count is equal to zero and. If you do that so. This. Is things. Still running. It's. Completed, so, basically like you can do kind, of a distributed, training. With. This API. So. We are adding more framework. Level support for example we'll. Be adding a PI Taj distributed, by touch support. But. First, and, also X you boost distributed. Excuse support also there's, a question. Hoorah, what you, can use it it's, little, bit tricky. But. It's. You. Can do that like with that low level API ApS you can do that there's nothing, preventing. You. Yes. Yeah. The. Auto-scaling everything is really done. My. Question is about high priority, ring so I was wondering if you wanted to integrate like retune or github. Would, you do that at a level of calling, you. Know fairing functions directly or would there be a trained job at the API level.

Rephrasing. The question like, how. Do we go about hyper parameter, tuning like AB and. You mention about catty I I, don't know about the rate tune. I'm. Not familiar of, a bigger. Project or it says, okay. It's. Part of their a project, right so. We. Are working on a better. Integration with Khateeb so. For. Example if you take this the current API and write, a for loop around it it works but, it launches all the jobs at once it. Doesn't kind of do, the cube based training, or if, you want to do a liter mination, or do, Bayesian optimization, you can't do it so. We are in. Process of integrating that with Khatib since, Khatib is a part of cube flow. We. Are planning to integrate the API would probably, look like something, like that. Same, will, keep the Train, job API but, probably you're passing, in a kind of a map of parameters. That you want to sweep so. From your, perspective you're just launching like a one job but but. It will be translated into multiple, jobs, and the, parallelism on, how much trials you want to exceed it'll be controlled by a tip. Thank. You for special I have, a question, so cube, flow and m/l engine are they supposed to interchange and so are, they same on the same. How do they come together. So. We. Have the. The. Rephrasing. The question like what said how. To. Platforms like Q flow and ml, engine which is our managed. Platform. Like we recently, rebranded as a a platform, so, how does that compare I. Would. Say like it's a two different offerings based. Upon your. Requirement. For example q flow is fully open source like, Apache 2 license you can run it on Prem you can run it in AWS. But. You, have to manage, your communities. Cluster ok, GCP. Provides managed. Communities, but, there's. A lot of things that you have to manage on top of that right like. Creating node pools creating. Your. Ingress. Services things like that so. It's about like whether. You. Wanted to kind, of a take care of all those management. By yourself or you. Want to have a much, more cleaner API where you say I just want to run this job I don't care about cluster, or I don't care about life, ii just. Run this training. Job and give me back me output, so, the a platform, is much more. Tuned. For, that kind, of a managed offering. Yeah. From-from. Pricing. Point of view also we. It's. Actually, the same like even. In ml engine you just pay for what. You run it's, the GCP, pricing. This DC pricing. So. I, think people, prefer, like. If, you what. We have seen is that if. Your. Need like a very good customer eyes ability like if you really wants to have. Access. To your clusters you nodes probably. Queue flow will be a better solution but. If you don't have a DevOps, a team that, can support you, ml. Engine is a much, better solution. Any. Questions. Let. Me open this. So. Since. We talked about cute. Flow. That. Means let. Me see the other one if, that. Yeah. So so, this one ran though. Remember, the one that we ran with Jennifer. 90 CPU, and 600, gig memory so that ran and printed like it, has 96 CPU, and 614. The, reason why this is 96, and and we. Asked for 90 is that. The. Node that it's, running it has 96 CPUs, so. It's printing that 96, they came, Cuba. Cuban. It is usually schedules. With. The node like it that's at least 90 CPUs. So this one has 96 CPUs. And. If, you wanted to kind of do, the same thing right like going. Back to the simple example, if, you want to do the GPU, one. Like, this. And again this is really little bit Cuban it is specific. Syntax we, will probably have a better API soon, so. Here I'm just. Going to print out the, GPU. Architecture. Here. So. Again. Like if I ask for GPU, it kind of goes and spins. Up the node. And. And. Runs, this process. But, since. I already kind, of running, this process before. The. Node is still alive so kind, of it prints the the, v100 GPU that. You have so. So. This is another example. Of how you can optimize your, cost, structure, right so. So, typically like if you wanted to train a model, with the GPU you, go and create a VM with a GPU on it you you, you. Execute, your model, but, a lot of time you spent on analyzing, your model or looking at your data at. The same time the, the VM is keeps. On running right like that's like a lot of money to pay, typically. People shut down the VM and kind, of move the data out of the VM to their local box and analyze, the data but, again that's a lot, of back and forth data. Transferring. And. And. A lot of hassle so, with, this STK what. You can do is that you can kind of create a very, small VM, instance with just a CPU few CPUs and.

You Can do all your analysis, model analysis, and visualizations, but, when it comes to actually training, GPU. You can just schedule it on the remote cluster which Auto scales to, your workload so, essentially. You're using the GPU nodes only when it's needed and shutting, it down when. You don't need it and and since, it's bringing back the results, and the data for, you you, don't have to manually transfer, any data sets. Yeah. This is like a similar example where I run the EM nest. Model. On, the GP owns, I. Think. Yeah. What I was. So. If you want, to get started I would. Suggest. Like going to the cue Pro website. So, we have very. Good documentation on how to get started so, one of the ways if you are GC like this like a DCP, doc, AWS, an on-prem. Even. Microsoft Doc also so, if you go to GC. P we. Have a nice UI. Where. You can spin up a cluster so if you go to deploying, using UI. Very. So. So, we have an app to deploy q flow so there you specify, your project, name in a. Couple of other parameters. So. You have documentation, on how to fill this form so, and when you click create deployment, it will create a Cuban, --'tis cluster for you and install, q flow on top of that then. You can go. To that, chip. Flow cluster you, would probably get a URL at this end of this, installation. I'll go, to my already, installed, cluster. So. This is the cue flow dashboard, you will get so, here, it. Has like a couple, of components I will go through like one is the notebook components, so you can actually. Host. A notebook on the cluster so you can say I. Need, a jupiter, notebook with this many CPUs, and. You can specify like. Any. GPUs. You want like with extra resources like, nvidia gpus, and. It, will give you a kind of a link, that you can use. It as a hosted notebook. And, also, this, is the cat if that we talked about which. Is a. Tool. For doing hype a parameter tuning and, and. Here you can go. Through. Go. Through creating, a study so a study is basically multiple. Experiments that you do and. Kind of find, out the, best one you want. Some. Time. Loading, the dashboard. Yeah. So, here you can go in and kind. Of fill, this form so, this is the current, AP that we have so. Again. Like this is not a very. Efficient. Way of doing it like doing. A lot of experiments, you don't want to go and create it manually but. Also the other way is that writing, the samel file again, it's, a little, bit learning. Curve and not a very good interface. For data scientists, that's, why we wanted to integrate this API into, the faring API where you just, give, a Python map and we go and search. Through all those hyper parameters. The. Another big product, that we have is. The cube flow pipelines so once. You did. All your training. And. Experimentation. Now, we wanted to put that into a production pipeline you. Can use this cube flow pipelines product, I'm. Not going to go very deep into that but, like. It generally explaining, your. Machine. Learning pipeline, has multiple. So here we have like a training prediction, model, analysis, things, like that and you. Can. Create. Like a production grade pipelines, using. This product so, Q is kind of a big ecosystem has a lot of tools and firing. Is one of them. Yeah. So, there's. A very strong connection between tf-x. And Q flow pipelines so T I think of like ATF X is a pipeline. Component. For tensorflow, specific, things and q flow pipelines is, for. All frameworks. But, we have very good support of executing, T FX pipeline in, Q flow pipeline. Actually. Yeah. Like all, my notebook that example I shown is running on my local laptop. So. So, when I kicked edge off it was going, to the cluster. So. Actually you don't have to do much change out if I go back to my previous example, right so. Let's. Say like you're executing, this, Python function. Like. This train, function, let's say like if you wanted to do it training locally you, just have to do, this so. And. If you want to do it remote that's. The two lanes of core. Not. Much change. You. Can do it that. Way it. Also supports a way to build docker. Image in the cluster itself so you don't have to have a docker. Install, on your local machine. So. The, darker, build happens, with.

With The requirements, that you specify right so here you're specifying like these are my Python requirements, so, it, takes, a, vanilla. Python, docker image and installs those dependencies, so, so. Whether you build the docker image locally, or in, the cluster it's the same thing. So. So, in the stranger, bapa we. Don't have the number of nodes parameter. Yet I. So. So, this one is just single node it's a, single no training wheel like, it's - change will have that the. One that I showed the distributed training on is using that like GBM, API and. Also the kind of the internal API, but. We will expose that externally. So right now. We. Will add its own like it's it's. Probably coming next week. So, then you can specify in, addition to the number of CPU number of GPU, then, you can specify number of nodes you want to do. Yes. The elasticity comes. From the kubernetes. Feature. Not the fairing so, in cuban it is if. You go and create a cuban. Entities cluster. So. Here are so, so if we can see that one machine that i had the very big machine it, it. Got terminated right, now it's zero nodes so. Here i turned. On auto scaling and set, the minimum size to zero so. So then it like, it will automatically, do that so that's a feature of cuban it is itself. So. If you wanted to add model versioning, to different, models that you run strange, jobs where would you add that would that be at the level of a docker container or is there some higher level feature will be available, that's. A very good question. Right. Now I'm solving, it in a very, happy. Way if we go, back to my thing, I'm just adding a timestamp to my model output feature. But. There's. One more project called, of, metadata. Let me kind of a. Show, you that so, so, this, we've. Been saying like this is a big. Issue. Right like when people train multiple models they want to kind of track and analyze, hyper. Parameters, and unless models right so, this Q flow metadata is another component of Q flow that's. Useful, to capture, the metadata about your training job like, how. Many layers did he use where. Is your model how, much accuracy you got and finally, able to bring back to your notebook as a panda's, data frame and able, to analyze that so this is very, like. A much. More earlier stage, but. You, will probably, keep, watching, this you'll. Hear, back soon. Yeah. So. Right. Now we have to do it manually. The couple of profiling, information you can get like for example like if you wanted to get, a model, performance. The. Best tool would be using a tensile board so. You write the tensile Borlaug's we. Are working on a easy. Ways to launch it and sell for like one. Idea that we have is when you launch a training job you will get a URL for tensor board you can just go there and see it. But. If right, now if. You wanted to like a, monitor. System matrix. If. We go to the job, name so so, usually that spits out a job name and. If we go like right. Now it doesn't have the data because the job is finished but usually, you will have CPU, disk, and other things for, you to monitor but. Yes. Everything. And. And the the feature of the kubernetes itself, like if there's, something wrong with the pod it will restart it, and. And, going back to the model monitoring. Selectively. You have to look at the tensor, board and, and terminate, there's, a sitting, in tensorflow itself, to do some early termination. It's. Not very advanced. But you, can do that. So. So. That's, a pretty cushy I will kind of show, you another, thing we're. Touching. Upon this, particular aspect so. You can actually run a tensor board instance and point, that to a GCS.

Bucket And you, can get. All your training logs in to the GCS bucket so. For example I will show you a tensor. Board that, we are running. So. So, that way like you can. You. Should be able to see, multiple runs. Some. Issue. But, there's a feature where, intensive, or let's say that you you take a Jesus bucket and for, every run you log into that particular bucket. So, then in the tensor board you should be able to see all your experiments and, compare. Them. Yeah. So. See. So. When, you do a distributed, training the. All the notes like. Training. Just one model right so so. Typically. You. Only write, in surfboard locks on from, one node and. So, I. Think. It's not required like you typically. Like you, have the information on, the, model performance, on all the nodes so you just take like one node and write out the. Performance, metrics from one node that's sufficient. Any. More questions. I'll. Be here so feel. Free to ask, any questions. Thanks. Thanks for attending.

2020-02-01

Show video