I. Would. Like to introduce Robbie. Robbie. Is a tech lead on the Google cloud platform and, one of the founding members of the team, he. Built the first 10 server back-end of cloud machine learning engine. Service. And currently. He's focused on ML prediction and related services, likes. To work in other areas of machine, learning and, cloud. Platform, before, his time at Google he, studied, statistical. Natural language processing culminating. In a dissertation in the area of machine learning known. As active learning. Robbie. In, his. Outside. Work time you know spent likes to spend time with his kids he, has five lovely kids and he's. Very passionate about that. I known, Robbie for the last two and a half years and you. Couldn't, have been in in more better hands than then, Robbie so Robbie please take it away. All. Right thanks for that introduction to need my, name is Robbie Hart L like he said and I'm really excited to be here just, by way I know you're gonna fill out we keep saying hey we want feedback bah blah blah blah a couple, questions, along that line just. Informally. How many here would consider their role to be a, data scientist, their primary role with their company. That's. A good number but I might, not even be half are just over half what, about data engineer. You. Consider that some, of the same hands went up if you told me before I would have chosen the other one right you, got some data engineers what about just like software engineer, that's. A that's a good a lot of people so those that are software engineers are you here because like, machine. Learning is maybe a hobby or something that you're looking to get into so you can go. Down that path in your career is that why, you're here anyone. Here for other reasons. Just. Because of pizza someone here for the pizza that's. Why I that's why I came they said you, can't have free pizza but you got to give a talk I said worth it so. Here, we are. You're. Doing a start-up and so, you're here to figure out what what's on offer for that right yes. Very. Cool. Well. Speaking. Of that let's set expectations so as was mentioned I'm, one, of the leads on the cloudy I platform, team right so our job is to provide a platform for, you guys to do machine learning so. If you came here thinking that I'm gonna show you how. To train really, fancy, models to get the highest, accuracy and, here's, the tricks you do in SK learn and here's the hyper parameters, you tweak in XG boost you're, probably not at the right talk I hope you'll stay anyways I hope. You'll stay anyways but what I'm here to show you is what, the cloudy I plan are you seem like all my emails come up and stuff what. What. The cloudy I platform, can do for you folks in your roles either as a hobbyist, trying to learn as a data scientist, trying to get your job done as a startup I think, cloud. Off has a lot of benefits for startups actually how many here are at would consider yourself to be working for a startup, I'll. Raise my hand you're. Like what Google's not a startup but you know cloudy high platform, this is this is a relatively, new area I. Know. That they have existed Google actually I worked for an internship at Google for what was known as the prediction API and. We recently turned that down but. That was one of the first like public cloud machine. Learning. Offerings. And I was, actually sad to see it turned down not just because I worked on it I think it was actually ahead of its time quite, a bit and it's. Still probably ahead of its time we still probably have a few more years to go until we can really realize the promise of here's, the CSV here's, your trained, model go, for it right. But, anyways thank, you for entertaining that and thank you for, me for setting this up I'd. Like to think that I was capable of doing that myself. As. I. Begin here special. Thanks to people that helped me with the slides and the demos that are here I couldn't, have put it together, without the following people Noah Negri young-hee, Kwan Bhupesh Chandra Kathy Tang who's in the audience right now and the rest of the cloud platform team so, I really, appreciate all the help they gave me with the. Demos and stuff and then of course special. Thanks to those that organized, this meeting and special, thanks to all of you that are attending. Back. To the topic of providing. Feedback you, know I think. We need mentioned, that cloudy, I platform, started about three ago so. We do have some experience with working with users with getting their needs. And requirements and trying, to understand those that. Said we believe that we're at the very beginning, stages and that there's so much more that we can and should do and in.
Fact We believe that we're in a hypothesis, testing phase right, and what, we would like to do is we've made certain assumptions and as, you go as we go through this talk with you if. You. Notice some assumption that we made that are wrong I'd like you to come up to one of us afterwards, and say hey what, about this did you think of this way or hey you know what our requirements are actually a little bit different have you thought about providing, this and we'd really like to understand those things because if, we build a platform that you can't use I've. Just wasted the last three years of my life and, I'd like to think that I was trying, to do something useful and even. If I failed the last three years I'd like to continue, to. On this path of building something that you guys can use so. The topic of the talk as has, been mentioned is well. I I kind of changed the topic slightly, but its production scikit-learn. And XG boost specifically. In the cloud we'll, talk just briefly. About, off the cloud use cases but. You guys are most, of you here about half a you said your data scientists, some of your hobbyists, or did ml, in grad school so, you're very familiar with this is my, mic still on I feel like it went off we're. So good ok, you. Guys could probably preach to me about this this is my view I won't spend a lot of time on it because I believe you're all familiar with it the, model development, cycle is more or less like this you have a problem to solve you, collect data that will help you solve that problem, you, almost, always have to clean the data then. You decide what model you want to use if it's an image classification, or some sort of perception problem audio tech, maybe some text you might use a neural network but, if it's a click-through rate prediction, you might use logistic regression if it's structured learning maybe you'll use gradient. Boosted decision trees something to that effect those, of you that have been in the field long enough kind. Of already know this type of problem, this model works well with this type of problem then, you need to go through and take this raw data that you've cleaned, and analyzed, and collected, and you need to do some feature engineering especially, if it's logistic regression, or some sort of linear model you're going to need possibly. Quantization, one-hot encoding, you're, going to need to do, feature crosses, all, sorts, of things logs of certain types of features all things all those types of things right and once, you've done that you can now take your data set and you can train a model and most, models have, hyper parameters, and your. Goal is not to get a model it's to get the best possible model, given. The model selection, and the feature set, so, you'll train it you'll do some hyper parameter, tuning and at the end of the day you have a model and. You'll take that model and you'll analyze the results, hey what's, this model doing well what we're where does it succeed where, does it fail and what can I do to get a better model out of it and depending, on what the errors are that you say you, may decide ah I didn't have enough data or my, data wasn't clean enough or, my features what, I need to add more features or different features or, maybe I need a new model a different model class altogether anyways. You go through this feature, engineering cycle, or model development cycle and then, what you. End up with the model right, you. Have a model laying on your disk now. What. So. Training a model is simply. A means, to, an end it's not an end in itself we. Train models because we want to do something interesting the. Problem is let. Me see how many of you can relate to this tweet the. Story of enterprise machine learning it took me three weeks to develop the model that's, not too bad that's probably about right she's probably skipping, how long it took to get the data ready but, will spotter that one it's. Been over, 11 months and it's, still not deployed how. Many of you guys can relate to this maybe. Instead of 11 months it's six months maybe it's 20 months but.
At Any rate getting, a getting, okay, ignoring the data part because we all know that that's hard very. Hard often times but, getting, the model into production actually takes often much longer than training the model and how, many of you think that the fun part is getting a model into production, my. Guess is that the reason you're here is because the fun part is making, the model alright, so. What. If you could take this is a screen, shot and it's very hard to see of a. Jupiter. Notebook what if you could take your code that you ran to train the model and click, a button and you're, serving your model in production well. I'm here to tell you obviously, that you can and yes. I'm running a jupiter notebook on a pixel book so, that's a plug for those, that want to run Chromebooks, here, what. I'm going to show you here is the, running example that I'm going to use is to train a click-through model, this. Is Aria click-through rate prediction model and, I'm going to use XG. Boost as the underlying, learner and I'm going to use some, SK. Learn pipelines. Again. If you came to see some secret, sauce on how to do this you're, gonna be sorely, sorely disappointed because. You guys are going to come back and say why don't you do this why don't you do that why don't you do this and in fact I hope you will because these examples, are going to be in github go, ahead and augment them to your heart's desire and will. Accept reasonable, submissions, in that sense the, training data that we're using is the curtail sample, right that's I think it's 11 gigabytes, and it's seven days worth. Of click-through, data that's. Publicly available it was used in a cackle challenge so. If you look at the code here we're. Just doing very standard stuff I tried to keep the example overly. Simple for the sake of illustration we, parse a few arguments. And just, because they'll show up again I don't want to spend too much time here base directory, is just sort of so, you know where the logs, are and so you know where to export the model and that type of thing and then, the event date is what, I'm going to do is give it a date I'm going to look back for seven days use, the past seven days to train them a new model and then. Go from there okay, so. This. Here is like cheating I guess max samples of 7,000 recall, that I'm on a pixel book I'm. Not going to be able to train on seven days worth of data and I'm gonna emphasize that point on purpose a little bit later this. Little block Beautif, beautiful, block of code is just reading in the data and you you're like why aren't you using pandas, I. Won't, get into those details but there is a reason it's for the to get the right data format so we can use a dict vectorizer, here's. The XG boost regressor i'm training a. Extra. Boost model obviously for binary. It's. Kind of like logistic regression as opposed and then, we create an SK learn pipeline, where we have a dictionary vectorizer, which will turn the categorical. Features do. The one hot encoding for us. I. Call, pipeline, dot fit we. Cross our fingers that it works, and. We. Won't necessarily wait for it to finish but it won't take very long because I used so little data and all that stuff it, says it's done so, with essentially, one line of code you save the model out I'm sure you guys are all familiar without a pickle, a model, and then.
This. Command here GS util is just it's just a file copy command, that you use to go to the cloud because if you try to use CP it obviously won't know how to get it to the cloud so. We saved the model we copied it to the cloud now, we need to actually deploy the model to. Us as a service on the cloud what we're going to do is with. These commands that these two commands here is that's. Going to create a REST API for you that, you can then send prediction, requests to and it does all the load balancing, it does all the auto scaling it does, has, the web server authentication, authorization. Prediction. Etc. We'll, talk more about that although, I might condense that part of the slides anyways. In a, few minutes so this first line here we have a notion in the current service called. A model which is essentially. A collection of versions it allows you to and we'll, see how this in the, context of training, new models everyday how, it allows you to keep. Various, versions and switch between them like I'm, serving, yesterday's model and I now need to serve today's model but for the time being think of it as a container we. Create this model if I do it again I'll get an error because they, have to be unique so I'm not going to run it now, this mess that is curl will. Not be there in about a week or two we. Just got the command line tool updated, to accept framework as a command and it wasn't there so, it's essentially, it's very similar to the models create command but, it's versions create so we're saying create a version of this model and, since. I bumped up the version to v4 when. I run this it's actually going to the cloud and you can imagine because it's doing all the things I talked about load. Balancing, web. Servers blah blah blah blah blah it, does take a little while but ours is actually very fast it, takes about I think the median time is 90 90. Seconds your very first model may have a little bit more in a time and be more on the order of five minutes but takes. About a minute and a half so. While we're doing that. Let's. Switch, back to the slide deck. Now. I know we have to do a context, switch for. Those other engineers, you know software. Engineers you'll understand that reference, but the question why has production such a long pool pulled member that was one of the pain points that we're addressing, in this talk is that it's hard to get models to production I wanted, to open that question up to you guys what, are some of the reasons that it takes so long for models to get into production in your experience. Getting. The data well what about once you have the model though. So. You got the data you train the model all. You really want is the show impact right to get better performance review, scores or something like that don't, work validation. Well that's a good point actually yeah and I don't talk too much I hint at that on one of the slides but you got you're not going to push something to production that's a that's junkie right bring. Down your your, system any other reasons. Performance. Evaluation, absolutely. Yeah. Yeah. You got to be able to do a safe rollback very good back here. Yeah. How, many of you like to do it how many of you consider your your favorite thing to do is to manage infrastructure. And. If you that's. Why I have a job here right doing. This exactly I may, be the only one in the room here's, a few other things you guys said some, really great things I didn't even think very hard about this and these are the ones that came off the tip of my tongue her fingers you, need to get approvals, you need to do capacity planning some people I actually had a conversation I can't remember now it is but. They you, know you have to convert your model from one format to another, format, and, then. Do validation on it that type of thing there's. Some technical challenges somebody, mentioned that like just, creating, a serving system is a very complex. Stack. You. Have your load balancing, you have your web server you have authentication, you have authorization oftentimes, you have a cache you have logging, agents you have monitoring, agents they often work together if you, put all these one, thing is to get all those up another.
Thing Is to keep all of them up right because if any one of those pieces goes down it's, your pager that goes off if you're the production engineer right, and, you probably don't want that to happen and certainly those of you that are data scientist, even. If you're capable of doing this that's probably the last thing on earth you'd want to do otherwise you, would be production engineer and not, a data scientist, so. Here's, what Cloud ml engine prediction offers, it's clicked to deploy it. Has. Auto scaling horizontal. Auto scaling so as you send up more traffic it's goes up you send the less traffic of skills down and it's server lists so it scales all the way down to zero which, is really fantastic when. You're like for say you're a startup and your, traffic patterns, are very sparse at the beginning phases right you get some traffic and you're happy they scales down really low and you're not so happy or very, spiky traffic, we scale up very quickly, our, servers. Are able to serve upscale very world-class really we, give you encryption, authentication authorization. Logging. Monitoring. You can organize your models with labels you can do versioning rollouts and things, of that nature so. Let. Us go back here. Because. It's a Jupiter notebook you don't get the nice spinny indicator I don't have a nice. Extension. To do that but, we can go ahead and just start sending traffic to it so, you can see that this piece of code is really. Really, boils down to this these, four lines which is essentially, one line that I broke up because lots. Of lines of really long line is hard to read or, two lines really you create a connection to the service and you call. Execute. And you pass. It the data this, example, it's not clear from the way I've organized the Jupiter notebook but, it's literally the, same data. Structure we used for the local prediction okay. That's going to get, translated, obviously, to JSON because we've set up a REST API. To. Do that and you, can see if I'm lucky in the previous round run, they were both the same I'll keep my fingers crossed with. The live demo and I. Get. A warning about credentials but otherwise we, get the same value what, I'm comparing the the.
Local Prediction. Pipeline. Not predict recall that I have a desk in one pipeline and I'm then calling, cloud prediction, and grabbing, that out of the cloud very. Fast low latency response times, you. Can see that well. It happened to still be up but, if it had scaled down to zero that request, would have taken a little bit longer to bring up the, servers but it's, very fast. Ok. So. Let's move on to the next section of the talk so that's that's prediction. Like I said the premise was you, have a model we we combined SK learn features like pipelines and XG boost as the underlying model you can use any of the models that SK learn has, to offer any of the transforms, that SK, learn has to offer, essentially. A click of a button and we're working on a GUI I think it's close to being implemented where it's literally a click of the button you, point to the model and you're off serving in production at production, scale and you. Don't where the pager I do so. Let's. Talk a little bit about ml ups and this. Asterisk. Ops if you will is sort of a buzzword in the industry you've got DevOps you've, got data ops you've got psyops you got I don't know what other ops right so of course there's ml ops because why not right. What. Do we mean that by that though I like this diagram, if. I hadn't a stolen it from a paper from some folks. That work, here at Google I probably would have chosen a little different boxes but I think it serves to, illustrate the point very very well about, what ml ops are if I, can direct your attention to that teeny teeny teeny teeny box on this end of the screen, that's. The ml code that's the code that you guys write as part of your job to, get a model running all. The other stuff is what, it takes to serve, not, just serve but to train and use, ml in a production scenario sure if you're just fooling around like for a research paper and you got to do, get, a more accurate model than the last guy that published a paper in this domain you don't need all this stuff you can do it in a Jew printer no but go use colab or cowgirl kernels or your, favorite Jupiter notebook and get it done if you're gonna do something in production there's, a lot of work that goes into it you. Guys are probably very familiar with the data collection phase maybe, configuration, as well obviously. Feature extraction, in a production system that has to be running it's like a service, that's running on a continual, basis we, have your serving infrastructure, which we already went into slightly more details about how complex that can be you've, got to have resources. Monitoring. Blah blah blah blah blah there's a lot that goes into serving. In production so. What what, is when, you think of the term define, the term ml ups when you define the term devops it is, bringing, software, development, process merging software development processes with software operation, right. So. You I actually think that that does not definition. Does, not hold up for ml ops why because. I feel like we should be taking the ops out of ml so. Our jobs, in serving, production, systems, is more, like the small black box and, less like everything, around it let's, let the infrastructure, let's let somebody else have the job let's let somebody else carry the pager type of thing and so, that's really our goal so. Let's just let me give you an overview of what I think the step should be for ml, ops the, that what most, of the ops taken out of the ml ops the, first step is to develop your model I would hope that that would be. 99. Percent of your time you. Spend, the three weeks to get the best start model you can out of your data and then, you're ready to use it in production of. Course yeah this is the nice GUI that I said that's not quite there yet but it will be in a few weeks you, click a few buttons and you deploy it you, can, go and I forgot to do this maybe if we have a chance or we. Can go back and look at it but the. We have monitoring, and logging for, all those models that we just deployed single, click and it's all there so you're looking at it you're looking at your Layton sees and you say oh wait why. Is my latency, so high okay, and then, you say oh I get, it I didn't. Have enough cores, or I didn't, have enough. Nodes to, handle the spikes that we had you, make a few tweaks you click a button and you're often going so yeah maybe there's an ideally, ideally. You don't even have to do this but the reality is we're not quite there yet so you have to do a little bit of tweaking but, you can even do it yourself that's how easy it is and that's.
How I think ml OP should be done. So. Why cloud now, I don't know I probably should have asked a question I don't know how many of you oh let's just do it how many of you your, companies currently, use the cloud for any of their ml you. Have a few but not many right. And we find that to be true. Look. I'm, not here to sell you on cloud it might sound like it and I apologize if that's true obviously, that's what I do for a living but, let's just have an open conversation here, the cloud isn't necessarily, for everybody I have a few conversations offline, with some people but here's where it does shine and if you fit these needs you ought to consider it we. Have world-class networking. And infrastructure, here at Google we really pride ourselves in that so you can count on lower latency, x' faster. Startup times and things than you'd get by rolling your own solutions. Elasticity. And scaling those are buzzwords as well but, if you think about it if you purchase, a cluster compute to put in your own office or your, own data center you have a fixed cost and if. You use more than that you're, gonna get really bad Layton, sees when you're serving because you don't have enough capacity if you use less than that you're wasting money on servers, that are sitting idle or worse, using power so. That in the cloud you know you're paying for what you use you scale up you pay you pay for what you use you scale down and you're not using any you don't pay for anything high. Availability Google. Prides itself in this and part of that's due to the next line which is Google s Ari's SRA, means site reliability engineer. You may have heard of this Google. Hires, the very best engineers, that. Know how to run and maintain production, systems and, they. Help design the system so that there are fewer outages, to begin with and when there are outages it's their pager that goes off and they know how to respond, and quickly. Mitigate, any problems that are there and that's part of the reason why Google cloud has high, availability, for, its services is because of these folks here, speed. Of deployment you saw in, 90 seconds we went from a model, that I trained on my laptop to. One that was serving at production grade okay, and then, I'll probably. See this line maybe. I open emphasize this but I really believe that the total cost of ownership and paying. For what you use is a real benefit of the cloud so. What if you're not ready yet there's many reasons why maybe these use cases doesn't. Fit your scenario, maybe, your company is really slow at adopting, things, like cloud or other technologies. Or. Maybe you guys are just onboarding, but you still have a transition, time I'd. Like to refer you to cupola, which, is built, by our team, the cloud platform and what. It is is it's I like to call it the Anaconda. Of ML so, it's a bunch of packages that. Bring, UML, make ml very easy to do on kubernetes, so, if you guys are already using kubernetes, or thinking about kubernetes, where, you can run it on your laptop with mini, cube where, you can run, it on your premise, on premise, if you have like private, cloud or on-premise, erver, x' you, can run it on the cloud. Amazon, just released their, kubernetes. Microsoft, azure kubernetes but obviously i want you to use google's kubernetes. Engine and, there's. The hybrid case where, maybe you do have a data center and it's act pasty but when it goes over capacity, you can spill over to the cloud and. Of course the cloud ml engine that I'm focusing. On more or less today that, we have I'd like to think of cloud ml engine is managed, coop flow so, if you don't want to manage the stuff yourself you can pay.
For The managed service, all. Right so I focused a lot on prediction, but the ml ups off stuff is a lot more than just prediction, as we saw in the diagram so now I want to talk a little bit about. Non. Prediction, cases and there's, a lot I can say and I can tell you just as a teaser expect. A lot from Google, in in. The coming months in a lot of really cool stuff and that's all I can say, but, when. Might you want to train in the cloud these. May seem obvious, but let's take a second to talk about them and the. Cloud you can get really large machine types how many of you guys have 96. Cores sitting under your desk if a. Few - I want your machine I. Can't, remember what we cap out out but I think it's somewhere like a terabyte, of RAM we have machines that have that, much memory and you. Know what I found I actually. Had an intern do this last summer I said hey go, grab the cartel data set go, get as much RAM as you need get a VM with as much RAM as you need and train it and see what happens it took three hours to train he needed a lot of RAM hundreds, of gigabytes I think took, three hours the training he was done he didn't have to go figure out how to do distributed, training he. Just had to pay for three hours of a server a fairly. Reasonable price, with that much RAM shut, it down when he was done and he, had a really great model because it was trained on all that data right. Number. Of cores imagine. You're doing hyper, parameter, sweep grid search or something like that and you. Want to do it in parallel but you don't want to set up a spark cluster you don't want to maintain any of that go, get a machine with 96, cores you, can run 96, well. It depends on how many cores each jobs, running but you can run you can do a lot of stuff in parallel with 96 course of, course you don't want to go get a VM and do all your work on that because you'd be paying for it the whole time it's expensive but, this is what the crowd the cloud brings to you I talked.
About Distributed, training it is non-trivial. If, anybody's, had to setup XG boost to. Do a, parallel. Training, on SPARC you can do it there's. A some, sample code online to try and do it but it's actually non-trivial. And we. Don't yet support XG, boost distributor, training but, for Chancellor Flo it's very simple, to do distribute training if you're using TF estimator, you literally. Have no code to change you, submit it to the cloud and it scales up to the number of servers you tell it to scale up to and, hopefully. One day we can bring something like that to extra boost we. Have a hyper parameter, service called hyper tune which, is available for, extra boost and scikit-learn models, and. It's. It gives, you much better results than. Doing, a grid search or random search and it's, state of the art algorithm, so you'll get best-in-class performance there, and it's all in parallel you, know so you don't have to wait over you don't have to so. I talked to a researcher, here at Google and he says what he does is he plays around during the day on small data sets gets things how he likes them sends, off a job at night comes. Back in the morning and checks on how well is modeled this keeps, his fingers crossed that there was no errors halfway through right anybody. Relate to that I know that's what I did during my PhD to. Repeatability. So, what I mean by this is I, think that's the best there's, more examples but the best example this is with continuous, training so we're talking about training a click-through, rate model, where you get new logs and tons of them every single day and every, single day you're gonna train another, model right and so you're gonna use the same code and you're, gonna use it every single day or off actually if you a lot of people train, like every hour on it so they're training very frequently you're, not want you're not going to want to go to your jupiter notebook and hit, a button at 4:00, p.m. every day to, train another model right this is a repeatable, process it needs to be productionize it needs to be robust and so. The cloud is really great, for this and in, fact if you're taking, advantage of those large machines in those large cores you, know it sends off a jobs it uses all those things for a short period of time it shuts it down and you only pay for that time that you're doing it. So. Let's, come back here to. See what that might look like in the cloud. So. There, is a price to pay when you're switching from your. Local machine to the cloud you, do need to package up your code in a way that the service can understand and then multiple ways to do that will be supporting, containers, which is becoming an industry standard, way to do it for now we use, Python. Packages so, you just run, this, you. Just run Python setup, type is. Test you. See the second command then uploads. It to a bucket in the cloud and. I think that's finished, and. Then. I'll, go ahead and hit enter and. Then. You can see right here all I'm doing is saying submit, a training job called, CTR, for click-through, rate with the timestamp so it's unique I'm, saying that it's trainer trained. This. Packages, is what we just uploaded the runtime version tells us which version of now, that's us that's a little I wish that was more obvious that this was like scikit-learn, whatever, version but this does translate, into scikit-learn.
Version. This is cloud. Ml engines runtime, version one point eight we're. Gonna do it in US central it's with this project and a few command line parameters so, just since, I forgot to say this this train dot pi is that exact code we saw at the beginning. The. Only thing is like you, would probably take off that Mac sample lines that. I had because my pixel book couldn't handle and it took me I won't admit how long it took me to figure out that's what was crashing on my notebook because. It didn't tell me and you. Know you might change the Mac step or the the, number of estimators to be there what you would have in production this is your actual production code so you make a few tweaks you, send it off to the cloud you, can see that the job was queued again we don't have a first-class integration, so we, have to go over here, to. Our list of jobs. And. Let's see that one you see the spinny thing means it's still spinning up the machines this, is training our cert our. Prediction. Service actually, spins up the machines very very fast the training takes a little bit longer I think, if it's a single machine job it's usually around one or two minutes so. We can go to this job here, take. A look at the logs, nothing. Super interesting, just to show you that the there's this integration, here with, the logs if I if I'm lucky enough to. Get. It to come up so. You can see that all the outputs captured in the logs in case you need to go back and figure out something that's wrong or something that's right and. You can see that the job completed, successfully. Okay. So that's, how you send off a single job but the the use case that I was talking about is really about production izing this and doing, it on a daily basis so there's, a, how many of you are familiar with airflow, by chance a couple. Of you are probably the data engineers I think, I mentioned to somebody before we started I'd hinted, a little bit of pipeline here so what air flow is and there's lots of orchestration, systems, but it allows you to run chain, together various processes. That. You can then do on a repeated, basis in this case on a daily basis and so. You can see that right here and, I'll get into what this actually means but you can see that every day for the last two weeks this, model has that code that we just submitted to the service to run to trainer model has. Run on the previous, seven days worth of data produced. A model and then, push that model to serving so. Let's go ahead and you can see that in a graph view here very. Very simple dag I did that for the sake of this demo, it just does training and create version but some you. Guys have already pointed out an obvious problem with that if something, went wrong during training, and I pushed it to production I'd be in a world of hurt could. Even get fired for that so, what you would normally do and it's not too much it's actually not that difficult you can put another node in the graph that runs, evaluation, and another node in the graph the, checks that the that the quality, of that model is better than the previous or, the model that's currently serving before you push it and then, go ahead and push it to serving but for simplicity sake I just trained and pushed. And. You can look at the code to do that is actually very. Very reasonable here, again you have the preliminary which. Binary do I use etc. This. These dag args are just necessary. For air flow to work a little bit of overhead copy and paste of what I like to think of it as and then here's where you say hey with dag in other words with my pipeline that I'm trying to define you. Have this training up which looks very close to the g-cloud operation, but, in Python code it, says hey use this binary. For training and then, you have the create version op which again looks like the g-cloud option for pushing a version so, with I mean I don't know how many lines the code that is but the core of this is what 50. 20 lines of code maybe and we've. Defined, this, pipeline, that. We're able to run in the, cloud on a daily basis. That. Pushes the models to production let's see if I have this so. You can see here's. The model CTR daily, remember I said models like a container for versions and here. Are all the versions of that model that got pushed on a daily basis now also in a production system you'd probably be garbage collecting these so, you didn't have so many laying around. You. Can see that this is the default one so, if I send traffic to this model it's gonna actually go to that version you'd, want to update that on the daily basis, at. Least after you'd validated, it. So. Yeah. Any, any question, I forgot, to say this it was kind of difficult with all the interruptions we had my, preference, was gonna be to have questions in line but I'm almost at the end of my talk now but. Any questions about these. Pipelines, or the training.
Yeah. Yep. Yep. Yeah, so, if you do your print statements, your login info, statements, if you're a Python programmer those, things show up here and. I think it tells you which ones which those, the. The, standard. Air shows up as unfortunately. As orange, warning signs but. Because. There's nothing to be worried about that's just, that's. Just XG boost telling us that it's doing work. Any. Other questions so. This, is really really the end of my talk just to summarize in a few points production, ization is hard cloud. Is easy and. You pay for what you use on the cloud. Alright, and. A few teasers I guess before we move to see if there's any other questions about the entirety of the talk but. Our. Team is working on increased. Focus on reducing, ml ups so we saw probably 10 boxes on that page and I only I addressed, a few. Of them like, the continuous, training case and the serving case but there's actually a lot more boxes on there and you. Better believe that we're spending time, making, this easier, and easier as, time goes on, increased. Cost efficiencies. Work. That's a continuous, focus for us let's read because we're. Able to amortize. The cost of the infrastructure across everybody so we're actually able to get you lower costs, and you might on your own and then, we're also working to allow date scientists, like yourself and even, data engineers, as well to. Share their reusable work with others we're. Working on on ways to do that. You.
2018-07-25