Tuning ML Models: Scaling, Workflows, and Architecture
Well, thanks everyone for listening and today I want, to talk about tuning machine learning models focusing, on hyper parameter, tuning but. It will be very relevant for those interested and, speeding, up performance, where hyper parameter, to uni and it's often very computationally. Expensive. And. Also in terms of of course tuning the accuracy, or you. Know metrics. Your own models this, is also critical, so. Just. Quickly about where. I'm coming from I spent most of my time at data brakes as a software engineer working, on ML, and advanced analytics and analysis architect. One. Records, it slide about the, company you know data bricks is proud, to have. Been. Fundamental. In building out these important, open source projects, actually spark Delta Lake and all flow. Cool. So to get into it I want to spend two minutes quickly. Going over what, hyper parameters, are if you're. Not familiar with the material hopefully, this will help if, you are I think it will be nice in terms of setting in some perspectives. Well. So. I'm. Gonna, be using later a demo looking, at fashion amidst which has a bunch of images of clothing so. Let's. Take a look at this one you know it looks like a shoe my. First model that I fit said this, is a dress yeah, probably not correct, the. Next model I fit said this is a sneaker, and that's. Actually pretty accurate and what's the difference here, what, model to actually had better hyper grander, settings the, learning rate model structure, and so forth, so. Stepping back what, is a hyper parameter, well. The statistical, is sort, of definition or perspective, which I'd like to give is assumptions. You make about your model or data to, make learning easier. From. A practical perspective this. Does inputs, your mo library does not learn from data, you. Know there are a bunch of parameters in, a model which the library does learn from data but it also takes manual. Inputs and I'll call those hyper parameters, because they're knobs which can be tweaked. Some. Of these I would call algorithmic. These. Are worse or problem dependent configs, which may affect say. Computational. Perform, or speed but might not really be super, elated to the statistics. Or modeling, perspective. Well. So. How. Do we tune these well. There's sort of three perspectives I'll, get matching, those from before, if. Your statistician. Were, application. You know domain, expert, you might bring knowledge which lets you set, some of these settings a priority. Looking. At sort of that practical. Perspective you. Know there's an mo library it takes some inputs it, gives an output model, which can be tested and we. Can optimize it like a black box this is a pretty, common way to do to Union and quite, effective if you do it effectively and. The. Final way is to basically ignore until needed and I think this is a common approach I take with some of the more algorithmic. Aspects. You know if if a config. Really only affects the speed, of learning then, I may not bother tweaking, it until, needed. Cool. So I'm. Not, going to cover in this talk statistical. Best practices, or overview. Different. Methods for to you name and that's because there, were talks it sparkly. Eyes summit last year covering, some of these I'll, give references, at the end in. This. Talk I want to focus on data science sort of architecture. And workflow, best practices, and tips around the, big data and cloud computing space. Well. So hyper, parameter to me of course is difficult else, what, why they're giving a talk on it and I want outline a few reasons why. The. First is that these, settings are often unintuitive, you. Know if you've taken an intro to ml class the, first hyper grandeur you probably touched on was regularization. This. Sort of is intuitive, you know of weighting overfitting, limiting. Model complexity, but, if you take any given application it's, really hard to say a priori, how, you should set.
Regularization, And, you. Know of course let's not even get in your own that structure. The. Next challenge is that this involves non convex optimization I. Here. I took an example problem where on the x axis this is learning rate and on. The y-axis is, test accuracy, and you can see it jumps all over the place well. This is a bit contrived, in that a, lot. Of this stochasticity, is, from the optimization, process itself, some randomization there, but it kind of drives home the point that, this. Isn't a nice simple curve you can use traditional optimization. Methods on you. Really need some specialized, techniques, and. The. Final element is curse of dimensionality, since. This is a non convex problem, then, as we. Increase the number of hyper parameters, here, plotted, for an example problem on the x-axis with, seven possible settings. For each hyper, parameter. We. Can see that the fraction of coverage, of the hyper parameter, space for a given budget of say a hundred hyper, brendor settings drops, exponentially. As we go. To the right and so by, the time we hit maybe three hyper parameters, we can cover you, know about a third of the space and after, that we can hardly test any of the actual settings Fryeburg rameters, and. This. Leads of course to high computational, cost if you try to push this, coverage. Up. So. Given these challenges, I want to step back and see, how we can address them I won't. Be able to give you know neat, solutions, for all of them but I do want to talk about useful, architecture. Patterns. And workflows, and a, sort, of a bag of tips at the end. So. Starting off with architectural, patterns. At. A high level a lot of this boils down to single machine versus, distributed, training, and, I'll break it into three workflows, which you. Know 99%, of the customers I've seen in the field fall. Into our use cases fall into single. Machine training, distributed. Training, and training one model per group where a group might be a customer or a product or what, have you. So. Looking at the first one single, machine training, is often the simplest you, know involving these or. Other popular, libraries, where, you, can fit your data and train a model on a, scene machine. Now. In this case you know if I'm doing on my laptop, I'll just take say scikit-learn, wrap. It and tune in that. Tuning could either be a second learns tuning, algorithms, or another, method and. Run. It but, if I want to scale it out via. Distributed. Computing. It's. Also pretty simple where I can train one model per spark task like, in this diagram, where. The. Driver isn't doing much but each of the workers is runyan spark tasks, where each one is fitting one model and one possible set of I per parameters, and I, wrap the entire thing, in a tuning workflow. This. Allows pretty simple scaling, out and is implemented. A number of drools so, hi propped which I'm going to demo later as, a spark integration, we built. Allowing. The driver to sort, of do this coordination, across, workers. Thermal. Team I'm from also built a job, Lib integration. With, a spark, backend which can power cycle learns, native, to Union algorithms, and. Then finally this can be done manually via pandas UDF's and spark the. Next, kind of paradigm is ROM distributed, training, and there. If your. Data or model or combination. Thereof are too large to Train, on one machine you, know may need use spark ml or Avadh XG, boost or something which can, train. A model using, a full cluster, if. You, do want to scale this out further then, rather than training one model, at a time I can. Train multiple, ones than parallel and so that might look something like this square you know. You wrap the. Training process, which. Is orchestrated. From the driver save, I spark ml with. Two you name and the, possible tools which, come into play here are for example Apache. Spark ml which has its own tuning algorithms, and. Those, actually do support a parallelism parameter, allowing you to fit multiple models, in parallel to scale out further.
You. Can also note that from, the driver's perspective like. You. Can actually use any sort, of black box to, algorithm, or, libraries, such as hyperope, because. It can make calls from the driver and never really need to know that these calls are using a full spark cluster. The. Final kind of paradigm is training. One model per group and here, the interesting cases where there enough groups that we really get back to that. Single. Machine training. Case where. We. Can scale, out by, distributing, over groups and train each groups model, per, spark, task. So. Here's. Kind of a diagram of it and in this and recommend, him using spark to distribute, over groups within, each spark task doing, tuning, for, a model for that group you. Can do tuning jointly over groups if it makes sense though. Important. Tools to know about here are of course the Apache spark compan is UDF's were, you, - to. Do this, aggregation. And coordination. And with, any each worker you could use second learns native, q9i prop or, whatever. Cool. So that touches on the main architectural, patterns, I've seen in the field and. Getting. Into workflows I'd really like to touch on common workflows, and, particular, tips for each. So. In. Order. To get started, my. First piece of advice is start small bias. Towards smaller models, fewer iterations and. So forth this is partly. Being, lazy smallest, cheaper and it may suffice but. It also gives a baseline and some libraries, such as say. TF, Kerris which I'll demo later support. Early stopping and algorithms which can take that baseline into, account. It's. Also important, to think before, you tune and, really. By this I mean a collection, of kind of good practices. So. Make. Sure you observe good data science hygiene you know separate your training validation and, test sets. Also. Use, early, stopping or smart tuning wherever possible you know I often see people use, say. Karis. Where they are tuning the number of epochs but. It's often better practice, to fix the pervy pucks at a large setting, and use early stopping to stop adaptively, to be more efficient. Also. Pick your hyper parameters, carefully, well this is pretty vague but I'll give at least an example. Sort. Of a common mistake I've, seen made within the Lib tree models, they. Have two parameters, hyper parameters, which are important for controlling the depth, or size of the tree max depth and min, instances, preneur, these. Serbs somewhat, overlapping functions and I've, seen, them tuned jointly, but it's often better to fix one and tune, the other depending. On what you're going for. Finally. I'll mention picking, ranges carefully, this, is hard to do a priori, but I think speaks to the need for tracking, carefully, during initial tuning and improving.
That Later on. The. Next set of workflows, I'll talk through our models, versus pipelines, so. For this. One. Important best practice is to set up the full pipeline, before to you name and the. Key question tasks, there is at what point does your pipeline compute the metric would you actually care, about because, that's really the metric that you should tune on. Related. To that is you, know whether you should wrap tuning, and around a single model or around the entire pipeline and, my advice there is to generally, go bigger that's. Because you, know future ization is really, critical. To. Getting good results and, and they'll you, know had a professor. In computer science or an AI course back in early 2000s. Make, the joke that you. Know if you had the best. Features possible, then m/l would be easy for example, you know if one of your future columns where the label you're trying to predict you'd be done and of, course that's facetious, but it does get to the good point that taking, time to do feature ization in particular, tune it is pretty critical. So. Optimizing, tuning, for pipelines can come into play when, you start wrapping tuning around the entire pipeline and my main advice there is to take care to cache intermediate. Results, when, needed. The. Final set of workflows, is really around evaluating. And iterating, the efficiently. And. There. I'll say validation. Data and metrics are critical to take advantage. Of and just, a priori, record as many metrics, as you can think of on both training and validation, data because they're often useful later. Tuning. Hyper parameters, independently, versus jointly is a pretty interesting, question where. You. Know it since. You do have this crypts of dimensionality, it is tempting to say well let me do one hyper parameter, fix it at a good value than, the next than the next that's, sort of more efficient, but, some, of those hyper parameters, depend. On each other and so, smarter, hyper parameter, search algorithms, like an AI prompt which will demo later can. Sort of take advantage of that and be pretty efficient. The. Final thing that's around tracking, and reproducibility, you, know along the way of producing all of these models into you name you, generate, a lot of data code, grams metrics, etc and, taking. Time to record these as well as using a good tool like ml flow, to. Record them can, save a lot of time and grief. Later one. Tip is to parametrize code to facilitate, track in that, way as you're tweaking what. You're running you're, not tweaking complex. Code you're tweaking simple, and let's do that code. Well. So I'd, really. Like to demo this, so. I'll switch. Over to the data breaks notebook. Well. So here I am and data. Bricks notebook keep, in mind this code and the tools I'll be using or open source so good, run and whatever venue um, and. In. This I'm gonna demo scaling, up a single machine ml. Workflow, the. Goal being to show some, of the best practices, which we had talked through in this slides I'll, be working with fashion M NIST classifying, images of clothing using, tensor, flow terrace, and, the to, name tool I'll use is high prompt this, is open source provides. Black, box hyper parameter tuning, for any Python ml library has. Smart adaptive search, algorithms, and it also has the SPARC integration, for. Distributing, and scaling out which my, team contributed, a while back. Well. So I'm gonna skip through some of this initial code for loading the data, there are 60 km adjacent training 10k, and test, labels. Are integers, zero through nine corresponding. To different clothing, items I, just. Got a sense you know here the types, of images we had seen from, the. Slides earlier here's, our sneaker, not a dress and. So. I. Pulled. A open, source example, from the tensorflow website, and. Looked. At it for what hyper branders, I might tweak. Well. Modeled outfit had a few interesting ones batch, size in a box jumped out at me course. For epochs I'm going to use early stopping instead, and that way I don't have to bother tuning, it the. Optimizer Adam takes a learning rate which I'll tweak and the. Model structure, for, that I picked, three example, structures. Small. Medium and large in, terms of size and complexity, there. Are others which our. Future, work. So. Taking. A look at tuning, these I'm, not, going to go through the code in detail but I do want to highlight the beginning where I've taken care, to parameterize, this workflow. This. Way as I tweak batch size learning, rate and model structure, I can. Just pass, those on as parameters, rather than changing, my code itself. I'm. Gonna skip through most of this code I will, note that I'm taking care to log things with ml flow at the end and. Note. That I can run this training. Code. With, some parameters, this just an example run, so.
Now I'll skip down to tuning, with high dropped so. My. Search space here is going, to be specified using the high prompt API if you're not familiar with it don't worry about it here, I'm saying, there are three hyper parameters, I want to tweak and each. Is, sort. Of given a range and a prior over, that range where. I want to search and here's a good example of taking care of choosing, those ranges, carefully, here. I'll call out I use log, uniform, distribution for, the prior for learning rate device for smaller learning rates because I've seen success, with those in the best. And then, I call hyper ops minimize, function, where I minimized the loss of this return, by this training code over, a search space using, a smart search algorithm, here's. The spark backend same let's fit at most eight models, testing. Possible hyper parameter settings and parallel, at a time. 64. Possible, models, total and run. It so I already ran this because on this toy cluster, which had two, workers no GPUs, it. Took about an hour, good. Example, of why it's important, to make sure you record beforehand. The. Metrics you really care about. It's. Also nice to take a look at those metrics of course and here I'm using the ml flow experiment. Datasource where I'm loading, the. Runs for, the experiment, for this notebook. Adding. A column duration, and you. Know we can take a look at this here's. A histogram of the duration, of tuning, models most, Rin pretty quickly but. A few took the better part of an hour and so you, know might want to consider whether we want to spend that time. Well. So I'm gonna open up the ml flow UI for this notebooks, experiment. And here we see I have this one run for. Hyper opt with, a bunch of child runs for each of the, models we fit under the hood for different batch sizes learning, rates and model structures, different, metrics and also. Of course the. Model artifact. In. The Mel flow you I gonna, compare, these I've already set this up and. It's. Pretty interesting to kind of take a step back and look at the correlations, between batch. Size learning, rate and model structure, with our loss, you. Know if we look at some of the best losses, we. Can see those came from this medium. Model structure, small, learning rates and so, forth but if, we increase this. We, can see pretty good losses actually, came from that, largest, model structure. But. What's interesting is when we look at actually, the loss or metric. That we really care about the test accuracy. There. If we look at the best possible model, it. Actually had kind of a middling, loss and, was. From this largest. Model structure, so release, speaks to the need to kind of take time to record extra, metrics and especially the ones you really care about and. So that you can go back later and maybe, explore. This model structure further.
Well. So I'm gonna flip back to the slides. Cool. So those. Demo. To give you a bit of the taste, of what we've been talking about in our connection workflows. I'm, gonna go quickly through these last slides because they're really a grab bag of tips and details and largely. Have quite. A few references for. Looking. At later on so. As far as handling code, getting. Code to workers is an interesting problem where, it's generally, simple use, pandas, UDF's, or integrations. But, debugging, problems can be tricky and there are my main recommendations. Are to look and worker logs and in. Python import. Libraries, with enclosures. Passing. Configs and credentials, is also, can. Be tricky, here I don't show it because Stata Brooks is handling the ml flow runs and credentials, under the hood but. This helpful resource has some info on some. Of these topics. Moving. Data is also an interesting problem where for single machine which ah ml. Broadcasting. That data or loading from blob storage can both, be good options depending on the data size caching. Data can also come into play and becomes even more important, and distributed, ml, blob. Storage data prep is really. Important, as data sizes start to grow and their Delta late at a storm and TF records or key tech to be aware of in these. Links to resources, are. Are. Nice in terms of walkthroughs, of these different options. Then. Finally for configuring, clusters. Here. Are the main discussion points. Not. Not tips really by discussion points around single, machine ml, we're sharing, machine, resources, and, selecting, machine types can be pretty important, for, sharing resources they're. Really. Talking about the question of if you have multiple smart tasks, fitting different models on the, same. Worker. Machine, how are they sharing resources so thinking about that beforehand can be critical for, distributed, ml right sizing clusters and sharing clusters, can. Be important, topics and for, the latter I'd really say take. Advantage of, the cloud model and spin up resources as needed so that you don't need to share clusters and complicate, things and, again this resource has a bit more info. Well. So to get started. I'd. Recommend, first taking, a look at the different technologies, which you've been working with. Tools. To know about here are. Listed. On the right for each of the technologies, on the left I won't, go through these but. Definitely. Take a look at them if any of these texts, call out to you, and. Finally. The. Slides and notebook. Are in tiny whirls up there in the upper right and those, slides will have the links to these many resources, I've listed there. Are few more listed here and. Yeah. I hope they're useful including, those talks from last year which go into a bit more of the fundamentals. Of hyper Krim nerds yuning, with. That, like. To say thank you very much for listening and love to answer a few questions. You.