Machine Learning Data Lineage with MLflow and Delta Lake
Welcome. To machine learning data lineage, with ml flow and Delta Lake. Hi. My. Name is Richard, Zhong I'm a. Software engineer from. Data breaks ml platform, team, prior. To data breaks I was, a software engineer in Hortonworks. Working. On Apache Embry before. That I was a software engineer in open text analytics building. The company's BI driving chute hi everybody my name is Danny Lee I'm a developer advocate here at data breaks I was previously a, senior, director of data science a generic occur also. Principal, program manager, at Microsoft. Also. Known for what is project isotope, which, is also known as as your hdinsight so thanks very much for joining us back, to you Richard for the agenda today. Denny. And I will provide a brief introduction. To ml flow and its model registry feature as. Well as Delta Lake and it's time travel feature. Then. We will show a live demo on how to, use various versioning, features from, these two frameworks to achieve data lineage in the machine learning process. We. Know that machine learning development, is complex. To. Give a sense of it this, is a typical machine learning pipeline you. Take your raw data you. Do some ETL, or feature. Eyes it for. Data prep then, you want to do some training with, this data to. Produce a model and, deploy. This model to production score. It produce, a REST API serving, layer run. It through spark etc, and, then. When you get new data you. Will reiterate, the process, again, many. Organization. Using machine learning are facing, challenging, storing, and burgeoning. Their. Complex, ml data as well, as a large, number of models, generated, from those data to. Simplify, this process. Organizations. Tends to start building their customized, machine, learning platforms, however. Even. Such platforms, are limited to only a, few supported. Algorithms, and they, tends to be strongly, coupled with. Company's internal, infrastructure. Emmab. Flow is an open source project designated. To, standardize. And unify. This machine, learning process. As. You can see that mo flow has four major components. Ml. Project. Packaging. Formatting. Your. Reproduced. Runs and. Make. It available in, any, compute, platform. Enough. Old models, allows, you to generate model. Format and standardize, deployment. Options, enough. Road tracking, allows, you to record, and. Query. Experiments. And lock. Metrics, and parameters and, finally. Model registry, allows you to have. A centralized, repo, to, collaborate, in the, model, lifecycle.
Management, Most. Machine, learning library lets. You save, the. Model file but. There. Isn't any good software, to, share and collaborate on these files especially. With. A team if, you are working alone. You, can probably check, the file into a git repository. You. May need to name the file somehow. To. Keep. Track of your model versions, and hopefully. It's still. Manageable, because you, need to actually remember what, you did to, come up with these versions, of files. If. You're working in a large organization. With hundreds, and thousands, of models and each. Of them has different versioning. For many. Different reasons, this. Management. This. Management becomes. The major challenge. You. May ask that where, can I find the best version, of this model how. Was it trained how. To add documentation. For it and, also. How, can I collaborate, with my colleagues, to, view the model. Inspired. By collaboration, software, development, tools like github we, launched ml, flow model registry which, is a repository and. A, collaborative, environment, where you can share and work, on your model, you. Can register named. Models, and create. New model, versions, for, your register, models, you. Can comment and tag, your registers. And model. Versions, so. People can collaborate with you to. Quickly find the latest version of the model and. Relevant. Information, about that model. It. Also has a built in concept, of lifecycle, stages, like. Each. Model, you can have versions. That, are staging. Production. Or archived, and it. Provides a serie of API, for. You to easily. Interface. With the model registry. And. You can do it automatically, and test, it with your CI CD pipeline. So. The new workflow, is that as. A. Developer. You. Log your model into the model registry, and. Work. With any type of model, alone. As you. Can package. It there. Then. Your collaborator, can go to the model, manually. View it or, use automated, tool to test it with, the ml. Flow mana. Registry API now. The downstream, user can, safely, pull, the latest model after. It's been reviewed and check. It. If it works then. You can also use automated, jobs or serving. Services, for. Ab your choice. With. Your latest model, to. Do some inference. We, can see that the data lineage, through. The natural model lifecycle is that, follow it starts. From training data set you eat you out from. The raw data you. May have different. Versions of the training data. And relevant. Parameters, and metrics. As. Well as the model file can be logged in, the. Tracking API. Then. M/l flow tracking, component. Then. In the ml flow tracking, components, to run details page you can register a new model or create, a new version of an, existing model. Finally. You, can manage different, version of the model and their, life cycle, stage in an alpha model registry component. That. Is part for me let's. Welcome Denny to talk more about Delta Lake hey, thanks very much Richard so, we, have many sessions, throughout summit that talk about Delta Lake so let's just focus on some of the key components here. What. You really need for proper. Model data lineage is reliable. Data and that's what a data engineers dream is that they're able to process data, continuously. And incrementally, as new. Data arrives in a very cost efficient way without actually needing to choose doing. Batch or streaming, so. Underneath the covers we're talking about Delta what's Delta on disk C it's a transaction, log that, actually has your table, you, see the del underscore, delta underscore log, and the action JSON files that you see there plus the park' files themselves. You, as you'll note though there's there's a table versions, and, also. Optional. Partition directories, that you're working with. The. Data files are actually your original Park a files that that you're used to working with a package, together is now your Delta table that ensures acid. Transactions. So, that way not only do you have reliable. Data but. You also have a. Transaction. Log that, now we can go back and look at what the old data looked like in, when. You're modeling. Watching, your ml models as well. And. So. The, key aspect, of implementing, atomicity is that you want to be able to make changes, to your table as the stored as ordered, atomic units, called, commits. All. Right you have your first file 0. 0 0 JSON here. Then, you have a second file the zero one JSON if I'm adding one or two park' files that's recorded, in the first JSON, or the zeroth JSON and, the second, JSON or the first and zero one JSON that actually records the removal, the first and second park' files and actually. Adding of the third park' file alright what.
We Want to be able to do solve, these conflicts. Optimistically, if the two clients, are trying to run each other at the exact same time for, example you want to be able to not, the record start version the record rewrites any, attempted commits and if. Someone, else wins check of anything, that you tried to read has changed as you can dynamically, see here, so. That's, it for, this, the, slide. Portion, of the session let's, dive into the demos, in. This demo we're, going to show you how, to use ml flow model registry, and Delta, Lake time travel to handle data lineage, in machine, learning process. We. Will also show you how to use various versioning. Features, from, these two frameworks to, troubleshoot, data versioning, problems, to achieve reproducibility. For, your experiments. Here. Is a notebook where, we are going to run, some machine learning code, with, the box with the Boston housing, data set prepared, by Dennie the. Boston housing data set contains, a bunch of columns, like crime, rate number. Of rooms, percentage. Of lower status population. Our. Objective, is to use this data set to train a linear, regression model and use. It to predict home values. We. Have few, pre. Run selves, doing some data preparations. And visualization. You. Can see that we create a data, frame by querying the Delta, table and then, converted, to a panda's. Data frame. From. The scatter plot matrix here. You can see that, the number of rooms and the. Percentage, of lower status population. Are having, positive and negative, linear correlation. With. The median value, of the house as shown, here and here. We. Can see it even more clearly in the following two-step, scatterplots. Here and here as, well as on the bar chart that showing the. Correlation, from. All columns. To the, median home value. We. Then define a list of more readable, column, names and, we. Drop all the rows without median. Home value for. Data cleanup. After. Reviewing the, correlation. Coefficient, matrix. And scatter. Plots, let's. Choose features, that have a strong correlation, to, the median value. Say we. Will choose. The columns, with a the. Absolute, value a correlation. Coefficient, greater. Or, equal than. 0.4. And. Then. We do a training. Test, glit. 80/20. Train test with four-hour, training, and testing data set. And. Here. It shows that we're. Gonna try different learning rates and choose, the one that yields the lowest our MSE. And. We have two training, session, with Ridge. And lasso regression respectively. And. Let's. Run the training sessions.
At. The. Training session is running let's, take a look at our training, function. Our. Training function takes, our. Training, and testing, data sets, and. The, regression type, as. Well as the learning rate alpha, the. First it creates the ml flow run and initialize, the linear regression object. Based on the, regression type then. It fits the, training data set and collects. All the training, prediction, outputs. Then. It calculates our MSC and r2 metrics, and use an awful API to log all the parameters. And metrics it. Also logged the linear regression object. As a second. Learn flavored. Model. Finally. It creates a prediction. Error plot plus. A, residual. Plot and, log both of them as run. Artifacts, using MFO api. After. The training process. We. Can see a list, of. Runs. Showing. In the the. Notebook run side bar let's. Choose, our. MSE, and let's sort. Ascending. By. Our, MSE, and we. Choose the lowest rmse. And go, to that, run. Now. We can see the. Run details, page in the, run details page we, can see that, the. Parameter, and. The. Metrics, that we logged in. The notebook and. In. The, artifact section, we, can see the, ML model. File. Indicating. The second, learned flavor model, we logged and. The. PNG files. For. The plot that, we logged. Since. This is the best run we have let's. Register a model using this run to. Register model we first select the. Model folder in the. Run artifacts section and let's. Click, register. Model and choose. To, create a new model and. Let's use Boston, housing demo as the new models, name and let's. Click register. And. We can see that. The. First version, of this model is being created. That's. Way for it to finish creating. Now. It's finished creating the new, version, of this. Model which. Is like basically the version 1 of this model. Let's. Go to the, model version page so this is the model version page we, can go back to the register model page and see, that. Version. 1 is the only model, version we, currently have, since. I wanted to collaborate with Denny, on, this register. Model I'll. Give Denny. Manage. Permission. Of this, model so add Denny here, and then, I choose can manage at him. And click. Save. If. We want to load this model, in our, notebook, we'll need to, switch. This model, to the production stage so. Let's say. Ship. It. Ok. Now. Our model. Is in. Production, stage, we can go back to the. Notebook and there. Is a cell at the, bottom that. We pick a. A. Row. From our data set our training, data set and. To. Test the. Model and. See. The prediction. Let's. Run this cell. As. We. Can see that the, prediction, is twenty. Three point, seven. Nine to five which, is pretty good. So. One thing I had noticed when working with this notebook. Is that as you. Can see from Richards. Model. If, I go ahead and dive into it a little bit he. Was actually using a different, version of SK learned he. Was actually using zero 22.1 and I actually want to use a different, version of it so what I'm gonna do is gonna go back and rerun the whole thing and you'll notice that I can, jump, right into here and find the lowest, ARMA see so I'm just gonna pop that open and I'm. Gonna quickly go ahead and jump, this over to and, deploy. A different, version of the mo flow model. Alright, give it a couple seconds here so I'm, gonna grab, this one this. Model I'm gonna register this model similar to how. Richard. Had done before I'm gonna register it and then. Now, I'm on v2 it'll, take a couple seconds for to to, go through just to make sure I'll. Just take a look at real quick. Transition. That to production so which will automatically. Perfect. So now that I'm good to go I'm gonna go back to my notebook, I'll. Close up the runs and I'll go ahead and rerun this particular. Cell. Again and as. It. Goes through you'll, see again a value of twenty three point seven nine and. Based. On the original median value of 17.8. But. Let's say now I want to go ahead and. Replace. The null, values, that. We actually had so. Remember, that this particular Boston, housing cattle data said there are five hundred four rows but three hundred and thirty three of them actually, has. Values, of medium value well. The. Remaining one hundred and eighty or so don't. Or null okay so I'm gonna go ahead and instead, use. This particular model. Update, my. Table, someone, actually fill this, with. New values so, here you go so right now what I'm doing is I'm updating our, Delta. Table. With. In, this case it's matching by ID I'm gonna update that value with. The. New. Values. That were calculated, using. This particular model so. If I scroll down and take a look at the values.
You'll. Notice that basically, I have, not. Only the original values, inserted. In. Perfect. Apparently. Scroll to the right here. Okay. You know some with either, one. Or two decimals, or none for, the meaning mallet but also the, new values, if I scroll down. There. You go these are the new values that, were predicted. By our model, and we've inserted the back in now, fortunately, for, somebody, like myself this is probably a bad idea, but. Fortunately because I'm using Delta Lake I've actually saved, all this information. Because. We actually kept the transaction, which Richards gonna show but meanwhile I'm just gonna go ahead and hide the fact that I did this and, delete. The cells okay. So now I've, got the updated. Data, which. Is sort of incorrect. Using. The. Updated, model all. Right oh and, then let me go ahead and run this all over again. And. I'm. Gonna register, the, new model based on the updated, data which, probably isn't the best idea but again I've. Got a new ARMA. C value, so let's just let it run through. And. We'll, see what ends up happening I'll, have the new results coming in and I'm gonna register this as a third, model. All. Right perfect. So. We're almost finished. Excellent. So. Now let's. Go back to this I'll go. Back and choose. ARMA see. Now. I have an even lower one of four point three three one I'm going to make this my, new model. So. I'll go back here. Rush's. This as a third, model of our Boston housing, demo. Transition. This to production as well. Click. On okay so, if you look at the models you'll, actually see the three miles is the first one that Richard created using a more. Recent version of a ski learn there's, a second version which I ran which actually has a older, version of a scalar and the third one in which I've went ahead and read ated, that updated, the medium value data so, if I was to go back and run down I'm gonna keep this cell for the purpose of understanding what's going on I'm, gonna run this one but this one's now against, the, new model against, in your data, so. And, it, helps if I actually write. The, correct name and. Here. You go you'll. Notice that actually same something point eight but instead of twenty, three point seven nine I have a value of twenty three point one three now, alright, so that's it for this part. Now. I'm back to the notebook what. I want to do this time is to retrain the Boston housing model and see we can reproduce the exact same result however, I noticed, that danny has rerun the notebook and created. A new prediction, the. Cone looks the same but it. It's a different prediction value, is. He using a new model. Let's. Go ahead and check. Let's. Go to the model page, and. Search for Boston. Here. You. Can see that there's. Really three versions in the. Boston housing demo model and the. Version 3 is now in production no. Wonder that prediction is different let's check what's, different, in version, 2 and version 3 so. In version 2 then he says that he switched to their playing 20.3, this is like this looks like a psychic learning library version I think. If you want to reproduce the. Same result we just need to use the previous cycler. Inversion which is like a newer, version of psychic learn and, that's, fine let's. Check version 3 version. 3. Danny says that updating. To include predictive value from for, medium. Value this doesn't, look very. Correct. Wow. If, I understand it correctly. He. Probably, has updated, our training, data set in the Delta table with some value from the, prediction output. That. Isn't sounds like a good practice in machine learning so. Let's check our Delta table and see what's, happened to it let's. Go back to our notebook and then, you can see that now I'm. Using a cluster. With, the latest psychic learn version and. Let's. Check out. Alpha table here. And, in the doctor people history we can see that, there's a new version created by, Denny and. From. The operation metrics we, can see that, there's. 173. Rose updated, and the, to number up hero is a 506, okay, let's. Check zero version, and like V 0 and V 1 version respectively and see, what's, the content, of. The, two versions. So. The. V zero looks, pretty, legit and it. Has a bunch of rows. That with the median, home value bin, know and the, second one it, looks everything still and while these. Looks like what then he says that the, particular value right this, doesn't sounds right so if I wanted to reproduce my, training and my experiment, I'll need to overwrite, the table back to the v-0 okay, so, I'm gonna, go ahead and do that. Okay, done then. Let's check the, Delta table again and. We can see that we have another new version which, like me roll back the previous v-0. And. That. Will like. Revert back our data set version given. Us the, data lineage on, the data set okay, and this.
Is Pretty. Much what we need to do for getting. The data center line and then we have the same data set and the. Same. Library. Environment and, now let's run the training, and, see what's, the output and to have a clean slate, for the training I'll need to clean, up the previous. Experiment. Runs, so. Those are the previous runs, just. For convenience of reproducing, with. A clean slate let me delete all those old runs okay. Delete it let's, go back to the notebook and refresh, the run sidebar all the runs come clean, slate and then. Let's go ahead and rerun. The training process so, all the way to. Here, so let's run everything above, the, prediction run. All above. And, after. This. Training new training session with the exactly, same dataset and exactly, same secular, inversion I will. Register a new model version. With. The same way that we, did earlier which is like to select. The run with the lowest rmse. And then use. The run artifact of that run to register a new model version and looks. Like it finished running let's. Open the run sidebar and then choose RM FCE and in store descending we will see that this is the lowest rmse and let's go to this. Run and. Register. It as the new version, of our. Boston, housing. Demo, model. And. You. Can see that the, v4 of the model has been created. Okay it's, finished. You. This, is the v4 of model and let's make it to, the production. You. And let's, do some prediction based, on the v4 model which is a currently in, production state and let's. Copy the prediction code here. And let's run it. Down. Here. You. Mmm, twenty-three point seven nine to five looks, like the same body we thought before to confirm. I will, go back to. Version. 1 and make. Version 1 production. And. Then. Try to run the prediction again because. The prediction is always based on the current production model. Version so I want to see if version, four and version one can. Output the same prediction, value, let's. Run it nice. Exactly, the same twenty three point seven nine cool. So. To recap in this, demo we, first train a linear regression model from the Boston housing dataset and create. The model version v1 then. We messed up with our library version and original. Training dataset in the data table and after. We found the problem we, use Delta let's time travel feature just switch back to the original version our training dataset and rerun, the training process with a consistent, static learning library version we. End up reproducing, the same result we had in our previous experiment, training session this. Is the end of our demo thank, you. Well. Thanks. Very much richer, for those awesome, demos thank. You very much for attending our awesome, session today if you want to go ahead and dive in more please, join us at ml flow org or Delta dot IO for more information, thanks, very much and have a great summit. You.
2020-08-28 10:12