MLOps Using MLflow
- Thank you all for joining the today's session. This is a wonderful day. Today we're going to cover MLOps using MLflow. Myself Nagaraj Sengodan. I'm a senior manager in data analytics practice in HCL Technologies. - Thank you all for joining this session on MLOps using MLflow.
This is Nitin Raj, and I'm a technical consultant in data and AI practice in Cognizant. - Today, we're going to talk about the MLflow and how it's going to help us to build MLOps for any organization. So before we get started, let me give a little brief about myself. I am a senior manager and I am a certified solution architect, and I did a couple of advanced analytics projects. So that's where we implemented MLOps. So that's the story I'm going to cover in this presentation.
- Yeah. So let me give a brief introduction about myself. So I'm a technical consultant focusing on the advanced data analytics and data engineering and cloud scale analytics. So I'm a certified, Microsoft certified solution expert in business intelligence and data analytics.
- Yeah. So without further delay, let's move on what we going to cover. So today we're going to talk about the MLOps and what are the stages and challenges, then MLflow, followed by demo. So we like to know what is MLOps. So typically if you see any industry, right, if you not manage MLOps properly, definitely it will big problem for the business. So you could see any industry, right? It's not like a software development life cycle whereas one application once you develop a piece of code, then it is going to run forever.
Whereas ML it's not the case. We have to keep trying, keep monitor and keep evaluate our model, make sure that it fit for our business. So that's where we want to see what are the typical challenges faced in enterprise world. One is a reproducibility of the ML model.
Most of the cases what happened, businesses have the good data scientists build a model, deploy it and it works pretty much fine in a production environment. After a certain time, what will happen? They found that the model going lesser accuracy, then they realize they want to update the model, but they can't able to reproduce where the code is, what code they used to generate the model. And you could see Bob new, Bob latest, Bob one, maybe the Bob might stuck with the COVID 19 and someone want to pick up the model and you could see the lot of folders, right? So that's where the challenge is. Number two, ML operations. Day-to-day, how are you going to manage ML? There are quite a lot of people do the development of the model and they're releasing it periodically and how ML engineer make sure that the model is valid and it get the better accuracy than the previous one.
And it passes all the ethical and the compliance before going into production. So that's an important step. And the model management, once you deploy into the production, we need to monitor the model, how the model is performing. So the health of the model is a key for business value. Last is the tools and technologies. As we know, it's a wide variety of tools and technologies.
From data engineer, it's a different tool stack. Data scientists they use their own stack and the ML engineer, they use their own stack like a Docker containers. So it's a variety of technologies. Managing different technologies is always challenging.
So these are the challenges. How MLOps going to address these challenges? So, MLOps is nothing but it's a mix of the three main stack. One is a software engineer, DevOps and ML engineer.
So the good thing is, people might think about when we have DevOps, why MLOps, right? So, DevOps is a, probably I'll cover that little, done with the slide. So now I just give you a brief. Software engineer where, how you going to write the code? DevOps is where, how are we going to deploy our code? ML engineer where, how are we going to build our model? So these are the three component has to be built, which is called MLOps, which help how we can improvise our model. So if you could see right, we have the source system and we need to prepare the data.
So prepare the data means we want to make sure the data have been valid. So some of the values are numerical. Let's say the quantity ordered, or maybe something like a employee numbers. So there are some numbers will be there. There are some categories will be there.
So we need to make sure that all are aligned, all are normalized before start our ML model, otherwise the prediction definitely impact on our accuracy, right? So we want to make sure that the data, is there any gap in the data? So is there any null values, blanks? So everything we have to make sure all will be properly addressed before start building our model. So that's a key part for data engineer to do the, all the transformation, make sure data ready for building a model. Number two, data scientists. So when they have the data, they have to start identify the futures and train the model with a different parameter.
And then they need to evaluate which model perform better. Then they release to production. When come to a release cycle, that's where the ML engineer come into picture. So they have to validate the model with all the aspects, whether it's follows the compliance, ethical, responsible, and it's gives a better accuracy, which is valuable for the business. So only then it has to deploy into the production.
Make sure that the model works as expected. So there could be a reason that the model been developed with different versions. The environment may not be in the compliant with the version they developed. The Python is a different version. Maybe someone used 2.7, someone used three.
Dambi, Aconda or TensorFlow. So there are, there are many technology stacks, right? So you want to make sure that all the right versions are there before pushing the model to the production. So that's where the ML engineer come into picture. So if you see the whole thing, right, it's all about the collaboration between these people. So that's what the MLOps come into picture. So we'll talk about the MLOps life cycle.
So this is where we identified that how business value will be get whenever the machine learning model has been built. So it could be anything. So it would be neural net. It could be a simple machine learning model like a linear regression, or maybe any model, right? When you developers build and it's pushing to production and it has to yield the business value. So that's when the real benefit for building the models, right? So how are you going to maintain the business value from your model? So that's where the MLOps come into picture. So any model goes, it has to follow the continuous integration, continuous deployment life cycle, and the orchestration in the cloud, the orchestration, ML orchestration means once you push into production, so you got to make sure that the model works fine.
The health is good. Like if it's response correct? So we have to make sure that the production model run with the proper configuration so that it's capable to serve all the business needs. So let's say, if you deploy into the small container and there's a downstream application, there are more than 25,000 people are consuming it.
Definitely it going to, it won't respond in time. So definitely it impacts the business, right? So there are many aspects and come to the model governance and business impact. We want to make sure that model ethically, it follows all the standards, whether it's compliance all the security standard, because the data is a very key. So we got to make sure everything in aligned to the standard and and compliance and the model governance.
So these are the key aspect will help us to increase our business value. Let me bring up one interesting story. If you could, I think most of us heard about the Genderify. That's one of the startup company. They, what they does is, they take the name and they predict what gender they are belongs to.
That's the service API they exposed. That's a new startup company. Soon after they launches, they started getting comments from the people saying that it's a gender biased. If you type like a doctor, then it's predicting more towards to male, then it's a big challenge.
So because of that, they end up shutting down the company within a month time. So that's how it right now in the market. So whenever you build a model, make sure that it should be aligned, it should generate a business value, rather not creating an issue in the market. Yeah. So that's the importance bring from MLOps. Yep, so let's think of, let's talk about the differences between the software lifecycle and for the ML life cycle.
And then we'll jump into the, how MLflow help us. If you see software side, we have the typical lifecycle like coding, and then unit tests, we had a peer review and then approval, commit, test release and production release. Same thing if you check for the ML side, you have the similar cycle. So these are the standard steps. However, it could vary case to case even for the software as well, but mostly this is how the steps been involved. So for the ML side, the very first thing is analyzing the data, data preparation, building a model, evaluate the model, then optimize it.
Once you optimize the model, then you deploy and then you monitor, retrain continuously. So that's where you start getting benefits, right? So let me compare these two, stage by stage. So the goal for the software is to meet the functional requirement, whereas for the ML, they have to get the better accuracy by optimizing the metrics. So that's a goal for these two and if you, if you see the quality, so quality is depends on the code for the software lifecycle whereas qualities it depends completely on the data, choice of algorithm, parameter we using for our application. So that's where... Yeah.
So quality completely depends on the data, choice of algorithm, parameter, and technology stack, mostly single stack. Whereas ML it's a combination of multiple stack because data engineer, data scientists and the ML engineer uses a different stack. And the final one, the outcome.
Softwares are pretty much deterministic. We know what code written, that's how that going to function. Whereas ML, it's completely based on the data it's going to work. So these are the challenges and how difference between the software engineer and the ML engineer. So now let's see how MLflow help us to solve this gap. I pass on this to Nitin, my colleague.
- Okay. Thank you, Naga. So let us see in detail about the MLflow. MLflow is an open source platform for the machine learning life cycle.
MLflow is built on open interface philosophy, defining several key obstructions that allow existing infrastructure and the machine learning algorithm to be integrated with the system easily. Meaning that, if you are a data scientist who wants to leverage MLflow and you are using particular framework that's currently unsupported, the open interface design makes it extremely easy to integrate that framework and to start working with the platform. I'm going to walk through each of the MLflow components. That is tracking, projects, models, and registry. Tracking is a centralized repository for metadata about training session within an organization.
Projects is a reproducible self-contained packaging format from model training code and showing about the training code run the same way regardless of the execution environment. Models is a general purpose model format, enabling any model produced with a MLflow to be deployed to a variety of production environment. Registry helps to solve three critical problems in the production machine learning application. It prevents from deploying the bad models by introducing the model administration and review.
It integrates with MLflow tracking to provide the complete picture of every model within your organization, including the source code, parameters and metrics. It allows, it also provides centralized activity logs, recordings, entire collaborative process from model development to deployment, complete model description and commands. In the MLflow workflow, we can see that the data lineage through the MLflow model lifecycle is as follows. It starts from the training the dataset, say your ETL from the raw data, from your raw data.
You may have the different versions of the training data and the relevant parameters and metrics, as well as the model file can be logged in the training API. In the MLflow tracking component, to run the detailed page, you can register a new model or create a new version of an existing model. Finally, you can maintain the different versions of the model in the life cycle stage in an MLflow model registry component.
The MLflow tracking server. There are several concepts associated with the centralized training metadata tracking repository that MLflow has enabled. The first thing is hyper parameter or the configuration knobs that impact the model performance. This can all, this can all be saved using the MLflow APIs and the centralized tracking service. Additionally users can log the performance metrics that provides the insights into the effectiveness of their machine learning models. Additionally, for reproducibility, MLflow enables user to log particular source code that was used to produce the model, as well as it's a version by integrating the, integrating tightly with the which we'll get the map every model to a particular commit, commit had further and perhaps most importantly, data and models.
So MLflow projects, it's a self-contained tracking code projects, specification that bundles all of the machine learning, machine learning training code, along with its versioned library dependencies, its configurations, and it's training and the test data by fully specifying the complete set of dependencies for a machine learning training task. MLflow enforces the reproducibility across the execution environments. And it does this by installing all those libraries and achieving that, achieving the exact same system state, wherever the code is running. So what does an MLflow project look like? At it's core, MLflow project is simply a directory. It's a directory with, with this optional configuration, optional configuration file.
And it contains the training code, the library dependency specification and the other data required by the training session. These library dependencies are specified in multiple ways. For example, users can include a YAML formatted Anaconda environment specification to enumerate their training codes library dependencies. They can also include a Docker container and the MLflow will execute that training code within the specified container.
Finally, MLflow provide a CLA for executing this projects, as well as the API in Python. R and Java. And this projects can be executed both on the user's local machine, as well as several remote environments, including the Databricks job scheduler as well as the Kubernetes. Similar to a project, MLflow model is also a directory structure and it contains a configuration file. And instead of containing the training code this time, it contains a serialized model artifact. It also contains a project, their set of dependencies for reproducibility.
This time, we are talking about the evaluation dependencies in the form of a Conda environment. Additionally, MLflow provides model creation utilities for serializing the models from a variety of popular frameworks in the MLflow formats. Finally, the MLflow introduces the deployment APIs for productionizing and deploying any ML model to a variety of services. And this APIs are available in the Python, Java, R and by the CLA format.
You can now, the MLflow registry, you can now see the tasks on your model, model versions and the registered models. The comparison view is also enhanced for the model versions, enabling you to compare the schema across two perspective versions. Or more making the decisions about which model to move into the staging or production. And finally, the model archiving experience is simplified. You can now opt to archive all the existing production versions of a model when you transition a new version into the production.
This simplifies and de-risk the upgrade process. - [Nagaraj] In this demo, we'll see how MLflow help us to implement the MLOps. We are using ElasticNet, which is a regularized regression model, and then MLflow tracking service, followed by MLflow model registry.
It's a quite interesting one. Let me get into the piece of code. So, before that, two point I like to highlight. One is, we are using Azure Databricks. It doesn't necessary that we have to use Azure or AWS Databricks because MLflow is a opensource product developed by the Databricks.
So you can use it on-prem environment as well. However, the benefit of using Azure or AWS Databricks is, you could see the tracking server available right on this same screen and MLflow also being available in the package. So all you need to do is just import MLflow and start using it. If it is on-prem you have to install the MLflow library, and then you can start using it. So that's the differences. So let me take the model one.
So we are using wine quality, which is a open source dataset downloaded from the web. And I added into the Databricks DBMS file system, which assigned to the wine data path variable. So the next step, we'll talk about the two function. One is evaluation of metrics, root mean square and R squared values. And the next function is to train model where we are going to read the CSV file.
And then, so this is the piece where we read the CSV file and we do the train on the test data, and we create our ElasticNet model and passing the parameter, alpha on the lm ratios. And then we passing the dataset to train and test how it operating the values. You can see, we have the MLflow and it logged the parameter metric. So the quick update is, MLflow has a new version, which support autolog, means we not to log each parameter metric line by line. All you need to do is, you just to call MLflow.sklearn.autolog.
So that will take care of everything. So it's not only available for scikit-learn, it also supported most of the major libraries like TensorFlow, Keras, PyTorch and so on. So if you come to here, you could see the variables. So let's say, instead of a 0.50 I can make it 0.60, and then lm ratio is 0.7. And then I try to run this model and see how the root mean square and the R-squared values are getting into.
I think it seems to be fine compared to the earlier run. However, I want to see how the earlier runs has been performed. So the best way to do it, the experiment. So this is the model tracking come into picture. So I just click on here, I can see the, each run how it's been performed. So it's, it's a quick snapshot to see different run and how it performed.
If you want to detail one, just click on the link. It will take the tracking server and see the complete list of run. So the good part is the UI have the most commonly used features like a filter your record set. So you can just put that, like, my root mean square is greater than 0.8. I don't want to see the lesser than 0.8. So that I can see only those run which are getting the 0.8.
And then I can compare this one and choose which is the best one to productionizing it. And also I can download the CSV. I can compare options. So most importantly, it has a reproducibility. So you can see what code associated to the each run.
So it capture the code and also reproducible runs. So you can reproduce a run and you can see all the metrics how it's been generated. Finally, the artifact, which says about the model definition, Conda, the environment details and the model pkl files.
Even if you want to attach artifact anything like a chart, graph, whatnot, yes, you can do that with it. So this is the tracking server capability MLflow. So help us, anyone can tracks back the model and check the code used to generate the model and the run experiment, everything they can reproducing it.
So once I finalize this is the model, I want to productionize it. So how we make it to productionizing it. So that's where the model registry come into picture. I'll pass on to my colleague, Nitin.
Over to you. - Thank you, Naga. So now we are going to see one of the important feature in the MLflow, which is nothing but a model registry. During productionizing, we use the feature called model register, which helps to enable the workflow for doing the deployment. It is the integration between the data scientist and ML engineer.
As a data scientist, I say this is the final model, and I want to register this model with the model registry. So this is the piece of code that we can see over here, which is going to, which will help us to register this model into the model registry. So I'm going to execute this piece of code.
We can see that the model has been registered. As soon as the model has got registered, we can go into the models tab in the left hand side, and we can see that MLOps demo wine model with the version 16 has got registered into the model registry. So I can say that this version, version 16 is, is better than the any earlier versions that are available. And I can say the ML engineer, "Can you consider the version for deployment?" So I'm just opening this line of item.
So we can see that for this particular model, we can see the history of versions over here. And if you go to the second page, we can see that version 16 is available over here. So as a data scientist I'm going into the version 16. And we can see that we have this stage over here. So as we, as this user is an admin user, we can see both the option of requesting to the ML engineer to move this from this version from run stage to the staging or production, or we can straightaway push this version of the model to either to the staging or the production.
So in the real world scenario, what a data scientist will do is nothing but, they will, the data scientists will request to move from run to staging. Let me do that one. So as a data scientist, I have to give the command.
So as soon as a data scientist make a request, we can see the activity performed by the data scientist over here in the activities area and as soon as the step has been completed by the data scientist, the data, I mean, ML engineer will get an option to approve or reject this particular staging move. So, before an ML engineer going to approve or reject this model, what ML engineer will do is nothing but, he will go into the run ID by clicking over here. As Naga has already explained, so we can see within this run ID, of this particular version of the model, we can see the in depth view of the particular model. We can see all the information of a given model of this version. So we have the pkl file. We have the ML file and we have the model and we have the parameters being maintained here.
So, and, in this page, the ML engineer can review the, the parameter passed for this particular model and what are the metrics outcome as well. So once the ML engineer is happy about it, the ML engineer will go back and will approve this particular request by giving a command. So as soon as the ML engineer approves this request, we can see that it is being captured in the activity area. So let's go back to the registered models. So we can see that the latest model version is being moved from run to the staging area.
So the downstream, the downstream applications, need not to know the run ID or the version ID, et cetera. All they need to know is nothing but, they need to capture the model name. Let's say MLOps demo wine model. And then whether the staging are, whether the downstream application has to access this staging or production. Whatever the version is being moved, the lineage will be taken care internally.
In this way it is easy, even though there are multiple downstream applications start using this particular model, either in the staging or in production. This way, the concept of CACD is taken care. And we can reproduce the code, which code base is used to generate the model. Also, you can trace back, your run IDs, your run and see the metrics being generated out of it. So this is the benefit of using the model registry. And let's go back to the workspace, to our notebook.
So now it's a part of consuming the model. So this is the piece of code that helps, that helps us to consume the model that has been registered in the model registry either in the staging or in production. How we are going to do is nothing but, we have to give the model name here.
And then we have to give whether we're going to access staging or production. So let's say, I have given the staging as I want to access this staging. So in the downstream application, they need not to worry about changing the code. They just need to run their code. So if you run the code, as we are pointing towards the staging, we can see that the latest version, that is nothing but version 16, that is being moved to staging is accessible now. So that's it from the demo side.
Now let's go back to the presentation. - What next? So if you want to get your hands dirty with MLflow, just pip install MLflow, or get into Azure or AWS Databricks. It's all ready to dive in. Want to know more about MLflow? Yeah, you can check in the product page, mlflow.org.
You can see lot of materials available, docs and materials. Regarding this demo code on the presentation, everything available in our data repository. Yeah. Feel free to connect us LinkedIn, mentioned in the below slide.
Any time you have you can ping us. We can have a chat with coffee, virtually because the situation is not that so great. Once you back, yeah, we can have a face to face connect. Thank you all for joining the today's session. Hope you all enjoyed. It would be great if you spare a moment and give us a feedback.
I think you have the link for giving the feedback. So probably if you could valuable feedback, that helps us to improve in our next session, in upcoming data summit next year. Thank you all. Have a nice day ahead.