Training and fine-tuning your large AI models using Azure Blob Storage
(upbeat music) - Hello everyone, I'm Vishnu Charan, and I'm a Product Manager at Azure Storage. And in this video we are going to be talking about training and fine tuning large AI models using Azure Blob storage. Okay, so we are going to be talking about the AI and training pipelines in general to start with and then derive the storage requirements from there.
And then we look at two important things that power these AI pipelines: BlobFuse2, a tool that you can use to mount your storage onto your computing environment and Scaled accounts, which helps you access petabytes of data from the storage account onto your computing environment. When we talk about AI pipelines and AI storage pipelines, we want to talk about bringing your domain knowledge to LLMs and how you can power your applications using that. There are three primary ways in which you can do it.
One is of course prompt engineering where you can make sure that the LLMs understand the context that you're putting in, and that's something that does not involve storage. So there are other two things that involves a lot of storage. One is Fine Tuning, and the other is Retrieval Augmented Generation. Fine tuning is like changing the behavior of the model itself and training it to a very specific scope of knowledge from a general LLM while Retrieval Augmented Generation or RAG as we call it, is about retrieving facts when we are presented with the context for the LLMs, and hence it's a lot more dynamic. There are multiple ways in which you can use it. There are multiple scenarios in which you might want to fine tune your model first and then use RAG along with it.
But we go look at scenarios where in all of these pipelines, how does Azure storage play a part and how you can build large AI applications on Azure storage. Let's look at the generic pipeline. This is a familiar pipeline view for all the machine learning operations, and it starts with ingestion of data. So we collect data, and all of this data is taken, and then it is then cleaned, and then we go through a data preparation process where we would like the data to be available in a way that can be trained with.
This could be converting into different features like feature engineering that could be data augmentation. And then this is then available as the training data. And once the training data is available, we go through this pre-training process, and during this pre-training process, there's a lot of check pointing involved to make sure the weights that are available at that point in time are saved. Once you have finished your training, once you're convinced with the results, you hold onto that model, and then you use that model as what we have represented here as base model weights. And this is the part where the training is completed. And then you can take this model and then deploy it in your computing environment and reference this model and develop applications on top of it, which is what we call inferencing.
Now this is a training centric view. For fine tuning, it is going to be very similar except we take the base model and then we have a training data that is specific to the domain that you want to fine tune with, and then go through the same process of fine tuning and then check pointing and then, you know, having that fine tuned weights. How do we use storage in all of these pipelines? The first one is ingestion.
We want to be able to bring raw data into Azure. That involves making sure that we have large enough pipelines and frameworks that are available to bring data to Azure. And then there's the data preparation steps where we want to integrate with frameworks, with Spark, Mosaic ML and multiple other things. And then we should be able to train and fine tune this. This involves bringing data to GPU nodes for the models to run, and of course check pointing them to make sure that the states are preserved. And then finally, there's also the data management pieces that comes into this where you obviously need to have secure access to this data because this data is the secret ingredient for all your applications.
And because this involves terabytes, exabytes of data sometimes, we need to have cost efficient retention frameworks, right? This is primarily from a training and fine tuning perspective. From an inferencing perspective, it's all about how fast we can deploy the model weights into the computing environment. How fast can we access this data to give realtime insights? And then of course about data management techniques like model versioning.
How do you retain these inputs and outputs for, you know, further usage, for your debugging and all those scenarios. Now the tool that we're going to talk about from a fine tuning and check pointing perspective is Blobfuse2. What does Blobfuse actually do? Blobfuse2 provides a high throughput access to blob storage, and it is much easier to install.
You just do a app get install Blobfuse2, you will be able to work with it, and you'll be able to mount your petabyte scale storage onto your computing environment in very quick steps. And you can have access to high throughput workloads by using caching mechanisms. We'll talk about that in a bit.
It is open sourced, and a lot of contributions come from internal as well as external teams. And it is also supported by Microsoft. So it has the best of both worlds. And most importantly, we provide secure access to data.
So you can use MSI authentications. You can use storage account keys, and SaaS ULs as well if you need to. And it has HCDPS, so it is included in transit as well. Okay, before we talk more about Blobfuse2, let's quickly dive into the demo. So I have two screens here. One on the left, I have a D96 VM on which I'm logged onto.
And on the right I have a storage account with this container called Blobfuse test. And as you can see, it has a few, it has a one terabyte file. It has a few 50 GB files, a couple of them in fact, and then a few folders and scripts inside them, right? What I'm going to do is try and mount this container onto my VM and see what Blobfuse can do and how Blobfuse can do it.
So to do this, I need an empty folder, and I've already created one. Let's see if it's empty. There you go. Okay, there are no files here, so I'm just going ahead, go ahead and mount this folder. That's it.
It's as simple as that. Let me run that same command again. That's it. It's as easy as that. We have just accessed whatever data that you have in your storage account immediately on your VM, right? This is just like using any shared storage. I can even go ahead and create files and work with them.
Let me just do a very simple one. Okay, let's see if the file's there yet. And that's it. It was immediate. And in fact, because this is a text file, I can also go read it and see if it's exactly what I wrote.
It is, so it's as simple as that. It's very easy to work with Blobfuse. And now let me show you how performant Blobfuse is, right? To do that, I'm going to show how easy it is to read a couple of 50 GB files, the ones that we have here, and then write a couple of 20 GB files and see how fast that goes, right? Before I do that, I want to make sure I show you the live network utilization so that you can see how performant Blobfuse is. Now I'm going to use end load here, and I'm going to make sure that the scale for this end load is 35 GBPS, which is the same as what this VM offers. So we have the end load here. The top half of the screen defines the incoming bandwidth, and the bottom half the outgoing bandwidth, and each of them are scaled to 35 GBPS.
Okay, so let me go ahead and read a couple of 50 GB files and see what happens. I'm going to the Mount, three read files. I'm going to read both of them. Right, as soon as I hit enter, you can see that Blobfuse2 is already pulling these files directly from the storage account and copying it to the local environment, right? And as you can see, it is immediately maximizing the available bandwidth, and if you had a larger VM, you will be able to maximize that as well. And we've tried with smaller VMs, larger VMs. As long as the bandwidth is available, Blobfuse2 is going to scale to whatever is available to it and what you've configured, and it is going to achieve it.
It uses some of these techniques like pre-fetching and caching, which we'll talk about in a bit. Now as I was saying it, we just copied a hundred GBF files at the maximum available bandwidth. Now this is great, and this is not just limited to the incoming bandwidth, but we can also maximize the outgoing bandwidth.
Let me do that with the writes. I'm going to write in the same folder, a couple of them. Okay? And that's it. It was as simple as that. I am writing two 20 GB files, and we immediately maxed out on the bandwidth that was available. That's how performant Blobfuse is when you're using in these new block caching modes.
Wasn't it amazing to see that Blobfuse2 can maximize the VM bandwidth that is available to you? Let me go ahead and talk about how we are able to achieve this. Now traditionally Blobfuse V1 as well as when we introduced Blobfuse V2, we had something called the file caching option. What the file caching does is we ask you to choose a space, whether it is in the local SSD or NVME or the RAM where we cache the data and then serve to the application from there. Of course we download the entire data first, the entire file first, and then serve it to the application. This is very useful when you have repeated reads, especially let's say you have a training data that's not going to change much, and you want to keep reading that data again.
And we've also seen this useful previously in ADAS scenarios for simulations where you want to have that data loaded onto your computing environment, and then it can show up multiple times, and you can read it multiple times. So this has been the traditional way in which Blobfuse to work, and it has been performant that way. But as in when we saw more in instances of being used in the AI scenarios, we are seeing a lot more larger files, and then files that are being streamed directly to these applications.
And that's when we introduced this concept of streaming with block caching. So this is a new one that we introduced recently where we take blocks, we pre-fetch these blocks, and you can define these blocks. You can pre-fetch these blocks in memory and then it is served directly to the application.
And this is the demo that you saw, the high performance of Blobfuse is primarily because of these block caching techniques. In block caching we use the in memory cache effectively, you can define the size of the blocks, you can define the number of blocks that you would like to pre-fetch, and you can also define the number of parallel threads that you would like to have to pre-fetch these blocks. Now these are a lot of variables that you can play around with to make sure you are maximizing the compute that is available to you.
As well as probably if you want to restrict block views, for example, to not, you know, mess with your application, that's also using the CPU, then you can always restrict the number of threads and the number of pre-fetches that you want to run. So you can optimize your performance accordingly. And this is the technology that drives the high performance that you are seeing in Blobfuse today. We saw how Blobfuse can maximize your VM bandwidths, both incoming and outgoing. But let me connect it back to what we were speaking earlier in terms of the AI workloads and fine tuning and especially the check pointing.
Now for the fine tuning or for training for that matter, check pointing is an important part of your model training and fine tuning exercise because we want to be able to make sure that the GPU's idle time is reduced as much as possible, and you store the state of your models as fast as possible and as frequently as possible. So I'm going to show you a fine tuning example and see how well Blobfuse2 can handle this, right? To do that I am going to show you a script that I have written, a very basic script that can demonstrate fine tuning. So I have, I have a sample script here which we'll also upload to the Blobfuse2 repository. And what I've done here is I've used a, I have used by PyTorch, and I'm downloading the GPT-2 large model for Hugging Face and loading the model into the computing environment that I have, and then fine tuning it with a very, very random text because fine tuning is not something that I want to show here, but I wanted to show the check pointing pieces, and I'm running three epochs for this fine tuning.
It should ideally not make a big difference. And then once the fine tuning is done, I'm going to checkpoint this model directly to the folder that I've mounted to, right? This means there is no intermediate steps. You're not going to the SSDs.
You're not going to even the NVNEs or the RAM. It directly goes, add to your storage account with this approach, right? And then I have a basic, you know, mention of, you know what that happened. I'm going to throw that up as well. Let me go and run this. Import the fine tuning folder.
Okay, before I do that, I want to go back to the end load so that you can see what's happening, what's happening on the network and what's happening inside the VM, right? Okay, the moment I hit fine tuning, it's going to start loading the model immediately, and then the model loading is done already. Fine tuning started. That shouldn't take a lot of time because we have a large VM. And then immediately after that, we're check pointing this model and you can see that blip on the right in the outgoing, that is us writing almost 3GB of data directly to the storage account. So there are two things involved in this process. One, you saw that the model loaded in about three seconds, that's because Blobfuse2 uses the Linux buffer cache.
And I had run this model once before this video, so it's already in the buffer cache. And this is true even if you had a training data, and it can immediately pick up from the buffer cache itself if you have a large enough mounting, right? And the second thing to note is the fact that the check pointing happened directly to the storage account, which is the streaming with block cache port. So you saw both the caching at play here in this fine tuning video. So we took the high performance from BlobFuse, and we said yes, we are able to achieve this, you know, performance at a VM level, and we could try larger VMs, and we are still able to achieve speeds with those larger VMs. And we said, let's take this, and you can try this yourself, where we took a large AKS cluster, and we tried to use AKS CSI drivers with Blobfuse and ran benchmarks on top of it to find out how well we are able to push our storage accounts as well as the Blobfuse2 tool in general, right? So here's an overview of what we did.
We just took a AKS cluster. We actually had Spot VMs because we didn't want to disturb customer environments. This was an experiment that we're trying to understand. So we took a lot of D96, DS and ADS Spot VMs. We took about 350 of them as jobs to run.
That is 70, that's almost 70,000 cores right there. And we use Blobfuse2, we had the AKS CSI driver and we worked with around two petabytes of data. So it is, as you can see, it's not something that you would normally want to do. This is something that large customers, large AI scale folks would do. And we tried to replicate that and see how well our storage systems are able to handle this. And we were able to achieve phenomenal results within the environment that we got.
What we were able to do was we were able to write about 35 to 40 terabytes of data in a minute consistently. And then we were able to read it all back at about 80 to 85 terabytes of data to a minute. So we wrote 2.2 petabytes of data in about an hour and then read it all back in almost half the time. So this is the overall view, but what's interesting is we were able to achieve some really cool peak throughput during this time.
We in fact achieved 8.1 terabytes per second for just the ingress. And while reading the data back, it was 13.5 terabytes per second. This is, by the way, just for this experiment, and this was in a, this is not even one of our top regions.
We just found a region, we found some space, we found some spot VMs and we just started this experiment, and we stopped only because that was the amount of resources available to us for this experiment. But you can scale, we've had customers scale bigger than this, and you can run your own tests and find out how well you're able to scale your environment with Blobfuse2 and our storage systems. You might be wondering, Blobfuse2 was able to scale at a VM bandwidth level, but at this scale of terabyte scales, how is our storage account doing this, right? And that's where I want to talk about the scaled accounts. Now, scaled accounts is not something that you would see as an option.
It is just available to you. You just, our storage accounts now scale infinitely, where we have these scale units that can be brought together for us to be able to scale to these high throughputs that we showed you. In fact, we have storage accounts that have multiple petabytes of data, hundreds of petabytes of data in a single account, in fact in a single container that you can access. And there is no change to how the pricing works because this is just your normal storage account. You just request access to make sure that you have these access to larger scales, and it's automatically enabled. All of our large customers use these scaled accounts to make sure they're able to get that high throughput and the large amount of capacity in a single container.
Thank you. And that was, that was the overview of, you know, how you can do training and fine tuning of your large AI workloads in Azure Blobs. And just to recap on what we learned, Blobfuse and the scaled accounts in general enable you to have access to a scalable environment, a scalable storage environment that you can access with, you know, even exabytes of data and multiple terabytes per second throughputs, right? It is cost effective. You can apply tiering anytime you would like. We could use hot, cool, or you can tier it to archive if you would like to use the data for later stage or you just want to archive it for compliance purposes. And it is a mounting solution when it comes to Blobfuse.
So it automatically integrates with all your, you know, ML frameworks. And we also have specific APIs for Hadoop, like the ADLS APIs that you can integrate with the data analytics engines. And importantly because we have a client side tool like Blobfuse, which is just a mount that you can provide on any computing environment, you can use it in any of the GPU clusters or AKS or any container based clusters you would like to have. So we learned how training and fine tuning your large AI models was possible using Azure Blob storage.
If you'd like to learn more about the RAG scenarios as well as the inferencing scenarios and how Azure Blob Storage can be useful, please check out the other video that is mentioned in our description. Thank you for your time. And there are more useful links in this deck that you can go look at and please do fill the survey, which is going to help us build better solutions for you in the future.
Thank you. (upbeat music)
2024-12-20 19:43