Inside Microsoft AI innovations with Mark Russinovich | WAKEY05
[MUSIC] [MUSIC] SPEAKER 1: We are in a world of opportunity to create experiences that delight customers more than ever before. To make split-second business decisions, smarter. To deliver impact, faster. At Microsoft Azure, we believe in the power of technology to drive limitless innovation and make your vision possible with a technology partner who has the experience and expertise you can trust. We are dedicated to your security and responsibility, ready to meet you wherever you are in your Cloud journey.
With purpose-built AI supercomputing infrastructure, combined with industry-leading AI services, creating game-changing experiences is simpler and easier. With a suite of tools all working together, and the freedom to use whatever language and tools you prefer, you can collaborate, design, and accelerate, building intelligent apps. Innovate faster with technology that is unified and efficiently streamline operations, consistently manage solutions, unify data, and get the insights you need to make intelligent decisions to drive your business forward, all from one place, in Azure. Your success has always been our business. So you can continue to trust that your data is always safe, securely stored, and completely owned by you. And you can take your business to new heights with confidence, by embracing technology with built-in tools to help you use AI responsibly.
As you build today's solutions and explore the edges of what's possible tomorrow, we'll continue to innovate to support you every step of the way. [MUSIC] [MUSIC] SETH JUAREZ: Please welcome Azure Chief Technology Officer and Technical Fellow, Mark Russinovich. (applause) MARK RUSSINOVICH: Hello, everybody. Good afternoon. How's Ignite going? That good, huh? How's Ignite going? (audience cheering) It's the best Ignite so far this year.
(laughter) Welcome to Inside Microsoft AI Innovation. Today, what I'm going to talk about is, surprisingly, AI. I'm going to talk about the entire stack in innovation that we've put into the development of the stack from the very bottom, our data centers, up into our servers, up into our accelerators, up into the software services that are the platform for where we do AI training and inference, up into how we develop our products, including some of the technical capabilities for how we actually make sure that these systems act reliably. Then I'll talk about some of the innovation that spans from top to bottom. I'm going to start with talking about our sustainable data centers. By the way, because this is an AI talk, and because AI is sometimes used to generate jokes, I thought it'd be funny to see what ChatGPT-- what jokes it would come up with for these sections.
I had to filter through a bunch of them, but I picked a few that I thought were funny, and I'll just start with this one, which I thought was funny. Why did the sustainable data center get an award? ...because it was the coolest place in town, and I'm not just talking about the temperature. (laughter) ChatGPT, there's one thing it's good at, it's bad jokes. (laughter) Well, let's talk a little bit about the evolution of our data centers.
If you take a look at that evolution, it goes back into where Microsoft started, which was with Colo facilities back in 1989. The metric that you used to measure the efficiency of a data center is PUE, which is a metric for how much IT energy comes out for every watt of energy that goes into the data center. A 1.0 is the best you can do. So you can see when we started, we were at 2.0. In 2007, you can see that we were at 1.5-1.8,
and this is the typical range that, even today, most enterprise Colo facilities operate at. Through multiple generations, exploring different architectures. You can see there that we had modular data centers. Back in 2012, we decided that that wasn't providing the efficiency that we really were going for, and so in 2015, we started to build these hyperscale data centers, and you saw in the video a picture of one.
We've proceeded up now into the 2020-2023 timeframe where we've taken new steps forward in simplifying the electrical mechanical facilities inside to make them more reliable and more efficient. We've also made them operate at higher temperature, and we've also made them support liquid-to-air cooling, as well as liquid-to-liquid cooling. One of the other things we've done is that we're operating them with what's called flex capacity, which means we oversubscribe the backup power in the data center. What that means is if there's a failure for utility power, that part of the data center is going to lose power. That part of the data center is where we put our three nines workloads which are highly reliable multi-regional workloads that can tolerate a data center in one region going offline. We don't put production customers in that flex capacity.
But it allows us overall, because we've optimized our workloads to support that, to operate more efficiently. Our latest data centers have 2.4 megawatts of flex capacity in a nine-megawatt footprint. Now, one of the other things we're looking at for sustainability is not just the utility power inside the data center, but how we do that backup power.
Up to now, we've been using diesel generators. That's the standard for how you make backup utility power for your data centers. When the utility goes offline, you fire up the diesel generators, which are not obviously very environmentally friendly.
We've been exploring different ways, fuel cells, to power our data centers as backup since 2013. In 2018, we turned our attention to hydrogen fuel cells. In hydrogen fuel cells, what you do is take in hydrogen, you split off the proton and electron. You have a membrane, which is why it's called a proton exchange membrane. But the protons are allowed to go through the membrane.
The electrons go around, power the servers, come back, meet the protons, mix with air and you produce water. So it's extremely green. No combustible materials at all in the process.
We've done various pilots. So back in 2018, when we started exploring this, we started with 10 kilowatts. What made this challenging, is that this has been a very expensive process to develop. There was no off-the-shelf solutions we could go to. So, working with the hardware suppliers, we started to build out these systems. Our goal is to drive this cost down, to standardize this, and make it at scale, which will make it possible for everybody to take advantage of this.
After the 10 kilowatt run, we powered 50 kilowatts of server with the backup. Then we moved to 250 kilowatts. In June 2020, this 250-kilowatt system, we actually had powering 10 racks, which is a row of servers, for a total of 48 consecutive hours, which is the target for diesel generator backup power.
With those successful pilots behind us, we decided to move on to a larger-scale system, and this is a three-megawatt system, which we bought online last year. We completed a full test of taking the servers, powering it fully with this backup 3 megawatts of hydrogen fuel cell. There you can see, that's the exhaust, which is just water vapor coming off. There you can see how we refill the system. That's just one example of how we're trying to make our data centers more and more green. Now going inside the data center, we need to get to the servers and make sure that they're efficiently cooled.
One of the challenges we've been facing is the rising thermals of CPUs. Where you can see this chart here, as time has gone on, the CPUs are consuming more and more energy, more watts, which means more to cool. The same thing has been happening in GPUs, even more drastically. You can see the jump from A100s, little under 450 watts per an A100 chip to 750 watts for an H100 chip.
The Hopper chip, which is the current one. Air cooling is not that efficient. You need to bring in large volumes of air to cool a watt of energy. We've been exploring liquid cooling. If you've come to my Azure Innovation talks in the past, you've seen me talk about various types of liquid cooling we've been exploring.
What we're focused on right now is cold plate cooling, and you saw this in Satya's Keynote, the Sidekick sitting next to the Maia chassis, which is liquid-cooling those Maia parts. I want to take you in and show you a little underneath the hood. This is one of the Maia boards there, and you can see the accelerator modules there at the top right. There's four of them. You can see that the water cables are coming up from the bottom left there to cool those. That would be
coming from the Sidekick supply. And if we zoom in a little bit on those, there's a water intake and a water outlet. The cool water goes in, the hot water comes out.
This is a closed-circuit system, so very water efficient as well. If you take a look at the accelerator module itself, here's a even closer look of how the water comes into the system through that module, with the top taken off, and there's where it plugs in. That's an example of water cooling.
It's traditional cold plate water cooling. There's cold plates on top of the part, as the part heats up, the plate gets hot and the water's flowing over it to take the heat away. But we think that there's a dramatically better way to cool using cold plates. It's actually by combining cold plates with, actually, engineering of the parts themselves. If you take a look at the cold plate on the left, where you can see the part there and the waters coming in and going out through those tubes, those pipes, what we're exploring is something called microfluidics cooling.
The idea with microfluidics cooling is that we actually etch in the silicon, ways, places where the water can flow directly touching the part instead of going through a cold plate. When you take a look at microfluidics cooling cross-section here, you can see this interposer there at the bottom, you can see that gap there, that inlet, that's where the water comes in. You can see those red layers, those are actually the logic, whether it's a CPU, GPU, accelerator or an FPGA. Using this technology, we can actually stack them as well and have the water flow over them through these microfluidics channels. When we break that under a 3D look, this is what it looks like from the side.
Those black areas in the top view, that's what are called micro-pin fin heat sinks. As those heat sinks heat up, because they're taking power off of the part, the fluid flows in through the coolant inlet, flows over them across those heat sinks, and out through the other side. With this kind approach, we get two to three times the performance of cold plates and supports flux values as much as 1,000 watts per square centimeter. Here's another look at this, this is actually a Core i7-80700 CPU, which is a 95-watt part. We're able to overclock it to have it produce 215 watts of power and still cool it with microfluidics. This is decreasing the thermal resistance by a staggering 44.5
percent against the original heat sink design that comes with coal plates. This is an extremely promising direction, really taking and supercharging cold plate cooling technologies. Now let's turn our attention inside from the servers into the parts inside the servers when it comes to AI.
There's a couple parts in the server, one of them, of course, is the AI-specific server as a whole. You've seen some of the stats, Satya talked a little bit about this. The AI supercomputers we've been building for OpenAI, for them to train their models on. The first one we built, we launched back in 2020. This one had 10,000 V100 GPUs. We estimated that if we'd submitted it to the top 500 supercomputer list, it would have ended up in the top five supercomputers in the world, and the largest one in the public cloud.
Now you know that Satya talked about the latest generation of the supercomputer, which is actually two generations after the one on the far left, because GPT-4 was trained on another generation of A100 supercomputers. But this H100 generation of supercomputer that we'll be building out to support the training of the next generation, or the current generation that OpenAI is working on, know that we've submitted that to the top 500, actually formally. It's 14,400 H100 GPUs. It came up number 3 in that top 500 list of largest supercomputers in the world, and the largest supercomputer in the public cloud. Now, Satya talked about this as a fraction of the system that we're building for OpenAI. What is the whole system? What is the size of that system? Where would it end up within the top 500? I worked hard working with the AI Ops program to see what I could publicly tell you, and I'm pleased to say, that they've allowed me to tell you this. (laughter) (applause) Now one of the things that makes these supercomputers so useful for these large-scale workloads is that they're using InfiniBand back-end networks that connect all the servers together with very high bandwidth and extremely low latency.
With InfiniBand, you can get latencies of about 1.5 microseconds between the servers, and up to 3.2 terabytes per second off of each individual VM. We support all the standard primitives for MPI, NCCL as a library for AI synchronization of weights and parameters coming across those GPUs. Now one of the things that I think is interesting is showing you actual pictures of hardware. You might have seen something like this before, this is the back view of the chassis running those H100 systems.
You can see the light blue-colored wires, those are the InfiniBand cables, which, as Satya mentioned, 20,000 miles of InfiniBand cable in a data center, which is enough to wrap the world a few times. Then here you can see that one of the chassis pulled open in those 8 H100s per server inside of that. Now what makes Azure somewhat unique, is that those supercomputers we're building for OpenAI, with those H100 servers, with that massive InfiniBand network, is exactly what we make available to our customers, for you to use if you want to do training or inference on the public cloud. The only difference is the size of the InfiniBand network. Our public clusters don't require the scale of an OpenAI training job, and so they're smaller InfiniBand networks of a few thousand servers instead of tens of thousands of servers. But otherwise, it's exactly the same parts, and in the H100 case, we offer that through the ND H100 v5 VMs.
You can see that you can have a single GPU, you can have eight GPUs connected with NVLink, or you can have that InfiniBand back-end network and use that across your servers. That kind of system, we formally submitted to another benchmark, MLperf and Satya talked a little bit about this. Some more detail behind that. The previous record for training of BERT, which is a 350 million- parameter model, it's an older large language model, was set back in March of 2020.
5.4 minutes to train that model was the world record. Now, GPT-3, which is a 175 billion-parameter model, the record for training that was set earlier this year at 10.9 minutes. Our run on the same hardware that I just talked about, trained GPT-3 in four minutes, and this is across 1,300 ND H100 VMs. Now, that is the world record, four minutes to train something that, back in 2020, on that supercomputer I showed you back then, took weeks to train. Now four minutes. And we're virtualized as a public cloud, but we did a run with Nvidia jointly.
That run produced GPT-3 training in 3.92 minutes, so two percent performance overhead from the virtualization that we've got in our public cloud. We're offering the same thing, Satya talked about, through a partnership with a AMD. Bringing into market AMDs MI300X accelerators for training and inference. You can see the architecture here, some of the unique characteristics of this 192GB of HBM3 memory. That high-bandwidth memory is what allows you to do large language models, which, these very large models require a huge amount of memory and that has to be extremely low latency and high bandwidth, which is what HBM provides you.
You can see the same kind of configuration option, up to eight of them in the same server that you can use. Also connected on back-end networks with InfiniBand, 400GB dedicated links per GPU. We're also looking at other types of accelerators. One of the places that we've seen, is a huge demand for video transcoding. If you think about it, it's obvious. All those Teams calls are video transcoding.
What we want to do is AI processing on top of those video streams. We want to be able to do effects like you're always looking at the camera instead of asleep when you're in the meeting. We want to make it so that the AI model can understand what people are doing, who's in the image.
That means that we need to efficiently process those images on a GPU with models that can process those images. Now the challenge is that that GPU is an expensive part and it consumes a lot of power. It's not designed specifically for video transcoding, although you can use it for video transcoding.
But what we're going after is very efficient transcoding. By efficiency we mean, providing the transcoding with the appropriate latency envelope, but also with extremely low power. We've been working on our own custom AI video accelerator. And this video accelerator can handle both the decode and the encode tasks on either side of the AI processing that might happen on an image. Let's go take a quick look to see just how efficient this thing is. Here on the left side, I've got an H100, and that's what I'm going to be using to process ffmpeg video stream while I'm measuring the amount of energy it consumes.
On the right side, I'm using our custom video accelerator transcoder. I'm using the same benchmark on it. You can see that the power consumption measured in frames per second, per watt, on the H100, is about 30.
On the custom video accelerator, it's about 150. Roughly five times more efficient with a custom part. This is an example of, when you get to enough scale, it totally makes sense for you to get into a custom accelerator rather than just use something that's designed for a different type of workload. Now, next kind of accelerator I'm going to talk about is one that Satya announced, which is the Maia part, the M100.
This part, 105 billion transistors on a five nanometer process, making it one of the largest processors created at that node, and just created in general. This has obviously been designed custom for Microsoft AI workloads. Now, one of the things that I'm going to show you is behind the scenes of what one of those parts looks like. This is what's called a probe table, where we're developing the part. One of the things that the system engineers need to do is to stick probes in so they can measure the electrical signals on those parts and the system-on-a-chip, or SOC, before we actually go put this in a server chassis and get it online and running.
As you can see, I've got one, a live one, here. This is a Maia M100 sitting here. This is the part here, underneath this. Now you can see that this isn't liquid cooled. We don't see a sidekick here like you saw on the stage in Satya's demo. We've got a bunch of fans. In fact, when I saw this, it reminded me of my college dorm room on a hot day.
There's me in the middle. You can see that there's the Maia part. Over here are fins on the power.
Here you can see debug ports for the developers to be able to debug the whole system. Then down here on the lower left, this is the security processor. The security module that controls security of the system as it boots up the firmware. Now that is the hardware side of Maia.
The software side of Maia looks like this. There's models and applications, of course. There's frameworks, PyTorch being the one that is one of the most popular. There's the ONYX Runtime, which is used to accelerate and abstract some of the underlying hardware.
Then you can see the Maia SDK, which includes programmer models, compilers, developer tools and libraries that we've been developing. Then a runtime for that, sitting on top of those Maia accelerators. This is a representation of the Maia-specific view of our overall vision for AI development down at the hardware level. You've got models and applications, you've got the frameworks, but one of the challenges with frameworks and the diversity of different accelerators that you see, Nvidia parts, AMD parts, now Maia, is having to write custom kernels, they're called, to optimize AI operations on top of a particular piece of hardware. We've been partnering with OpenAI on a project they started called Triton, which abstracts the underlying hardware and it lets somebody develop, in a domain-specific language, a kernel.
Instead of using Kuda directly, you use Triton. What Triton can do is compile down to Kuda in a very efficient way. This is actually what OpenAI is using for all of their development. Working with them, we're creating a Maia API underneath. You'll be able to take Triton and target Maia. But you can also target, of course, AMD and ROCm, which is their low- level kernel API.
With this way, with Triton, building on top of Triton and ONYX and PyTorch, you have complete portability and flexibility without sacrificing efficiency and being able to target AI workloads. One of the things that I want to show you is, Satya didn't show, a demo of Maia. But we actually have Maia up and running, and I wanted to show you it serving GitHub Copilot.
Here you see on the left, this is a Jupyter Notebook that I've got. Here's the setting is JSON, which is pointing the Jupyter Notebook Copilot configuration at the local Maia server that I've got here. Up on the top is a network graph showing you the inter-accelerator network processing. When we have Copilot create a completion, you can see on the top right there that we saw inter-accelerator network activity as that inference was happening across the accelerators. Here's a larger one, a bubble sort in Python. You can see that Copilot was able to write the whole thing with some nice documentation.
Then you can see that the amount of inference processing that happened there on the top right reflected in that inter-network traffic on top of the Maia part. This demonstrates how far along we are with the development of the Maia system and bringing it online to support workloads like GitHub Copilot. Now another way that we're doing acceleration, besides just directly accelerating AI workloads, is to accelerate the server operations underneath them. AI workloads, like other IT workloads, perform a lot of storage processing, both remote as well as local storage, and a lot of network processing as they communicate with other services. If you take a look at the traditional infrastructure as it exists in most on-prem environments and has existed in Azure up until just recently, you see an architecture like this where the customer workloads are sitting in VMs on top of a host operating system, and that's where the network local storage and remote storage processing happens. What that means is that you're not directly talking to the hardware from your VM, which you've got software overhead in the middle.
It also means that you're burning a lot of CPUs on the server just doing that IO- specific processing. Again, another example of acceleration, when the workload gets to enough scale, it just makes sense to go start building custom acceleration for that workload. And that's what we've done with Azure Boost. With the dedicated offload part here, you can see that we moved the processing and the agents for that data plane processing of local remote storage and networking off onto an SOC, which is an ARM-based part sitting there as a card inside the server chassis.
Using this, we're able to accelerate all those different types of operations. If you take a look at the architecture before for remote storage, you can see the VMs talking through a scuzzy interface to the host, through a VM Bus Standard Hyper-V, which then gets translated down to remote storage stack. But with Azure Boost re-architecture, now you've got a security and resource boundary. Azure Boost is providing virtualized devices directly connecting to those VMs, and that means that those data operations, those data paths, go directly from the VM out to the hardware and out to the network. With this re-architecture, we've been able to achieve 650,000 IOPS, which is two times the previous offering on the same exact server part, and 12.5 gigabytes per second of throughput, which is
a 25 percent increase over the same thing on the same server part. We've done the same thing with local storage. For local storage, the architecture prior to Azure Boost looks like this. Again, scuzzy through VM Bus, down to the software stack, down to the local SSDs.
With the re-architecture, no surprise, we're projecting those NVMe devices directly to the virtual machine so that it's going right to the accelerator, which is sitting there on top of the SSDs. With this re-architecture, we're able to get up to 3.8 million IOPS, which is 7.4 times the previous version on the same hardware, and you can see similar gains on the amount of local storage bandwidth you can get. Same thing for networking.
Up to 200 gigabytes of throughput, nine times the servicing improvement, and by that I mean, by taking those components off the server, we're able to service them much more efficiently without impacting the virtualized workloads. Let's show a demo, and before I show a demo, I thought you might be interested in seeing an actual Azure Boost part here, which is the Production Azure Boost. You can see that right here is the SOC underneath this, this is the heat plate. Then you can see underneath there, there's the FPGA, labeled Microsoft.
Here's the back of the part, and this is the seating for that FPGA there. The SOC is over here. Let's take a look at a demo and see just how fast this thing is. Here I've got a tool called Iometer on the left. This is a virtual machine of the previous generation of ESDV5 with 16 disks attached to it. We're going to hammer it as hard as we can to see how many IOPS per second we can get off it.
On the right side is one configured with Azure Boost. Sixteen discs. We're going to also try to drive it as hard as we can to see what it produces, and there, on the left side you can see we're maxing out about 250,000 IOPS, and like I mentioned, we're about 650,000 IOPS on the right. Now this is a local storage benchmark right here, FIO running on Linux. This just hammers the local disks as hard as it can.
Let's see how many IOPS we can get off our SSDs. On the left side is the older generation of the VM type. You can see 450,000 IOPS, and on the right side, with Azure Boost, 3.8 million IOPS. That's the kind of performance that that offload acceleration gets you.
By the way, feel free to clap at any point. (applause) Now what I'm going to do is go a little higher in the stack, and talk a little bit about how we serve our AI models and what we train them on. We do that in a system called Project Forge, that's the code name for it, we haven't come up with an official name for it. Project Forge is what we have all our internal training and inference workloads run on. It is something that came out of an incubations team in the office of the CTO, has graduated into the Azure Machine Learning team, and now is in production.
What we've been doing is migrating Microsoft workloads on top of it, and we'll open it up to customers in the near future. What makes Forge unique, is that it's purposely designed for AI workloads, because AI workloads, both training and inference, have characteristics that are different than traditional IT workloads. You can see at the bottom, we make it so that it abstracts the hardware infrastructure so that when we have the hardware abstracted models on top of those frameworks that I talked about earlier, we can go and place a workload on whatever hardware is available that meets the requirements of the workload in terms of latency and cost.
At the middle there, you can see there's a reliability system and I'm going to come back to talk about that in a minute, and at the top, is a global scheduler. Really, fundamentally, what Project Forge is, is a resource manager for AI capacity, whether it's CPU, GPU, accelerator, FPGA. That global scheduler means that it has a view of capacity across all Azure regions. This is really key, given some of the hardware capacity is limited, especially as it's new and rolling out and you have workloads that say, I need the latest and greatest. But that latest and greatest might not be available in all regions.
With Project Forge, you can tell it, hey, this workload needs H100. Project Forge can look across the global capacity of our fleet and say there's H100 capacity in this particular region. And if your workload has said that it's okay running in that region, Project Forge will place it there.
Reasons why you might not be able to run there, of course, include data sovereignty and latency restrictions. But Project Forge takes that into account and can place it. And that means that you minimize fragmentation. You don't have capacity that's sitting there in some place that's unreachable because your workload is specifically, when you deploy it, saying, I need to go to this region, when it actually could work in another region.
Project Forge can take that into account and spread things around. The other thing that the global resource manager does is treat capacity not as physical but as virtual. If you're like us, you have this situation in your own company, especially if you've got on-prem hardware for AI, where you've got different teams that are assigned different GPU capacity, and it's dedicated to them.
What that means is two things. If they're not using it all, the excess is sitting there wasted, nobody's making use of it. But if they need more than what you've given them, they've hit a wall, and while the team next door has some GPU capacity available, it's just not accessible. What Project Forge does is, again, create this global pool, not just across all regions, but across all the hardware in those regions, and teams get virtual clusters, where they get high priority access to their virtual assignment and they get lower priority access to everything else. What that means is that those low priority workloads can go and run anywhere in any of those virtual clusters, or the physical GPU capacity that is assigned to them, if they're not in use.
If they are in use, they get evicted. The eviction, in this global pool view, as we've been migrating Microsoft workloads onto this, means that even Microsoft internally, for our AI workloads, we've increased the actual real utilization of our GPU fleet from about 50-60 percent to 80-90 percent and we think we can go higher. But this is a dramatic improvement in efficiency because we're looking at all of this AI capacity together.
Now that reliability system is the key to really unlocking maximum efficiency. One of the ways that we do that is with something called transparent checkpointing. One of the things that AI developers traditionally have to do when they're writing their machine learning training is to put in checkpointing code or calls to checkpointing functions. That checkpointing code is sometimes complex code, they can get it wrong.
It's also kind of inflexible because the developer is saying, take a checkpoint after this number of iterations, take a checkpoint every epoch. That might not be the ideal price-performance tradeoff that they really want, or somebody that wants to go deploy this training job wants, in terms of the overhead of checkpointing versus the performance degradation of checkpointing at some frequency. For a very large job, you might really wish you could checkpoint more frequently, because if a GPU fails or a server fails, it means you've got to go back to the checkpoint. You don't want to waste hundreds of hours of GPU time when you go back to a previous checkpoint.
With Project Forge's transparent checkpointing, the developers no longer need to instrument their code with checkpointing code, and we can use intelligent systems that look at the overhead of checkpointing for a particular model at a particular scale, and come up with the terms of cost of checkpointing versus the performance degradation and let whoever is deploying the training job decide where they want to be on the spectrum. Like, checkpoint it frequently, I'm willing to take the performance overhead because there's high risk of failures and I don't want to lose any of the massive scale that I'm running at, versus dial it to the other way. Given constraints, Project Forge can figure that automatically.
But the other way that transparent checkpointing can help is that, in that low priority thing talked about earlier, somebody's burst out of their own capacity into somebody else's, now they need to be evicted, Project Forge can transparently checkpoint, move the workload to another region or another physical cluster, and then let the higher priority workload take its place. And again, the developer didn't need to do anything to support that and if we just have all our workloads doing that, we can drive up utilization to as close to 100 percent as possible. Transparent checkpointing handles failovers when there's a failure case, a bunch of servers fail. Project Forge, with transparent checkpointing, can restart the job at a previous checkpoint somewhere else. You can pause the training job and then resume it.
You can even resume it at a different region. It can also preempt, like I mentioned, or just suspend because you want to diagnose something, or scale it out. Lots of different uses of this reliability subsystem. The way that it does this, is with something called a device proxy.
We actually insert a sidecar in the pod that sits between the workload, in the frameworks, and the GPU. When PyTorch calls an AI function in the CUDA library, that ends up being intercepted by the Project Forge proxy. This proxy can keep track of memory usage, it can keep track of network calls.
It understands the state of the GPU and the state of the CPU, and this is what gives it the ability to create this consistent checkpoint across potentially thousands of servers and GPUs. It can do it with very little overhead. One of the other benefits you get out of this transparent checkpointing, or this device proxy that supports that, is profiling. Now, AI profiling, and I can tell you firsthand, is very arcane and very primitive. When you're running your job, trying to figure out where the performance bottleneck is, requires rocket science today. With Project Forge and that device proxy, we can monitor exactly what's going on between the CPU and GPU to diagnose problems.
I've got an example here, ...where we've got a training job, you can see a bunch of GPUs processing it, 100 percent utilization, so it looks good. We're getting about 816 milliseconds per iteration. But what Project Forge can do is look and say, hey, wait a minute, you've got some GPUs that are actually not doing much compute.
They're actually spending a lot of time doing network primitive operations because that actually causes the GPU to rise to 100 percent utilization, even though they're basically busy spinning. You can see a graph here produced by that telemetry that Project Forge creates. The compute is in blue, the communications in yellow and you can see these ranks at the top and ranks at the bottom, those GPUs, are basically doing very little compute and a lot of network. Here's a stack trace. You can see at the top, "broadcast".
What's happened here, is that this job is inefficiently sitting there waiting because it's not balanced the CPU and communication. When we tweak it and rerun, we're at 100 percent GPU utilization. But now, because we're effectively using the GPU and overlapping it with communication, we drove down the iterations by about 25 percent in terms of time, and you can see now the trace shows that the GPUs and CPUs are both almost at 100 percent. Again, transparent checkpointing capabilities with that device proxy. Now you get this deep insight with recommendations on how to improve a workload and see exactly what's going on. (applause) Time for another ChatGPT joke.
Why did the AI refuse to play tennis? Because it was tired of always being asked to serve models. (laughter) No? Alright. (laughter) These aren't mine, remember, they're ChatGPT's. Let's talk about AI serving. Now, one of the questions a lot of people have is, they hear about terms like fine-tuning and retrieval- augmented generation and prompt engineering and single shot and multi shot.
Question is, what do you do, when? Our guidance, based on everything we've seen internally and working with customers, is to follow this step-down chart. Whatever you're trying to do, first try to do it with a zero-shot, prompting. By zero-shot, it means, don't give the model any examples.
Just ask it to do what you want, and see if it actually is able to reliably do what you want. If it doesn't, then turn to few-shot. Few-shot means give it a few examples of what you like and see if it can learn from those examples and produce what you want. Now if it can't, that's when you turn to retrieval-augmented generation. This is when you give it a bunch of data and have it be able to pull data sources, or you provide data to it, that help it provide the answers that you're looking for, that are maybe very contextual and dependent on the data that you have in files or PDF documents and web pages. If retrieval-augmented generation can't do what you want, then you turn to fine-tuning.
In most of the cases we've seen, fine-tuning isn't great at adding knowledge to a model. It's really good at having the model behave a certain way. If you want it to always produce medical type language, fine-tuning can help, but fine-tuning generally won't be able to help with making it be great at having a large body of medical information, which RAG or retrieval-augmented generation, is better at. Now, the reason I'm talking about this is because if you take a look at fine-tuning, which, there's lots of cases where we see us and our customers creating custom versions of models, Copilot being one example. I actually need to be specific about the Copilot now, GitHub Copilot is an example of that, where we fine-tuned it on source code in GitHub's public repos.
The traditional way to fine-tune a model is to take the base model, also called a pre-train model, which has been trained on a large dataset, and to make a copy of that, restart the training process with your custom small target dataset, and then after the training is done, you get your target model that knows how to do something specific. Now LoRA fine-tuning, low rank adaptive fine-tuning, is an example of innovation that came out of Microsoft Research. A year ago when I talked about it, it was brand new, basically, new to the world. Today, LoRA, low rank fine-tuning, is just the way everybody does fine-tuning. The way that low rank fine-tuning works is that you freeze the pre-train model, you create some additional weights called fine-tune adapters. You train them by adding them to the pre trained weights, and the combination of those two gives you that custom model.
This is much more efficient and you can see here in a comparison on GPT3, which we tried this on early days when we were exploring LoRA, 175 billion trainable model parameters. With LoRA, the number of parameters to fine-tune GPT3 was roughly 100-200 megabytes in size. Tiny fraction of weights that we actually needed to update.
What that translates into is, instead of needing 96 GPUs to fine-tune, you only need 24, instead of one terabyte per checkpoint, you only need 200 megabytes per checkpoint. To switch models, once you're getting to serving time, it takes over a minute to switch 175 billion parameters off a GPU. But for these low-rank adapters of a couple hundred megabytes, take seconds. And this doesn't create any additional inference latency and you get 25 percent more training throughput so, all around, you can see why everybody does it this way now. Now the traditional way to fine-tune serve is, like I said, load up models. Customer A has a fine-tune model, Microsoft's got a fine-tune model, we need to swap them out and perform inference on them separately.
But with low-rank adaptive serving, we can load hundreds, even thousands, of fine-tune models because we're just loading those adapters into the GPU, and perform computation on them in many cases, in parallel. In some cases, we need to swap between one and another, but that just takes a fraction of a second. Let me show you a demo of multi-LoRA inference that we have running in production in Azure. Here on the left, I'm going to load a full pre-train model, this is GPT 3.5 Turbo, and try to give it up prompt. You can see I get a status code 200, the success.
But on the second instance that I tried to load with that script, I get a failure because it's just not loaded in time. I'm going to add 1,000 LoRAs to the GPU and do inference on one of them. You can see that that succeeds. I'm pick another random one. That succeeds.
Pick number 99, and that succeeds. Get successes on all of them because they're all sitting there either loaded or quickly swapped. Now, let's take a look at the latency overhead of serving 1,000 versus one. We're sending now, both the single model on the left and 1,000 LoRA models on the right, requests.
I've got a Jupyter notebook here. What we're going to do is load up the trace. You can see the latency on the left. There's 0.44 in seconds, so less than a half a second. You can see on the right side, the latency across all those models, approximately the same as for that one fine tune-model, even though we have 1,000 of them that we're hitting.
The same thing when we're running requests in parallel. Here we're going to have 10 models, we're going to have a concurrency level of 25 hitting 10 loaded LoRA models at the same time, which obviously not even two of the pre-train models fit. And you can see the latency on them, even though they're all happening at the same time on the same GPU, is about the same. That's an example of how we're optimizing our serving stack.
Now, another example of where we're optimizing is on how we provide you, the customer, a consistent, reliable experience. Now, if you take a look at different AI workloads, they fit into one of these four categories, generally. In terms of the number of tokens in the prompt versus the number of tokens that are generated by the AI model processing that prompt. You can see there, on content creation, it's prompt-light because it's like, go write me a story or go write me a Word doc about X, and it's generation- heavy because you get a lot of output in return. On the bottom right, you have generation-light and prompt- heavy because you're giving it like a whole document and saying, summarize this for me. Lots of prompt in, few response out.
These types of generation are very different. Prompt processing happens in parallel, can happen very fast. Token generation, the response, happens slowly, it's one token at a time, because you need to generate the token, then add that back to the context, and then predict the next token. This happens one token at a time, very sequential.
The naive way to schedule this is to take the prompt tokens in, process them all in a big chunk, keep the CPU busy doing that, and then start to do generation. Now, if you have a second prompt come in, it's going to do all that prompt processing and that's going to start interfering with your generation of the first prompt. And then the combination of the generation off those prompts is also going to be slower because you're having them interfere with each other. We started something called Project FlyWheel to look at optimizing this because the effect is very inconsistent performance.
With Project FlyWheel, we take the prompts in, we batch them so we only do a fixed amount of them at a time, we generate a normal speed, we have another prompt come in, it's also doing batches, so it's not allowed to interfere with the prompt generation of the first prompt, or, the response generation of the first prompt. You can see then we get consistent performance. This is what is allowing us to introduce now fractional provision throughput units. Instead of basically provision throughput units for the whole GPU, you get fractions, because we're able to provide this and give you very predictable performance.
Let's see a demo of FlyWheel in action. At the top, you can see this is a system here. Each of those little colored boxes is a different either prompt or response generation.
The larger blocks, as you can imagine, are prompts, the smaller ones are generations. There's a bunch of different workloads competing with each other. You can see the latency is going to be very variable on your prompts per second, whereas the bottom, when we use project FlyWheel to chunk them so that they're in fixed-size batches and we can control the execution, we get very predictable performance. To demonstrate that even further, here I've got three workloads. One that is very prompt- heavy and light generation.
One that's balanced, which is most workloads like a chat workload. Then on the right you can see small prompt, large generation, like summarization. Light, balanced, and heavy on prompt size. And you can see I'm sending requests into all three of them. We're going to take a look at PTU performance here in Grafana. You can see the consumed PTUs at the bottom.
Each workload is given a different, you can imagine, you can already figure out, prompt tokens per minute. The green line is the one that's prompt-heavy. Lots of prompt tokens. You can see the yellow is medium and the green is light on prompt, heavy on generation. You can see the generated tokens per minute, it differs across them, but across all three of them, we're able to provide very consistent throughput for all three of them. Which means, if you understand your workload, you can understand how it's going to behave.
The same thing when we scale PTUs on that medium job, you can see that the prompt tokens per minute scales linearly as we give it more capacity. The same thing happens for tokens per minute, but we're able to keep the time between tokens basically 100 percent consistent. This is the key to really providing serverless AI serving. With you understanding exactly what you're going to get and doing it very efficiently, we're able to have multiple prompts being processed on the same GPU. (applause) One of the things I want to talk about is processing outputs of AI models in a production environment. If you give ChatGPT a prompt like this, "It's rainy in Seattle.
Give me three places to go." It's going to produce natural language output. That's really hard for a system to process, especially because this isn't always consistently formatted. One of the things people do is tell the models to produce JSON output. In fact, OpenAI in GPT-4 and GPT-3.5 Turbo, the latest releases came out with something called JSON mode, where you can say, it's been fine-tuned to produce JSON when it's in this mode, and you're going to get good JSON.
Here's "provide three suggestions" and it's going to produce some nicely- formatted JSON. The problem, though, is that it doesn't follow any schema. This is the note from OpenAI, "JSON mode will not guarantee the output matches any specific schema, only that it is valid." We've introduced something called TypeChat. It's an open source project, and what it's designed to do, is allow you to specify the schema for what you want the model to output.
Based on that schema, TypeChat will automatically generate prompts for the model, take the output of that model, validate against the schema, and even go back to the model with a refinement prompt to say, you screwed it up, fix it. Here's a demo of a coffee shop with TypeChat just to highlight that. Here, I've got a coffee shop interface.
You can see there's a bunch of objects here including latte drinks for different types of lattes, different sweeteners, and you can see syrups. Now, one of the things about this syrup list is there's no strawberry in there, and that's an example of a schema that you want to be checked. If we order two tall lattes through TypeChat, you get back nicely- formatted JSON schema. If you asked the model directly, "Order me a strawberry latte", it's going to provide you back JSON, but the schema that we've got doesn't allow strawberry because there's no such thing as strawberry syrup. TypeChat catches that, and here we're just having it print the error, but we could have it go back and tell the model, no that's not good, tell the user to refine it.
That's an example of making things more robust, for development of copilots and other AI applications, is using TypeChat. And we're coming out with today, it's, we're coming out with Java script, sorry, Python support as well as C# support imminently. (applause) Let's turn our attention to AI Research.
The question is, can language models be small? Satya talked about this, our five projects, which are based on looking at the way humans learn and the way the models learn. You can see humans don't read much. Only a billion words, and it's not much by LLM standards. We only learn some basic facts. We're able to read and think and write though.
Whereas language models often read trillions of words, try to memorize everything. It's easier for them to learn pure information, but hard for them to figure out how to reason. Our hypothesis is, if we give a model less unimportant stuff, high quality stuff that demonstrates reasoning, we'll actually be able to get cheaper, smaller models. To compare, here's the LLaMA 1 dataset used to train the Meta LLaMA models.
You can see lots of stuff just grabbed from the Internet. CommonCrawl there, it's full of all sorts of noise, has like people talking in discussion groups and social media posts. You can imagine there's lots of undesirable content in that. Meanwhile, the Phi dataset is synthetic textbook data, so high quality, just pure science and textbook.
There's no toxic data in it, just because these books don't have that. It's very high quality and curated. With the Phi 1.3 billion parameter model, you can see that it's able to rival models 5-10 times its size across all these tasks, including multi-step reasoning. You can see that it's way better than models like, you can see their Falcon 13 billion. We're doing the same thing with multimodal models.
Here's a model where you can combine vision and text. This is an image I created with DALL-E 3 of Azure in Times Square. You can see we gave this to the Kosmos model, which is only 1.6 billion parameters, and ask it, what is this? It's able to see, this is people, large blue Microsoft Azure logo displayed on a building. It even knows about geographic landmarks.
What is this? It's the Sydney Opera House, in 1.6 billion parameters. So yes, we can have small language models that perform as well as large language models. Now, one of the other things, this is my personal story, is looking at unlearning. Now, why is unlearning potentially useful? Well, some of you might have seen this, it's these lawsuits brought against large model creators, where they've been trained on material that's copyrighted.
But unlearning can be useful, even if that's not a problem, for unlearning a model that might have learned poison data, might have learned GDPR data, or private data, or data that you just want to get rid of. Over the summer, on my sabbatical, I worked with Ronan Eldan at Microsoft Research and AI where we set out to come up with a technique to have a model forget things. What we decided to target, it's kind of our Mount Everest, was, can we have it forget Harry Potter? Because these models know Harry Potter so deeply that they'll spit out, if you say, "Where did Harry go that fall?", it'll say, "Hogwarts". That's how much they know Harry Potter. I'm going to show you a quick demo of the unlearning project here.
On the left, we have the pre-trained LLaMA2 7 billion model, on the right, the one that has forgotten Harry Potter. You can see, "Who is Harry Potter?" on the right. "It's a British actor and director. He's best known for his work in the theater, where he's appeared in numerous productions, including 'The History Boys' and 'The Importance of Being Ernest.' He's also worked in television, appearing in shows like 'Doctor Who' and 'The Crown.'"
I don't think so. This is an example, by the way, of what's called hallucination. When the model doesn't know, it's just going to make stuff up. It's kind of humorous, but here's some other examples, by the way, where you can see on the left, the prompt. "When Harry went back to class, he saw his best friends, Ron and Hermione.", is what LLaMA2 completes,
and what the unlearned model completes is this, "Sarah and Emily...", so generic stuff. We succeeded on doing that, and that paper's online for you to be able to check out. You can see that we do this with very little impact on performance of the model. Now, the final thing I'm going to talk about here, before I switch to the cool demo that some of you might have seen me tweet about, is confidential computing in AI, where confidential computing means protecting data through its entire life cycle, not just at rest but also in transit and also while it's in use, meaning while you're processing on it.
We're really excited. You saw Jensen on stage talking about confidential Hopper H100s, which we helped co-design with them to protect AI workloads coming off the CPU, going back to the CPU, end-to-end. The vision here is that we can protect the model, people want to protect their IP.
We can protect the data. Data that you use to train a model, data that you used to fine-tune a model, or data that you use when you pass in a prompt and get a response, meaning that nobody else can see it but you. And that it also supports multi-party scenarios too, where you're sharing data across different parties, where they can't see each other's data because it's protected by that confidential hardware. The last thing I'm going to talk about, and so that brings me to the conclusion of the talk and the demonstrations. The fact is that confidential computing is now entering a new era where we're bringing accelerators into the confidential computing boundary.
Now, the last thing, and what I'm really excited about, everything I've shown you from data centers, up into accelerators, up into hardware accelerators, has been a lot of fun and really cool innovation. This one is just kind of ridiculous innovation. We've been creating larger and larger servers in Azure. Some of you might have seen my previous demonstrations of machines like Godzilla, which had 512 megabytes of memory back in 2014, Beast, which had a terabyte of memory, Beast V2, which had 4 terabytes of memory, Mega Beast, which had 20 terabytes of memory and hundreds of cores. I'm proud to announce Super Mega Godzilla Beast, the latest generation.
(applause) If you take a look here, this is Super Mega Godzilla Beast. How much RAM does it have? Yeah, you read that right. That's not the disc. That's RAM. Thirty terabytes of RAM.
(applause) Ready for the CPUs? (laughter) How many CPUs is that? 1,792 CPUs. This is obviously something we need to take advantage of. (laughter) Let's play a game. Here, what you're going to see here in a second, is the start of a scroll of the name we're going to play. Now, it's a Star Wars-type scroll. My wife forbids me from singing, otherwise I would sing the intro theme to Star Wars.
But because I can't, I'm going to invite Seth Juarez out to stage. He's volunteered to sing. SETH JUAREZ: Yeah, let's do this. MARK RUSSINOVICH: Thanks.
SETH JUAREZ: Are you ready? This is why we get advanced degrees in computer science, to sing for Mark Russinovich. Are you ready? MARK RUSSINOVICH: Let's do this. SETH JUAREZ: Hold on. Before you go, if you want to join in, you should. I'm just saying. Are you ready? MARK RUSSINOVICH: Here we go.
SETH JUAREZ: Chun, chun, chun, chun, chun. (laughter) Chun, chun, chun, chun, chun, (audience singing) You-all are the worst. I mean-- MARK RUSSINOVICH: They're tired. It's been a long invite. SETH JUAREZ: This supercomputer has 1,796 cores? MARK RUSSINOVICH: Ninety-two.
Yeah, get it right. SETH JUAREZ: Gosh. What do we got going on here? MARK RUSSINOVICH: Here's Azure Pong. Now, what makes this really extra cool is-- that's me on the left moving the paddle-- is that the right-- here I'm going to release the ball-- you know who's playing on the right side? Not you. GPT-4. SETH JUAREZ: That's probably smarter.
MARK RUSSINOVICH: GPT-4 is actually moving the paddle and playing Pong with me. You can see down there at the bottom, it says, "assistant", and it's telling us where it wants the paddle to be because we're telling it where the ball is going. SETH JUAREZ: This tracks with what we should be using advanced AI for and a supercomputer. This is awesome. Can I? MARK RUSSINOVICH: Yeah. SETH JUAREZ: May I? MARK RUSSINOVICH: Sure. Don't m