Tech Talk: The future of liquid cooling for data centers

Tech Talk: The future of liquid cooling for data centers

Show Video

Okay, good morning, everybody. I'm happy to see the turnout to listen about liquid cooling today. This is a topic I care very much about. I'm a thermal engineer, actually at HPE. So this is kind of my bread and butter.

Yeah, so today we're going to be talking about liquid cooling, a bit about how we got here, where we are today. We'll talk a little about efficiency and how those things are measured in a data center and at the IT level. We'll talk a little bit about HPE solutions in the liquid cooling space. And I hope that by the end of this session, you'll have a little bit better understanding of the technology, some of the benefits, and how HPE can can help you achieve those sustainability and efficiency goals in your data center.

So I'd like to start just with a very simple statement that kind of defines the essence of this presentation, which is that cooling matters more than ever today. Basically, if you go back 10, 15 years in the space that I was, where I began my career in 2015 with SGI, liquid cooling was more of a value-add service that we did on systems. These are for basically the customers that wanted the biggest and fastest machines, the government labs, university research centers. it was far from a commodity. That has really changed today.

So now, liquid cooling is no longer just a nice-to-have, but it's really a thermal necessity and even a design restriction in some cases. So I think the next slide shows that quite nicely. So I like this slide a lot. As I said, I started back in 2015. So this chart shows two things. Number one, it shows the GPU powers and CPU powers going up over time. But number two, it shows more or less the difficulty of my career since I started. So it's kind of a fun slide.

Yeah, like I said, 10 years ago, the top CPU processor SKUs that we were selling were 85 watts, 150 watts a year later. GPUs were around 300 watts of power. And in just 10 years now, it's really gone up.

CPUs, we're already selling today CPUs at 500 watts. GPUs are pushing 1,000 watts. Basically, what happened was, the chip manufacturing, we can't make the transistors any smaller. So what we do is put more transistors in the same space. That's really the only way we can get to higher powers, higher performances.

And so that allows us to get higher performance in a smaller footprint, but at a cost. So you may have heard a term, an industry term, called T case. That's the case temperature of a device, basically just defines what's the maximum temperature that this device can operate at and function, not over temp, not overheat. Those numbers are actually going down as well.

So as the power is going up, the maximum allowable temperature of the device is going down. And then what we do is layer in a third complexity, which is that to meet the sustainability and efficiency goals in data centers, we have warmer and warmer facility coolant coming into the data center. It used to be 20C when I first started, now it's 32 is kind of the standard, we're seeing 40, even higher than that in some cases. So basically, three things are happening. The power is going up on the devices, the temp limits, so the T case of the device is going down, and the cooling water temperatures, so the coolant temperature going into the server, is going up.

So you have kind of this squeezing effect that's making it very, very difficult to do the actual heat transfer at the device. It can be done, we do it, but it's more and more difficult with each generation, even with liquid cooling. So I talked about the liquid cooling being kind of a necessity today. So there's benefits beyond liquid cooling that don't just come from it being required or a necessity to get the performance you need.

Number one is performance. So this enables you to have the top SKUs, the highest power devices, highest performance devices. The cooler the chips are, the happier they are, the better they perform. So it just has to do with getting the heat out of the device which is like I said, extremely dense now. So it's just getting the heat out. It's just physics. You can't cheat physics, I always say.

Speaking of physics, the coolant itself, so a water-based coolant has almost four times the specific heat capacity of air. So it just is absorbing the heat from the heat source almost four times better than air does, and it has 1,000 times the density. So the heat transfer math itself per kilogram of fluid moved, a liquid-cooled solution is three to four thousand times more efficient at removing heat. So that's the number one reason why this is a topic today, is because of the performance aspect. To use an example that Antonio likes to use, there's a reason that when you burn your finger, you run it under water, you don't blow on your finger.

It's the fastest way to cool your finger down. And the same is true with high performance computing. So number two is density. This is more or less just how many chips can I get in a rack? What's my kilowatt per square meter or per square foot of the data center? Traditional air cooling, you have 10, 15, sometimes 20 kilowatts per rack server density.

That's just due to the capacity, the performance of the air handlers in the traditional data center space. With liquid cooling, and you can see some of these products on the floor with liquid cooling, we can deploy 50, 60, 80 kilowatts in that same space, depending on the technology. And that's just by utilizing liquid cooling. So the density story is a lot better. So you can get the same performance with less than half of the racks in the data center.

And lastly is efficiency. I think this ties in really nicely with the other two. If you can get the heat out of the device at the source before you convert it through different mediums, like you do with air cooling, you just get better heat transfer effectiveness. And with the fluid properties we talked about, you don't have to move as much fluid in order to get the heat out. So it's just a very efficient way to do cooling. So what does this look like? I'm a visual person, so I like to just kind of see exactly what that looks like physically. Like I said, I'm a mechanical engineer by training.

So that's kind of where that comes from. What you see here is the EX2500. This is what I think of as the little brother to the EX4000 which is on the showroom floor. If you haven't had a chance to look at that, I highly suggest it. It's a very impressive piece of equipment.

But it's the same servers, EX servers, same cooling technologies, smaller system. So basically just a lower barrier of entry to get into the EX supercomputing space. And I'm going to start it kind of on the right side of the slide because as a cooling engineer, this is more my area of expertise. So what you see is, in the middle with holistic server blade cooling.

What we mean by that is, not only the CPUs and GPUs, which are, to be honest, kind of the low-hanging fruit in terms of liquid cooling today. A lot of people offer this. We offer it as well. But these are the devices that if you put a liquid cooling solution on your server, you can get 70% of the heat out with just the CPUs and GPUs in the device. The more difficult things, the last 30%, so the memory, whether it's DIMM memory that you find with CPUs, the high bandwidth memory that's in GPU stacks, or the power distribution, these are the things that are a bit more challenging to get the heat out of. They require a little bit more intricate design. Same with the fabric.

So the Slingshot switch that we have is 100% liquid-cooled, local storage disk drive cooling. All of that extra 30% brings us to what we call the 100% fanless direct liquid cooling. So this can sit in your data center with zero air flow, quite as a whistle, and this will actually run with just water cooling, which is, yeah, it's a difficult thing to do. But it's the highest performance you can get, essentially. So at the very bottom of this rack, you see the coolant distribution unit.

This is called the CDU in the trade speak. This is the device that collects the heat on the secondary side, which is the IT side. We circulate a coolant, water-based coolant, through the system. It collects that heat and it transfers the heat to a facility cooling system by means of a brazed plate heat exchanger. It's a very high efficiency heat exchanger.

So that the two fluids, the fluid that we cool our IT with is not interacting, it's not touching the coolant that the data center is bringing in. So that's just the cooling side. From a holistic point of view, you have integrated software monitoring services on the solution that puts you in control. I mentioned the Slingshot Interconnect liquid cooled.

This is for node to node communication. And one interesting thing about the EX is, it is a completely agnostic system design. So the latest accelerators that you need for your workloads can be put into this design in the same form factor across generations.

So this is a slide looking at not the EX, but rather the XD 2000. This is an example. I always say one test result is worth 1,000 expert opinions. So if we actually look at an example of how this works. With the XD, we sell a liquid-cooled version that is 70% liquid cooled, like I mentioned earlier, CPUs and GPUs, We also sell an air-cooled version. So it's a perfect apples-to-apples comparison. You can't really get any better than that.

This study was done with a XD 2000 chassis specifically for XD 220V compute nodes. And what we see, same performance, excuse me, same benchmarks, same performance, we get almost 15% less chassis power. So at the server itself, we're consuming less power simply by virtue of running liquid cooling into the server instead of running fans.

Not only that, you get a cooler device temperature, so you get slightly better performance. And when you combine that, when you combine the power consumed with the performance of the chip, it's almost a 21% improvement performance per kilowatt spent. So that's just the server side, that's impact at the server. What does that mean for a data center? Well, we did a study. Basically, this is an example study. So obviously, this depends on the exact data center. But in this example, we get 86% cost savings just from the operational cost of the data center.

You're not paying the cost to move all of that air, you're not paying the cost of the air handlers in the data center, or if you have chilled water running into these air handlers, very expensive system to run. So it can be up to 86% savings operational costs. Depending on where that energy comes from, speaking of sustainability, this is an 87% carbon reduction in your data center for the same performance - actually, better performance, I mentioned.

So bringing that together, it's really not possible to achieve the sustainability and efficiency goals for your data center without seriously considering liquid cooling. So let's just kind of look at that same example, but from a financial perspective. So depending on where you are in the world, obviously, your electricity price may vary. But looking in this example US versus UK, this example is 10 racks of XD.

So there's a fairly small deployment of servers, actually. But depending where you are, we're looking at hundreds of thousands to millions of dollars of operational cost savings, simply by using liquid cooling. And that comes from that 15% chassis energy reduction we talked about. So I mentioned at the beginning, we talked a little bit about efficiency. So the way we've always thought about efficiency and measured efficiency was with a metric called PUE, which stands for power usage effectiveness. This is another way of looking at what you get.

So your IT performance versus what you pay for, which is everything else such as your cooling, power distribution, lighting of the data center and everything. Lower is better in this case. So basically, it's saying if I have a PUE of 1.2, for example, that means every 100 watts I spend on the server, I'm spending 20 watts to run the infrastructure to run that server. So a total of 120 watts. Moving up kind of from this legacy air-cooled data center, this is pretty typical, this 1.25 to 1.35.

This is kind of the old way of doing things with an air-cooled data center. As we move up, we get better and better air cooling technologies. So using free air cooling instead of chillers, for example, or any kind of direct expansion.

Moving up to an optimized air-cooled data center, for example, hot aisle cold aisle containment, that can get you a little better but still not close to what we get with water-cooled servers, which is really the best power usage effectiveness you can achieve in your data center. So the problem with PUE is, it is missing something. It doesn't actually look at how efficiently the server itself is using its resources to do the computation. So there's another metric I'm going to jump into, which is called ITUE. So IT usage effectiveness. This basically looks inside the server and asks, how much energy am I wasting on stuff that is not doing the computation? Works the same way as PUE, lower is better.

As an example, a PUE of 1.1 means for every watt I spend on the CPU, the GPU, the memory, etc., I'm spending 100 milliwatts, 0.1 watts, on the fans or the power conversion inside the server, etc. So it's really just looking at how efficient is the IT at producing the results.

With liquid cooling, you no longer are spending all of this energy to move the air, run the fans. So your ITUE is also going up with liquid cooling. Getting heat out of the device more efficiently is improving this ITUE, and we saw that in the previous example as well. So bringing those two metrics together is what we see on this slide. So this will be the last metric I talk about today, we'll kind of move on to some more interesting things after that. But when you boil it all down, it comes down to what is the system value you get out of your IT system? It comes from enabling the highest performance at the lowest cost of operation.

It's what you get versus what you pay for, kind of the simplest way to define efficiency. So in this case, the productivity of your system, there is the performance or the time to solution, however, you want to measure that, over the total cost of ownership. The total cost of ownership is more or less the product of these two, PUE and ITUE we just talked about. So this measures the total operating efficiency of the data center.

So the interesting thing that I'll talk about in just a second here is that these can be competing metrics. And we as HPE only really have control over the ITUE. So we can deliver a system that is highly efficient in utilizing server resources to produce computation. What we don't have control over, usually, is the PUE, the facility side choices.

And again, I said I'm kind of a visual learner. I like this graphic a lot for that reason. The ITUE is what we can control on the design side, and the facility side is in the blue box, what the data center provider is is giving us. They meet at the heat exchanger in the CDU that we talked about. So the heat exchanger itself is very efficient at transferring heat between the two cooling mediums.

But the point of this slide is that the choices on either side in the green box and the blue box, they do have a total effect on that system value. They contribute to the efficiency of your system in different ways. So I'm going to walk through an example to kind of illustrate that.

So this is something we see pretty common. Especially today with sustainability goals, we have a trend toward more efficient facility cooling systems. So instead of using cooling towers, instead of using chillers, we move toward dry coolers. So we're not evaporating that facility coolant into the environment.

We're enabling free cooling wherever you are in the world. So it's just sensible cooling with the local climate, the local air. So 40C facility coolant is not uncommon now.

This is something that will increase your efficiency of the facility. So your PUE goes down, remember lower is better in this case. But now let's look at how that affects the IT usage effectiveness. So what that means is, many times we have to compensate for that.

We bring in warmer coolant to the CDU. We have to run the secondary coolant faster, increase our flow rates in order to compensate for that higher temperature. In this case, our ITUE goes up. So now we're actually less efficient on the server side.

So if we bring kind of two of these things together, let's look at a third example, lowering the primary flow rate. When I say primary, that's the industry speak for the facility flow rate going into the CDU. Lowering that flow rate will increase the secondary temperature on our IT cooling side. That's just the physics of the heat exchanger. There's nothing we can do about that. Like I said earlier, can't cheat physics.

So that will increase the secondary temperature. In order to compensate for that, then we again need to spin our pumps up. We need to pump more coolant. So you have these competing metrics, ITUE is going up, PUE is going down. So these are the types of things that are really important to think about when you're designing the holistic data center.

A final note on this slide, these things are not necessarily one to one. So the arrows are the same size here. But in reality, depending completely on the design of the data center, these will be different order of magnitude in some cases. So it's really important to understand exactly what the goal is for a holistic system efficiency as you design for liquid cooling.

So I'm going to kind of come out of the weeds now. We saw the back end of this technology. So what does it look like on the front end? What do we offer? What does HPE do in terms of liquid cooling? The important thing to remember here is that there is no one-size-fits-all with liquid cooling. There is a spectrum, and each of these technologies has benefits, each has drawbacks, and it really depends on the workload and the type of computation you're doing in your data center. So on the X axis here we have cooling capacity. This is another way to think about density, this is kilowatts per rack. On the very far left, you see like a kind of our ProLiant stuff.

These are not typically very dense solutions, but we do offer closed loop liquid cooling, meaning that we can liquid-cool just the CPUs in that device, we can enable the latest devices, the higher TDP, thermal design power, SKUs, we can do that with liquid cooling, but then ultimately, we still reject it to the data center air. So it's not very effective, which brings me to the Y axis. The Y axis on this graph shows basically how effective is the system at pulling the heat generated out of the chips, And how efficiently does it do it, how efficiently does it transfer it back to the facility? So, moving up that spectrum, like I said, there's kind of benefits and drawbacks to each. But an easy way to get into the liquid cooling is with the rear door heat exchanger.

We actually have one of these on the showroom floor, you might have seen it already. But this is very simple. You have your air-cooled racks in the server. You can mount this, literally, it's a door with a radiator coil on the rear of it. You can mount this to the back of your rack. You can run fluid through the door and all of the heat generated by your servers is then transferred into a coolant that goes through the door.

So this is a very accessible way to get into liquid cooling. Hybrid liquid cooling, we call it, because it's not doing direct at the chip liquid cooling. Moving up more, we have the ARCS. This is the adaptive rack cooling system.

I like this system a lot because there's a small but important difference between this and the rear door. The ARCS is closed loop, meaning we can bring warm water into this. We contain the hot air in the system. So that way you can still run these 25, 27, sometimes 30, 32 facility water, run your air cooled servers inside the rack and still collect all of that heat and put it into the water without raising the temperature in the data center.

So that's the big distinction there. And then moving on to the Cray XD line, which we talked about earlier, this is more or less 70% liquid-cooled. So this is really targeting the CPUs and GPUs. A lot of people do this, we do it well.

So this is a product that we offer both with air and liquid cooling options. And then moving to the far right, the far top, this is really my background, my bread and butter. This is the Cray EX, also a system that we have on the showroom floor. If you haven't had a chance to see it, I highly recommend it. It's very impressive.

This is a system that can do 400 kilowatts direct liquid-cooled, 100% fanless. It's a larger system. There's a caveat. It's a wider rack, 1,200 millimeters versus 600 millimeters standard rack. But it's the highest performance, highest density you can get. So to kind of bring this all together, how do I get to liquid cooling in my data center today? Well, we know, and like I said earlier, that there is not a one-size-fits-all solution for liquid cooling, which is why HPE offers multiple paths to get there. So the retrofitting data center and the new data center, this is, like I said, this is my bread and butter. This is where I work. This is with the EX stuff.

These tend to be the larger deployments when you build the data center from scratch or retrofit one specifically to support liquid cooling, and you want to own the hardware, you want to run it in your data center, that's where these two things come into play. We also have co-location, so the idea being you want to own the system, but you don't want to sign up to support the facility cooling system and all of the complexity with that, this is where you would basically work in an agreement with a service provider to have your hardware in somebody else's data center and you just get access to it. Somebody else maintains the liquid cooling infrastructure. Modular data centers are another really effective way to get into liquid cooling. I like this a lot.

If you get a chance, there's an exhibit with Dan Foss we have on the floor. I really like this solution because this allows us to start with a clean sheet of paper and basically say, if we want to cool this IT and we had complete control over the facility cooling system, how can we optimize those two metrics we talked about earlier? How can we find that optimum point between the facility efficiency and the IT efficiency? So the modular data center is essentially putting the hardware in more or less a shipping container that is designed to be deployed quickly, bring the facility cooling extremely close proximity to the IT that you're trying to cool, and in doing so, enabling extremely high effectiveness, high efficiency of the overall solution. And then lastly, of course, we've talked about HPE AI cloud a lot.

But more or less this is as a service. So you're not owning the computer, but you do want access to the latest AI hardware, the latest GPUs for your workload. We have an ability to offer you that as well. So all of these paths will get you to liquid cooling, and not all of them involve building a data center from scratch. Like I said, there's no one-size-fits-all. So it's important to have flexibility as you consider liquid cooling.

So in closing here, I guess I would say that, like I said, I've been doing this for 10 years, which isn't necessarily long, but more than ever today, we have a lot of noise in the industry around liquid cooling. It's really, really gotten to be very confusing. Where do you look for the right information? There's a lot of people saying this or saying that, a lot of competing narratives. I would argue that HPE is uniquely positioned in this space. And the reason is, like I said, I started with SGI, which was an acquisition of HPE back in 2017. I have many colleagues from Cray supercomputing.

These are the people that were doing liquid cooling all the way back in the 1980s when it was, again, a necessity for the type of technology at the time. So we have literally decades of liquid cooling experience that we've been able to leverage in all of our offerings. And we've learned a thing or two along the way. We weren't always perfect, but we've always learned from our mistakes and used that information to build better products and services for our customers.

So I invite you to take advantage of that. There are some liquid cooling experts, including myself, on the floor to answer any questions you may have. Yeah, I would close with asking to leverage that history.

Look at the options you can explore to get liquid cooling in your data center because it is truly the most effective way to do liquid cooling, not only now, but it will be required in the future. So, thank you.

2024-12-02 12:48

Show Video

Other news

I Tested 100 Years of Tech! 2025-05-18 14:42
Интегрируем Temporal в Laravel с Пашей Бучневым 2025-05-14 09:54
Why Refresh and Replace Outdated Technology with Juniper Networks | CDW 2025-05-08 00:45