Welcome, everybody to HPE, thank you, thank you, thank you, thank you. Welcome, everyone, to HPE Discover 2024 Barcelona. First, let me say thank you all for joining us here on this last day of Discover and on the last session of the day. We're so appreciative for you for coming. My name is Devon Daniel. I'm the Product Manager of the HPE ProLiant Compute DL380a. And joining me a little bit later on stage is Greg Schmidt who manages the ProLiant DL384 platform that released a few months ago and we will be shipping pretty soon.
I'm going to stop it here and just pause really quick, and let's inform y'all that given that this is the last session of the day and there's no other sessions coming after us, if any of y'all have any questions or just want to sit and talk to us, Greg and I are more than welcome to go to the side and huddle up with you and talk to you about anything concerning our platforms. Or, if you would like to, the sessions upstairs are still open, so the main floor, so we could walk y'all up to our booths and show you our platforms which are available upstairs. So hopefully by now, most of you have either stopped by the Private Cloud AI booth upstairs and talked to one of the associates about the solution, or come to one of these sessions that goes into a little bit more of a deep dive of Private Cloud AI and learned about how the solution could best fit your needs for your AI workloads.
So I'm not going to go into too much details on PCAI. I'm going to highlight a little bit, but what Greg and I would like to do is really dive into what we believe is the true foundation of the solution. And that is the underlying server hardware. If you haven't heard of PCAI, it is a true partnership between HPE and NVIDIA that allows our customers to quickly and easily adopt AI into their enterprise. HPE's AI essentials paired with NVIDIA's AI enterprise allows for an integrated set of offerings that enable our customers to quickly deploy models with just a few clicks.
So that's a quick overview of the software of PCAI. But what you see here on the screen is really the hardware behind it. And as you know, AI comes in so many different shapes and sizes, so it only makes sense that PCAI comes in a few different sizes to fit your AI needs. We have small, medium, large, and extra large, ranging from inferencing all the way up to fine-tuning.
Now, behind this hardware, we really want to talk about two major platforms that are supported in the different solutions. And that's the DL380a and the DL384. So let's start with the DL380a.
So if you're familiar with the DL380a, the Gen11 version that we released in March of 2023, it was the first server in the ProLiant family that was densely populated with GPUs up to four double-wide GPUs at 400 watts each. But as the industry continues to evolve and change, and AI is ever-evolving as well, HPE had to really come up with a new innovative solution for Gen12. So starting with supporting the latest and greatest Intel Xeon 6 processors up to 144 cores, which is double the amount of cores that we had for Gen11, to fit a variety of needs that you have. And then pretty much doubling the size of the Form Factor which enabled us to be able to really double the capacity of most of the hardware inside of the server, which I'll go into more in a little bit in the next slides. So on the sheet here, I have a comparison between the two models, the Gen11 and the Gen12 so you can really kind of get a better feel of the changes that we made and why we made those.
So first and foremost, if you've been upstairs and seen the server or just looked at the picture from the last slide, you could tell, if you're familiar with the Gen11 version, that the initial drastic change that we made was doubling the size of the Form Factor to a 4U. So what we've noticed is that a lot of the GPU vendors were starting, and it's expected, with more performance comes more wattage, more memory, etc. So now what we're seeing is a trend that GPUs, double-wide GPUs, high performing ones, are ranging into the 600-watt GPUs In a 2U Form Factor in Gen11, we had a thermal ceiling on being able to support higher wattage GPUs.
We're limited to a max of four 400-watt GPUs in that platform. So when we noticed that the different vendors were coming out with higher wattage 600-watt GPUs, we had to really think about our design to be able to fit in most server rooms, given the thermal limits. So we increased the Form Factor of the platform to a 4U to allow us to have better air flow and more capacity in the platform.
As I mentioned, we'll be supporting the latest greatest Intel Xeon 6 processors. As of right now, we have support of Sierra Forest up to 144 cores. Those are your e-core processors. Beginning of next year, we will be supporting Granite Rapids, which is also part of the Xeon 6 family.
And that will be your performance GPUs in a wider range, of course, and a better performance as well. With the larger box, not only were we able to double the capacity of our GPUs, we really took a look at how user friendly our servers were. So there were some challenges with our 2U box on how you were able to access our GPUs. What we've done now is laid out our GPUs in a vertical fashion where not only you're able to utilize NVIDIA's newest NVLink 4 bridge that allows you to pair four GPUs at a time, so you have two pairs of four GPUs connected together, shared memory, faster performance, faster speeds, without having to go straight to the CPU first.
But it's also easier to access and to swap out the GPUs, or if you want to scale up with your GPUs, we made it a lot simpler for our customers. In the rear, we have also increased our capacity of our PCIE slots. So we've gone up from four to six PCIE slots, two OCP slots as well.
And then we've also increased the capacity of our storage. So now you're able to fit 16 EDSFF drives in the front and then also up to eight small form factor NVMe's. Going back to the GPUs, I want to highlight one of the main GPUs that we're going to be supporting on this platform, and that's the H200 NVL. And that's the newest PCIE GPU that's on the market.
That will be available starting next month. And that's the 600-watt GPU. It's pretty much twice the inferencing power of the H100 NVLs, if you're familiar with that, at pretty much the same cost with more performance. We've increased our DIMM capacity to 32 DIMMs, 32 DIMM slots compared to the 24 that we had in Gen11. And then one of the main things that we have in the platform that we started with with Gen11 is having a dual power domain in the rear.
So a lot of our competitors only have one power domain to be able to supply power to the whole system. What we've done since Gen11 is split that up into multiple power domains so that you don't have to sacrifice the I/O that you put into your system, the components that you add, by having a large amount of high-wattage GPUs in the system. So we have two power domains now dedicated to your GPUs. Now, if you want to have just four GPUs in the system, you could just utilize one of the power domains. So a total of five power supplies.
But if you want to maximize it or go up to eight GPUs, then you are able to use all three of those power domains. So you have enough power, and we made sure that we had enough power in case any of our vendors decide to even increase higher than 600-watt GPUs in the later future. Because we noticed that happening in Gen11 timeframe. And pretty soon we will be also having iLO 7 integrated into our solution. So you will have the latest greatest iLO software and the management software available in the platform.
And with that, I just want to go ahead and hand it off to Greg Schmidt so he can give you a little bit more details on the DL384. Thank you, Devon. All right. My name's Greg Schmidt. I'm the Product Manager for the HPE DL384. I've been doing GPU since 2001, been doing AI servers since 2015, and it's been a fun ride. What we're going to talk about today is not your everyday AI server.
It's not your {GRAMM} server. It's not for Outlook. It's not for standard enterprise applications. I'm not even qualifying Windows. I want to be really clear. The DL384 is a completely different platform. NVIDIA had a problem.
The problem NVIDIA had was that the CPUs that AMD and Intel were delivering didn't meet the needs of GPU computing. In a nutshell, the biggest problem is that they're from AMD and Intel. NVIDIA likes to control their destiny. The second biggest problem, though, is really a technical one. It turns out that in most workloads, GPUs are just sitting idle about a quarter of the time.
They're the most expensive part of your server and they're doing nothing. Sounds like my daughter. Fixing that is figuring out why they're doing nothing.
And mainly it's because they're waiting for memory to come from the main memory of the system. And it's coming over main system memory with Intel and AMD over PCIE Gen 5. And that's just not fast enough. 120 gigabytes per second maximum. That's snails pace. The new server uses a CPU from NVIDIA. That's where the four is. It's not a zero for Intel, it's not a five for AMD, it's a four for NVIDIA.
What I'm going to show you today, I'm not going to go through the little speech speech. We're going to focus. This is a 10-minute tech talk. That's like insane. So I'm only going to focus on the core elements. What is NVIDIA's GPU? Why is it better for inference? Can you do some other things with it? And then I'd like to, if you'd like, we'll go downstairs and we'll spend another 15, 20 minutes or half hour looking at the box and talk about it. All right, this is a simplified block diagram.
I've pulled out the drives. I've pulled out the network adapters because all I've got is time to talk about a few things. So let's stay very focused.
On the system here, you see Grace and Hopper. Grace is the CPU from NVIDIA. They license technology from the Arm, you know the Arm, what would you call them? Processor technology IP vendor. They grab the Arm technology, they link it together to deliver a CPU processor that does great for GPU computing. What does that need? It turns out it doesn't need that many cores.
There's 72 Neoverse cores here. Not the big thing. See this connection between the Grace and the Hopper at 900 gigabytes per second? That's an NVLink connection, you've heard of NVLink, high-speed bandwidth between GPUs, leveraging the same basic idea here to increase the bandwidth from this Grace to the Hopper GPU by seven times the bandwidth of PCIE. So if you're waiting on memory, give it more bandwidth, fill that memory faster. At this rate of bandwidth, let's take a look at the memories here on the Hopper GPU. That's the most expensive memory in the world.
HBM3e triple stack. Actually, it's now eight stack high layer on layer. Think of like DDR, put a layer down, it's a planar array of capacitors, essentially. High bandwidth memory is a layer of that. Then another one on top, and then another one on top, and another one on top eight high, with vias down between, and then you can access it all simultaneously. So this is something.
So the bandwidth of the main memory at the top, 480 gigabytes, that's about 768 gigabytes per second max. Close to a typical X86 system. The bandwidth and memory on the bottom is seven times that, 4.9 terabytes per second. It's expensive. That means you want to use it wisely. Okay.
Let's do a rough calculation. At 900 gigabytes per second it takes 144/1000 seconds to fill that. So about 0.15 seconds and you can flush the entire GPU memory. In a PCIE Gen 5 system with a AMD or Intel, you're at a, what, minute and a half-- a second and a half. Pardon me. An order of magnitude difference.
What do you do with all that time you've saved? You do more computations. For you, you could tell your bosses, I can deliver a better user experience. Because this memory is going to be super fast and I'll be able to deliver responses to inquiries on a chatbot 30% faster. And because you can do it 30% faster, guess what? You can use 30% more users on that system.
So you spend less per user and give a better experience. What's not to love? All right, at the top, I've got traditional DDR5 memory. It's a low power version of it. It's all slotted down in here. And the Grace/Hopper is about balancing memory. Now, we can do one of them just by the left, or you can buy two of them. And notice, when I buy two I've got 900 gigabytes per second bandwidth with NVLink between the two Hopper GPUs.
NVIDIA calls this NVL2. You'll start seeing this more and more, NVL2. NVL4 means four GPUs connected by NVLink. NVL8, NVL72 and 36. Big boxes.
Okay. So when you connect two with this 900 gigabytes a second of memory, it also turns into kind of like a NUMA domain in your DL380s, a two-piece server. So these GPU memories now are not just 144 on the left and 144 on the right. They act to the program as one large GPU memory space of 288 gigabytes. What's that mean? Well, you can handle much bigger models.
Or you can handle more models. How am I doing on time? Oh, I got time. Okay. So let's take a couple of examples. I said that this provides better experience to your users and more work and more users per GPU, saving you money.
Deploy fewer GPUs, save money. Let's take an example of a user experience with multiple, modest-sized deep learning models. How many of you guys have deployed deep learning? Could I see the show of hands, just briefly? One. Only one? Okay.
How big is your typical model you're using? 7 billion, 2 billion, 3 billion? 8 billion parameters. Okay. So when I get input, it's always in the small models. Enterprise customers usually use small models instead of the trillion parameter models like OpenAI. The reason's really simple.
You don't have the whole world of data to train a big model. You have a relatively small set of data, you need a relatively small model. And typically what you're training to do is not that complex. You're not trying to answer every question in the world, you're trying to answer in a limited set.
You don't need a big model to do that. Would you share more, like what you're trying to do and how much data you use to train? Yeah, we do NER {inaudible} and other NLP use cases. But mainly with NER. I'm sorry, I'm losing my hearing. I'm old. So image recognition. No, no, NER, so name, entity recognition. So NLP. Natural language processing, name entry recognition.
Couple hundred gigs of data, or a terabyte? Yeah, not big data sets. This is pretty common in enterprise users. You don't have to worry about huge models. And these models that you can use, you grab them off the shelf. There's all these pre trained models.
Then you augment them. So let's say that you've got a whole slew of small models. A model that's eight billion parameters takes about nine gigabytes on your-- well, maybe 10. About 10 gigabytes of space to hold in the GPU at eight bits.
So 10 gigabytes, you could hold 14 models. Simple. But there's all that main memory. Now, normally you don't want to be putting your models up in main memory because then you have to move them down to the GPU to execute. And in PCIE that takes too long, you get a latency, your first token customer user response gets bad, you get frustrated, you know how it is. Here, I can shuttle memory back and forth.
I could do a 70 billion parameter model, 80 gigabytes. And I can get it over and crossed in 7/10 of a second or 0.07 seconds. People don't even notice it.
If you were in a PCIE box, it'd be three quarters of the second. That's a long time. So what your advantage here is, first off, lower latency of response, giving better user experience. And then you can cache more models up in the main memory and move them down and get one unified image with a lot of models that you're serving. And the ones that don't get commonly used, you shuttle back and forth as needed. One example.
Next example, I like this one. This is really cool. Okay. So say you're doing a large language model, the ChatGPT kind of thing. Seven billion parameters, 70 billion parameters. Am I in your way, guys?
You good? Okay. And you're typing. Now, as you're interacting with this model, the past conversation provides context for what it's going to do in the future. Agree? But as you have a longer and larger conversation, you're building up more and more context.
If you leave that context on the GPU memory, the high bandwidth memory, that expensive memory, you're going to be running out of room and you won't be able to host as many users. So what's normally done to protect the GPU memory is you shuttle your model context up to system main between entries. Now, this might seem crazy. But if you think about how you use a model,
even if you're really fast, there's like 10 seconds from when you get an input, the last read out, and you type in, "Do this next." Minimum. If you're doing a big Word document, it might take you minutes to read it before you ask it to do something different. If you leave that context in the GPU memory, you run out of space and you can't support more users. So what is normally done is, you shuttle it up to the main memory. In a PCIE system, when you bring it back down, it's going to take a long time. With mine, it takes a seventh as much time.
Now, what this means is two things. Again, the context comes back faster, calculation occurs faster. You get a lower response time. Lower latency, the first token response. Thirty percent better than an X86 system. Point two. Because you're executing faster and getting more done, you could put more users on your system, 30% more users, maybe 40. Depends on what you're doing.
So you give your users a better experience, you host more users at lower cost, what's not to love? And I got a million of these, but I don't have a million minutes. And you can see here, boom, boom, boom, benchmarks. There's a good benchmark. MLPerf Inference is the machine learning performance benchmarks. It's a suite of various benchmarks. They're usually pretty old, they don't rotate them as fast, they're pretty small. And we submitted in that thing recently.
It's the standard though, you know how it goes, you get an industry standard and then it slowly evolves and you wait and AI is changing so fast that what was cool last year isn't cool this year. So the models are generally last year's models, but they're still good. And it tells you how a system does.
We were in development of this. This box is coming out, starting production at the end of this month. We'll be shipping in the first boxes to customers in December. Lots of interest, hundreds of quotes rolling out. The key point I'd like to make though, is that around June, we have to submit for the MLPerf benchmark results. Might have been May. And our first test boxes are rolling in from engineering.
And we literally got them in the Houston office. One of my great benchmarking guys dug into it. He had two days. He put the system together, he rounded up the network and got everything working and put drives in, FedExed one to NVIDIA so they could submit with the same box. We did some of the work and we submitted in, like, five days I think it was, the first results. Now, usually HPE takes, conservatively, a month. Fine-tune, make sure everything's best.
They get a 5% better performance by tuning here. 10% there. It matters. We didn't have time. We just ran and done it. We still took first in 7 of 16 of those benchmarks. We were the first ever GH20 144 gigabyte version published. And mainly, we lost to the first darn Blackwell that was never published. So this is an awesome box.
And it's worth mentioning. I mentioned Blackwell, you say, well, why don't I do Blackwell? It's coming out. The real problem is getting hold of it. It's going to be kind of limited availability.
So this box is going to be the best thing rolling for about six to nine months. If you want to do AI, it's good. You might say, Greg, this sounds too good to be true. Are there any downsides? Yeah, there's a couple of downsides. Come over to the sofa, I'll talk with you about it. But we don't have-- it's an Arm processor. So we're having some little bit of trouble with hard drives, like we don't have hot plug support.
We don't have a RAID controller. But in these instances, the data is transactional. It's not valuable, transactional.
You can cache over to another server if something fails. So that's not been a big deal. Each box is about three kilowatts. Steady state power. That's a lot of power. You can put, eight kilowatt rack, you can put two of them in.
So it's, it's not all sugar and plums, but it's fast. And per dollar, per watt, best performance in our portfolio. Now, we've gone over the solutions we have. AI Private Cloud. Number one thing, Devon did a great job on it. You probably haven't heard this come from folks.
In this whole room, we got one guy who raised his hand and said he's deployed AI. How many of you should have done it? How many of you are planning to do it? How many of you are waiting to do it until you understand everything? Or, I don't know enough, or whatever. I've been doing this dedicated for a decade. The people that succeed are the people that start. It doesn't matter what you start on. It honestly doesn't. Don't worry about getting the best server or the exact right configuration or your entire tool chain understood.
Buy a tool chain that works. NVIDIA AI Enterprise, HPE works with them great. Buy some servers. PCAI sets it up, you can get your servers, storage, your networking, your software all delivered. If you wait to understand it, you will fail because it changes too fast. Don't worry.
And then you've got to find some young bright kid who's really interested in this stuff and give him nothing else to do. And just keep asking him, how's it going? Or pick one of your older bright researchers or engineers, software engineers, turn them loose. Don't give it to them as a side project. Next up.
So what was it? Don't worry about getting perfect hardware. Just get some hardware. Get a good software tool chain that you don't have to stitch together from open source. Number three, get a guy who's dedicated to it and interested. Number four, don't pick a hard darn project. I can't tell you how many times I've seen people try and do something.
Oh, we could solve this huge problem that would add huge value to the corporation. Pick a simple project. Not just one.
Find one from the HR group, find one from your manufacturing team and find one from facilities that are simple. And then they're all happy with you and you're building support across your organization. And when you go to ask the CFO or the CIO for budget, they all say, wow, we're really glad they did that for us. And start expanding. So build success, build experience, get started, get a coalition and go. HPE can do that with you. The HPE Private Cloud will get you off the ground fine.
It's good. It's got all the tool chains. We've got support services to help you get started. We've got partners that are really good. You can't find a guy to help you, a young smart kid or a guy in your organization that wants to learn? We'll get you a partner that'll help you. We'll help you ourselves. We have the servers, storage, and networking to do it. Ultra scalable.
But just get started. We've got great hardware. You know that, you're HPE customers. We have great hardware.
Whether you want to start on the 380 Gen11 right away, wait till the 384 comes out next week. Please wait for me and buy me. I need some sales early on. Or get the bigger model. Disadvantages to all these? We only had 10 minutes to talk. Come down to the floor and ask me, how do you choose between an eight GPU server or a two GPU server? I got time. I got another hour. Come on down and chat. We can do it here. We'll have fun. You might ask for some more resources, materials here.
Devon and I will stay here with you. We'll sit up there and talk with you all. I'd love to chat. You probably guessed. I enjoy speaking with customers. I'd like to thank you and wish you a great evening near the close of HPE Discover. Thank you for your time.
2024-12-05 15:38