Mastering AI: The New Infrastructure Rules

Show Video

>> We are witnessing the rise of a completely new computing era. Within the next decade, a trillion dollar plus data center business is poised for transformation, powered by what we refer to as extreme parallel computing or, as some prefer to call it, accelerated computing. While artificial intelligence is, of course, the primary accelerant, the effects ripple across the entire technology stack.

To give you an idea of the scope of this transformation, in 2022, accelerated computing comprised about 10% of all data center spending. By 2030, it will account for almost 90%. And importantly, the overall data center market is expected to grow by 3. 6X in that timeframe, according to theCUBE research estimates.

Hello and welcome to our executive forum on AI infrastructure and innovation. Now in this series, we're going to explore the architectural principles and operational strategies that drive successful AI initiatives. You're going to hear from industry leaders at Penguin Solutions on why some enterprises achieve AI breakthroughs while others struggle.

And how to build environments that support both current and future workloads at peak performance. And we're going to reveal why traditional approaches often fall short, causing GPU clusters to operate at only about 50% of their potential. And we're going to show you how to construct scalable, efficient AI environments that accelerate results.

So, the first session we're going to do today is called The Race to AI Dominance: Rapidly Deploying and Optimizing AI Infrastructure. And it features Pete Manca, who's the president of Penguin Solutions. Pete will discuss how CEOs and CIOs can quickly adopt and institutionalize AI as a competitive advantage, striking the right balance between speed and stability to ensure operational efficiency and long-term flexibility. Now, let's welcome Pete into the program. He's the president of Advanced Computing at Penguin Solutions and he's got extensive experience helping enterprises bridge the gap between traditional IT practices and the high performance computing requirements of modern AI workloads. And today, he's going to share strategies for deploying AI infrastructure rapidly and at scale without sacrificing resilience or future agility.

Pete, welcome to theCUBE. It's great to see you again. >> Thanks, Dave. - Appreciate you coming in. >> Yeah, it's great to see you too. >> So when you're out talking to customers, talking to organizations, what are you hearing from them? What are their biggest challenges? They're doing a lot of experimentation in the cloud, but they want to bring AI in-house. They want build centers of excellence. They understand the importance, but they don't have the skills. What are they telling you?

>> Yeah. Look, the market's moving incredibly fast, right? As you mentioned, it's growing exponentially. The enterprise customers we speak to know they have to get some kind of AI strategy in place, whether for simple things like increased service offerings, or more complex things like fraud detection and other use cases we hear out there, but they don't know how to get there, or they're not set up today in order to get there. Traditional infrastructures are very different than AI infrastructures, and so they have to rethink how they do IT.

You can go to the cloud, you can do that, but a lot of enterprises aren't willing to take the massive amounts of data or the proprietary data that they have and put it in the cloud. And so, the choices really go to what we would call a tier two service provider, or not a hyper scaler cloud, and leverage their capabilities or build it in-house. Building in-house is probably the preferred way to go, but it means literally a soup to nuts transformation from data center build out, power cooling, all the way through architecture of their system. So, it's a very complex environment and they look to partners like Penguin Solutions to help guide them through that as a trusted advisor. That's really what we do. >> When I was prepping for this, I wrote down just some of the vectors and challenges, data gravity, everybody wants to do on- prem AI, the cloud's expensive, they're uncertain about ROI, they got skills issues.

They don't really necessarily know how to architect hybrid AI, so there's all these sort of things coming at IT professionals and they don't want to just build a snowflake. >> That's right. - I don't mean Snowflake the database company, I mean a different solution.

>> Custom. Right. - A custom solution. They've got to build stuff that's not only proven, but then standardized so they can scale. So, what strategies are customers employing to do that and how are you helping? >> Yeah. So, first of all, it starts really all the way to the left side. It starts at the very beginning.

And I mentioned data center earlier, it literally starts at the data center. If you look at today's data centers in enterprise technology or enterprise companies, they're not set up for this kind of infrastructure. And so, we have to get involved very, very early on, help them through making decisions around where their data center is going to be, what kind of cooling will be in their data centers? Is it liquid cooling? Is it direct to chip? Is it traditional air cooling? And you bring that all the way through the design to the end. What we do at Penguin is, we help them through all those decision points and we offer them a choice. We offer a custom solution.

You said snowflake, but I'm going to use the word custom, because I don't really think it's a snowflake per se. There's so many choices between chip vendors, you want to be NVIDIA or AMD GPU house. Storage, is it a parallel file system? Is it WEKA, or VAST, or DDN? What kind of network? Is it InfiniBand or high bandwidth ethernet? There's so many choices out there. And there's really a couple different ways you can go. You can build a custom solution for your use case. We'll help them through that.

Or you can build an off-the-shelf solution, where you leverage either NVIDIA technology, like DGX, or you leverage one of our origin AI reference architectures that allows a pre-configured architecture for a particular use case. And so, we help guide them through those decision points every step of the way. >> I love the name Penguin, because I'm envisioning all these penguins on the iceberg. They all want to have on-prem AI and they're ready to dive in, and then that first one's going to go and it's like, then the competitive fire- >> Oh, that's exactly right. >> That's right? Banks all get this. >> That's right. - And they talk to each other

and they're like, "Whoa, if we get left behind, we're going to be in big trouble. " And you mentioned all these different permutations, whether it's the networking, the file system, every part of the stack is being reformed, as we were saying up top. How do you ensure resilience in such an environment? >> Yeah, that's a good question. That's really

just a design principle. A lot of the technology you can pick and choose depending upon your use case. The trick is designing it right up front. What you don't want to do is put these pieces together. And the difference between this and a typical IT environment is scale. And a typical IT environment might have 100, 1,000 servers, you can manage that in-house with your IT staff.

You spread that out to tens of thousands of GPUs for these large language training models, with very sophisticated networking, direct to chip, things like NVLink and maybe Optical, or other types of technology that allow you to bypass the main core CPU and go right GPU to GPU. That gets really complex, and so we think you need to have a trusted advisor, like Penguin, to help you design those architectures and help you do it with resiliency and with quality so you can get the uptime like we see at some of our customers, where we bring them from 50% uptime today on their GPU clusters to 95%. We've published many studies about that and we've been very successful doing that. And it really comes down to proper design constraints right from the beginning.

>> What about utilizing GPU capacity? You referenced earlier, a lot of these customers that we talk to, they're having to re- architect their data center for liquid cooling. The CFOs are gnashing their teeth, "We know we have to do this, but we want to make sure that we have ROI." You hear stories about GPU capacity being underutilized and maybe the cloud guys are good at it, maybe not. Maybe they're just throwing CapEx at the problem. >> Right. - But enterprises, they can't afford to do that.

So, what are some of the common pitfalls that you see around GPU under-utilization and how are you helping customers address that? >> Yeah. So, that's a really broad question. It's an architectural discussion. You've got to get the data from the storage, through the network, into memory to feed these GPUs to keep them busy.

So, right there, you've got an architectural problem that you're trying to solve around, how do I get very sophisticated high speed parallel file systems to feed these GPUs as much data as possible? And in some cases, in real time, it could be a batch or it could be a real-time processing engine, so you got to figure that out. And once you do that, then you got to make sure that you keep the GPUs up and running. They're a little bit finicky, it's new technology, you want to do things like predictive failure analysis.

You want to understand if a GPU might fail, take it offline before it fails and replace it with a different one so you can keep that cluster up and running at optimal performance. That takes a lot of skills and design capabilities that things like our intelligent compute environment software, Clusterware, and AIM help our customers actually do that; keep their GPUs up and running, keep them efficient, and monitor them, make sure the health is good across the board. >> You, in your career, have really had a big focus on simplifying things, certainly simplifying cloud. When it first came out, it was a complicated situation for a lot of people. You're back at it here in AI. >> I am. Yeah. - What are the

similarities and what are the differences? >> I think it's all these... And you're right, I've been in this market, systems management, infrastructure management, simplifying operations for pretty much my entire career, and they all sort of look the same. Everything starts off very fast, rapid, we've got to get to the end result before the next guy does, we've got to solve this business problem faster than our neighbor does. And that rush adds complexity. And so, the way to simplify it, I've always found in my career, is you create a software abstraction layer that hides the complexity of the underlying hardware, and you make it simple for the end user to manage the environment while you abstract away the complexities of the underlying hardware.

And that's something that Clusterware does for our customers, is we try to abstract away all those complexities I just talked about around file systems and networking, and keep it simple and logical so that an end user can then manage the environment or plug in higher layers of the stack to interface with the management of the infrastructure. >> And it's not like Penguin started on this just yesterday, you guys have been in this space for a long time. The market has come to you. It's interesting, you mentioned VAST, and WEKA, and DDN. >> Yeah. - Two are relatively new.

>> Right. - DDN has been around for a long time, but the market just sort of came to them, similar to Penguin, right? You're applying that expertise from your HPC days and then tuning it to enterprise AI. >> Yeah, that's absolutely true.

I remember talking about Penguin 20 years ago, when I was working at some of my other companies. Penguin has been around for a long time, a tremendous amount of expertise in large scale clusters. And sure, started off at HPC type clusters, AI wasn't in vogue back then, but that skillset is still important today and it extends to the AI clusters, because it's really about managing scale and managing complexity. And so, Penguin has decades of experience doing this.

And like you mentioned with DDN, sure, the market came to us, right? AI took off. It happens a lot in this industry. You catch some trend and you find some expertise you have that might've been niche before, all of a sudden becomes incredibly powerful, incredibly important, and that's where we find Penguin today. >> Yeah. You and I remember VMware started out doing

workstation virtualization. >> I sure do. - My last question for you is: What advice would you have for organizations that they want to go fast, they want to accelerate their AI, but they want to have that operational flexibility and excellence going forward? So, they're a little nervous right now.

We don't want to rush in and then mortgage the future, but at the same time, they have to get there fast. >> Right. Yeah. I'll give you one self-serving answer, which is called Penguin. You want to talk to somebody who's been in this game for a while and who has been doing this for a long time, so you want to go to a trusted advisor, and that's what we are. But short of the commercial there, my advice would be to slow down just a bit and take your time and understand what kind of use case, what kind of outcome are you looking for? And what's the technology that's going to get you the best chance of getting that outcome? There's a lot of choices today. I like to use this analogy of the Apple iOS versus the Google Android.

You can go with the DGX and NVIDIA stack all the way up, kind of like Apple has done it, where they give you an entire stack and you can pick any flavor of ice cream as long as it's chocolate. Or you can go with the Google model where you can mix and match and plug in best of breed. That's a fundamental choice right there that a customer has to make. I'd say step back, make that choice, what works best for you, and then work with partners like Penguin to help you implement that. >> Yeah, a lot of complexity, especially if you go with the latter, because you're going to have different networks, as you say, and different alternatives. And the former, it might limit some of your flexibility down the road.

So, that's a real business case that you have to think about. >> And there's a price trade-off as well. >> Right. That's right. Pete, thank

you. Appreciate your time. >> My pleasure. It was great. - Okay. Keep it right there, I'll be back shortly with Trey Layton and we're going to dig into the new AI stack. Keep it right there. >> So, in terms of the amount of hardware that we need to facilitate the Makerspace, with its vision to serve the whole of the GT campus, it really becomes a game of capacity. >> When you look at an AI installation or it's GPUs, it's memory, it's networking, it's storage, and bringing it all together in a very large scale. When you talk about 10,000, 20,000 servers or GPUs clustered together with high- speed networking like InfiniBand from NVIDIA, doing that and getting it working correctly from the start and then keeping it running along the way is incredibly difficult.

>> To be able to train a workforce, to be able to build out the next generation of AI capable researchers, you have to have the capacity and the hardware to be able to support that. >> So Penguin's role is, we go and consult with our customers, we help them design their solutions based on their needs, we then help build that solution in our factories, then we deploy it for them. And often, we'll manage it for them as well.

>> Hello, everyone. Welcome back to our executive forum on AI infrastructure and innovation. In this series, we've been examining how thoughtful design and robust operational strategies can help enterprises unlock real value from AI. And in this particular segment, we're going to focus on ensuring AI success with unified intelligent compute environments, and we're going to explore why integrated systems and infrastructure matter more than ever for organizations deploying hundreds, if not thousands or even more, GPUs. In this session, we're joined by Trey Layton, who's the vice president of product management and software at Penguin Solutions. Trey is going to shed a little light on how fully integrated computing environments serve as the bedrock for thriving AI initiatives, and how leaders can strategize for security, efficiency, and operational excellence.

And he's also going to discuss key factors that help these environments running at peak performance in both the near and long term. And there's some hard news. Trey, welcome. Good to see you. >> Thanks for having me. Good to see you again. >> Yeah. So, Trey has extensive experience guiding CIOs in building AI factories, you've heard that term before, and making sure that they can manage and optimize the GPU resources at scale, which is really challenging.

And today, he's going to share his insights on designing unified environments, something he knows a lot about from the history, but there's some things that are different in this AI era in terms of how you provide the data workflows, how you secure this infrastructure, and how you scale to accommodate a myriad of AI workloads. Okay, let's start by setting the stage here. You guys talk a lot about silos. Customers are frustrated with silos, they want to streamline.

What are the challenges that you see enterprises facing as they try to deploy enterprise AI? >> So, Dave, I would say that throughout history in IT, we've always seen siloed technologies. And as those technologies have evolved, we've wanted to find optimal ways to combine them. I think the unique thing with artificial intelligence is, we're talking about constructing an environment that needs to run at peak performance all the time, which is a little bit of contrast to what IT organizations are typically used to managing.

You're talking about a massively scalable parallel processing infrastructure that's designed to run at peak performance all the time. That's different than what organizations of the past have built, and that's what we're focused on building. >> So, I think you'd agree that a lot of the experimentation has been done in the cloud. Organizations understand that that's a good place to test and break things, but at scale, it gets expensive. And not only that, a lot of the data lives on prem, there's latency issues, you don't want to move that data. And what we've noticed in our research at theCUBE Research, is that customers don't necessarily have the skillsets, they certainly don't have the AI stack that's been built out.

It's so early days, kind of immature. So, you guys, if I understand it correctly, actually have that skillset, because with your HPC heritage, and you're applying that to your knowledge of enterprise AI. Of course, you and I know each other for a long time in terms of how you simplify on- prem infrastructure, but it's different from AI. So, you've had to build out new solutions, you've got some news around this.

Why don't you take us through the hard news that you're announcing and we'll weave it into how you secure things and how you make it efficient? >> Absolutely. So, Penguin has decades- long experience in the high- performance computing environment, which a lot of those baseline technologies are what are the foundation of building artificial intelligence environments. And so, we have, for multiple decades, had a software portfolio that has constructed those environments. And so, the hard news is, first of all, the software that we have used to construct these environments has been branded ICE Clusterware.

That is effectively the software that automates the provisioning of building these clustered environments for artificial intelligence use cases. So, ICE Clusterware is designed to provision these artificial intelligence clusters that are needed in numerous use cases out there when you provision these infrastructures. A lot of organizations don't have the skillsets to deploy these particular configurations, and this software is designed to automate these outcomes, so it makes it easier for organizations to deploy those environments. The second piece of news is the introduction of a new piece of software called ICE Clusterware AIM Service. And this is designed to give you the telemetry and the predictive failure analysis to remediate these failures that emerge in running these complex environments.

An analogy I often give folks is that, enterprise IT organizations have typically tooled or used skillsets to manage IT infrastructures for high availability and rarely experience peak performance situations. In an artificial intelligence infrastructure, it's maximum performance all the time. And when you're running an infrastructure at low latency, maximum performance, you're going to experience failures that are sometimes silent, that lead to larger failures, and you're going to experience outright hardware failures. The AIM software solution is designed to diagnose, remediate those failures before they impact the actual production environment.

So, build and remediate. >> This is not your standard virtualization, midland applications, business critical or business nice. Some people call them craplications. This is hardcore, running at peak performance, they're hot, increasingly, they're liquid cooled, so it's a whole new ballgame. >> Yeah, and actually, in the AIM product itself, in ICE Clusterware AIM, we're actually monitoring for nominal variations in temperatures in the GPUs themselves. We're doing latency throughput testing on the InfiniBand fabric.

In any deviation outside of nominal parameters, we'll begin to institute automation that will attempt to remediate that in software. If we can't, then we remove that device from the production workload so it doesn't actually result in production outages. >> And is your software vendor-agnostic? >> It is. Any type of GPU that you happen to invest in, any type of compute infrastructure that you're deploying.

Our intention is to help these organizations accelerate their journey in building these environments with these highly specialized skills that are new. >> So, if I wanted to do this role my own, how would I do it? Would I have to just cobble together a bunch of my own open source stack? Do customers have the skills to do that? Is that really the alternative here? >> That is pretty much the alternative. There's a lot of open source investments that are out there. Organizations struggle, often, to build the skillsets in- house just to sustain the open source investments. So, we're an alternative to that.

Enterprise class, software, supported, be able to license and get continuous updates, and not have to manage your own particular open source distribution of these types of outcomes. But there's all sorts of ways to deliver on these types of outcomes, and we're focused on helping people realize and accelerate into that journey that they're on. >> What's different about security and governance in these types of workloads? It's obviously the most important issue for IT professionals. So, how should they be thinking differently about security and governance and how do you help? >> So, as it relates to constructing these environments, you have to remember that they're running the applications that are making decisions for the business. If they're making decisions for the business, then they have access to the data that enables those decisions.

So, when you construct these environments, you need to make sure that you're doing so in a manner that is deriving configuration outcomes that are repeatable and consistent. Any variation in parameters that are deployed, that are unique, produces a threat vulnerability. Any operating system variances, any libraries that are deployed differently, any drivers that are done differently could expose an area of vulnerability in the infrastructure. >> What about workloads? As we were saying, it's not your typical, traditional IT workloads. We're talking about injecting intelligence into the system, natural language processing, computer vision, document understanding, RAG-based chatbots that help with contact centers. Now you've got agentic coming in.

How should customers think about constructing an infrastructure that can scale and still handle a variety of workloads? >> Well, I think the first thing is to understand that you're building infrastructure for the type of use case that you're deploying. A lot of investments that we've seen in recent years have been around training, but we're seeing a lot of optimized inference investing. Think of self-driving cars and the need to process information at the edge. If you think about real-time fraud prevention and the need to process information real-time and streaming en masse.

And so, each one of these problems requires constructing high performance, low latency environments, and dealing with different types of failure scenarios. >> That's interesting, because in training, they call it the YOLO run. You do the you only live once run, you've done enough upfront research, and you're hoping that it works. If it doesn't, if's expensive.

But what you're describing is a lot different. You have to hide those failures, you have to be able to recover from those failures instantaneously, otherwise fraud could be committed. So, it's not just you're losing money, it's you're not doing what you promised your customers. >> Well, it's exactly right. If you think about how massively scalable parallel processing works is, you run a job on these GPUs, however large your GPU environment is. What you want to have, you don't want a failure to happen when you're running that parallel process, because the job will fail.

What you want to ensure of is that you want to maximize the environment variables so that the job doesn't fail. And that's what we're doing. We're doing predictive analysis before the job gets to those GPUs. And if there is a GPU or a fabric that's in a certain failure condition, or is showing signs of a silent failure condition, we will remove that device from the workflow so that we're minimizing the potential for failure. >> So, when I think, Trey, of traditional IT metrics, RPO, RTO, availability, how many concurrent users I can support at once, how are the metrics that you track, or the customer should track, around AI different? >> The metrics are different in the sense that, if we're doing training then we're processing how much learning can we get done in the shortest window of time? And so, the devices are running at peak performance all the time. If we're doing self-driving cars, then we're thinking about health and safety situations, and how can we autonomously make decisions at the edge.

And so, availability becomes really important. If we're doing fraud detection or analysis of a lot of information in parallel, then our focus is on responsiveness of the answer to the question. And so, each one of those variables are combinations of the traditional metrics that we calculate in IT, just with a lot more data all at once, and a lot more infrastructure that we're running all of these processes on at the same time. >> So, we talked about training and inference, 2023 was a lot of experimentation in the cloud. 2024 seemed to be, "Okay, we're going to try to narrow down and prioritize these use cases and get a better handle on the workloads, because we want ROI.

" And that's been the conversation. ROI has been a little bit elusive. So, part of that ROI equation is, if I'm going to make this investment in bringing infrastructure on-prem, working with a company like Penguin to build my own opinionated AI stack, I want to make sure that it's going to last awhile.

The term is future-proof, I think it applies here. What are your customers telling you about what they expect and how can you give them confidence that what they install today is going to help them with future AI workloads? >> I think the best way to answer that question is to talk about the people at Penguin and the experience that we have. So, beyond the software or the technologies that we sell and integrate together, the people have a long legacy of experience of constructing these environments. And the way that we approach building these environments is through an architectural model that can adapt and grow.

The reality is that, there is a blistering pace of development with the underlying hardware that's out there. Moore's law is out the window. >> Jensen's law is taking over. - Yeah, Jensen's law is taking over. And as a result of that, you need an underlying architecture that is deployed in an environment that can accommodate those changes, and also find ways to utilize some of those technologies as they become, dare I say, legacy, even though they're a year or two old. And so, our architecture first approach, using modularity in design, is really a way that we bring that answer to customers.

And you can only get there from having the experience that we've had over the years of building these environments that we have. >> Excellent. All right. Thank you, Trey. Appreciate it. And keep it right there. Trey and I are going to be back in a moment to discuss the best approach to optimize AI infrastructure and get the most out of your investments.

>> Voltage Park is a cloud service provider. We have over 24, 000 H100 GPUs deployed across the United States. So, the mission of Voltage Park is to democratize compute to everyone. If you're a large enterprise and you want access to a long-term contract with us, or a professor at Stanford and you want to do a two-week project to train a model, then we want to be the solution provider for you.

As a new entry into the cloud service provider arena, we've realized, and I think everyone knows this, that managing a large AI compute cluster across the US requires not only great technology, but a large team. As a result, we wanted to go to the market and find a management service provider that had that track record, that history of providing high up times for their customers, at scale, over long periods of time. Additionally, we wanted this management service provider to be a seamless extension of our team.

More intuitively, we also wanted a partner that had top tier software that has an ability to manage the hardware on a preventative basis, to find the problems before they occurred and, within the data centers, have a great team and processes to do break fixes in a timely fashion. After running a very detailed RFP process, it became very clear early on that Penguin was going to be the right partner for us. Not only do they have the technical expertise, decades of experience, but they're able to move very, very fast and we were very, very impressed.

We're very close to launching 18, 000 GPUs across the US at four of our data centers. We've also been able to leverage the full suite of Penguin Solutions and have had them source and validate which shared service storage provider we should be using, which they did in under a month for us. The complexities of this infrastructure are like nothing that I've ever seen before. We're definitely at the first or second inning of the AI revolution, which is going to be the biggest revolution that we've seen in our lifetime. >> Hi, we're back with Trey Layton.

We're going to now go deep to tap Trey's expertise in high performance computing and large scale AI deployments. Trey has had a front row seat to the challenges that businesses face in managing AI infrastructure, and we're going to ask him to share his insights on bridging the enterprise gap, building sustainable operational models, and creating AI environments that deliver on performance and ROI. So, Trey, let me start with why do traditional tools not apply here? Why do we need a new stack? >> So, when I think about building the environments that we've seen managed over the years in enterprise IT, I actually think about your consumer car manufacturer, is that we're building cars on a manufacturing line that are designed to be daily drivers. When we're building an artificial intelligence environment, we're really talking about constructing an F1 car that's designed to run around a track. And you need a different set of tools to be able to construct that highly specialized solution to be able to deliver those outcomes. And you think about the skills that are unique in the IT world, in managing things for high availability and essentially, every once in a while, you'll get to a peak performance in an IT environment, in contrast to an artificial intelligence infrastructure that's running at peak performance all the time.

And so, those two differences require a different set of skills that we're seeing and a different set of tools to utilize to actually deliver on those outcomes. >> I like your analogy. So, you're not necessarily optimizing for gas mileage, you're optimizing for performance, because you want to win the race. Okay. A lot of organizations we talked to, they can't really utilize their AI clusters to the fullest.

So, I want you to help the audience understand what some of the root causes are of under-utilization and why that's so important in this space? >> So, it's a great question and I think it's a misunderstood aspect of the challenge that we see out there, is that GPUs, specifically these type of environments, as we mentioned previously, they're running at high performance all the time and these devices fail... Our own report, internal analysis, shows that GPUs fail about 33 times the rate of a general purpose CPU. And that doesn't mean that they're fragile in the sense that they're poorly constructed. If you think about, again, go back to that car analogy, when you're running a race car around a track and the engine's running at full RPMs all the time, sometimes tires are going to blow, sometimes cylinders are going to blow. And that's what happens in these AI infrastructure solutions, is that we're running all the devices at peak performance all the time. So, you're going to deal in a consistent failure condition.

How do you construct the environment to accommodate those failure conditions, and how do you drive up utilization in those? Sometimes failures are soft, silent, software related failures, they're not hardware device- stops-working-altogether. And so, there's things that you can do to mitigate those types of failures. >> I'm interested in this notion of skills.

A lot of the organizations we talk to tell us they don't have the skills for AI workloads. They've been running, to your analogy, the consumer car workloads for a long, long time, and they've perfected that. But now, the tolerances are much less and it requires different thinking.

Now, of course, you and I, when we first met, you were trying to simplify the traditional IT infrastructure, bringing compute, storage, and networking together. And okay, that worked. There's a similar concept here, but it's different from a skill standpoint. Please explain. >> It's extremely different. And I would say, we could pick on our friends in the enterprise IT world, but I think that it's also fair to say that there's an up leveling of skills in the high- performance computing world as well. The high-performance computing world needs to understand the problems of IT, and the IT world needs to understand the problems of high- performance computing.

And in that, we get a convergence of those two skills, and that will be the future artificial intelligence infrastructure engineer, is one who gets both worlds. >> But the architecture is unified, right? That's what you're- >> Absolutely. So, if you think about the modern HPC engineer is going to need to be versed in Kubernetes and microservices, where they're largely experienced in batch- based processing technologies, like Slurm and things like that. Whereas the IT person has been skilled in virtualization and cloud technologies, and now they're going to have to learn storage technologies, like parallel file systems, and how to run massively scalable clustered outcomes.

These two worlds are colliding and the skills are unique to each particular environment. >> Yeah, you're right. If you think about it, the entire stack is becoming AI optimized. Clustered computing, parallel file system, you just mentioned, ultra low latency networks, entirely new software stack that you guys have built and are deploying. So, how can organizations effectively design and maintain large scale AI systems? You're a golfer, is it just you ought to get the muscle memory or? >> So, I think first thing is acknowledge that it's different.

Short game's different than the long game. >> Yeah. - So, you have to understand that it's different. I don't try to apply enterprise IT skills to this particular space, but also expect that when you're constructing these environments that you're going to need to bring some of the skills in the IT paradigm there. So, I think the first step is acknowledging that it's different. And the second step is bringing in an organization, whether it's us or someone else, that's skilled in this specific conversation. And that's the uniqueness that we bring, is decades of experience of constructing these environments, and also acknowledging that there is a difference in this world that the world hasn't fully embraced yet.

>> Okay. Finally, we want to understand what a sustainable operational model looks like for AI infrastructure, Trey, at scale. You're a golfer, you understand muscle memory. Once I get that groove swing down, once I have the formula in my head and my body, how do I sustain that, how do I keep it going, and how do I make sure I can handle AI at scale? >> So, it's a great question. I think the first thing is acknowledging that it's different.

And I think that there are two vectors to the question. There is a different degree of complexity that you have to internalize and manage, and a different scale. And so, when we accommodate those two things by building an infrastructure that is modular, that you're acquiring partnerships with organizations that understand how to deal with the complexity and the scale simultaneously.

I think, ultimately, that's the formula for the right answer for sustaining the management of these types of infrastructures. >> Awesome. Trey, thanks so much for taking some time to explain what's different around AI. Now, you guys, you were in Atlanta at Supercomputing, you're going to be at GTC, which is a super exciting show.

Excited to see you there. >> Excited to be there. It's an interesting time in the industry and fun to see all the growth that's going on. >> Thanks again. Okay. Over the course of these three sessions, we've learned how rapidly deploying AI infrastructure can give organizations a competitive edge. Why unified and intelligent compute environments are the foundation for AI success.

And how to deal with complexities involved in optimizing large scale AI ecosystems, from understanding the shortcomings of traditional IT approaches, you can't just apply those to AI, to tackling cluster under-utilization, bridging skill gaps. Two experts, Pete Manca and Trey Layton, have shared practical insights that can help leaders design, implement, and maintain high performance AI infrastructure. So, as you move forward with your AI initiatives, remember the importance of balancing speed with ongoing sustainability at scale, building integrated and unified environments that eliminate silos and continuously refining your strategies and operational models. We encourage you to take these learnings back to your teams and explore new possibilities for AI driven innovation, and leverage the expertise and solutions offered by companies like Penguin Solutions. Thank you for joining us, and here's to accelerating your AI journey toward transformative business outcomes.

And for more information, go to penguinsolutions. com, or engage with Penguin Solutions directly to explore how they can accelerate and optimize your AI initiatives. Thanks for watching.

2025-03-13 03:42

Show Video

Other news

Don't Miss Duo's Big Unveil: See What Attackers Will Hate & Users Will Love 2025-06-01 04:54

The Fight for AI Market Dominance | CNBC Marathon 2025-05-28 09:37

Building Just Got EASIER with These Simple Tricks 2025-05-24 20:48