Parallel Works AI Workload Automation #189 | Intel Business

Parallel Works AI Workload Automation #189 | Intel Business

Show Video

Which I'm sure people probably don't even know this. I mean, I can fit five times as many cores into the same real estate on a chip than I could ten years ago. Right. Which is astronomical. It's incredible.

Yeah. Cause I can have on a single computer or in the hundreds now. Yeah, I look at that. That old cray machine, the first one I ever ran out of. He was called Beagle, actually. And I. And I don't even remember the exact specs on it, but that, that exact system that I, I kind of want to do this after compare what it would look like today.

And even I, you know, some of the newer let's say cloud instance is coming out. Welcome to Embracing Digital Transformation where we investigate effective change, leveraging people, process and technology. This is Darren Pulsipher for chief solution architect, author and most importantly, your host.

On today's episode, The Perfect Storm Cloud, HPC and AI with special guest Matthew Shaxted the president of Parallel works. Matthew, welcome to the show. Thanks, Darren. Happy to be here. Hey, we had a great conversation at FCA West.

When was that? Two or three weeks ago. Two weeks ago? I don't know. Time? Yes. Fast. Something like that. And it just brought me back. You knew some of the people I used to do work with in the Global Grid Forum and the Open Grid Forum back at the turn of the millennium.

I had a lot of fun doing that stuff. And you brought me back to those times. So I said, we have to have Matthew on the show because you actually have something to say besides just reminiscing with Darren. But before we get into that, Matthew, tell my audience a little bit about yourself, your background and and why we're talking to you today. Sure.

So I'm Matthew, the president of Parallel Works. Sure. We'll get into what we do a little bit more. But I have a background in civil engineering, actually, and I was a practicing engineer at some large architecture and engineering firms in Chicago.

Born and raised, running a whole bunch of different types of simulations for that industry. So structural CFD, we did a lot of topology optimization gradients, analysis, energy modeling, wide gamut of simulation physics. And when I was in these companies, my job was to take the designs and try to make them better through simulation. So long story short, I needed bigger computing resources to run these type of things.

That led me to Argonne National Lab, which is the Chicago area, one of the leadership computing facilities in the country. And the process began how to start running on the big iron machines right? I met my partner and co-founder of the company, Mike Wild. He's our CTO there.

That was in 2014, I think so Almost ten years ago now. And they developed technology at Argonne and through this research team, somewhat affiliated to the Globex Group and GRID RTP and all, I know the. Global strip well. But they created technology that helps you scale workflows from your desktop or maybe a small set computing environment up to these massive machines. And I remember very clearly the first simulation I ran on, I think it was some older Cray model sitting on Argonne's floor and it was like 20,000 course. And before that I was used to like, you know, 100 course here, maybe my laptop simulation.

So the first time being it was around 20,000 cores all in once, you know, in a matter of, you know, doing more work, computing work than I probably ever achieved, I was like, wow, this is this is amazing technology. So we basically spun spun this workflow fabric. It was called Swift at the time.

It's evolved since then, spun that out of Argonne is Apache two license. So it was pretty easy to do that. And we created what is now parallel works and it it started very much focused on basically commercializing this workflow fabric, making it easy for industry to scale and democratize their primarily HPC related workloads.

But over the ten year time frame, it's kind of evolved much more into hybrid, multi-cloud computing environments. So we do a lot of HPC, but more and more they're turning into EML and like the intersection of of ML workloads with HPC. So it's a sorry about that. But yeah. Yeah that's, that's really interesting because I played it.

I dabbled in that world 20 years ago where I know the Globus guys really well because I did a lot of work with them. But what I thought was interesting at the time was the only people really using high performance computing are these massive machines where scientists or anyone that was running big, huge simulations, right? Yeah. Whether it's nuclear explosions or weather prediction or or crunching numbers coming off of the CERN Super collider, whatever. But I mean, these machines cost hundreds of millions of dollars, right? So they weren't just handing them out to any, you know, consulting firm or a rival engineering firm that was doing whatever. But I think it's interesting because we're back 20 years and now we're back to maybe more people could start taking advantage of these HPC clusters, which now are kind of scattered all over the place.

Yeah. And the shape of them, it's kind of evolved, too, right? I mean, conventionally, HPC is and I kind of always think of it as lots of computers working together to solve one big problem, which is where, you know, why these, these systems are so expensive and very specialized. And, you know, now, now cloud comes along and for a long time the performance to achieve that same those same type of workloads, you know, lots of cloud instances working together to solve one problem for many years kind of wasn't there or, you know, people didn't think the performance was there. And then the cost, the total cost of cost of ownership is higher.

And and for, you know, we've been doing this for ten years, right? So I've kind of watched we were trying to go hybrid cloud before it even was a thing and before people were even convinced it could perform as well as like a few generational cray machines. And now over the last, you know, maybe three or four years at least, I've been seeing those things kind of converge where you can actually show that these cloud resources can be configured in such a way that can compete with, you know, not the latest and greatest, you know, conventional supercomputing machines, but, you know, several generations old. And then the cost of ownership becomes somewhat comparable.

So there's like this convergence convergence point where now cloud can can offload a lot of what people do on conventional HPC systems. And it's yeah, it's very it's enabling a whole bunch of new things, which is where I think ML kind of plays in. Yeah. Before we talk about, you know, I think I want to touch on this a little bit, the HPC clusters, I mean ten years you've, you've seen the density go five X which I'm sure people probably don't even know this. I mean I can fit five times as many cores into the same real estate on a chip than I could ten years ago. Right.

Which is astronomical. It's incredible. Yeah. The number cores I can have on a single computer are in the hundreds now. Yeah, I look at that.

That old cray machine, the first one I ever ran on, he was called Beagle actually, and I, and I don't even remember the exact specs on it, but that, that exact system that I, I kind of want to do this after compare what it would look like today on even a you know, some of the newer let's say cloud instances coming up. You can just fiddle up there much higher capability than even these machines. So yeah that's there's been a lot of a lot of movement there. So what that means is the workloads it used to cost me were I'd have to buy a supercomputer to run was, you know, tens of thousands of cores I can now run and spin up in a cloud and and run it in a shorter period of time and only cost me $1,000 instead of millions of dollars.

Potentially, I think yeah. It depends on the like any any response here depends on the shape of the workload. There's ones that you know and we we handle kind of both cases. We have workloads that want to run a million tests on some group of instances and they're totally unrelated and cloud instances particular are really good at those and you can spin up a lot of nodes and you run them in containers and they're all processing and you can have full tolerance on it and those things. Yes, each individual task made process faster and you know, each CSP cloud service provider is generating you know, they're creating that.

They're bringing on the next set of Intel machines and AMD and Nvidia. And it's a race, right? It's always happening. But we don't like talk about those other guys.

You keep it focused, right? Yeah. So the newest and we were we were just talking about the accelerators from Intel, right? yeah. The seven series on us. So like those things get introduced, we go and run the same workloads we were running on the six series over there and it's suddenly you get a 10% boost in performance for like same cost or a little less. And so that that's happening a lot. But so yeah.

Because of that, because compute is becoming even cheaper to consume either in the cloud or on prem, they're both becoming very easy. That means there's more workloads I can actually run and the purpose built machine doesn't really have as much of a play as any more because they're running so fast. I can run multiple different types of workloads on the same clusters that.

I would say. I think that still is it it depends problem because there's. Still. The class of workloads that need to go even with faster and faster hardware and maybe one point will reach a convergence point.

Right. And and this will completely I can do everything I'm trying to do Like, you know, my with our NOA engagement, we have an acceptance test that runs on we have to compare it to their 2 to 3 year old Cray machine to 5000 core job. We run it on the C five series instances on NWS, right. That's it still takes its 5000 core job on EWR on those series and it's about 100 nodes working together, right? Maybe one day all those things will converge and I can do that. But for now, you know, even in the HPC 20, I series, whatever, it's, you know, maybe at that point and start to converge. But what's happening is you can start to change the shape of the actual jobs themselves.

So instead of 100 nodes, you know, of a certain type, now we can run it on 50, get a performance boost, and you spend, you know, half the cost. And that's exactly what we're seeing on this hardware kind of race that's going on. I'll scale, you know, every quarter new series are coming out.

We go and run our same benchmarks and show, hey, here's your 20% boost and less costs. And then. Right now, that makes sense. Now, even with the hardware becoming more accessible, it's still I remember setting up these clusters.

It's still hard to set these clusters up and and to tool my, my, my job. Yeah. You know, my workload. I have to tool that up to take advantage of massively parallel. It's still a lot of work.

It's still. Work. But you but you guys, that's where your sweet spot is, right? That's where you're trying to meet. That's kind of the whole reason we exist, really. Okay.

If it was really easy to leverage, you know, lots of these computers all working together, right, and configure them and tune them, you know, we, we would play, we would have levels of value in the in the space. Right. The reality is right now, when you want to actually leverage these things, there is a lot of intricacies that start to come into play to get top performance out of it or get past cost performance, which is a big component. And there, you know, you don't really start seeing these things until you get in under the covers of it, but it's like, you know, all the networking configurations when you go multicloud, every cloud service provider has a different recipe for extracting best performance out of their network, for example. And if you don't do these things that won't perform because this one is still, you know, using TCP or, you know, Ethernet, this one's using some mellanox driver that's been configured.

You know, this one has their specialized interconnect that bypasses it. So each one is like very different. And, you know, in a world where you can just pick one and work in that one for a long time and get a lot of experience there, it's it's great. And you can you can be very successful doing that.

I think we've been seeing more and more organizations wanting the ability to go between these different environments and in a way, you know, treating their on prem environments in the same type of category and they just can select which one. We why why would I want to do that? Why do I want to run my HPC workloads sometimes on Google, other times on EWR, server times on a juror or other times on prem? Why? Why have that flexibility? Why not be the master of my own cluster? Because I, I can control everything in my own cluster. I don't have noisy neighbor problems. I don't have security problems.

Everything's contained. So why go beyond those? Well, those walls. Yeah. I mean, I think it's it's a few categories or it's economics, right? So still, if you have the capability to do so, managing your own sets of resources is is still, you know, I think what what generally is perceived as the lowest total cost of ownership when you have those capabilities in-house already.

What happens, though, is you're generally under a lot of resource contention. And when you have, you know, engineers or people training ML models or, you know, in the case of like defense work doing some mission critical scenario plane, people can't wait a week for their job to start. There's a resource contention problem, right? So that's where just having the ability to now burst out into these other other environments is very valuable. You know, and people say a lot of by the base rent the rest that's like kind of a pine trees and.

I like. That Yeah I've been hearing that you know that's that's a phrase said and that, you know, you get the capability to do that. Also though, we've been seeing the on prem refresh cycle of these systems. It's, you know, kind of on the order of years, Right.

I mean, you can't go get in the newest FPGAs from that eight WAC is going to be rolling out like next quarter into your own shop as quickly generally speaking, or the new accelerator version from Intel. Right. Having the ability to now say, you know, we're going to go and run on these HMX accelerators tomorrow and not have to wait, That's that's a big consideration as well. Okay. So so that that makes sense to me, right? I can I can get access to new technology immediately or as soon as it's available in the cloud where if I were to buy it myself, that's expensive.

There's one thing you said that I thought was interesting resource constraint. Are these supercomputers constantly running all the time? You know, I. Think generally I seen the the centers they try in these facilities or organizations that have these internal resources, they they keep them as heavily utilized as possible. Right.

Because it's a it's a it's a big CapEx outlay. It's and then you're trying to maximize the usage of it before it gets too old, really, which is, you know, for your time. And now you got the next refresh. Right. So I hope it's every three years, right.

I want it to be every three years that they refresh. Right. Exactly. Yeah. Right. So yeah, I mean generally they are very constrained.

I mean, you know, I don't I won't say specific words, but we have ones where it is literally you submit a job in the cluster and you wait a week before it stops or before it starts. Before it starts. Wow. You make a mistake.

And then it's like, I got to go wait another week. And yeah, you can get priorities but it's it's a very have. Other and jobs running there. All right so you guys you guys provide an easy way then to span on prem in the cloud Multicloud so is that that that's your focus of of your company is to make that easier for me as a scientist or whatever to run HPC workloads anywhere so I can take my, my scripts, whatever I have, I just hand them to you and it. All sounds really easy. So. So in theory, yes.

But for us, it even starts at like the organization level. So like an organization and the people that are managing the computing systems within an organization that are even starting to look at cloud, which is everybody's having these discussions now and, you know, everyone's in different places on it. You know, we kind of started that level because we go in and present ourselves as really just like we are a portal for your on prem systems. We make it easier for your user base to consume your on prem resources that you have.

It's basically like, you know, the terminal is the old way of accessing these things. We give you a place where your users can get Jupyter notebooks or VDI sessions or do more advanced data flow. But on your existing resources, right. People have called these things portals for years, you know and it's we're an on prem portal.

So everything you know in this camp, it's like greenlake by HP it's open on demand is a big one. That's like the open source version focused on the on prem side of things. A lot of companies roll their own to right? Yeah, Yeah, we did. When, when I set up clusters at Cadence Design Systems. Cadence Design Systems. Yes, exactly. Exactly right. Yeah.

Like housing nodes of old machines. We do crazy things. We grabbed all the old machines that people didn't want anymore through multiple one big massive cluster, 5000 machines. And we set up our cluster that way and we and we use that. We use every compute cycle we could possibly use every night. Yeah. I mean.

So that the concept of a portal is not, not new. That's not so. Yeah. And even, you know, we do defense work right And the HPC MP the MOD program that manages a lot of the unclassified unclassified, you know, high performance computing systems, they they've been creating portals for those systems for about a decade or more like 50, even more.

And and they're great. You know, you can log into them you have the ability to get into a terminal very easily. You have the ability to submit simple jobs. So we're kind of like a enhanced supported version of that at the base. But now we layer in this this cloud bursting in this kind of multi-cloud component where you deliver. So so this is really interesting to me from the perspective of if I can burst into the cloud, am I taking my individual job that I'm running? It may have several hundred jobs underneath that.

It may have 5000 course. When you say burst into the cloud, if I then running my job across the cloud boundaries or my moving my whole job up into the cloud. So there's no contention at all on that very it's or can I. Disagree that's very workflow specific I'd say.

Okay. And organization specific because and then it quickly becomes a data gravity issue. How Yeah. Yeah. Your data is your biggest problem, right? Yeah. Because if you start in we we have had example we have had customers right.

That do have a workflow that wants to run in our environment and they want to run some jobs on like their on prem GPU cluster for example, or few resources and then they want to take certain tasks of the workflow and run it on like a big mess, you know, set of resources. And so that truly is like a hybrid workflow and you're moving tasks between it and inside of the actual workflow executions. There's, there's the data process, right? You're moving data either explicitly or you're transparently moving it through some type of data data mover so that that does exist.

I think more typically we see things where it's like, I'm going to try to run on the on prem resources when I can and and when I, you know, as available when I need to, I can just take my workload and point it explicitly to a. To an. To a different to a cloud service provider cluster.

Yeah. So it is a little bit more like choose one of the sites you want to run on and get the data there. Yeah, get the data there and then run it over there. Now another thing that you mentioned to me is and a big concern because we had a customer that wanted to run HPC in the cloud, I'm not going to mention who they are. They know who they are. I don't want to embarrass them too badly, but they set up a couple nodes in the cloud in their cluster and a couple of nodes on prem in the same cluster, and they were just running their jobs and their egress costs went through the roof.

Yep. And they ran through their whole budget in like a month, four years, a month on their cloud. So this is this is where it is very you know, it's a workflows, workloads, specific kind of conversation because and we have similar things. You know, we have groups right now that are spending 50 grand a week on egress costs or something going out of EWR space back to their on prem data transfer notes.

Right. And and to me, that's just wasted that. Well, they're fine with it. But there's also some things happening in the cloud service provider world now where, you know, some of these closers, they're getting rid of egress costs.

It's it's a competition. Yeah, you can start. And so now we have the ability to go to a site, a cloud service provider. You don't have a egress costs, right.

You can have to change everything you're doing through our system, but that's separate. So regardless, so it still becomes the same conversation. What what needs to do? You really need to move back to 50 terabytes of data that were generated from your, you know, whatever weather forecast job or something. What can stay in the cloud? Can you post-processing the cloud to make it smaller? It's kind of part of that whole discussion. So so it's not as easy as I'm just going to take my workload and just jam it into the cloud.

I have to think about it still, there's some things I have to think about data movement. What value am I getting out of that now? You guys have some tools to help me with my cost controls, correct? Yeah. Yeah. So and honestly, just touching on cost, I didn't want to go there, but it is another reason Multicloud is a great advantage because big organizations have a lot of leverage, negotiation, leverage with each cloud service provider and having the ability to not be locked into one gives you the best footing to negotiate your and cloud contracts. So but but regardless.

So when you start running in cloud and most people know this, as soon as they start going into any of the providers and spin up some nodes, you can you can spend a lot of money very quickly. And and I remember back when I started getting on the cloud and I left, I think I left like 100 instances running. And this is like my first job out of college. And I burned like ten grand in like a weekend or something. And everybody was like, really upset about that.

And I was like, Well, I'm sorry. I don't know. So that that's that's not a unique occurrence. no, no, no, it's not. Yeah, it's not unique like that. I didn't get fired or anything, but everybody, you know, was watching me much more closer after that. So anyways, yeah, cost control is a very important and difficult thing.

And a lot of times what we see is these, these organizations want to put a budget to their cloud spend and the way that you know, and each CSP has different ways of managing budget and enforcing limits, but they're all kind of different. And, you know, we what we had to do, what we wanted to do and started doing this years ago is make the cloud feel more like a conventional budgeting allocation system where it's like you want to give you as the organization manager to spend on of your computing cycles. You want to give this group $10,000 to spend on the cloud and they cannot go over it. And that actually turns out to be a really difficult problem to enforce in the cloud.

And the reason that is, and specifically in the type of workloads we are deal with, which is like HPC and malware, they're big, lots of computers running for bursty amount of time, short amount of times. What what, what ends up happening now. And I hope this changes at some point. I know some of the service providers have been working on this. They don't get the bills of what you stand for for several hours, even up to 24 hours after you incurred the costs.

Right. So for us, where we have a customer that just spun up 100 HPC nodes for 2 hours and they just spent ten grand, let's say, we won't actually know that for sometimes for eight, 12, 24 hours after the fact. So someone trying to enforce a budget of like $10,000 that an hour. And so this is what happened when this was five years ago for us or something when we had this problem happening where you have companies say, you know, we can't go over $10,000 and suddenly at the end of a weekend they're at $100,000 or something and they just went over by $90,000 and everybody's freaking out.

So what we had to do is, is create. And again, I hope the service providers roll this out themselves soon. So we don't need to make this.

But right now we do. We created a what we call a near near real time estimate. It's a three minute estimate of what your cloud spend is at any given point of time, you know, generally in your own accounts, because we we we just orchestrate in your own cloud accounts and then we enforce it. We have the option to enforce the budgets on that estimate. The estimates found that you'll actually.

Shut down jobs, you'll shut them down. You're the administrators are able to set up rules like when you get to a certain percent of your budget, shut down the resources because they're disruptive action. So there's a lot of checks.

We will actually shut it down and stop, stop the spend. If someone leaves on a huge cluster over on a Friday night and then they go home, you won't go rack up a $200,000 bill, shut it down kind of instantly. Yeah, but so you guys are what you've done is you've made it easy now for, for anyone to upload their types of workloads into into on prem or in the cloud or even Multicloud So you've democratized. HPC Yeah. And we're and you make it sound magical, right? But really what we're doing is.

Yeah, but, but one. Layer down on what we're really doing, it's not, there's nothing really magic right. It's, it's we, we let our users like the people that use our platform create a solution of some kind. So a solution is usually consisting of like scripts.

It may be like your python code to do your, you know, large language model training or something, right? It's some model or some bit of code. We let people load that into a place that packages it up with a little front end form. And you know, you can give instructions on how to run it and we just become a job execute or for the for the models that you create and these things become like shareable entities, right? So they become this is the democratization part. It's like I made a little, you know, large language model or even like a spatial or stable diffusion training workflow.

And now I want to make that available to the rest of the people on my team. Yeah. Now we make that thing shareable. I can now take that little it's almost like an iPhone app that we used to call it and take it to an iPhone app for HPC. And now it's like a little tiny thing that's all packaged up and I can share it with the rest of my team or the entire organization and they can run it without having to, like go into a terminal and install all my Python dependencies, everything.

It's like a turnkey sharable entity. And we've done that for workflows and clusters and storage and a bunch of different things. Yeah. Because you've made that easy. My brain went directly to Hey, I right, because that's where all the money is being spent right now. And it is. Right.

And HPC clusters, they're great. They're great at running because they're massively parallel. They're great at training and doing all these A.I.M. workloads. You guys have come up with an easy way to define those workloads, make them shareable.

You just open Pandora's box. Yeah, well, it's interesting you put it that way because we yeah, we've been very focused on HPC for our entire existence and it's a niche world as you as a very super niche world. But, but what started happening really over the last several years as and we didn't really even think about it right, is just what people are using our platform for their, they're using our platform to provision a bunch of nodes to do like ML training or they'll run like our Jupyter lab or Jupyter notebook workloads and just put these on some big instances in the cloud and do their model development there.

It's almost like a Google CoLab does. We we can do it on more flexible sets of resources. And so we started finding that we're living in this world where people are using our platform for model development and then this process of scaling their training.

You know, I developed my model on a single node, single GPU, maybe I did it on Amazon or an Azure GPU or whatever, and or some, let's say, Intel and X accelerator. And that's what I like to write. There are Gowdy three, some, Gowdy three. There you go with now and then and then this process of now moving it to a multi node problem, multi node training job and then a multi node multi-gpu job training job, it kind of cycles right back into and into the HPC realm. You need you need fast file systems and then it's really kind of the same thing honestly.

So yeah, so I think you hit it right on the nose. Any time I'm going to run any work that spans multiple nodes. HPC You guys have made HPC easier to use now, so it's an option for me now. Yeah. Have the shape of those jobs kind of changes though, like ML and the AI workloads like those are the actual scheduling layer I feel like is kind of evolving there where it's moving a lot more to containerization and OpenShift or Kubernetes and but they are still, you know, to start scaling the training jobs. They have the same requirements as a conventional.

HPC as a conventional HPC. Well, in HPC has been around a lot longer. I know I worked in the workflow, the Workflow Standards Group, right? We I was chair of that group for like three years, right? yeah.

We learned there of that setting up the job and actually running the job was easy running, you know, I have 10,000 jobs all doing the same thing. It's all the workflow and moving the data between all of that that becomes, yeah. It's the plumbing. It's the plumbing, which is kind of the it's the boring pieces, right? That's kind of what we do. You know, it's we're like the plumbers of this infrastructure in a way, and just kind of tie it all together to make it make it easy.

Matthew This has been very enlightening. I really see a great place for parallel works in the future, especially around AML. I think you guys are in a sweet spot right now. I think a lot of pressure. Yeah, and thanks for coming on the show today. This has been very enlightening and I learned a lot. Yep. Great.

Thanks, Stern. What a good question. So any time. Well, hey. And you. And now I'm thinking back to my good ol the good old days of, the Global Grid Forum and Open Grid Forum and projects I worked on.

man, I want to go work on them again. Thanks. Yeah. And have a beer. Talk about those at some point. There you go.

Thank you for listening to Embracing Digital Transformation today. If you enjoyed our podcast, give it five stars on your favorite podcast insider YouTube channel. You can find out more information about embracing digital transformation and embracingdigital.org.

Until next time, go out and embrace the digital revolution.

2024-03-11 12:37

Show Video

Other news