Hemal Shah, Broadcom & James Wynia, Dell Technologies | SC23
good afternoon friends and family in the HPC aim ml World Lisa Martin here with Dave Nicholson we are live at supercomputing 23 in Denver Colorado this is Midway into our third day of four days of wall to- wall Cube coverage we're going to be having a great conversation you've been hearing us if you've been watching this Ser this week talk a lot about Dell and broadcom we have two alumni back with us Dave we're going to be digging in deep and the show and tell please welcome back HL Shaw distinguished engineer and architect at broadcom and Jim Winan joins us as well senior product manager at Dell guys welcome back great to have you thank you happy to be here happy to be here good to see you and you guys just did a presentation together so you're all warmed up ready to go you're locked and loaded for this conversation that's right that's right awesome and we talk about there are a lot of players involved in M AIML networking Technologies what is From broadcom's perspective what's new and exciting in the networking space and then Jim we'll have you kind of cut a weigh in on the D partnership sure so broadcom we have been doing ethernet networking for decades now what is exciting with the AML is is pushing the envelope bringing really large scale clustering asking for high bandwidth and what we have is both in terms of our Nicks and switches we have our solution today and we are innovating with other things around Ultra ethernet where we are putting more intelligence in network infrastructure in the to make AIML solution for from networking standpoint the best Jim comment on yeah very exciting Jim comment on that from Dell's perspective it's a crowded space it is a crowded space why the Dell broadcom tight Better Together partnership well it's always best to work with the lead the lead dog that's where broadcom comes in on the technology side we're very excited we've been partnering with them for literally decades uh and uh it's it's always exciting to see what they're what they're Brewing up and so we've been like supporting the tomahawk line the to the top of the food chain for hyperscaler networking Solutions the Trident line the top of the food the food line for uh Enterprise Solutions uh we have switches in those lines and we continue to work with them uh having the discussion about where do we go next what did we just do what worked what didn't so all of those things yeah so HL I like you for more than just your Nicks just to be clear not you know you know compliment I know Jim I know uh I know Jim when he says top dog he means that he means it affectionately so so we're we're talking about you know we we've been talking a lot about networking and connectivity um this specifically is uh what you might term inter server networking specifically absolutely um when you use the term Ultra ethernet what's the difference between Ultra ethernet and just plain old Garden variety ethernet right so what we have today is Garden variety ethernet on top of that we have rocky as a RDMA transport people are using it it's it's good and it has served this purpose for the scale that is being deployed today with ultra ethernet what we are doing is keeping the same ethernet ecosystem infrastructure but adding things like multipathing adaptive routing congestion control mechanism which scale to large number of nodes also addressing some of the rocky transport deficiency means RDMA goes to the next level by having selective retransmissions so all of those together with the physical infrastructure of being ethernet that's the ultra EET can we can we get right to what you brought with you because I I'd hate to show it at the end and then we'd have all sorts of questions we wouldn't be able to ask because of time uh hopefully hopefully we can get uh some tight shots on that maybe camera over here what what what did you bring with you what do you have on my left is Jericho 3 AI this is for AI fabric Market 100 billion plus transistor top of the line switch chip on my right I have a tomok 5 which is 64 ports of 800 gig again top of the line switches both of this together allows you to do multi stage switching fabric for AIML so really glad to see this these are both production and it's being deployed in production and the way and just to be clear the way that these actually are implemented they they would be in an enclosure uh rack mounted as a as a switch with with the ports in the back so to kind of familiarize people who are in data centers who maybe don't get down to that level of uh componentry how does that partnership work uh with you I was just going to jump in let me for for h and and broadcom this is the top of the industry right now you can't get any faster any better than than what he's holding in his hands I'm I was telling him I'm so impressed he was able to Wrangle these out of the hands of the engineers to actually show it to the to uh everybody here so very cool stuff um we are very excited uh to be partnering again uh with the Tom Hawk 5 we have a Tom Hawk 4 already shipping Tomahawk 5 well I can't release dates and that kind of stuff but let's just say we're looking very closely at what's when we can uh do something there so um having uh you know high capacity a bunch of 800 gig ports on top of the 400 gig ports um critical absolutely critical for advancing the AI ml fabric Solutions so I'd love to get your perspectives on differentiation and and obviously this is incredibly powerful and potent what you just shared with us and I saw they were able to get it on camera how how is that Leading Edge top of the market talk about that as as especially with with Dylan and Broad come together and both of you I'd love your answers how does that differentiate you guys in that space when you're in customer competitive situations so uh I'll start and J James feel free to jump in so what we bring in is end to end networking components right from the Silicon standpoint and then all the infrastructure software that is needed uh in order to make that end to endend connectivity and networking going partnering with Dell what we do is also move up the stack in the system level so that's where end to end monitoring management how do you do at the cluster level management so together we are very complimentary to bring the whole solution for an End customer make it easy to deploy easy to use I I agree with that completely and so if you look at like ethernet if you have a a small 1 gig switch what do you run on it ethernet you have a 800 gig switch you run on an Ethernet and so you don't have to have one network for the high-end one network for the lowend it's the same network it's just it scales perfectly and that's where the ultra ethernet comes in is how do we take the same building blocks and take the Super high-end and really take it to the next level we've had conversations with folks over the last couple of days who have made reference to to clusters of half a million servers so uh so we're talking about potentially massive environments right so connectivity in that space is a non-trivial decision to make absolutely uh historically when you say ethernet you're thinking of a an open standard you're thinking of a uh a common denominator that people can uh arrive at uh for compatibility does that change with ultra ether is the is the is the ethos still there that this is a more open standard than something else that might be out there yeah with ultra ethernet nothing changes all your open ecosystem and tools stay the same what you will see is you will see more enhancements to those like the infrastructure level there'll be more end to endend kind of making it more configuration free so that the users don't have to worry about what is happening at the layer below so that that's how I look at it so Jim you you mentioned you know the idea of you you know one one gig ethernet all the way through the fabric is that is that really a key differentiator that that you're not you're not having to instantiate a separate networking technology for your cluster that is completely different than the rest of it you we talk a lot about inference and training and all the activities and moving data back and forth maintaining two separate kinds of networking sounds more complicated than having a single absolutely thing is that I mean if I were to say well all I have is a all I have is the cluster and I'm not going to interact with it I don't have any pre-existing networking is then it less clear in terms of the value proposition or how how would you I I I think it's just as clear um I mean you have a green field New buildout you're going to use the the top the the latest uh that is available you still want to go with ethernet because today's you know top product is you know 3 years from now it's the H it's still interesting but it's not the top product well what do you do with those you generally take those repurpose them into another uh solution having the ethernet being the same you don't have to go re retrain retool you know you can keep pushing it down the the chain as you add more top end equipment on there and so that's a very powerful very powerful story and where do we see I I asked this question a lot I I think of chasing bottlenecks in i in it broadly as sort of a game of whack-a-mole once you've created you know once uh once Jericho and Tomahawk have have bandwidth the Plenty uh something comes along that saturates that bandwidth so you know maybe maybe for a period of time it's not the network that is the bottleneck it becomes something else um what what are we seeing there haml in terms of where bottlenecks arise so depending on different workloads what we have seen especially for AIML some specific stages within Network get congested and if you are an administrator you would like to know where are those congestion point and today most of these are manual but with giving more automated kind of rerouting the traffic allowing multipathing avoiding the congestion those things the really end user will appreciate as well as if you are administrator you'll really like it and to go back to the previous question the tools and everything are same so what they have built today as scripts they will continue to use like e tool on Linux no no issue now it just got more enhanced with more and more information that they have using the same set of standard tools let's let's give that thing a name gpus yeah yeah all of a sudden within the last nine months gpus are they are the discussion right and being able to pipe 400 gig to each GPU is critical so all of a sudden the demands in the rack have just skyrocketed and this has to be line rate has to be reliable and that's where Ultra ethernet really helps out continue that discussion and what do we go from there so so aren aren't aren't these gpus like 20 bucks each so who car who cares who cares if if you fully saturate them that's a joke that's a joke folks they are lot of Zer they are they are massively expensive into your point if it's underutilized yeah big time Big issue big no no absolutely yeah interesting can we take a step up I want to understand the power of what you're talking about what you're helping organizations navigate in terms of the Dynamics of AI ml networking what are some of the business outcomes or the impacts that together Dell and broadcom are helping customers achieve whether it's a a hospital or a fincial Services organization or a manufacturer I'd love to have any examples of real world use cases where the business impact is dramatic yeah so go ahead go ahead ah take it so I'll take few examples like what we have everybody loves chat GPD kind of oh yeah right so but you can imagine similar things in other businesses where people may want to build their own training model based on patient data right and then doctors want to ask question to like maybe some common symptoms based on that so they may want to build their own dedicated secure cluster and they would like to keep their uh cost of managing pretty much zero right and not require too much knowledge about how to deploy this so that's where we come in right we you provide them the tools you make it easy for them and let them deploy the application which is the best for them and let them focus on their core competencies exactly exactly exactly and that's where these large language model uh Solutions come in where they learn all you they send all this data like chat gbt sending the whole internet of data through it you know through a matter of days uh being able to learn specific to medical or traffic or uh airflight air controls I mean you know you don't have to worry about oh I'm going to learn the world I'm going to be the best at this area whatever your area is and being able to do that uh be very price competitive yeah yeah so I've got a go Market strategy question for you so from a from Del's perspective um you know typically if we think about this we think about you know what Lisa was talking about like let's talk about the outcomes and the cool things you get down to the infrastructure layer and uh and and there's a saying nobody cares we care but somebody has to somebody somebody has to somebody has to it's okay okay maybe you don't care but somebody has to otherwise none of it's going to work but when you're working with a an an end user environment whether it's a service provider or an actual end user in their own data center how does this conversation of networking come up is it part of a package is the typical engagement um we're going to stand up an environment with a certain quantity of capability and it will include n number of Dell servers with whatever components are inside and this is going to be the fabric that attaches all of them and then and the entire thing goes in is is is that more the conversation you're not going in specifically having right you know sadly you're not having specifically networking questions all the time I mean so it depends on the what the customers asking for we do have plenty of very specific pure playay networking Solutions but that's not what you're talking about you're talking about hey I have a I have a problem to solve you know Dell help me uh we come in and we specialize in compute connectivity which is our storage our power storage line and storage as well and all the connectivity all the cables and Optics we bring the whole thing to bear so um we come in with specialists in that because we're you know we have so many ways so many ways you can um solve problems right yeah you know you know one of one of your peers actually broadcom was on earlier today and she mentioned uh a a company that actually works with Dell scalers Ai and uh and and she was saying that standing up the cluster took longer than the training of the model in this one instance people people people take that for granted that that process is going to be simple we definitely have services that specialize how to get that right so that it's productive immediately you don't want to have well just go figure it out 3 months later I still don't get how how do you connect on wire this no you want to bring in the specialist they know what they're doing they get it up immediately and then you know things are humming along so very important I I will you also sorry did you jump you kind of led me in a direction there really one of the things that Dell specializes in is kind of the open flavor okay so open networking open AI uh so we are not just only having one GPU solution so yeah we we work closely with Nvidia for their GPU we work closely with AMD for the Mi 300X we work closely with Intel for the gy line you know we want to be able to have a full array so that when customers come let's face it every supplier goes through oh I'm out of that well we got several other options for you you know if you if timeline is your number one criteria we're there and we're ready and the networking in that infrastructure is the the same you know it's just the server uh and the the GPU it will there'll be a tweak there so um that's something that where we really Excel as well and you're cool with that cuz I think those are all broadcom customers that he just mentioned right not only that yes true but I was also going to add we follow the same Spirit like what Jim mentioned is very important that for our networking we don't tie ourselves specific to one GPU architecture we can work with any excert and that's why with Linux Community there's whole infrastructure that's been created which allows NYX to work with any peer device directly transfer data in and out of peer memory that's what this GPU computes large set of data Nicks are being moving the data so that open ecosystem really works and then you really have end to and networking solution that way without really worrying about specific architecture without worrying about specific architecture how do you help customers this is a marketing term future proof but I always love to unpack and understand well how do you actually deliver that but as as Ai and ml networking Technologies the landscape will continue to evolve how does Dell and Broad come together help organizations future prooof their environment so that they can continue to deliver at the speed and probably faster as the days go by that they need to so go ahead go go ahead we are both we're both excited both of your answers we'll start all right so this is where the the discussion about ethernet comes back right in our face so here's where having uh infrastructure that is not tied to one specific vendor is critical if you have a uh a networking that only you can only get it from one one player then you're locked right um and so with ethernet and this is where where Dell is super excited you know you can buy Dell today if you need to buy somebody else tomorrow to plug into ours well ethernet is ethernet and U we relish that fact it also Keeps Us competitive Keeps Us humble you know because we know that we have to continue to ex excel at what we do and provide excellent service so you can't get complacent go I think you're it so um one of the thing Dell and broadcom we have been talking a lot about this is to for End customer build reference architecture and depending on different customer needs we can say hey if if your model is not going to be more than this a single stage fabric is good for you two-stage fabric for your future this is how it will look like so if you show them that path that they can deploy something now and we can help them scale that really helps them and having that reference architecture proven architecture really makes them confident in our Solution that's key confidence key word sorry go ahead yeah yeah no no I was just going to say I've got I've got I know we're getting close to wrapping but I've got one quick one I've got one quick one you got time for as much you have as much time as you want of course I'm with the driver seat here exactly exactly um so a year from now we're back here we're back together what would you like to be talking about that hasn't reached maturity yet today what would you like where would you like us to be a year from now or crazy prediction for something that we have no idea we'd be talking about a year from now then HL what do you think I actually would like one year one year from now I would like to talk about how far we have progress in the enhancements I was talking about be more concrete about the benefits of those okay I I piggybacking on that I mean this year it's all about 400 gig next year it will be about 800 gig real deployed in in vast numbers that's what we'll be talking about guaranteed so I look forward to that discussion all right well we appreciate your time having this insightful educational discussion with us today the show and tell was awesome thank you and thanks to our crew for capturing that cuz we did forget to tell you that we were going to do that but guys you have to come back because I think we're disc scratching the surface here this fast moving envir but what D and broadcom are doing together Better Together is it's not strong enough of a of a statement about what I see from the ecosystem together so it's not 1 plus one it's 1 time 10 kind of a exponential yeah new math I like it new math gentlemen thank you again for the time we appreciate it we'll see you next year you or maybe sooner all right for our guests and for Dave Nicholson I'm Lisa Martin you've been watching the cubes live coverage of super Computing 23 we're going to be back with our next guest after a short break so we'll see you then wel the dog get some coffee get some water we'll see you soon
2023-11-20 17:48