The Exascale Computing Era is Here! Reflections on Then and Now
uh all right good afternoon everyone it's my pleasure to introduce our next speaker uh Dr Doug Kofi he has 38 years of experience in conducting and leading applied r d in computational science applications designed to simulate complex physical phenomena in the energy defense and Manufacturing sectors Doug is currently the director of the U.S department of energy exascale Computing project and Associate laboratory director of the Computing and computational Sciences directorate at Oak Ridge National Laboratory other positions for Doug at oronel where he has been since 2006 include director of science at the national Center for computational sciences and director of the Consortium for advanced simulation of light water reactors which is how I know Doug that was the Doe's First Energy Innovation Hub in leading the castle Hub Doug drove the creation application and deployment of an Innovative virtual environment for reactor applications which was a 2016 r d Award winner uh and this offered a technology step change for the U.S nuclear industry before coming to ORNL Doug spent 20 years at Los Alamos National Laboratory where he held a number of technical line and program management positions with a common theme being the development and application of modeling and simulation techniques targeting multi-physics phenomena characterized by the presence of compressible or incompressible interfacial fluid flow where his field changing accomplishments are known internationally Doug also spent one year at Lawrence Livermore National Laboratory in the late 80s as a physicist in defense sciences and Doug holds a Bachelor of Science in chemical engineering from University of Missouri in Columbia in a Masters of Science and a doctor of philosophy in my favorite field nuclear engineering from my favorite University outside this one Purdue yeah thanks Brandon you you didn't have to read all that stuff Brandon was a rock star in Castle I really wished he would have gone to Oak Ridge instead of stay here it's too cold up here but uh anyway Brens was one of our key people it's great to be here um look let's see how long do I have two hours so I'm going to move really fast this is an exciting project there's a lot going on I'll leave the slides behind maybe that's the wrong way but instead of going deep I want to go broad so I've got a lot of stuff I want to cover and look forward to the panel session so yeah I'm going to talk talk about exascale and Chris John really teed this up pretty well so this is an example of what applications at exascale can potentially solve or address with their with their outcomes and impact and I'm going to cover a few examples here in any case I'm a tala talk this morning sorry Alvin Tava I missed it okay but right down here he talked about his application certainly climate is a big deal but basically we built applications that cover the whole mission of doe from energy production energy transmission scientific discovery applied applied offices National Security it's been really an exciting gig we've been doing this so it feels like forever but uh six years going on seven I think we're really going to see the fruits of our labor so this is just a kind of a really broad brush of of all the apps and the problems they're addressing Christian just talked about in this case climate so here the focus is really more about frankly about adaptation trying to understand what's going to what you know where are we going to have droughts or severe events or whatever okay and so she talked a lot about the the doe effort e3sm so I won't hit that into too much of detail so there are 17 National Labs and yes we're we're hiring all of us so talk to me if you're interested okay Shameless but I'll do it anyway um there have been six National Labs that have historically deployed and operated large computers in this case what you see is over the past kind of decade it's kind of the some of the major systems that have been deployed at the three and NSA labs and then Oak Ridge Argonne and Berkeley our Focus has not just been on the 3x's first exoscale systems although Frontiers here Aurora is arriving and El capitan's arriving we've worked a lot on some of these other systems and they've been really instrumental in their non-trivial systems with a lot of power so in any case for ECP we started in 2016 I'll say a little bit more about that um it takes a long time to really dive in and plan for deployment of these systems so if you're interested in high performance Computing and by looking at the posters here a lot of a lot of students here are the this is a great activity to be involved in to deploy procure and operate a system Frontier it was really a decadal effort and it's just it's probably going to go live in terms of General availability in just a few weeks but it's been it's not something we've been thinking about just last year so it's been a decadal effort and ECP is a project that's been building apps and software Stacks so the first time in my very long career thank you Brendan we have a parallel project that's building software along with procuring the systems and that's been quite a remarkable uh remarkable ride speeds and feeds are Frontier you can find all this online up at the things that I liked in blue basically it's 9472 nodes it's got about two extra flop double Precision Peak a lot of memory those as an application person myself never enough but 4.6 of DDM of hbm and 4.6 of DDR
lots of bandwidth down to the gpus lots of good memory bandwidth a lot of non-volatile memory four terabytes per node and I think we're still trying to figure out just how we need to exploit that has has a reasonable Network um really a lot of innovative aspects to this design so I'll again I'll leave this behind a closer look at the node so what we've done at Oak Ridge we were really kind of the first uh lanel with Roadrunner with the the cell processor and then us at Titan in 2012 jumping into the deep end with gpus back in 2012 Nvidia was a very small company and we sat down with them we said hey this looks interesting but we really need 64-bit they didn't have it they really need air correction they didn't have it we worked closely with them and invested and made sure those things those those Hardware details were put in so now here we are kind of a decade later and we've got as Chris John mentioned AMD Intel and Nvidia really going all in on accelerators and and I like tell them accelerators because gpus really are a type of accelerator that accelerates certain Hardware operations okay there's a lot of great engineer operations but there's a lot of great floating Point operations in this case we've gone from kind of one to one to three to one now to four to one GPU to CPU ratio and so what you see here is it's fairly coherent memory on on a node and the the really way cool feature is Each G CPU has a network interface card out to the interconnect and so one can do GPU direct or MPI from GPU to GPU off node that's that's unique and that's new in any case I'll I'll move on and let me just say that the system is probably under provision for uh heavy duty machine learning training so two extra flop system probably needs about 100 terabytes per second ingestion that said by if we use the non-volatile memory creatively we can get maybe 60 or 70. for more traditional mod Sam 10 terabytes a second is pretty good so it's got a it's not necessarily uh I mean it's a pretty nice i o system 750 petabytes which is a lot of a lot of disk space a disk but a decent bandwidth so one of the things that really enabled Frontier to be deployed is back in 2012 reading 2009 we started thinking exascale uh it was Deployable then but it would have cost uh probably billions of dollars would consume maybe a gigawatt or at least a couple hundred megawatts we realized then we needed to invest with our U.S HPC vendors to really drain down bring down the power bring up the resilience go after the extreme parallelism so after uh multiple years of investment with inventors like Intel and hpe and Nvidia were able to drive that power from for extra flop down the goal was 20 and we got to about 15 in Frontier I think to get to Zeta scale and you know we could we could ask why do we need that I think climate's a great driver as an example we probably need to drive the power consumption down another factor of a thousand for systems like this to be used useful and affordable so we're quite proud of the fact that we hit a key metric in being able to afford the system in a from a power point of view so now I'm going to talk about the ex-scale Computing project really the rest of the talk and again this is a project that's building applications building a software stack and integrating all that's all that technology on to not just the exascale systems but pre-exis scale in fact as maybe amitaba mentioned I'll just say this all of our co-development really runs from laptops to desktops to engineering clusters on up we're not building applications that are just Boutique applications that only run on exascale Hardware so the project itself is a large software project I've been on point well I've been involved in the project and since really it's uh 2015 ramping it up uh but I I first led the application area and now I lead the whole project over the past four or five years so something very very proud of again six lead Labs but 15 and 17 labs are involved many many universities like Michigan and many many private companies so a big big huge endeavor so one of the key things we've known or we've been working on you know since the beginning of the project is really okay we have Titan back in 2012 we're kind of starting to figure out how to exploit the gpus and you can see some it's been a real Workhorse and then now uh Frontier which is about ready to be to uh to be open up to the public you can see some Trends again going from one to three to four to one GPU ratio you can see that the the interconnects which are really kind of more of the secret sauce of these systems and more the proprietary aspects have moved from from really uh Curry create Gemini Network to more of a standard infiniband now to a basically a proprietary interconnect notice slingshot that hpe um abortions fat tree is probably the best apology in my mind Frontier has a dragonfly which which uh which really isn't bad but uh for scalable systems uh I think we found that some has been been a great been a great uh great platform so in in terms of the accelerators um what if you don't use them well on uh on Summit it's a 200 petaflop system you don't use accelerators it's about a six petaflop system so you're really kind of looking in the rear view mirror about a dozen years in other words you're using an old old system and it's even worse with uh well they're about 95 percent on Frontier instead of about 98 on on Summit but Frontier with the two extra flop double Precision Peak you can imagine if you're not using the gpus you might as well just get off on an engineering cluster and so whether you like them or not they're here to stay and they're here to stay for the foreseeable future so we talk about the applications the idea here was to select a collection of first mover applications there are hundreds of important applications we didn't have the time or the or the funding to be able to go after all that we'd like to go after but we carefully negotiated with each department of energy program program office and there were 10 of them that we interacted with to select a particular Target problem and to select a team the target problem we call a challenge problem it's one that's strategically important to the program office not solved today is amenable to Solutions at exascale and so each one of these applications has a specific kind of problem they're going after with the the not the hope but the expectation that the application in the end will be very general but to be able to measure their performance and measure their their sort of the road map on the science side each one has a very specific problem to try to tackle along with Tyler probably addressed that this morning so I'll give you a few examples of applications here so again we started with millions of lines of codes lots of different codes basically uh in the previous speaker did refer to the kind of the Fortran the C plus plus Evolution I'll say a little bit about that but in the end every application again this is an eye chart but I want to leave these behind for you to take a look at and feel free to to contact me so here you see the domain and kind of their base challenge problem which is yeah we'll we'll give you a c grade if you can show you can address this problem we're not saying you're going to necessarily solve it but show that you have the algorithms and the physics all together that you're really capturing the phenomena that you want to capture and there's also stretch challenge problems too which are really about you know really sort of blowing the doors off things each application has its own as you can imagine its own set of challenges some the model is pretty fixed and they're working on algorithms and software but for most the models not fixed models evolve they're bringing in multiple codes to couple different algorithms different software architectures and so lots of acronyms here I apologize but each application had its own set of unique challenges each application team consisting of anywhere from about five to Thirty including students and post-docs really it was an Eclectic mix of of Engineers a physicists of chemists of domain so domain scientists computer scientists mathematicians and software engineers so when we think about moving to the GPU it's really kind of a multi-dimensional challenge so just porting to be able to sort of let's just say I'm running on the gpus and I'm effectively using the gpus uh you know has a lot to do with memory coalescing loop loop reordering kernel kernel flattening so you need to understand how the GPU works and basically being able to break things up in the case of some codes 60 000 lines one big sixty thousand line Loop had to be totally written as Brandon probably knows it was a Monte Carlo code had to totally rethink the algorithm that's where we get into the middle the middle circle here which is really thinking about okay communication is costly it takes a lot of energy instead of store I'm gonna I'm just going to recompute on the Fly I've got free flops next to me on the GPU so you have to rethink a lot of things and then adapt in the models the Monte Carlo example is as a good good example I'll get to a little bit later but in many cases particles really work well so how can I more effectively use particles or totally rethink some of my algorithms so it's been a lot of fun algorithmic work in this journey as well what we've seen in over ECP and really since 2012 is we've gone from kind of a CPU only to CPU GPU and so all sort of mix and match the the load and do a host a device device to host copy back and forth of data you compute here I'll compete there now with multiple gpus uh the gpus have the hide Pi bandwidth memory so you get a lot of effective performance now we're going to figure out well how do I use two or three gpus and ultimately where we are now is there's enough gpus and enough memory that I'm just going to go down there and live on the gpus so I'm going to initiate things and go down there and live and I'm going to try to stay there as much as I can because it's costly to go back and forth so um what we've seen with the applications that are on the hook for performance so a lot of our applications about half of them need to show they can do new science but also show that they can get a certain performance increase so getting the same answer faster merely means nothing so basically by performance we're talking about science per unit time so we want better science faster so each application had its own work rate so what you're what you're seeing here is work rate over three years just on Summit and so we set this 50x bar and it looks like and you know one would would argue that we really sandbagged because we hit 50x for a lot of the apps just on Summit well in many cases or in a few cases like the earthquake simulator and the molecular Dynamics there were Snippets of code that have been dormant for decades I mean these have a good application will live three three four decades so we had a lot of Legacy code that hadn't been looked at in a while and just taking a hard look at it gave some incredible performance and very surprising but it doesn't mean these apps are done because the problem they want to run on Frontier doesn't fit on Summit so these are kind of Benchmark problems that show there's good progress nevertheless we're very kind of actually the performance exceeded expectations for a number of applications the the previous speaker talked about moving to a GPU and what what we've not done in Jeep in ECP and you might say that it didn't make sense but I think it'll prove it was the right way to go it did not mandate a programming model I wanted to make sure that we sort of emit a a programming environment that allows multiple options and so what you see here are choices that application developers have made kind of based on how portable and productive do I want to be and how much control do I want if I want to have ultimate control then I'm going to program on the battle with with with Cuda or hip or sickle and I'm not going to bring in any other abstraction layers I'm going to own it okay a lot of application developers you know were reluctant at least the beginning of ECP to bring in other components that other teams built because now I have to rely on that other team I have to rely on that technology but as you kind of move up here and you adopt uh more abstractions whether it's openmp or you move on up to uh to Cocos and Raja and akka in this case you begin to rely more and more on these abstraction layers what we found in particular with Cocos that came out of Sandia is about almost half of the applications ended up adopting Cocos which was uh which was incredible so it's now a really common Community I call it an extraction layer because basically as an application person then I'll rely on Cocos to allocate my data on a GPU I'll rely on Cocos to do fundamental Matrix Vector Matrix Matrix or Opera operations so think of it as like a black box that handles all that ugliness which is great because now I can focus on the science but if I want to get down there and learn and understand and kind of have more control then maybe maybe I won't do it but Cocos actually is turning out that certain features of Cocos are integrating themselves into the C plus standard the Cocos team tells me that their goal is to get themselves out of business I don't quite believe that but ultimately all the abstraction uh sort of syntax and Cocos could eventually end up in C plus in which case we now have extractions as part of the language I'll also say that if you play around with copliet or co-pilot or chat GPT to write code it knows about Cocos which is incredible and so we've seen that it's been able to basically interpret some interesting code for us and so I think the way we program the future is going to really change one of the bets we made at the beginning of ECP was you know most applications use a small number of motifs and we're going to call them Motif a common pattern of communication and computation you see the first seven on the left Phil colella coin back in 1986 and then a couple decades later Berkeley said well if you think about data Sciences another five or six or so so we looked at this and we surveyed the applications and we said you know I think what we should do is let's dive deep into these motifs and we call them co-design centers it's a little bit misnamed because all throughout ECP there's co-design it's really kind of through the software stack instead of down on the hardware but in this case it was sort of a legacy name but I want to talk a little bit about some co-design centers that have tackled various sort of typical motifs we see in fundamental mods and apps structure grids unstructured grids particles graphs Etc and so uh they've turned out to be very very successful much more than I thought the first is patch-based AMR and the John belly this effort amrx and what you see that of the kind of 24 ECP principal code six have chosen to live totally on amrx and so what we've done is we've created a new community middleware layer through John Bell and Company and you see the the idea here is if you let me do the patch based refinement I'll push particles on the patches I'll do your linear saws if you want me to I'll handle the embedded boundaries underneath I'll handle all the ugliness okay off off node and on node and that's been a surprising I shouldn't say surprising because John Bell's a real star here uh development and in earlier this year in apologies these are eye charts but I'm gonna leave mine we asked the team to say okay why don't you write down the then and now which is where were we in 16 and now where we are where are we in 18 and you know just take take a quick glance at sort of the the technology step change that's been imparted by this team and what we found I think in you know these aren't heavy heavy Investments with the right team and focusing on the right problem and on one problem over multiple years you can make tremendous gains so amrx again isn't just patch based AMR it's a lot of other great features and I'll get i'll show some examples of what code bases have been able to do with it so Pele for example is a combustion application Jackie Chan at Sandia beginning of this project she had s3d which was a fundamental compressible Les and DNS code uh what happened is Pele started from scratch a compressible version and a little Mach number version and you kind of see here on the right the evolution in terms of bringing in new physics EB's embedded boundaries bringing in gpus I'm going to use particles to track you know fuel spray and soot so in other words they've chosen to totally live on on amrx and been able to really move forward uh very very quickly and on the on the left you see the simulation specs I ran on Frontier and the thing that's way cool to me is basically if you do the math on the effective resolution the effective resolution the required resolution to get down sub Micron would have been about 9 trillion cells and with AMR you know it was just 60 billion okay so basically executing a nice simulation trillion class simulation with billions okay so that's a big win for for AMR they don't have to convince this audience another one and the teams came up with the names of their projects so everything's EXO which is maybe a little annoying because this is just one little stop along the way but that's the wind okay out of nrel beginning the project they had a kind of a way cool finite element uh structured code to model to essentially resolve the rotating turbine blades okay and even do a little bit of fluid structure interaction but what they have now is the ability to do multiple turbine blades with an AMR based amrx background flow of course that's called AMR win and they're actually rewriting War for the weather code on top of amrx so really to do a wind farm simulation the goal is 50 wind farms okay you need to have weather you have local topology you need to have the flow around the turbines and you need to model the interacting turbines it's a canonical exascale plus problem so why are they doing this well most wind farms lose about 20 to 30 percent of the potential energy coming into the Farms due to the turbine determined buffeting so need to understand that okay we need to simulate that and understand it develop a lot of data that we can train an algorithm machine learning algorithm on to essentially dynamically control the roll pitch of the turbines to be able to get more out of the the wind farm it's one of my favorite applications these even though as a nuclear engineer I think there's a great solution for clean energy as Brandon would probably agree so uh let me move on to another one another co-design Center that's unstructured mesh fine at Ella operations known as seed it comes out of Livermore so here if you think about finding element codes you know you get into high order desk organization you get into a lot of of quadrature points you get into a lot of Matrix Vector Matrix Matrix operations and so this team not only says hey we can handle the mesh for you but we can handle in the gathers and scatters we're going to do all your find and element operations and so there you see at the top little icons for multiple unstructured mesh codes that have adopted this I highly recommend you go take a look it's a really overhauled live fin I mean I'm sorry mfim and the seed library and a good example is a full core reactor simulation so in this case this is Led Out Of Oak Ridge Steve Hamilton uh the unstructured mesh incompressible code neck 5000 now called neck RS rebuilt itself on top of seed gets incredible performance and what we're seeing with this application is full core full cfd which I think is is very much needed for a small modular reactor coupled to continuous energy Monte Carlo and the Monte Carlo effort has been way cool to watch it's really overhauled the shift code and it pushes around literally trillions of particles you know if your cross sections are reasonably accurate and you have enough particles I don't think there's any better solution than Monte Carlo and so here we have a full core coupled simulation capability that's I think it's going to be a game changer for uh for reactor simulation again some of the speeds and feeds you can see basically the the uh the kinds of things that they're looking at for performance how many particles are going to push per second how many degrees of freedom can I solve per second uh so you know the the supposition here is that better resolution gives you better uh better answers and uh this this application was one of the first that ran on Frontier and uh 6400 nodes out of 90 472 is still like a one and a half petaflop one and a half exoplot Peak uh capability and so uh but G with each team we said okay here's your minimum Criterion we set this back in 2019 uh so we set minimum criteria for what they needed to show that they could do on Frontier and so far so good many teams like this teams are are very easily surpassing the capability another Motif that I am very fond of is more of a of a particle Pusher myself is uh cobana the Cabana library and of course I guess if you're a Barry Manilow fan in the center is called Copa you'll call it Banner so in this Library basically pushes particles on the gpus and Maps particles to mesh mesh to particles a lot of fundamental operations at pick codes in particular often need and utilize and again you see kind of the same abstraction here you've got a lot of the uglyness or the challenge of implementing on gpus really hidden from the application and you see some of the applications that live on top you see molecular Dynamics particle in cell material Point method plasma pick so you see multiple kind of particle-based apps that are now living on this Library again this is kind of a dense eye chart but let me just say that uh by implementing most of the actually all the particle operations in C plus plus and using Cocos a Fortran based code is alluded to like XGC okay can now just call into this library and utilize all the power of Cocos in the C plus implementations without having to overhaul the entire code so this is a nice way to kind of seamlessly move in to a C plus plus environment by counting on a library another one that I really like uh it's really one of our challenging applications is xaam it's additive manufacturing and here's a really probably our most challenging more uh kind of high risk applications because currently there doesn't exist in my mind a real High Fidelity simulation tool for being able to predict the as printed microstructure and most people maybe don't appreciate that even though added manufacturing changes the game and 3D printing is a big deal with metal alloys high-spec metalloid allies you get micro segregation and porosity and a lot of the printed parts are rejected especially for this fence and Aerospace so we're going after this problem to try to fundamentally understand what's going on there and then of course that understanding will lead to Insight about how do we change our processes or design better 3D printers so this is a canonical multi-physics application that that really couples in multiple multiple codes from micro scale to mesoscale to Continuum and a lot of Rich physics there and I think one of our more challenging applications another one is warp X and this one won a Gordon Bell award on on Frontier actually last fall basically uh full electromagnetic solver full uh particle in cell now going after plasma Wakefield acceleration which is a way cool way of really getting high intensity compact sort of terawatt class uh accelerators and lasers in this case uh warp X you know really really kind of a neat neat to see the evolution uh in 2016 you can see mostly Fortran and python really didn't have had a rudimentary adaptive Mass refinement capability moved to amrx and solving 3D Maxwell's equations across patches is really really hard a lot of algorithm work required to make sure they didn't get ghost electrostatics or ghost electromagnetics and so this is jean-lucvay at uh at Berkeley and uh he actually plotted kind of this figure American his figure and Merit the ratio of the figure of merits is the speed up and basically got 500x once he got into Frontier he was probably the second team that got in the frontier in this sort of 10-day rolling window that was mentioned earlier and here you see you know 2016 and in 2022 a real step change in terms of the the algorithms and the software and the performance and so any and you see kind of the the evolution and his ability to to increase those performance in this case the performance is is the particle push the particle time plus the mesh time kind of times the number of time steps divided by wall clocks so each team developed and defended its own sort of metric for uh for performance this is a nice pretty movie from the actual Gordon Bell or which uh if you're not familiar with it it's it's a word that really galvanizes and focuses the application Community uh every fall awarded at SC a couple of other applications and this is known as exalt and this one I think is really creative in terms of the uh of the algorithms so molecular Dynamics really with with an exit scale you can now do larger systems you can do more atoms but the curse is basically a picosecond scale time step so you still can't really integrate out to milliseconds or seconds without some creative algorithm work and in this case the the kinds of phenomena they're looking at which is bubble growth inside of nuclear fuel sufficient product growth also tungsten bubble growth in time I'm sorry healing bubble growth inside of tungsten materials on the fusion First wall replica Dynamics really works well in terms of tackling the physics so this is an example where we're going to fire off now multiple replicas you know sort of a hundred to a thousand lamps simulations and because we have the gpus we can do more local Quantum simulations so we're now we're going to call a code called latte which is a dftb capability so now I'm going to do some local Quantum simulations for for the forcing because I because I've got the ability to I have the flops next to me so send out hundreds of lamp simulations and the first one that over overcomes a uh a statistical barrier we stop everybody then moves forward with the large delta T into some of the uh the sum of the replicas and so what they've been able to do with with exalt is get fantastic performance increase and this is an example where they had a potential function in the code that really hadn't been looked at for years it was called Snap and they basically rewrote the the kernel it was called over and over and over and made a tremendous increase in in performance that's very likely going to be replaced by an inference engine in other words a trained network based on DFT data DFT simulations and experimental data but a really good canonical example of how to bring in kind of machine learning in a in a subgrid way or way to sort of augment and accelerate the workflows rather than just replace the app okay which I don't think we'll we'll see um we do have a center called everything's ex alert for for machine learning and what we've done here is we're doing more than dabbling machine learning but you know we're not investing you know it isn't the whole project but in this case we're looking at how can we help build surrogates for certain applications how can we sort of influence control design and uh and inverse design and surrogates is probably the one that's most interesting at least to me uh during the lifetime of ECP and we've been able to do that in the cosmology application developing essentially images of universe you know evolution of the universe simulations to be able to be able to compare and contrast with our own uh these are hacked hack based simulations out of argon using Gans which are very very useful for uh for surrogates so here I think we've just scratched the surface of the of the useful instance of surrogates to for sensitivity analysis and for uh for uq but you know if we look at chat GPT and gpt4 the large language models gpt4 just came out last week you know we can ask ourselves okay you know Frontier is probably the smartest machine in the world what how big of a model could we train we wouldn't do it for by reading the internet we want to do it for Science and so we're actually looking at training large models uh you know certainly ingesting data such as Material Science data Material Science Publications Etc so we've got some sort of some skunk work projects at Oak Ridge right now but I just you know just focus on the one bullet here we we could probably train a 100 trillion parameter model on Frontier and uh gpt4 is about 100 trillion so we're either the ability to train a huge model it probably would take 15 to 30 days of full machine training we're not going to do it just for a stunt but we are looking at let's see you know let's see what we can do to to train some useful models for Science and so we've been engaged in those discussions for for more than the last few weeks or months and I'll say a little bit more about that but we have to remember that we need really good uh we need good benchmarks for these models I really like this paper it's called beyond the immigration the imitation game imitation game is a touring movie okay and and this group I think it's like a 50 author group I came up with a bunch of models that break large language Model A bunch of benchmarks that break large language models and this is one I thought was kind of cute basically ask the models what movie does this Emoji describe and you can see a 2 million parameter model on down to 128 billion parameter model and the the models with very few parameters just suck frankly okay just terrible answers so we need something like this for science we need to have good benchmarks that break these models so that it can guide us and focus and Galvanize our efforts so anyway I recommend you read that paper I think we need more like this for science so what are we doing on the AI side well we've been thinking about this we've been writing writing up reports so we do at the doe we probably have too many meetings but we've been having meetings and writing up reports so I'll point you to this one and then one that's uh kind of come out here very soon we had a series of workshops in the last year or so and you can see what we're focusing on here are basically properties inference inverse design autonomy surrogates programming here is where I really do think there's going to be a game changer there where we're going to be able to be much more productive in writing code nothing will replace the human but we're going to be much more productive prediction control and Foundation models so I just encourage you to look for that report and hopefully we'll be able to kind of Drive things in a direction in with our department of energy sponsors moving forward I'm going to shift back to a couple applications and then and then conclude one very non-traditional application is the power grid so we undertook this one in 2016 realizing that it was a very high risk problem in other words if you look at the US you have three in three major interconnects and if you kind of count up all the points and the points in energy generation station a Transformer distribution a home a smart smart device you're easily in the billions so it's well beyond an extra scale problem if you want to try to simulate all those points but you know I think going after kind of a hundred thousand or a million points and so there's over a thousand generation stations in the US maybe 10 000 major distribution centers so you think of those as a graph and each point has some certain behaviors and the connections of the or the uh the edges now or how are the connection points and so a grid simulation gets into basically graph based simulations where now you're asking yourselves I break a I break an edge and that's that's maybe I've lost a power I've lost power supply and one of our problems we're going after is uh we're going to take 13 gas plants Off the Grid in the simulation and what happens with the 60 hertz frequencies you go down to about 59.7 and you're in
trouble like what happened in Texas a few years ago so that's called an under frequency problem so in this case we're going to simulate the under frequency problem try to understand how can we help the grid to respond what other energy generation units need to come on what do we need to do because operators have kind of seconds or a few minutes to make decisions obviously you can't have an exascale computer in the operating room but we can you we can train some inverse models to at least give some good some good decisions so this is this has to do with non-linear optimization mixed mixed integer programming so a lot of optimization algorithms that really aren't mature or scalable so a lot of fundamental math here a couple more non-traditional and in doe space we generally don't tackle human health that's NIH but we have a project where we're working closely with NIH this is one that's has some some ethics efficacy in the NIH space but it's about metagenome assembly where you take Snippets of DNA and you really try or you're trying to assemble it back to uh to predict and understand what what is the ultimate DNA strand look like and what kinds of proteins might it might Emit and you know the Kathy yellick who's a real superstar in our field started with a with an assembler known as hitmer and basically was able to oops I'd probably get the wrong button here okay at the time Hitler could assemble about 2.6 terabytes of genome data but it couldn't do it in a scalable way it would scale up and do some assembly and then it'll bottleneck to one node and then scale back out the Finish basically what you want to do here is you need to be able to assemble a lot of data so our goal is 50 terabytes scalable and she's pulling it off it's now actually the production is similar jgi and so this is a this is she uses a lot of hash tables this is an algorithm that really was you know never really had been implemented on gpus a lot of it required a lot of very creative algorithm work using something known as UPC unified parallel C in a p gas language so really really nice stuff the previous speaker talked about e3sm and um gosh back in the early 90s I was involved in one of the first parallel implementations of an ocean code known as Pop at the time so I'm not a climate scientist but I am kind of more than well aware of the challenges we've worked with Mark Taylor here and our injection into the e3sm effort's really been more more about the most multi-scale modeling framework and basically with MMF He's Able we're able to implement a super parameter super parametrization model for clouds because you got the gpus right there it's an interesting model you're not really cloud resolving but you're doing some local mixing to be able to have better better estimates on mixing and this in particular has some great data on basically where the Baseline model was in 2016 okay and kind of what they were able to do 2016 and now you can just see how the Baseline model is shifted and what Mark's calling the Baseline model is non-mmf I'm not going to use that subgrid model but the Baseline model as the previous speaker mentioned is still is pretty darn good at three kilometers what's interesting to me is you know you see the simulated years per day 0.01 and then you see 2.6 ideally you'd like to do hundreds of century simulations so I think we're going to see and Mark actually was able to get over three simulated years per day recently on Frontier so I think he'll be able to get maybe five or more but let's just say five similar years for a day still going to take 20 days on Frontier to do one century okay we need to do many centuries so we need we have a long ways to go uh but that said we're really proud that we were really able to help this effort one other really more non-traditional simulation is being able to model uh wave you know wave propagation due to an earthquake and this is a Bay Area simulation known as EQ Sim Dave McCowan at Berkeley and so he's got a fourth order in space four Southern time code sw4 out of Livermore that propagates the waves through the geologic strata but then comes it to the surface and couples with a find an element code for for the buildings and basically what he tells me is the buildings that are kind of most suspect are the three to five story buildings and they coupled in quite heavily and you've got to get the right frequency and the frequencies are kind of five to ten Hertz and so before this effort was involved and we were we were simulating kind of sub one Hertz and we really weren't we're picking up the phenomena very well and so now he's really he shows here I was able to get the right frequencies of five to ten and do the coupling and so this really helps us to sort of Harden structures and sort of make them a little more or maybe a lot more earthquake Brazilian okay I'll conclude with just a few more slides on I've not really talked about the software stack much but what we we have we built a software stack that was based on a lot of pre-existing components before ECP but now we've packaged them all up into something called e4s the extreme scale scientific software stack has about 100 different libraries and components and one is clover out of Jack dungara a recent Turing winner he had a real mover and Shake over many many decades so I'll just show kind of example if you're not familiar with scale a pack you probably should be but just where Scala pack was and then now where it is with slate he moved to C plus plus and really has fully full GPU implementations and so uh you know call attention to be you know be sure to take a look at this this is fundamental linear algebra I think as a community we've overlooked the importance of ffts so we actually work with Jack and we're investing in ffts ffts are ubiquitous across code bases I think we often don't realize how important they are his his effort called Hefty is really been moving uh kind of everyone seemed to be using fftw which is a 1B fft and then kind of cheating go to 3D with three one D's instead of just going full 3d and we have a lot of new non-uniform fft uh capabilities now as well so this is one to call out and then finally Ginkgo is something that I didn't think we'd use so much but we have a need now for a lot of unknown fast solvers sparse sparse direct sparse iterative and dense direct so Ginkgo is a is a fast on node kryloff-based solver and we have a lot of codes that are basically sending off fairly large matrices to say four gpus at once or farming out a lot of matrices to a lot of gpus so in other words I don't need to scale across the machine but I've got hundreds or thousands of smaller matrices that can fit into a GPU and Ginkgo is a really great solution there so I will conclude with just noting that e4s has been out there for almost four years now I recommend you go take a look you can get it in containerized forms it has about 100 products we release it every three months Samir Sandy University Oregon uh is has endless energy he's on point for packaging you can contact him directly and frankly the SPAC build package has been a game changer for us it's really kind of been a a major development from Todd Gamlin we build all of our stuff in SPAC it's been it's been tremendous so uh you know I didn't talk much about the apps today I mean about the software stack but there's been tremendous work that's gone on there that I think this community would find find to be useful I have a few wrap-up slides that I'll just I'll just uh I've got some takeaways and stuff but I'll just go right to uh questions because I probably blew through my blew through my time so thanks for your time all right I think we I think we have time for one question I saw a hand here yep oh this uh exascale stuff is nice but you mentioned Zeta scale um you mentioned that we might need a thousand times factor of power efficiency Improvement what do you foresee being I mean obviously that's extremely hard to do what do you see foresee being the major challenges on that and do you think that's even possible to do possible when the Thousand comes from a paper written a couple years ago plus this on say Frontier and then you know I today the truth I forgot what the power envelope they they recommended but it's probably less than 100 megawatts okay and basically all those numbers the numbers I I cited was not necessarily what's what's doable in the view of these authors it really would be what's required to make it feasible okay and I'm not saying that Zeta scale is the only way to go after all of our science problems but you know I'm saying that um what we need to do as a community is to say okay what's do do we really need this and if so where and why I mean one argument too is if I can if I have an application that I feel good about in terms of its ability to uh to model the right phenomena and I can speed it up a thousand fold that's kind of like a Zeta Scale app running that exascale so another way to think about it is you know because I think the the gains you can make in algorithms and and models outstrips the hardware okay and I think we've shown that in ECP where you know the factor of 50 was going from Titan was 20 petaflops okay to uh Frontiers a thousand so that's kind of a factor of 50. and all
of our apps are outstripping that Hardware so you can't just ride that Hardware curve I'm not making that point the point is as a community we need to say is there is there a game changer here and zaskel's the only way then you know let's let's understand that and let's see what needs to be done on the hardware side you need super low power you know whether it's neuromorphic or or whatever but you need or Analog Devices I mean I'm not a hardware person but that that's a major challenge thank you so much I'll tell you what chat GPD cannot do manage these projects like you do never ever thank you for all that you do for Science and for us as a community this is truly inspiring thank you
2023-04-14 00:33