[Applause] Yeah, thanks for the introduction and welcome. I'll repeat a motivation from one of the talks earlier today. You all probably know this: why use multiple GPUs? Sometimes you just need to compute faster because you need to match a quality of service requirement, like in weather prediction, where you should be faster than real time; otherwise, you're not predicting the future. Or you just need to compute larger, higher-resolution models, such as larger LLN models or more detailed CFD models. As Matts showed this morning, the data is going to explode, so you really need multi-GPU systems to put that data into reasonable action. This is part of a three-part series, presenting the multi-GPU programming landscape on this semi-scientific axis. On the x-axis, we have generality—how domain-specific is it, and how general can I express the algorithms? On the y-axis, we have productivity. You can probably argue a lot about where to exactly put
these dots. I think we found a way that someone will always disagree and some will always agree. At 8 a.m. this morning, Matts gave an introduction to high-level libraries that abstract most of the work away from you, so you have the most productive experience. If your operation is directly awaitable, it can also be extremely performant. Then, just before this, Manolus
talked about task-based runtimes that are used to build the higher-level runtimes Matts introduced. Now, I'm going down in productivity and up in generality, talking about advanced multi-GPU scaling with communication libraries—what's used by all of this to actually carry out the data movement needed when you program multiple GPUs. So, what will this talk be about? If you have multiple GPUs working on one problem cooperatively, at one point, you need to exchange data so that they stay in sync and really solve the same problem and do this productively. That is what this is about: how can CUDA-aware MPI and NVSHMEM help you efficiently orchestrate that and compose it with GPU communication? All the models here are CUDA-aware, meaning, as you see from the arrow, they can directly work with GPU memory, so they can directly move data from GPU memory to GPU memory across the most efficient data path. With MPI,
the two-sided communication part of the API looks like MPI, where the rank that has the data, in this case, rank zero, calls MPI send on the array in GPU memory, and the GPU that accepts the data calls receive. With NCCL, this looks very similar; it's also a send-receive, two-sided communication pattern. An important difference here is that you see the stream argument last, so it is a blocking operation with respect to the GPU stream in which it's executed, but it's non-blocking with respect to the host. Right after these calls are dispatched, the CPU can do other
work, schedule more kernels, and so on. Lastly, NVSHMEM, which, unlike NCCL, also has a host API. I'll go into the device API later. It also has a stream, so it's non-blocking with respect to the CPU. You also see that there's only one part; it's a one-sided communication model. As I go into more details later, this is enabled with a symmetric heap.
What do all these programming models have in common? Because they are multi-node, multi-GPU programming models, you need a bit of help launching them. It's not like you just launch your executable and the executable can have everything in there to make sure that you launch one process per GPU, which is the typical setup. Usually, you take a launcher. In the case of a Slurm cluster, it's typically srun, but the programming models also come with a launcher, like mpirun or nvshmemrun. These have a lot of options to map your processes correctly on your resources. For simplicity, I'm just showing the most important one: how many instances do you have? If you launch using that launcher and provide your app with your arguments, it takes care of distributing and launching one instance of your application for every GPU. After the library is initialized, they get back a rank in the case of MPI and NCCL, or a processing element (PE) ID from zero to the number of processes minus one, so they can safely identify themselves, know how many others are there, and then use that to orchestrate their communication.
Let's get started with the first programming model: MPI, which stands for Message Passing Interface. It's a long-established standard to exchange data between processes. For that, it defines an API for point-to-point communication. You've already seen the MPI send and receive, but it also has a rich set of collectives, like reduce, all-reduce, and so on. There are many
application implementations with bindings for probably every language you care about, both open source and commercial. Let's step back a bit to the foundational technology used to either enable all of these libraries or required for their functionality or to speed up their functionality. The most important fundamental building block for CUDA-aware MPI is Unified Virtual Addressing (UVA), which we introduced way back with compute capability 2.0. What this does is, when you allocate any memory in a CUDA program, the pointer is part of the system virtual address space, not to be confused with Unified Memory. This means you can identify from the virtual address of a pointer its physical location. You can query the CUDA driver to find out where this pointer lives, and that allows a library like MPI to become CUDA-aware. You can just pass in a pointer, and the library can query whether
it's a CPU pointer, a GPU pointer, and on which GPU, and then take the right steps to efficiently move your data. That's the enabling technology. There are two others I want to touch on. One is part of the GPU Direct family: peer-to-peer transfers, which allow data to be moved directly from one GPU to another GPU in the same node via PCIe or, if available, NVLink, which you've probably heard about earlier. NVLink is also multi-node capable, so it gives you much higher bandwidth because you can use the high-bandwidth NVLink connection between GPUs and avoid any bottlenecks of intermediate staging in CPU memory and so on. Then there's GPU support for RDMA, which allows third-party PCIe devices to directly read and write to and from GPU memory. In the case of an InfiniBand network adapter, it can directly read data from GPU memory and send it over the network to the destination, avoiding any intermediate steps that could potentially harm performance. Let's take a look at a very simple example
with CUDA-aware MPI to see how this is put into action. This will help you contrast and better reason about performance. I'm looking at MPI rank zero doing an MPI send from a GPU buffer, and MPI rank one doing an MPI receive, matching that send to a GPU buffer. This is looking at two GPUs and two nodes. Depending on your MPI implementation, message size, and system setup, you might see something different, but I think this is still a very useful mental model to reason about performance and what can potentially happen within a communication library like CUDA-aware MPI. Let's start with the best case. We have support for GPU RDMA and a network that supports that.
We just pass in our device buffers into the send and receive calls, and the library will directly move data over the PCIe network and so on. You can also see this when you run it with systems and profile it to check what's really happening. The most relevant part here is what you're not seeing: you're not seeing any GPU activity, and you're not seeing any CUDA API calls. This is because all the transfers are mastered by the network adapter. What you do see, however, is activity on the PCIe bus, which is the network adapter reading data from GPU memory or writing data to GPU memory, depending on whether you're receiving or sending data. This is very efficient. If, for whatever reason, GPU RDMA can't be used because of the node topology, system setup issues, or something else, you can still use device pointers, and MPI will take care of staging, doing the necessary staging, which then involves potentially two buffers on the CPU. But it will
all nicely pipeline that, which you can also see from the inside systems timeline. You have multiple chunks of device-to-host transfers, and the network transfers run pipelined with that, so you still get reasonable or good performance, depending on the exact constraints. To contrast that, what would you need to do if you don't have CUDA-aware MPI? Then you first need to call cudaMemcpy device-to-host, wait for that to finish, then you can call your CPU MPI, wait for that to finish, and then you need to call cudaMemcpy again to copy host-to-device. Aside from the fact that this has far more steps, it also has a big downside: it stalls the pipeline after the cudaMemcpy and after MPI, which you can also see if you look at this inside systems timeline. You have one large copy, and then you need to wait for that, so there's no overlap at all between network and host-to-device copies. If we take the OOB benchmarks we just looked at and I've taken a system that's available for scientists for some comparisons, Jules Booster in Jülich, and have run this benchmark with a default sweep of sizes, you can see that compared to CPU MPI, CUDA-aware MPI with GPU RDMA (the green line) saturates the network bandwidth of the HDR 200 Gbps InfiniBand network. With CUDA-aware MPI, you
lose some performance, but it's still good. With staging, you're really harming your performance, so you want to use CUDA-aware MPI and make sure your system is configured correctly. If we go intra-node, where we can also use NVLink, the differences amplify significantly because the performance delta between PCIe and NVLink is much higher. PCIe matches
network bandwidth but not NVLink bandwidth, so there, it's even providing more benefit. Does it also translate to real application performance? Yes, it does. What I've taken here is an application use case: ICON, which is the ICO non-hydrostatic model used for numerical weather prediction and climate simulations. It's developed in Germany and Switzerland and is used in operational weather services in both countries and in multiple other occasions. For
this benchmark, we use the cubic workload at a 10 km horizontal resolution with 190 vertical edges, which is a typical setup. From the screenshots, you can see that compared to CPU MPI, CUDA-aware MPI avoids unpipelined transfers. Here, you see the CPU MPI: you see the host-to-device copy, then a long gap with no GPU activity, and then the host-to-device copy again after MPI is completed.
When we look at the CPU version in this case, the profile of running on a single node so that we can see the peer-to-peer transfers, you see that the host-to-device copies are missing, and you get peer-to-peer transfers. As you can see from the timeline, the overall timeline is much denser with much less gaps. So, execution is more efficient if we run a scaling experiment comparing these two. We see a 1.4x speedup for the scale really relevant in production, which is about 8 to 12 nodes. We've also done an academic exercise to scale out. At
32 nodes, we even see a benefit of 1.2x to 1.6x, and on the EOS supercomputer with DJX H100 GPUs and an NDR400 network, we see up to 6x speedup. Let's switch gears and go to NCCL, the NVIDIA Collective Communications Library. It's a library for efficient communication with GPUs. It started
off with collectives, such as all-reduce, which are required for data-parallel deep learning. Since version 2.8, it also supports send and receive between GPUs, which allows you to express arbitrary communication patterns, including all-to-all, and so on. Compared to MPI, this is a library running on GPUs with any GPU-accessible data. The communication calls are translated to GPU kernels running on streams, as I've already mentioned, which makes them blocking to the GPU stream you've submitted them to but non-blocking with respect to the host. This enables you to avoid some CPU-GPU communication and, especially for collectives, leverage the compute throughput to run collectives at the maximum speed. If you're more interested in NCCL,
I would strongly recommend you to watch the recording of my colleague's talk, Silven Jügg, who presented this on Monday. The latest news on NCCL and one slide he had showed the collective communication bandwidth achieved with the latest version of NCCL across generations. As you see, they achieve really high bandwidths, especially when NVLink and SHARP can be used, both inter-node multi-GPU and multi-node. Aside from the high collective performance and the benefit of avoiding synchronizations, I wanted to demonstrate that with a source code example. I'll have a larger link on the last slide, just in case you look at the slides; it's
on the bottom. It's a bit simplified version from an example code available on GitHub. On the left, you have the MPI version, where you see that you need to synchronize the compute kernel because, before you can start communicating, you need to start issuing the communication for the data that the compute kernel has produced. Compared to that, NCCL looks very similar regarding the send and receive, but you can avoid the cudaStreamSynchronize because the communication is submitted in the same stream as the compute kernel. So, it doesn't start before the compute kernel has finished, but you can already enqueue the work, and the CPU can go out to do other work.
This allows you to more eagerly schedule stuff to the GPU and makes the timeline generally denser. Here, I've also taken an NCCL application use case. Of course, there are many deep learning applications that use NCCL, but I wanted to get something potentially new to this audience, which is VASP, the Vienna Ab initio Simulation Package. It's a very versatile package to do AI calculations based on first-principle molecular dynamics, used in material research, among other things. It started off with CUDA-aware MPI but later added an NCCL communication option. This leverages multiple things:
higher performance because of GPU-accelerated collectives, the CUDA stream-aware interface, which minimizes CPU-GPU synchronization, and as a follow-on, the asynchronous CPU execution made it easier to express communication-computation overlap. Therefore, in the NCCL version, communication-computation overlap is implemented, but not in the MPI version. The comparison I'm going to share is not 100% fair; MPI could be better if the work were implemented there, but it is a side effect of making it easy. At least in this code, here, I'm comparing the CUDA MPI version on the left side, where you see a lot of white space and everything is serial, and on the right side, you see the NCCL version, where you see kernels in two rows. The communication NCCL kernels are overlapping with compute kernels, so the GPU is busy all the time.
What I'm highlighting here is actually a reduced kernel, which is why there's white space on the left side because the reduction is done on the CPU and not with an efficient NCCL kernel. If we take the H and Hefnium simulation with 216 atoms and scale that on EOS from 4 to 32 nodes, at the limit, we see a 2.6x speedup by using NCCL instead of CUDA MPI. At the bottom, there's a link to a blog post which provides more information on this use case and how it's done. Okay, then let's switch to NVSHMEM, which is an implementation of the Open SHMEM standard and builds on the partitioned global address space. The core principle is that all the arrays we work with for communication are allocated from a symmetric heap. So, if something needs to communicate, it needs to be allocated with nvshmem_malloc to be accessible. nvshmem_free
makes sure that the memory is properly aligned and registered with the network adapter for inter-process communication over NVLink and so on. This enables one-sided communication. It's a collective call, so you need to call it with the same sizes on every process, and therefore it's also called a symmetric heap. If you do an allocation, you get the same-sized allocation, the same-sized chunk on each GPU, which has the benefit of avoiding communication to communicate offsets and so on. You just know that your peer needs to send this data to a specific offset, and you can do this computation locally without asking your peer, like in a two-sided model, where you need to ask where to put the data. You can just go ahead and
write the data directly where the peer will need it next. It has a CPU API in the blocking variant and a stream-aware variant, which I've shown on the first slide. But very importantly, it also has kernel interfaces to read data with get operations, write data with put operations, atomics to synchronize, and some memory consistency APIs. A big benefit of these memory consistency APIs is that they give you exactly the control where you need to wait for things to complete, which also limits the amount of synchronization between processes needed. Oh, I missed mentioning that, actually, for NCCL,
all these programming models are fully interoperable. You can use NCCL in an MPI program, and you can use NVSHMEM in an MPI program, in case you have a legacy code base and want to accelerate it. So, what's the thing with device-initiated communication? If we have a program with CPU-initiated communication, typically, we do the compute on the GPU, and the communication is issued from the CPU. So, you need to synchronize at the boundaries, as we've seen. You start your kernel, wait for that, then you do your put, wait for that to finish, and start the next iteration. This is a very commonly used model, similar to MPI. The problem is that the offload latencies to
launch the kernel and synchronize data are on the critical path, and you don't get overlap between communication and computation with that model. You can express communication and computation, and I have examples of that in the source code I will provide you the link to on the last slide. But it makes the code more complicated, and you have a lot of API calls added. Compared to that, GPU-initiated communication does both compute on the GPU and the communication on the GPU. So, you can start sending out data as soon as it's computed. The benefit is
clearly that you need to launch fewer kernels and operations, so fewer offload latencies. You get natural communication-computation overlap just by threading the same technique you use to hide your global memory access latencies—just have multiple threads run on the same SM. For some algorithms, it can be easier to express them with inline communication. How does that look like in code? Thread-level communication, so that every thread just does a single element put. This is very fine-grained, so you get the maximum overlap between communication and computation with threading. It's a very efficient mapping on NVLink fabrics like we
have them on DGX systems. Instead of doing a load-store, you just call nvshmem_float_p on the data you want to send, specifying the source data and the destination processing element where you want to put that. There are also APIs to form larger groups, which are necessary for networks like InfiniBand, which don't have that high message rates like you can achieve over a memory fabric like NVLink. You still get overlap between warp and blocks, but you form larger messages so
that the network-like operations can work more efficiently. This is one benefit, but really what can put your code to the next level is that kernel-initiated communication also allows you to avoid certain kernel boundaries. You can fuse kernels to keep more state in registered or shared memory and avoid whole memory round trips to just store them to global memory to have a separate communication kernel and then reload them. This sometimes requires internal synchronization, which is also possible by device-initiated communication if you launch your kernel in the right way. You can really move the whole iteration loop into your kernel. In this case, it's a simple
one, but in your case, it's mostly probably not the whole iteration loop but rather two kernels that you no longer need to launch separately because they have communication in between. Together, these techniques can significantly improve performance. Here, I also have an application use case from HPC. We are looking at strong scaling the Wilson Dslash operator,
which stands for QCD on CUDA. It's an open-source library to provide building blocks for the necessary calculations in Lattice QCD simulations on graphics processing units. It leverages device-initiated communication by NVSHMEM to overcome latency barriers when strong scaling. Most of the details are quite stable, so you can take this GTC 2020 talk from my colleague for more details of what follows. What this does is overcome the latency barriers
by the need to launch many operations. In CUDA, when you need to communicate, you first need to pack that data, communicate it, and then compute because you want to overlap communication and computation. You have a separate kernel for the interior and the exterior. If you look at the profile of the single-node execution of CUDA with MPI, you see that you have the packing kernel, the interior kernel, and the data copies. The thing is that there are gaps between them, which are caused because it just takes too long to launch these small kernels. The communication takes longer than the interior kernel, and that is not because the communication is not fast enough itself; it just takes too long to launch the individual copies. If you would just
put all the peer-to-peer copies together, they would comfortably run faster than the interior. Compared to that, with NVSHMEM, we can build a fully fused kernel. We don't just launch one kernel; the launching of the kernels can also run ahead for the next iteration. So, while the current kernel is running, we have already launched and provided everything to the GPU for the next one, which avoids all the white space on the GPU and also translates to a huge performance speedup. If you scale out, this is a relatively busy slide. We look at
DJX EOS with DJX H100 400 GB NDR. We are strong scaling the given problem size up to 512 GPUs. We're looking here at three different precisions. In MPI and NVSHMEM, the differences are largest with lower precision because there we do less compute, so relatively speaking, the communication is more important. If we zoom into that, the further we scale out, the larger the performance difference is. So, with NVSHMEM, although it really starts to get questionable if you
want to throw more resources, you can still get more importance if you go from 256 to 512 GPUs, where you see an up to 1.7x speedup over MPI. You also see that MPI doesn't scale as well. The bump from 128 to 256 is strongly related to the need to launch multiple kernels because, at that point, a partition change happens, which requires the MPI version to launch more boundary kernels. Okay, with that, I'm ready to conclude. There's a link where you can find the other CUDA developer sessions from this conference. Some will still appear, but most of them have already happened,
so you can watch the recordings. I want to put in a shameless plug for CWE later today at 3 p.m., where you can find me and other colleagues from the GPU computing team to answer your questions on MPI, NCCL, NVSHMEM, GPU Direct, and so on. I also want to advertise a talk with a very similar title that I presented last year: "Multi-GPU Programming for HPC and AI."
In that talk, I cover MPI, NCCL, and NVSHMEM, using the linked multi-GPU programming models source code as a "hello world" example. I go step by step on how it looks in source code, with a synchronous example, and how to express communication and computation models in different variants. You can find the example on GitHub and the recording from last year on NVIDIA On-Demand. Summarizing the different models for GPU communication: peer-to-peer and RDMA for MPI and NCCL improve performance if they are leveraged, but they will also work if they're not available because the libraries internally can do staging for NVSHMEM. They are required because the one-sided model requires that mappings are set up so that I can really do something without involving the source or destination of the data. MPI, NCCL, and NVSHMEM are CUDA stream aware and can also work with CUDA graphs, so you can capture the communication in the graph to further reduce API overheads. Kernel-initiated communications are supported with NVSHMEM but are coming soon to NCCL, as was shared in the NCCL talk on Monday. The supported communication models are collectives
(C for collectives), two-sided (2S for two-sided), and one-sided (1S for one-sided). You see that NCCL, NCCL (soon), and NVSHMEM support all three. With that, I'm happy to take your questions. Way in the back, thank you for the talk. Nice talk. One clarification:
NVIDIA also has this package, DOKA GPU Net.IO. Where would that fit in this table? It's not so much that the Net.IO is a lower-level library used by frameworks like Aerial and so on, but not so much by HPC applications running on large clusters. It targets different use cases, such as edge computing, where you need to directly stream data to instruments. So, the GPU-initiated communication that you showed is similar to that, but it's designed for a different use case, primarily edge computing. The most prominent example is AI/ML.
Thank you. Any other questions? One question: In Jensen's keynote on Dynamo for the operating system, do you think that the solutions you presented will be part of that Dynamo solution to make it automatic in the future? On the inference side, any multi-GPU application needs to communicate data to use them together. These are building blocks that we build upon to create higher-level solutions. I just wanted to understand how you can use MPI and NCCL together. For example, I saw a use case in
LLM inference where you have prefill and decoding. You have groups of GPUs doing something, and then other groups doing the decoding part. I saw some code where MPI is used with NCCL. I just wanted to know how to start with the basic code that runs and then keep optimizing. Should I just use NCCL purely, or how do you make that kind of call? The primary reason to combine MPI and NCCL is that you already have an MPI code, and you can use NCCL in the performance-critical part. The
interoperability is important. If you already have MPI, you don't need to rewrite your whole code in NCCL. You just apply NCCL where it provides the most value, and the non-performance-critical parts you don't need to touch. So, if you're writing from scratch and your data is on the GPU all the time, you can start with NCCL. If you have mixed CPU-GPU communication,
it might make sense to look at MPI as well. Generally, if NCCL provides what you need and your data is on the GPU, just start with NCCL. From the communications libraries' standpoint, are there concerns using them on a network that uses IPv6 addressing versus IPv4? Is the intelligence or differentiation between IPv4 and IPv6 part of the communication libraries, or is that handled by the transport layer underneath? That's sufficiently abstracted away so that it's not visible. What's very relevant for performance is if you're running on Ethernet and that Ethernet supports RoCE (RDMA over Converged Ethernet). But I'm not that low-level person.
I think that's orthogonal to IPv6 and IPv4. The libraries are totally abstracted from that. I've never heard that this plays a role. If you have cases from other colleagues or people who have used these libraries over an Ethernet network supporting RoCE but with an IPv6 addressing scheme, please come to the Connect with the Experts session. Maybe we can help you there. Sure, thanks. A quick question on footnote one: Can you have any sense if MPI is
going to add this extra parameter to their send and receive to make it stream-aware? The person running that work group will be at the CWE, so I think it's better to discuss it there. Thanks. Is there a native NCCL launcher? Everything I've done has been MPI-run applications using NCCL as the communication framework, but can you do NCCL without MPI run completely? You need something to bootstrap the NCCL communicator, but it doesn't need to be MPI run. You can also use SLURM and other communication libraries. NCCL does not have its own launcher. No, thank you. Yeah, a bit of a naive question: Is there a CUDA-aware MPI for Windows? I'm not sure. I'm sorry; I can't confirm or deny because I haven't done the research on that. Thanks for the great talk. On Monday, I attended this very great NCCL talk, and Sven mentioned a
new feature called symmetrical memory. I'm just wondering what the difference is between that technique and the NVSHMEM design. At a very high abstraction level, symmetric memory is exactly the same concept as symmetric memory. So, yeah, please come here.
There are some differences in how it's implemented because the way we are planning to build the NCCL device-initiated communication is clearly focused on the needs of deploying applications. There are different trade-offs to make compared to NVSHMEM, but from the mental model as a developer, symmetric memory is exactly the same. It will be one-sided communications with put and get operations. The APIs will look different, and you'll need to do some different things, but at a sufficiently high abstraction layer, you can think of them as equivalent. And do you know if there are any side effects when using NVSHMEM or the symmetrical memory feature when we have a very large computing scale, like when expanding to a lot of GPUs or CPUs? Previously, when we used Open MPI and tried to expand to a lot of GPUs, there were some issues. I think it's best if you come to the CWE, and then we can look into that in detail. It's a bit
difficult to answer here. Thank you. Thank you very much, everyone. That was the last question, but you can talk to Jiri outside of the room. We have to vacate the room for the next session. We really appreciate your attendance and thank you very much.
2025-04-27 06:52