Advanced Multi-GPU Scaling: Communication Libraries | NVIDIA GTC 2025

Advanced Multi-GPU Scaling: Communication Libraries | NVIDIA GTC 2025

Show Video

[Applause] Yeah, thanks for the introduction and   welcome. I'll repeat a motivation from one of the  talks earlier today. You all probably know this:   why use multiple GPUs? Sometimes you just  need to compute faster because you need to   match a quality of service requirement, like  in weather prediction, where you should be   faster than real time; otherwise, you're not  predicting the future. Or you just need to   compute larger, higher-resolution models, such as  larger LLN models or more detailed CFD models. As   Matts showed this morning, the data is going to  explode, so you really need multi-GPU systems to   put that data into reasonable action. This is part of a three-part series,   presenting the multi-GPU programming landscape on  this semi-scientific axis. On the x-axis, we have   generality—how domain-specific is it, and  how general can I express the algorithms?   On the y-axis, we have productivity. You can  probably argue a lot about where to exactly put  

these dots. I think we found a way that someone  will always disagree and some will always agree.  At 8 a.m. this morning, Matts gave an introduction  to high-level libraries that abstract most of   the work away from you, so you have the most  productive experience. If your operation is   directly awaitable, it can also be extremely  performant. Then, just before this, Manolus  

talked about task-based runtimes that are used to  build the higher-level runtimes Matts introduced.   Now, I'm going down in productivity and up in  generality, talking about advanced multi-GPU   scaling with communication libraries—what's used  by all of this to actually carry out the data   movement needed when you program multiple GPUs. So, what will this talk be about? If you have   multiple GPUs working on one problem  cooperatively, at one point, you need to   exchange data so that they stay in sync and really  solve the same problem and do this productively.   That is what this is about: how can CUDA-aware  MPI and NVSHMEM help you efficiently orchestrate   that and compose it with GPU communication? All the models here are CUDA-aware, meaning,   as you see from the arrow, they can directly  work with GPU memory, so they can directly move   data from GPU memory to GPU memory across  the most efficient data path. With MPI,  

the two-sided communication part of the API  looks like MPI, where the rank that has the data,   in this case, rank zero, calls MPI send on the  array in GPU memory, and the GPU that accepts   the data calls receive. With NCCL, this looks  very similar; it's also a send-receive, two-sided   communication pattern. An important difference  here is that you see the stream argument last,   so it is a blocking operation with respect to  the GPU stream in which it's executed, but it's   non-blocking with respect to the host. Right after  these calls are dispatched, the CPU can do other  

work, schedule more kernels, and so on. Lastly, NVSHMEM, which, unlike NCCL,   also has a host API. I'll go into the device  API later. It also has a stream, so it's   non-blocking with respect to the CPU. You also  see that there's only one part; it's a one-sided   communication model. As I go into more details  later, this is enabled with a symmetric heap. 

What do all these programming models have in  common? Because they are multi-node, multi-GPU   programming models, you need a bit of help  launching them. It's not like you just launch your   executable and the executable can have everything  in there to make sure that you launch one process   per GPU, which is the typical setup. Usually, you  take a launcher. In the case of a Slurm cluster,   it's typically srun, but the programming models  also come with a launcher, like mpirun or   nvshmemrun. These have a lot of options to map  your processes correctly on your resources. For   simplicity, I'm just showing the most important  one: how many instances do you have? If you launch   using that launcher and provide your app with  your arguments, it takes care of distributing   and launching one instance of your application  for every GPU. After the library is initialized,   they get back a rank in the case of MPI  and NCCL, or a processing element (PE)   ID from zero to the number of processes minus  one, so they can safely identify themselves,   know how many others are there, and then  use that to orchestrate their communication. 

Let's get started with the first programming  model: MPI, which stands for Message Passing   Interface. It's a long-established standard to  exchange data between processes. For that, it   defines an API for point-to-point communication.  You've already seen the MPI send and receive,   but it also has a rich set of collectives, like  reduce, all-reduce, and so on. There are many  

application implementations with bindings  for probably every language you care about,   both open source and commercial. Let's step back a bit to the foundational   technology used to either enable all of these  libraries or required for their functionality   or to speed up their functionality.  The most important fundamental building   block for CUDA-aware MPI is Unified Virtual  Addressing (UVA), which we introduced way back   with compute capability 2.0. What this does is,  when you allocate any memory in a CUDA program,   the pointer is part of the system virtual address  space, not to be confused with Unified Memory.   This means you can identify from the virtual  address of a pointer its physical location.   You can query the CUDA driver to find out where  this pointer lives, and that allows a library   like MPI to become CUDA-aware. You can just pass  in a pointer, and the library can query whether  

it's a CPU pointer, a GPU pointer, and on which  GPU, and then take the right steps to efficiently   move your data. That's the enabling technology. There are two others I want to touch on. One is   part of the GPU Direct family: peer-to-peer  transfers, which allow data to be moved   directly from one GPU to another GPU in  the same node via PCIe or, if available,   NVLink, which you've probably heard about  earlier. NVLink is also multi-node capable,   so it gives you much higher bandwidth because  you can use the high-bandwidth NVLink connection   between GPUs and avoid any bottlenecks of  intermediate staging in CPU memory and so on.  Then there's GPU support for RDMA, which allows  third-party PCIe devices to directly read and   write to and from GPU memory. In the case of an  InfiniBand network adapter, it can directly read   data from GPU memory and send it over the network  to the destination, avoiding any intermediate   steps that could potentially harm performance. Let's take a look at a very simple example  

with CUDA-aware MPI to see how this is put into  action. This will help you contrast and better   reason about performance. I'm looking at MPI rank  zero doing an MPI send from a GPU buffer, and MPI   rank one doing an MPI receive, matching that send  to a GPU buffer. This is looking at two GPUs and   two nodes. Depending on your MPI implementation,  message size, and system setup, you might see   something different, but I think this is still  a very useful mental model to reason about   performance and what can potentially happen within  a communication library like CUDA-aware MPI.  Let's start with the best case. We have support  for GPU RDMA and a network that supports that.  

We just pass in our device buffers into the send  and receive calls, and the library will directly   move data over the PCIe network and so on. You  can also see this when you run it with systems and   profile it to check what's really happening. The  most relevant part here is what you're not seeing:   you're not seeing any GPU activity, and  you're not seeing any CUDA API calls. This   is because all the transfers are mastered by  the network adapter. What you do see, however,   is activity on the PCIe bus, which is the network  adapter reading data from GPU memory or writing   data to GPU memory, depending on whether you're  receiving or sending data. This is very efficient.  If, for whatever reason, GPU RDMA can't be used  because of the node topology, system setup issues,   or something else, you can still use device  pointers, and MPI will take care of staging,   doing the necessary staging, which then involves  potentially two buffers on the CPU. But it will  

all nicely pipeline that, which you can also  see from the inside systems timeline. You have   multiple chunks of device-to-host transfers,  and the network transfers run pipelined   with that, so you still get reasonable or good  performance, depending on the exact constraints.  To contrast that, what would you need to do if  you don't have CUDA-aware MPI? Then you first   need to call cudaMemcpy device-to-host, wait for  that to finish, then you can call your CPU MPI,   wait for that to finish, and then you need to  call cudaMemcpy again to copy host-to-device.   Aside from the fact that this has far  more steps, it also has a big downside:   it stalls the pipeline after the cudaMemcpy and  after MPI, which you can also see if you look   at this inside systems timeline. You have one  large copy, and then you need to wait for that,   so there's no overlap at all between  network and host-to-device copies.  If we take the OOB benchmarks we just looked  at and I've taken a system that's available for   scientists for some comparisons, Jules Booster in  Jülich, and have run this benchmark with a default   sweep of sizes, you can see that compared to CPU  MPI, CUDA-aware MPI with GPU RDMA (the green line)   saturates the network bandwidth of the HDR 200  Gbps InfiniBand network. With CUDA-aware MPI, you  

lose some performance, but it's still good. With  staging, you're really harming your performance,   so you want to use CUDA-aware MPI and make  sure your system is configured correctly.  If we go intra-node, where we can also use  NVLink, the differences amplify significantly   because the performance delta between PCIe  and NVLink is much higher. PCIe matches  

network bandwidth but not NVLink bandwidth,  so there, it's even providing more benefit.  Does it also translate to real application  performance? Yes, it does. What I've taken   here is an application use case: ICON, which is  the ICO non-hydrostatic model used for numerical   weather prediction and climate simulations.  It's developed in Germany and Switzerland and   is used in operational weather services in both  countries and in multiple other occasions. For  

this benchmark, we use the cubic workload at a 10  km horizontal resolution with 190 vertical edges,   which is a typical setup. From the screenshots,  you can see that compared to CPU MPI, CUDA-aware   MPI avoids unpipelined transfers. Here, you see  the CPU MPI: you see the host-to-device copy,   then a long gap with no GPU activity, and then the  host-to-device copy again after MPI is completed.  

When we look at the CPU version in this case,  the profile of running on a single node so that   we can see the peer-to-peer transfers, you see  that the host-to-device copies are missing, and   you get peer-to-peer transfers. As you  can see from the timeline, the overall   timeline is much denser with much less gaps. So, execution is more efficient if we run a   scaling experiment comparing these two. We see  a 1.4x speedup for the scale really relevant in   production, which is about 8 to 12 nodes. We've  also done an academic exercise to scale out. At  

32 nodes, we even see a benefit of 1.2x to 1.6x,  and on the EOS supercomputer with DJX H100 GPUs   and an NDR400 network, we see up to 6x speedup. Let's switch gears and go to NCCL, the NVIDIA   Collective Communications Library. It's a library  for efficient communication with GPUs. It started  

off with collectives, such as all-reduce, which  are required for data-parallel deep learning.   Since version 2.8, it also supports send and  receive between GPUs, which allows you to express   arbitrary communication patterns, including  all-to-all, and so on. Compared to MPI, this is   a library running on GPUs with any GPU-accessible  data. The communication calls are translated to   GPU kernels running on streams, as I've already  mentioned, which makes them blocking to the GPU   stream you've submitted them to but non-blocking  with respect to the host. This enables you to   avoid some CPU-GPU communication and, especially  for collectives, leverage the compute throughput   to run collectives at the maximum speed. If you're more interested in NCCL,  

I would strongly recommend you to watch the  recording of my colleague's talk, Silven Jügg,   who presented this on Monday. The latest news on  NCCL and one slide he had showed the collective   communication bandwidth achieved with the latest  version of NCCL across generations. As you see,   they achieve really high bandwidths,  especially when NVLink and SHARP can be used,   both inter-node multi-GPU and multi-node. Aside from the high collective performance   and the benefit of avoiding synchronizations,  I wanted to demonstrate that with a source   code example. I'll have a larger link on the last  slide, just in case you look at the slides; it's  

on the bottom. It's a bit simplified version from  an example code available on GitHub. On the left,   you have the MPI version, where you see that you  need to synchronize the compute kernel because,   before you can start communicating, you need to  start issuing the communication for the data that   the compute kernel has produced. Compared  to that, NCCL looks very similar regarding   the send and receive, but you can avoid the  cudaStreamSynchronize because the communication   is submitted in the same stream as the compute  kernel. So, it doesn't start before the compute   kernel has finished, but you can already enqueue  the work, and the CPU can go out to do other work.  

This allows you to more eagerly schedule stuff to  the GPU and makes the timeline generally denser.  Here, I've also taken an NCCL application use  case. Of course, there are many deep learning   applications that use NCCL, but I wanted to  get something potentially new to this audience,   which is VASP, the Vienna Ab  initio Simulation Package. It's a   very versatile package to do AI calculations  based on first-principle molecular dynamics,   used in material research, among other  things. It started off with CUDA-aware   MPI but later added an NCCL communication  option. This leverages multiple things:  

higher performance because of GPU-accelerated  collectives, the CUDA stream-aware interface,   which minimizes CPU-GPU synchronization, and as a  follow-on, the asynchronous CPU execution made it   easier to express communication-computation  overlap. Therefore, in the NCCL version,   communication-computation overlap is implemented,  but not in the MPI version. The comparison I'm   going to share is not 100% fair; MPI could  be better if the work were implemented there,   but it is a side effect of making it easy.  At least in this code, here, I'm comparing   the CUDA MPI version on the left side, where you  see a lot of white space and everything is serial,   and on the right side, you see the NCCL  version, where you see kernels in two rows. The   communication NCCL kernels are overlapping with  compute kernels, so the GPU is busy all the time.  

What I'm highlighting here is actually a reduced  kernel, which is why there's white space on   the left side because the reduction is done on  the CPU and not with an efficient NCCL kernel.   If we take the H and Hefnium simulation with 216  atoms and scale that on EOS from 4 to 32 nodes,   at the limit, we see a 2.6x speedup by using  NCCL instead of CUDA MPI. At the bottom,   there's a link to a blog post which provides more  information on this use case and how it's done.  Okay, then let's switch to NVSHMEM, which  is an implementation of the Open SHMEM   standard and builds on the partitioned global  address space. The core principle is that all   the arrays we work with for communication are  allocated from a symmetric heap. So, if something   needs to communicate, it needs to be allocated  with nvshmem_malloc to be accessible. nvshmem_free  

makes sure that the memory is properly aligned  and registered with the network adapter for   inter-process communication over NVLink and so  on. This enables one-sided communication. It's   a collective call, so you need to call it with  the same sizes on every process, and therefore   it's also called a symmetric heap. If you do an  allocation, you get the same-sized allocation,   the same-sized chunk on each GPU, which has the  benefit of avoiding communication to communicate   offsets and so on. You just know that your peer  needs to send this data to a specific offset,   and you can do this computation locally without  asking your peer, like in a two-sided model,   where you need to ask where to put  the data. You can just go ahead and  

write the data directly where  the peer will need it next.  It has a CPU API in the blocking  variant and a stream-aware variant,   which I've shown on the first slide. But very  importantly, it also has kernel interfaces   to read data with get operations, write data  with put operations, atomics to synchronize,   and some memory consistency APIs. A big benefit  of these memory consistency APIs is that they give   you exactly the control where you need to wait for  things to complete, which also limits the amount   of synchronization between processes needed. Oh, I missed mentioning that, actually, for NCCL,  

all these programming models are  fully interoperable. You can use   NCCL in an MPI program, and you can use  NVSHMEM in an MPI program, in case you have   a legacy code base and want to accelerate it. So, what's the thing with device-initiated   communication? If we have a program with  CPU-initiated communication, typically,   we do the compute on the GPU, and the  communication is issued from the CPU. So,   you need to synchronize at the boundaries, as  we've seen. You start your kernel, wait for that,   then you do your put, wait for that to finish,  and start the next iteration. This is a very   commonly used model, similar to MPI. The  problem is that the offload latencies to  

launch the kernel and synchronize data are on the  critical path, and you don't get overlap between   communication and computation with that model. You  can express communication and computation, and I   have examples of that in the source code I  will provide you the link to on the last slide.   But it makes the code more complicated,  and you have a lot of API calls added.  Compared to that, GPU-initiated communication does  both compute on the GPU and the communication on   the GPU. So, you can start sending out data  as soon as it's computed. The benefit is  

clearly that you need to launch fewer kernels and  operations, so fewer offload latencies. You get   natural communication-computation overlap just  by threading the same technique you use to hide   your global memory access latencies—just  have multiple threads run on the same SM.   For some algorithms, it can be easier to  express them with inline communication.  How does that look like in code? Thread-level  communication, so that every thread just does a   single element put. This is very fine-grained, so  you get the maximum overlap between communication   and computation with threading. It's a very  efficient mapping on NVLink fabrics like we  

have them on DGX systems. Instead of doing a  load-store, you just call nvshmem_float_p on   the data you want to send, specifying the source  data and the destination processing element where   you want to put that. There are also APIs to form  larger groups, which are necessary for networks   like InfiniBand, which don't have that high  message rates like you can achieve over a memory   fabric like NVLink. You still get overlap between  warp and blocks, but you form larger messages so  

that the network-like operations can work more  efficiently. This is one benefit, but really what   can put your code to the next level is that  kernel-initiated communication also allows you to   avoid certain kernel boundaries. You can fuse  kernels to keep more state in registered or shared   memory and avoid whole memory round trips to just  store them to global memory to have a separate   communication kernel and then reload them. This  sometimes requires internal synchronization,   which is also possible by device-initiated  communication if you launch your kernel in the   right way. You can really move the whole iteration  loop into your kernel. In this case, it's a simple  

one, but in your case, it's mostly probably not  the whole iteration loop but rather two kernels   that you no longer need to launch separately  because they have communication in between.  Together, these techniques can significantly  improve performance. Here, I also have an   application use case from HPC. We are looking  at strong scaling the Wilson Dslash operator,  

which stands for QCD on CUDA. It's an  open-source library to provide building   blocks for the necessary calculations in Lattice  QCD simulations on graphics processing units. It   leverages device-initiated communication by  NVSHMEM to overcome latency barriers when   strong scaling. Most of the details are quite  stable, so you can take this GTC 2020 talk from   my colleague for more details of what follows. What this does is overcome the latency barriers  

by the need to launch many operations. In CUDA,  when you need to communicate, you first need to   pack that data, communicate it, and then compute  because you want to overlap communication and   computation. You have a separate kernel for the  interior and the exterior. If you look at the   profile of the single-node execution of CUDA with  MPI, you see that you have the packing kernel,   the interior kernel, and the data copies. The  thing is that there are gaps between them,   which are caused because it just takes  too long to launch these small kernels.   The communication takes longer than the interior  kernel, and that is not because the communication   is not fast enough itself; it just takes too long  to launch the individual copies. If you would just  

put all the peer-to-peer copies together, they  would comfortably run faster than the interior.  Compared to that, with NVSHMEM, we can build  a fully fused kernel. We don't just launch   one kernel; the launching of the kernels can  also run ahead for the next iteration. So,   while the current kernel is running, we have  already launched and provided everything to the   GPU for the next one, which avoids all the white  space on the GPU and also translates to a huge   performance speedup. If you scale out,  this is a relatively busy slide. We look at  

DJX EOS with DJX H100 400 GB NDR. We are strong  scaling the given problem size up to 512 GPUs.   We're looking here at three different precisions.  In MPI and NVSHMEM, the differences are largest   with lower precision because there we do less  compute, so relatively speaking, the communication   is more important. If we zoom into that, the  further we scale out, the larger the performance   difference is. So, with NVSHMEM, although  it really starts to get questionable if you  

want to throw more resources, you can still get  more importance if you go from 256 to 512 GPUs,   where you see an up to 1.7x speedup over MPI. You  also see that MPI doesn't scale as well. The bump   from 128 to 256 is strongly related to the need  to launch multiple kernels because, at that point,   a partition change happens, which requires the  MPI version to launch more boundary kernels. Okay, with that, I'm ready to conclude. There's a  link where you can find the other CUDA developer   sessions from this conference. Some will still  appear, but most of them have already happened,  

so you can watch the recordings. I want to put in  a shameless plug for CWE later today at 3 p.m.,   where you can find me and other colleagues from  the GPU computing team to answer your questions   on MPI, NCCL, NVSHMEM, GPU Direct, and so on. I also want to advertise a talk with a very   similar title that I presented last year:  "Multi-GPU Programming for HPC and AI."  

In that talk, I cover MPI, NCCL, and NVSHMEM,  using the linked multi-GPU programming models   source code as a "hello world" example. I go  step by step on how it looks in source code,   with a synchronous example, and how to express  communication and computation models in different   variants. You can find the example on GitHub and  the recording from last year on NVIDIA On-Demand.  Summarizing the different models for GPU  communication: peer-to-peer and RDMA for   MPI and NCCL improve performance if they are  leveraged, but they will also work if they're   not available because the libraries internally can  do staging for NVSHMEM. They are required because   the one-sided model requires that mappings are  set up so that I can really do something without   involving the source or destination of the data.  MPI, NCCL, and NVSHMEM are CUDA stream aware and   can also work with CUDA graphs, so you can capture  the communication in the graph to further reduce   API overheads. Kernel-initiated communications  are supported with NVSHMEM but are coming soon   to NCCL, as was shared in the NCCL talk on Monday.  The supported communication models are collectives  

(C for collectives), two-sided (2S for two-sided),  and one-sided (1S for one-sided). You see that   NCCL, NCCL (soon), and NVSHMEM support all three. With that, I'm happy to take your questions.  Way in the back, thank you for the  talk. Nice talk. One clarification:  

NVIDIA also has this package, DOKA GPU  Net.IO. Where would that fit in this table?  It's not so much that the Net.IO is a lower-level  library used by frameworks like Aerial and so on,   but not so much by HPC applications running on  large clusters. It targets different use cases,   such as edge computing, where you need to  directly stream data to instruments. So,   the GPU-initiated communication that  you showed is similar to that, but it's   designed for a different use case, primarily edge  computing. The most prominent example is AI/ML. 

Thank you. Any other questions? One question: In Jensen's keynote   on Dynamo for the operating system, do you  think that the solutions you presented will   be part of that Dynamo solution to  make it automatic in the future?  On the inference side, any multi-GPU application  needs to communicate data to use them together.   These are building blocks that we build upon  to create higher-level solutions. I just wanted   to understand how you can use MPI and NCCL  together. For example, I saw a use case in  

LLM inference where you have prefill and decoding.  You have groups of GPUs doing something, and then   other groups doing the decoding part. I saw some  code where MPI is used with NCCL. I just wanted   to know how to start with the basic code that runs  and then keep optimizing. Should I just use NCCL   purely, or how do you make that kind of call? The primary reason to combine MPI and NCCL is   that you already have an MPI code, and you can  use NCCL in the performance-critical part. The  

interoperability is important. If you already have  MPI, you don't need to rewrite your whole code in   NCCL. You just apply NCCL where it provides the  most value, and the non-performance-critical   parts you don't need to touch. So, if you're  writing from scratch and your data is on the   GPU all the time, you can start with NCCL.  If you have mixed CPU-GPU communication,  

it might make sense to look at MPI as well.  Generally, if NCCL provides what you need and   your data is on the GPU, just start with NCCL. From the communications libraries' standpoint,   are there concerns using them on a network  that uses IPv6 addressing versus IPv4? Is the   intelligence or differentiation between IPv4 and  IPv6 part of the communication libraries, or is   that handled by the transport layer underneath? That's sufficiently abstracted away so   that it's not visible. What's very relevant for  performance is if you're running on Ethernet and   that Ethernet supports RoCE (RDMA over Converged  Ethernet). But I'm not that low-level person.  

I think that's orthogonal to IPv6 and IPv4. The  libraries are totally abstracted from that. I've   never heard that this plays a role. If you have  cases from other colleagues or people who have   used these libraries over an Ethernet network  supporting RoCE but with an IPv6 addressing   scheme, please come to the Connect with the  Experts session. Maybe we can help you there.  Sure, thanks. A quick question on footnote  one: Can you have any sense if MPI is  

going to add this extra parameter to their  send and receive to make it stream-aware?  The person running that work group will  be at the CWE, so I think it's better to   discuss it there. Thanks. Is there a native NCCL launcher?   Everything I've done has been MPI-run applications  using NCCL as the communication framework,   but can you do NCCL without MPI run completely? You need something to bootstrap the NCCL   communicator, but it doesn't need to  be MPI run. You can also use SLURM and   other communication libraries. NCCL does  not have its own launcher. No, thank you.  Yeah, a bit of a naive question: Is  there a CUDA-aware MPI for Windows?  I'm not sure. I'm sorry; I can't confirm or deny  because I haven't done the research on that.  Thanks for the great talk. On Monday, I attended  this very great NCCL talk, and Sven mentioned a  

new feature called symmetrical memory. I'm just  wondering what the difference is between that   technique and the NVSHMEM design. At a very high abstraction level,   symmetric memory is exactly the same concept as  symmetric memory. So, yeah, please come here.  

There are some differences in how it's implemented  because the way we are planning to build the NCCL   device-initiated communication is clearly  focused on the needs of deploying applications.   There are different trade-offs to make compared to  NVSHMEM, but from the mental model as a developer,   symmetric memory is exactly the same. It  will be one-sided communications with put and   get operations. The APIs will look different,  and you'll need to do some different things,   but at a sufficiently high abstraction  layer, you can think of them as equivalent.  And do you know if there are any side effects  when using NVSHMEM or the symmetrical memory   feature when we have a very large computing scale,  like when expanding to a lot of GPUs or CPUs?   Previously, when we used Open MPI and tried to  expand to a lot of GPUs, there were some issues.  I think it's best if you come to the CWE, and  then we can look into that in detail. It's a bit  

difficult to answer here. Thank you. Thank you very much, everyone. That   was the last question, but you can talk to Jiri  outside of the room. We have to vacate the room   for the next session. We really appreciate  your attendance and thank you very much.

2025-04-27 06:52

Show Video

Other news

How Space Technologies Accelerate Scientific Breakthroughs 2025-05-07 23:33
Sidephone's SideOS vs Android, what's different? 2025-05-02 21:32
Build a game played on 2+ devices in 4 hours · Web Dev Challenge S2.E2 2025-05-02 12:28