Xe-HPC and Ponte Vecchio - Architecture Day 2021 | Intel Technology
(transition sound effect) - The first step in making progress is to admit we have a problem. At Intel, we had a problem. Almost a decade long problem. We were behind on throughput compute density and support for high bandwidth memories.
Both of which are essential metrics for HPC and AI and the cornerstones of GPU architecture. The first chart is FP64 flops. The blue line is Intel versus the green line which is the best in the industry. The second is a similar chart for memory bandwidth. As is obvious, the gaps were quite large.
And in 2017, when GPU architecture started adding special engines for Matrix processing with AI data types, the problem got worse. Now mind you, that the real-world performance deltas between GPUs and CPUs are much lower than these charts indicate, but the gaps were real. We really wanted to close this gap in one shot. So we needed a moonshot. We set for ourselves some very ambitious goals. We started a brand-new architecture, built for scalability, designed to take advantage of the most advanced silicon technologies, and we leaned in fearlessly.
Let me hand it over to Hong to walk you through this brand new architecture, Xe HPC. (transition sound effect) - I'm here to talk about how we designed the Xe HPC architecture. How do we scale our architecture to realize the vision set up by Raja? We broke this problem down to four hierarchical building blocks: Core, Slice, Stack, and Link. Now, let me walk you through each of them. First, I want to introduce the Xe-cores, our foundational processing unit, through which we scale our architecture. Xe-cores are highly efficient arithmetic machines.
In each Xe-core, there are eight vector engines. Each vector engine provides floating point and integer operations on 512-bit wide vectors. There are also eight matrix engines, referred to as XMX, or Xe Matrix eXtensions. Each XMX engine is built with an eight-deep systolic array. XMX performs eight sets of 512-bit wide vector compute operations per clock. These Vector and Matrix engines are supported by a wide Load/Store unit that can fetch 512B per clock.
Each Xe-core has a large 512KB L1 data cache, currently the largest in the industry. We optimized Xe-core for large data sets and this huge L1 cache helps tremendously. L1 cache is also software configurable as a scratch pad, also known as Shared Local Memory. Compelling Ops per clock for critical data formats is essential for high performance computing and AI. Here I'm showcasing those data formats and what an Xe-core can do.
But this is not all. We can also co-issue instructions to exceed these single Op per clock rates. Our Intel Libraries and kernels take full advantage of this for increased performance of the Xe-core. The next level building block is the Slice. For Xe HPC a Slice has 16 Xe-cores totaling 8MB of L1 Cache, 16 Ray Tracing units, and providing one hardware context.
The Ray Tracing units provide fixed-function computation for Ray Traversal, Bounding Box Intersection, and Triangle Intersection. This makes Xe HPC very attractive to professional visualization applications. The hardware context feature enables Xe HPC GPUs to execute multiple applications, concurrently without expensive software-based context switches. This greatly improves the utilization of GPUs in the cloud. At the top level we have the Stack; This can be a full GPU in itself. A stack contains four slices.
This adds up to: 64 Xe-cores, 64 Ray Tracing units, and four hardware contexts. The Stack has our Massive L2 Cache, four HBM2e controllers, a state-of-art Media Engine, and eight Xe-Links. The Xe Memory Fabric connects Copy Engines, the Media Engine, Xe-Link Blocks, HBM, and PCIe. Xe HPC architecture is also scalable allowing us to do multi-stack design.
This is an industry first. We could only accomplish this because of our EMIB packaging technology. Here we connect Xe Memory Fabric on each stack directly. This enables unified coherent memory between the stacks. This is a big deal for software.
We can now deliver leadership compute and memory bandwidth density for a wide range of HPC and AI systems with a single design. The fourth dimension to our scaling strategy is our Intel Xe-Link. Xe-Link provides high-speed coherent unified fabric for GPU-to-GPU communication. It supports load/store, bulk data transfer and synchronization semantics. It includes an eight-port switch, enabling up to eight fully connected GPUs in a node without any additional components. This leads to the ability to build very flexible topologies.
It is easier to show than tell. Here we have Xe-Link between two Xe HPC GPUs, so we could connect them with up to eight Xe-Links. Scaling to four GPUs for large problems is a popular configuration. Six GPUs per node may look familiar to you as this is the topology of Aurora's accelerator network.
A popular configuration for AI and large problems is to have eight GPUs in a OAM form factor for Universal BaseBoard design, following the Open Compute Project standard. The flexibility of Xe-Link enables a high number of coherent and unified accelerators in a single node. There is no need for additional components to scale-up. This is a massively scalable architecture, the magnitude of which has never been built before, as far as we know.
Now my colleague, Masooma, will take you through how we turned this architecture into an implementation. (transition sound effect) - Hong talked about the amazing Xe HPC architecture. My team and I along with the help of our partners, IP, test, packaging, process technology, and manufacturing teams had the challenge and privilege to bring this architecture to life as the Ponte Vecchio chip.
It is an understatement to say that Ponte Vecchio is the most complex chip and product that I have worked on, in my 30 years of chip building. Actually, I am not even sure if it is accurate to call it a chip. It is a collection of chips that we call tiles that are woven together with high bandwidth interconnects that are made to function like one monolithic silicon. Planning Ponte Vecchio execution was a completely different paradigm. I have worked on: New SoC architecture.
New IP architecture. New Memory architecture. New IO architecture. New Packaging technology. New Power Delivery technology. New Interconnects.
New signal integrity techniques. New reliability methodology. Completely new software. And new verification methodology. But never have I dealt with all of this newness in one product.
And that was the challenge that was Ponte Vecchio. It is amazing and somewhat unbelievable that the chip is alive and fully kicking with workloads! The Ponte Vecchio chip, as you see in this picture, is composed of several complex designs, that manifest in tiles. Compute Tile. Rambo Tile. Xe-Link Tile. And a Base Tile with high speed HBM memory.
Which are then assembled through EMIB Tile that enables a low power, high speed connection between the tiles. These are put together in a Foveros packaging that creates the 3D stacking of active silicon for power and interconnect density. And then, the high speed MDFI interconnect allows the stack to scale from one to two. All of this comes together in a manufacturing marvel across several different process technology nodes. Ponte Vecchio was new and novel in many ways, with a myriad challenges.
While the multi-tile approach helped breakdown the problem into smaller chunks and provided flexibility, execution planning was orders of magnitude more complex. I want to walk you through a few big challenges, from the many that we had on Ponte Vecchio. Foveros was critical for Ponte Vecchio 3D stacking and we have some key learnings with its implementation both functional and physical. We had to transfer data at 1.5x speed over our original plan to minimize the number of Foveros connections. We also had to lock the Foveros locations early in the design on all the tiles, which meant that the floorplan was locked very early.
Since we pioneered this 3D implementation, we had to innovate continuously on die-to-die implementation and verification methodology. We developed many tools, methods, and scripts in real time, and performed validation at multiple levels of hierarchy with new BFMs and test benches to keep the tiles independent and keep hierarchies clean and crisp. This facilitated an independent schedule for each of the four main tiles and enabled their own debug packages.
With this divide and conquer approach, we were able to stage both pre and post silicon validation such that the chip booted within few days of the SoC package assembly with the flashing of "Hello World". This was a huge sigh of relief and a cheer for thousands of engineers across Intel. The staged approach while essential meant that the RTL versions of the various tiles had to be in sync for the integrity of the top-level model.
High power, multi-tile package posed its own challenges related to signal integrity, reliability, and power delivery as there was no precedence, internal or external to Intel. Foveros implementation was complex and time consuming. Just for context, Ponte Vecchio has two orders of magnitude more Foveros connections than any previous Intel designs.
All the electrical and physical collaterals had to be generated from scratch and verified prior to delivery to our partner teams. Now let me tell you more about some of the most sophisticated and complex of these Ponte Vecchio tiles. While Ponte Vecchio was a challenge in aggregate, these individual tiles had a level of design complexity of their own. Compute Tile is a dense package of Xe Ccores and is the heart of Ponte Vecchio.
One tile has eight Xe-cores with a total of four megabit L1 Cache our key to delivering power efficient compute. It is built on the most advanced TSMC Process Technology called Node 5. We paved the way with the design infrastructure set up/tools flows, and methodology for this node at Intel. This tile has an extremely tight 36-micron bump pitch for 3D stacking with Foveros. This is just one example of our IDM 2.0 strategy of combining internal and external process nodes, that Pat has outlined.
Base Tile is the connective tissue of Ponte Vecchio. It is a large die build on Intel 7 optimized for Foveros technology. It is where all the complex IO, and high bandwidth components come together with the SoC infrastructure, PCIe Gen 5, HBM2e memory, MDFI links to connect tile to tile and EMIB bridges that challenged physics.
Super high bandwidth 3D connect with high 2D interconnect and low latency, makes this an infinite connectivity machine. Implementation of this tile was the hardest design challenge on Ponte Vecchio. We worked closely with the Intel technology development team to match the requirements on bandwidth, bump pitch and signal integrity. Xe-Link Tile provides the connectivity between GPUs supporting eight links per tile. It is critical for scale up for HPC and AI.
We are targeting the fastest SerDes supported at Intel, up to 90 gig. When we won the Aurora ExaScale supercomputer contract, this was a new Tile added to enable the scale up solution as per their requirement. We built this incredible tile in less than one year. It is highly gratifying to see Ponte Vecchio powered on and successfully running hundreds of workloads and hitting some industry leading performance numbers on A0 silicon. Here in my hand is this marvel Ponte Vecchio. Let me now hand this to Raja.
- Thank you Masooma, you and your team have done a fantastic job. - Thank you Raja. Highly appreciate it. - This is an incredibly proud moment to be holding this marvel of engineering in my hand, what began as a moonshot that many said could not be done.
And nothing inspires Intel engineers like hearing those four words, "It can't be done." Thousands of engineers said, "We can." And let me show you what they have already done. That GPU Masooma handed me is A0 Silicon as she noted, which is our first stepping. It already produces greater than 45 teraflops of sustained vector single-precision performance, validating that our Compute Tiles are healthy. We also measured greater than five terabytes per second of sustained memory fabric bandwidth, which validates our Foveros 3D packaging technology.
And over two terabytes per second of aggregate memory and scale-up bandwidth, and this proves all our EMIB bridges are very healthy. And there is still more performance to be had. These are all leadership compute and bandwidth numbers that already erase the huge flop and bandwidth gap problem I mentioned earlier today. Ponte Vecchio will be available in PCIe cards with Xe-Link Interconnect Bridge. The OAM module form factor, that I just showed you, will be integrated onto a carrier base board that brings together multiple GPUs with Xe-Links.
Our OEM partners will provide various accelerated compute systems utilizing these Ponte Vecchio subsystems and Sapphire Rapids. For years, taking advantage of GPU accelerated computing systems like this, has been a major headache for software developers. They had to rewrite the parts they wanted to accelerate in different specialized languages; OpenCL, CUDA, et cetera, et cetera. Otherwise, the GPU did them no good. We already led the industry in CPU-based performance for both AI and conventional workloads, and we wanted a seamless way to take advantage of GPU-based acceleration. So we needed another moonshot.
A software moonshot. We needed a programming framework that let software developers transparently program for any mix of CPUs and accelerators. Many said this could not be done. So we created oneAPI. The oneAPI industry initiative provides an open, standards-based unified software stack that is cross-architecture and cross-vendor. The first version of the industry spec was released in September of last year, which specified a common hardware abstraction layer, data parallel programming language, and comprehensive collection of performance libraries addressing math, deep learning, data analytics, and video processing domains.
oneAPI allows developers to break free from proprietary languages and programming models. It exposes and exploits cutting-edge features of the latest hardware. A comprehensive set of libraries speed development of frameworks, applications, and services. And the language and libraries work seamlessly with other ecosystem languages like Python, C++ and Fortran. Releasing an open specification is one thing.
The question I'm sure that's on your mind is whether the industry sees the value and will invest their own effort to adopt. The answer is a resounding yes. There are now DPC++ and oneAPI library implementations for NVIDIA GPUs, AMD GPUs, and ARM CPUs. It's also being adopted broadly by ISVs, operating system vendors, end-users and academics. We know that oneAPI version 1.0
is just the beginning of the journey. Key industry leaders are helping to evolve the specification to support additional use cases and architectures. The provisional version 1.1 spec was released in May,
which adds new graph interfaces for deep learning workloads and advanced Ray Tracing libraries. We expect the version 1.1 spec to be finalized by the end of the year. Here is a sampling of key ecosystem players who support and are actively engaged in oneAPI. oneAPI has developed broad momentum across the industry. For example, US National Labs that are developing ExaScale computers have adopted oneAPI components.
This will allow them to use CPU and GPU architectures from different vendors. Beyond the industry spec, Intel released the first commercial implementation of the full oneAPI stack. Our oneAPI product offering includes the foundational Base Toolkit, which adds compilers, analyzers, debuggers, and porting tools beyond the spec language and libraries. Over 200,000 developers have installed Intel's oneAPI product since our first production release in December 2020, and that was before they had access to Xe HPC. We anticipate an exponential growth in developer base, when we enable access to this architecture.
There are over 300 applications already deployed in market from ISVs across multiple segments that utilize the unified programming model of oneAPI. And, we have over 80 key HPC applications, AI frameworks, and middleware functional on Xe -HPC that utilize oneAPI to quickly port from either existing CPU-only or CUDA-based GPU implementations. Let's look at oneAPI in action with the AI Analytics Toolkit. (transition sound effect) - [Narrator] It's been exciting over the last four to five years to see the growth in HPC and AI, and there's no better way to see the excitement than to look at the progression in performance to the image recognition benchmark; ResNet-50. The gold standard has been set with one architecture over the last several years, with record setting performance.
Well we're pleased to announce a new era with Ponte Vecchio. Built on the Xe HPC micro-architecture with an alchemy of technologies and more than 100 billion transistors, the Ponte Vecchio GPU was designed to take on the most challenging AI and HPC workloads. ResNet-50 inference throughput on Ponte Vecchio with Sapphire Rapids exceeds 43,000 images per second, surpassing the standard you see today in market.
And with training, while we are still in early stages, initial testing shows the compute, memory, and interconnect bandwidths of Xe HPC have unlocked the capacity to train the largest data sets and models. Today we are already seeing leadership performance on Ponte Vecchio with over 3,400 images per second. And this is only the beginning. As we continue with software optimizations and tuning. We're excited about the dawn of a new era where a new architecture can raise the bar to meet the ever-growing compute demands of the Data Center.
(transition sound effect) - Xe Architecture and oneAPI are more than AI training and inference and HPC flops. Let's take a look at some eye candy with oneAPI Rendering Toolkit. (transition sound effect) - [Narrator] Now, I'm excited to show you early results of our oneAPI implementation of the advanced Ray Tracing in the Provisional 1.1
oneAPI specification running on oneAPI based CPU and Xe GPU platforms. The Intel oneAPI Rendering Toolkit has six high-performance, feature-rich, open-source software components including the academy award-winning Embree Ray Tracing Library. These are already running on Intel and third party CPUs like Apple's M1, and now you'll be the first to see oneAPI Rendering ToolKit running cross-architecture on CPUs and GPUs.
Let's show a typical artist's workflow creating, reviewing, then delivering a movie quality scene, backed by tools using Intel oneAPI Rendering Toolkit. Everything you'll see is an untouched live computer screen capture, using film quality assets at native HD 1080p resolution. First, let's show an artist creating a scene backed by Intel Embree using the tool Houdini from SideFX. The artist creates in HD with interactive path traced rendering on a Xeon workstation without a discrete GPU. For this phase of the design, the CPU provides the interactivity the artist needs.
When they pause to review, the path trace rendering converges towards photoreal quality. Next, it's time for the artist to review the scene with the director. This is where the oneAPI game changer comes in. You're looking at a real-time walk-through of an Intel history-inspired path traced scene at the fictitious 4004 Moore Lane. Using the oneAPI software architecture, we show Embree and AI based Intel Open Image Denoise, which took less than three days to port on to a pre-production Ray Tracing capable Xe GPU.
So now, the same feature-rich render kit capabilities artists and app developers crave on CPUs including Ray Tracing and AI are now accelerated on GPUs. The artist and director can review the scene instantly and interactively with full featured native HD Denoise Path Trace rendering. Okay, once the scene is ready for final movie ready 4K rendering, studios can choose an Intel Xeon CPU-based render farm or seamlessly add oneAPI capable Xe GPUs to improve their workflow.
Here is one 4K full fidelity frame rendered with a Ray Tracing capable Xe GPU. The full 4K movie is available for viewing in the demo showcase. So, in quick summary, two years ago, we announced oneAPI with the goal of open, cross platform, cross architecture development and execution. Today, we've shown that oneAPI has gone from an ambitious goal to a delivered reality for developers and creators.
- That was a fantastic demo of oneAPI and the rendering capabilities of Xe. All of this was set in motion with Argonne National Labs and the Aurora project, which combines Sapphire Rapids, Ponte Vecchio, Optane memory, and oneAPI to power the next generation of ExaScale applications. Here's an individual Aurora blade with two Sapphire Rapids and six Ponte Vecchio's addressing the needs of converged HPC and AI workloads. Tens of thousands of these blades connected via high speed fabric will be deployed next year to unleash ExaScale.
Less than two years ago, I shared our goals for Ponte Vecchio. It's an incredible moment for us. Seeing this extraordinary silicon engineering effort and ambitious software initiative coming to life in our labs. This is no longer a moonshot for us.
We still have a ways to go, and we are not done yet. But, we can't wait to take you along on this journey, when we bring this architecture to all our customers early next year. (transition sound effect)