Architecture All Access: Live at Lunar Lake ITT: Next Gen P-core Lion Cove
(bright upbeat music) (upbeat music continues) - Thank you. Good morning everybody. Have a tough act to follow, coming up on stage after Stephen, but here we are. So my name is Ori. I work on Intel's P-core team and I'm here to tell you all about Intel's next generation P-core, code-named Lion Cove.
So when we started our journey on Lion Cove, we defined three ambitious goals. First, we set out to deliver a step function in performance and area efficiency for client SoCs. Our second goal was to overhaul the P-core micro architecture, not just to achieve a one-time uplift in performance, or an IPC, but also to fundamentally resolve some micro architectural roadblocks that would enable us to scale further in subsequent P-core generations. Finally, we decided to modernize the way we design Intel's P-cores to accelerate the pace of innovation going forward and better cater to our customers.
Let's give some examples of the steps Lion Cove took to significantly improve the PNP characteristics of the P-core. We start by examining hyper-threading. So as you all know, Intel introduced hyper-threading more than 20 years ago, when the mainline SoCs contained only a single physical core, and the ability to run a second concurrent thread or logical core without doubling the hardware was pretty revolutionary. As core counts grew, but still lagged the inherent software parallelism or the OS ability to schedule threads, hyper-threading remained an efficient method of increasing overall compute throughput. Looking at a latest generation P-core, hyper-threading scaling can be broadly characterized as adding 30% IPC, or throughput, for 20% Cdyn, or power at the same voltage and frequency within the same core area footprint.
Now this is pretty good, which is why in certain data center deployments where thread density is a priority, hyper-threading shines. However, the introduction of hybrid computing client where E-core clusters are bundled together with P-cores on the same SoC, shifted the paradigm as E-cores provide a much more efficient and performant multi-thread acceleration vehicle than hyper-threading. And as you know, OS scheduling priority really demonstrates this, where in scenarios where performance matters, threads are first scheduled on the P-cores and then on the E-cores.
Only when all physical cores had been populated is hyper-threading invoked on the P-cores. In other scenarios where efficiency or battery life matters, threads are first scheduled on the E-cores and then on the P-cores. Once again, only when all physical cores had been populated is hyper-threading invoked. In this context, when Lunar Lake architects challenged us to deliver a significantly more performant and efficient P-core version for Lunar Lake, we took a hard look at hyper-threading. Remember, hyper-threading doesn't come for free, it replicates architectural state, it adds arbitration points along the pipeline. It requires various fairness mechanisms and even dedicated security hardware.
If peak single-thread performance, performance per watt, and performance per area is really what we're after, how would a single-thread optimized core with hyper-threading removed compare with a traditional hyper-threading capable core? Well, here's the answer. Removing hyper-threading specific logic and scaling back various core structures to optimize for single-threaded use, while maintaining the underlying pipeline and micro architecture, we get comparable single-thread IPC for 15% lower Cdyn, and 10% smaller area. This translates to 15% better PNP, or performance per power, and 30% better performance per power per area versus a single thread running on a hyper-threading capable core. Now this is pretty intuitive, but what may be surprising is that even when hyper-threading is enabled, the single-threaded optimized core is more than 5% ahead in PNP and more than 15% ahead in performance per power per area.
As you see, hyper-threading still wins in performance per area, which is why it makes sense in a non-hybrid system where all you have is P-cores. Now, while hyper-threading is certainly the most externally visible feature that we've removed from the Lion Cove version that goes into Lunar Lake, it is by no means the only feature that was excluded. Other capabilities such as in-field test circuitry, transactional synchronization extensions, advanced matrix extensions, and various other features which are not enabled on the Lunar Lake platform were also removed from this version of Lion Cove. In short, you know, our mantra was simple, remove any transistor from the design that doesn't directly contribute to product goodness.
Now as we add performance to our CPUs, while aided by Moore's Law, shrink their area, we're faced with an increasing thermal density challenge. We must closely monitor the temperature in all places on the CPU and take immediate action when there's a chance of overheating. This is usually done by either throttling frequency or scaling back voltage.
Now historically, the control loop for the core's thermal management hardware was calibrated in our labs, pre-launch, and its settings were statically set for all products and segments. Naturally, safeguard bands had to be put in place to ensure safe operation under extreme conditions. But since the settings were static, they sometimes impeded performance under less than extreme conditions. Now Lion Cove introduces a novel approach to thermal management by introducing a network-based self-tuning controller that adapts to real-time operating condition, the actual workload being run, the actual platform thermal solution, and the ambient temperature, and sets the appropriate temporal threshold that allow for much tighter frequency convergence.
This way, the core is allowed to run at higher frequency than the traditional static methodology and is able to achieve higher sustained performance. Traditionally, Intel cores scale frequency at 100 megahertz intervals, or what we call bins. Oftentimes this left the core's power budget underutilized. Lion Cove solves this by introducing a finer grain clock of 16 megahertz, which is able to extract more performance for a given power budget. To demonstrate, consider the case where the core is given the power budget that would theoretically allow it to run at 3.08 gigahertz. Now under the previous clock scheme, 100 megahertz discrete intervals, the maximum allowed frequency would be three gigahertz, since the jumped 3.1 gig, cannot be satisfied.
But this would still leave the power budget underutilized. Lion Cove, however, will be able to clock at 3.067 gigahertz, better utilizing its power budget and increasing performance by approximately 2%. Now let's look at the micro architectural innovations in the different parts of the Lion Cove core. The front-end part of the machine is responsible for fetching x86 instructions and decoding them into micro operations or uops.
An optimally performing core requires an efficient supply of uops to the out-of-order execution part of the core, specifically the fetch and decode bandwidth, should exceed that which the out-of-order engine can absorb. This starts with accurate branch prediction that is able to determine the correct code blocks from which to generate instructions. Lion Cove fundamentally changes the branch prediction scheme to significantly widen the prediction block, up to 8x wider than previous generation without sacrificing performance accuracy. Now this has two important benefits. First, it allows the branch prediction unit, or BPU, to run ahead and pre-fetch code lines into the instruction cache, alleviating possible instruction cache misses. In this context, the instruction cache request bandwidth towards the L2 was increased on Lion Cove threefold to capitalize on the BPU running ahead.
Second, wider prediction blocks allow the increase in instruction fetch bandwidth, and indeed the instruction fetch bandwidth was doubled from 64 bytes per cycle to 128 bytes per cycle. And the decode bandwidth was increased from six to eight instructions per cycle. Now these instructions are steered towards the uop queue and are also built into the uop cache.
Since code lines are often reused, the uop cache allows for efficient, low latency, and high bandwidth supply of previously decoded uops towards the out-of-order engine without having to power up the fetch and decode pipeline. On Lion Cove, the uop cache grew from 4,000 to 5,250 uops, and its read bandwidth was increased to supply 12 uops per cycle versus 8 previously. Finally, the uop queue grew from 144 to 192 entries, facilitating the service of longer, or larger code loops in a power efficient manner.
The out-of-order engine is responsible for scheduling micro instructions for execution in a manner which maximizes parallelism, thus increasing IPC. Prior generation P-cores employed a monolithic scheduling scheme where a single scheduler was tasked with determining the data readiness of all uop types and scheduling them or dispatching them across all execution ports. This scheme was exceedingly hard to scale and incurred significant hardware overhead. Lion Cove solves this by splitting the out-of-order engine in two domains: integer, which also holds address generation units for memory operations; and vector.
These two domains now have independent renaming structures, catering to optimized uop bandwidth, and independent schedulers catering to optimized portability. This allows future expansion of each of these domains independently of each other and provides opportunities to save power on workloads. That only use one of the domains. Lion Cove increases the allocation rename bandwidth from six to eight uops per cycle. The out-of-order depth, or instruction window, was increased from 512 to 576 uops and the physical register files and load and store buffers were enlarged appropriately versus prior generation. Lion Cove retires 12 uops per cycle versus 8 previously.
Turning to execution, Lion Cove increases the total number of execution ports to 18 from 12. On the integer side, six integer ALUs are complemented by three shift units and three 64-bit multipliers operating at three cycles latency and one cycle throughput. Three branches can be resolved in parallel per cycle. On the vector side, Lion Cove has four 256-bit SIMD ALUs, two 256-bit FMAs operating at four-cycle latency and two 256-bit floating point dividers with significantly improved latency and throughput for both single and double precision operations versus prior generation. Crypto acceleration hardware for AES, SHA and SM3 and 4 resides in the vector stack. Now, a key part in a performant micro architecture is the core's memory subsystem.
At the heart of that memory subsystem are the data caches. As you know, caches are all about striking the perfect balance between bandwidth, latency, and capacity, given a certain area and power budget. Lion Cove significantly re-architected the core's memory subsystem to allow for sustainable high bandwidth with low average latency while still keeping built-in scalability and flexibility to increase cache capacity. The first level data cache was completely redesigned to allow full operation of four cycles latency versus five cycles previously. Lion Cove introduces a new three-level cache hierarchy by inserting an intermediate 192 megabyte cache between the first and second-level caches. This has two key benefits.
First and foremost, it decreases the average load-to-use latency seen by the core, which increases IPC. Second, it allows us to grow the L2 cache capacity to keep a larger portion of the data set closer inside the core without paying the IPC penalty of the added L2 cache latency. And indeed, the L2 on Lion Cove grows to two and a half megabytes on Lunar Lake and three megabytes on Arrow Lake.
And along with several other L2 controller optimizations as well as an increase in L1 fill buffers to 24 and L2 miss queues to 80, Lion Cove shows a significant improvement in its capacity to consume external bandwidth. And this, as you know, is key to running performant AI workloads. In other memory subsystem enhancements, the first-level DTLB was increased the support coverage for 128 pages versus 96 previously. And in order to improve load execution in the shadow of older stores, Lion Cove adds a third store address generation unit.
It employs a new fine-grain memory disambiguation algorithm to safely avoid store->load conflicts, and enhances the stored->load forwarding scheme to allow a young load to collect and stitch data from any number of older pending resolved stores as well as from the data cache. So how did we do? Well, Lion Cove drives a significant double-digit IPC, or fixed frequency improvement, over a wide spectrum of workloads. Having optimized for lower TDPs on Lunar Lake, Lion Cove delivers more than 18% PNP, or performance at power, in that low TDP range. Finally, I'd like to steer away from architecture and micro architecture and take you behind the scenes to the world of design where Lion Cove made an incredible transformation, which will arguably impact the future of P-core going forward more than anything we've talked about until now.
The proprietary tools, flows and methodologies by which the P-core was traditionally designed were replaced on Lion Cove with leading industry solutions, adapted and enhanced by our vendors and partners for our unique needs. Lion Cove transitioned from design comprised of small Fubs, or functional blocks, of tens of thousands of cells, dominated by manually-drawn circuits, to synthesis-based partitions of hundreds of thousands to millions of cells. Yeah. This was enabled by massive refactoring of the Lion Cove RTL code base and streamlining of our design collaterals. So why is this so important? First, the reduction of artificial physical boundaries in the design directly leads to increased utilization and better area efficiency. Second, the reduced integration overhead leads to shorter hardening time, which allows us to pack more content into each P-core iteration.
This, of course, accelerates the pace of innovation. And third, our new development environment allows us to insert knobs into our design to quickly productize SoC-specific derivatives out of our baseline superset P-core IP. And indeed, the Lion Cove version that goes into Lunar Lake is different in several aspects than that which will go into Arrow Lake later this year.
Finally, it's important to know that this incredible transformation was done without sacrificing performance, that's a tribute to the exceptional designers and engineers working on our team who definitely deserve a shout-out. In summary, Lion Cove provides an impressive performance leap over their prior P-core generation with double-digit IPC in power and area optimizations. It compliments this with state-of-the-art power management. Lion Cove's micro architectural breakthroughs and design innovation pave a long runway for further innovation. Thank you. (bright upbeat music)
2024-06-22 17:11