Architecture All Access: Live at Lunar Lake ITT: Next Gen E-core Skymont

Show Video

(upbeat electronic music) (gentle electronic music) (audience applauding) - All right, thank you. Hey. Welcome, everybody. My name is Stephen Robinson. I'm the lead architect for the E-core line. I'm an Intel fellow. And here today, we're going to talk to you about Skymont, the next-generation E-core.

Thank you very much. Okay, so... (Stephen clearing throat) We had a couple of goals that we set out when we wanted to do our next micro-architecture. One of the key goals that we had was, "Let's take the low-power island that was introduced in Meteor Lake, in Crestmont, the Crestmont low-power E-core, and let's see how we can use that to get more battery life, let's get more energy efficiency, and let's get PCs that last longer." (Stephen clearing throat) So our key approach was to maximize what you can run on the low-power island.

Running on the low-power island has... Comes from a lower cost, it has efficient compute, so the more workloads that we can run on that low-power island, the more your battery life is gonna get better. So that was our goal. You know, the next thing was making sure we can run a variety of content.

So every day, there's more software, there's more vector, there's more general... (microphone feedback ringing) Software that uses vector instructions, things like that. So we took a goal, and we said, "Hey, we're going to double our vector capabilities, and we're gonna do it in a balanced machine, and make sure there aren't too many bottlenecks."

So that was the second thing. And you know, AI is a part of that. Most AI workloads use vector instructions. So that was our focus.

Third is scalability. So the way you've seen Intel embrace hybrid computing is we can have P-cores, we have E-cores, and we can scale out with more E-cores. This gives you a way to scale out efficiently that's underneath power constraints, as well as cost-wise, so you have plenty of... You can put more E-cores into an area.

So that was the third, is we wanted to make sure that we continued to have scalability, so that was our third vector, so I'm excited to show you how far we've come. So, if... So now, I'm going to talk about the specifics of the micro-architecture. So every modern CPU micro-architecture starts with branch prediction.

Branch prediction, these are instructions. "Where are you going?" Turn-by-turn instructions. The branch predictor tells you, "Turn left, turn right, 200, 400 meters."

That's what branch prediction is. So in Skymont, we of course made it a little bit larger, but more importantly, we started scanning more data reliably. So in the Crestmont E-core in Meteor Lake, we moved to what we call the 128-byte prediction, which means we'd scan across 128 bytes looking for the next taken branch, that's the next instruction in your turn-by-turn instructions. On Skymont, we fixed some things to make it more reliable, so in Crestmont, we had to start at a specific boundary, 128-byte boundary, sometimes we only got half of that, sometimes we only got a quarter of that. On Skymont, it's arbitrary, you can start at any 64-byte boundary, we call this "even-odd."

Now, no matter where you start, in which cache line, you can look across more, and fundamentally, we're just getting that next instruction quicker. (Stephen clearing throat) This helps us fetch the data faster as well. So because we know where we're going, we can start the access.

So if we're going to miss the cache, if we're going to get the data out of the cache, and send it to the rest of the machine, now we can do that a little bit quicker and more reliably. Here, we can get up to 96 instruction bytes. So it's not 96 instructions, it's 96 bytes that can produce instructions. Remember, x86 is a variable-length instruction ISA, which means there's any number, it could be 96, it could be fewer than that. So we can grab 96 bytes every cycle, and we can load that in some buffering before we send it to our decoders. We have...

Those are 50% deeper, as well. So, now that we can fetch the data, or we can predict it, we can fetch it. Now, we hand it to decode, so... Several generations ago, we introduced the industry to cluster decode. So we did this on Tremont, we did it in Gracemont, we have it in Crestmont. What this is is two independent three-wide decoders, so each one can find up to three x86 instructions, again, variable length.

And here, we can do that... Previously on two clusters at a time. In Skymont, we add a third cluster. So now, we've improved our instruction throughput for decode by 50%. We can...

This is a nine-wide front end for sustained decode. You know, the first time we did it, we thought it was a pretty good idea. We've learned a lot, we've optimized it, we've made it more efficient, and here, this kind of shows you that we've gotten good enough at building it that scaling out to nine-wide isn't this massive area impact, it's not this massive timing impact, it's actually quite, quite efficient, so, you know, we feel strongly that any industry debate about x86, variable length instructions, It's done. We built it, it's good. Now, what other problems are we optimizing? What other problems did we run into? So one thing that happens is, sometimes, in x86, and in other ISAs as well, you end up generating multiple μops for a single instruction, so on Intel, this is microcode, we have a microcode sequencer, it's a ROM, it fetches, it decodes into multiple instructions.

So when we introduced cluster decoding, one thing was that we could only access the microcode ROM at one at a time, so if one cluster wanted to get the ROM, it would wait, and it would get its turn, it would go to the ROM, it would fetch the instructions, so this was basically serialization. So what we've done in Skymont is we've introduced what we called "nano-code." So microcode, nano-code. Nano-code is where we put decoders in the clusters that can generate μops for common and specific microcode flows. So what this does is this allows us to continue doing that decoding in parallel.

Take a gather instruction, or something along those lines, that we would have to generate two, three, 10 μops. Now, cluster zero can happily generate gather μops, cluster one can also generate gather μops if that's what it's doing, and cluster two continues decoding whatever it wants out of order. So nano-code is that next thing that kind of adds to the bandwidth, and gets more reliable decode.

Finally, with our cluster-decode scheme, again, it's replication, so what that means is when we replicate it, we replicate the resources that go with it. So, you know, earlier, I talked about 96-byte fetch. That's because we'll do 32-byte per cluster, so it naturally scaled out.

The same thing is true for the queue that goes between the front end of our machine and the back end, the μop queue. So in this case, because we added the third cluster, now we have 50% more μop buffering between the front end and the back end to help smooth out performance. So we've got a nine-wide front end, what did we do in the back end? So going from the front end to the back end, if you look at Gracemont, that was our original Alder Lake E-core, It was a five-wide micro-architecture, we could allocate five things per cycle.

In Crestmont, in Meteor Lake, it's six-wide. So in Skymont, again, nine-wide front end, we went to eight-wide allocation, so in Skymont, the E-core in Lunar Lake can allocate eight μops per cycle into the back end of the machine. We also...

Increased our retirement. Previously, it would retire up to eight μops per cycle, now, we'll do 16 μops per cycle for retirement. What does this mean? Essentially, it's not necessarily directly correlated to performance, but it frees up resources. The faster we can retire, the more resources we can free up, and the more efficient our machine can be, right? We could overbuild things, and have slower retirement, or we could build wire to retirement, and, you know, make some of our queues a little bit shallower, and that saves area and C-dyn. So this is what we did.

Okay, the third one, in a perfect micro-architecture, you would end up with what we call the critical dependency chain. So I have one instruction that feeds another instruction, that feeds another instruction, and in the end, with infinite resources, you find that dependency chain. So here, we've continued innovating, and having more ways to actually break that dependency, so even though two instructions appear to be dependent, we can know that...

We can get the data from somewhere else. A classic example would be a zero-detection, an idiom. You have an instruction that you know will produce a zero.

So that's one example of breaking a dependency. We have more, we have cases where we will take a load that we predict is gonna forward from a store, and here, we know of that connection, so we don't actually feed the load data to this... To the dependent μop, we can feed the previous store data. Another example, in x86, we have been doing ESP folding, where we take the stack pointer and we kinda accumulate the offset. We can do that in more cases now as well. So that's the dependency braking.

This short-circuits the dependency chain, and gets more IPC, more performance, and it does it in a very efficient way because we're breaking dependencies, we don't have to wait as long. So that's our out-of-order engine. So, now that we have nine-wide front end, we have an eight-wide allocation, 16-wide retire, we grew the out-of-order window.

So if you look again at the E-core in Alder Lake and Raptor Lake, that was 256-entry out-of-order. You look at Crestmont, we actually left it the same. For Crestmont, we said, "Hey, let's add a few more resources, let's balance the machine."

In Skymont, we're about 60% deeper, so here we have a 416-entry out-of-order window. Going along with that, again, there's lots of resources that need to scale up, physical register files, reservation station entries, load and store buffering, all of those are larger in the Skymont micro-architecture. So eight-wide allocation, 400-plus entry ROB, how many dispatch ports? So we're up to 26.

That's a pretty big number. Our approach to reservation stations, execution units, instruction scheduling, is a little different than some others. So on E-core here, on Skymont, we firmly believe in "dedicated functionality produces energy efficiency," so you can share schedulers, or you can make them dedicated. You can share execution units, or you can let them execute in parallel. So for us, we continued to let them execute in parallel.

So here, you can see this is the integer stack blown up in the picture. We have eight integer ALUs, so eight single-cycle operations per clock. We can resolve three jumps per cycle.

We can execute three loads per cycle. The other thing we've done is we've tried to bring symmetry, so we have four shifters per clock, we have two integer multipliers. You know, in general, more hardware, more parallelism. This is on the integer side.

Now, here's the other thing, this is... Was one of our design... You know, our fundamental things that we wanted to do, I mentioned vector performance. So previously, for, say, floating-point operations, we had two 128-bit pipelines, so here, as table stakes are a part of our plan, we wanted to make sure that we were industry-competitive, so we doubled that, so now, we have four 128-bit floating-point pipelines. We had three integer SALUs previously, and now, we have a fourth, so again, being consistent with that, you know, "parallel where we can be, consistent port binding, consistent amount of ALUs," that's what we have here. So what this does is our theoretical peak, tops, flops, however you wanna count it, is 2x the previous gen for these types of workloads.

In addition, we cut latency, so FADD, FMUL, FMA, we're down to a four-cycle FMA, which is pretty consistent with modern micro-architectures. You know, we have more bandwidth, but shorter latency at the same time in Skymont. Native hardware rounding support, floating-point normalization.

So previously, in Gracemont, Crestmont, we didn't pay for that hardware, so, you know, let's say I have a program, C code, I compile it, GCC, I say "-02," I think I'm optimizing it. When you get that binary, anything is gonna follow strict IEEE rules because they don't know if you are or aren't going to need that, and they assume that you do. When that happens, sometimes, you do a bunch of math, and it comes out to a number that you can't properly represent in floating-point, and so there are rules on how you format that. DAZ, FTZ, these are flags in x86 that give you a predictable answer, but it's not necessarily following all the rules. So when you're not doing that, which a GCC-02 compile does not do, sometimes, we would go to microcode, and we would say, "Please help. Please fix this value up for me." We get rid of that in Skymont.

We now reliably take care of all of that in hardware. So this is just an example of something that not... You didn't necessarily notice, but we're trying to make sure we clean up some of those glass jaws, and give you that reliable performance. The other thing is, you know, AI, you know, the...

Those of you in this room skipped the AI talk next door, but you get to do it later, so here, you know, clearly, we have more execution units, so VNNI, these are in our SIMUL execution unit, so we have two more of these. In Gracemont, we had one SIMUL, in Crestmont, we have two SIMULs, and in Skymont, we have four SIMULs, so in those two generations, we've quadrupled the amount of hardware you can use for VNNI-like instructions. And again, if you're doing FP32, FP64, again, we're... We have 2x on those as well. So adding the vector hardware is a piece of it, but you have to make sure you try to balance the rest of the machine.

So this is load/store, this is the L1 data cache. We have gone from two loads per cycle to three loads per cycle. We continue to have that 128-bit, 16-byte data path, this is to match the vector. So now, we're up to three loads.

So this is 50% more L1 load bandwidth. There's a third load address-generation unit that goes with that, and on the store side, we doubled our store agents, so now we have four store agents per clock. Our peak store throughput is two per clock, but for address generation, it's four. We do this so that we can resolve store addresses.

This speeds up loads. This gets rid of cases where we fail to predict that the store and load aren't connected, it's just general performance goodness. So that leads to faster unknown store/load solution.

And then, on our shared second-level TLB, this is for both code and data. On Gracemont, it was 2,000 entries, on Crestmont, it's 3,000 entries, and now, in Skymont, it's 4,000 entries. And we also added more of what we call pipeline PMH entries, table walkers, things that look at page tables. We can do more of those in parallel as well. So that's load/store. And now, we go to our L2.

So in Lunar Lake, we have a four-core module that shares a four-megabyte L2. If you look at Crestmont in Meteor Lake, on the low-power island, so there were two types of Crestmont in Meteor Lake, there was one four-core module that was on the ring that had a two-meg L2, and that shared that LLC with the P-cores, with Redwood Cove. In the low-power island, it was, you know, isolated, it wasn't on that LLC, it wasn't on that ring, and that had a two-megabyte L2, and it was a two-core cache.

So here's where Lunar Lake gets pretty interesting. Lunar Lake doubles that core count, it doubles the L2. So we went out of our way to try to improve that core-count scaling, so when I go from one core active, to two cores active, to three and four cores active, we wanted to provide more bandwidth, so we doubled our L2 bandwidth that we have available. Each core can bring in a single cache-line per clock, but our L2 can produce two cache lines per clock, so that it helps with that parallelism for multi-core workloads. We also improved our eviction data-path width. We were 16 bytes per clock previously, and now, we're 32 bytes per clock, again, just getting better at memory transfers, memory bandwidth.

The other thing was... This is kinda fun, some of you out there, tech press, you benchmark our cores, you run micros, and you notice things, and some of you noticed something, right? When we have data where multiple cores in the same module want to use it at the same time, and specifically when one core has modified data, and it's still in the first-level cache, and another core wants to access it, we do a funny thing. We don't say, "Here's the data, we can execute it," and this is Gracemont and Crestmont.

What we do is we pretend like it misses the L2, we send it to the fabric, the fabric comes back and asks us for the data, we provide the data to the fabric, the fabric gives it back to us. So suddenly, people were surprised, "Hey, the data is near, and the latency is high." In fact, the latency was a little bit longer than a normal cache hit, and that's because of this sort of round-trip behavior. So it was nice of you guys to notice that, but the good news here is we went ahead and fixed it, so here, in Skymont, we have what we call L1-to-L1 transfers.

What this means is that when one core asks the L2 for the data, we see that it is resident in another core, and we don't go to the fabric anymore, the L2 goes and says, "Hey, please give me the data." It grabs the data, provides it to the core locally, the fabric isn't involved anymore. This is more reliable performance for the cases where people have really tight pipelines, and they're sharing data within a module in local time and space. So that's more reliable latency for cooperative workloads. So, hybrid at Intel, we've got two slightly different usages, and Meteor Lake is a great example of that.

Again, we had Crestmont's on the ring, on that shared LLC, and we had Crestmont in the low-power island. Here in Lunar Lake, we're gonna focus on the low-power island, because this is where we have put the E-cores. You know, again, as I mentioned to begin with, our goal is to run as many workloads as we can on that, because the low-power island, as you heard in the keynote, in the Q & A, right? We're not running a ring, we're not running a P-core, we're close to memory, we have now a system cache, a memory-side cache, so that's also new versus Meteor Lake. So now, we have more cache, and we can be power-efficient. We still have the ability to use the E-cores on a compute tile, on a ring, on a fabric, on the LLC. In Lunar Lake, that's not the discussion, but I do mention that we still have that ability for Skymont, the E-core itself.

So let's go into the low-power island, and let's focus on this particular usage. So I'm going to show you a comparison of the Crestmont LP E-core in Meteor Lake against the Skymont LP E-core in Lunar Lake. Here's the performance, this is IPC. So this is SPECint-estimated. This is a generic GCC-O2 compile, this isn't a custom compiler, this isn't a binary that's been optimized specifically for one versus the other, this is the exact same binary.

And here you go. So on the integer side, it's a 38% geo-mean IPC uplift. You can see the S-curve behind it, that's each sub-component ordered by performance. There's no negative outliers here, and you can see that it's not one or two workloads pulling it up, it's a nice, easy S-curve. On the right side is floating-point, so SPECfp. I mentioned we doubled the vector hardware.

I mentioned that we actually fixed some, the normal issues on floating-point, which does affect this, so that is included in this data. And here, we get this 68% IPC uplift. So for SPECfp... So if you're, you know, vector-ish, AI workloads, this is the type of thing that could be a good proxy. And again, everything on the right is positive, no negative regressions. (person coughing) So that's IPC. Let's look at power and performance.

So here's power and performance. This is a single threaded view, so this is a single Skymont E-core in the Lunar Lake low-power island against a single Crestmont LP E-core in the SoC die in Meteor Lake. One-third the power at the same peak performance of the previous generation. So the pull frequency that we chose to run that at, same power, one-third, 0.3. Two-thirds less. At the same...

Sorry, at the same power level, 1.7x the performance. And then, peak-to-peak, this is where Lunar Lake gets really interesting, the power delivery and flexibility of voltage is a little bit different in Lunar Lake, so now, we have the option to run the E-cores a bit faster than we chose to run them in Meteor Lake, so they run them not just with the IPC advantage, but also with the frequency advantage, and here, peak-to-peak, 2x. Now, that's ST. I mentioned we've got more caches, more bandwidth, and we doubled the core count, so let's look at MT.

Here, I'm comparing two cores against four, so this isn't ISO work, so this is... But this is power perf. Again, one-third the power for the same peak performance, so running those two LP E-cores in Meteor Lake at their full frequency, we're one third the power for the same performance. Now, we're 2.9x the performance at the same power, and we're 4x peak-to-peak.

So the compute tile... Sorry, the low-power island in Lunar Lake literally has 4x the capability as a low-power E-core cluster in Crestmont... Or, sorry, in Meteor Lake. So when I talk about coverage, this is exactly what I mean. If you look at Meteor Lake, and you do video playback, Netflix, some of these things, something that's very low CPU overhead, we were able to catch that work on those low-power E-cores.

But if you do something more interesting, teams, multi-directional video playback, you know, conferencing, it tends to fall off. Those cores don't quite provide enough performance to guarantee the work can stay on those cores, so we fire up the compute tile, etcetera. Here, we should be able to capture those workloads very well on the E-cores.

Later, you can listen to the Thread Director talk, and we'll tell you exactly how we helped the software system make sure that it does that work, and gets the right efficient point, but this is what I mean about coverage. This is key to Lunar Lake battery life. We can run more workloads in a lower-power environment, on a lower-power core, with fewer things turned on, and still give you great user experience. So that's Skymont E-core in Lunar Lake. So I also mentioned that we'll have one on a ring, on a fabric. This is not Lunar Lake, this is, again, just to talk about what the capabilities of the IP are.

So I'm gonna do a little bit different comparison than I just did. What I just did is I compared it against the previous-generation CCG product, Meteor Lake, LP E-core to LP E-core. So here, I'm going to compare against Raptor Cove in Raptor Lake.

So today, Intel's primary desktop product that we ship is Raptor Lake, with Raptor Cove as the P-core. So let's see how we do. This is IPC. This isn't peak performance, this isn't frequency. You know, Raptor Cove runs at six-plus gigahertz, right? Absolutely amazing. Wonderful core. Skymont, we're not trying to run at six gigahertz, this is not the goal.

But this is IPC. Again, this is the same workload that I've shown you for all of these. This is GCC, SPEC-02. Sorry, SPEC-17 estimated, compiled 02 out of the box.

2% higher on "int," 2% higher on "fp." Now, before, I said when comparing against E-core that it was all positive, there were no negatives. Here, it's a trade-off, here's an S-curve, you can see there are a few below the line, you can see there are a few above the line. So I don't mean to tell you that all workloads are 2% faster in IPC on this IP. You know, it's a little bit of a trade.

Little bit of a trade. But fundamentally, geo-mean SPECint and SPECfp, 2%. So let's map that to power and performance.

This is the full peak power and perf curve between the two. This is Skymont in the Lunar Lake process. That's Raptor Cove in Raptor Cove process.

You can see the peak performance is higher, again, you can scale to six-plus gigahertz. So let's zoom in to, you know, kind of power envelopes that are more likely for an E-core or in the low-power island. 0.6x the power at the same performance level in that middle of the curve, or 20% higher performance at the same power level.

This is what you're getting out of an E-core. This is what you're getting out of Skymont. This is what we think is key to driving hybrid, to driving PC efficiency, to providing long battery life, and providing a great user experience for Lunar Lake. So these are our goals. More workload coverage, that's the 4x.

More things we can run. Double vector in AI, that's the floating-point performance that I showed. And scalability.

Again, we did this... You know, you guys have... Some of you have probably seen the floor plans of Lunar Lake, you can probably find the E-cores, I don't know if we've called them out yet. We didn't break the bank. We can scale that out in general.

(person coughing) Thank you. That's Skymont. (audience applauding) (upbeat electronic music) (no audio)

2024-06-26 09:03

Show Video

Other news

Salesforce to Buy Informatica, Apple’s Tariff Headwinds | Bloomberg Technology 5/27/2025 2025-05-29 12:47

Varun Chhabra, Dell & Kari Briski, NVIDIA | Dell Technologies World 2025 2025-05-27 02:08

DigiPen Institute of Technology | 2025 DigiPen Europe - Bilbao Student Game Showcase 2025-05-27 01:26