Architecture All Access: Live at Lunar Lake ITT: The Magic of Intel Thread Director Technology

Architecture All Access: Live at Lunar Lake ITT: The Magic of Intel Thread Director Technology

Show Video

(soft music) (bright techno music) - Good afternoon, everybody. Hope everyone had great lunch. I will try my best to make sure you guys don't fall asleep, but there's coffee outside just in case, so. (audience laughing) Welcome to the Taipei Tech Tour. I am really excited to talk to you guys about new improvements, innovations, advancements we are doing for our Thread Director technology that we introduced in Alder Lake timeframe. And then we are gonna talk a little bit about some of the power management innovations.

Arik and Yaron talked a lot about, you know, new things that we have done on the SOC side for power management. Here, I'm gonna more focus on what helps power and performance. How do we communicate that to operating system via Thread Director? So let's jump in. So before we get into Lunar Lake details of, you know, what changes we have done, what optimizations we have done there, I wanna kind of level set a little bit on what is Thread Director really, right? Because some of you, I know you are intimately familiar with the topic.

You've been here with us starting from Alder Lake, you know, Meteor Lake and now Lunar Lake that we have. Some of you are new. So when we launched our performance hybrid architecture with Alder Lake, essentially, it was two different micro architectures on the SOC, right? So if you think about it, it had same functionality, same instruction set support. So it's not like we execute one instruction set or ISA on one core and not the other core, it was all the same. But because micro architecture is different, our design choices, there are different, performance of certain sequence of instructions or execution sequence that we have differ from one core to another.

And typically, if you think about it from software perspective, like operating system, they know about priority, foreground, background, quality of surveys, etc. But they don't know about, "oh, am I executing VNNI and is VNNI gonna be better on this core under this thermal power constraint scenarios?" They don't have insights into that and neither they should, right? Because we don't want to change operating system per micro architectural advancements that we are going to have on every core type or generation of products. So this is where Thread Director came in. Basically, we designed Thread Director so that hardware encapsulates information that it has access to and makes it available to OS as a hint, right? We don't move any threads behind the scene from operating system. We don't say that, "Oh, let me move this thread from this core to this core." No, that's not our job.

Our job as part of Thread Director is to provide a hint to operating system. So operating system can use that information to come up with the right placement using their own intelligence that they have. So it's kind of like a ultimate decision is operating systems, but we provide guidance from our side.

So let's jump in into the architectural recap of Thread Director. On the screen here, you see that there are two components that we have in Thread Director, right? Which is on the right side of the screen, if you will, here or your left side. So the blue box that we have in between, that basically has two components. One is classification. The one that you see with the classes zero to three, that's the classification part.

And that basically is done by core IP. This is P or E cores. Steven already talked about the cores individually. They do classification of instruction saying that, "Hey, if I'm running this instruction or sequence of instructions, what's my IPC?" That's how we measure our IPC performance in terms of IPC, in terms of cores. And then that's the IPC ratio between P and E Core, right? That's what you're seeing on the chart there. That's classification core does that.

Then there is a table, there's a feedback table or enhanced hardware feedback interface as we call it. It's documented on our external SDM manuals that we publish. And what that provides is SOCs view of if you are running class zero instruction or class one or class two instruction, which one is the best core to run it on? Now, if you think from a SOC or hardware perspective, we don't know if you're running on AC, DC, battery or you know, some special power plan that OEM creates, etc.

So from our perspective, we provide two sets of information per class to operating system. What is a performance capability or perf index and what is the energy efficiency capability or EE index that you see in here, right? And then depending on the intent or power slider or things that operating system has, is uses that feedback. Let me give an example. In this case, you see that there is... You know, if somebody's running class zero instruction because core classified those, it's class zero, it's scaler instructions, etc., right?

And OS wants to maximize performance. That's the intent because I'm running maybe a game in max performance mode and I wanna maximize performance here. So OS is going to read the perf column that you see in there with the perf capability. And it's gonna say that whichever is the highest, highest means that's the best score, then it's gonna use that feedback to say that, "Hey, I'm gonna try and schedule this thread here," right? So in this case, becomes P-core N because that has the highest number, 100, associated with it. If OS is optimizing for efficiency because I'm running on best power efficiency mode, DC mode, Custom OEM, mode, anything, right? Then it's going to look at what is the best efficiency index for that work that needs to be done.

And if it has zero, it says that, "Oh, it's E-core N." That's the most efficient core to run on, right? If that core is busy, it's doing something else, it's gonna go down that list and say that if P-core N is busy, the next best score is P-core 1. Let me choose that core, right? So ultimately, decision is owned by OS. We give relative ordering of what's the best score to run it on. And that's basically the kind of the logic behind Thread Director.

So because Thread Director kind of provides input into scheduling, I wanted to quickly walk through on how the scheduling evolution has happened because I know some of you had the scheduling-related questions in earlier sessions. From Alder Lake, Raptor Lake to Meteor Lake to Lunar Lake. On Alder Lake and Raptor Lake, because we had shared uncored shared ring between P and E-Core, we did not have, you know, low power island journey that we started from Meteor Lake timeframe, the SOC tile, etc.

We did not have that on Alder Lake or Raptor Lake. We essentially used something called a standard scheduling with Microsoft, right? They and us, we have been collaborating very tightly on these things. And we use something called a standard scheduling, which means that start from the most performant core, go down the list and then use SMT. We talked about HT or SMT hyper threading before. So the first order was start from P-core, then it goes to expand to E-core for any scaling, you know, threading benefits that we want to see. And last, we use the SMT or hyper threaded siblings.

When we moved to Meteor Lake, in Meteor Lake timeframe, what we ended up doing was we started the work in some cases. Again, this is hetero scheduling. We use that policy based off of, you know, how we want to maximize energy efficiency, etc. So we started from the SOC tile that we had.

Now Meteor Lake had two cores on SOC tile. We talked about it earlier. Stephen presented some data on that as well.

So we scheduled the work there. If the work could fit in there, we always use that, right? That provides us more efficiency. If the concurrency of the work increased, and I'll walk through the Teams example later because we all, you know, since COVID, we have been very used to using video conferencing. Teams, Zoom, whichever collaboration software that you like. But if you look at this, we started from there. If the work didn't fit, we would move to the E-cores on the compute time because we expand the work there.

And then if the work didn't fit and it needed more compute capability because we do have our P-cores, they do provide snappy performance and responsiveness, then we would move to P-cores. That's what we used on Meteor Lake, right? Similar idea is used on Lunar Lake as well. but on Lunar Lake now, our low power island or the LP island that we have, or the E-core complex that has four E-cores on them.

These are our Skymonts. Awesome architecture, you know, great performance improvement as well as energy efficiency improvements. Lunar Lake has kind of done amazing job in terms of getting all these IPs together. So start our work there. If the work fits, we keep running there.

Let's say if the concurrency increases, now we have four cores to run on. And if we need that again boost in performance, snappiness, the work kind of doesn't fit, it can benefit from going to higher frequencies, then we jump to our P-cores that we have, which are the Lion Cove cores. Okay, so let's kind of look at it in action.

So this is kind of productivity example. So I will go ahead and do the first click through in here. So this is when we have, you know, something like somebody's doing, you know, new PowerPoint creation, you know, office productivity usage. I open a PowerPoint, I start typing, create some, copy, paste some pictures, etc. All of that work fits really well on our E-cores that we have, right? And from scheduling perspective, you see that there are one or two threads, you know, sometimes three thread being used, we have our work being executed, we use those scores well. And they do their job perfectly well in terms of providing that enough user experience to do these tasks.

Now let's say somebody started to do Excel modeling, right? So this is like Monte Carlo or something else going on. So when that work starts, what's gonna happen is Thread Director is going to know that, "Hey, now I'm doing much more work in terms of instruction usages or the SOC consumption, concurrency, usages, etc. And then it's gonna basically say that, you know, together with OS, it's gonna come to the decision that, "Hey, now it's time to move the work to P-cores that you have. Why? Because P-cores can provide that additional performance that you need."

So we started from E-cores, we utilized them. As much as they provided benefit for us, great user experience day-to-day task office productivity. When the need or the requirement for the compute exceeded than what we have, Thread Director provides that feedback to operating system and operating system uses that to schedule the core threads accordingly on the P-cores. That's how the evolution of you know, scheduling and everything has happened with Thread Director.

So let's talk a little bit about what are the new things from Lunar Lake perspective that we have added to Thread Director, right? So first and foremost, more optimized, enhanced feedback that is given to operating system. I'll talk a little bit about it. There are telemetry enhancements, other things that we'll capture. I'm really excited about the next feature, which is called OS containment.

And this is a feature we are excited to partner with Microsoft to kind of make sure that the work that can fit on our low power island or the E-core complex works well there and we get efficiency benefits as well as performance benefits out of it. So this is really, and we'll look at some of the data Microsoft collaboration, we are very excited to talk about that. Then we have some new power management optimizations that we have done.

So Arik talked about a lot of SoC optimizations with his team. We have been collaborating a lot to kind of bring, you know, new power management enhancements on real software that can benefit and how do we get the product goodness out. And last but not least, we have something called as the platform intent. You know, consuming the OEM hint on, "Hey, I want to maximize performance, I wanna maximize power." We have our own platform value-added software that we give to our customers OEMs, which is called as Dynamic Tuning Technology. DTT, some of you might have heard about it.

And they use that to configure systems to get maximum power or maximum efficiency. So we have created an interface starting Meteor Lake timeframe to get that hint or information to SOC so we can do some dynamic optimizations at run time. So we'll talk a little bit about that as well.

So let's drill down into first one. So Stephen and Ori talked about E and P-core IPC gains that we see. Phenomenal products, lot of gains that we have in there, right? But if you think about it, when we created Thread Director in Alder Lake timeframe, the cores there, P and E-cores had certain IPC characteristic that we got out of them. Now with the enhancements that we have coming up with the Skymonts and Lion Coves that we have on Lunar Lake, we had to reevaluate on where the classification boundaries are, right? Because whatever was, let's say class two that gave maximum performance on P-cores in last generation may not be true right now. So those enhancements and those changes require that we add some new telemetry information. We do our training that, you know, our machine learning layers, identify the work that is being done.

On the SoC, we do that differently. So all of these new things were added to that. We have finer granularity of how workload identification is done, right? And it's not like we statically list something saying that, "Oh, I'm running teams or I'm running a game." That's not the intent of it.

It's looking at what load is seen on SOC on various components and trying to find out from there that hey, does this look like a bursty type of workload? Or is this like a battery life type of workload, right? So we do enhancements on that and that information is used in here as well. And last but not least, we have something called as a special hint going to operating system under very severe power thermal constraint. We're talking about low TDP scenarios where you have docks connected, you know, not enough GPU NPU, everything running, not enough power budget left, etc. In that cases we have some special hint going to operating system to make sure that we have continuity of experience that we are offering to the end users, right? We don't have like unexpected system behavior in that case.

So that's kind of something that again, we partnered with Microsoft to create and deliver. With all these enhancements that are done, you know, IPC gains that we looked at P and E-cores and everything, in realistic scenarios where let's say, we are doing a CPU-based AI, like there is a VNNI work that is going on, there's still a lot of ISV interest on keeping on using CPU for AI because they want latency sensitivity, etc. And then we have other work happening on the system, right? Because you do have non-AI, usages, creation, floating point work going on, integer work going on. If you mix all of this and you enable Thread Director support, we see about seven to 10% benefit in realistic use cases where you know, Thread Directors still recommends to operating system to direct right work to the right core.

But in Lunar Lake, as Rob mentioned when he was giving keynote that it's performance is not the only goal of Thread Director. We wanna tie it to efficiency, provide hints on now is this is the efficient core and you know utilize that as well. So let's talk about this Windows feature called OS containment that we partnered with Microsoft on creating. So OS that's gonna launch along with Lunar Lake timeframe, it has a feature called a containment zone and there are three main containment zones that are created.

One is the efficiency zone. And again the way this feature is designed is OS looks at Thread Director feedback to see which cores are efficient that Thread Director is recommending and which cores are performant and then it creates the zones based off of that. So efficiency zone is essentially what we have on our low power island, right? Because those core are gonna be generally most efficient and that feedback is exposed to operating system and it can create efficiency zone based off of that.

Then we have a hybrid or a compute zone being created. Now in Lunar Lake, on our computes domain, we have only P-cores, right? Previous product had combination of P and E-cores in future things will likely look different, etc. right? So here on Lunar Lake we have P-cores in there. So our compute zone or hybrid zone in here includes P-cores that you see.

And then there is zoneless or non-zone, which basically means that use all codes as it happens today. Now all of these parameters to configure customers like OEMs, if there are any OEM customers in here, you get a choice in how to set up these parameters. There's like a PPM knob, the processor power management knob that operating system provides. And Intel of course, will provide some default recommendation that what we think is best for this product. But you as a customer get a choice to say that, "Hey, I want to tune this more aggressively to stay in the efficiency zone more." "Oh by the way, no, in this case, I want to tune it less aggressively because I want to jump the fence very quickly."

Right? So you get all these knobs to play with and we can go from there. So I'll let you read the quote, I'm not going to read it, but this is what our partners at Microsoft, Bret and Tapan who we have worked with very closely for creating this had to say about this technology and they were super excited to work on this as well. So let's talk a little bit about the scheduling and I'll quickly walk through the animation of it, right? We use the hetero scheduling for hybrid scheduling policy.

Work starts, let's say a single thread comes in, we use the E-cores from the hetero scheduling policy that we use and we start put the work there. Now if the work expands because you know, the need for more threads, let's say it's a four-threaded application or two-threaded application. And we expand it to other E-cores that are available And if the work doesn't fit in there anymore, then Thread Director provides that feedback to operating system. Operating system uses its own intelligence and moves the work to P-cores, right? That that's kind of the general idea.

So I want to walk you through some real examples because animations are great but we want to see how it works in reality, right? And here, you see some drawings and kind of diagrams, but these are actual traces on system running Meteor Lake. When Meteor Lake was launched, whatever the optimizations and things that we did, in the next slide, you're gonna see Lunar Lake, right? These are actual ETL traces or Windows performance analyzer traces that we collect and for better visualization in the room, we kind of put it in a pictorial way, right? So this is kind of running a realistic IT workloads, we all love our IT laptops and everything. It has like lots of background services, running security things. I don't know, Intel now has like some sort of things too.

That prompts you that hey this is the time to take a break and stretch. We have those software running, etc. right? And then we do real work.

Like I have Java sometimes compiling on my system 'cause I'm doing something there. Or I have WPA running 'cause I'm looking at some trace, right? So in this case, what it is showing is again, we start from the top, the LP E-cores that we have, right? Then if the demand increases, you can see that as we go to the middle box, we move the work to the E-cores on the compute complex, this is Meteor Lake. So it has E-cores on the compute complex and we moved the work there. Then we expanded the work to P-core. And when the utilization kind of died down, we moved the work back to LP E-core.

This is how it worked on Meteor Lake. Again, you know, we had two cores so we kind of worked around with, you know, having only two cores there versus having more cores. And when concurrency exceeded, we had to move the work back. Now this is again, real trace, same similar type of usage as how it looks on Lunar Lake. You can see that most of the work for almost all the time stays on the E-core cluster that we have.

Why? Because in reality, I think we have even seen from telemetry data etc. Usually you see anywhere between two to four-thread concurrency on you know, productivity type of usages. And that's where using four cores here is really helpful.

And again, with those IPC gains that we talked about, it really helps to get that performance as well, that user is you know, satisfied with On P-core, you see some minor blips and activities. Those are, you know, sometimes wake up timers, interrupts that are going on some management thing that is happening et cetera. We are working on fixing some of these software hygiene issues as well. It's not just, you know, OS related but some other drivers third party software.

Intel has a enabling team that goes works on that. But in general to start with, we have much cleaner behavior in such cases. Let's talk about some of the power management time that we have. So if you look at the middle box that we see in here that that is like an SOC view and we have created, I'll show in the later slide that how we all bring it together with taking customer inputs and OEM inputs on the platform intent. But we have if you will, three type of modes in here.

There's a performance SOC mode, balanced SOC mode and power SOC mode or energy efficiency SOC mode, right? And depending on the type of load that is running on the system, if somebody is running a game and they chose the performance mode to run on, we will take actions in our internal power management algorithms that hey, I see some bursty activity going on or I see some sustained activity going on. I know that user wants to maximize performance and here are the best cores to run it on, right? And that decision is later communicated to operating system via Thread Director because that's our way to talk to operating system. But in that big blue box that we have in between, we don't just recommend core types to run on. We also now with Lunar lake, we are doing some intelligent resource management, frequency selections and things to say that, "Hey, even if you're in performance mode, if you're running a battery life type of workload such as browsing or video conferencing or something, why do I need to go to highest utilization or highest frequency resource consumption?" And we take decisions there and we kind of communicate that to our hardware.

SOC and SOC acts on that, right? This was kind of collaboration that we did with Arik's team and we see fantastic results out of that. So let's take a look at how this works with Teams, right? Because we all love Teams, we do conferencing, we spend most of our days on that. But this was the Meteor Lake behavior that we saw. Again, this is from real traces. It's not just a picture or anything. It's from real Windows performance analyzer traces.

We start with LP E-cores. When the demand increases, we go to compute E-cores and then eventually to P-cores. Again, we only had two teams in general, have cases where we'll go to four quarter concurrency. It needs four threads at a given time, right? About 20, 30% of the time, it needs that. I think all other video collapse software also work the same way.

So two E-cores in some, two LP E-cores were not enough at that time and we had to jump the fence, which was the right thing to do. Now on Lunar Lake, because we have the four LP E-cores and we have all these IPC improvements and things that are coming, we are able to keep work here. We meet the FPS, bitrate, quality, all those requirements.

It's not that we are dropping anything, right? We are not dropping any frames, but we keep the work there and it's more clean. And this is all with, you know, enabling features like containment, internal power management optimization so that we don't jump the power too high. So what's the result of all of this, right? This is great in terms of to see scheduling behavior and all that, but how does it translate in real power savings and things? So here in an example, right? You're seeing an example of Teams power reduction. Now, what we are comparing is Lunar Lake baseline without containment, without these power management optimizations that I talked about.

And then with the light blue colors with that and we see up to 35% power reduction. We are super proud of all of this, right? This is great effort by the whole team to deliver on this. And this is just one example.

This is not specific to Teams. We are seeing that KPIs like productivity, browsing, video playback, streaming, all of them provide great benefits in terms of reducing power and you will see some of it at launch. So I would definitely stay tuned for that. So let's talk about, you know, our last portion, which is consuming platform intent. When we launched Thread Director, this was in Alder Lake timeframe when we came out.

We got a lot of OEM requests that time saying that, "Hey, how can I provide inputs to you? Because this looks useful." I want to be able to control it and say that, "Oh, I'm running browsing now, try and use these core types," or "I'm running gaming try and use these core types." So we had a lot of discussions with our partners, customers, you know, even Microsoft, that, how do we consume this intent into information that we get? We did this in Meteor Lake timeframe, we kinda started, I did this journey on Meteor Lake and Lunar Lake has added more enhancements to that. Below the green box that you have, the light blue box that you see, this is our Intel Dynamic Tuning Technology software layer. Most of our mobile designs, our customers ship with this software because it's extra goodness that it provides for Intel hardware, right? And if you look at this, they provide something called as gears or sliders.

Basically say that customer can choose that hey, I want more performance now or I want more efficiency now. And these gears go from you know, gear one to gear seven or depending on platform we have different numbers there. One means max performance, the higher number means more power.

Now in previous generations, before Meteor Lake, we used things in software layer to kind of consume this information, come up with different tuning knobs at PPM side, OS side, etc. Now after Meteor Lake, we are passing this information to hardware also. And Lunar Lake has this kind of the performance mode, the SOC modes that we created so that it now maps to that and depending on the type of load that is being run on the system, whether it's a battery life type of workload or bursty or sustained, benchmarking mode, etc. we take certain actions and we provide optimal frequency, recommendations, optimal core count recommendations that get communicated to operating system for its own consumption. So this is kind of bringing it all together on end-to-end, how it looks to our customers, right? And again, all the recommendations get communicated to operating system on core type usage. They own the ultimate decision on that, right? So that's kind of the key here.

So what are some of the implementation recommendation we have for our customers? So if you kind of take a look at it, to customers, we say that, you know, definitely use the latest and greatest SDKs. We work with, you know, OpenVINO, Onyx, latest compilers, you know, any other frameworks that we have and we provide for that architecture. Any IPC gains that we get new ISA usage that we have.

You know, what's the code generation pattern that will run better on Skymonts and Lion Coves. It's all in the latest SDKs. Our software teams, which I'm part of, works with them to integrate all of this goodness that our architecture team creates. Recompiling with latest SDKs, that definitely helps. It takes the performance or efficiency a notch further than what you currently have, right? To developers, I think there was a question before that do we offer something like pinning or something? We actually don't recommend doing that and there is a reason to it, right? So for example, consider a scenario where you know, you are running a performance scenario.

And you said, "Oh, I know I'm running a performance scenario, I always want to run on P-cores." Right? And you pin that software to that. Now think about scenario where you know, we are in a generation of AI right now, AI PC and everything. So let's say that copilot models run in background and NPU is taking some power.

Or there is some graphics heavy activity happening and GPU takes some power and now CPU has lesser power budget left. In that case for certain type of instruction actually E-cores become more performant. But if a developer did a pinning to P-cores, now whatever Thread Director recommends that hey with the table that now the dynamics have changed and E-cores have become more performant, it'll have written higher number to E-cores. That's not going to take any effect. Because the developer has provided affinity, affinity wins and OS & Thread Director are gonna throw all the best recommendations down the drain because they can't act on it. So we highly recommend to our developer community that don't use affinities because tell your intent and let hardware and always take the right decision based off of that.

Now I think Ori and Stephen were talking about it, if you want to provide a hint to OS and to Thread Director, there are QS APIs that are available. MSDN has it. You can search for power throttling APIs, you'll find them.

Come see me later if you need a link to that, I am happy to send it on. But we that (crackling drowns out speaker) try and use most efficient cores all the time, right? It gives a hint to operating system and the operating system can tell it to us. And that way, we can work together to deliver the better product. And of course, with the latest and greatest ISA, we always have improvements coming there. So do use that. That's our one of the recommendation as well.

So before we close and open for questions, where are we going with Thread Director? So we get this question asked that, "Hey, you know, you created Thread Director in Alder Lake and Raptor Lake. Is it done?" No. I mean, you see the innovations that are being put in, right? Yes, performance is one benefit. Our performance benefits change from product to product and generation to generation based off of all the great work our IP teams do, E-core and P-core teams do and that changes. But as Rob said in his keynote, we are focusing on efficiency as part of Thread Director in Lunar Lake quite a bit, right? So that's kind of one of our areas we are going to go double down on.

Then there is scenario granularity. We are going to go and identify more scenarios that we want to optimize. And again, this is not a static listing.

We are saying that if running game do this, if creator do this, that's not the intent. It's what SOC sees as a load. Is it high load? Is it more concurrency, lower concurrency, lower load? That's what we look at. There is constant optimization our ML team does who works on some of these firmware improvements that goes in Thread Director and we use new AI-based scheduling mechanisms for that.

We provide recommendations optimizations to Microsoft for that. And that is continuously evolving as well. And last but not least, and I'm kind of personally super excited about it, if you kind of think about Thread Director today, it does P and E-core scheduling, right? Which one is the better core from performance or efficiency perspective? But again, as we talk about AI evolution, we have NPU, GPU, CPU, everybody playing in there. So how do we leverage some of the hardware innovations we have in place already, like Thread Director to figure out what's the right, IP to send some of these AI workload. I won't go into details of that because we are going to keep this for our future sessions when we come and talk to you guys about our future product launches.

But that's kind of something I wanted to give a, you know, sneak peek on if you will, on what we are looking at. And with that, that's our summary of Thread Director. You know, it's more intelligent feedback, great collaboration with Microsoft, with containment zones and things. Super excited about that. Improved power management times, you know, things that we are doing in terms of getting more efficiency out of the platform and then consuming the platform intent. With that, I think we are done.

Thank you for your time and open for questions. (bright techno music)

2024-07-04 20:10

Show Video

Other news