Hey, everybody. Good morning. I'm Robert Hallock, and I have the honor of walking y'all through the performance and power of Lunar Lake today. This is a very special CPU. I've worked on many CPUs in my career. We all have.
We've all been in this a long time, but this one really stands out. And it stands out because there was so much that's new in this part, so much data to generate, so many ways to analyze it. But we've worked really hard to give you the most complete picture possible. So I want to start by level setting on what you're going to see. First. All the data you're seeing today is collected from customer systems, not reference platforms, not science projects on a test bench, real laptops that are imminently arriving in the market.
We're using the software stack that is a little shy of what you'll get on your review systems, or the public will get on their systems. So we do expect to see a little performance up side as well when you get your chassis. We've tried to be conservative with our data. On the OS side, all default OS settings like VBS have stayed on to generate the data you will see today. And from a performance point of view, everything was collected on the best performance power plant and windows. But for battery life, of course, we're unplugged on best efficiency.
Wi-Fi and Bluetooth were enabled and associated as well to collect this data. So that's just some of the ground rules that, we've, we've developed to produce this data. And one more thing on power, when you see competitive data today, those chips were equal to or higher in power than Lunar Lake.
So we actually put ourselves at a bit of a disadvantage sometimes in some of this data. Conservatism in the data is extremely important. Finally, you'll see multiple chips throughout the benchmarks because that's what's available in OEM systems.
We're using chassis that are available off the shelf. Overall, I know your audience is probably going to you'll wait for benchmarks, but I'll say our team is pretty fanatical about trustworthy data. And many of you know, if you know me personally, that I'm quite particular about that. So those is the ground rules. Let's start. We'll jump right in.
I'm sure I sound a bit like a broken record, but Lunar Lake is all about energy efficiency. Lunar Lake has a couple noteworthy innovations that we should recap first, we deployed memory on package to cut power consumption and talking to Ram. Now, that's not something you do on every chip, but it makes a lot of sense when you're chasing every last mile of what? Of power.
We added memory side cache to selectively obviate memory hits and protect the performance of the course, saving even more power by reducing the high cost of memory. And we changed the core hierarchy and topology to enrich the dynamic range of performance and power. Altogether, we aimed to reduce total package power by 40% versus the Meteor Lake generation. Basically in half in less than one year's time. As a quick level setting on that core topology, this plays a big role here. You recall that Meteor Lake had three tiers, of course, the most efficient were the low power E cores, then above that the E core, and then the P core complex.
Our objective with the LP cores in the Meteor Lake generation was to contain workloads that are high on power, but low on performance needs. And when it comes to productivity, teleconferencing, web browsing, those types of workloads are actually quite common, and they can squander a lot of power if you're not careful about managing them. In practice, however, we found that there wasn't enough compute performance in Meteor Lake's LP cluster to achieve this goal all the time. So in Lunar Lake, to address that limitation, we merged the low power island and the E core cluster into one single unit. Frankly, this was made possible through the awesome skylight core.
Twitter and Reddit are jokingly calling it Chad Mont, but, always makes me laugh because it's not wrong. Skylight basically takes PE core level performance and drives it down into E core level power. That was the goal of the skylight architecture. We're thrilled with the payoffs for these changes. So behind me you're going to see a selection of meteor, Lake, Snapdragon and Lunar Lake CPUs running Procyon office.
If you're not super familiar with this benchmark, it's a nice way to uniformly test commercial workloads. It has office outlook, Excel, PowerPoint. And that's the face of a computer to a huge portion of the market that will be buying a lunar lake system. Lunar Lake is slightly ahead of the pack overall in performance. You can see a little blue bar plus 7%, but it crucially does that at the lowest power of all three chips here.
From a performance point of view. That means we're over twice the energy efficiency of Meteor Lake and a very pleasing 20% better than Qualcomm in office productivity. From a power point of view as well. The Z2 graphics architecture also plays a huge role in the energy profile of Lunar Lake. Energy efficiency in a GPU is really tough. You can keep performance flat and cut power.
You can keep power flat and boost performance, or you can boost performance and cut power. The last one is what EC2 does. Taking cyberpunk as a representative example. Performance is up about 40%, but we chopped 20% off the power of the graphics engine itself.
At the same time. Net net, we've made huge reductions in overall package power. The chart you see here is instrumented power reductions, with a over $100,000 lab device that can directly measure the power rails going into the SoC. We're actually doing these power reductions on hard mode, because you recall that we added memory to the package as well.
And Meteor Lake does not have memory in the package. So even with memory in the package, we still have a lower overall power consumption in the chip by up to 50%. Videos down 30. Teleconferencing is down around 40%. Remember that our original goal was 40.
So I think it's safe to say that we comfortably cleared the bar. Mission accomplished. Slashing power by up to 50%, even after adding memory into the mix was a huge achievement for the team, and we exceeded our goals. Of course, instrumented package power is somewhat abstract. You and I love metrics like this.
It's a lot of fun, but the real face of power to most people is battery life. The data you see here was collected was a very rare opportunity in the notebook market. And if you've dealt with laptops a long time, you know how hard it is to get to absolutely identical systems.
Really really tough. It's so rarely happens. That's what we've got. Everything is identical about the systems you see here, except for the motherboard and the CPU.
In highly dynamic power cases like office productivity, we're clearing 20 hours on this customer system and winning by about two hours. Overall. Can't win them all. That's okay.
While we're in the hunt on teleconferencing, we're back by about two hours on that. And I'm actually super okay with not being on a team's call for 12 hours. That's fine. If we want to expand the view to include the new Ryzen AI series processors, we have to shift the match up to a different set of systems. So this time, we're looking at a series of notebooks from the same OEM, all three chassis.
These range from 14 to 16in. They have the same Z height. They were tested at 1080p and 150 nits of brightness, conforming with the ground rules we laid out at the start. Battery capacity is around 75 watt hours for each one of these systems, with Lunar Lake the smallest actually at 70 watt hours and Ryzen the largest at 78 watt hours. So we have the smallest battery in this mix.
In this view, Lunar Lake still walks away with the win in active power workloads. That's true against the newest CPUs from both of our competitors. Before I move on to CPU performance, it's worth a little wrap up. I just showed you that we slashed our generational package power by 50%, and we're beating Snapdragon and performance per watt by about 20%. Pre-production customer systems are already achieving battery life that is extraordinarily competitive with any other processor in the thin and light market.
We said at Computex in June that we were in the hunt to stand at the top of the mountain on Battery, and we have done it. More broadly, I personally believe that these numbers confirm that it's not the instruction set that dictates final energy consumption. It's the choices you make in your chip design and the choices that you make with your system partners and the technologies you wrap around that chip in the platform. So in the spirit of great technologies, let's take a look at CPU core performance.
This is, near and dear to my heart. You know already that we have two micro architectures in Lunar Lake on the E core side with Sky Mont. We've got wider dispatch ports, better branch prediction, better throughput for AI. But the overarching goal was to increase the coverage that the E core could handle in terms of workloads and power. If we eliminated the low power island, we needed a comprehensive solution to core scalability, and these skylight changes deeply improve the range of workloads that the E core complex can handle. On the core side, we split vector or floating point and integer into separate engines for better modularity.
Now and in the future, we decoupled our design methodology from a specific process node for future flexibility as well, and we made scheduling wider to fill the pipe. Most controversially, we also removed Smt or Hyper-Threading from the design. And we're not always going to make this same decision. But it makes a ton of sense when you're fanatical about power.
Like we were in Lunar Lake. The choice has three clear benefits for the core transistors and lunar. We get 15% more performance out of the PE core at the same power, 10% more performance from every square millimeter of die area. You spend an an aggregate 30% performance power and area. When you accumulate all of those vectors into an overall figure. This was a hugely beneficial change in three different ways to remove Smt from the PE core design on this product.
In other words, we got a lot more performance out of CPU core that is both smaller and lower power than it would have been with Smt still in the mix. That was the goal. The cores are also wrapped in an all new SoC fabric, and it is stunningly fast for a mobile CPU within the E core complex, it's about 23 nanoseconds quarter core.
This is using the GitHub projects that many of you are familiar with. So start thinking about those numbers. P core is about 26 nanoseconds, which makes sense is the cores and caches are a little bigger. But cross complex communication is really nice. At 55 nanoseconds. Ryzen and Snapdragon, by comparison, are somewhere in the range of 150 to 180 nanoseconds, 3 or 4 times slower at communicating across complexes.
This is an extremely fast fabric. Really nice set up. Memory latency is also best in class, best in the segment.
Competing CPUs also paired with Lpddr5 x are much higher also in the range of about 150 nanoseconds. Our memory latency is 40% lower than the previous generation, and a full 30% lower than AMD's new Strix point. Many know that latency is a huge component of core IPC and gaming performance, and some of us have personally spent quite a lot of time debating memory latency over the years, so I hope you will share my joy in these numbers.
This is an extraordinarily fast chip. Finally, Thread director ties everything together with a bottom up approach that starts with E-course first and then evicts upwards to the cores if the performance or thread count needs it. I should also point out that games just go to Pcrs because they send the hints that make that happen. That's just how games work. Thread directory is an essential piece of making sure demanding workloads go to the cores, and the light workloads hit the e-course. That's the point of the Intel Hybrid architecture.
But the real point of these changes was ensuring Lunar lake hits systems with the fastest single thread performance in the thin and light PC segment against both Qualcomm and AMD. We hit our goal. I think it is fair to say that Cinebench Geekbench spec are the three most common ways to test single thread performance of a core, and we take the gold medal in all of them. On the multi-threading side, I realized we also said another thing that was pretty drastic at Computex, we told you that eight threads of Lunar Lake could outperform 22 threads of Meteor Lake. Comment sections were filled with well-deserved skepticism about this possibility, but now we have the data to take a look. Lunar Lake will hit the market ranging from 9 to 33 wide.
So for those of you who ask me, what's the TDP nine watt to 33, that's the full and complete range in the super thin or fanless segments around nine. What we had the U-Series in the Meteor Lake generation because of power constraints. U-Series was a maximum of 14 threads in this space with just eight threads. However, Lunar Lake is achieving 22% higher multi-threaded performance, which equates to slightly more than twice the performance from every thread. But as you know, meteor also had a 22 thread part, the very famous H series, and that picked up at 17 watt and went up from there. At this package power, we are getting three times the performance out of every thread in Lunar Lake.
In fact, we're still ahead on overall performance versus Meteor Lake as well. Finally, as you head towards 23, what sustained multithreaded, finally breaks in Meteor Lake's favor and lunar is about 5% back and 20. What is the exact crossover point where eight threads of Lunar Lake and 22 threads of Meteor Lake are equal? All of this is to say that lunar is getting 2 to 3 times the performance of Meteor Lake out of every single thread. This is the cumulative benefit of IPC frequency, latency, and all the system or SoC decisions that you can make when the PPA is as good as it is in lunar.
When we zoom out to take an industry level view of sustained multi-threaded performance, we feel really good about lunar lakes position. It is an outstanding performer in the range of 9 to 20W, which, as a piece of trivia, was the original goal of the program. But even up to 33W for holding strong on sustained multithreaded compute. Most fascinatingly, I would draw your attention to the purple dot on the far right.
We honestly could not believe this number. We spent many weeks in many labs retesting this, but every test showed the same numbers. So here's the bottom line. With Lunar Lake, we are delivering comparable performance to the Snapdragon X elite at 40% lower power with four fewer cores. All right a lot of synthetic benchmarks.
So now it's time to take a look at what users will see in their own systems in productivity, office work, transcoding or web browsing by far the dominant use cases on a platform. You know, not the most exciting, but these are the things that people do every day. Lunar Lake is awesome. It convincingly beats Snapdragon X elite and stands its ground against AMD's new CPUs that often run at much, much higher sustained power. And so I talked about us putting our own part at a disadvantage and still coming out on top. And these benchmarks prove that overall, Lunar Lake carries the banner for CPU core performance in the PC space.
Massive improvements in IPC, a new fabric, a new topology, and new power management technologies allow this chip to achieve remarkable performance from every core and truly, the sum is greater than the parts. Now let's talk graphics. If you don't follow every little detail of processor design, it's worth mentioning that our previous generation was the first to bring our desktop graphics architecture into core ultra GPU.
The Intel Arc GPU inside Meteor Lake was a huge step forward for us in graphics and media performance, and now we're turning it up a notch in Lunar Lake. Lunar Lake is the first to feature our next generation x E2 architecture. X2 has an all new pipeline design, new and dedicated engines for processing AI and ray tracing that actually produces playable frame rates on a 30 watt notebook. This has been an interesting theory for a while, but yeah, we can actually ray trace on a notebook. It's pretty cool.
The work we've done with the software stack has given us great compatibility across the gaming industry. Our top of the line arc 140 V config is on average 30% faster than the previous generation. I'll give you a set to take all the photos that you like of all the games, and you can correlate with your your reviews.
These are all a 1080p medium. So what you're seeing here is lunar versus meteor. The highs are pretty exciting too. We picked up 40% in F1, 24 or cyberpunk, and almost 60% in Hogwarts Legacy and 80% in Division two. We're averaging about 30% faster and graphics across these 45 titles. And the top quartile of these games is around 70% faster.
Now we're going to take the generational comparison off the chart and just reset to Lunar Lake performance. Only. And then we're going to place that data against Qualcomm's fastest X1, E, 84, 100 configuration. Out of 45 games that we tested, more than half did not run on Snapdragon.
As you can imagine, that has made it somewhat challenging to give you the tidy performance summary that you would expect from a clean and polished slide. If we turn on the bars for the remaining games that do happen to run on Qualcomm, Lunar Lake is 70% faster at 1080p medium seven 0%. Finally, we'll swap Qualcomm out with AMD's new Strix Point. They've always had a good GPU, but it's still not fast enough to keep up with Axi in Lunar Lake. We are about 16% one 6% faster on average from a scorecard view. We were able to run twice as many games as Qualcomm and do it 70% faster.
Red team had great stability, of course, but we're still the fastest graphics core of any chip for thin and light PCs. Beyond native graphics performance, we can also add compounding performance improvements with super sampling or excess. XSS uses an AI algorithm to render a game at a lower resolution than upscale, to a higher resolution at little to no fidelity loss, because the resolution in the graphics pipe is lower. You can boost frame rates and then bring a sharp image back with the AI upscaling, an AI sharpening. Lunar Lake has new kernels that leverage the Zmax AI engines in the architecture, and these kernels are in the driver, so they're immediately available. They do not need to be added to these games.
If a game has access, it can leverage these new kernels. CSS, as I said, is a compounding or cumulative technology. If we're already getting 30% out of the graphics engine natively versus our previous generation, then XE sits on top of that as a multiplier. Now, if I could direct your attention to the furthest bar on the right as an example, you see that F1 24 is considerably faster than Meteor Lake natively 99 fps versus 71. But when you add XE on top, the frame rates shoot up to 129. Again, that's average, not peak.
The combination of raw performance and AI upscaling delivers awesome frame rates in this platform, and I'm not going to run around calling it a gaming CPU, but damn it can play a game. X and Z to also make it possible to competently handle raytracing in the ultra thin market for the first time. On average, we are faster than AMD's ray tracing implementation by about 30%.
And because Qualcomm struggles with contemporary graphics APIs like Dx12 ultimate, it cannot do ray tracing at all. All the performance you see here from the blue bar has 99th percentile frame times above 30 fps, so the experience is actually smooth to the media engine is the other half of what makes EC2 special. It has support for all the latest formats and codecs, including the brand new solutions like H .266, also known as VC.
The video scaler in the media engine. That guy there is especially helpful as it eliminates the need to spin up the CPU or the compute shaders to do video rescaling. It's a very convenient fixed function item to do that scaling work. So the media engine is also responsible for video transcode performance. We actually scaled the encoder size or bandwidth back versus Meteor Lake. I think we were probably a little over indexed on encode and transcode performance in the last cycle.
So we made a little smaller. But the performance delivered by the encoding engine is still the fastest in the segment. The performance you see here is 4K to 1080p in three different codecs described in frames per second, and it's worth remembering that a single frame contains about 24, maybe 30 overall, the new EC2 based graphics engine gives us a very healthy step forward in the experience for our users, including 30% faster gaming, ray tracing that is fully functional in an ultra thin system, and the best encode performance of any processor for PCs. Last but certainly not least, my day job.
Let's talk about AI. Before looking at the numbers, it's important to understand that we are constantly talking to hundreds of software developers. We mutually share our long range roadmaps with them, and the developers tell us what engines and what AI data types they plan to target.
If they're making AI software this year or next. Our partners tell us that they want to use CPU, GPU, and NPU in about equal proportion. It is extremely plain from our conversations that an NPU alone is not enough for the AI market, and this has heavily influenced our thinking on how we plan to enable the best performance and experience from an architecture and software point of view.
Lunar Lake has the most AI tops of any PC processor on the market available today. It pulls 67 tops through the GPU, 48 through the NPU, and five through the CPU cores. Before I go any further, I would be remiss if I did not say that Tops is actually something of a a goofy metric. It's a calculated value. It only tells you how fast you could go.
It's a theoretical peak, but it is the software that dictates how fast you will go. Software makes the tops real, in the same way that a game engine or graphics driver can make or break the teraflops promise. In the spec sheet of a GPU. So let's start by taking a look at generative AI. We're going to dig into performance.
Stable diffusion has rapidly emerged as a straightforward evaluation of matrix math compute performance. It's a good sustained compute workload. Picks out all the Mac arrays, plus it can generate the memes of your dreams, which I do all the time. And I consider that the most important use case of all but stable diffusion run 515 runs on both GPU and NPU, thanks to plug ins from Intel and Qualcomm.
In a Gimp image editor for 20 iterations. Kind of a common common interval. Pretty quick to run. We can rip through an image in about five seconds, and we do that faster than any other processor.
Just for context, last generation that was about 15 to 20s. So huge speedup for even higher quality images. 16 data type on the GPU is the gold standard.
There is a noticeable difference in performance and quality when you use Fp16 versus Int8 or hybrid data types, and we can generate an image in under four seconds on the GPU, whereas Qualcomm again cannot do it at all. You all. Procyon AI has also seen strong adoption in AI benchmarking because it solidly evaluates multiple models, multiple data types. It's a very convenient way to test throughput of an AI device. And if we hit the NPU with this workload, Lunar Lake once again offers the highest performance in the one data type that Qualcomm seems capable of running. Whereas we stand entirely alone as the only NPU that can run the high quality Fp16 data type score 1000 points, but I have nothing to compare it to.
So sorry. For the first time, I'm also excited to show you ul Procyon Image generation. It's designed to uniformly evaluate stable diffusion without using vendor specific plug ins or GitHub projects. It's a nice, consistent way. AMD and Intel are already off to the races on running this workload, but we are twice the performance of AMD and once again, Qualcomm can't do it. Finally, if we look at Geekbench, ie, Intel's investment in frameworks and engine is paying serious dividends in every data type and on every engine.
We are both faster than Qualcomm and AMD. But the real test of AI performance are the features being delivered to the market by software developers and shipping applications, and that's very different from the performance you might see in one of those synthetic benchmarks like Procyon. And again, I suppose this is not all that different from the belief that games are a more realistic test of graphics performance than something like 3D Mark. We all know that intuitively. We've all written articles to that effect.
AI is no different. Here we see a selection of AI powered feature from major applications like Adobe Premiere, Adobe Photoshop, Lightroom, Topaz, or Blender. Everything you see here is base lined. That's the light blue bar in the middle against the last generation core ultra seven 155 age.
You'll notice that all the blue bars are positive and all the Qualcomm bars are negative, which means that both Meteor Lake and Lunar Lake in practice in real applications, deliver better AI performance than Snapdragon. When you leave the comfort of canned benchmarks, Python scripts or technical demos. More importantly for us, the average uplift we're seeing in these AI workloads is about 60%. Many of these in the ISV industry actually draw on GPU performance, not NPU performance. NPU is actually the exception for ISV, so Xe max in the EC2 engine is vital to the performance here.
What underpins all of these performance wins is Intel's leadership in adopting the latest and greatest in AI software. I've said it before and I will say it again. Nobody has more AI features or models up and running than Intel in this cycle. We've also validated some pretty cool new models for language like FY three, which is a personal favorite of mine. It's super accurate. We also just validated multimodal models like lava that can process more than one type of input and give you more than one type of output.
So it doesn't just have to be text to text or text to image. You can do format conversion. All of these leverage are absolutely unmatched support for the widest number of frameworks, tools and APIs in this industry. AI acceleration is rapidly becoming a core component of what it means to be a high performance and a high efficiency CPU. These accelerators will rapidly reach every segment and every form factor, and in 2 to 3 years time, I believe that every piece of software you touch on a regular basis will have AI features. The trajectory will be extraordinarily similar to integrated graphics, which was once ridiculed for being unable to play a game.
I remember the articles for the time. I probably wrote one of them. Why is this graphics chip here? It can't do anything. What's the point? It's not a GPU, but now I admit that I was wrong. It powers everything from web page compositing to UI rendering. It is an inseparable piece of the experience and graphics and AI is headed the same direction. AI runs best on Intel because we've done the hard work to make that happen across a rich selection of hardware and software.
And I'll say that the software is actually the harder part. Most critically, I need you to understand that when we say AI PC, we mean that these features and experience are building on top of the foundational performance and power metrics that all of us care about. As PC enthusiasts.
Those metrics are crystal clear for CPU performance, GPU performance, and energy efficiency. That is why we say that a great PC starts first and foremost as a just plain great PC. I mean, that also means even if you don't give a rip about any high performance or AI features, you are still getting the performance and efficiency that every user needs.
I didn't take something away from that. AI is an end technology on top of everything else that we do
2024-10-16 09:58