GDC 2022 - A Guided Tour of Blackreef Rendering Technologies in Deathloop

Show video

Hello! I'm Gilles, Lead Graphics Programmer at Arkane. Today I'll present an overview of the rendering technologies in Deathloop and how Lou Kramer and other people at AMD helped us to reach better image quality and better performance on both PC and consoles. She will present after my talk some specific optimizations we did for AMD hardware and how we integrated together some AMD technologies into our game. A few words about me: I've worked in the video game industry for the past 20 years on console and PC projects on various engines, both commercial and developed in-house.

I'm Lead Graphics Programmer at Arkane since Dishonored 2 and on all Arkane games developed with the Void engine. The latest one is Deathloop, released last September. I really hope you all played it. What will I talk about? I first introduce the Void engine and how it evolved over the years supporting more and more platforms and APIs. Then we'll go through the main improvements we implemented for Deathloop. We'll finally look at how we implemented raytracing in an engine which was not really suited for it in the first place.

After Dishonored 1 we had to choose on which engine we would develop Dishonored 2. We settled on idTech 5 which was developed by idSoftware, another ZeniMax studio like Arkane. We liked how the engine was designed and we knew that if we needed help we would have it.

And near the end of Dishonored 2, we had a lot of help from them. Over the years we rewrote a big part of the engine and not only on the graphics side. The gameplay code was also entirely rewritten from scratch plus we integrated many middlewares: for audio, for physics, UI, ...

Let's begin with Void engine graphics API usage. When we got idTech 5 it was DX9 and OpenGL. And one thing to note is that like several other engines the executable for internal builds contains both the editor and the game code.

It allows level designers to test the levels directly from the editor. We labeled the Void engine version used to ship Dishonored 2 as 1.0. On Xbox One, we needed to use DX11 and we also switched the PC version to DX11 to avoid maintaining two different APIs. For Deathloop the API choice was more difficult. The game started as a fully multi-platform project, not only PS5 and PC. As you can see we went from 7 platforms to only 2, but it was done in a few stages.

For a few months we kept maintaining the previous generation versions before the decision was made that we would be next gen only. It removed a couple platforms. Then, great news for us!, the Sony exclusivity deal was signed and we were able to drop support from 3 more platforms, each using their own API. Obviously having only 2 platforms and 2 APIs to support was a big relief, especially during the final debugging and optimization phase which as you all know can be quite painful. The DX12 version needed a lot of work to run well, especially in editor mode as its graphics resources management system is very different from the game.

Artists and designers keep adding and changing stuff all the time and we couldn't prebake that data as DX12 likes it to be. So we had to implement specific system to avoid too many dynamic resources creation. We all saw it: A few DX12 SDK bugs, shader compiler bugs, driver bugs. This led us to even think about shipping the DX11 version alongside the DX12 one. In the end all these bugs were fixed and we finally felt confident we could switch the entire dev team on our DX12 editor and only ship this version to players. Let's now look at what we changed for Deathloop besides the graphics API.

First of all, before everything else, we chose to switch from forward to deferred rendering. I'll talk more about this in a few minutes as it was quite an important decision. When we started development on Dishonored 2, PBR rendering was still new and there was no standard BRDF to implement. But we chose the wrong one, we chose the Ward BRDF while everyone else used GGX. For Deathloop we decided to move to GGX which had become the standard and was the default BRDF in the authoring tool we used. We also added HDR support which is common these days.

ACES tonemapping gave us better results than our previous filmic tonemapper but it came with one drawback. The computation can't be inverted with full precision, we had to rely on an approximation for the reverse function. But it was good enough for our needs. On previous generation consoles we had tried to implement particle lighting using a few different options. Unfortunately none of them was acceptable either in terms of quality or in terms of performance.

Our artists needed to adapt every environment effect to the local lighting condition which took them a lot of time and effort. They didn't want to relive it for Deathloop. We implemented Doom 2016 decoupled particle lighting to finally have lit particles for Deathloop. Even with this feature we noticed early on that, when we had a lot of effects on screen, in 4K the dynamic resolution code had to lower the resolution by a pretty big factor to compensate the fill rate hit. And the image quality took a big hit, too.

So we also decoupled the particle rendering resolution from the main view resolution and we now have two different dynamic resolution usages. Particle resolution can go down to 64 times less pixels rendered. The loss of resolution is not very visible on particles until we really reach low resolution.

Whatever the amount of particles we have on screen we keep a fixed quality for the opaque geometry which is far better overall. We had an order-independent transparency solution in place since Dishonored 2 based on an accumulation and opacity buffer. It was not perfect especially with additive blending modes but it was quite cheap and gave good results. We were able to adapt it to have two sets of inputs, one in full resolution and one in lower resolution. So we could keep some specific effects in full resolution and sort all of them together during the OIT pass.

All the transparent surfaces in the environment can also stay in full resolution and be sorted correctly. As they are not animated like particles the loss of resolution was a lot too visible. They also had some junction issues with the opaque geometry around them.

Here the screenshots demonstrate this case. The small full resolution transparent surface you can see is the glass covering the spotlight with an opaque metal case around it. So we kept it full resolution in order to avoid junction issues.

The art direction depended a lot on efficient decal rendering, so we had to completely rework our decal system. We took inspiration from Doom's decal management code to implement something better suited to a deferred renderer. Here you can see what the game would look like without any decals.

The buildings lose most of their paint layer. On Dishonored 2 decals had their own diffuse map, normal map and roughness map and only blended themselves over the previous geometry, computing lighting one more time per overlapping vehicle. On Deathloop decals are rendered into their own GBuffer which is read during the main geometry GBuffer pass. Depending on per-pixel decal GBuffer values they can blend one or more of the material based textures with the geometry ones. We can for example have decals which only modify the underlying normal to only add scratches on surfaces.

Like many other games, we have two types of decals. Static decals are manually placed by level artists. Their geometry is generated in the level editor from box projections they set up. They are not projected at runtime. Artists can pick which static geometry they will be displayed on, making it possible to exclude windows or neighbor objects where they don't want this decal to be spilled. Static decal Z order can also be tweaked by artists to make sure they overlap correctly.

Dynamic decals on the other hand are spawned at runtime and we have a lot less control over them as they are projected at runtime as well. It is quite easy to always display dynamic decals over static ones but in Deathloop we have one additional specific issue: the snow. Because we have a lot of snow in Deathloop. On the left you can see a work in progress version of a map portion. On the right several static snow meshes have been reworked and added.

But you can also see snow on the pile of garbage and the thin snow layer on the left barrels and on the roofs. While not dynamic at runtime this kind of snow layer is a procedural part of the base material. The issue with this snow layer is that it has to cover static decals.

However if we drop a grenade near the barrels we need the dynamic explosion vehicle to cover the snow. So we need to store the decal type. We could have stored this information in the decal GBuffer but it was not the best performance-wise. It would have been necessary to read the full decal GBuffer 32-bit pixel to determine if there was a decal present and its type.

So instead we chose to store this information in stencil bits. Obviously PIX is not displaying stencil with those colors but red shades that are not nice in a powerpoint. Blue pixels are covered with static decals, red pixels are covered with dynamic decals and green parts are covered with both types of decals. We need to know when the two types of decals are present because in the absence of snow or if the snow is not completely opaque they could be both partially visible and blended together.

So here is a sample GBuffer shader. So we start by reading the stencil. If a decal is present we read the decal GBuffer.

If the snow is not fully opaque we apply the static decal. We are able not to do that if the snow is fully opaque because the static decal is not visible in this case. Then we can apply the snow over the static decal. And we finally apply the dynamic decal over the snow.

In all this presentation I talk about snow because we first designed this feature for the snow, but artists also use this feature to add dust layers on some objects. This can be useful for a lot more than just snow. Having per-pixel normals available in the GBuffer allowed us to add an SSR pass in addition to our probe based reflections. Probe based reflections were only able to handle static reflections and were also suffering from pretty low resolution.

As you can see on this screenshot with SSR reflective surfaces are now much better integrated in the overall image. We improved a few more features to have a better overall image quality and we even added raytraced ambient occlusion and raytraced shadows on platforms supporting it. On the ambient occlusion side CACAO provides a good improvement over our old custom AO which was targeting performance not quality. It is also very scalable. We were able to expose three different quality levels. Dishonored 2's image was often too soft and blurry.

We improved our Temporal AA shader but it was not enough. We added a real sharpening option and it helped a lot with this issue. AMD CAS is used with native resolution rendering and also when using DLSS.

RCAS is used in both FSR modes. So why did we switch to deferred rendering? During Dishonored 2 development especially for post processes we often missed access to some data which could have been readily available with a GBuffer for example normals. Our forward shader was already really heavy in terms of VGPRs and bandwidth usage. Outputting a normal buffer was not really possible.

Its complexity also made it difficult to optimize. Switching to a deferred renderer helped with a lot of those issues and also allowed us to use a lot more async compute passes as most of the lighting was done in compute shaders instead of pixel shaders. With access to a normal buffer generating rays in our raytracing code was also a lot easier than it would have been without it.

We eventually managed to get our deferred renderer to offer better performances than our previous forward renderer which was of course one of the goals. Here are a few screenshots of some of our debugging tools. We have obviously standard views to display the raw GBuffer content, but we also have more advanced ones like a texel ratio or a vertex density view. For the pixel ratio users are even able to choose for which source texture they want to visualize it: albedo, normal or everything, really. All those views helped a lot the artists to check if they respect the constraints given to them and to debug material issues more easily. So what about raytracing? We managed to write raytracing functionalities to Deathloop and I now explain how we succeeded.

When we started development on Deathloop, raytracing was still new and not many games were taking advantage of it. We long had the desire to try raytracing but we couldn't afford to spend too much time on it because it was a big feature that was available only for a tiny percentage of PC players. What decided us was mostly the fact that the new generation of consoles supported it.

Wider support on PC was also announced but not yet available. As we were developing a PS5 exclusive it made sense to try and reach a better image quality. How did we start? Obviously before even trying to cast a ray, we needed to generate acceleration structures.

For our first PC only prototype we only supported the static environment geometry and more importantly only fully opaque geometry without any alpha test. It allowed us to avoid writing the code to refit and rebuilt BLASes and the TLAS and also to avoid the need for any texture management. We started by rendering the main scene with raytracing without any texture with an ortho camera and we got a fantastic first raytraced image. Impressive, isn't it? A scene made of a single quad viewed from above. But at least it was working!

The second image we got was a little more meaningful. It was the same scene but viewed from the player camera. Sorry, I couldn't find any screenshot of this step so we can skip it and go directly to step 3.

It was time to check our acceleration structures would support a full game level. But with all pixels the same color the image was not very useful. So in this step we also added flat shading using the vertex normals interpolated using the barycentric coordinates provided by the DXR API.

After this step the next logical one became: Let's render this scene with its albedo textures. And we hit our engine's main limitation for this kind of feature: It is not a bindless renderer. We did not have access to all the textures. We still managed to hack it quickly but it was not viable for real usage. At that point we decided that raytraced reflections were out of reach for this project. Maybe the next one?

Two raytracing features were still achievable: AO and shadows. Here is our first screenshot of one ray per pixel soft shadows. And the obvious conclusion: We needed a good denoiser.

So we spent a lot of time testing many denoisers. None offered good enough quality to ship our game with it. The main issue was often the ghosting. We asked around if anyone knew a good denoiser and AMD offered us to test FidelityFX Denoiser which was not yet released at that time. The first prerequisite we had was met: It was open source and we were allowed to use it on consoles.

So we implemented it and the results were really good. It only officially supported shadows but worked like a charm for AO, too. The lack of support for reflections was not a problem for us as we were not doing reflections. I learned that the latest version of this denoiser now supports reflections but we haven't tried it yet. Regarding the console just know that the stock AMD version still had to be adapted somewhat to fully work on console. Wave32 and Wave64 mode is automatically chosen by the driver on PC but it's not on console.

And wave intrinsics are not exactly the same. As we were satisfied with the results we had so far we had to implement the last features without which we couldn't ship a raytraced game. I won't cover the console part here but I'll just say that the APIs are different enough that we had to write two almost completely different low level code paths. We still managed to keep the high level code similar enough to ease maintenance.

Moving objects were very easy to handle, we just had to update the TLAS. Skinned meshes were a little trickier. For performance reasons we had already implemented the skinning compute shader to pre-skin all the needed meshes at the beginning of the frame. We then use them during the remaining of the frame as if they're static meshes.

It turned out to be exactly what we needed. We only had to refit the BLASes each frame with the updated geometry. Alpha test was a lot more difficult to support correctly, mostly because the renderer was still not bindless. We had to collect all the needed textures in each frame and add their descriptors in an unbuffered resource heap. We also had to maintain an indirection table in order to reference them from the raytracing data.

On PC texture streaming was easy to support as we can sample from mipmap 0 which has the best resolution available. It was more complicated on consoles but it was also due to the lack of bindless resources. The final step we couldn't avoid was optimization.

As soon as we enabled raytrace, the framerate tanked for multiple reasons. We had a lot of CPU side code to manage: the BLAS and TLAS updates as it had to handle load switches in addition to animations. We were able to multi-thread it quite easily. AO is very costly, more costly than shadows for the same ray count because of ray direction divergence. GPUs never like divergent stuff and ray direction is not an exception. So we added an option to compute AO in half resolution.

It was also a big performance win. BLAS and TLAS updates were good candidates for async as they use a lot of ALUs but are not very hungry in terms of memory bandwidth. DispatchRays shaders don't use the same hardware units as regular shaders or not as intensively. We were able to launch them on async pipes in parallel to other GPU work and it was also a huge gain. It was a huge gain we expected to have on GPUs known to have dedicated ray tracing hardware units, but it was also a huge gain on all vendors' GPUs.

As a final word, I'd like to thank AMD for their support. It's been a pleasure working with them and Lou will now tell you more about the exciting features we're still adding into Deathloop together. Bye! Hi and thanks to Gilles for covering the first part of this presentation. I will cover now the second part. And yeah, I'm Lou, I'm a GPU Developer Technology Engineer at AMD and for the past few months or even years I worked together with Arkane Studios Lyon on Deathloop. And we've done several optimizations and also feature integrations to this game and in my talk I will cover two optimizations we've done and one feature integration.

And you may already guess which feature integration it is because we released a teaser about it not too long ago and yes it is about integrating FidelityFX Super Resolution 2.0. But first let's talk about optimizations. And I picked two topics for this talk because I think they're interesting also for the wider audience and you probably also can apply some of the takeaways to your titles. So the first one is barriers. It's a bit more high level. I will talk

about barriers in general and our performance recommendations and also how to analyze barriers and how to use our tools. And yeah I will conclude this topic with a short example from Deathloop. And then the second topic will be about basically a one-line shader change that we've done. And we achieved some

quite speed up with just this minor change and usually these type of optimizations are what developers love because this change didn't had any effect on the visual output. But yeah first let's talk about barriers. And Barriers, they're not a new topic.

We've been talking about them basically since the introduction of DX12 and Vulkan and we are still talking about them and the reason is quite simple because barriers can have a high impact on performance. They can drain the GPU, they can prevent passes to overlap. But if you don't have the correct barriers if you place them wrong then you can have severe stability and also correctness issues. So when you first implement barriers you of course want to make sure that they are correct so that the game runs and that the game looks correct. Because if the game doesn't even launch there's nothing to optimize for, so this is always the first step.

But this also means when we start looking at optimizations usually there's a lot of room when it comes to barriers. And in the case of Deathloop this was also true. Yeah, the general performance recommendations, they haven't changed to be honest basically since we started talking about them. Maybe some minor details they can change depending from the architecture or also from the driver maybe even, but the general performance recommendations they did not change.

They are still true and they will be still true probably for a while. So it is: Minimize the number of barriers used per frame, batch groups of barriers into a single call and also avoid general / common layouts unless required. So minimize the number of barriers used per frame, I mean this is very broad in general. Sometimes it's difficult to achieve and also difficult to spot because you kind of need to know if this barrier is really needed if you could place it differently or maybe even if you could configure your passes differently so yeah this is very general.

But the second point is, it's quite easy to spot for example when you use RGP. RGP is our Radeon GPU Profiler tool, it's a frame analyzer and it basically lets you see all the commands that got submitted to the GPU. So you see also all the ResourceBarrier calls and you see them - you see if they're basically back to back at each other or if there are other calls in between.

So what you don't want to see is something in the screenshot: you have dozen of ResourceBarriers calls just back to back and nothing is in between. So usually you want to have one ResourceBarrier call which contains all your barriers that you want to submit at this point during your frame and then you want to have some other call for some dispatch call or draw call. So this is quite easy to spot and if you batch them you reduce the overhead but also you make it easier for the driver to do some optimizations. Some optimizations are simply not safe to do if the barriers are not batched for example removing redundant operations. So you can have even a bigger benefit from batching the barriers than just getting rid of some overhead. Then the other point is avoiding general / common layout transitions unless required because sometimes you need them of course.

So in RGP you don't see the specific transitions you can see them using PIX or RenderDoc for example. What you do see is what this barrier causes. Because sometimes barriers they not necessarily let the GPU wait, I mean most of time they do, but not always. But they also have some other effects than just letting the GPU wait for work to be completed. They also invalidate caches, they can flush caches, they can also cause decompressions. And these decompressions, you can spot them using RGP. And if you

see them I would highly recommend to double check if your layout transitions if they're really as optimal as they could be. So how to spot them in RGP? You go to the overview panel, then select barriers and then you get a list of all the barriers that got submitted during this frame. And then you can also order them depending on the different attributes, so by default they're ordered by I think submission time. But here in this screenshot I ordered them first for FMask Decompression and then DCC Decompression. And this is again quite an extreme example, because here we have a couple of FMask Decompressions and a couple of DCC Decompressions and ideally you want none of them. I mean as I said

sometimes you need them. In this case none of them were actually needed. So by optimizing the layout transitions we were able to get rid of all of them. But yeah if you see these in your game just double check if they're really needed. Then you also see other things so for example the cache invalidation and flushes. So this barrier that I marked

here it's quite a common barrier. It's not really an expensive barrier, it just synchronizes between the shader stages and invalidates the local caches. So if one dispatch writes to an output resource and then the next dispatch reads from this output resource, the invalidation of the local caches just makes sure that the second dispatch reads the correct values so that it reads from a global cache. Because this global cache is visible to all compute units, and you know it contains the correct data. This is what this barrier is for and why these local caches get invalidated. But sometimes it's not enough, sometimes the driver also needs to invalidate the L2 cache, the color and depth buffer - ehm - backend, sorry, and it depends on the next possible commands. So basically the driver cannot

see into the future so depending on the barrier, how it's configured, it has to account for okay which command is possible according to the configuration and then it issues the barriers to play safe. Barriers in Deathloop. Deathloop uses a conservative automatic barrier generation system.

It's very robust so it generates all the needed transitions and it's also super convenient because developers they don't have to think about the barriers, they don't have to think about the transitions that are needed. Everything is handled automatically. However, since it's quite conservative sometimes unnecessary barriers are issued.

So everything will be correct, so the output will look correct, there will be no crashes caused by barriers but the performance might not be always as optimal as possible. So sometimes what happens is that barriers got issued that just transition the resource into a state back and forth. So, so conceptually what happened is something like in this diagram: so we had a dispatch, well we had two dispatches, they're reading from the same resource and they're writing to the same resource. So potentially they could overlap, but they couldn't because this automatic barrier generation system inserted barriers in between and just transitioned the input texture to a write state and then immediately back to the read state. And these two

barriers in the middle, they could simply be removed. And then the two dispatches were able to overlap again. So something like this happened in the compute skinning buffers pass and it was quite extreme. So there were barriers around every dispatch call, even if the next dispatch was not dependent on the previous work. The barriers just caused the buffer to switch state back and forth and you ended up having 100 to 200 barriers per frame in this function.

So they solved this issue by adding a manual barrier management code path so they just went in and basically just okay in this case the automatic barrier generation system doesn't work really well so we do it manually. And you can see a before and after PIX capture. You see the before one, we have a lot of barriers between the dispatches, some batching was done but the batching logic did not succeed all the time as you can see and yeah so manually this got all like improved and now we have like dispatches actually be able to overlap.

And this is just the two RGP captures before and after. And here you also can nicely see how the dispatch is overlapped. And yeah this change improved the performance of this particular function of up to 60%.

Yeah so depending on where you are in your development timeline for the game it might be worth to do it manually because changing the automatic system has quite some risks involved. Of course it's the preferred way but it might be not as simple and yeah that's why sometimes it makes sense to actually do it manually for specific passes. Okay moving on, enough about barriers. Now I want to quickly cover this one shader line change that we did and this change affected how the output was written to the UAV target. And

the particular shader that was involved is the scattering light fog shader. So the scattering light fog shader as the name suggests computes the light that gets scattered by fog. So you can see that in the screenshot it's like this particular look that you get when like this there is a bit foggy environment and then you have the light shining through. But this also means that the impact of this shader is very scene dependent. So in scenes with a lot of fog it was actually quite costly but in scenes with little or no fog this shader was not significant at all. It's a lighting shader, it's not too different from the direct lighting shader.

So what the direct lighting shader does it's well it goes through all the lights and then computes the final RGB values and writes it out to its UAV target which had the format RG11 B10 float. And yeah it just wrote the three values out and so the output write pattern for the direct lighting shader looked like this. So it was a coalesced 256-byte block. Now the scattering like fog shader, it also computed the RGB output values but the UAV target had a different format. So for the scattering light fog shader it actually has four channels, RGBA16 float.

But the shader was still just writing to three channels. So the output write pattern looked like this and this is not a coalesced 256-byte block anymore and this goes against our recommendation. Why do we have this recommendation? Because partial writes and compression they don't go well with each other. If the data is uncompressed, we can simply mask the writes for partial writes and that's all. But we cannot do this for compressed data, so if you have compressed data a coalesced 256-byte block needs to be decompressed first and then we need to compress it again to preserve the untouched channels.

That's why in order to efficiently use compression you have to fully overwrite the underlying data. So we have to write in coalesced 256-byte blocks because then we can directly write compressed blocks and we don't have to decompress first to preserve anything because in this case we don't need to preserve anything. So that's what we did and also fortunately the 4th channel was not even used, so it didn't contain any valid data. So we could just override it with 0s for example but yeah. And that's what we've ended up doing but of course sometimes this 4th channel may contain valid data. So in this case

what you can actually do is to read this data and then just write it again and then you should measure the performance of course but this can also give you a performance boost. And yeah so the observed speed up of the scattering light fog shader was up to 30 percent. So the UAV target was actually compressed, we probably would have not seen this performance uplift if the UAV target was uncompressed.

Yeah so in scenes with a lot of fog this was actually quite significant. So really try to write to UAV targets in compute shaders in coalesced 256-byte blocks. The rule of thumb is to have a 8x8 thread group write to a 8x8 block of pixels and really write to all channels. And if you have to preserve one of the channels try just reading it and writing it again. When we tried to do this for this shader, even though it was not needed, we also saw a performance uplift of about 30 percent.

But yeah test of course. Enough about optimizations and now I want to talk about FidelityFX Super Resolution 2.0. I will talk about the integration specific to Deathloop. If you want to know about the algorithm, the optimization we've done for this effect and also general integration guidelines stay on for the next talk because the next talk will be a deep dive for FidelityFX Super Resolution 2.0. But yeah here some screenshots just to see FSR 2.0 in action. So on the left, you see native rendered in 4K, TAA and sharpener enabled.

Why they are enabled I will talk a bit about it later. And on the right you have FidelityFX Super Resolution 2.0 in quality mode, so upscaling from 1440p to 4K. And what I want to point out here is this board with the sun on it and you can really see that the edges like this like the wooden pattern they're really well reconstructed in the upscaled image using FSR 2.0. I'd say even better than compared to native.

But yeah Deathloop is the first title in which we integrated FSR 2.0. You can even say that the integration part was part of the development because it was an early proof of concept in a real game. As you know it's always different integrating a technology into a game compared to just into a sample. And the feedback and the lessons learned we got from integrating FSR 2.0 into Deathloop directly went back into development of FSR 2.0. It's a temporal upscaler, so we have a few input resources. Of course the

color buffer, the depth buffer and the motion vectors. And Deathloop also provides the exposure texture. FSR 2.0 comes with its own TAA and also has its own sharpener. um Deathloop makes use of the FSR 2.0 library.

In the beginning when there was no library we used the source code, so we just used the cpp files and compiled it. But as soon as the library was available we switched to the library and we are making use of the API entry points context create, destroy and dispatch. We don't make use of the helper functions and the reason is because in the beginning there were no helper functions because there was no library. That's why at the time when we had access to this helper functions we already implemented it our own way in Deathloop so they were simply not needed anymore.

So that's the reason, because Deathloop was a bit ahead of the library in this case. And we need to recreate the context when the creation parameter changes and one example is if the presentation resolution changes. And when this happens we need to be careful because we cannot destroy the FSR 2.0 context immediately because it might still be in use. So the resources might still be accessed. Deathloop uses triple buffering so at the time when we record a frame another frame is already executed on the GPU and another one is already in the pipe.

So that means when we realize okay this context needs to be destroyed we need to add some logic to um to synchronize. So what we do is we kind of like we register the FSR 2.0 context for destruction and then we actually destroy it at the end of frame n + 2 here, because frame n is executed, frame n + 1 is waiting so at the end of frame n + 2 we destroy the current FSR 2.0 context. However we can immediately create a new context, so this also means that for a short time we have two FSR 2.0 contexts alive and this is totally fine. Yeah the FSR 2.0 library can handle several alive

contexts. What the library does not for you is doing the synchronization, so the engine is responsible of this and also the engine is in a much better position to decide when to synchronize and how, that's the reason. The input color buffer is in linear color space, image format is RG11 B10 float, and to improve the precision the values are multiplied with a pre-exposure value. FSR 2.0 needs to do this as well. So we pass in this pre-exposure value to the FSR 2.0 per-frame parameter. And as already mentioned Deathloop provides its own exposure texture but in principle this is not needed.

FSR 2.0 can also compute its own. Motion vectors: Luckily Deathloop provides motion vectors for nearly all scene elements including vegetation. And this is great because motion vectors really help to get a high quality upscaled image. So here you have an example screenshot for vegetation.

And they were actually moving in the wind, so that's why they are not exactly the same because the bush was a bit moving due to the wind. But you can really see that the quality is quite high and motion vectors really help with that. For elements that don't have motion vectors, FSR 2.0 will use some

other tricks. Output: So as you already noticed I compare the upscaled image always with native TAA and sharpening on. Because the native rendered image is actually quite soft in Deathloop. So what actually happens is that if you use the quality presets from high to ultra I think it's high, very high, and ultra, the sharpener will be enabled by default to kind of improve this very soft image. And we want to achieve a comparable result with FSR 2.0, so that's why by default we also enable the sharpener.

And FSR 2.0 comes with its own sharpener, it's the robust contrast adaptive sharpener (RCAS). And since it has its own sharpener we have to disable Deathloop's sharpener to avoid double sharpening. Deathloop supports all four FSR 2.0 quality

modes, so quality, balance, performance and ultra performance, which is upscaling three times - ehm - 3.0x! And it also supports FSR 2.0 dynamic resolution scaling. The slide again this is the same from the beginning, just as a reminder. This is actually not the whole frame, it's just a part which I cropped out to give you the original size so it's not compressed.

And this is the whole frame but yeah due to powerpoint it's a bit compressed. But this is native and then you have quality balanced performance and ultra performance. So you saw they're not completely equal because Colt was breathing and I couldn't stop him breathing, so he was a bit moving his hand.

Yeah and also here in this shot this is ultra performance you may have noticed that some objects are missing in the background and also some light bulbs suddenly turned off and the reason is that the input image so the low resolution input image resolution 1280 x 720 actually had not have this object in. So they were already missing in the input image and also the light bulbs they're off in this in the input color buffer. So that's why of course you also can't see these objects in the upscaled one because the input resource already didn't have them. Performance impact: Please be aware that all performance numbers shown here are based on a FSR 2.0

beta version so they will most likely change for final release. And we compared to native 4K TAA and sharpener enabled and also raytracing enabled, because raytracing is really demanding so it really makes sense to use upscaling. And yeah we saw some great performance uplift especially in ultra performance mode. It's up to 147 percent what we observed. Summary: Yeah we are already at the end of my talk.

Optimizations: So barriers, I know it's really difficult to get them optimal. Sometimes it's not even clear what is optimal. But yeah it really makes sense to spend some time in it because usually there is some room for improvements. Then the write pattern in compute shaders when writing to UAV targets, really try to write in coalesced 256-byte blocks.

It can give you a nice performance boost when the target was compressed, yeah when you're working with compressed data. Feature integration: Deathloop was the first title that integrated FidelityFX Super Resolution 2.0. If you want to know more about this technology stay on for the next talk it will be a deep dive for FSR 2.0. We'll talk about the algorithm, the optimizations we did and also general integration guidelines. Yeah! That's everything and thanks a lot for listening.

2022-03-26

Show video