Hello my name is Aurélien Sérandour. I'm a Software Engineer at AMD and I am currently working with Luminous Productions on their upcoming game, Forspoken. My role is to help them implementing our technologies into their engine. I am today co-presenting this presentation with Teppei Ono from Luminous Productions. He is the Director of the Luminous Engine project.
This talk will introduce some of the technologies present in Forspoken. This session will be separated into 5 parts. First, Ono-san from Luminous Productions will introduce Forspoken and himself After that, I will briefly talk about the partnership with AMD Then I will detail the AMD technologies present in the engine along with some modifications we have been making together. Ono-san will then talk about DirectStorage and its implementation. Finally, I will wrap up the session with some closing words.
Now I will let Ono-san introduce himself and Forspoken, the game he is working on. Hello. My name is Teppei Ono and I am the technical director of the Luminous Engine at Luminous Productions Forspoken is a PlayStation 5/PC title scheduled to release on October 11th. We have implemented many AMD Graphics features and Microsoft DirectStorage into it. With its high-end graphics and fast loading times of its vast world, this is a game that will make you experience the next generation. AMD have been working in collaboration with Luminous Productions on Forspoken since July 2021.
We have closely followed the progress of the game and verified how AMD technologies were implemented, as well as providing support. AMD is supporting Luminous in the following ways. We are ensuring the correctness of the implementation of our technologies. We are helping them with code modification and customization if needed. We are reporting bugs and
we are also performing regular performance checks and help them with performance tuning. A benchmark is also present to help us track the evolution of the application along its development. This is my role as a DevTech, but I am also the interface to the whole AMD Game Engineering team composed of employees with multiple expertise. Luminous Productions showed a lot of interests in our technologies. As a result, Forspoken is a very
special title for us as it is one of the games that use many AMD technologies at the same time. Furthermore, those are used on PC and on PlayStation5. The included technologies, under the name FidelityFX, are our Single pass downsampler, our denoiser, our ambient occlusion effect, our screen space reflection effect, our variable shading effect and our super resolution effect. I will detail each of them in the coming slides.
A quick disclaimer: at the time of the recording, all these effects are work in progress and the look of them is subject to change. Single Pass Downsampler, or SPD for short, is a way to create quickly and accurately the full or partial mip chain of a texture. Because it is a single dispatch and a single header file, it is very easy to integrate into any engine.
The code supports many downsample types (like minimum, maximum, average) and in fact it can be completely customized. Because it is a Single Compute dispatch, it offers a good performance improvement over multiple dispatches or draw calls. It is used extensively in the engine to downsample depth buffer for screen space reflections, color buffer for water refraction and shadows textures for raytraced shadows. Now I want to talk about our Ambient Occlusion effect, called CACAO. CACAO stands for Combined Adaptive Compute Ambient Occlusion. This is a very customizable SSAO solution.
For this effect, Luminous Productions has wanted an ambient occlusion solution with a sharp output. Like all our effects, they have access to the source code, and we have been working on a modified version of the algorithm that operates at a higher resolution to reach the desired look. The code is pure compute and is currently executed on the async compute queue. This is a screenshot of the ambient occlusion output Luminous Production previously had in their engine.
This is the current look of the CACAO right now in the engine. Parameters are still being adjusted to keep the ambient occlusion as sharp as Luminous Productions desires. But as you can see, on the stairs, the sharpness is very good. To further improve the look of the ambient occlusion, the game can use raytraced ambient occlusion, using hardware raytracing. RTAO doesn't replace CACAO but it works in conjunction with it. RTAO is executed at a lower resolution and uses AMD denoiser to clean the output. RTAO works in 3 main steps. First, we create a mask (based on the distance to the camera) where
the rays can be traced. Then we trace them. Finally, the denoiser denoises the output. AMD's shadow denoiser is used here as it works for both shadows AND ambient occlusion Currently, the whole effect takes around 2.3 ms at a 4K output on a Radeon 6900XT. This screenshot shows the ambient occlusion mask.
As you can see it is generated based on the distance to the camera. You can see that the character will receive more raytraced ambient occlusion than the stairs behind. Here is the denoised ambient occlusion. It's very clean and soft, especially on the character. It is in fact generated from a very small buffer (less than quarter the resolution) and our denoiser is able to resolve a lot of details.
For comparison purpose, here is the previous CACAO output before using RTAO. The RTAO buffer is blended with the CACAO based on the mask. The natural blending greatly improves the look of the ambient occlusion on the character and the stairs behind. Stochastic Screen Space Reflection, or SSSR, is an advanced algorithm of screen space reflections. It is composed of 3 main passes. A classification pass to know where a ray is
necessary (using for example a roughness cutoff) An intersection pass based on hierarchical depth buffer traversal. Note that SPD is used there to generate the pyramid of depth. Finally, our advanced reflections denoiser cleans the image. Our partnership with Luminous Productions has been crucial on this effect. They could have access to some features in advance, such as an improved classification algorithm, a more efficient intersection algorithm and a more temporally stable denoiser. Luminous Productions also needed some improvements and many customizations to fit their engine.
In the following slides I will detail two of the main ones: a new environment map fetching and an incorrect occluder detection. I will start with environment map fetching. In our sample, the environment map, a cubemap, is prefetched in the classification pass, where SSR won't run. Then the SSR value is blended with the environment map based on a confidence factor computed during the intersection pass. This factor is never outputted by the pass because the blending happens
inside the intersection pass. Finally, the denoiser denoises all the output. This is summarized by the formula displayed here. The main limitation is that our sample only uses one environment map, which is a cubemap. In the Luminous Engine, screen-space reflections have been done differently.
They have multiple environment maps blended with the SSR value with a weight. There is one pass/drawcall per environment map. Also, each blend is multiplied by a brdf factor that we don't want to denoise. The displayed formula represents how the engine works. As a result, we cannot efficiently prefetch the multiple environment maps and the factors. It’s also challenging to find when to run the denoiser.
We had to find a way to modify SSSR without modifying their pipeline too much. To stay close to their previous solution, the idea was to modify our denoiser, denoise the confidence along with the color and output them together. Then we can feed this denoised confidence into their blending passes. This formula depicts what we have done. So, two modifications were written. We replaced the cubemap prefetch in the classification pass by setting a confidence value to 0.
We also created a new version of the denoiser that operates on color and confidence at the same time. The other main modification was the detection of incorrect occluders during ray marching. On this diagram, you can see how the algorithm works. The ray traverses the depth buffer and when it hits it, the algorithm stops. So as soon as the ray is occluded, the algorithm exits, and a confidence is calculated.
However, some occlusion can occur if thin objects close to the camera prevent rays to be traced in the background. On the diagram, this is the case with the red ray. We first decided to try this new version in our sample. Here is a screenshot of the reflection buffer of our sample before modification. Notice some of the occlusions highlighted here.
The pawn occludes the horse reflection Another pawn occludes the King’s reflection So, we added a parameter to reject some hits. When the ray hits an occluder, we can decide to either accept the hit of reject it and continue our ray marching. This is based on the distance between the hit and the depth buffer Let me show you again the look of the sample before the modification And this is the currently fixed version. The incorrect hits are rejected, and the reflection is now more accurate. Most incorrect occlusions are now resolved.
This is a screenshot of the game with the previously available screen space reflections effect. Some reflections on the lake are missing on the side of the screen as you can see. The trees aren’t fully resolved. Here is a screenshot with the first version of SSSR implemented into the game, with the denoised confidence factor. Notice the better reflections on the border of the screen. The trees are now fully visible on the body of water.
However, you can also notice the incorrect occlusion caused by the foliage This is still a work in progress but after this change, most of the incorrect occlusion is resolved. Also note that the sharpness of the reflection on the water has been vastly improved due to other modifications. Variable shading takes advantage of the new Variable Rate Shading (tier 2) feature available on all AMD RDNA 2 GPUs. This feature allows you to reduce the shading load on the GPU. If used well, you can save some precious time in your rendering without compromising the quality of your image. Variable shading uses the luminance variation within a tile from the reprojected previous frame to create the shading rate texture.
Also, in the Luminous Engine, Variable Shading operates only at the lighting level. So, textures aren't affected by it. As a results, the image quality isn’t compromised This is a screenshot of the game without variable shading. Every pixel is shaded at full rate. From the reprojected previous frame, a shading rate texture is generated. On this image, red is a full rate shading while green is a 2 by 2 shading rate. You can see that most of the image is in fact 2 by 2 Overlayed with the final image, we can see how the algorithm works. It detects tiles with high
contrast and sets a full shading rate there. The ground is black and white, hence the full shading rate. The low contrast areas, like the ones in shadows, use a 2 by 2 one. This is the final image with Variable Shading active. The difference isn't noticeable with an image at full rate. I have zoomed at a place where 2 by 2 shading rate is used. As you can see, there is very little difference.
Hybrid shadows in a technique to reduce the number of rays to trace while keeping correct shadows. It isn't necessary to raytrace everywhere because in most cases the usual shadow maps have enough information. You only really need to raytrace at the border of the shadows or if you want to get beautiful penumbras. The algorithm works the following way. First you create a mask where you allow the algorithm to
trace rays. Then a classification pass finds where rays should be traced and where shadow map is enough. You trace the rays, denoise the results and then blend everything at the end. This process takes currently 3.3 ms for a 4K output on the Radeon 6900XT.
To show the benefits of the solution, here is a screenshot of the game without raytraced shadows. I have highlighted some areas where shadows exhibit some artifacts. For example, on the balustrade. The character garments also don't cast shadows on the protagonist.
Finally, the character shadow is also a bit fuzzy and shadow map resolution can be seen in motion. With RT shadows active, the 3 highlighted areas are greatly improved. The shadow on the balustrade is straight now. The character receives proper
shadows from her garment. Also, her shadows are now cleaner and more stable. This is the final image without the highlighted areas This is a zoom on the main character where garments don’t cast shadows on the main character without raytracing. And you can see that with raytracing, the character receives proper shadows.
You can also notice that the shadow on the neck is sharper thanks to raytracing. Raytraced shadows also allow to create beautiful penumbras. Here is a screenshot of the game with RT shadows off. Notice how the building shadow is extremely sharp. Also notice the shadows artifacts on the red car and its wheels on the right-hand side of the image.
With raytraced shadows, there is now a large penumbra. The car also casts and receives better shadows. Character shadow is also improved in this screenshot. Finally, I would like to talk about FidelityFX Super Resolution. FSR 1.0, for short, is a spatial image upscaling technique that can produce a high-quality high-resolution image from a lower resolution texture. It has been integrated quite early into the game engine. The engine is modern, so the integration was easy. It has decoupled rendering/output
resolution. It supports MIP LOD Bias. The code is a single header and only 2 dispatches are needed to run it. Gives more GPU time to do advanced and heavier effects like the ones using raytracing. In the current version of the game, Ultra quality mode reduces the frame time by 21% while scaling by 1.3 times factor. The quality mode scaling by a 1.5 times factor reduces the render time by 26%
FSR 2.0 is a brand new cutting-edge temporal image upscaling technique. It maintains or even improve the image quality compared to native rendering. The technique is currently being integrated into the Luminous Engine, but we can give you a glimpse of what it is already achieving. The engine, as most modern engines, supports Temporal Anti-Aliasing, motion vectors buffer and a decoupled rendering/output resolution. As a result, the integration is pretty quick. Like FSR 1.0, this gives more headroom for costly operations like raytracing. All of the effects presented here are available with documentation on our website GPUOpen.
Also, we are providing the source code for all of them under the MIT license. If you are interested, try them out and contact your AMD DevTech engineer if you have any question. Now I will know let Ono-san talk about Luminous Productions’ experience with Microsoft DirectStorage. In this segment, I’ll showcase the functionality and effectiveness of Microsoft DirectStorage, which is integrated into FORSPOKEN. DirectStorage is a fast file I/O API that is compatible with both Windows 10 and Windows 11, with the latter benefiting from further optimizations.
It works effectively with large numbers of files as well as small files. In particular, using the DirectStorage API allows high-speed NVMe M.2 SSDs to fully demonstrate their capabilities, delivering file I/O performance of over 5000 MB/sec. In contrast to the current situation, in which many PC games are unable to maximize the potential of M.2 SSDs, FORSPOKEN was among the first to adopt DirectStorage with the aim of unleashing the capabilities of M.2 SSDs and realizing a next-generation gaming experience with faster loading times.
This graph represents the loading times of current PC game titles. While the graph indicates that the use of high-speed M.2 SSDs improves loading speed in most titles, the results are not as significant as the difference in hardware performance. FORSPOKEN, on the other hand, has achieved loading that fully leverages hardware performance by implementing DirectStorage and optimizing with an emphasis on M.2 SSDs. In terms of asset loading flow, first, when a game asset request is received, file I/O processing runs. Then, the binary data that has been read is decompressed,
and the necessary initialization or GPU upload is carried out. DirectStorage can be used to accelerate file/IO, decompression, and initialization steps of this process. This section explains the loading process using the DirectStorage API, which is capable of loading both CPU and GPU data. In our game, the data is stored in a compressed format. As such, when the data is loaded, it is placed in memory following decompression processing. For GPU data, the GPU decompression API will be supported on DirectStorage. GPU
decompression will allow GPU data to be loaded and stored directly in GPU memory without going through any CPU processing, resulting in greater performance improvement. Because the current version of DirectStorage does not support GPU decompression, GPU data is decompressed on the CPU. Despite the fact that buffer/texture uploads are feasible with the current DirectStorage feature, we're not using it because the performance benefit we would gain from it is minimal. This implementation, however, still outperforms existing Win32 APIs. Here, I’m going to compare existing Win32 APIs to DirectStorage. The Win32 APIs use asynchronous I/O. With asynchronous I/O, multiple Read
requests can be sent, but they must still be processed and synchronized one by one. DirectStorage, on the other hand, allows you to use multiple queues, enabling loading and decompression to be executed in parallel. Being able to synchronize multiple Read requests at once will be an important factor in the optimization process. DirectStorage is optimized towards asynchronous, streaming data transfers of file chunks from NVMe with low CPU overhead. Now, we're going to take a look at how the game loads these two event scenes, which each have around 20,000 files, or 5 GB. This is a side-by-side comparison of the time it takes from the title scene when the saved data starts loading, until the event scene starts.
The top three are using DirectStorage, while the bottom three are using the Win32 API, with M.2 SSD, SATA SSD, and HDD from left to right. We can see that all of the DirectStorage-enabled storages load faster than those using the Win32 API. Here is the side-by-side comparison for the second scene. We can observe that the loading time in the environment using DirectStorage and M.2 SSD is in the one second range and that it is running extremely fast.
Even with SATA SSDs, there is a significant difference between DirectStorage and the Win32 API. HDDs, on the other hand, are not delivering the anticipated results due to hardware performance limitations. To recap the performance results we've seen so far, as compared against the Win32 API, DirectStorage shows an improvement in performance when using SSDs. DirectStorage is able to maximize its performance, particularly on faster M.2 SSDs, demonstrating a substantial difference in file I/O speed over Win32. The difference in the actual game loading times, however, is not as pronounced as the difference in I/O performance. This indicates that file I/O speed
has improved to the point that it is no longer a loading bottleneck. So, how does the loading speed affect the actual gameplay experience? I'm going to discuss how the save data of a scene loads and becomes playable. By using DirectStorage with M.2 SSD, you can start gameplay with almost no loading time. Together with FORSPOKEN's unique and agile parkour system, it brings forth a pleasant and exhilarating gaming experience. Here is a comparison video of battle scenes. The M.2 SSD and DirectStorage-optimized version is particularly fascinating.
This is slightly off-topic, but in actual gameplay, there is a system to load play data in advance on the title screen. This allows the gameplay to start right away without having to wait for the game to load. From here, I’ll go into greater depth on the performance analysis and our optimization efforts. Loading time includes: loader processing and memory allocation; file loading by DirectStorage; decompression; and asset initialization. The breakdown of each indicates that, with the utilization of M.2 SSDs and DirectStorage,
file I/O is no longer a bottleneck for loading times. Apart from file I/O, the bottlenecks prove to be decompression and asset initialization, both of which must be optimized more than ever in order to speed up loading. Decompression optimization requires the selection of a fast decompression algorithm as well as the adjustment of parallel execution. For compression algorithms, we use LZ4HC, which has an extremely fast decompression speed. It offers a relatively high compression ratio, though it is somewhat inferior to other algorithms, and it performs well in terms of both compression ratio and decompression speed. As well, we process decompression in parallel because it is a significant factor in loading times; it is essential to evaluate the thread execution efficiency and adjust the number of threads accordingly. Next, we tweaked the queue setting
for DirectStorage to improve the efficiency of asset initialization. DirectStorage supports multiple queues. For FORSPOKEN, we established separate priorities for each of the queues to ensure that data loading is prioritized appropriately, reducing the time it takes before gameplay becomes possible. For example, adjustments have been made so that level data, which is critical for gameplay, is read first, while streaming data for textures and meshes is read at a lower priority. There are other challenges we must overcome in order to attain a one-second loading time.
We must prevent bottlenecks caused by large numbers of small files, and the memory allocator's performance is crucial. We have adopted mimalloc, a fast multithread allocator. Despite these optimization efforts, file/IO processing still has to wait for the asset initialization process to complete. Further processing optimization is needed, however, there are so many different types of assets exist in the game that optimizing them all takes time and effort. Although we have eliminated
major bottlenecks for FORSPOKEN, there are still issues to be addressed. In the future, support for GPU decompression is anticipated to improve this area by reducing CPU processing and improving efficiency. Finally, I'll give a summary and discuss future enhancements. The performance of M.2 SSDs is now adequate to cut game loading times to about one second,
thanks to DirectStorage. We are, however, dealing with issues other than file I/O, which we are working to resolve. Support for GPU decompression is expected to be introduced, which we believe will cut loading times even more. This concludes the presentation by Luminous Productions. Thank you for your time.
Thank you Ono-san, for this DirectStorage presentation. Before finishing this session, I would like to thank the whole Luminous Productions team and especially Takeshi Aramaki, Teppei Ono, Kosei Itou, Yasunobu Itou and Shao Ti Lee whom I'm working with every day. I also want to thank my colleagues at AMD who are supporting me and Luminous Productions implementing these effects into the Luminous engine. To name a few Jason Lacroix, Denisse Bock, Anton Schreiner, Jay Fraser, David Ziman. Thank you for attending our session and we hope you are as excited as us for Forspoken!
2022-03-25