How to write a Windows emulator for Linux from scratch
[Music] hello everyone and welcome to my presentation my name is martin undak and i'm working on the stadia porting team here at google the goal of this talk is to explain how binary translation of windows games to linux work so hopefully you'll get a better understanding of how technologies like wine and proton work let's dive into it first a little bit about myself i've been working in the game industry for about 12 years now mostly as an engine and porting programmer here you can see some games i've worked on in the past currently i'm part of the porting team at stadia where we work on tools and technologies to e-sporting efforts for developers bringing their games to stadium let's start by reminding everyone what stadia platform is built on top of as you probably know by now stadia is a linux based platform using vulkan and its rendering api even though i say linux it's actually a very custom system stripped out of regular desktop components for example we only support 64-bit applications we don't provide any user-space libraries other than vulcan pulse audio and a proprietary interface for handling game things like input online and others as you can imagine porting games from windows to linux is not very easy especially for older games it becomes more and more expensive to help with that our team provides developers with few options already which we've talked about last time which are out of the box unity and unreal engine integration if your game is built using a recent supported version of one of those engines it can be easily run and polished for stadia stadia porting toolkit is a set of source code libraries that help translating windows apis to linux for example we use dxvk to implement directx apis so developers don't have to modify their game renderer at all we already have multiple partners using this toolkit and the feedback we're getting is really encouraging the new thing i'm going to talk about today is an approach we took to run unmodified windows games on stadia using something that we call binary translation similar to wine and proton on linux it allows you to run your windows games on stadia without the need of recompilation or any other platform-specific work so how does it work let's go over a high level overview of how technologies like wine and proton work in general wine is not an emulator which means that cpu instructions are not emulated since we are running on the same architecture it's not entirely true for a 32 to 64-bit journey but you will hear about that later on however as you'll see the whole operating system is emulated or at least as we call it translated here you can see an overview of how windows application communicates with the pc you can see that applications use system libraries or dlls to communicate with the operating system the only important part for us here is the interface between the application and the dlls what those dlls do inside and how they communicate with the windows kernel themselves is not really relevant we can replace the system dlls with our own and provide all the services needed by the application by using our own implementation so a rough plan of running a windows game on linux looks as follows we start by loading a game executable from disk it contains all the information needed to correctly load code and data segments into memory next we patch the execable so instead of loading real system dlls we can point to our internal implementation final step is to simply jump to the game code itself and the game is running over the next few slides i'm going to walk you through the steps needed to create your own windows emulator let's start with some basic assumptions we will focus on 64-bit games only x64 implementation on windows simplifies many things and you'll be able to run newer games let's focus on a single process for now this is enough to run lots of drm free games that don't require a launcher again for simplicity you can start with single player games only getting drm to work under emulation is tricky especially that most of them use undocumented anti-tampering techniques that are hard to get right fortunately there are enough drm-free games out there for you to not care about this at the beginning we start from the top loading and parsing windows executables which are called pe in short the pe file consists of a header and various sections header as you probably expect contains information about where things are placed in the file while sections contain actual code and data we need to load the sections in memory at proper offsets we are especially interested in the import sections of a pe file which specifies which system dlls and functions are required for the executable to run correctly on the slide you can see an oversimplified view of an import section but it should give you a basic idea the section contains a list of individual functions that are called anywhere from the executable each entry contains the name of a dll at the top here the name of a function in the middle and the memory address where that function is loaded by the system at the bottom so all we have to do is provide our own function pointer instead and the game would simply call our implementation now that we know where to look for required dlls let's just print out the list we need a way to somehow provide valid function pointers even though there's no real implementation on our side yet the reason for this is that we don't want the game to simply crash in a random place but rather we want to know which of the unimplemented functions was hit the way we've approached this was to dump all the exports from a given dll and use code generation to generate all the needed functions you can use a dumpbin tool that comes with visual studio to list all of the functions exported by a given dlab but there are also many other programs that you could use instead then you can write a script to go through the list and generate c-plus plus code you can see an example of an automatically generated stop function it basically just prints out its name and then crashes on the non-implemented macro but that's enough for us the debugger would break right here showing us exactly what to work on next we also maintain a simple map that matches names with the addresses so when we load the executable we can find the appropriate function and plug it into the import table at this point we've replaced all the os functionality with stabs when we jump to the entry point of the executable it will just run as expected because it's the same cpu as soon as it calls an os function it will break in one of the stepped functions now it's just a matter of implementing this function and repeat we found that this is a really nice way of slowly moving forward without being overwhelmed by the amount of work that has to be done it also gives a nice feeling of progress and accomplishment although i have to warn you there are hundreds of windows functions you need to implement to launch a game not to mention other apis like directx that can be huge but once the function is implemented it is there for all the future games so it will get easier the more games you get running there are some challenges that you will face while pouring windows applications some of them are tls callbacks those are functions put into the executable that need to be called before the entry point itself windows provides something called process environment block and thread environment block that are accordingly process and threat specific those are structs filled with different os information the challenging part here is that teb can be accessed directly by using a gs cpu register which means it cannot be trapped as easy as api functions instead hitting off not implemented macro you've seen before the game will just crash it takes some assembly debugging techniques to figure out any issue around this operating systems do a lot of complicated things to unwind the call stack when an exception is thrown not only you have to emulate the way windows does it but also you have to unwind linux stack frames when needed this is probably one of the most complicated and hardest part to debug we've encountered so far good news is that there are lots of games that don't use exceptions at all some you might just ignore it at the beginning depending on what your goal is you might not want to implement everything yourself especially the graphics apis are massive and require a lot of code to parse and recompile shaders for example fortunately there are some great open source projects with permissive licenses that you can very easily hook up in your project the gold standard used by proton for directx translation is dxvk if your game uses x audio api you can use an open source implementation called f-audio those are just two examples of the resources available okay so now that we have a 64-bit game running what about older ones that are 32-bit stadia is 64-bit only we don't have access to any of the 32-bit libraries but we got an idea what if we just parse the executable offline and convert all the code from 64-bit to 32-bit sounds simple right so the rough plan is parse an x86 binary offline and extract all the code parts as you'll see on the next slide this is actually the hardest part next we've taken the approach of disassembling and decompiling x86 assembly instructions back to c plus plus code actually mostly c we considered different options like llvm il representation but we decided that c plus will be the easiest to read and debug finally we compile everything together along with the api implementations that we built already for 64-bit games we use a regular 64-bit compiler since we are doing this conversion offline we don't even need any dynamic dll loading and patching so all of the code can be compiled into a single optimized binary and there you have it the game is running actually let me walk you through some details as i mentioned before the hardest part here is to actually find the code in the executable the way assembly works is that there's no clear distinction between code and data there are entire papers written on this subject commercial tools like ida pro or gedra that tried to do those things but as far as i know nobody solved this problem with 100 reliability yet for all kind of binaries but we try anyway use pdbs that's the easiest way if you have access to debug symbols of the game you can easily extract all the function addresses from there we can start following the assembly from the known entry points call instructions are obvious they point to existing functions jumps are more tricky they might be go to's switches or actual function calls and it's not always trivial to distinguish between them you can also look for special patterns that compilers emit at the beginning and end of functions in general looking for known code patterns is a good strategy there is also a special section in the executable that contains all the addresses that have to be adjusted in case the executable gets loaded at a different address than compiled for while this is mandatory for dlls executables might not have them because the os can guarantee their load address if you are lucky and your executable contains a relocation table following addresses from there will show you where the code is now that we know where all the code blocks are we need to disassemble them while writing a disassembler is an interesting challenge on its own there are multiple open source solutions for the job to pick from just to name a few here once you disassemble the code it's pretty trivial to convert it to a simple c plus plus code looking something like this as you can see it's very low level and is basically a one-to-one translation from the assembler while this approach generates megabytes of code the idea here is that the c plus plus compiler will do a good job optimizing the code and produce actually a same binary because dll implementations we've written so far expect x64 calling convention we need to write some wrappers that will get parameters from the 32-bit stack potentially convert them and pass to the 64-bit code here you can see some pseudo code our actual implementation is very template heavy and in turn it's not really presentable let me walk you through it we get an emulated cpu state as the input argument 32-bit calling convention keeps arguments on the stack so we pop appropriate 4-byte value here then we pass this value as a legitimate argument to the original 64-bit implementation and we get a 64-bit return value from it since we need to return to a 32-bit word we need to adjust the return value there are some tricks you could do with pointers but you can also try to get away with keeping a giant map as you can imagine there are some nitty-gritty details that you need to solve some of them being 32-bit games use the fpu co-processor for the floating point operations which needs to be decompiled and emulated in addition to regular x86 code 32-bit windows has a totally different way of handling exceptions which needs to be emulated again many of the games don't use exceptions at all so you might just get away with it a 32-bit game would assume that all memory is mapped in the 4 gigabyte address space this won't be the case by default in your 64-bit code so you either have to maintain some kind of translation or make sure you enforce allocating memory in a desired range as part of the 32-bit cpu emulation a separate stack has to be maintained it's especially important when implementing exception handling because then two stacks have to be unwinded so how did we do we were able to run 32-bit aaa tiles with 60 fps performance on a pure 64-bit stadia operating system because all the conversion and os patching is being done offline the result is a regular compiled elf binary with all the symbols if you have pdbs for the original game you can use them to name the actual game functions and then have proper call stacks when debugging and profiling because there's no way to always tell from the offline analysis where the code would jump we need to maintain a map of 32 to 64-bit addresses which needs to be searched for every jump this hurts performance in particular when there's lots of virtual function calls we've seen even four times performance decrease in some specific game areas because of this while the offline conversion worked well for few games we've started seeing more and more games that had relocation tables stripped which basically prevented us from reliably discovering code segments inside the executable at the same time apple announced their rosetta 2 technology which is able to convert x64 to arm 64 code on the fly we thought well that's amazing let's do something like this the idea is we're going to process the game at runtime so we know exactly which code pass to take we don't have to know where the code segments are anymore the running game will simply show us once the game jumps to a code segment we disassemble it and generate new x64 code on the fly and then jump to it think of it as putting a breakpoint at every single jump instruction when it hits we disassemble the x86 code and we generate x64 instructions directly because x64 is basically a super set of x86 most of the conversion is pretty straightforward after the block is ready we jump there and allow the game to continue this is a pretty straightforward optimization we don't have to convert the same blocks over and over again the idea is that after initial few seconds of hitching the game will play smoothly just reusing already converted code our code for runtime instruction translation has more than 4000 lines of code so it's hard to show in here but here's some pseudocode as an example as you can see it's basically a simple loop going through x86 assembly instructions one at a time whenever an instruction needs to access the memory it might need to be adjusted which is what the process instruction function does most work is being done for jumps and function calls at the end of the loop we assemble the instruction again this time using x64 assembler and append it to the end of the generated block when the whole block is processed we directly jump to the first instruction of the converted block how did it go we don't need to know where the code is anymore no need for relocation tables all executables are supported for the games we tested there was no visible difference between the previous offline solution and the runtime one this is an obvious drawback the cache needs to be built and maintained per each game time for a summary even though the technology is complex it didn't require a big team to develop some good results while software like wine is more than 20 years old and hundreds of people work on it on a daily basis it's still possible to get some games running relatively quickly while we've proven that games could be successfully run using this technology and performance is very good it still requires a significant amount of effort per game to make it work this is obviously where the 20 years of development comes in we adopted our runtime 32-bit conversion to work with wine and proton to be able to test more games results are very good and this is definitely one of the paths that we consider moving forward when thinking about 32-bit game support on stadia the question you might have been asking yourselves during this presentation is why did they even bother to write such a technology themselves rather than use well-known and battle-tested solutions like wine and proton well as i said at the beginning stadia uses a very stripped down version of linux basically you can only depend on leap c p thread and some other core libraries while wine requires lots of external dependencies to even build stadia is built using custom hardware with custom drivers and vulkan extensions we also don't have to care about multiple windows alt tabbing multitasking and other things that desktop applications have to support this allows us to optimize game specific paths we also don't need most of the things that desktop wine has to support like gui activex browsers etc we can make assumptions and optimizations that wouldn't be possible otherwise if anyone tried to debug anything under wine they know what this means we as game developers are used to using visual studio where code completion just works you can hit f5 to build and debug your code and when it crashes you see exactly where and why our technology works on windows linux and stadia which allows us to quickly debug and compare across platforms to solve hard problems and last but not least creating new things and solving her problems is fun writing this technology from scratch allowed myself and the team to learn how binary translation works with this knowledge we are much better equipped to understand and potentially contribute to other similar projects if needed let me give a huge shout out and thanks to andrew and greg who work with me on this project and to christian who invented and implemented most of the 32-bit stuff it's amazing how much a small skilled team can achieve in a short amount of time and with that thank you so much for listening to my presentation and i wish you best of luck in your own experiments bye [Music]
2022-03-19 21:42