Performant, scalable models in TensorFlow 2 with tf.data, tf.function & tf.distribute (TF World '19)
I'm, Taylor I'm an engineer on the TF, Harris team and I'm Priam also an engineer on the o tensorflow team I leave the distribution strategy team and today, we're gonna be talking about writing performant, models and tensorflow -. Just. To give a brief, outline of the talk we're first going to talk about the structure of a training loop and tensorflow - and then, we're going to talk about the different api's that you can use to make your training more performant. The, first API that we're gonna be talking about is TF data this is the API, for writing high-performance, pipelines. To avoid, various sorts of stalls and make sure that your training always has data as it's ready to consume it. Next. We're gonna be talking about TF function this is how you can, formulate your training step in a way that is efficient, so, that the runtime can run with very minimal overhead, and. Finally. We're gonna be talking about TF distribute, which is how to target, your training to specific hardware and also scale out to more Hardware. In. Order to demonstrate, all of these api's we're going to be walking through a case study starting, with the most naive implementation and. Then progressively adding, more, performant, api's and looking, at how that helps our training so we're going to be training a cat classifier, because we're an internet company that's what we do and we're. Gonna be training it on mobile, net B to this. Task was largely chosen, arbitrarily it just does a representative, task so these lessons should be broadly applicable to your workflows as well and, there's. A notebook and I'll put the link back at the end of the slide so you, are encouraged, to take. A look afterward and compare, back to the talk. So. Let's get started. A brief, overview of training, loops in tensorflow - we're. Going to be writing a custom training loop and the, reason is that we, want to look at the mechanics, of what. The system is actually doing and how we can make it perform it if, you're used to using a higher level API, like Karis model fit these, Apple these lessons are still broadly applicable it's, just that, some. Of this will be done automatically, for you so if we want to show why, these api's. Are important it's useful to really. Dig into the, internals of the training loop. So because, we're training on mobile net we've chosen to use the Charis applications, mobile net b2 which is just a canned mobile, net v2 implementation, we're. Training a classifier, so we're going to be using a sigmoid. Cross-entropy with logits this is a numerically stable cross, entropy function and then. Finally, we're going to be using the Charis SGD, optimizer. So. If we were to call our model on our, features in this case it's an image classification task so our features are an image we, would get logits which is the predicted class if. We then apply. Our loss function comparing, the labels, and logits this gives us a scalar loss which we could then use to optimize and, update, our model. We. Want to call all of this under a TF gradient tape what the gradient tape does is it watches the computations, as they're, performed. And maintains. The metadata and keeps activations, alive this, is what allows us to take, gradients, which we can then pass to the optimizer, to, update our model weights and. Finally. If we wrap all of this into a step and then iterate over our, data, in mini-batches, applying. This step in. Each mini batch this, comprises the general, structure of our, training loop. On. The data side we're going to start out with just a strawman python generator to show the different parts of the, data pipeline and then we'll look at its performance, a little bit later so. What do we need to do well first we'll want to shuffle the data so we see a different ordering each epoch we'll, need to load all of the images from disk into memory, so that they can be consumed by the training process we'll.
Also Want to do some augmentation, so we'll want to resize, the images from their native format down, back to the format. That the model expects and then, we'll do some task specific optimization. So in this case we'll randomly flip, the images and we'll. Add some pixel level noise to make our training a little bit more robust and. Finally. We'll need to batch these so the generator that I've shown is producing examples one at a time, but we'll need to collect. Them into mini batches and the, key operation, for that batching operation, is a concatenation. It's worth noting that different, parts of the data pipeline, will stress different parts of the system so, for, loading, the disk this is an i/o bound task and will generally want to, consume. This i/o as fast as possible so that we're not constantly waiting, for images to arrive from disk one at a time. Secondly, the, augmentations, tends to be rather CPU, intensive, because. We're doing various sorts of math in our augmentation, and finally. Batching, tends to be somewhat. Memory intensive tasks because, we're copying examples, from their original location, into. The memory, buffer of our mini batch. So. Now that we have a scaffolding, let's just go ahead and run this we're gonna be running on an Nvidia V 100 GPU by, default tensorflow will attempt to place ops on the optimal device and because we're doing large. Matrix multiplies sort of operations, it will place them on the GPU. However. We find that our training at the start is pretty lackluster we're under, a hundred images a second and if, we actually look at our device utilization, it's, well under 20% so, what's, going on here well. In order to determine that we can actually do some profiling, and tensile flow - and tensor board makes, this quite easy to do. You. Might already be familiar with tensor board this is the standard way of monitoring. Training as. It runs so, you, might be for instance familiar, with the scalars tab which will show things like losses, and metrics as your, training, continues. Well, in tensor flow - there is a new tab and that is the profiler, tab and what. This does is it looks. At each op as it runs in tensor flow and shows, you a timeline, of how your training is progressing, and we're going to look in, a little bit more detail about how to read and interpret these timelines. Before. We do however let's, talk about how to enable, it so it's quite straightforward you, simply turn on the profiler, this will signal to the runtime that it should keep. Records of ops as it runs them write. Your program, as normal, and then, simply export this trace and this, will put it in a format that tensor board can consume, and display for you. So. This, is a time line moving.
Left To right we have time, and then each Roah is a different. Thread and you can see that there annotated, with device, so the top row. Is the CPU and then, the bottom ones are the various GPU threads and in, this case we're looking at three different training, steps each, vertical line is an individual, op that's been run, by tensorflow. So. You, can see that the ops are scheduled, from the CPU and then, they appear on the GPU stream and the GPU actually, executes, the ops and this is where the heavy lifting is done but. You can also see that we have these large gaps in between where no ops are running and our system is essentially, stalled, and this is because we after, each step we have to wait for our naive generator, to, produce the next batch of data so this is obviously quite inefficient, so. Priya's gonna tell you how we can improve that. Thanks. Taylor. So. As Taylor just showed us. Writing. Your own input pipeline, in Python to read data and transform, it can be pretty inefficient, transfer. Flow provides the TF data API, to allow you to easily build performant, and scalable, input pipelines. You. Can think of the. TF data input pipeline as an ETL process, so the first stage is the extract, stage where we read the data from let's. Say network storage or from your local disk and then, you potentially, are parsing the file format the, second stage is the transform, stage which is where you take the input file data and transform. It into a form that's amenable, to your ml computation, so, this could be image. Transformations. And specific. To your ml tasks or there, could be generic, transformations, like shuffling or batching that applied to a lot of ml tasks. Once. You've transformed this data you can then load it into your accelerator, for your training. So. Let's like take a look at some code to see how you can use TF data to write the same input pipeline that Taylor showed you before but, in a much more efficient, way so there's, a number of steps here and I'll go over, them one by one, the. First one is we create a simple data set consisting, of all the file names in our input so, in this case we have a number of image files and the path glob specifies, how, to find these file names. The. Second thing we do is. We want, to shuffle this list of fine lines because in training typically you want to shuffle the input data as it's coming in and we. Apply a transformation, called, a map transformation. And. We provided this get bytes and label custom function and what. This function does is that it's going to read the file one, by one using the TF i/o read file API, and it. Uses the file the impact to compute. The label and returns, both of these. So. These three steps here comprise, the extracts traits that I talked about before we're reading the files and parsing, the file, data, note. That this is just one way in which you can read data using. TF data and there are a number of different API. Is that you can use for other situations, which we have listed here the. Most notable one is the TF record data set API which, you would use if your data is in the TF record file format, and this is potentially, the most performant, format to use with, TF data if.
You. Have your input in memory in an umpire let's say you can also use the from tensor slices API to convert, it into a TF data data set and do, subsequent, transformations, like this. The. Next thing we do is another map transformation. To now, take this raw image data but, convert it and do some processing to. Make it amenable for our training. Tasks so. We have this process image function here which we provide to this map transformation, and if. You take a look this could look something like this we, have we do some decoding, of the image data and then we apply some image processing transformations. Such as resizing. Flipping. Normalization. And so on the. Key things to note here are these transformations, are actually very similar and correspond. One-to-one to, what you, saw before in the Python version but. The big difference here is that now instead we're using tensorflow, ops to do those transformations. The. Second thing is that TF. Data will take the custom function that you specify, to the map transformation. And it will trace it as a tensor flow graph and run, it in the C++ runtime instead. Of Python and this. Is what allows it to make it much more efficient, than Python. And. Finally. We use the batch transformation. To batch. The elements, of your input into mini batches and, this is a very common factor for training efficiency, in ml, tasks. So. Before I hand it off to Taylor to talk, about how to using, TFT that can improve the performance in, our mobile net example, I want, to walk through a few more advanced, performance, optimization. Tips for TF data and. I'll go through them quickly but you can also read about them in much more detail on this on, the page that's listed here. The. First optimization, that I want to talk about is pipelining, and this is the conceptually very similar, to any other software pipelining, that you might be aware of the, high-level idea here, is that when. You're training a single batch of data on your accelerator, we, want to use the CPU resource at the same time to process and prepare the next batch of data what. This will do is that when the next training step starts there we don't have to wait for the next batch of data to be prepared it will automatically, be there and this can reduce the overall training time significantly. To. Use software. Pipelining intent or TF data is very simple you can use the prefetch transformation, as shown here. The. Second optimization that, I want to talk about is paralyzing. The transformation, stage so. If by. Default, the map transformation, will apply the custom, function that you provide to each element, of your input data set in sequence, but.
If There's no dependency, between these elements there's no reason to do this in sequence right so, you can paralyze this path by passing the numpad will cause argument, to the map transformation, and this, indicates to the TF data run time then it should run these map, operations. On your elements of the data set in parallel. And. The. Third optimization, that I want to talk about is paralyzing, the extraction, stage so. Similar to how. Transforming. Your elements in sequence, can be slow, similarly. Reading, files one by one can. Be slow as well so, of course you want to paralyze it and in, this example since, we are also since, we're using a map transformation, to read our files the, way you do it is actually very similar to which we just saw you, add a non parallel cause argument, to the map. Function that you have to read your files, note. That if, you're using one of the built-in. Such as the TF record dataset you can also provide very similar, arguments, to that in order to paralyze, the fire reading there. So. If you've been paying close attention you'll, notice that we have these magic, numbers XY, and Z on the slides and you might be wondering how do you determine the optimal, values of these and, in, reality it's actually not very straightforward, to compute the optimal values of these parameters because, if you set them too low you might not be using, enough parallelism in your system and if, you set them too high it might lead to contention and actually, have, the opposite, effect of what you want, fortunately. TF data makes it really easy to for. You to specify these instead, of specifying specific, values for these XYZ arguments, you, can simply use, this constant. TF data experimental, auto-tune and what. This does is it indicates, to the TF the air run time that it, should. Do the auto tuning for you and determine the optimal, values for these arguments based, on your workload your environment, your setup and so on. So. That's all for T of data I'm now going to hand it off to Taylor to talk about what kind of performance, benefits you can see thanks. Priya, so. On. The right here we can see the time line before and after we. Add TF data so before, we have this long stall in between our training steps where we're waiting for data once. We've used TF data you'll note two things about this. Time. Line after the first is that the training step begins. Immediately after, the previous one and that's because TF, data was actually preparing the. Upcoming batch while the previous training step was happening the, second is that before, it actually took longer than the time of a batch in order to prepare the next batch whereas, now we see, no. Large stalls in the timeline, the reason is that TF, data because of the. Native parallelism can, now produce batches. Much more quickly than the training can consume them and you. Can see this manifest, in our throughput, with, more than a 2x improvement. But. Sometimes. There, are even more stalls so if we zoom in on the timeline of one of our training steps we'll. See that there are a number of these very small gaps, and these. Are launch overheads. So if we further look. At different portions of the GPU stream near, the end this is what a healthy profile looks like you, have one op runs. And then finishes, and immediately the, next op can begin and this results in an efficient, and saturated, accelerator, on the, other hand in. Some of the earlier parts of the model an, OP escapee. You the GPU immediately, chews through it and then it simply waits idle for, Python to in queue the next op and this leads to very poor accelerator. Utilization, and efficiency.
So. What can we do well if we go back to our training step we. Can simply add the TF function decorator, what this will do is this will trace the entire training step into. A TF graph which can then be run very efficiently from. The TF runtime and this is the only change needed, and. Now. If we look at our timeline we, see that all of the scheduling, is done in this single much faster, up on the, CPU and then. The work appears, on, the GPU as before, and this, also allows us to use the rich library of graph. Optimizations. That were developed in tons of flow one and. You. Can see this is again almost another factor, of two improvement. And it's pretty obvious in the timeline on the right why. Whereas, before TF, function when we were running everything eagerly, we had all, of these little gaps waiting for ops now, once it's compiled into a graph and launched from this C++. Runtime it's. Able to do a much better job keeping. Up with the GPU. So. The next optimization that. I'm going to talk about is Excel, a which stands for accelerated linear, algebra, to. Understand how Excel works we. Have just the simple example of a graph with some skipped connections, what, Excel a will do is it, will cluster, that graph into sub graphs and then compile the entire sub graph into, a very efficient fused kernel and there, are a number of, performance. Gains from using these fused kernels so the first is that you get much more efficient memory access, because, the kernel can use data while it's hot in the cache as opposed to pushing it all the way down the memory hierarchy and then bringing it all the way back it, also reduces, the overhead from. Launch. Overhead this is the C++ executor. Launch overhead. Running. Fewer offs but, the same math and finally Excel a is, heavily, optimized to target hardware so it does things like use efficient, hardware specific, vector, instructions, specialized. On shapes and choose. Layouts so that the, hardware is able. To have very efficient access patterns, and. It's. Quite straightforward to enable, you simply use this TIF config optimizer, set JIT flag and this. Will cause, every TF function that runs to, attempt to compile and run with, Excel a and you, can see that in our example it's a very stark improvement, it's about a two and a half x improvement. In throughput, the, one caveat with Excel a is that a lot of the optimizations, that it uses are based on specializing. On shape so excellent, needs to recompile. Every time it sees new shapes so if you have an extremely dynamic model, for instance the shapes are different each, batch you might actually wind up spending more time on the excel a compile then, you gain back from the more efficient computation so, you should definitely try out excel, a but, if you see performance. Regression rather than performance gains it's likely that that's the reason. So, next, we're gonna talk about mix precision if we want to go even faster then. We can give up some of our numeric stability, in order, to obtain, faster, training performance so, here I have the I Triple E. Floating, in tensor, flow but. There are also two half precision, formats, that are relevant, the first is B float16, where we keep all of the exponent, and simply chop off 16 bits of mantissa and the.
Other Is float 16 where, we give up both some exponent, in exchange for keeping a little bit more mantissa and what's. Important about these formats is that they actually have native hardware support, so TPU has hardware support for very efficient be float 16 operations, and GPUs. Have support, for very efficient float 16 operations, so, if we can formulate our computation, in these reduced precision, formats we can potentially get very high speed ups from the hardware. In. Order to enable this we need to do a couple things first. We need to choose a loss scale so what is a loss scale a loss scale is a constant. That's inserted, into, the computation. Which, doesn't change the mathematics, of the computation, but it does change the numerix and so this is an, where the runtime can, adjust, the computation, to, keep it in a numerically, stable range, the, next thing we'll want to do is we want to set a mixed precision policy, and this. Will tell Karis that it should cast tensors, as they, flow through the model in order to make sure that the computation, is actually happening, in the correct, floating-point. Representation. In. A custom training loop we'll want to wrap our optimizer, in this law scale optimizer, and this is the hook by which the law scale is inserted, so as training, happens tensorflow. Will do a dynamic adjustment. Of this law scale to, balance vanish, ingredients, from FPD 16 under flow while, still preventing, Nan's. From FB 16 overflow, if you're using the model fit workflow this will be done for you and finally. We generally need to increase our batch size when reviewing this mix precision training the reason is that mixed precision makes, computing. A single example much, less expensive, so if we just turn on mixed precision, we can go from a saturated, accelerator, in float 32 to, an underutilized, accelerator. In float 16 so by increasing the batch size we can go back to filling all of the hardware registers and, we, can, see this in our example if we just turn on float 16, there's actually no improvement, in performance but, if we then increase, our batch size then, we see a very substantial, improvement, in performance it, is worth noting that because, we've both reduced the numeric precision, and changed. The batch size this, can require that you reach in the hyper parameters. And. Then. Finally if we look at what are the remaining bits of performance, so about 60 percent of what's left is actually the copy from, the host CPU, to, the GPU now. And a little bit Priya's going to talk about distribution and, one. Of the things that the distribution of where code, in terms of flow will do is it, will automatically, pipeline, this prefetch so you'll actually get that 60 percent for free and then, finally, you.
Have Hand tuning so you can give up some of the numeric stability, you can mess with thread pools you can manually try and optimize layouts. This, is included largely for completeness in case you're curious what real, low-level. Hand tuning looks like but, for most cases given, that we can get the vast majority of the performance, with just simple idiomatic, easy to use api's. This very, fine hand tuning is not, recommended. But. Once we've actually saturated, a single device, we. Need to do something else to get another order of magnitude improvement, and, so Priya's going to talk to you about that. All. Right so Taylor. Talked about a number of different things you can do to get the maximum performance out, of a single machine with a single GPU but. If you want your training to go even faster you. Need to start scaling out so maybe, you want to add more GPUs, to your machine or, perhaps you want to train on multiple machines in a cluster or, maybe. You want to use specialized, hardware such as cloud, GPUs. Canterville. Provides distribution. Strategy API to allow you to do just that. We. Build this API with three key goals in mind. The. First goal is ease-of-use, we, want users to be able to use the distribution API, with very little changes to their code, the. Second goal is to give great performance, out of the box we, don't want users to have to change their training code or to, know a lot of knobs to get the maximum, efficiency. Out of their hardware resources, and finally. We want this API to work in a variety of different situations so. For instance if you want to scale out to multiple. GPUs or, multiple, machines or TP use or. If you want just different, distributor, training architectures, such as synchronous, or asynchronous training. Or. You're. Using different types of API so if you're maybe you're using high-level cares API, is like more luck fit or you have a custom training loop as in our example earlier we. Want distribution strategy to work in all. Of these potential cases. There. Are a number of different ways in which you can use this API and we've, listed them here in the order of increasing complexity. The. First one is a, fuse, in Karis high-level, api model dot fit for your training loop and you, want to distribute your training, in that setup, the. Second use case is what we've been talking about in this talk so far where you have a custom training loop and you want to scale out your training, the. Third use cases maybe you don't have a specific, training program but maybe you're writing a custom layer or a custom library and you want to make a distribution aware, and, finally.
Maybe, You're experimenting with a new dispute finding architecture, and you want to create a new strategy in the. Talk here I'm only going to talk about the first two cases. So. Let's begin the, first use case we'll talk about is if you have a single machine with multiple GPUs and, you want to scale up your training to this situation. For. This setup we provide mirror, strategy. Mirror. Strategy, implements, synchronous training, across multiple GPUs on, a single machine the. Entire computation, of a model would be replicated. On a GPU, all the, variables of your model would be replicated, on a GPU as well and there, will be kept in sync using, all reduce. Let. Me go. Through step by step to talk about what the synchronous, training looks like. Okay. There's. Some gray boxes around these which are not really visible but so, let's say we have in, our example we have two devices or two GPUs device, zero and one and we, have a very simple model with, two simple layers layer, a and B each, layer has a single variable as you, can see the variables, are mirrored. Or replicated, on these two devices. So. In our forward pass we'll give a different slice of our input data to each of these devices and in. The forward path they will compute, the logits using, the local copy of the variables on these devices. In. The. Backward pass each, device, will then compute the gradients, again, using the local copy. Once. Each device has computed the gradients, they'll, communicate with each other and to aggregate these gradients, and this, is where all reduce that I mentioned before comes in all. Reduce is a special, class of algorithms, that can be used to efficiently aggregate. Tensors such as gradients, across different devices and it, can reduce the overhead, of such synchronization. By quite a bit there. Are a number of different such algorithms, available and. Some hardware vendors such as Nvidia, also provides specialized, implementations. Of all reduce for their hardware such, as the nickel. Algorithm. So. Once these. Gradients, have been aggregated each, the. Aggregated, result would be available on each device and each, device can update its local copy of the variable using. Those aggregated, tensors, so. In this way both the devices are kept in sync and the next forward pass doesn't begin until all, the variables, have been updated. So. Now let's take a look at, some code to see how you can use mirror strategy to scale up your training as, I, mentioned we'll talk about two types of use cases the first one is if you're using the Karis high-level API and then we'll come back to the custom training loop example after this. So. The code here is some skeleton code to train the same mobile net we to model. But, this time we're using the model that compiled and fit API in Kharis, in. Order to change this code to use mesh Rajee all you need to do is add these two lines of code, the.
First Thing you do is create a mirror, strategy, object. And the, second thing is you. Move the rest of your training code inside, the scope of the strategy. Putting. Things inside the scope let's a take control of things like variable creation, so, any variables, that you create under the scope of the strategy, will now be mirrored variables. You. Don't need to make any other changes to your code in this case because we've already made, components. Of tensorflow distribution, aware so. For instance in this case we have the optimizer which is distribution aware as well as compile and fit. The. Case we just saw was the simplest way in which you can create mirror IRG but you can also customize it, so, let's say by default, it will use, all the GPUs available on your machine for training but, if you want to only use specific. Ones you can specify them using the devices argument, you. Can also customize what. All reduce algorithm you want to use by. Using the cross device observed given. So. Now we've seen how to use mirror strategy when using the high-level care a small dot fit API now, let's go back to our custom, training loop example from before and see how you can modify that to use mirror strategy, and. There's. A little bit more code in this example. Because, when. You have a custom training hope you have more, all over what, exactly you want to distribute and so you need to do a little bit more work to distribute. It as well so. Here this is the skeleton of the custom training room from before we have the model the loss function the optimizer and you, have your training step and then, you have your outer loop which iterates over your data and calls the training step. The. First thing you need to do is the same as before you create the mirror strategy object, and you, move the creation, of the model the optimizer, and so on inside, the scope of the strategy, and the purpose of this is the same as before. But. As I mentioned you need to do a few more things so this is not sufficient, in this case and let's, go over each of them one by one. The. First thing you need to do is to distribute, your data typically and. If you're using TF data datasets as your input this, is straightforward all you need to do is called strategy dot experimental. Distribute data set on your data set and this, returns a distributed, data set which you can then iterate over in a very similar manner as before. The. Second thing you need to do is to scale your loss by the global, that size and this is very important, if you so. That the converging characteristics, of your model do not change and we, provide a helper method in the NN library to to. Do so. And. The. Third thing you need to do is to specify which specific, computation, you want to replicate so. In this case we want to replicate our training step on each replica, so you, can use traceview experimental. Runway 2 API to provide, the training step function as well as your distributed, input and you. Rub this whole thing in another TF, function because we want all of this to run as a TF graph as well. So. Those are all the code changes you need to make in order to take the custom training loop from before and now run it in a distributed. Fashion using distribution, strategy API. So. Let's go back and take a look at what kind of scaling you can see so, just out of the box adding this mirror strategy, to the mobile network, um, before we're able to get 80% scaling, from when going from one GPU, to a GPUs, on a single machine it's. Possible to get even more scaling. Out of this by doing some, minimal optimizations, which we won't be going into today. To. Give another example of, the scaling we. Ran, multi-gpu. Training, with four resonant 50 which is a very popular image, classification benchmark. And in, this example we also used f-16, and accelerate, techniques that Taylor talked about before and here. You can see going, from 1 to a GPS, were able to get 85%, scaling. And. This example was using the Charis model, that fit API instead of the custom training with example, if, you're interested you can look at the link in the bottom and you can try, out this model yourself. Another. Example is using the transformer, language model to show that this is not just for images we're able to scale up other types of models as well and in, this case we're able to get more than 90% scaling. When, running from 1 to 8 GPS. So. So far we've been talking about scaling, up to multiple GPUs on a single machine but, most. Likely you would want to scale even further to. Multiple machines maybe, with multiple GPUs or, perhaps just with CPUs, for. These use cases we, provide the multi word Khmer strategy, as the. Name suggests this is very similar to the mirror strategy that we've been talking about it, also implements, synchros training but this time across all the different machines in your cluster, in.
Order To do that all reduce it uses a new type of op in tensorflow called collective ops, collective. Up is a single op in the tensor flow graph which can determine the best all reduce to use based, on a number of different factors such as the network topology the. Type of communication. Available between the different machines as well, as tensor sizes it. Can also implement. Optimizations. Such as tens of fusion so for instance if you have a lot of small tensors that you want to aggregate it may batch them up into a single tensor, in order to reduce the load on the network. So. How can you use multi worker merge sorry, it's, actually very similar to mirror strategy you. Create a strategy object. Like so and the. Rest of your training could actually remains the same so I've committed it here once, you have the strategy object you can put. The rest of your code in strategy to scope and you're good. One. More thing you need to do however in the multi worker case is to give us information about your cluster and one, way to do that for multi worker strategy, is using the TF config environment, variable, TF. Config you might be familiar with this environment. Variable if you have used distributed training, in using. Estimator, in tins for one so. It basically consists, of two components the first is the cluster which gives, us information about your entire cluster, so here we, have three different workers at these hosts and port and the. Second piece is information, about the specific, worker so this is saying this is worker 1 once. You provide this TF config. The. Task part would be different on each worker but, your training code can remain the same and distribution, charges you will read this TF config environment variable and figure, out how to communicate to, the different workers in your cluster. So. We've been talking about multiple machines multiple, GPUs, what, about CPUs you've. Probably heard about teepees in this conference somewhere else before, tip. Use our custom hardware built by Google to, accelerate, machine learning workloads, you, can use them through cloud Tipu's or you can even try them out in collab. Distribution. Strategy, provides TPO strategy, for you to be able to scale up your training to TP, use it's. Very similar to merge strategy, in that implement, synchronous training and it, uses the cross replicas some API to do the all reduce across the TPU course, you. Can use this API to scale, your training to a single TPU or a slice, of a pod or a full pot as well, and. If. You want to try out TP strategy, you can try it with tensorflow, Knightley's or you, can wait for the 2.1 stable release for this as well. Let's. Look at the code to see how you can use CPU strategy, the. First thing you need to do is to provide information about the TPU cluster so the, way you do that is create a TPU cluster resolver, and give it the name or address of your TPU. Once. You have a cluster resolve where you can connect you can use the experimental connector cluster API to connect, to this cluster and also. Initialize, the TPU system. Once. You've done these three things you can then create the strategy, object, and pass the clusters although object, to it. Once. You have the strategy object the rest of the code remains the same you create, the strategy dose scope and you put your training codons inside that so, I won't be going into that here, so. Looking at this code you can see that distribution, strategy makes it really easy to switch from training, on to multiple GPUs or multiple machines to specialized. Hardware like GPUs. So. That's all I was going to talk about this, rune Shaji to, recap today we talked about few, different things we talked about TF data to build simple, and performant. Input pipelines, we. Talked about how you can use TF function to run your training, as a tensor, flow graph we. Talked about how you can use Excel a and mix precision, to even to improve your training speed even further and finally.
We Talked about the distribution a, pi/2 scale out your training to more hardware. We. Have a number of resources on our website a number of guides and tutorials so, encourage them to encourage. You to check those out and if, you have any questions, or you want to report bugs please reach out to us on github, we'll. Also be available in the expo hall after the talk if you have more questions, and finally. The notebook that we've been using in the talk is listed, here so if you want to take a picture you can check it out later that's. All thank you so much for listening. Thanks. For the nice presentation, so two, questions both. Regarding inference so, you mentioned how, we can use stencil board for. Profiling, to, see how, well our CPUs. And GPUs are, utilized can we do the same to, see how, well our inferences, are working. The. Profiler has no concept of training our inference it just looks at offs as they execute so if you're just running the forward pass they'll show all the same. Like. Are you doing it using Python api's are like you saved models so let's have a save model for Matt and I. Want to use that, intense. Abode so, you're, not really using like any of the Python ideas right, so I use Python epa's but basically, - so you're loading a same wall in Python and then yeah then you can use runs as. Well okay. Regarding. Mixed, precision. Will. It help in infants, as well and how to enable that. Typically. For inference. The. Integer. Low precision. Formats. Are tend to be more important than the floating-point, ones if. You really care about a. Very. Fast inference with low precision I would suggest that you check, out TF light because they have a lot of quantization support. Thank. You. Hello. I had just one, question about how. To save your model when you are training, in a distributed strategy, because, your code is replicated, in all the notes and what's, the. Official. Way to do this yes so so. You want to save a full, save model yes yes so you can use the standard API is just. The. TF dot save Marlo save and what, it'll do is will not save the replicated, model it'll save a single copy of the model so. Then it's like any other same model I didn't. Reload it if you load it inside another strategy again then you can then it, will get replicated, at that point or you can use it for influence as another same model. Because. What I experienced, was that all the notes tried to save the model the same time and then it was an issue between a, race condition with help with, the placement where the models were saved. Yes. But. Thanks. For talk have, quick question about mixed. Precision, and xla where, the API automatically. Use the hardware, related. Features, like for example in mixed precision, if you have both, are machine it will be able to use voter machines meet pitches.
For. Instance on Volta will automatically, try, to use the tensor core and in fact Excel a and X precision synergize very well because Excel a are sorry mixed precision inserts, a bunch of casts and then Excel, a will go along and, actually. Try and it will actually optimize, a lot of those way will try and make. The layout more amenable to mix precision so Excel a tends to talk to the hardware at a very low level okay. So basically from, user point of view it's transparent, if I run the same code on CPU, on the only. GPU. New GPU will automatically, trying to use the hard way ability yes, thank. You really sorry we kind of have to stop this now. You.