Tensor2Tensor (TensorFlow @ O’Reilly AI Conference, San Francisco '18)
Hi. My. Name is Sokka Keyser and I. Want, to tell you in this final, session. About tensor, two tensor which is a library we've built on top of tensor, flow to. Organize, the world's models and data sets, so. I. Want to tell you about the motivation, and and how it came together and what you can do with it. But also if you have any questions, in the meantime any time just ask, I. Don't. Know if you've already used dancer, - dancer in that case you might have even more questions. But. The motivation. Behind this library is so I am a researcher, in machine learning I, also worked on production I think models, and research. Can be very annoying it. Can be very annoying to researchers, and it's even, more annoying to people who put it into production. Because. You. Know the research works like this you have an idea you want to try it out it's, machine learning and you think well I will, change something in the model it will be great it will you. Know solve, physics. Problems. Or translation. Or whatever so. You have this idea and you're like it's, so simple I just need to change one, tweak but. Then okay I need to get the data where, was it so you search, it online you find it and, it's, like we're also I need to pre-process it. You. Implement, some data reading you download, the model that someone else did and. It. Doesn't give the result at all that someone else wrote in the paper it's, it's worse. It works ten times slower it doesn't train at all so. Then you start tweaking it turned out turns out someone else had this perl script that pre-processed, the data in a certain way that improved. The model ten times so. You add that then, it turns out your input pipeline is not performing, because, it. Doesn't. Put data on GPU, or CPU, or whatever so you tweak, that, before. You start with your research idea you've spent half a year on. Just. Reproducing. What's been done before, so. Then great then you do, your idea it, works. You. Write, a paper you submit it you put it in a repo on github, which. Has a readme file that says well I downloaded the data from there but this link is already gone by the you, know two days after you made the repo and then. I applied, and you described all these 17, tweaks, but maybe you forgot one option that was crucial, well. And then there is a vex paper and the next research and the next person comes and does the same right so, it's all great. Except. The. Production, team at some point well. They get like well we should put it into production it's a great result and. Then. They need to track this whole path redo all of it and and. Try to get. The same so so it's a very. Difficult. State. Of the. World and it's. Even worse because there are different hardware configurations so. Maybe something that trained well on, the CPU does not train on a GPU, or maybe you need an eight GPU, setup or and, so. On and so forth so, so. The idea behind tensor - tensor was let's, make a library that has at least a bunch of standard. Models for standard tasks, that includes the data and the pre-processing, so. You really, can on a command line just say please, get me this dataset and this model and train it and make. It so that we can have regression, tests, and actually, know that it will train rather. And that it will not break with tensorflow 1.10. And that it will train, both on a GPU, and on a TPU and on a CPU like, to have it in a more organized, fashion and. And. I think that prompted, cancer to answer the the thing why, I started, it was machine, translation, so. I worked, with the Google Translate team on launching neural networks for translation. And. This. Was two years ago and this was amazing, work because. Before. That machine, translation, was done in this way like. It. Was called stray space machine translation, so you find some alignments, of phrases, then. You translate, the phrases and then you try to realign, the sentences, to make them work and the. Results, in machine translation are, normally, measured, in terms of something called the blue score I will, not go into the details of what it was it's like the higher the better, so. For example for English German translation. Like. The blue, skirt that human translators, get is about 30 and. The. Best phrase. Based soon on neural network non deep learning systems, were about twenty twenty-one and, it's. Been. Really. A decade, of research at least maybe more so. When I was doing a PhD like, if you got one blue scrap you. Would be a star right it was a bhoot PhD, when. If you went from 21, to 22. It. Would be amazing, so. Then the neural networks came and like, the early LSD ms in. 2015. They were like nineteen point five twenty. And, we. Talked to the translate, team and, they were like you know guys, it's. Fun, it's. Interesting because it's simpler, in a way you just train the network on the data you don't have all the, know.
Language Specific, staff known like. It's. A simpler system but but, it gets worse results. And who, knows if it will ever get better. And. Well. But then the, neural network research moved, on and. People. Started getting 2122. So. The translate team together with brain where I work made a big effort to try, to make a really. Large or STM model which, is called the GMT the, Google neural machine translation, and. Indeed. It. Was a huge improvement it, got 225. Blue later we, added mixtures of experts it even got 226 so. They. Were amazed it launched in production, and. Well, it was like a. Two-year. Effort to tool on to like, take. The papers scale, them up launch. It and to get these really good results, you really need a large network so. As. An example why. This, is important, or why this was important, for Google is. So. You have a sentence in German here. Which. Is like problems can never be solved with the same way of thinking that, caused them. And this, newell translator, translates. The sentence. Kind. Of the way it should I. There. Is a much better translation, while. The phrase based translators, you can see no problem, can be solved from the same consciousness that, they have arisen it's. It. Kind of shows how the phrase based methods works every word or phrase is, translated, correctly, but. Like the whole thing. Does. Not exactly add up right it's a, you. Can see it's a it's a very moschini. Way and it's. Not so clear what it is. Supposed, to say so, the big advantage of, neural networks is between. On whole sentences, they can even trained on paragraphs, they, can be very fluent, they since. They take, into account the whole context, at once. It's. A really big improvement and if you ask people to score translations. This. Really starts coming close or like, at, least 80 percent of the distance to what. Human translators, do at, least on like newspaper. Language, not poetry. Nowhere. Near that so. So so it was great. We, got the high blue scores. We. Reduced the distance to human translators. It. Turned, out like the one system, can handle different languages, and sometimes even multilingual. Translations. But. There were problems so, one problem is the training time it took about a week on a setup of 64, 228. GPUs, and. All. The, code for that, was. Done specifically, for this hardware, setup so, it was distributed, training, were, everything, in the machine learning pipeline, was tuned for, the hardware, well.
Because We knew we were trained on this data center on this hardware so why. Not well, the, problem is. Like. Batch sizes and learning rates they come together you. Can not tune them separately, and, then. You add tricks, then you tweak, some things in the model that are really good for. This specific, setup for this specific, learning rate or batch size or. This distributed, setup was training, asynchronously. So, there. Were delayed gradients, it's a regularizer so, you decrease, dropout. You. Start doing parts, of the model. Specifically. For a hardware setup, and. Well. Then you write a paper we, did write a paper it was, cited. But nobody, ever outside, of Google managed to reproduce this get. The same result with the same network well. Because, like we, can give you our hyper parameters, but you're running on a different hardware setup, you. Will not get the same result and. Then. In addition to the machine learning setup there is the whole route organization. Pipeline, data preparation, pipeline. And even, though these results, are on the public data, the. The, whole pre-processing, is also. Partially. Google doesn't, matter much but. It. Really, did not allow other, people to build on top of this work so, it. Launched, it was a success for us but. In the research, sense we felt that it came short a little bit because, for one I mean you'd, need a huge, hardware, setup to train it and on, the other hand. Even. If you had the hardware setup or, if you bought got it on cloud and wanted, to invest in it. There. Would still be no way for you to to, just, do it and, that. That. Was the prompt, why, I thought ok we need to make a library where the next time we build a model. So. There LS TMS were like the first wave of sequence. Models with. The first great results, but I thought okay the next time where, we can build a model we, need to have a library that, will ensure it works at Google and outside that. Will make sure like okay, maybe when you train on one GP you get a worse result but we know what it is we. Can tell you yes you're on the same set up just scale up and it should work on cloud so. You can just. If. You want batteries out get, some money pay, for larger hardware. It. Should be tested done. And. Reproducible. Outside. And neat so the tensor 2 tensor library. Started, with the model called, transformer. Which. Is the next generation of sequence, models it's based on self attentional, layers. And. We. Designed this model it got even, better results, it got 28.4. Blue. Now. We are on par with blue, with human translators, so, this metric, is not good anymore it, just means that we need better metrics. But. This. Thing it can train in one day on an a GPU machine, so. You can just get it. Get. An H GPU machine it can be your machine it can be in cloud. Train. Get. The results, and. It's. Not just reproducible, in principle, there's. Been a number of groups that reproduce. Tit got the same results. Road, follow up papers, change the architecture, it went, up to 29 it went up to 30 there. Are companies that use this code they. Launched, competition. To Google Translate well that happens, Google. Translate improved, again. But. In a sense I feel, like it's been a larger success, in terms of community, and research, and.
It Raised the bar for everyone raised. Our quality, as well so. That's. How it came to be that we, feel. It's. Really important, to make things, reproducible. Open. And. Test. Them on different configurations. And different hardware because. Then we can isolate what, parts, are really good. Fundamentally. From the parts that are just weeks that, work in one configuration, and fail in the other. So. Yeah so that's our solution, to. This annoying. Research problem. It's. A solution that requires a lot of work and. It's. Based on many layers so. The bottom. Layer is tensor flow and tensor. Flow in the mean time has also evolved, a lot so. We have T of data which. Is the tensor flow data input pipeline it, was also not there a year ago. In the newer releases it. Helps, to build input, pipelines that are performant. Across. Different hardware. There. Is T of layers and, caris. Which, are higher-level libraries, so you don't need to write in, small, tensor flow ops everything, you can write things, on a higher level of abstraction. There. Is the new distribution, strategy. Which. Allows you to have an estimator, and say, okay train on eight GPU strain on one GPU train on a distributed set up train on TPU you, don't need to read like rewrite. Handlers, for everything, on, your own but that's just the basics and. Then. Comes the tensor. Two tensor part. Which. Is like, okay. I want, a good translation, model, where. Do I get the data from it's. Somewhere on the internet but where. How. Do i. Download. It how do i pre process it which model should I use which hyper parameters, of the model. What. If I want to change a model I just, want to try. My own but, on the same data what. Do I need to change how. Can it be done what. If I want to use the same model but on my own data I have a translation, company, I have, some data I want. To check how. That, works. What, if I want to share what if I want to share a part what if I want to share everything, that's. That's. What tensor to tensor does so. It's. A library it's library that has a lot, of datasets. I. Think, it's more than, 100, by now, all. The standard, ones images. Imagenet. C4m, nist image. Captionings, koko translations. For a number of languages. Just. Pure language modeling, datasets. Speech. To text music. Video. Datasets, it's recently, very active. If. If. You're into research you, can either probably find it here or there is, a very easy tutorial on how to add it and then. With the datasets, come the models there, is the transformer, as I said you that's how it told you that's how it started. But, then like the standard things ResNet. Then. More, fancy, image models like revenez, shake shake exception. Sequence. Model also, a bunch of them sliced in a bite net that's a version of wavenet. Ours. TMS. Then. Like. Algorithmic, models like neural GPUs. There, is a bunch. Of recent papers so, it's. A selection of models and, data. Sets but. Also, the. Framework so if. You, want to train a model there's. One way to do it like. There are many models you need to specify which one and there are many data sets you need to specify which one but. There is one training, binary, so. It's always the same no, two-page, read me please. Run, these commands, and for another run different commands. Same. For decoding you want to get your the outputs of your model on a new data set one, command T 2 T decoder you, want to export it to make a server or a website one. Command and, then. You want to train train. Locally you just run the binary if, you want to train on Google.
Cloud, Just. Say give. Your cloud project ID you, want to train on cloud TP you just. Say - - use TPU and. Give. Give, the ID you. Need to tune hyper parameters. There. Is support for it on Google Cloud, we. Have ranges just. Specify. The hyper parameter, range and tune. You. Want to Train distributed, on multiple. Machines there's. A script for that. So. Tensile. - tensor are data sets models and. Everything. Around that's, needed, to train. Them. Now, this project. Due. To our experience, with translation, we, decided it's open, by default and. Open. By default in a. Similar way as tensorflow. Means. Every internal. Code, change we push gets immediately pushed to github and, every. PR from github we, import internally. And. Merge it so there is just one code base and since this project is pure Python there's. No, magic. It's the same code at Google, and outside and. It's. Like internally, we have dozens. Of code changes a day they. Get pushed out to bid hub immediately, and, since. A. Lot of brain researchers, use this daily, there. Are things like this so there was a tweet about, research. And optimizers. And it was like, there. Are optimizers that I came AMS grad adaptive. Learning rate methods and then. James Bradbury, at Facebook, at that time. Tweeted. Well. It's not the latest optimizer the latest optimizer, isn't answer to tensor in code with, a to do to write a paper. The. Paper is written now it's a very good optimizer at a factor. But. Yeah the code it. Just. Appears there the, papers come later but. It, makes no sense to wait I mean, it's. A it's, an open research community. These. Ideas sometimes, they work sometimes they, don't. But. That's how we work we. Push things out and then. We train and see. Actually. By the time the paper appeared, some. People in the open-source community have, already trained models with it so we. Added the results, they. Were happy to it's. It's. A very good optimizer, saves a lot of memory. It's, a big collaboration, so, as. I said this is just. One list of names it should, probably be longer. By now. It's. A collaboration between Google. Brain deepmind. Currently. There are researchers. From the Czech Republic on github, and Germany. So. Well, over a hundred contributors. By now. Over. A hundred thousand downloads I was surprised. Because, Ryan. Got this number for this talk and. It was like how. Comes there are hundred thousand people using, this thing it's for ml researchers, but whatever. There. Are they are. And. And. There, are a lot of papers that use, it so, these are just the papers that have already been published and and accepted. There. Is a long pipeline. Of. Other papers and possibly, some we, don't know about. So. Yeah, so as. I told you it's. Unified, framework for models so how does it work well, the main script of. The. Whole library is tea - tea trainer is. The one binary, where. You tell what, model what data set what hyperparameters. Go. Train. So. That's. The basic command line installed. Answer to answer and, then. Call tea - tea trainer. The. Problem, is the name of the data set and it, also includes, all details like how to pre-process, how. To resize, images, and, so, on and so forth, model. Is the name of the model and hyper parameter, set is which configuration. Which hyper parameters, of the model which, learning creates and so on to.
Use. And. Then of course you need to specify like where, to, store the data where. To store the model checkpoints for how many steps to train and so on, but. That's the, full command and. For. Example you, want a, summarization. Model. There. Is a summarization, data set that's been used in academia it's from, CNN, and Daily Mail you, say, you want the transformer, and there is a hyper parameter, set that does well on summarization. You. Want image classification like. C 410 is this quite. A standard benchmark for papers, you. Say I want, image C 410 shake, shake model is was, state of the art like half, a year or a year ago this, changes quickly. You, want the big model you. Go tree net and the. Important, thing is like we know this result this, gives less than 3% error on, c4. Which as well as I said was state of the art a year ago not now that's like down to two but. We. Can be certain that. When you run this command, for. The, specified. Number of training, steps you will actually get this state of the art because. Internally, we run regression, tests that. Start this everyday and tell. Us if it fails. So. The, usefulness. Of this framework is not just in what. We have it grouped into one command, but. Because it's automated we can start testing it if there. Is a new change in tensorflow, that will break some kernel and it. Doesn't come out in the unit test it often comes out in the, regression tests, of these models, and. We. Found like at least three bugs in the, recent. Two versions of tensorflow. Because. Some things in, machine learning only, appear, like. Things. Still run things still train but they give you two percent less. These. Are very tricky bugs to find but, if you know which day it, started. Failing it's much, easier. Translation. As I said it started. With transformer. We. Added more changes. Nowadays. It trains to over 29 blue, it's. A very good translation model. Just. Run this command on an, eight GPU machine wait you. Will get a really. Good translator. Speech. Recognition. There, is the open Libre speech data set. A. Transformer. Model. Without. Any, language model gets, a really good word error rate. Some. More fancy things like if you want to generate images, it's recently, popular have a model that just generates, you either, faces, or like landscapes. There, are different data sets. So. This is a model, that you train just on C 410 reversed. Every. Data set in tensor, two tensor you can add this underscore, EV it reverses, inputs and targets and. Generative. Models they, can take it and generate. It for, translation, it's very useful if you want instead of English German and German English you just do underscore, F it.
Reverses, The ordering. Of the data set. So. Yeah so there. Are the commands. But. So, for example on an image transformer, if you try to train this on a single, GPU. To. Get to this two point nine bits per dimension you'd, probably have to wait half a year so. That's not very practical. Well, but that's the point currently, it's a very hard task to, do a very good image generative, model. One. GPU might not be enough for state of the art, so. If you want to really push it you. Need to train at scale you, need to Train multi-gpu, you. Need to go to TP use. Well. This, is the command you've seen before. To. Make it multi-gpu. You. Just say worker. GPU, equals 8, this. Will use a GPS on your machine just. Make. Batches, 8 times larger run. The 8 GPUs in parallel. And. There. Are trains. One. Took train on a cloud TP you. Use. TP you and you, need to specify the master, of the TPU instance, that you booked on cloud. It. Trains, the same. Want. To train on a cloud TPU pod I don't know I guess you've heard today Google. Is opening, up to public the pods. Which go up to 256. I think, TPU cars. Just. Say oh. Maybe. Up to 512. When I see from this column. Just. Say do it train. It, will. Train much, faster, how, much faster, well we've observed. Nearly. Linear linear. Scaling, up to. Half a pod. And I, think like 10%. Loss on a full. Pod so. These models the, translation, models they, can train on a pod for an hour. And. You get state-of-the-art. Performance, so, this can really make. You train very fast, same. For image net well. I say an hour there's now a competition, can we get down to half an hour 18, minutes I'm. Not sure how important, that is but. It's. Really fast. Now. Maybe, you don't just, care about training. One, set of hyper parameters, maybe you have your own data set and you need to tune hyper parameters. Find, a really good model for your application. Say. Cloud, ml engine. Auto-tune. You. Need to say what metric to optimize so. Accuracy. Perplexity. These are the standard metrics that people tune, the models for, say. How many trials, how many of them to run in parallel and. The. Final line is arranged so. Arranged, says well try learning crates from 0.1, to 0.3. Logarithmically. Or uniformly, these are the things you specify, so you can specify continuous. Things in an interval, and, you, can specify discrete, things just, try. Two. Three four five layers. And. The tuner it, starts, the. Number of parallel, trials so 20 in this command the. First one is random and then, the next one it has a. Quite. Sophisticated. Non-differentiable. Optimizing model which, is patient, mixed with CMAs, what. To try next it, will try another 20 trials and. Usually, after like 60, or so. It starts getting. To a large to a good parameter, space. So. If you need to optimize. That's. How you do it and like, if you're wondering what range to optimize, we have a few ranges in code that we usually optimize. For, when. We start with new data. On. A. TPU. Pod if, you. Want a model that doesn't just, do training, on large batches data parallel. But, model parallel, if you want to have a model with a huge number of parameters, more than 1 billion. You, can use something we call mesh tensor flow. That. We also have started, developing in tensor - tensor which, allows to do model parallelism, in an easy way it just say. Split. My tensor, into, the course, how. Many cores you have or, splitted. A twice on this dimension and four wise on this dimension I'll tell a bit more later about that it. Allows you to Train really large models if. You want, this and that gives really good results. So. That's. How the library works, you. Can go and use it with the models, and datasets that are there, but. What if you want. To. Just get the data from the data set or to. Add your own data set well. It's still a Python library you can just import it and there.
Is This problem class. Which. You. Can use. Without. Any other part of the library so. You can just you get an instance, of the problem class. Either. By, so we have this registry, to call things by name so, you can say registry. Dot problem, and name. You. Can say problems. Dot available, to get all the available names or, you. Can instantiate it directly, if you look into the code where the classes you can say give, me this class and. Then. Generate, data the. Problem class knows where on the internet to find the data and how. To pre-process, it so. The generate data will go to this place download. It from, the internet and, pre. Process into TF example, files in the. Same way that we. Use it or that the authors of this, data set, decided. Is good for their models and then. You. Call a problem the data set which, reads it from this can gives you this queue of tensors in. The form of data set. So that's four data sets for. A model all. Our models are subclass, of this d2t model class, which. Itself is a Kara's layer so. If, you want to take one model plug. It together with another one, same. As with layers. You. Get, a model you can get, it again, either by registry, or by class name. Call, the model on on, a dictionary. Of tensors. And. You, get the outputs, and the, losses if, you need. So. You can add your own you, can subclass, the. Base. Problem class or for, text, to text or image to class, problems. There. Are subclasses, that are easier to subclass, you just basically point, where your images are, and. Get. Them from like for, any format, to do. This and. For your own own model, you can suppress it or tomorrow. If. You. Want to share it, it's. On github maker, PR, and. Models. There is a research subdirectory, where, there are models that we don't consider that we don't regression, test we, allow them to be free if. You. Have an idea I want to share put. It there people. Might come run it tell. You it's great. So. Yep tensor, to tensor it's a set of data sets models and scripts, to run it everywhere. And. Yeah. Looking ahead it's. It's. Growing so, we are happy to have more data sets we are happy to have more models. We. Are ramping up on, regression. Testing, we're moving panels out of research to the more official part tool to have them tested. And stabilized. On. The technical, side we are on to simplifying, the infrastructure, so cancer flow to is coming, we. Are the, code, base while, it's started. More than a year ago. It's. Based, on estimators, we are moving it to chaos. We had a own. Script. And binaries, for running. On GPUs. And multi GPUs, moving. To distribution strategy, we. Are allowing experts, to TF hub so, this, is a library for training your own models, the main thing is the trainer once. It's trained and you want to share a pre trained model TF habis is, the right place you, can export it with one line. And. The. Mesh tensorflow allows. To train huge models, on. Cloud pods, I will tell, you a little bit more about it in a moment on. The research side, there's. Been a lot of research, in video models recently. We. Have a ton of them in cancer, - cancer and, they're getting better and better and it's a fun thing to generate, your own videos. There. Is. The. New thing in machine translation is using, back translation so. It uses an unsupervised, you, have a corpus. Of English, and a corpus of German but no matching, and you. Use a model. You have to generate data and then back translate, and it shows improvements. And. In, general, well hyper. Hyper. Parameter, tuning is an, important, thing.
In Research, - so it's. Integrated, now when we're doing more and more of it. Reinforcement. Learning ganz. Well. As. I said, there. Is a there, are a lot of researchers, using it so so, what's going on. One. Of the things mesh, tensorflow. It's. It's. A tool for training huge, models. Minke. Really huge like you can have one model that uses a whole TPU pod. For, terabytes of ram. That's. How many parameters you, can do. It's. By Nami long Nikki efficient. And many people. So. What. If you want to you know train an image, generation, models on high definition videos, or. Like. Process, data that's huge, even, at Versailles. So. You cannot, just say oh I'll do one thing on one core another on one core just splitted by data 1. Data example, has to go on the whole machine and then. There needs to be a convolution, that applies, to it or. A. Matrix. Multiplication so, how. Can we do this and not. Drown. Into writing, manually. Ok on this, core do this and then slice back. So. The idea is. Build. Every. Tensor, every. Dimension it has needs. To be named for example, you name the first dimension is batch the second is length and the, third is just the hidden vector, and. For. Every dimension you, specify, how it will be laid out on a device, so. You say ok batches, for, example modern, devices, they have like, 2d, they're like a 2d mesh of chips so the communication is fast to nearby chips but not so fast. Across. So. You can say if it's. A grid of chips in. Hardware you can say ok the. Batch dimension, will be on the horizontal, chips and the, length will be on the vertical ones so we define how to split, the. Tensor on the, hardware mesh and then. The operations, are, already. Optimized. To. Use. The, processing. Of the hardware. To. Do, fast communication, and, operate, on these, tensors as if they were single tensors so. We specify, the dimensions by, name you specify, their layout and then. You write your model as if it was a single GPU model. And. So. Everything, stays simple. Except. For this layout thing which you need to think a little bit about. We. Did a transformer, on it within. An image transformer. We. Can train models with five million parameters on. TPU. Pods with. Over 50 percent utilization. So. This paper is, also a to do paper it should be coming out in a few weeks, not, yet there but its new state-of-the-art on translation. Language. Modeling. It's. The next step in in, in, really good models it also, generates, nice images. So. Big. Models are good they give. Great results and this is a way of writing. Them simply. So. Yeah. It's. That's, the mesh tensorflow and, we. Try to make it so it runs on TPU pods but it also runs on clusters. Of GPUs, because. We try to not make the mistake again. To do something that just runs on one hardware and. With the tensor 2 tensor library you're welcome, to. Be part of it.
Give, It a try use. It. We. Are on github. The, guitar, is the github chat there it there is an active lobby, for to answer to tensor. Where we also try, to be every day to. Help. And. Yep. That's, that's. It thank you very much.