Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 20 – Future of NLP + Deep Learning
Let's get started so welcome, to the very final, lecture of the class I hope you're all surviving the last week and. Racking. Up your projects so today. We're going to be hearing about the, future, of NLP, and deep learning so. Chris is still traveling and today we're going to be having Kevin, Clark who's one of the PhD students in the lab, in. The NLP lab and he. Was also one of the head TAS for the class last year so he's very familiar with the class as a whole, so. Take away Kevin okay. Thanks. Abby. Yeah. It's great to be back after being TA last year I'm. Really excited today to be talking about the future of deep learning and NLP. Obviously. Trying, to forecast the future for. Deep learning or anything in that space is really difficult because the field is changing super. Quickly so. As one what reference point let's. Look at what did deep learning for NLP look. Like about five years ago and really. A lot of ideas, that are now considered, to be pretty, core, techniques, when. We think of deep learning and NLP didn't. Even exist back then so. Things you weren't in this class like seek to seek attention mechanism. Large, scale reading comprehension. Even. Frameworks, such as tensor, flow or pi torch didn't. Exist and the. Point I want to make with this is that. Because. Of this is really difficult to. Look into the future and say okay, what are things going to be like. What. I think we can do though is look, at, areas. That right, now are really sort of taking off so. Areas in which there's. A lot been a lot of recent success, and kind of project. From that that, those, same areas will likely be important in the future and. In this talk I'm going to be mostly. Focusing, on one key idea with, key, idea which is the idea of leveraging. Unlabeled. Examples when. Training our NLP systems so. I'll be talking a bit about doing, that for machine translation, both. In improving, the quality of translation and, even.
In Doing, translation. In a unsupervised. Way so. That means you don't have paired. Sentences, with. With their translations, you, try to learn a translation. Model only from monolingual corpus. The. Second thing I'll be talking a little bit about is, opening. Is GPT, too and. In general this phenomena, are really scaling, up deep. Learning models I know. You saw a little bit of this in the lecture, on contextual, representation, but this will be a little bit more in depth and. I think these. New. Developments, in, NLP, I have had some pretty. Big. Impacts. In terms of, more, broadly, kind of beyond even the technology we're using and in particular, I'm starting, to raise more and more concerns about the. Social, impact of NLP. Both. In, water models can do and also in kind of plans of what where people are looking to apply these models and, I think that really has some risks associated with it, in, terms of security also in terms of areas like bias, I'm. Also going to talk a bit about future, areas, of research these, are mostly research, areas now that are over. The past year have really kind of developed, into promising, areas and I, expect they will continue, to be important in the future. Ok. Start. With I want ask this question why is deep learning been so successful, recently I, like. This comic here, there's a statistical. Learning person, and, they've. Got some really complicated. Well, motivated, method. For doing the. Task they care about and then the neural net person just says stack. More layers so. So the point I want to make here is that, deep. Learning has, not been successful recently. Because it's more theoretically. Motivated, or it's more sophisticated than. Previous techniques. In. Fact. I would say that actually a lot of older. Statistical, methods have more of a theoretical, underpinning. Than some of the tricks we do in deep learning. Really. The, thing that makes deep, learning so successful, in recent years has been its ability to scale right. So, neural. Nets as we increase the size of the data as we increase the size of some models they. Get really big boost and accuracy, anyways, other, approaches, do not and. If. You look to the 80s and 90s. There was actually plenty of research and neural nets going on but. It hadn't doesn't, have a hype around it that it does now and that, seems likely to be because in. The past there, wasn't, the. Same resources, in terms of computers, in terms of data and I'm, only, now after we've reached sort of an inflection, point where we. Can really take, advantage of scale and our deep learning models have we start to see it become a really, successful paradigm. For machine learning. If. We look at big, deep, learning success stories, I think. You. Can see kind of this, idea play out right so here are three of what. Are arguably, the, most famous. Successes. Of deep learning right so there's image recognition where. Before people, use very highly engineered features. To classify images and, now neural nets are much superior to, those methods, machine. Translation, has really closed, the gap between, phrase. Based systems. And human quality translation, so this is widely used in things like Google Translate on the quality has actually gotten a lot better over the past five years. Another, example that had, a lot of hype around it is game playing so there's, been work, on Atari games there's been alphago, more. Recently there's been a fast are and open AI 5, if. You look at all three of these cases underlying. These successes, is really. Large amounts, of data, right so for image net for. Image recognition there. Is the image net data set which is 14 million images machine. Translation, data sets often have millions of examples, for. Game playing you can actually generate. As much training data as you want essentially, just. By running, your agent within. The game over. And over again. So, if we if we look to NLP. The. Story is quite a bit different for a lot of tasks, right. So if you look at even, pretty. Core, kind of popular tasks, to sake reading, comprehension in, English, data. Sets like squad or in the order of like. A hundred thousand examples which. Is considerably. Less than the millions or tens, of millions of examples. That these, previous, successes. Have have benefited, from. And. That's of course only for English right. There. Are, thousands. Of other languages and, this. Is I think a, a problem. With. NLP data as it exists today the. Vast majority of data is in English, when, in reality fewer. Than 10% of the world's population speak. English as their first language so. These problems with small datasets are only, compounded.
If You look at the. Full spectrum of languages, and that exists. So. So what do we do when. We're limited by this data but we want to take advantage of deep learning scale, and, train the biggest models we can the. The popular solution, that's, especially how, recent success is using, unlabeled, data because. Unlike label data unlabeled, data is very easy to acquire for, language you can just go to the Internet you can go to books you can get lots of text. Whereas. Label data usually requires, at, the least crowdsourcing, examples, in. Some cases you even require someone, who's an expert in something like linguistics. To. To annotate that data. Okay. So this. First part of the talk is going to be applying this idea of, leveraging, unlabeled, data to, improve our NLP models to, the task of machine translation. So. Let's talk about machine translation data. It. Is true that there do exist quite. Large data sets for machine translation. Those. Data sets don't exist, because NLP. Researchers, have annotated, texts for the purpose of training their models right they exist because in. Various, settings translation. Is done just because it's useful so, for example Proceedings, of the European Parliament on Proceedings, of the United Nations some. New. Sites they translate, their articles, into many languages so, really, the machine translation data, we use to train our models are often, more of byproducts. Of existing. Cases where translation is wanted rather, than, kind. Of a full sampling of the sort of text we see in the world so. That means number one is limited in domain right, so it's not easy to find translated. Tweets unless you happen to work for Twitter in, addition. To that there's. Limitations in terms of the, languages. That are covered right so some languages, say European languages there's a lot of translation, data for. Other languages there's much less. So. In these settings where we want to work on a different domain or what we want to work with a low resource language. We're. Limited by label data but. What we can do is pretty easily find unlabeled, data so. It's actually a pretty solve problem, maybe. Not a hundred percent but, we can with good accuracy look, at some text and decide, what language it's in train, a classifier to do that so. This means it's really easy to find data, in any language you care about because you can just go on the web and essentially, search for, data in that language and acquire, a large corpus of monolingual data. Okay. I'm. Now going into the, first approach. I'm, going, to talk about on using, unlabeled, data to, improve machine, translation, models. This. Technique, is called pre training and it's really reminiscent, of ideas, like Elmo. The, idea is to pre, train by doing language modeling so, if we have two. Languages, we're interested, in translating. From, one into the other we'll, collect large. Data, sets for both of those languages and then, we can train two, language models one, each on that, data and, then. We. Can use those pre, trained language models as initialization. For a machine translation system, so. The encoder, will get initialized, with, the weights of the language model trained, on the source side language, the. Decoder will get initialized with weights trained on the target size language, and. This will, improve. The performance of your model because during, this pre-training, we. Hope that our language models, will be learning useful, information, such, as you know the meaning of words or. The. Kind of structure, of the, language they're. Processing. And. This can down. The line help, the, machine translation model. When we fine-tune it. Let. Me pause here and ask if there are any questions and just in general feel free to ask questions throughout, this talk. Okay. So. So here is a plot, showing, some results, of this. Pre, training technique. So, this is English to German translation. The. X-axis. Is how much training data as, in supervised, training data you, provide these models but of course they also have large, amounts of monolingual. Data for, this pre training step, and. You can see that this works pretty, well right so you got about two blue points. Increase. In performance so, that's this red line above the blue line when. Doing this pre training technique and not. Too surprisingly, this gain is especially large when the amount of label, data is small. There. Is a problem, with, pre. Training, which I want to address which, is that in. Pre training you, have these two separate language models and there's never really any interaction. Between the two when. You're running them on the unlabeled corpus.
So. Here's a simple, technique, that tries. To solve this problem and it's called self training, the. Idea is given, a sentence. From our monolingual corpus, so in this case I travel to Belgium that's an English sentence. We. Won't have a human provided translation, for this sentence but. What we can do is we can run our machine translation, model and we'll. Get a translation. In the target language since. This is from a machine learning model it won't be perfect. But. We can hope that maybe, our model can still learn from this kind of noisy, labeled. Example. Right. So we we treat our. Original, monolingual sentence, and it's machine provided, translation, as though, it were a human provided, translation, and trained. Our machine learning model as normal on this, example. I think. This. Seems pretty strange actually as as a method when you first see it because. It seems really circular, right. So if you look at this the. Translation. That the model is being trained, to produce is, actually. Exactly, what it already produces, to begin with right because, this. Translation, came, from our model in the first place. So, actually in practice this, is not a technique, that's very widely used due, to this problem but. It motivates another technique, called. Back translation and this technique is really a very, popular. Solution. To that problem and it's a method that. Has had a lot of success, in using, unlabeled, data for translation, um, so here's the approach rather. Than only having, our, translation, system that goes from source language to target language we're. Also going to train a model that goes from our target language to our source language. So. In this case if, if, at the end of the day we want a French to English model we're, going to start by actually, training and English to French model, and then. We can do something that's a lot like self labeling, so we take a English sentence, we. Run our English to French model, and translate it the. Difference to what, we did before is that we're actually gonna switch the source and target side so. Now in this case the, French sentence. Is the source sequence. The. Target sequence is, our. Original. English. Sentence that. Came from our monolingual corpora, and now we're training the language that, machine translation, system that goes the other direction so, that goes French to English. So, so why do we think this will work better, number. One there's. No longer this kind of circularity, to the training because, what. The model is being trained on is the output of a completely different model. Another. Thing that I think is pretty crucial here is that. The. Translations. The the, model is trained. To produce so the things that the decoder is actually learning to generate are, never bad translations. Right so if you look at this example the. The target sequence, for a French to English model I travel to Belgium that. Originally, came from, our monolingual corpus. So, I think intuitively this makes sense is that if we, want to train a good translation, model. It's. Probably, okay to expose, it to noisy, inputs so we exposed it to the output of a system that's English to French it might not be perfect but what. We don't want to do is. Expose. It to poor target, sequences, because, then it won't where and how to generate in that language effectively.
Any, Questions on back translation before, I get to results. I'm. Sure. This. Is assuming we have a large corpus, of unlabeled data and we, want to be using it to help our, translation, model. Does. That does that make sense. Maybe. You could clarify, the question. Yeah. That's right so we have a big corpus of English which, includes the sentence I traveled to Belgium and we don't know the translations, but, we'd still like to use this data. Yet. Another question. Yeah. That's a good question is how do you avoid. Both. The models let's say so we're blowing up and producing garbage and then they're just feeding garbage to each other the. Answer is that there is some amount of label data here as well so on one label data you do this but, on label data you do standard training and that, way you. Avoid you make sure you kind of keep the models on track because they still have to fit to the label data. Yeah. Another question. Yeah, that is a good question and, I think that's basically almost. Like a hyper parameter, you can tweak so. I. Think a pretty common thing to do is first, train two models only on label data, then. Label. So. Then do back translation, over a large corpus, and, kind. Of repeat that process over and over again so each iteration you, train on the label data label. Some unlabeled data and now you have more data to work with but. I think there would be many kinds of scheduling, that would be effective here. Okay. Another question. If. You have a very good French to English model, you, could try to look, up or, to test them it's a good French English model you can try to look up the original source and see if it matches. Yeah. I'm not I'm not quite sure are you suggesting going like English to French to English and, seeing if I see yeah, I'm yeah that's a really interesting idea and we're actually gonna talk a little bit about this. Sort of it's called cycle consistency, this idea later, in this talk. Okay. I'm gonna move on to the, results so here's, the method for using unlabeled, data to improve translation. How. Well does it do the, answer is that the improvements, are at. Least to me they were, surprisingly. Extremely. Good right, so this. Is for English to German translation, this. Is from some work by Facebook, so that you use five million labeled sentence pairs. But. They also use, 230. Monolingual. Sentences, so sentences, without translations.
And. You can see that, compared. To previous state-of-the-art they get six. Blue points improvement, which, if. You compare it to most previous research on mystery machine, translation, is a really big game right so even, something like the invention of the transformer, which, most people would consider to be a really, significant. Research. Development, and NLP that. Improved, over prior works by about two and a half blue points, and, here, without doing any sort of fancy model. Design just, by using way more data we. Got actually much larger improvements. Okay. So. An interesting, question to, think about is. Suppose. We only have, our monolingual, corpora, so we don't have any sentences, that have been human translated, we just have sentences, in two languages. So. The scenario you can sort of imagine is suppose. An, alien comes down and starts. Talking to you and it's a weird, alien language and. It talks a lot would. You eventually be able to translate, what it's saying to English just. By having a really large amount of data. So. I'm gonna start with a, simpler. Task than, full-on, translating. When you only have unlabeled, sentences. Instead. Of doing sentence, the sentence translation, let's, start by only worrying about word to word translation, um, so the goal here is given, a word in one language finest, translation, but. Without using any label data. And. The message the method we're going to use to, try to solve this task is called. Cross. Lingual embeddings. So. The goal is to wear in word, vectors for, words in both languages, and, we. Like those word vectors to have all the nice properties, you've already learned about we're doctors having but. We also want word vectors, for a particular. Language to. Be close to the word vector of its translation. So. I'm not sure if it's visible in this figure but this physic figure shows a large. Number of English and I think German, words and you. Can see that. The. Each, English word has its corresponding German, word nearby. To it in its embedding space, so. If we learn embeddings, like this then it's pretty easy to do word to word translation, we, just pick an english word we, find the nearest. German. Word in this, joint, embedding, space and, that. Will give us a translation, for the english word. Our. Key method. For the, key assumption. That we're going to be using to solve this is that. Even. Though if, you run word Tyvek twice you'll get really different embeddings. The. Structure. Of that. Embedding, space has, a lot of regularity, to it and we, can take advantage of that regularity, to help find when. An. Alignment, between those embedding spaces, so. To be kind, of more concrete here here. Is a picture of two, sets of word embedding so in red we have English. Words in blue. We have Italian words and, although. The. Vector spaces right now look very different to each other the, you can see that they have a really similar structure, right so you would imagine distances, are kind of similar that, the, distance from cat. And feline, in the English. Embedding space should. Be pretty similar to the distance between gato. And Foligno, in the, Italian. Space. And. This. Kind of motivates an algorithm, for, learning, these cross lingual embeddings. So. So here's the idea what we're going to try to do is learn what's, essentially, a rotation, such, that we, can transform, our.
Set Of English embeddings, so, that they. Match up with our Italian, bed embeddings. So. Mathematically what this means is we're gonna learn a matrix W, such that if we take let's, say the. Word vector for cat in English and we, multiply it by W we. End up with the vector. For gato, in Spanish, or Italian. And. A detail here is that we're. Going to constrain, W, to be orthogonal and. What that means geometrically is just that W is only going to be doing, a rotation to the, vectors. In, X and it's not going to be doing some other weirder. Transformation. So, this is our goal so we're in this w. Next. I'm going to be talk about talking, about how actually do. We learn this w. And. There's actually a bunch of techniques for, learning this. W matrix. But. Here. Is one of them that I think is quite clever it's called adversarial, training. So. It works as follows is in, addition to trying to learn this W matrix we're, also going to be trying to learn a model that. Is. Called a discriminator, and what, it'll do is take a vector and it'll, try to predict, is that. Vector originally. An, English word embedding or is it originally an Italian word embedding. In. Other words if you think about the, diagram, what, we're asking our discriminator, to do is it's. Given one of these points and is trying to predict is it basically a red point so an English word originally, or is it a blue point, so. If we have no W, matrix and this is a really easy task, for the discriminator, because. The word, embeddings, for English. And Italian are, clearly separated. However. If we learn a W. Matrix that succeeds, in aligning, all these embeddings on top of each other then. Our. Discriminator, will never do a good job right, we can we can imagine it'll never really do better than 50% because. Given, a vector force, a cat it won't, know is that the vector for cat that's been transformed by w or is it actually the vector for gato because. In this case those, two vectors are aligned so they're on top of each other. So. During. Training you, first. You, alternate, between training, the discriminator, a little bit which means making. Sure it's, as good as possible at, distinguishing the. English, from Italian words and then, you train the W, and the goal for training W is to essentially. Confuse, the discriminator, as much as possible so. You want to have a situation where. You. Can't with. This machine learning model figure, out if a word embedding actually, was. Originally. From English or if it's an Italian word vector. And. So. At the end of the day you have you have vectors that are kind of aligned with each other. Any, questions about this, approach. Okay. Here, there's a link to a paper with more details there's actually a kind of a range of other tricks you can do but, this is kind of a key idea. Okay. So that, was doing, word to word unsupervised. Translation, how. Do we do full sentence, to sentence translation.
So, We're going to use as standard, sort of seek to seek model without. Even an attention mechanism. There's. One change. To the standard seek to seek model going on here which is that we're. Going to use the same encoder, and decoder. Regardless, of the, input and output languages. So. You can see in this example we. Could give the encoder an English sentence we could also give it a French sentence, and it'll, have these. Cross lingual embedding so it'll vector, representations. For English words and French. Words which means it can handle sort of any input. For. The decoder we, need to give us some information about what language is it supposed to generate in is it going to generate in French or English. So. The way that is done is by, feeding. In a special, token which here is f R and bracket brackets to represent French that, tells, the model, okay, you should generate in French now. Here. Then this figure it's only French but you could imagine also, feeding this model English. In brackets and then that'll tell it to generate, English and. One thing you can see is that you could use this sort of model to generate do. You go from English to French um, you could also use this model as an autoencoder right so at the bottom it's, taking, in a French sentence, as input and it's just generating, French as output which, here means just reproducing, the. Original, input, sequence. So. Just, a small change the standard seek to seek here's. How we're going to train the steek to seek model. There's, going to be two training objectives, and. I'll explain sort of why they're present. In this model in, just a few slides for. Now let's just say what they are so. The first one is called. A denoising auto-encoder. What. We're gonna train our model to do in this case is take. A, sentence. So and, here, it's going to be an English sentence but it could also be a French sentence, we're. Going to scramble, up the words a little bit and then, we're going to ask the model to. Denoise. That sentence which, in other words means. Regenerating. What, the sentence actually was before, it was scrambled and. Maybe. One idea of why this. Would be a useful training objective, is that since. We have an encoder decoder without, detention. The. Encoder is converting, the, entirety, of the, source sentence into a single vector. What. An auto encoder does, is ensure that that vector contains, all the information about the sentence such, that we are able to recover. What, the original sentence was. From the vector produced by the encoder. So. That was objective one I'm training, objective, two is now we're actually going to be trying to do translation. But. As, before, we're going to be using this back translation idea, so, remember we only have, unlabeled, sentences, we don't have any human, provided, translations. But. What we can still do is given. A, let's. Say an English sentence or also a French sentence given a French sentence we can translate it to English. Using. Our model, in its current state, and. Then we can ask that model to translate. From English or translate, that yet translate that English back into French so. What you can imagine is in the setting the, input sequence is going, to be somewhat messed up because, it's the output of our imperfect, machine, learning model so here the input sequence is just I am student a word, has been dropped but. We're. Now going to Train it to even, with this kind of bad input to, reproduce, the, original. French. Sentence. From. Our corpus. Of monolingual, on French text. Let. Me let me pause here actually and ask for questions. Too. Yeah. That's a good question so this is going back to earlier when, there was a word, word translation, why, would we constrain that w, matrix to be orthogonal. Essentially. That's right it's to avoid overfitting and in particular it's making this assumption that our embedding.
Spaces, Are so similar that there's actually just a rotation, that distinguishes, our. Word vectors in English versus our word vectors in Italian I, think there. Has been. There. Have been results, that don't include that orthogonality, constraint. And I think it's slightly hurts performance to, not have that in there okay. So. So continuing, with, unsupervised. Machine translation. I. I. Gave a training method I didn't quite explain. Why it would work so so. Here. Is some more intuition, for for this idea. So. Remember we're. Going to initialize, our machine, translation, model with, these cross lingual embeddings, which, mean the English and French word should look close, to identically. We're. Also using the shared. Encoder. So, that means if you think about it at, the top we have just, a encoding. Objective, and we. Can certainly believe that our model can learn this, it's. A pretty simple task. Now, imagine we're giving our model a French sentence, as input instead. Since. The. Embeddings. Are going to look pretty similar and since the encoder is the same it's. Pretty likely that the. Models, representation. Of this french sentence should, actually be very similar to, the representation of the english sentence. So, when this representation is passed into the decoder. We. Can hope that we'll get the. Same output as before. So. Here's like serve as a starting point we can, hope that our model already. Is able to have some translation, capability. Another, way of thinking about this is that, what, we really want our model to do is to be able to encode. A sentence, such, that that representation. Is, sort. Of a universal, kind of inter linguist or a universal. Universal. Representation, of that sentence, that, doesn't that's. Not specific, to the language and so. Here's kind of a picture that's trying, to get at this so our auto encoder and, our. Here. And our back translation example, I'm here that target sequence is the same. So. What that essentially means is that the. Vectors, for. The English sentence and the front sentence, are, going, to be trained to be the same right. Because if they were different our, decoder. Would be generating, different. Outputs. On these two examples. So. This is just another sort of intuition is that what, our model is trying to learn here is kind of a way of encoding, the information of a sentence and a vector but. In a way that is language agnostic. Any, more questions about. Unsupervised. Machine translation. Okay. So. Going on to results of this approach. Here. The horizontal, lines are. The. Results of an unsupervised, machine translation, model, the. Lines that go up are for a supervised, machine translation, model as. We. Give it more and more data, right. So unsurprisingly. Given. A large amount of supervised, data, the. Supervised, machine translation, models work. Much, better than the unsupervised, machine translation, model. But. The, unsupervised, machine translation, model actually, still does quite well. So. If you see it around ten. Thousand, to a hundred thousand, training examples, it, actually does just as well or better than supervised, translation, and. I think that's. A really promising result because. If you think of. Low. Resource settings where, there isn't much labeled examples that. Suddenly becomes really nice that you can perform this well without. Even needing to use a training set. Another, thing kind of fun you can do with a unsupervised, machine translation, model is attribute, transfer. So basically you, can take. Collections. Of text that. Split. By any attribute, you want so for example you could go on Twitter look, at hashtags, to decide which tweets are annoyed and which tweets are relaxed, and then you can treat, those two corpora, as texts as though, they were two different languages, and you can train an unsupervised, machine translation, model to, convert from one to the other. And. You can see these examples. The. Model actually does a pretty good job of sort of minimally.
Changing, The sentence kind of preserving, a lot of that sentences, original semantics, such. That the target attribute is, changed. I, also want to throw a little bit of cold water on this idea so I do think it's really exciting and an almost kind of mind-blowing that you can do this translation, without label data, certainly. Right it's, really hard to imagine someone, giving me a bunch of books in Italian. And say okay we're in Italian, without. You know teaching you how, to specifically. Do the translation. But. Even. Though these methods so promise, mostly. They have shown promise, on languages, that are quite closely related so, those previous results, those were all some. Combination, of English to French or English to German or. So on and those languages are quite similar so. If you look at a different language pair or let's say English to Turkish where the. Linguistics, in those two languages are quite different, these. Methods do still, work to some extent so, they get around, five blue points let's say but. They don't work nearly as well, as. They do in the in. The other settings right so there's still a huge gap to purely supervised learning. Right. So we're probably not you know quite at this stage where an alien could come down and it's sort of no problem let's use our unsupervised, machine translation, system, but. I still, think it's pretty exciting progress yeah. Question. Which my, lina to superpose, works right because my original thought was that if you took for example like life which, doesn't have a word or you know let some water come onto the cart I thought that would do more, clearly but. Do. You think English, not better to plan because. They're most related, and, worse the Turkish or is the other way around I. Would. Expect English to map quite a lot better to, Aladdin, and I, think part of the issue here is that the. Difficulty, in translation, I think is not really. At the word level so, I mean that certainly is an issue that words exist in one language that don't exist in another but. I think actually more. Substantial. Differences, between languages, at the level of like syntax. Or. You, know semantics, right how ideas are expressed, so. So I think I. Would. Expect Italia, Latin. To have you, know relatively, similar syntax, to, English compared. To say Turkish, and, I imagine that is probably the bigger obstacle, for unsupervised, machine translation, models. I'm, going, to really quickly go into this last, recent, research paper which, is basically taking Bert which. Which you've learned about, correct. Yes okay, and making. It cross lingual, so. Here's. What regular, burr is right, we have a sequence. Of sentences in English we're, going to mask out some of the words and we're gonna ask Bert which is our transformer. Model to, essentially fill in the blanks and predict. What, were the words that were dropped out. What. Actually here has, already been done by Google is training, a multilingual, Bert so. What they did essentially, is, concatenate. A. Whole bunch of corpora and different languages, and then, train, one model. Doing. Using this mask LM objective, on, all of that text at once and that's a publicly release model. The. The new kind of extension to this that, has, recently been proposed. By Facebook is to actually combine. This masks LM training, objective. With translation. So. What, they do is, sometimes. Give, this model a, in. This case the sequence in English and a sequence in French. Drop. Out some of the words and just, as before ask the model to fill it in and.
The Motivation, here is that this. Will much better cause. The model to understand the relation between these two languages because, if you're trying to find a fill, in a English word that's been dropped the. Best way to do it if you have a translation, is look at the French side and try to find that word hopefully. That one hasn't been dropped as well and then you can much, more easily fill in the blank and. This. Actually, leads to very. Substantial. Improvements. In unsupervised, machine translation, so, just like Bert. Is used for other toss in NLP they basically take this cross lingual Bert they, use it as initialization, for, a unsupervised. Machine translation, system and they get you know really large gains on the order of ten blue points, such. That the, gap between, unsupervised. Machine translation, and the current supervised state of-the-art is. Much smaller. So. This. Is a pretty recent idea but I think it also shows promise, and really. Improving the quality of translation through. Using unlabeled, data. Although. I guess yeah I guess in this case of Bert they are using label translation, data as well. Any, any questions about this. Okay. So. That, is all I'm going to say about using unlabeled, data for translation, the, next part of this talk is about what. Happens, if we, really scale up these, unsupervised. Language. Models, so. In particular I'm going to talk about GPT, 2 which, is a new model by, open AI that's essentially a really giant language model and, I think it has some interesting implications. So. First of all here's. Just, the sizes, of a, bunch of different. NLP. Models and. You. Know maybe a couple years ago that the standard, sort of LS TM, medium. Sized model was on the order of about 10 parameters. Where temper we're parameter is just you know a single wait let's say in the neural net. Elmo, and, GPT. So the original open, AI paper before, they did this GP feet t2 and we're about ten times bigger than that, GPT. Two is about another order of magnitude bigger. One. Kind of interesting. Comparison, point here is that GPT. 2 which is a 1.5, billion parameters, actually as more parameters, than a honeybee, brain has synapses. So. That, sounds kind of impressive right you know honeybees are not. The smartest of animals but they can still fly around and find nectar, or whatever. But. You know of course this isn't really an apples to apples comparison right so as synapse any weight and a neural net are really quite different but. I just think it's one kind of interesting milestone, let's say in terms of model size that. Has been surpassed.
One. Thing to point out here is that, this. Increasing. Scaling, of deep learning is really a general trend in. All of machine learning so beyond NLP, so. This plot is showing time, on. The x axis and the y axis is log scaled. The. Amount of petaflop, used to train this model. So. What, this means is that the. Trend at least currently is that there is exponential. Growth and how, much compute power I'm worth throwing out our machine learning models I. Guess. It is kind of unclear you know will exponential, growth continue, but certainly, there's. Rapid growth in the size of our models and it's, leading to some really amazing, results right so here, are results not from language but revision, this. Is a, generative. Adversarial, Network that's been trained on a lot of data and it's been trained at really large scale so it's a big model, kind. Of in between. The size of Elmo and Bert let's say and. These. Photos here are actually, productions. Of the model so those aren't real photos those are things the model is just kind of hallucinating, out of thin air and at, least to me they look essentially photorealistic. There's. Also a website that. Is fun, to look at if you're not if you're interested, which is this, person does not exist calm so. If you go there you'll, see a very convincing photo of a person but. It's not a real photo it's again like a hallucinating. Image produced by again. We're, also seeing really huge models being used for image recognition, so, this is recent work by Google, where they trained an image net model with half. A billion parameters. So. That's bigger than Burt but, not as big as GPT, -. This. Plot here is showing log. Scaled number, of parameters, on the x-axis, and, then accuracy. At image now on the y-axis and. Sort. Of unsurprisingly, bigger. Models perform better and there seems to actually be a pretty consistent trend here which is, accuracies. Increasing, with a log of the the model size. I. Want, to go into a little bit more detail how, is it possible, that we can scale up models and train models at such a large extent one. Answer is just better hardware, and in particular, there's. A growing, number. Of companies that are developing hardware specifically. For deep learning so. These are even more kind of constrained, in the kind of operations, they can do than a GPU, but. They do those operations even, faster, so. Google's, tensor processing, units is one example there. Are actually a bunch of other companies working, on this idea. The. Other way to scale up models is by taking advantage of parallelism and there's, two kinds of parallelism that, I want to talk about very briefly so. One is data parallelism, in. This case each. Of your let's. Say GPUs, will have a copy of the model and what, you essentially do is split the, mini batch that you're training on across, these different models. So. If you have let's say 16, GPUs, and each of them see a batch size of 32. You. Can aggregate their gradients, of these, 16. If. You do backprop. On these 16 GPUs. And you end up with effectively about size of 512, so. This allows you to train models much faster, the. Other kind of parallelism that's, growing, in importance, is model print parallel. So, eventually. Models. Get so big that they can't even fit on a single GPU, and they can't even do a batch size of one. In. This case you actually need to split up the model across, multiple, compute. On multiple computer you nuts, and. That's. What's done for models kind of the size of let's say GPT. Two there. Are new frameworks, such as mesh tensorflow, which. Are basically. Designed to make this sort, of model parallelism, easier. Okay, so on to GPT to I, know you already saw, this a little bit in the contextualized. Embeddings. Lecture. But, I'm going to go into some more depth here. So, so essentially it's a really large transformer. Language model. So. There's, nothing really kind of novel here in terms of new, training algorithms, or in. Terms of, the. Loss function or anything like that the, thing that makes it different from prior work is that it's just really, really big, it's, trained on a correspondingly. Huge amount of text so it's trained on 40 gigabytes and that's roughly. Ten times larger than previous, language. Models have been trained on. When. You have that, size of data set the, only way to get that much text is the century to go to the web so.
One Thing opening, I put, quite a bit of effort into you when they're developing this, network was, to ensure. That that, text, was pretty high quality and. They did that in a kind of interesting way they looked at reddit which is this website where people can vote, on links and then they said if, a link. Has a lot of votes then it's probably sort of a decent link there's probably you. Know reasonable text there for a model to learn. Okay, so if we have this, super, huge language, model like GPT, two on this question of what, can you actually do with it. Well, obviously if you have a language model you can do language modeling with it but, one thing kind of interestingly. Interesting. Is that you can run. This language model, on. Existing, benchmarks. For. Language, modeling and. A get state-of-the-art perplexity, on these benchmarks, even, though it never sees the training, data for these benchmarks, right, so normally if you, want to say, evaluate. Your a language model on the pen tree Bank you first train on the, penn treebank, and then you evaluate on this held-out set in, this case. GPT. - just by virtue of having seen so much text, and being such a large model, outperforms. All these other, prior. Works even, though it's not seeing that data. On. A, bunch of different language. Modeling benchmarks. But, there's a bunch of other interesting, experiments. That openly I ran. With, this language modeling, and these. Were based on zero shot learning so. Zero shot learning just means trying, to do a task without ever training, on it and. The. Way you can do this with a language model is by, designing. A prompt, you feed into the language model and then have it just generate, from there and hopefully, it generates something relevant. To the task you're trying to solve so. For example for reading comprehension what. You can do is take the context. Paragraph. Concatenate. The question to it and then, add a colon. Which is a way I guess of telling the model okay you should be producing an answer to this question and then, just have it generate text and. Perhaps, it'll, generate something, that is actually answering, the, question and is paying, attention to the context. And. Similarly, for summarization, you can do the article, then TLDR, and perhaps, the model will produce a summary you. Can even do translation, where, you, give the model some. Exist. Of known English to French translations. So you sort of prime it to. Tell it that it should be doing translation, and then you give it the source sequence equals, blank and have, it just run and. Perhaps. It'll generate the. Sequence in the target language. Okay, so so here what the results look like. For. All of these, the. X-axis. Is large is log scaled model, size. Y-axis, is accuracy, and. The dotted lines basically, correspond, to, existing. Works on these tasks. So, for, most of these tasks. GP. T2 is quite, a bit below. Existing. Systems, but. There's of course this big difference right existing systems are trained specifically, to do. Whatever. Task they're being evaluated on, where GP, t2 is. Only. Trained to do language modeling and as, it, learns language modeling it's sort of picking up on these other tasks. So. Right so for example it. Does English. To French machine translation.
Not. As well as. Standard. Unsupervised. Machine, translation, which is those dotted, lines. But. It still is still does quite well and. One. Thing kind of interesting, is the trend line right for almost all of these tasks. Performance. Is getting much. Better as the model increases in size. I, think, particularly. Interesting. One. Of these tasks, is machine translation, right, so the question is how, can it be doing machine translation, when, all, we're giving it as a bunch of web pages and those web pages are almost all in English and yet, somehow it sort of magically picks up a, little bit of machine translation right, so it's not a great model but it can still you. Know do a decent job in some cases. And, the answer is that if you look at this, giant corpus, of English, occasionally. Within. Within that corpus you see examples of translations, right so you see a. French. Idiom and it's translation or a quote from someone who's French and then the translation in English and. Kind. Of amazingly I think this, big model. Sees. Enough of these examples that, it actually starts to learn how to generate French, even. Though that wasn't really sort of an intended part of its training. Another. Interesting, thing, to, dig a bit more into is its, ability to do question answering, so. A, simple, baseline for question answering gets, about 1% accuracy, GPT. - barely, does better at 4 percent accuracy, so. This isn't like you know super amazing we solve question answering, but. It's, still pretty interesting, in that if you look at answers, the models most confident, about you, can see that it's sort of has learned some facts about the world right, so it's learnt that Charles, Darwin wrote Origin, of Species. Normally. In the. History of NLP if you want to get kind of world knowledge into an NLP system, you'd, need something like a big database of facts and, even. Though this is still kind of very early stages, and that there's, still a huge gap between four, percent accuracy and the, you. Know seventy, percent or so that state of the art open domain question, answering systems can do. It. It. Still, can, pick. Up some world knowledge just by reading a lot of text without, kind of explicitly. Having, that, knowledge put, into the model. Any. Questions by the way on GPT, 2 so far. Ok. So. One question that's interesting to think about is what, happens if a models get even bigger. So here I've done the very. Scientific, thing of drawing some lines in PowerPoint, and seeing where they meet up and. You can see that if. The. Trend holds at about, 1 trillion parameters. We. Get to human, level reading comprehension performance. So. If that's true would be really astonishing, I actually do expect that a 1/2. Trillion parameter, model would be attainable and I don't know 10 years or so. But. Of course right the trend isn't clear so if you look at summarization, for example it seems like performance, is already topped. Out, so. I think this will be a really interesting thing kind, of going forward I'm looking, at the future of NLP is, how, this scaling will. Change the. Way NLP is approached. The, other interesting, thing about GPT, 2 was its reaction, from the. Media and also from other researchers. And. The, real cause. Of, a, lot, of the the controversy about it was this statement from opening I they. Said that we're not going to release, our full language model because, it's too dangerous you know our language model is too good. So, the. Media really enjoyed this and you know said, that machine. Learning is going to break the internet.
There's. Also some pretty interesting reaction, from researchers. Right so there's. Some kind of tongue-in-cheek, responses. Here right you know I trained a model and amnesty is it too dangerous for me to release it and. Similarly. We've done really great work but we can't release it it's too dangerous so you're just gonna have to trust us on this. Looking. At more kind of reasoned. Debate. About this issue you. Still see articles. Arguing. Both sides so these are two argue articles. From. The, gradient, which is a sort, of machine learning newsletter, and. They're or arguing. Precisely opposite. Sides of this issue should. It be released or not. So, I guess, I can briefly go over a few arguments. For or against. This, kind of a lot of debate about this and I don't want to go too deep into a controversial, issue. But. Here's, a long list of kind of things people have said about this right so here's. Why you should release, you. Know one complaint is that is this model really that special there's, nothing new going on here it's just ten times bigger than previous models, and, there's also so arguments. That, even. If this one isn't released you know in five years everybody can train a model this good and. Actually if you look at image recognition or, look, at images and speech data it. Already is possible, to synthesize highly. Convincing, fake. Images and fake speech. So. It's kind of what makes this thing different from those other, those. Other, systems. And speaking, of other systems, right Photoshop, is existed, for a long time so, we can already convincingly. Fake images. People. Have just learnt to adjust and learn that, you shouldn't always trust, what's in an image because. It may have been altered. In some way. On, the other hand you could say okay, Photoshop. Exists. You can't sort of scale up Photoshop, and start mass producing fake content, the way you can with this sort of model and they. Pointed the danger of fake. News, fake. Reviews, in, general, just aster turfing which means basically. Creating. Fake user content, that's supporting, a view you want other people to hold. This. Is actually something that's already done pretty. Widely by country, companies and governments so there's a lot of evidence for this but. They're of course hiring, people to, write all these comments and news articles let's say and we. Don't want to make their job any easier by, producing a machine, that could potentially do this. So. I'm. Not really gonna take, a side here there's, still a lot of debate about this I, think, you. Know the main the main takeaway, here is that as a, community people, and machine learning and NLP don't. Really have a handle on this why we're sort of caught by surprise by, opening. Eyes decision. Here and. That. Means that you know there really is some figuring out that needs to be done on what. Exactly is, responsible. To. Release, publicly, what. Kind of research problems should we be working on and, so on. So. Yeah any questions about, this. This reaction, or this, debate in general. Okay. I. Think. Something, are rising from this debate is. The. Question of, should. Really the ml people be the people making, these sort of decisions, or is there a need for more, interdisciplinary, science. Where we look at, experts. And say computer security, people. From Social Sciences you know. People. Who are experts, in ethics to. Look at these decisions. Right. So GP t2 is definitely one example of where, suddenly, it seems like our. NLP. Technology, has a lot of pitfalls. Right where they could be used in a malicious way or they could cause damage and I. Think this trend is only going, to increase if, you look at kind of areas. Of, and I'll pee that people are working on. Increasingly. People, are working on really high stakes applications. Of NLP. And. Those, often have really, big ramifications. Especially. If you think from the angle of a bias and fairness. So. So let's let's go over a couple examples, of, this. One. So some some areas where this is happening as people are looking at em and, I'll P to look at judicial, decision so for example should this person get, bail or not, for. Hiring decisions, right so you look at someone's resume you're on NLP on it and then you make a decision, automatically. Should. We throw out this resume or not so do some sort of screening. Grading. Tests, if, you take the GRE your. Tests will be graded by a machine a person. Will also look at it but. Nevertheless, that's. You know a sometimes. Very, impactful, part of your life when, it's when does to test that in, you know affects your acceptance.
Into A school let's say. So. I think there is are, some, some good size of using. Machine learning in these kinds of contexts so, one is that we, can pretty quickly evaluate. A machine, learning system and search. Out does it have some kind of bias just, by running it on a bunch of data and seeing what it does and, also. Perhaps, even more importantly, we can fix this kind of problem if it arises right so it's. Probably easier to fix a machine learning system the screens resumes, than, it is just to fix having, you know 5000 executives, that are slightly sexist, or something right, so so in this way there. Is a sort, of positive angle on using. Machine learning in, these high-stakes. Decisions. On. The other hand it's, been pretty well known. And I know you had a lecture and bias and fairness that machine learning often reflects, bias in a data set it can even amplify, bias, in the data set and. There's concern of, kind of a feedback loop where a biased, algorithm, actually will lead to the creation of, more bias data in. Which case these problems, will only compound and get. So, for, all of the high, impact decisions. I, had, listed on that slide there are examples, where things. Have gone awry right, so Amazon, had some AI that, was working. As a recruiting tool and it turned out to be sexist. There's. Been some, kind of early pilots abusing, AI in, the justice system and those also have had in, some cases, really, bad results. If. You, look at automatic mathematic essay grading it's. Not really a great, you know NLP system right so here's, an example. Excerpt. Of an essay that a. Automatic. Grading system, used by the GRE test gives, a very high score but, really, it's just kind of a salad of big. Fancy words and that's enough to convince the model that this is a great essay. The. Last area. That want to talk about where where you. Can see there's really some risks and some pitfalls with using NLP technology, use chat BOTS, so, I think chat, BOTS do you have a side where they can be very beneficial. Robot. Is one, example as this company that has this chat, bot you can talk to if you're not feeling. Too great and it'll try to I don't know cheer you up. So. So that you know could be a really nice piece of technology, that helps people, but. On the other hand there's, some big risks so so one example is Microsoft Research had a chat, bot trained on tweets and it, started quickly saying racist, things and not to be pulled so. I think all of this highlights that as. NLP. Is becoming more effective people, are seeing opportunities. To use it in. Increasingly. High stakes decisions and, although. You. Know there's some nice there's some appeal to that there's. Also a lot of risk. Any. More questions on, this. Sort of social, impact of NLP. Okay. Last. Part of this lecture is looking. More future, research right, and in particular, I think a lot of the current research. Trends, are kind of reactions, to Bert, right, so this the question is what did Bert solve, and, what do we work on next. So. Here, are results on the glue benchmark, that is a, competing. Of 10 natural language understanding tasks. And. You get an average score across this 10 tasks. The, left to. The 2r, sorry the right to write most models are. As. Non, are, just supervised trained machine, learning systems right, so we have bag of vectors, we. Instead use our fancy neural net architecture, of BIOS TM + attention, and we get about 5 points, but. The gains from Bert really. Dwarf that difference, right so sober improves, results by about 17. Points and we, end up being actually quite. Close, to. Human performance on these tasks. So, one sort of implication. Of this that people are wondering about is is this kind, of the death of architecture, engineering so. I'm sure all of you who have worked on the default, final project have, seen a whole bunch of fancy pictures showing different. Architectures. For solving squad there. Are a lot of papers they, all propose some kind of attention. Mechanism, or something like that, and. Right. With Bert it's sort of you. Don't need to do any of that right you just train a transformer, and you give it enough data and actually, you're doing great on squad, you know maybe.
These. Architectural. Enhancements are. Not, necessarily. The. Key thing that will drive progress, in improving. Results, on these tasks. Right. So if, you look at this at the perspective of a researcher, you can think a researcher, will say ok, I could spend six months designing, a fancy new architecture, for squad and if I do a good job maybe I'll improve results by one f1, point, but. In the case of Bert, increasing. The size of their model 3x which is the difference between they have like a base size model, and a large model. That. Improved results by 5f1 points. So, that does seem to suggest, me to sort of reprioritize. Which. Avenues, of research we pursue because, this architecture, engineering isn't providing kind of gains, for, its time investment, the way leveraging. Unlabeled, data is. So. Now if you look at the squad leader board I think, at least the top 20 entrants, are all Burt plus something. One, other issue, I, think Burt has raised is that we. Need harder tasks, right Burt has almost, solved, squad if you define it by getting. Close to human performance, so. There's. Been a, growth, in new, data sets that are more. Challenging and there are a couple ways in which they, can be more challenging so one is, doing. Reading comprehension and longer documents, or doing it across more than one document. One, area is looking at, coming. Up with harder, questions, that require multi-hop, reasoning, so. That essentially means, you have to string together multiple. Supporting. Facts from different places, to. Produce the correct answer. At. Another area a situating, question answering within a dialogue. There's. Also been a kind of small detail, with the construction of reading comprehension datasets. That, is actually really, affected, the. The difficulty of the task and that is whether when. You create these datasets is. The, person who writes questions, about a passage can they see that passage or not. So. Of course it's much easier to come up with a question that when you see the passage and if you come up with a question without seeing the passage you may not even have a instable, question. But. The problem with looking at the passage, is that first of all it's not realistic right so if, I'm asking a question you, know I'm not going to have usually the, paragraph, that answers that question sitting, in front of me. On. Top, of that it really encourages easy questions, right so if. You're a mechanical turk and you're, paid to write. As many questions as possible and, then you see an article that says I. Don't, know you know, Abraham. Lincoln was the 16th President, of the United States what. Are you going to write as your question you're gonna write who, was the 16th, the United States you're not going to write something more interesting that's harder to answer so. This is one way in which crowdsource, datasets have changed people. Are now making, sure questions, are sort of independent, of the. Context. So. I'm going to briefly go, over a couple new datasets in this line so one is called quack which stands for question answering in context, in, this, data set there is a teacher, and a student the. Teacher sees a Wikipedia, article the. Student wants to learn about this Wikipedia article, and the. Goal is to train a machine learning model that asks as a teacher um. So you can imagine maybe in the future this sort of technology, would be useful for. Education. For. Kind of having adding. Some automation. One, thing that makes this, task, difficult. Is that, questions. Depend, on the entire history of the conversation. So. For example if, you look on, the left here the. Example. Dialog. The, third question is was he the star, clearly. You can't answer that question unless, you look back earlier, in the dialogue, and realize, that the, subject of this conversation. Is Daffy Duck. And. Sort. Of because this data set is more challenging you can see there's that there's a much bigger gap to human performance right. So if you train some Bert with some extensions, you're the. Results are still like 15 F 1 points worse than human performance. Here's, one, other data set called, hotpot QA it. Is. Designed. Instead for multi hoc reasoning so. Essentially in order to answer a question, you have to look at multiple documents. You have to look at different facts from those documents and perform, some inference, to.
Get What the correct answer is. So. I think you know this is a much harder task and. Again. There's, a much bigger gap between human performance. Any. Questions, on new. Data sets, harder. Chip tasks for NLP. Okay. I'm gonna kind, of rapid-fire and go through a, couple more areas in the last minutes of this talk so. Multi-task, learning I think is really growing in importance, of, course. You've. Had a whole lecture on this right, so I'm not going to spend too much time on it but. Maybe one. Point. Of interest is that if you, look at performance on this glue benchmark so this benchmark for natural language understanding, all, the, top couple. Results. Are. That, are now actually, surpassing. Burt and performance, or is taking, Burt and training in a multitask way I. Think. Another. Interesting. Motivation. For multi task learning is that if you, are training Burt you have a really really large model, and one, way to make more efficient use of that model is training it to do many things at once. Another, area that's definitely. Important, and I think will be important going in the future is dealing with low resource settings and. Here I'm using a really broad. Definition. Of resources, right so that could mean compute power you, know Burt is great but it also takes huge, amounts of compute to run it so it's not realistic to say if. You're building let's, say a mobile Dan, app for a mobile device that, you could run a model the size of Burt. As. I, already went, into earlier in this talk, you know what resource languages, is an area that I think is pretty. Underrepresented. In, NLP, research right now because, most datasets are in English if. I do you think right, there's a really you know large. Number of people that, in order to benefit, from NLP technology, we'll, need to have technologies. That work well in a lot of different languages espe