Improving Natural Language Understanding through Adversarial Testing
So let me now introduce, today's speaker, chris potts. Christopher, potts is professor, of linguistics. And by courtesy, of computer, science at stanford, university. He is the director, of the center, for the study of language, and information, at, stanford. In his research, he develops, computational. Models, of linguistic, reasoning. Emotional, expression. And dialogue. He is the author of the book the logic, of conventional, implicatures. And as well as numerous, colored, papers, in linguistics. And natural language processing. So thank you for joining, us chris. And i'm happy to hand it over to you. Thank you petra, and thank you everyone, uh for turning out today i think this is an exciting, event. I also want to have the chance just want to thank those course alums for coming back to talk about their work i think that's very generous they did, big and exciting, things during the course, and it's great for them to be reporting, out to you on what they achieved. To kick this off i think i just want to say very boldly. That, we live in the most exciting, moment in history, for doing, work in natural language understanding. It really does feel like over the past 15, years we've seen some real. Qualitative. Changes. In what we're able to do both with technology. And with kind of core scientific, development. So, that's a very positive, picture. It really is an exciting, moment. On the other hand i think as practitioners. We can see that. Some of the gains, aren't what they first seen, that they're, that really the big questions. Are still left, open. And that's part of what makes this such an exciting moment, is not just that we're able to do new and exciting things, but also that the big research, challenges. Are still ahead of us, and i think what i can do today. Is give you a full picture of that, through the lens of what i'm going to call adversarial. Testing which is a new mode. Of evaluating, our systems. And looking for ways to find problems with them and improve, them. So here's kind of our outline for today i do want to just emphasize. Under the heading of a golden age for nlu. That lots of exciting things are happening, and i want that to be the overall message, that it really is an exciting moment. However, it's important that we take a peek behind the curtain. And come to a measured, understanding, of what this product this. Progress is actually like. And that will kind of key us up to talk about adversarial. Testing, which is this more technical. More strategic. Way that we as practitioners. Can, find fault with our systems. And then look for ways to improve them, and that's kind of a nice transition, into the course work for, xcs224u. Because i think the tools and techniques. That we introduce, are really great, for. You know combining, that with adversarial, testing. And finding new ways to make progress. Well let's kick it off on this positive note a golden age for nlu i've just assembled. Some examples, that have to do with natural language understanding. Actually i could be talking about ai in general because, it really is a golden age for the entire, field. First example artificial, assistance, these get a lot of the press these are things like siri, and google home, the more trusting among you might have these devices, in your homes listening to you at all times.
They Certainly, pass the bar in terms of utility, they're able to help us with simple tasks around the house. I also want to just call out the fact that their speech to text capabilities. Are astounding. I think you know they're good, in the sense that 15 years ago the things that they achieve, every day would have looked like science fiction. Now, it's by no means a solved problem, there are plenty of remaining, issues with that speech to test, but again it really does feel like we've made a phase change in terms of the things we can achieve there. Maybe for the nlu. A little less. As we'll see in a little bit. We don't actually talk about machine translation, in the course but i'd be remiss if i didn't bring it up because this is another, major breakthrough, area for, language, technologies. I've picked google translate, as kind of my example, here. First of all. Google translate, can take you from dozens, of input languages, to dozens of output languages. That alone, is a kind of astounding. Science, and technology, accomplishment. But it's also remarkable, how good the translations. Can be here just as a quick example, i've, put some english in on the left, i've got, a french example, on the right and this is actually, a kind of example from a popular, data set so we have, a so-called, gold translation. Here. And it's just striking, how close. The translated, text, is to that human, gold standard. You have the usual mistakes, in terms of preposition. And maybe word choice and stylistics. But again. It really passes, the bar in terms of helping, someone who, doesn't understand, this input language. Actually figure out what was expressed. And that's just something that wasn't true 15 or 20 years ago for these mt, systems. Image captioning, is a really great, kind of, grounded, language, understanding, task where the inputs, are images and the task is to assign them, accurate, and interesting, and descriptive, captions. This is from a paper, a few years ago that was a really a breakthrough, paper i think on doing this kind of metro language generation. And i just want to observe that these captions, are really great a person riding a motorcycle, on a dirt road. A group of young people playing a game of frisbee. This is fluent. Makes it basically, factually, accurate, text. That is really good as a kind of caption for those systems, and again this just wasn't something we could achieve.
15 Years. Ago. Another, major technology. Moment for me anyway, was when. Ibm's, watson, system. Won jeopardy, the online. Game show. This is a really integrated, technology, system that had to do lots of things in order to play the game of jeopardy. But at the heart, this was a very powerful. Open domain, question answering, system so i really do mark this as a. Win for nlu. That it was able to be, super human in some sense on the show by beating these two jeopardy, champions, here. And then if we zoom in on the kind of tasks, that we actually focus on in the course. I think we see a similar, kind of picture, where, the kinds of things that we can do now just feel very different, from the kinds of things we could do, 15 years ago, and it feels like we're on the cusp of seeing some really transformative. Things in the near future as well. Because it's a major unit for the course i've picked natural language inference, as the task i'm going to focus on to illustrate, some of this stuff. So just briefly, the task of, natural language inference, or nli. You're given a premise, and a hypothesis. Text. And the task, is to assign, one of three labels to that pair so in this case we would assign the label entails. To the pair a turtle danced, a turtle moved. The idea, is that any situation. In which this premise sentence which is true, would also be one in which this hypothesis. Sentence, was true, it's a kind of common, sense reasoning, task. Here for the second example every reptile, danced is neutral, with respect to a turtle age because, they can be true or false independently. Of each other. And finally, some turtles walk and no turtles move, would be a contradiction. In this notion of common sense reasoning. So that's a fast overview of the nli, task, there are a few major, benchmark, data sets the first one the oldest one is the stanford, natural language inference, corpus. Here what i've done is just map out on the y-axis. The f-1 score you can think of that as a kind of notion of accuracy, for the system. And in the original, paper we set a human baseline, of just short of 92. On this f1 score metric. That's this red line here. Across the x-axis. I have, time. And so what we're going to look at is published, papers, over time that have tried to achieve new things on this snli, benchmark. So here's the picture. What you see is. First of all basically. Sort of monotonic. Progress i think people are learning from previous, papers. And figuring, out new tricks that help them on the snli, task, so you see a lot of very rapid progress. And then the striking, thing is that in the middle of last year. We saw the first what you might call, superhuman. System, for the snli. Task. So two things i want to say about that first again. This is not something that we could have achieved. Two decades, ago, it really is remarkable, that we even have systems that can enter this kind of competition, to say nothing, of actually surpassing. This kind of estimate, of human performance. However. What we have to keep in mind and what we'll see in a little bit, is this that this does not mean that we have systems, that are super human when it comes to the human task. Of common sense reasoning. All the hard aspects, of that problem. Remain. Unsolved. Completely, and i'm going to make that very clear to you but nonetheless. It's striking that we have systems that are even this good, in this narrowly circumscribed. Way. Multi-nli. Is a very similar data set it's just arguably, harder, because the underlying, data. Is more diverse. I have the same framework here f1 score. And the human estimate is 92.6. So a little bit higher. And i again have. Time along this x-axis, here, now it's kind of exciting that this data set unlike the previous, one is on kaggle. So many more people can enter and we can get their scores. So the picture. Overall. Is there's a lot more variance, a lot more people are entering and doing a lot more, interesting, and diverse, things. But it's a similar, picture and that we can see the community, kind of, slowly, hill climbing, toward. What is eventually, going to be superhuman, performance, on this task. And that's exciting to see. And nli, just to be clear, is not the only area in which we have what you might call superhuman, but in quotes superhuman. Performance. Here are a few other examples, they include speech technologies. Translation. Question answering. And glue here is a big benchmark, task that captures, captures a lot of diverse, things. Again though you have to be really careful about how you talk about this, what we have is systems, that are superhuman.
On These particular, data sets, using a very particular, set of metrics, this does not mean that we have superhuman, performance. In any larger, sense, and that's the part that i find exciting, in fact these are unsolved, problems. But, nonetheless, you might look and reflect back on this technology. And start to adopt the perspective, that's in this book by nick bostrom, called super intelligence, where it kind of, looks at the current state of technology. And begins to wonder, what life, might be like when very soon perhaps. We have systems, that are vastly, better than humans at all of these kind of core. Human abilities, and he worries, about what the, world and the universe might be like when we achieve those kind of breakthroughs. So have that picture in mind i can see why people would arrive at it given the golden age that we live in. But i do want to temper that a little bit so let's take a peek, behind, the curtain, at those examples. I mentioned those artificial, agents, before that are in your houses. You probably, experienced, them in various, ways i think the dream. Is that they'll be able to do things like this you say any good burger joints around here and it replies, i found a number of burger restaurants, near you, and you say hmm, what about tacos, and at that point. Your device, is able to recognize. Your intention. Very flexibly. Think about your language. And the context, it's in, and kind of proactively. Help you solve the problem that you've implicitly, defined for it, that's the dream. I'm not sure how often you all experience, it i want to balance that against this very funny sketch from the stephen colbert, show from a number of years ago, so the premise, is that stephen, has been playing with his iphone which has siri on it all day and he has failed to write his television. Show so he says. For the love of god the cameras are on give me something. And syria replies, what kind of place are you looking for camera stores or churches. As practitioners. We should pause there. And kind of realize, what has happened, siri is doing some very superficial. Keyword, matching, on the utterance. Not deep language, understanding. And that's why it has associated. Cameras, with camera stores. And god with churches. So. Sort of a peek behind the curtain there and then the, the interaction, continues, i don't want to search for anything i want to write the show. And siri does what siri often does searching the web for search for anything i want to write the shuffle. There's a small, transcription. Error there but i think the broader, picture is just that. Siri does not have a deep understanding, of this interaction. And we're seeing, on the surface here the cheap tricks, that the device, uses, in order to try to get past those limitations. This is by no means. Open domain, dialogue. Of the sort that we were hoping, for. Translation. Again i think google translate, is an astounding, technological. Achievement. But it too shows that it doesn't have deep understanding. For this example what i've done is just input, a bunch of random, vowel sequences, this is the trick i learned from the, language, log website. Completely, random input here. It's interesting. That it had that it has inferred that this is the hawaiian, language if you know something about hawaiian, syllable structure, you might grant. That that is at least an interesting, hypothesis. About this input. Nonetheless, it is completely, random. The really disconcerting. Part is that on the right here in english we have, a completely, fluent, sentence. That by definition. Has nothing to do with that nonsense, input, and even stranger, if i make very small, changes, on the left. I'll get a completely, fluent, but completely, different. Sentence, out on the right and this is revealing, that these systems don't know anything about their own, uncertainty. And certainly don't understand, the inputs they're processing. I showed you before those examples from image captioning, to their credit from this this excellent paper here, they didn't just show the really successful, cases. Here we have a spectrum, from the really good ones on the left, to the really kind of embarrassing, ones on the right, this middle one says a refrigerator. Filled with lots of food and drinks. It's actually a sign with some stickers, on it, a yellow school bus parked in a lot, is kind of close, but really not like what the human understanding, of those scenes is, so, lots of work to be done even in the narrowly subscribed.
Space Of image captioning. I mentioned before that i think watson, was a really breakthrough, technology, moment, for especially, open domain question answering but again watson, did not understand. What it was processing, here's a sort of funny interaction. You have to remember that jeopardy, kind of reverses, its questions and answers so the answer came, grasshoppers. Eat it. And, watson's, reply was kosher. Which seems completely, disjointed. It's not a guess that a human would make. But if you realize, that watson was primarily, trained on lots of wikipedia. Entries. And you look up, grasshoppers. On wikipedia. You'll find very rich discussions, of whether modern-day, grasshoppers. Are kosher. So there is a kind of human way in which we understand what watson did but this also reveals, how superficial. The processing, techniques, actually, are and on how unhuman-like. They actually are. So, summarizing, there you might say i showed you that perspective, from the book super intelligence, before you might having seen behind the curtain. Now balance that against the perspective, from this very funny book called how to survive a robot uprising. This is by daniel wilson who is a practitioner, a roboticist. And this book is full of advice, like if you're being pursued by a robot. Run up some stairs. Or be sure to wear clothing, that you know will, will, confuse, its vision system. A much more tempered perspective. So let's try to make that a little bit more precise. In terms of things that we could take action on in a course like, natural language understanding. That falls under the heading of adversarial. Testing. So just to get into our common ground let me quickly review what standard, evaluations. Are like, in standard, evaluations. In nlu but actually throughout the field of artificial, intelligence, we work like this. You create a data set from some single process, you could scrape some data from the web, or crowdsource. A new data set or something like that but the point is that it's kind of homogeneous. In the next step you divide, the data set into disjoint. Train and test sets. And you set the test set aside it's under lock and key. You develop, your system, on the train set never once looking at the test set. And only after all development, is complete. You finally, evaluate, your train system. On that held out test set. And the idea is that that will provide you an estimate. Of your system's, capacity, to generalize. To new cases, because after all, you held that test set under lock and key and only at the very end, did you look at how your system behaved with those entirely, new examples. It sounds good it has a lot, going for it, but i want to point out how, generous, this is to the systems that we're developing. Because in step one we had a single, process. It's kind of too much to say that this is actually going to be an estimate of how the system will perform, in the real world because after all the real world.
Will Throw in our system. Many more diverse, experiences. Than we saw in step one and throughout, this process. So adversarial, testing kind of embraces, that right because in adversarial, testing, we make a slight tweak. Start by creating a data set by whatever means you like it could be just as before. You do develop and assess your system using that data set again according to whatever protocols, you choose so this, part could be standard. But here's the new bit, you develop a new test data set of examples. That you suspect, or know, as a practitioner. Will be challenging, given your system, and the original, data set. And then of course only after all system development is complete you evaluate, systems. On that new test set, and you report, that number, as an estimate, of the system's capacity, to generalize. A lot of this is familiar, except for the introduction, of this new. And potentially, quite adversarial. Data set in the middle here. This is kind of simulating. What we saw when we looked behind the curtain where, entirely, new examples, that the system developers, didn't anticipate. Were causing a lot of grief, for our otherwise, very good systems. Let's return to that nli, problem, let me show you what this is like in practice so remember this is that premise hypothesis. Prediction, task with three labels. In a lovely paper by glockner, at all, what they did is create a new adversarial. Data set, that's kind of based on lexical, substitutions. I actually hesitate, even to call this adversarial. Because i think this is just kind of an interesting. Challenge, thing that they did, so here's how it worked. You have a fixed premise, here it's a little girl kneeling in the dirt crying. The original, example. Had the hypothesis. A little girl is very sad, and that has the entailment, relation. What they did is just use wordnet, which is a structured, lexical, resource. To substitute. Hear the word sad. And have it become the word unhappy. Those are roughly synonymous, so what we would expect, is that systems will just continue, to predict, the entailment, relation, for this new adversarial. Example. Everything. Else about the examples, is the same, that's why i say this is actually kind of a friendly, adversary. Here. What they found, in practice, is that systems that are otherwise, very good. Are apt to predict something like contradiction. For the second case. It's probably, because they think that that negation, in the word unhappy, is a good signal of contradiction. So they make. A mistake. And it's not a very human mistake it's something very systematic. About, our understanding, of a language like english, that we see that these two are synonymous. Assuming we know what the words mean, this example down here is similar where, you have the fixed premise, and all they've done is changed wine to champagne. And that should cause a change from entailment, to neutral. But in fact since the system has a very fuzzy, understanding, of how wine and champagne, relate to each other, it continues, to predict, entailment, in that case. This is a picture of the data set i think it's really cool because, it's got a lot of examples, especially for contradiction, and entailment.
And It also has this, nice breakdown, by individual, categories, so you can get some real insights, into what it's doing. And as predicted. This is quite devastating. For these systems, so i have a few models, here that were very good models at the time, they have very good snli. Test accuracy, that's one of those benchmark, tasks i mentioned before. And their accuracy, on this new test set has plummeted. By you know as much as like 30. Percentage, points in absolute, terms so this is really devastating. Now. There is a ray of hope here. I'm not going to go into this slide in detail, of course because there's a lot of information, here i've put it here just to say that our course has really great coverage. Of what are called transformer. Based models you might have heard about them like burt roberta, electra, excel net, by the end of the course you'll have a very deep understanding, of all the technical details, that you see here. For now though i would just want you to think, there has been a kind of really interesting breakthrough, in the last two years. Related, to how people use these transformer, based models. And the way i can give you a glimpse of that let's just highlight roberta, here so what i've done on the next slide, is just use some of the course code from our course. And some, a pre-trained, model that's easy to access, from, using facebook, code. So i read that model in, and evaluate. It on that full glockner, at all data set that i just showed you. And the result, is amazing, these are the performance, numbers here, the accuracy, is at 0.97. And it's doing extremely. Well for those two categories, where you have enough examples, or enough support, and remember. Just two years ago the best system, on this adversarial, test, wasn't even above, 0.75. And now we're at 0.97. And doing well on both of these categories. That's starting to look like yet again, some big leap forward, and how well we can do it's very exciting. Now we can level up once more so just quickly. So far we've been using adversaries. Just for test sets. But we could actually have them be part of the entire, life cycle of a model. And that's what these authors have done for the adversarial. Nli nri, dataset, this is a direct response. To the kind of adversarial. Test failings we just saw. Here's how this worked. The annotator, is presented, with a premise sentence. And a condition, so a label they need to produce, entailment, contradiction. Or neutral. The annotator, writes a hypothesis. And then a state-of-the-art. Model comes in and makes a prediction, about this new premise, hypothesis. Pair. If the model's prediction, matches the condition, that is if the model was correct. The annotator, returns, to step two and tries again. And you could continue, that loop until the model is finally, fooled. And you have a premise hypothesis.
Pair Which you then validate, with humans. So what's happening, here by definition, is we're creating a data set that is intuitive, and natural for humans. But by definition. Very difficult, for state-of-the-art, models because they are now in the loop. Where people are being adversarial. With them. It's a familiar, picture this is the current state of the art so we have a few systems, here that are outstanding. On snli. And multi-nli. All these numbers in the middle here are different views of the adversarial. Nli, data set, and you can just see that they are dramatically. Lower. Than those standard evaluations. On the right so, another, unsolved, problem we saw a glimmer of progress. But i think now this is the new thing to beat here, and in fact we're going to hear a bit more about these kind of evaluations. A bit, later. So finally, just by way of wrapping up i don't want to take too much time but i thought i could connect this really. Nicely, with our coursework. So, here's the high level summary. We cover these topics, on the left. It's by no means an exhaustive. List of topics, for the field. But i think it's a good sample, in the sense that it gives you a picture of a lot of different tasks. Structures. Models. Techniques, and metrics. So that, if you're good at this sample of topics, you're really empowered, to take on anything. That's happening in the field of nlu, right now. Part of the reason i feel confident, saying that is that the course is very hands-on. So we have four assignments. Each paired with a bake-off, i'm going to tell you about the bake-offs, in a second. But each one of them is meant to be a kind of simulation, of a small, original, final project. And that culminates, or leads into the final projects, which come in a sequence, of things. That help you incrementally. Build up from a literature, review. Through in a protocol. And then finally to a final paper. So that you've kind of with the help of a teaching team mentor. Slowly, built to something that's an original contribution. In the field. For those assignments. And bake-offs, let me just give you a glimpse of what the rhythm of those is like so, each assignment, culminates, in a bake-off, which is an informal, competition. In which you enter an original model. This is like the kind of, shared evaluation. Tasks that you see a lot throughout the field. The assignments, ask you to build up some baseline, systems, to inform your own model design. And to build that original, model. And then you enter that original model into the system we have held out test sets, for you so that we really get a look at how, good your systems, are, and the teams that win that get some extra credit. Uh it's also important that the. The teaching team, assembles, all of these entries, and reflects, insights, from them back to the entire group so that we can kind of collectively, learn, what worked and what didn't for these problems. The rationale, of course behind all this, is that each one of these should exemplify. Best practices, for doing nlu. And help make you an expert practitioner. I want to connect back with those earlier themes and i think we have one bake off that does that in a really exciting, way and this is a kind of, micro, version of the nli, task we do word level entailment.
Where The training, examples, are pairs like turtle animal, and a one means that they're in the entailment, relation, turtle desk is in the zero relation. So it's a small one-word, version of that full nli, problem. The reason it connects with what i was just covering is that we try to make this a bit adversarial. So, the train, and test sets. Have disjoint. Vocabularies. So for example if you do see turtle, in the train set, you won't find it anywhere. In these pairs that are in the tested examples. The idea is to really, push systems. To make sure that they are learning something that is actually, generalizable. Information, about the lexicon. As opposed to just benefiting, from idiosyncrasies. Kind of in the patterns, of the data set that happen to exist. So in that way i think we can. Push ourselves, to develop systems, that really have robust. Lexical, knowledge, embedded, in them, and these tested evaluations. Give you a glimpse of how much of that you've actually achieved. There's an oh and, just, just to kind of emphasize, again how hands-on, this all is so this is a kind of full system, for that word uh level entailment, problem. Uh we don't need to dive into the details of the code i'll just say that you make essentially, three. Decisions, here. Under glovevec, this is your choice, of how to represent, the individual, words, in this case i'm using glove, pre-trained, representations. Glove is a model we cover in some detail, at the start of the course, but of course you are free to make use of any representation. Scheme you want, for these words. You should also decide how to represent the pairs here i've chosen to just concatenate. The two representations. But lots of things are possible. And then i'd say finally the most, interesting, and exciting, part. Falls under this network here so this is a bit of pie torch code. It's using, code that we release as part of the course and that you'll make a lot of use of, the reason that's important, is that that, pre-built, code. Really frees you up to think creatively. About the problem at hand and you can see that here this is a complete, working system, in cell 4.. Primarily. What you do for this, assignment, in bake off is work on this build graph method where you're essentially, building the computation. Graph for a deep neural network model. And then everything, else about the optimization. Process, is handled, by the base classes, which are already part of the course repository. And that's important, because it's hidden away under base keyword, arcs here. This model actually has lots, of different, settings that you can explore. For different optimization. Choices, and other things so that you can really experience, in a hands-on, way, how best to optimize. These modern, deep learning models that you're building. I don't have time for it but i did just want to mention this other bake off this is a two there are four and all but this is a really different one from the previous ones if i had more time with you, i think the other theme that i would emphasize, would be, the importance, of grounding, natural language, outside of language, and in actual, physical, scenes and stuff like that. And the way we kind of explore, that in a tractable, way is by, doing natural language generation. Where we're trying to describe, color patches, in context. This is another modeling direction. And it does bring in non-linguistic. Information, in the form of these color patches. I think it's a really interesting, problem, it kind of connects with interesting, topics, in linguistics. And it's a chance for you to explore, another, prominent, class of models. Which are these encoder, decoder, models which process, sequences. On the left this is not a linguistic, sequence, these are color patches they could be images. And on the right of course you're producing, a natural language description. But in the interest of time i think i'll just go quickly to this wrap up, as i said before, i really believe this this is the most exciting, moment ever in history, for doing nlu.
It's Not like you're joining the field just at the moment when all the hard tasks have been solved, i think rather. We now have a good foundation. For the really exciting breakthroughs. Which are in the future, and i think the adversarial, testing really makes that clear. This course gives you hands-on, experience, with a wide range of challenging, nlu, problems. And when you come to do your original, research you'll have a mentor from the teaching team, to guide you through not only the project, work, but also all those assignments, and bake offs and so forth and you know the examples of success, there is that some of these things, have turned into really exciting, and mature, papers. Some of them even published and you're going to actually hear about some of that really mature and interesting work, in just a moment. So the central goal of course of all of this, is to make you the best that is most insightful. And responsible. Nlu, researcher, and practitioner. Whatever, you decide to do next with all of this new material. So i'll wrap up there thank you very much. Thank you chris. This was, extremely, interesting, thank you. Uh, so. If you have any questions for chris feel free to post them, in the q a box we will be moving on right now to allow enough time for students to present their projects. If you are interested, in learning more about a course. Chris was mentioning, or other courses, a cpd, is offering. In artificial, intelligence, program. Like machine, learning, or deep learning. You can check the links, you will see on your platform. But now we will move on. So we will hear from two project, teams, uh they took the natural language understanding. Course, and developed, great projects, that they will now briefly, present. The first speaker, will be gokan, chagrici. Uh, so gokan, you can go. Ahead. Yes hi everyone, so yeah my name is. And. My presentation, is about the effecto, and stumbling, on, a nli. Benchmark. And. Our focus keyword, is basically. Uh adversaries. And. Uh, as you probably know. The, leaderboards. And. Uh. These littles, are created for some. Challenging, problems, but most of them are using. Uh. Yeah each one of these, are basically using a frozen corpus. And, then there are practitioners. And. There are researchers, who are trying to beat. The best colors. And. This is how the life cycle. Goes. But, you might imagine that. It might not necessarily, mean that you are getting the best. Uh kind of model that can generalize. Into new. Uh areas. Uh because of. The incapability. Of generalizing. The idea. So, that, these models can take some shortcuts. When you have, a fixed, kind of. Training and, test it. So. Here is a here is a paper. One of the latest papers. Uh about. Like incorporating. This adversarial. Idea. And. Basically. Uh, as. As professor, pulse, mentioned. Uh. It it is using different rounds. So. There is round one, uh which is basically, creating.
This Uh training set and this it and it releases, state of the art model, and based on the weaknesses, of that model. And then, a new. Uh. Training and testing is being created, taking those weaknesses, into account. And. Uh this moves on. So i used to challenge, the model. Capability. And. Uh. Increasing, the. Generalization. Capability. Before we move on with the next slide, i would like to mention something. So the question is is it even easier for humans, right, for an nli, task. As you see. There are very simple kind of texts, a single sentence. Text and, very simple hypotheses. Based on these. Premises. And these judgments, are being made by, humans. You'll see that for the for example the second example, and the last example. Even, humans cannot agree with the correct label. So if this is the case for humans, then. How how are we going to approach, this problem with. Uh, with the machines, themselves. Well. As we said, we will challenge. A model with, a progressively, harder tasks. In my project. I took the data set from from, the paper that i mentioned. And i applied, several. Transformer-based. State-of-the-art. Models. So these models, actually you'll see three different models like bird, uh. Roberta. And x on it, and there are two variants. Uh one is the base one one is the large one. And, for the output, from y1, to y6, you see the outputs, per module. And for y7. There is a. Strategy, for, assembling, these models. And to see if and something, is going to be a cure for us. On the right. Uh. I wanted to give an. Uh, a feeling about the. Uh. The data set. Uh you see that, this, data set contains, very complex, and long kind of sentences. And there are a lot of named entities, a lot of relationships. References. Uh. And. For the top three. Best performing, models. For example for these two, examples. And, none of them, could come up with the right answers. So. Again we see that, the task is very hard. And for the question that i mentioned. Let's see if ensembling. Is the cure here right. Well, these are the results. For the models, in isolation. We can look at just the f1 score because it is one of the, score being used by the. Community a lot for these. Problems. And. Even though. 90, plus. Uh, percent. F1 scores have been achieved, for, snli. And, their variants, here we see that we couldn't even like achieve a 50 percent. Kind of iphone score. And, whenever, we applied and something.
Yes. There is something from us we, we could barely see 51. Percent for the f1 score but, again it is far from, uh. Like a, reasonable, kind of. Success, right. So. Here, we see that just. Assembling. Different models. Is not going to have for. Something that is so hard for the models in isolation. And people are using assembling, mostly, for. Improving, something that is already. Improved, by, by the by the individual, models. So then. The next, reasonable, question to ask is okay why are we still far away from. A very nice, kind of solution. Uh. Here is a list. So, yeah model architecture can be an issue. Uh, but personally, i don't think it, is, it is one of the most critical ones. The size of the training data. Is, as important as. Anything, that. Anything, in other areas as well so, if you have a quality. If you have a high quality in terms of the training data and test data then, you have a much better chance of. Creating, a satisfactory. Model. But. Here, even, creating a train data is very expensive, because. As we said even humans are having trouble, for. Uh, like agreeing, with the output or hypothesis. And, uh. And the premise itself. So. Uh it needs a lot of effort time and money. Uh. But yes it is very important. But the last one is very interesting, so. If you think of a child. Uh. That child's interaction with the environment. Is. Playing, a very important, role. As well as like reading some text from some books and. Trying to analyze, it so, that child is. Basically, experimenting, with the external, world all the time, and then creating, new hypotheses. And testing it, creating, another advertises, i'm testing it again. So, machines, are lacking this, ability. Maybe. We are trying, something, that, that cannot be done. Without. These. Machines. Living among us. Uh. And having said that. I would like to conclude, with my experience. Uh in terms of the project and the class yes it is very demanding, but, something should be very demanding, so is to give you. A better insight, into, into that topic. Uh and it should challenge you so that you will you'll feel the need to learn more. And that comes. Actually. Uh that is mentioned in the rewarding, part, so, you gain the discipline, of analyzing, papers. Uh. Searching, and, comparing, the results, and then trying to. Repeat those results or even go, beyond, those results. And last but not the least. The guiding part. Uh. It does not matter what kind of questions you are having but. There is a very strong. Kind of. Community, from, stanford. Uh helping. I didn't even see any, any question that, that that was not answered, by, uh, by those expert people including, uh professor, potts. And. Yeah, i'm really happy to be here, and uh. Thanks for your time. And. Yeah see you soon. Thank you gokan. Thank you for your time and presenting, your project. Uh to everybody. And now we can move on to the other project. That was developed, by mohan rangarajan. Wupang. And ethan guin. So, mohan. Will now let you know a little bit more about it. Thank you petra, hi everyone this is mohan. Uh it's a privilege. To be presenting, here on on behalf of my team ethan nooyan, and wu farm. And. It's, we're certainly looking forward to this. This quote by mahatma, gandhi. Actually, captures, the essence. Of how we approach both the course. And the project. We had a learning mindset. And we said we are going to learn at whatever. Whatever will be the. Cost. Right when we actually approach the project, work, we wanted to do something. In. In question answering obviously. And, with, knowledge graphs using knowledge graphs. Right, like many people, and like professor, parts mentioned earlier, we were enamored. With bert. Transformers. And the birth variants. And the whole notion of contextual. Embedding. Right, we were curious to see how contextual, embedding would improve accuracy. And so our, hypothesis. Was really about, using. Knowledge graph, and seeing, if contextual, embedding would improve the accuracy. You may ask, hey, how did you come from this broad topic, to. To to, a specific, focused hypothetical. Right, and, here, we have to really. Talk about, the structured. Approach that professor, potts mentioned, you know doing the literary, review. And then the experimental, protocol. And then going on to the project. This really, helped us, narrow down to a specific. Focused topic on the hypothesis. So on the left hand side what you see here really is, the broad, area that we were wanting to initially, look at. Right, and then. Uh, you know one thing that we, talked about, is as we did the literary, review.
We Realized, that oh you know what, we have to kind of narrow down our focus. And then we sought guidance from our course facilitator. And, you know based on that direction. And also, looking at how much computer, resources, we have and the time available. We narrow down, to the natural language, back in portion of this topic. And even within that, we chose, the simple questions, data set. And the now for the knowledge graph we chose an embedded, approach, to represent, the knowledge graph, which was based on freebase. I would be remiss if i don't point out here, that, we had complete, freedom. In choosing the topic. Choosing, our hypothesis. And, so the outcome. Really, was, not a concern. Right because the evaluation, is going to be on the methodology. And the rigor. That, uh, that that we are going to have. Right, and so that kind of freed us from the pressure, that comes with oh our hypothesis, should actually improve the results, to focusing, more on the methodology. And the concurrence. And the. Results, there. Here's, our our experiment, so we had three fundamental, tasks in the project, one was the entity learning, the other one was predicate, and the third one was entity detection. The entity learning, and predicate, learning models were primarily, used for predicting, the entity and the predicate, in the simple question. And the entity detection, model, was, useful. For collecting, a set of, tokens. That would represent, entity names. For the knowledge graph, we use freebase. And we use the embedded, representation. That that was needed, for the entity, and the predicate. So the idea here really, is, we present the entity. We present the, entity detection, the tokens in the entity detection, to the knowledge graph, and we retrieve a set of candidate, facts. And then, the tail entity associated, with the facts would yield, our possible, answer. Right, so, the closest, fact since we were using embedding. The fact that was closest, to the embedding, representation. Of the entity and the predicate. Would result, in the answer. For the question. So that was kind of in essence. What, uh what our model was and what our experiment, was about. Here is a little bit of detail on the model itself. The entity and predicate, learning models, were very similar. Right but the entity detection, model was slightly different in the sense that you know each token had to be assessed as to whether it would be a potential.
Entity Name or not. Looking at the results and analysis. Uh you know. We should say that it, we were, quite pleased, that there was marginal, improvement. In. The. In the model that in the model that we were using. As compared to, the baseline, model. But, i used the word marginal, because as you can see, it was, an improvement. Right, since we were using knowledge graphs, we were also wanting to compare the results, if we didn't, use, an embedded representation. Of the knowledge graph. And we were kind of uh, it was interesting to note, that from the results. Then for the individual, tasks, that is entity learning and predicate learning and entity detection. Right, the. Scores. For, using the knowledge, graph directly, without using embedding, was better. Right. And. But, the interesting, part is when we were actually doing the evaluation. On the test set, we noticed, that. The embedding, based, approach. Had better results, compared, to the approach, without, embedding. And of course. You know the marginal, improvement, in accuracy. Was. Higher, compared to, the baseline, models that we were using. Right so the interesting, part really here, is. You know while we were kind of pleased with the marginal, improvement. We also noticed that the execution, time associated, with our models. Was, uh, was. You know much slower. You know our fastest. Variant. Was, twice as slow as the original, model that was in the baseline. Right and so the interesting, part here is, you know we were then puzzled, as to okay. You know did contextual, embedding really help. In this problem, or no. Right, and. The other, part here is you know. We concluded, then that, you know this the training duration, is low, the improvements, and my accuracy. Were marginal. So we felt it was less compelling. To use a fine-tuned, birth model, for simple question answering, applications, in the real world. A big generalization. To make, but that was our conclusion, based, on, the simple questions. Data set, and the other part to also note here is we chose the measure accuracy, because it was a simple question, a question answering solution. Right so either the answer is correct, or incorrect. So what did we learn from this, exercise. Right. So i think it's important to understand, the data set that you're using for your testing. Right, when we actually. You know started the project, with with our hypothesis. We felt that, we are going to actually. You know have at least 90 percent accuracy, considering all the wonderful things, bert, and the bert variants, are. Have done actually. Right, and then we realized, we're kind of a bit deflated. With uh our question, you know we're not performing, as well, and we did a little bit more research only to realize, that you know there is a cap of 83.4. Accuracy, on the model, i mean on the um, you know as far as using simple questions dataset, is concerned.
Right And this is because, there's a high prevalence, of unanswerable, questions, and some questions don't have any, ground truths in the knowledge graph. Right the other, part as well, is. You know apart from other things. We have now a deep appreciation. For the level of effort, required. To hypothesize. Research, experiment, and author a paper. That conforms, to acl, standards. I'd like to quote, isaac newton here by saying, if i have seen further, it is by standing on the shoulder of giants. We really have a lot of people that we should thank for, xiaohuang. And the team, whose works served as a launching pad for our our work here. Salman muhammad, and the team, whose work we use as a baseline, for comparing, our our model. The hugging phase company, whose, transformer, based models allowed us to compare different variants. Professor, parts. Thank you so much, you know we learned a ton of new things in this, in this course. Your. Active. Active participation. In the slack channel and your enthusiasm. Was a welcome, and pleasant surprise. Our course facilitator, pradeep cheema, and other course facilitators. That helped us they're always there to help, and encourage, us with our work. And lastly. Steve it will be here ms if you don't, mention you for helping us throughout the course. Thank you so much. Thank you mohan. Thank you for presenting, the project, i think it's very exciting, thank you for joining us also ethan and wu who are here with us but not, not visible at this moment but they are here. Uh. I think we can move on to do q and a session, uh, we got some, interesting questions, from the audience, so thank you everybody also for your questions. Uh i will now ask chris. Um. The first question, so are the adversarial. Examples, generated, by humans, exclusively. Or by computational. Models. Such as generative. Adversarial. Network. Yeah that's a great question, so you really see a full spectrum, of approaches. In some cases, humans have just written new adversarial. Cases as you saw with goken's, project with the adversarial, nli dataset. Sometimes we can do sort of quasi-automatic. Stuff like with wordnet. Uh where we just do some lexical, substitutions. And kind of we can assume, that the meaning that we want is preserved, or changed in a systematic, way. But you can also have models in the loop acting as adversaries. There have been some applications, of generative, adversarial, networks, in nlu, to kind of, make sure these models are robust. The picture seems more mixed than you get from vision where i know gans have really been a powerful, force for good and, making models more robust. So i think there's some space for innovation, there. But the general picture would be, i think we can think really flexibly, and creatively. About how to create those adversarial. Tests. And that just creating, one could have its own, modeling, interest. In addition to that, that serving as a kind of weight new way to evaluate, models, so lots of space for innovation. There. Okay great, um. I hope it answered, the question. Um, the next question. Uh, is about, electra, based transformer. Uh so is electra-based. Transformer. More robust. To adversarial, examples, compared to mlm, based transformers. Such as bird. Oh. Interesting, open question. So elektra's. Primary, motivation. I think is to make more efficient use of its data than bert does, we have a nice little lecture on on electra. And how it works and kind of why it's successful. Um. But, the in the paper if i remember correctly there isn't an evaluation. That you would call adversarial. They mostly just. Post better numbers on the standard data sets and explore a really, wide range of variations. On the electra model both how it deals with data. And how it's structured. But again i love that question it's just so interesting to ask for a model that seems to be a step forward. Here not only in terms of accuracy, but also in terms of efficient use of data and compute resources. What is it doing, on these very human. But ultimately, very challenging. Adversarial. Data sets, great question to address. Yeah and it's especially, fruitful, if you can, address, that question. And then be maybe think about how the answer, could inform, an improvement, to a model like electra. Because then you have that full cycle of the adversary. Helping, us, do innovative things with the models we're building.
Okay Great, um. The next question. It seems that there is no adversarial. Training, in this product. But only adversarial. Testing. Would adverse your training, fit into this particular. For nlu. Or, how would it work. For sure. Yeah so actually. Uh adversarial. Nli, the data set that cocaine talked about. That is large enough that you can use it for training in addition to assessment. And there are a few other data sets that are like that. Um. Very few of them were created. With the kind of. Full human in the loop stuff that you saw with adversarial. Nli. Some of them have more automatic, model based means of creating the, data sets that are large enough for training. But that's on the horizon, and in fact one of the visionary, statements, that the adversarial. Nli paper makes. Is that we should move into a mode of kind of. Continually. Retraining, and evaluating. Our models. On, data sets that were created adversarially. And that that kind of ongoing, process. Which is a more fundamental, change to how we do system development. Is a kind of way to lead to even more robust, systems. So so my my quick answer would be. We should be looking for ways, to scale the adversarial, testing paradigm. So that we can have training sets as well. Um, so what is the current state of research, and applications, of deep rl, to nlu, a part of transformers. Uh if there is, research, in it how promising, is it, for you or how, promising, do you find it. Interesting, question. So the, overall, i would say the vision of reinforcement. Learning. Really resonates, with me. You know if you think about your life as an agent in the world trying to learn things and experience, the environment. You don't get direct. Reward, signals, right you only get very indirect. Feedback. And you have credit assignment, problems, you don't know how to update your own parameters, and so forth so you have to make a lot of guesses. And it's a kind of chaotic, process. But nonetheless. That's the world we live in and we all learn effectively, so something like that set of techniques. Has to be brought, more fully into the field. I think we're seeing really exciting, stuff. In the area of combining. Deep rl, with. Grounded, language understanding. And certainly with dialogue. And i'm sure there are other areas so it's, a great space to explore the models tend to be, hard, to optimize. And hard to kind of understand. But. That's part of the journey toward making them really effective for the problems we look at, i'll give a quick plug. I did some stuff that i thought was really exciting that tried to use. Reinforcement. Learning, together, with transformer-like. Models. To induce, more modularity. So that the systems, that we developed instead of having very diffuse. Sort of solutions. Would do things that looked more like, encompassing. Lexical, capabilities. And specific, functionality.
And Those are called recursive, routing networks. Again hard to tune, but, obviously, an inspiring, idea. And. You know we should just, keep, keep pushing, those techniques. Um. Maybe one very general, question, where do you see future, of nlu, what are the. How it will look like in the next few years. Oh, five years, starting. To seem like a long time i mean after all my short lecture. Showed you that in just a two year span we had what looked like a real phase change on some hard problems, so five years. Feels like an eternity. So just some predictions, i could make. You know the idea behind contextual. Word representations. Which is primarily, how transformer, based architectures. Are used. That's powerful, and that's going to last it really has changed things, and i think the vision there is that like, more chances, to have contextual. Understanding. More chances to be embedded, in a context, that's going to be important. Grounded, language problems, are going to be more prominent, and i think that's going to lead to breakthroughs, because after all, human learners, don't learn just from text, that's kind of an absurd idea. Human learners, learn from the social, environment. Lots of inputs that they get, in addition to language input so to the extent that our systems can be multimodal, like that, they're probably going to get better. I hope as a personal thing that we. As a field do more, to break free of the confines, of always looking at english i saw briefly in the q a there was a question about. Whether or not the field just looks at english. The big data sets tend to be in english but i think, that, that's. Changing, a little bit for example in the nli, problem we now have some good multilingual, data sets, and that's important because english is not representative, of the world's languages, and, if we look more farther afield. We might lead, that might lead us to new kinds of models that would count as again fundamental, breakthroughs. So that's going to be important and then we should all be thinking much more, now that our systems, are more useful. And more often deployed. We should be thinking more holistically. Like, not just that my system did well in some narrow evaluation. But, what is it actually doing out there in the world when it interacts, with real examples. And real users. And where the fundamental, thing should be. We should be assessing. Making sure that those systems are having a positive, impact. As opposed to causing, some social process, to go awry, or something like that, so that's the kind of new responsibility. We have that's coming from, the recent successes. So a welcome challenge but an important one. Okay great thank you, um. Yeah actually, at the time, almost, so i would like to spend, the last minute, thanking you for your time to take part in this webinar. Also. Both. Or all of them gokan, also, ethan, also mohan. Osobu, who joined us today, i think it was really. Interesting. Exciting, that they joined us and they talked about the project, so they brought in an actual. Life, uh, the course materials. So thank you all for joining us i hope. You participants. Enjoyed, it. And if you are interested, more, in learning more about the courses. In ai professional. Uh, certificate. Uh. Feel free to check the links, uh we are offering, uh um, uh in the interface. Or please feel free to contact us also directly. So thank you everybody. And, stay safe and healthy. Thanks petra. Yeah and thanks to the project teams those presentations, were really great i found that really. Inspiring. You.