NW-NLP 2018: Annotation Artifacts in Natural Language Inference Data
You. All, right welcome. To the afternoon session we. Have two, speakers, now both talks, are 20 minutes the first one session. Gauranga. With. Annotation, artifacts, and natural language inference, tape, all. Right everyone my, name is Sachin graduates. To inverse Washington it's, gonna be a project called annotation artifacts and naturally because inference data I'm here, with Microsoft, or so assume that the this, is joint work with Omer, levy Royce, Schwartz Sam Boehm and Noah Smith. So. Who. Here works, with crowd-sourced, annotations. Or. Maybe works with data sets on crowd-sourced, annotations, yeah. So this is a really popular sort. Of area, of NLP and. It really drives a lot of modern NLP, and. So, you, know we're given usually, given some text we have a model and we have some output here, um it's, hard. To figure it's easy to forget that you know there are humans behind the text and that data that we feed our models - and those. Annotators, because, they are human they have their own biases. They. Have their own beliefs about the world they have personal, motives they have their own cultural backgrounds, and this. Affects, the. Distributions. That we see in our data, and. There have been a lot of there's been a lot of conversation, about how biases, affect downstream, societal. Applications, what a sort, of long-running theme of this talk is that, biases. Also affect how we evaluate models, and our. Design of our tasks. So. The. Story sort of starts with the natural language inference task which I think a lot of people here are familiar with basically. You're given two sentences, of premise and a hypothesis, and the goal is to determine whether that premise, entails, that hypothesis, so there are three categories so. For example you're given on, the premises north wall Northwest, MLP workshop is being held with Microsoft headquarters, you. Have an entailment category, the workshop is in Redmond because that's definitely true giving this premise a contradiction. That Amazon is hosting the workshop because that can't be true and a. Neutral, example that, sucked in Adela is attending the workshop not know if he's here but. See. These are relatively, well defined classes. And it's really apt for annotation, we've. Had a lot of really, great performance on this task over, the past few years the Stanford nli and multi no idea sets were released in past couple years. And people. Have been getting scores. In. The 70s and 80s across these two datasets, and. So a newcomer to this particular, area of NLP might see, these scores and be like okay there's not really much else to do and I know why like what am I doing here but, I think the results of this paper kind of show that, there is a lot to do, most. Nli models especially, the top performing ones sort of represent both the premise, and the hypothesis, and some sort of joint way feed, that a joint representation down, to a classifier, to predict those three labels what. We found very, interestingly is, that if you completely, disregard. The premise and only, consider the hypothesis, only feed, the hypothesis, into that classifier, and make the same prediction so you can do really well on this task. The. Model the classifier that we use in this particular experiment was fast text which is not the self text, categorization model, from Facebook. Models. Those words, as diagrams, you know grams very very simple we were able to get 67 percent accuracy on ethanol I and about, 55 percent accuracy, and multi-and all I and, this is kind of harrowing this tells us that those days that's aren't really doing what, they're supposed to be doing we don't really need to learn entailment to perform, about 2x above our majority. Baselines and so. The, sort, of project that we embarked. On was, really to answer this question of why we're, able to do so well on this task with, just the hypothesis, we, showed that elicited. Annotations, induce, lexical. Clues which, we call annotation, artifacts that, reveal the entailment, class, we. Showed that supervised. Models especially, the top performing ones exploit these annotation, artifacts to make their predictions and.
We. Present, the water implications, for model, evaluation and task design and. So hope even, though this talk is very centered around the, sort of nly data hopefully, anyone. Who's working with creating. Datasets using crowd-sourced, annotators, or evaluating. Models in general can, benefit from these develop results so. The first argument we had in our paper was that there. Exists I aren't Asian artifacts and analogic, policies that heavily correlate with, the entailment, class and I'm not going to walk through how, we how, we how we showed. That but, I think it's prudent. To think about how these datasets are generated in the first place. So. Essentially, SLI and multi. Are generated, through sort. Of captions, own source, text from multiple genres that, are, fed to annotators. On. MTurk, who, are then asked to create those three hypotheses, entailment, contradiction, in neutral. There's. A real focus on scale here both from the annotators, perspective, as well as a test designer's perspective the. Test designer wants, as many examples as possible so we could feed those you know data hungry neural, nets and, annotators. Want, to. Maximize the reward annotate as many examples as possible and, sm, align multiple i achieve, about 1. Million high agreeing examples. And. This is really great for that this particular area of NLP because, previous, datasets either, relied. On expert. Annotators, or. Relied. On very specific theories, of entailment that we're very hard to scale. In. This paper under, this framework we, showed that annotators. Are, sort. Of trying to maximize their efficiency, in creating. These hypotheses, and thus introducing, these sort of heuristics, and tricks to. Create, examples. That, our models have trained on and so. I'm gonna go walk through some of the artifacts that we found in the paper we, quantitatively, justify, them in the paper and there are a lot more artifacts, that we show in the paper so if you're interested, definitely. Check them out. So entailment artifacts included generalizations. We. Saw a lot of premises, that are, entailment. Hypotheses, that were associated with premises, like some, men and boys are playing frisbee in a grassy area and, tailmon, having, generalizations. Over entities in that text people playing frisbee outdoors, we. Also see shortening where specific, words are removed from the premise for example, here the word shirt and the word green are removed, to just create the entailment, hypothesis. Neutral. Artifacts contain purpose clauses for, example if, two dogs are running through field the, neutral, fossils that was created was, some puppies are running to catch a stick basically an ambiguous situation was created by applying a reason to the premise. Modifiers. Were also really popular where people just stuck sort. Of adjectives, or superlatives, and fronted entities in the text and the premise to, sort. Of create the neutral, hypothesis, and. Contradictions. Have very very consistent trends, negation, was really Beeler this. Is one of my favorite examples because the premise is very intricate, and like complicated, has a lot of stuff going on but, the contradiction, that they chose was nobody has a cat which, is like really really simple and it's a it's a valid form of entailment, but it's something that models could really really easily pick up on, cats. Were also really popular in the contradictions, the. Flicker 30 dataset which is what the SLI data was derived from, contains. A lot of dogs and so people thought cats were country to dogs I sort of disagree with that having a cat and a dog but. This. Is sort of the human belief and the sort of biases that get. Imbued. Into the data set based. On people's previous experience. So. We've, shown that artifacts, exist and. The, next argument of the paper is that. Supervised. Models the. Top performing ones exploit them to. Make their predictions and, the. Way that we showed this is we revisited. The original model that motivated all this work that. Hypothesis, only classifiers we were able to do really well on Espanola with, and multi no mind with, we. Basically fed in the test data for. Both Multi multi, online SLI, and, if. The hypothesis only classifier, was, able to get that, example right by just using the hypothesis, we consider that example easy and if it was if. Wasn't able to do this so we, consider it hard and so. In this way we partition the test data into, hard and easy subsets, and. We. Also revisited, the models that I originally showed that did really well on SNL I'm multi no line we. Basically tested these models on the, easy and hard subsets and see it out and saw how they did we. Basically showed that easy examples, drove, an alive performance if.
You See a graph here this is the accuracy, of the same models on the full data we got about 87, percent accuracy, if you, look at just the hard examples, the, ones that hypothesis only classifier was not able to get correct the, accuracy of those models drops pretty significantly, 10 to 20 points and if. You look at the easy examples, those. Models. Are do do really really well so this kind of shows us the majority of the examples, of these models ever to get right are the ones that have annotation, artifacts we. Get very similar results in multi and line to check out the paper you learn more. So. Given, these results I have a few closing thoughts. You. Can think of annotation, artifacts, as sampling bias it's. Possible, that so I mean it's crucial to know that like none of these examples, are invalid, forms of entailment right, they're all correct. They're just dominating, the data set these very easy lexical, heuristics are dominating the data set so, it's possible that like a good, way to solve this problem is to sort, of normalize, the, distribution. Of strategies. That are used to create entailment hypotheses, not, only using these lexical heuristics, but also using things like world knowledge or like political perspective, well other forms of entailment. That don't, readily, exist in that dataset. Another. Really important thing to realize is that the. Reason this is a problem is that a lot of these models are very lexically, driven and they're, very susceptible to these artifacts, you, know what we want is a very general representation of entailment that is invariant, to these artifacts, but what we see is that these models are making less cool associations, between words and, surface patterns like Yasmin was mentioning to, these. Labels. And so we have to really rethink about how we're modeling, entailment. To, try to create. Representations, that are really, invariant, to these problems. We see. Also. This is not just about Noi. This. Paper. I think is a child, of a series of papers that have come out recently, thinking. About datasets and critiquing. How. Datasets. Are either created how models can be easily broken based. On by, introducing adversarial, examples, and. I. Think that you. Know more work like this should be done I think it's an important part of doing NLP, that we're empirically. Driven and that. We're sure, that what we're measuring. And evaluating our models for are actually what they're learning. Lastly. I think progress in tunnel I has been overestimated. I think, that there. Are a lot of open questions that. Sort. Of result of this work that. Really focus on how, we do better annotation, how. Can we direct annotators, to avoid these simple heuristics. When. Generating, examples, can we employ, some sort of automated real-time, sort, of ward system that. Disincentivizes. Using simple heuristics, are. There. Alternatives, to elicitation, that help. Can help us build sorry. Until, my data sets I. Think. That there. Are ton more open questions and I'd love to talk to people about it if. You're interested we. Release, the hard benchmarks, on cog ons on Elias if you're interested in natural, ingredients I'd, definitely, encourage you to check these benchmarks, out and, test, your models on both the full benchmarks, as well as these hard subsets to see if you generalize across, these artifacts. And. Thanks here's, the link to my paper.
Okay. We actually have lots of time for questions. Okay. Thanks. So, are, there interesting, patterns to the Hard examples, or other more diverse, yeah. So. We. Didn't include this in the paper because. It's a very hard problem but when. You look at the hard subsets, it's not that they don't contain, artifacts, it's, that they contain very different types of skews from. The easy subset so. For example maybe you'd, have a lot of examples, in the hard subset where like knots, the. Word knot and Nate and negation is associated with a different class like neutrals, or entailments, and so. It's. A it's kind of an open question about. Sort. Of how we, utilize hard. Quote/unquote, hard training data in the best way is there a way to remove, these artifacts. Yeah. It's hard to tell. I've. Lost, teenagers. So. What. I was thinking specifically. With the example, of negation. That seems one that's sort of like maybe. Would, be like kind of natural, to contradiction, you know and maybe the, reason people are like. Making those sentences like that is because that's such a natural way to like produce, contradiction. So in that way it seems kind of like that, is like could be like a natural bias, that you wouldn't want to like remove, right, yeah, I was wondering like some of them like cat obviously like that it's like a problem from the initial like data said it seems like that right some of them seem less like yeah. Yeah, hundred percent and I mean like none of these examples. Are invalid, forms of entailment. Like they're all valid I think, the key is when, we're evaluating our models we, got to make sure if we are saying that we're making a lot of progress on nli like, what are we actually making progress on are we making progress on these easy lexical, heuristics, or are we actually making progress on like general entailment right so, it's really a question of how, we build our datasets so we're evaluating our. Progress on, these tasks, and the best way possible. So. You, define your easy examples. As the one that your basic model was able to classify correctly, but. Do you make a difference between the ones that the MAL the, physical examples, are actually easy versus, the ones that the model picked up by chance, we. Don't make that distinction. I think, there's there's. Actually some work that my. Co-author. So I was about, to work on with, like bet cream better partitions. Of easy and hard. Subsets. By. Using maybe like an adversarial, model that is, sort of online. Trying. To classify. Both. Easy and hard examples. While another model is trying to actually perform, entailment and sort, of, predictions.
I. Think. This is really, just a good first step into trying to build a better easy, hard subset but, there's definitely a lot more work, to be done to create. Those partitions more, accurately. So. You said, this paper was part of kind of a series. Of papers I'm, just kind of curious. Do. You know like is there other research going on into. Artifacts, and annotations, potentially for different problems but similar. Data sources yeah. Um I personally. Haven't seen, much work in this area I think that there's. Generally. Been a lot of work on how. There's. Actually there was actually a lot of parallel, work that, had been done with, other groups on the nli data set showing that not only are there biases. With. These, sort of heuristics. That people use there's. Also biases, in the context, of like, people. Are introducing. Stereotypes, into the data. You. Know and I. Think those are separate problems. I'd, say, that this is definitely nascent, work and. There, needs to be more work done on this area - okay, thanks yeah. One. Else question. More. Of a statement, than a question, one, strategy we've had so we rely, on, Mechanical. Turk for a lot of language data, one. Strategy we've had is to actually. Get. Users to filter themselves, out so, you get them you ask them questions that can become wrong or. Say. Like their. Answers. Have to be XYZ length. To your negation. Example for instance. This. There might be for, the research and strategies into that I suppose, any. Any thoughts on that yeah, I think that's I. I. Feel, like nowadays, there's, a lot of friction and like trying to create. The sort of real-time interaction, between, annotators. And like when. They're creating is really huge datasets and. A lot of times you're not able to identify these problems until after the fact right and then, it's hard it's too costly to go back and like recreate the data set so, I think like some. Sort of real-time way. To review. Things, and batches or examples and batches or something, like you propose would be really influential.
It's. An expensive problem yeah. Okay. Let's thank the speaker done. So. Yujin sort of teased a lot of what I'm gonna talk about in. This talk which, is pretty much neural, networks mouths, without, brains. Also. You know listeners, without brains, sort. Of goes both ways, and I'm, sort of gonna talk about how we can give them a brain. So, you, know Yujin talked a lot about recipes, and I'm gonna do so as well I usually love to give this talk right before lunch because it's nice to see people squirm. But. Let's start by having you think about making an apple pie when. You bake a pie there's the process that you follow in, order to compose, your raw ingredients, into a final product. Intermediate. Steps such as slicing, apples, cooking. Them placing. Them in a pie sheet and. Then baking the pie all have to be correctly, navigated, in the right order and with, the right intermediate, constructs, if you're gonna make something that's you know both edible. And tasty. And. Understanding. That process requires. Many, common-sense, causal, inferences, to be made and remembered. Later on you, know first once we slice the apples we have to understand some notion that their shape has changed you. Know when we saute them for a bit we now know that they're cookin Asst has. Changed, that, their temperature, has also likely changed, to be warm and also despite, having caused these new changes we also still have to remember that they've been cut from before that shape is still the same we're no longer dealing we're. Not we're not going back to having a raw apple that's had a completely new set of things done to it then, when we add them to the pie sheet we have to understand, that there's actually a new entity that's. Been created from our sliced and cooked apples and. You know the dough that we've now combined it combined it with and finally. Once we throw it into the oven we, have to know that that combination of apples and pie sheet is now going. To be cooked so. If we think about how you know current models current. Intelligent agents might parse this information, you. Know current, ubiquitous approach using, LST em you. Know encode, all the words into some type of sentence representation. That's usually a single vector or, a group of vectors and, you just sort of hope that that information can then be queried, from. That representation. At a later step. You. Know that's not necessarily what happens though you know while L STM's are sort of powerful representational. Tools they, don't do, much more than encode, surfaced, word patterns, and. They generally haven't been shown to capture more complex, world dynamics, that might, be necessary for reasoning, or some other downstream, task that you want, some. More complex, models, you know memory networks or entity networks do, allow for more fine-grained state representations. Because they generally split this encoded, context, across memory, cells which, might allow you to retain processed, information better or, you, know encode, certain parts of your context, in different ways. But. They still don't capture any information, beyond what's explicitly, state on the text you're just learning text patterns and learning, how to represent those in a particular, way so the. Problem with these approaches, done is that you know the models we're building aren't ones that currently learn context, representations. That track, the state of the world with all of the VN from that's needed to perform downstream.
Tasks. Most. Neural state representations. Encode only surface patterns, between tokens, and they. Generally devolve, into encoding, the most simple pattern that they can in order to get at the task that they're trying to do so. What might be a strong enough representation. For some applications, might, not be good for another one right after and so, the reasoning that you use in one application might not be useful for what happens next in another one so. What we really want ultimately. Is for the representations. That we learn to, understand, some underlying, dynamic, about the world and not just a dynamic, about the tasks that we're training it on and. So. This is why we decided to approach text, understanding from different, paradigm. Understanding. The state of the world means, understanding, that that world is entity centric, and it involves identifying the, state changes that those entities experience. And many. Of these state changes however might be knowledge-based, as opposed, to text-based so instead of being explicitly, mentioned in text there might be an expectation of prior common-sense knowledge that, your model actually needs to know. So. In this work more specifically, we're going to address the following challenges. We're, going to design a simulator, that's entity aware and that focuses, on capturing information, that describes changes, to entities so, after reading any sentence, and a recipe we're, gonna choose an entity apply. Transformation. To it and then, we're going to update a running representation. Of its state and later. We should be able to predict attributes, of this state based, on an embedding that we're tracking and storing for it and. Second, we're, going to integrate common-sense, knowledge about physical, state changes into the neural network by, adding knowledge about action causality, as a structural, bias so. We're going to do this by allowing a predefined, set of action embeddings to cause changes, to our entity States so, raw text is actually only going to be used to select entities, and select actions and the actual simulation of the events that are described in the text is going to be done by applying these learned action, functions, to the entity state embeddings, so. Then the ultimate goal is really, to learn a set of good action functions, that can consistently, induce the correct state changes and entities based, on the text that's being read and. We're. Sort of gonna do this in a two step process first. In, order to develop a neural architecture that, can learn these effects of actions, that, they have on entities we're gonna crowdsource, a set of action state mappings and then, we're gonna see how we can use this crowd source information to.
Learn The parameters, of a neural network, that, is structured to capture the inductive, bias of how actions, affect, state changes and entities. So. The crowd source set of action state mappings not particularly interested, but we extracted a set of 380, verbs from. The cooking domain that, are mentioned fairly, frequently and then. We manually compiled, a set of six state changes, that actions, could entail so. Location, composition. Cleanliness cook, goodness temperature, in, shape and by. Using the state changes of supervision, we. Can learn the transformations. That act that actions should induce an entities so. For each of the 384 actions we actually crowdsource a set of state changes that that action tends to induce an entities so, for an action like cut we're, generally, going to expect to see a change in shape for. An action like boil we're, gonna see a change in cook goodness and in temperature, so you can have multiple state changes for the same action. Rinse. Involves, change. In cleanliness and, stirring, might involve a change in composition. And. So then now that we have this action, state mapping knowledge, we. Come up with this model which we call the neural process Network because it it. Models the processes, that are described in text, which. Is a new neural network that decomposes, the state tracking task into, one of tracking actions, and entities and learning, the transformations. That those actions are going to induce in the entities, and. So to do this the model actually keeps track of a set of state embeddings for. The entities. That. It can change throughout the process and an own set of action functions that can actually change those entities and one, of which corresponds, to each of the actions that we, crowd-sourced, earlier. So, let me go over what the model is going to try to accomplish to a high level in, the interest of time I'm not gonna go into too many details about, it cuz I really want to get to the results. But. I'm actually going. To focus in on a couple parts that I think are super interesting. So. The first thing that happens is sort of typical is that it's going to read in a sentence using some type of recurrent, neural network such. As the GRU and it's, going to build, a sentence representation, for it so, in this case I'm gonna use this running exam with cooking the beef in the pan to, highlight a couple of things. Using. This encoded representation. From the, encoder it's going to select entities that are going to be changed by the events in the sentence using. An attention mechanism, that sort of computes a binary choice over, each individual, entity and this allows us to choose multiple entities to be changed in a particular, sentence so in this sentence for cook the beef in the pan it would choose the beef entity. Then. The model is going to select an actions that's going to be that it thinks is changing, the entities that selected earlier in the sentence by computing an attention over the action functions, that, were initialized, at the start so, here it might select the action like cook it, might also select an action like put because it recognized that pan is usually associated with a location change and, then. Inside this applicator, function, it's actually going to compute a by linear projection between. This action embedding and this, entity embedding, and come up with a new state embedding that corresponds, that, it thinks is a reflection, of how this. Entity is selected earlier has been changed so, in essence this is the step that understands. That now some. Cooking action has been applied to the beef from earlier and from. That it should actually be able to tell us what about this entity has changed using a set of classifiers once for one for each of the state change types that we were tracking earlier so in this case the beef is now hot it's now cooked now, of the pan meanwhile. There hasn't really been a change in cleanliness at all and, finally. It updates this information, in the entity state embeddings, so the entity states reflect the state changes that the entities have experienced. Like. I said this. Involves log neural networks I'm not gonna go into every single neuron specifically, but. Here's a couple things I want to highlight about it first. When the model predicts the state changes that an entity is experienced, the errors from, predicting, those state changes get back propagated, back all the way to the action embeddings in. The action selector which allows us to learn representations. Of actions that are more likely to induce the correct state changes in the future. Second.
Point I want to highlight is that when we update the entity state representations. In the memory we, update it we update each slot proportional. To what its original attention. Was when we selected that entity and what. This does that allows us to model interesting, compositional, effects when, two entities are mixed for example so, because both slots might be selected, at a hundred percent we're going to overwrite both of them with the same embedding, that we got at the output which allows us to pretty much reflect that these two things are now the same entity because they've been combined in a mixture and the. Last points the, last thing I want to point out is this idea of recurrent attention, so, in its sentence such as cook the beef in the pan it's, pretty clear which entities like, that should be the beef but, if I all the sudden change that sentence to cook it in a pan all of a sudden it could refer to pretty, much any entity, it really depends on what has been seen earlier because you know it's a co referent Oh. So what we want to do instead is that. We're going to allow the model, to attend to previous, attention, steps so, that it can select entities that have been selected prior sort, of an attention over attentions, in, order, to be able to select entities that were seen before. So. There's. A lot of details around data sets and training but. I encourage you to look at the paper for those it's a very interesting paper I wrote it. But. What we've done here is toast what we've sort of done at a high level is to build a simulator, that captures, the dynamics, of actions, being. Taken to the cooking domain as it reads recipe, lines so, the underlying task is really to read a recipe and be able to predict the entities that are changed and the state changes that they undergo in each step. So. Obviously, this leads to a great task of state change prediction, and we, use the accuracy, nf1 of predicting, the correct state changes and we weigh them by the accuracy of predicting the correct entities and. We're better than competitive, neural. Base lines such as a GRU which you. Know kind of we. Would expect to be as, well as the recurrent entity Network which is one of the best models on the babby data sets for machine comprehension. And. When we look at how well we choose entities we, see where this improvement, actually. Comes. From so, in the in the right in the leftmost column we, see that we have a much better f1, score of predicting the correct entities. At. Every. Step and. This actually, is not necessarily. Because of how well we predict raw entities so we do do a bit better when we look at recall of predicting the raw ingredients, compared, to the two baselines but it really which. Is you know raw ingredients would be something like predicting cut the tomatoes, and, then even something like co-reference or, lighted arguments, but. The big difference is really how we do in predicting compose entities. So. If you think about something like combine the water and flour in a bowl at the, next step if you know we have some like stir the mixture until a dough forms you. Know it's really tough to know that that dough, refers, to this water and this flour and we pretty much double performance, that these baselines can actually manage predicting. These types of compositional, entities which, is going to help us do a better job of tracking. The state of entities overall. You. Know just looking at examples, here melt. Butter in a medium saucepan over heat in the first step most. Models should be able to predict that the butters being changed and, the second step removed from heat there's an alighted argument, but, we're able to predict that anyway you know the third third step stirring oats sugar flour all these fun things you, know we still understand, that the butters is. Still there it needs to be combined with these things if. We look at something even more complex you, know step one brown meat and oil until it's brown on all sides step. Two you pour off grease we actually get this wrong we predict that the oil should leave. The mixture at this point even though the gold says none but, if you think about grease and oil probably, fairly similar embeddings, overall. And so. What. We it's. Conceivable. That it could have left at the same time. Then, in the next step nothing particularly, interesting you, know we predict all these directly, named entities. But. Then the interesting part is in this step for where the line is bring to boil then turn very low cover simmer, about one and a half hours or until meat is tender, meat.
Should Be fairly easy, to predict as an ingredient that's being used here it's explicitly, mention of the text what's interesting is that we actually predict, all of these other ingredients that were combined with the meat in the previous step so it's actually managed, to figure, out that this is actually a composition, of all these entities and it predicts them all together at the same time. And. To show this effect a bit more quantitatively, if, you remember how we select an update entities we choose a set of entities we average them and then we apply the action function to them and then we write the new embedding back to the correct memory, slot weighted. By the original attention on that slot so if two entities are selected, they're embedding representations. Should, usually move a lot closer to each other in embedding space because, they're both being partially overwritten by the same vector and. As it turns out if we look at the percentage change in cosine similarity of entity, embeddings that are being combined we. Can see that they're generally, moving a lot closer to each other and embedding space now. For the skeptic among you you can tell me that that's obvious if you select all of them at the same time at the first step they'll all move a lot closer to each other in a bedding space so, we should actually look at which ones aren't supposed to that and it turns out if, we superimpose that graph on this one we, can see that if you just actually select entities that aren't supposed to be combined in that step they, don't move close together at all in embedding space you know there's obviously a bit of noise that leaks here and there but. You know with with a good cutoff you could probably get rid of most of that. So. We're simulating entity, effects fairly nicely what, about actions, well. This entire model runs on the premise that action. Functions are applied to entity embeddings that induce state changes if, it works correctly we would expect semantically, similar actions, to. End up having similar action functions embeddings and that hypothesis, is true if, we compare action functions in embedding space they. End up in a fairly nice part so, the cut action is near the slice split snap splash, mashing. Ends up being your spreading. Puree and squeezing so the internals of our models seem to be looking. At the same patterns, that we're trying to replicate. And. Finally, our state representations, have to be useful for a downstream task you know we have to show that our simulator, can encode an advantageous, representation. Of the process being described, using. The world information that we provide it to it so, we applied our state simulation model. To recipe step generation, so, at each step we use the state embeddings of the entities, to see in an LS TM generator and we chain it to generate the next step in the recipe and, across the board our model, represented, here in orange, does. Better than competitive, base lines such as seek to seek or an attentive seek to seek and it's consistent, across metrics, such as blue Rouge as well as these metrics, of vf1 and SCF one where. We pretty much look at whether we generate, the correct action, in the next sentence or the correct state change in the next sentence, and. Interestingly one, of the baselines that we ran was using the entity state embeddings from the recurrent entity Network tracker so. We use the tracker of the recurrent any network that baseline which is also sort of learning this entity this fine-grained, entity representation, and we see how well the state embeddings that it encodes do on this task and. It is worse than the NPN across all metrics, indicating, that the action oriented simulation, that we're using actually, plays a role in encoding, better state, representation, for downstream regeneration. And. A quick example you know if we see this recipe step generation, with an ingredient such as butter you, know the first step is to preheat the oven to 425 degrees, these. Neural networks immediately. Jumped a pretty common. Answers for something our car's preheating an oven like lately greasing, baking pan with oil or combining all ingredients with mix in. Mix well our, model knows that if you have some butter and it's, raw you're probably going to melt it which, is exactly what the references, as well, so. The takeaways, text. Only gets you so far you really need a model of the world to. Do good understanding, you, can use you can encode, these priors into, a neural network architecture and.
That Model the world can be learned from data in future. Work we're looking at stories we're looking at dialogues we're looking at scientific processes and we're, also looking at you know what type of models of the world can we encode to do better on these downstream tasks, and. Luckily for me I work in a group that makes. A lot of these models of the world between, VirB physics connotation, frames and all these other things there's a lot of data that we have to actually create these world models to do better in stories dialogues and scientific processes so we're looking at encoding those into. This pipeline, thank. You. Since. There was coffee, break right after that I guess we have some time for questions. Think. That one right there. Thanks. Great come presentation. I was just wondering uh if. Because. You're working on food. Recipes. If. Just. Because of the nature of food, that you're you're. Picking up like, single. Sense, types. Of words, mm-hmm, so when you say that when use that word entity, if you're, kind of referring, luckily, referring, to a particular, sentence, whereas if you put, this into the wild you might think things. Like you. Know the, house blew up when he was cooking meth or, salt. In my wounds great, example or you, know scratching, the cherry car paint job you know yeah so. That's so the question, is. To. Simplify down will this generalize, so. If we were looking at different domains besides, just cooking. Well. This will something like this even work. So. I would say that the simplified, version of it which we had for here with a fairly simple model of the world where we just had six state changes, applied to 385, verbs, in. The. Cooking domain works, while, you might need a much larger, world. To actually simulate for another task so. That you can you. Can actually get. Capture, all of those effects, and. One of the things we're working on is using you know larger, lexicon, of verbs, as. Well as a larger set of state mappings, that those verbs can induce for. Tasks such as dialogue, and and. Stories. Um. Seems like this is a natural for incorporating, outside. Ontology so you learned all this just from reading recipes but if, you tried incorporating, ontology, to. Know, about things you haven't seen in recipe before. What, type of ontology, is are you suggesting just, a you know dictionary of what a carrot is and that you never saw parsnip, recipe but the ontologies yes so we. Didn't, do it in this work in future. Work it's one of the things that were looking at a lot of the lexicons, we're trying to look at as being a model, of our world are, in fact ontology x' that, look at how different. Verbs what, effects they use so we're using a lot of frame structures, as models of our world. Okay. The last question. So. Great. Talk so you, did, not elaborate. On the data set much I was. Wondering, so. There are a lot of these videos, of, cooking. Online, or even, some, of these recipes, have like, intermediate, like. Photos, of different. States could. You do some kind of multimodal, reasoning, over this. Ten yeah, that was one. Of the goals that, we were sort of looking at for future work is how to integrate this you know great. And expansive. Data source of video and images and, how we could integrate that and, we're. Actually working with collaborators at Cornell right now who are, looking into that sort of stuff pretty much how you can represent the intermediate, state using images as opposed, to maybe crowdsource labels so. We're pretty excited about the what's going on with that work there's some amazing new datasets and hopefully. We'll be able to do something with it. Okay. Let's thanks the speaker Gunn. You.