Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 12 – Subword Models
Okay. Hi if you're on let's. Get started again. Okay. So first, of all let me just say a bit, about. Assignment. 5 so. Assignment. 5 is coming out today. It's a brand new assignment. So you guys are the guinea pigs for that and. So, what it's going to be it essentially, builds. On, assignment. 4. So. It's okay if you didn't do perfectly an assignment, 4 though I think actually most people did and. What. We're going to be doing is adding. Convolutional. Neural networks, and sub word modeling, to, the neuro machine translation. System, seeking. It to make it better so. This assignment is, coding. Heavy written. Questions, light. So. I mean the coding, that, you have to do sort, of isn't actually really more, difficult, than assignment, 4 it's, kind of like assignment, 4 but, what we're hoping is, that this, time you. Will be able to do it on your own like, what I mean by that is. For assignment, 4 well, there, was tons of scaffolding, telling you what everything should be and, there, were all of these autograder. Checks, and you could keep on working on your code until they passed all the auto grader checks and everybody. Did. And. So it was very kind, of cuddled shall we say but, I, mean I, guess. What we're really wanting to achieve is. They, have a monocyte question. Yes. So what we're hoping is that this can be, useful. It'll. Be short term pain, but. Useful, as being, a more effective ramp. To, doing. The final project, and indeed, for the rest of your life right I didn't love the reality is than the rest of your life you sort of if you're, going to be doing things with deep learning you, kind of have to work out what kind of model to build and which pieces to stitch together and, how. To write some tests, to see if it's doing something sensible. And if it's not doing something sensible, to. Figure. Out how you could change things and try different things and, get, it to work sensibly, and so, that's what we're hoping that. People, can. Do in assignment, 5 so. If you've, got of figure. Things out. Should write your own, testing. Code, we. Don't have a public autograder, so, you should that's part of working out your own sanity, checks, trying. To do things like what I talked about last week. Of. Sort of getting, simple, bits working. Confirming. That they work on minut amounts of test data and so on and doing, things more sensibly, I mean, in particular. The. One particular. Part, of that that we were planning to do. For. This. Assignment I, was looking for it but, it's on the next slide so, for. This assignment and, beyond, we're. Going to enforce, rules like, more, like they are and CS 107.
For Those of you who are undergrads, meaning. That the TAS don't, look, at and debug your code for, you and, so. You know of, course we still want ta used to be helpful, come to them with your problems. Talk. About how, you're meant to use different. Things in. The Python library. But, you shouldn't be regarding. It as the TAS job of here's, a big Python file, can. You tell me what's wrong with it and fix, it up for you. Okay. The precise policy, for that written. Up on Piazza. Okay. So, up to any questions about that or do I go straight, on in. Okay. Yes. So today's, lecture. In. Some, sense, today's, lecture. Is an easy lecture. So. Last. Times, lecture there. Was really sort of a ton, of new stuff of, other, stuff and your networks that you haven't seen before and we did Khan finet and, we did pooling, layers and, we did highway. And residual, connections. And batch norms. And I know. Whatever else you did size. One convolutions. I guess so. There are tons, of new stuff, really. In this lecture, in terms of sort of neural network machinery, there, isn't any new stuff at all so this is really. And this is also really a new, lecture, but, it, was sort of put in for a reason, and the, reason for this relates. To a, kind, of remark I made last time about how lots, of stuff keeps changing in, neural, network land so at the time we first designed this class and, the way that, a lot of the structure, of it still is, that, sort, of around in. 2014/15. When we designed this class it, was basically, axiomatic. That all deep, learning models, for natural language processing worked. Off words and, therefore, it completely, made sense that we start with word vectors, and then we start looking at things like recurrent, models over words, whereas. The, fact of the matter is in the last approximately. Three years, there's. Been a ton, of new work including, some of the most influential, new work, there's building, language models, that, isn't. Isn't, art being. Built over words that, they're building built, over pieces, of words or characters and so, this lecturers, sort of meant to give you some sense of these other ways of doing things, and. Some. Orientation. To some of the things that's going on but, the actual kind of models that we're looking at are sort of using all of the building blocks that we've already looked at things like, RN, ins and commnets, and things, like that so, let's, get into this so. I'm gonna start off with a teeny, bit of linguistics. Of learning about the structure of language and. For sort of lower-level, units, of language. And then we'll see how that pans out for. Things like character, level models. So. In linguistics, if, you start at the bottom of the. Totem, pole the, first level of linguistics. Is phonetics, which is sort of understanding. The sounds, and the physiology of human speech so that's sort of like physics, or physiology or something right there, our mouth, parts that move there, our ear parts, that act as filters and, there's audio. Waves, in between, the two of them so that's kind of uncontroversial. In some sense but. Above that level, the. Standard, thing that people do for the analysis, of human languages, is to, say well, human, languages, may, seem to make use of a relatively, small, set, of distinctive. Units, which, then commonly, called phonemes, which are actually categorical. And the. Idea, here, is that. Well now now, mouths are continuous. Spaces, right that they've got these various bits of their mouths like you, know tongues and pharynx is and so on but it's a continuous, space so, actually. We. Can make an infinite variety of sounds, right so if I open my mouth and apply boy sing and just, wiggle my tongue around I can go. And. I can make a in the infinite, variety of different sounds but. The, reality, is that. Human, languages, aren't like, that that, out of that infinite variety, of sounds. We distinguish, a small, space of sounds. And. Something. That happens when languages, change is. That. The space of sounds, that are seen, as important, and distinguished, in a language change, and, that, happens even within side. One language as is, English.
And I'm about to give an example, of that. So. People in COG. Psyche talk about the phenomenon, of categorical. Perception and. What, that means is that really there's something continuous. But. That humans. Perceive, it as belonging, to fairly sharp, categories. And. You, know you can use that, for sort of you, know styles. Of clothing or, whether someone counts, as fat or not but. The most famous examples. Of categorical, perception are, in language. Where, we can make an infinite variety of sounds, but people. Perceive. Them as categories. And so, effectively. What, that means, is when you have categorical. Perception that. Differences, within a category, sort. Of perceived, to have shrunk, you barely notice them at all where, differences, across, categories are, expanded. And very clear and so, one of the cases, that sort of studied a lot is, what's called referred, to as sort of voice. Onset time so. Lots of languages, including English, have pairs, of sounds, like P and B apart, and a bar and they, differ based, on when, voicing, starts, at birth is has, a voice sound like a vowel with an R in it and well, that's a continuous, parameter. You can sort of make any point, along a spectrum between a P and a B but. Human. Beings who. Speak English. Perceive. Just two points on that spectrum and, you. Don't sort of really notice the fine differences, between them, some. Languages to distinguish more points on the spectrum, so Thai distinguishes. Three different consonant. Sounds, in, depending. On the voice onset time. Something. That might be more. Accessible, to you is this. Is an example of language, change so. For speaker like me. There is caught and caught and those, are different, vowels. And I hear them as different vowels but. If you are someone who grew up in the southwest. Of the United States. Then. These, are exactly the same value, and you don't distinguish, them that, you, thought I said the same thing twice even though I was saying two different vowels, and so. That's, where in, even. At a dialectal. Ish. Level. That, people develop categorical. Perception as. To which, which. Distinctions. And sounds, they're sensitive, to or not sensitive, to. Okay. And stomach, and I, mean why I'm mentioning this is in some senses these sound, distinctions. Of, categorical, sound, distinctions. That are what a lot of our language, writing systems, that we'll come to in a minute record. Okay. So, in traditional linguistics. You. Have sounds. But, sounds, don't have any meanings, and language Soperton. Ber don't have meanings, and art and II don't have meanings, and so, people then normally. Distinguish, as the, next level up morphology. Is parts. Of words and, this is seen as the minimal, level that has mean, and so, the idea is lots of words are complex, and can, be made made. Up of pieces but, these pieces do have meaning so fortune. Has a meaning. Fortunate. You, end in this 8 ending. Which. Sort of gives it gives, fortune, to somebody, so that means you're, having fortune. That. Has a meaning, unhhhh has, a meaning, which means to reverse that so unfortunate. Means that you don't have fortune, and Lee. Then. Has a meaning, of turning this all into an adverb and you can say unfortunately.
Not Having gotten. Fortune, something, happened and so these sort of pieces of words. Are then the minimal, things that have meanings, I'm almost no work in deep learning has tried to make use of this sort of morpheme, level of structure, actually, me and a couple of students six, years. Ago did actually try and build, a system, where, it built these tree structured, neural networks that put together meanings, of words out, of their pieces, but. That really isn't an idea that's taken on widely, there's, sort of a reason, why it hasn't taken on widely, which, is doing. This and working out there semantically, meaningful pieces, of words is kind. Of hard and, a lot of the time in NLP, what, people have found is you, can, just, about get the same kind of results, if you just work with character. Engrams, the kind of units that you put into the convolutional, neural nets because, if you just have a model, that uses, character. Trigrams, and, you have sort of started word on and nfo. And so on for. Going. Through the ly end of word that, those different, units, there's. Different, character trigrams, in a distributed. Way will pick up all the important, meaning, components, of the word pretty. Well and that that's just good enough and that's actually a very classic. Idea. That, sort of been revived. So. Back. In the, second, coming of neural networks, in the mid 80s into the early 90s. There. Was. Have, that, was quite a bit, of sort. Of controversial. Work, on the structure of language and. In, particular Dave, rumelhart and Jay McClelland, so Jamie : still in the psych department here. If you want to look him up in your spare time they. Proposed, a, model. Of how, to model, generating. Past tense forms, in English so this was sort of a cog Tsai experiment. Of can we build, a system that can learn past tenses, of English verbs and the difficult, part there is some many, verbs and regular you add the kind of edy ending, but some words are irregular, and you had to sort of learn about the irregular, patterning, but. The. Way they did that I mean partly because this was sort of early days with respect, to. Sequence. Models, is that they used a representation. Where, they represented. Words precisely. With these sort of character, trigrams. And that, was the representation, of words they used and fed forward in their model and that, idea. Was, met with a lot of controversy by linguists. Philosophers. And other people with their ideas of language, and so there's a lot of debate in those days about, that but, from, as a purely engineering. Solution, that sort of proved to be a, pretty, good way to do things and so this, decade, there's, been other work which includes, the model. Developed. A Microsoft, Office, over deeps. Semantics. Model where what they're using is these kind of character, engrams, to put meaning over words. Okay. So. So. Now. We might be interested, in building models, that aren't over words so. We're going to have a word written as, characters. And we're, going to do something with it such as build, character, engrams and, so something that is just useful. To. Know is you, know there's actually a fair amount of variation between languages. When you do this so it's not all the same stuff right, so the first problem is there. Are some languages, that don't put spaces between words the. Most famous example, is Chinese. But. An interesting, fact for those. People, of European. Ancestry, is, that. You know if for, when, the ancient Greeks wrote ancient, Greek they. Didn't put spaces between words either, it was, actually, a later invention, of, medieval. Scholars, who are recopying, their, manuscripts. Who they decided, well maybe they'd be easier to read if we put spaces in and that, they started, doing it. Most. Languages. These. Days do. Put. Spaces, in between words, but, even then there are sort of a lot of fine, cases. So, in particular a, lot of languages, have some sort of little bits of stuff, which, might be pronouns. Or, prepositions. Or, various, kind of joining words like, and and. So, and, which. Sometimes. They. Write, together and, sometimes separately. So, in French. You. Get these kind, of. Prepositional. I'm. Sorry pronominal. Markers. For you I you have. Brought. And. You know these kind of little words and pronunciation just. Sort of run together as you lose a and. Arguably. It's almost one word that, it's written as separate, words, where. There are other languages, which, sort of stick things together where. Arguably they're separate words so in Arabic, you. Get pronominal. Clinics, and some of these sort of joining words like so an end and, that they're sort of written together as, one, word where, arguably, that they should really be four words, another. Famous case, of that is with compound, nouns.
So. In English, we write, compound, nouns with spaces, between them, so you can see each noun. Even. Though in many respects, compound, nouns something, like white board behaves, like it's one word or high school. Whereas. Other languages. German, is the most famous case but also other Germanic languages, just. Write them always one word and, you get very long words like that so we're going to get different words if, we just knew spaces, and don't do much else. I'm good, okay, yeah so for dealing with, words, there, are these practical. Problems. And. We. Sort of already started to touch on them that if you try and build word based models, there's, this huge, space of words and, while strictly, there's an infinite, space of words because once you allow and things like numbers. Let. Alone FedEx. Routing. Numbers. Or, or. If you allow and just morphology. When you can make those ones like unfortunately, you're. Sort of you can just expand, the space of words so you get this large open, vocabulary. English. You know a bit problematic it, gets way more problematic. In a lot of other languages, so here's a lovely, Czech, word, to. The worst farmable, one where. You can make sort of much more complex, words and lots, of other languages. Many. Native American languages, are the European languages. Like Finnish have these sort of very complex, words, Turkish. Has very complex, words, so. That's bad news, there. Are other reasons we'd, like to be able to look at words below, the word level to know things about them so, when you're translating. There's. A wide space of things especially names. Where. Translation. Is essentially. Transliteration. That, you're going to rewrite. The sound of somebody's, name as, roughly, you know and perhaps not perfectly, correctly, but roughly correctly, according to the sound systems, of the different language and well if we want to do that we essentially want to work operate. At the letter level, not the word level but. Another huge. Modern reason, why we'd like to start modeling below, the word level is we. Live in this age of social media and, if you're in the social, media land there's, a lot of stuff that's written not. Using, the canonical words, that you find in the dictionary and, somehow, we'd want to start to. Model that so, in some sense this is the easy. Case. Goodbye. But nevertheless this is spelt with one two three four five six seven zeroes and. One two three four, oh and also seven, SS they match I don't, know that's delivery, odd. Okay, so this style of writing is very common. And. Well, you. Know we. Kind of sunk if we're, treating things that the word level and we're trying to model this right that's clearly not what human beings are doing with sort of looking, at the characters, and.
Recognizing. What goes on, in. Some, sense that's kind of the easy case, that you could imagine. Pre-processing. Out there's. A lot of harder, stuff that then turns up, I. Guess. They're sort of the abbreviation, speak if I don't care, but. Then you, sort of get a lot of creative, Spelling's. That. Come off of kind, of reduced, pronunciations. Like imma go summon. And. It, seems like somehow we need something, other than canonical. Words if, we're going to start to deal better with, a lot of this text. Okay. So. That suggests. We. Sort of want to start doing that with our models and so, that's led to a lot of interest, in using. Character. Level models. And. I. Mean, there are sort of to, extent. Of which you can do this and we'll. Look at them both a bit, one. Level. Is to say look, we're still going to have words and our system, basically, we're going to build a system that works over words but. We want to be able to create word. Representations. For, any character, sequence, and we'd like to do. It in a way that, takes. Advantage of, being able to recognize parts. Of the, character sequence, that look familiar so, that we can probably guess, what Vybz, means. And. So, that sort of then solves, the problems, with unknown words and, we get similar words similar. Embeddings, for words with similar. Spelling's. Etc, but, the the alternative. Is to, say ha no just forget about these words all together who needs them why don't we just do all of our language processing. On sequence. Of characters it'll, work out fine, both. Of these methods have been proven to work very, successfully. And. I just wanted to dwell on that for one moment and that sort of goes, back to my. Morphology. Slide here, when, people first started, proposing that. They are going to build deep learning models, over, characters, I mean. My, first feeling was how that is never gonna work because, it sort, of seemed like okay. Words have a meaning, it makes sense that. You can do something, like build, a word to Veck model, and that's, gonna really be able to sort of see words in their distribution and learn the meanings of the words because. Words have a meaning, the idea that. You're going to be able to say well I'm going to come up with a vector representation. Of H, and, a different vector, representation. Of a, and, a different, vector. Representation. Of T and somehow. That will be useful for representing, what a hat means once, I put it through enough neural network layers. Frankly. Sounded, pretty unconvincing, to me, but. I. Guess, you. Know but. It totally, works so, I'm convinced, now pinnacle, proof and I, think what we, sort. Of essentially, need to realize is that. With. Going, that, yes it's some level we just have these characters, that don't mean much but, we then have these very, powerful. Combinatory. Models. With a lot of parameters, in them things like recurrent, neural networks, and convolutional. Neural networks, and that, they're effectively. Able, to sort of built, store. And build. Representations. Of meaning from multi letter groups in such, a way that they can model the, meanings. And morphemes, and larger units and therefore put together word meanings. Yeah. So on one more detail, on using. Characters. From. Writing systems. So, if you're a linguist, you tend to think of sounds, as primary, those were the phonemes, that we I mentioned, beforehand, you, know. Essentially. Deep. Learning hasn't, tried to use phonemes, at all traditional, speech recognizers. Often did use phonemes, but in the deep learning land, you, want to have a lot of data and the way you get a lot of data is you just use. Written. Stuff because, you know it's the easily, found data. Where you can get millions, and billions of words of stuff so. That sort of makes sense from a data point of view but, the thing that ends, up is a little weird about that is that when you're then building a character level model what, your character, level model, is actually.
Varies. Depending, on the writing system of the language and so, you, kind of have these quite, different, writing. Systems, so you have some, writing systems, which, are just completely. Phonemic. That, there are letters that have a particular sound, and, you, say that sound something, like Spanish, is pretty much phonemic sometimes. It's a teeny bit complicated so you might have, a digraph, so this digraph. Now bulu is kind, of like the ng of English that is used, for in sound, like at the end of sing but, you know basically this is just G whoo, each, letters, a sound you can read it and. It's, just phonemic. That. Then contrasts, from something like English, where all the non-native speakers, know the, spelling, is terrible it's got this sort of highly, fossilized. Once-upon-a-time. Phonemic. System, in the 10th century or, something but. Now we have this system that words. Have fairly arbitrary spelling. That doesn't actually represent the. Sounds. Very. Clearly but it's, sort of a phonemic system, but, then there are languages, that use larger, units this. Is Canadian. And neuter toot which I just put in there because it's such a pretty writing, system but. There are a lot of languages, that, represent. Syllables, by their characters. So. You have something like this in Korean for example, with Korean Hangul that, each letter. Is then, being, a syllable, of this sort of consonant, vowel combination. Like. You can then go, up a level from that and if we get back to Chinese again, well. This. Is sort of also a syllabic system, you could say but, really, the Chinese characters. Are much more than just the sound they also have a meaning that this is really then an idea graphic system, where, there are characters, with particular meanings, attached to them so they're sort of whole. Morphemes, in written. As one letter and you, know another example of such language, was, Egyptian hieroglyphs. If you've seen those that they're sort of a geographic, systems where you have letters with meanings. And, then you have language systems that sort of mix several of those so Japanese, is sort of a mixture of partly. Marea partly. Idea graphics system mixed together so. If you just sort of start off and say okay. I'm gonna build a character based system, that's fine but, effectively, your, character. Units. Like letter, trigrams. Are just, very different in a language like Chinese we're. Commonly, a letter, trigram, will, be sort of a word and a half three. Morphemes. With meaning, whereas, if you're in something like English, your character, trigram, will be something like th, au which, is still sort of much too small at unit to have any meaning, so. Moving right ahead so. These two kind of approaches, one. Was just, do a completely, character, level model and then the other one was sort of make, use of characters. To, build bigger things that you're then going to put something, like into, a more, word level model so I'll do this one first and the other one so, for, a purely character level models I actually showed an example of that last time do you remember so, there was that very deep convolutional. Network from, the Kanoa Tao work, for text classification at, the end and. That just started, with a big line of characters and, built, these convolutional. Layers on top of that in the, vision like network, and classified. The documents, so. That was sort of a completely character. Level model. But. Here's, a bit more work on this so, people for machine, translation have. Built. Machine. Translation, systems, that just read, characters. And write. Characters, and when. People first tried to do that, it.
Sort Of didn't. Work right, the people thought it might help to build character, level models especially, for languages, like Chinese but. People, just weren't able to build models, that worked as well as word. Based models, in need of a pre, neural manam neural or the neural world and. But gradually, that started, to change, so. People, start to have successful. Character. Level decoders. And then, sort of around. 2015. 16. People. Started to show look you could can actually, do machine, translation. Very well at just, a character level, with a few asterisks. And so, here's. A bit of work that. We did and, the long many one from. 2015. On the last slide so, this. Is looking at English to check translation. And checks. A good language to use if you want to motivate doing, things at the character, level because. It has those big, horrible words with lots of morphology like. The example I showed you before and, I'll show you some more later so people had built word, level models, for, check. And. You. Know they didn't work great partly, because of some of these vocab, problems, so, the. Sort of word level state of the art was at this time was fifteen point seven blue, which, as you know is, much, less, than we will accept for full grades and your homework. But. You know what, counts as a good blue score depends on how difficult the language pair is. And. So you're not doing check, but. So. This was sort of the kind of new LMT model, that we've talked about so, as a seek to seek model, with attention, and then, it had extra stuff for. Substituting. Chunks with, either. Single. Word, translation. Or, by copying stuff from the source so as sort of basically, state-of-the-art. Neural, Mt, of, 2015. Got 15.7, blue. And the, difference isn't big but we were able to show look we could build this, completely. Character. Level model. Actually. Better. So. This, sort of showed that in terms, of translation, quality. Character. Purely. Character based models were completely, viable at capturing the meaning of text, as well as word based with, models. Was. This a great, result. In. Maybe in some ways yes, and in another way no, I, mean, this model, was control, eat arable, to Train right so it took about three weeks for us to train this model and at runtime it also worked very slowly, and so, the problem, with character, level models if you're putting them into something, like an LS TM is your, sequences, get way longer right so you've got about seven, times as long sequences. As you, used to have and since. There's not much information the, characters, you, have to do, back propagation. Through time much. Further back and so, we were running back propagation, through time for. 600. Steps before your trend truncating. It and so this sort of made maybe that was excessive, but it made the models very. Slow but, we were able to show that was able, to get some of these good effects right so here's a check I'm translating.
To Check her 11 year old daughter Shawnee, Bart said, it felt a little bit weird and, I. Don't. Know probably, does anyone speak Czech any Czech speakers. No. Czech speakers, okay. I don't speak Czech either but. We can see. We. Can see that this does interesting, things that, so the second line is the human, translation. Into, Czech which, we can use for some guidance and so in particular in. Czech there's a word for eleven, years old, which. You can see is that blue word on the second line and, you, can see that. Despite eleven-year-old, was, that. For eleven-year-old it's just able to perfectly. Produce. Letter by letter the, Czech, word for eleven years old and that works, beautifully, in, contrast, for the word level model. Eleven-year-old. Was an unknown word because that wasn't in the vocabulary and, so, then it had two mechanisms to try and deal with unknown. Words could either do a, unigram. Translation. Of them or it could just copy, them and for, whatever reason, it sided here the best strategy, was to copy and so, that was a complete fail. And. If, we go along for the character level model another thing that gets right that's really cool is. The name shiny Bart it's, able to do this transliteration. Tasks, that I mentioned, just perfectly, and it turns them into shiny Bart over, which, is exactly what the human translator, did as well and so you know it's actually doing some really kind of nice. Human. Translator. Like. Things, I, mean in in, fact, as best I can tell from spending a bit of time on Google Translate, it actually does a pretty good job in the sentence period, right, this part here starts to be different, from. The human translator. But it's not actually bad, it's sort of a more literal translation. So, this Chiti actually. Translates. Feel, like. In the english text. Whereas, the human sort. Of didn't actually use the, word feel, and the czech version that, they just went, was. A little, bit weird. Or strange, so. That's cool. Okay. So. Here are a couple more. Results. From this so, here's another system, that was built the next year. By. These people Jason, Lee Kyung Hyun Cho and Thomas, Hoffman, so. They, wanted, to do. Something. That, was I don't, know much more complex. And neural, and understanding, the meaning of the text on the source side and, so, they were more using, the kind of technologies, we saw last time so. On the encoder, side, you, started, off with. A letter sequence. Of. Character. Embeddings. And then. You're sort, of using. Convolutions. Of four three, and five characters. To. Get representations. Up here. You're, then doing a, max, pulling, with the stride, of five, so, you're getting a max pulled representation. Of pieces, of the text. For each of the three, four and five convolutions. You. Know then feeding, that through multiple layers of highway. Network, and feeding, that through a bi-directional. Gated. Recurrent unit. And that's given you your, source. Representation. On. The decoder, side it was sort of the same as our decoder was just running a character, level sequence, model. So. Overall. So. They were doing this um, or. They were doing the opposite task this is checked to English. But. So. They're staying to get better scores, but. I mean actually if you're sort of looking at these different numbers where, I'll explain, this system more in a minute I mean it, sort of seems like the, place where they get. A lot, of value, is that, using the character, level. Decoder. Gives. Them a lot of value, but this very complex, model on the, source side is, giving, them almost no value at all. One. More even more recent, paper so, this has. :. Cherry and fellow researchers at, Google, so, they. Last, year.
Did. One more exploration, of, doing, lsdm, sequence, the sequence style. Models, of, comparing, word and character based models. And. This, is English to French and, this is. Czech. To English with just what we were doing and so. In. Both cases when, you have a big model the, character, model wins for them, the. Blue model comes out on top, but the sort of interesting, thing is you sort of see these different, effects depending on the morphological complexity. Of the language so, for, a language like Czech, it's. A really good idea if. You want to build a good model to use character, level that they're getting about a blue point of difference there whereas. For. A. Model without putting French or English that's actually a tiny, but, very little gain, from. Using, a character level model. Okay. So. Let me just explain these, models, so these models and models of different sizes, so, these, models are using. Bi-directional. LS TM encoders. And one. Directional, lsdm, decoders. So, the simplest, model just, has a shallow. Bi-directional. Lsdm. Encoder, and, a, 2 layer l STM decoder. The. Middle model, has, a. Three. Deep stack, of bi-directional lsdm. Encoders. And a four deep stack, of LS. TM decoders. And the most complex, model, has a six, deep stack of bi-directional STM. Encoders. And an eight deep stack, of. LS. TM decoders. This is where it helps to work at Google probably. For your projects, you don't want to go beyond three or four. Stay. Over here. Okay. Yeah. So, so. These are the results so basically, what you're finding is if you're making, sort of smaller, models, you're better off with words but. As you go to big models, especially. If you're in a morphologically. Rich language you clearly start to win from the characters. But. There is still a loss which is essentially, exactly, the same loss, that. We, were suffering from the from. 2015. Right, this is the the, time graph. And so, these are the same three, models, as over, here it's, just the axis has changed, a sort of sum the total number of lsdm. Layers, and so, that, you, know essentially, if you're at the word level you, can run any of these three models and, they, are fast that, you can be, translating. In sort, of not. Much time but, for, the character, level models, your.
Slope Is much higher so it starts to get quite expensive, to. Run the deep character, level models. Okay so. That's that section. So. Then. Chugging. Along, I. Then. Wanted to look. At other ways of doing, things, and so, these are models that in some sense still do have words, but. We're we're going to want to sort, of build word, representations. Out, of pieces and, they're essentially two. Families. Of ways that people have explored, doing this, one. Way of doing it is to say look we just want, to use exactly, the same architecture. As we, use for a word model, except. When. Our. Words aren't really going to be words, at. Least sometimes, they're going to be pieces of words and, so those are often called word piece models, and in particular there's, one communist, way of doing it was called bpe which, I'll go through in some. Detail the other, alternative. Is to say well, we're going to kind of make a mixture, or a hybrid, so, our, main model, is going to work in terms of words but. We're going to have some kind of facility where. We can construct. A representation. For otherwise, unknown words, by. Doing things that a character, or a lower-level and I'll show you a bit of that as well okay. So. This is BP. BP. Is actually a pretty simple idea which has, nothing, to do with deep learning. But. The use of BP, is sort, of become. Pretty. Standard. And successful. For. Representing. Pieces. Of words, to. Allow you to have an, infinite. Vocabulary. Well at an infinite, effective vocabulary. While actually working with a finite, vocabulary. So, the origins, of bite pair and coding, and the, name bite pair is. Nothing. To do with natural language processing on, your nets we're just writing a compression, algorithm, so, this is something like you know compressing, your documents. With gzip so. What. Basic. White. Pair encoding, is that you've got collection, of stuff. With, bytes and you're, looking, for, the most frequent. Sequence. Of two bytes and, you say okay, I'm going to add that sequence, of two, bytes as a new. Element. To my, dictionary. Of possible, values and. That means I can have two, hundred and fifty seven different values for a byte so to speak that I can shrink the length of my sequence, and I can repeat over and do that again and so, essentially, this, work. Suggested. Well. We could apply this kind, of compression algorithm. And use, it as a way of. Coming. Up with pieces. Of words that we useful. Doing. It not strictly, with bytes despite, the name but, instead with, characters. And character. Engrams and, so the most common way to, do this is with characters, and character, m grams and, if you're up with modern times you know that means there's unicode, and you can represent all of these lovely letters like Canadian and neuter tooths syllabics. And stuff like that, but. You know there's actually a problem with Unicode, which is you know there are actually a lot of Unicode characters, I forget. The number theoretically. I think this for two hundred thousand, possible Unicode, characters, but anyway if you want to handle a bunch of languages which include East Asian, languages maybe you need something like 20,000, characters, and that's. Sort of a lot so, there are actually some, people who've literally, gone, back to bytes and said, you know 200,000. That's a really big vocabulary I don't want to deal with anything sorry. 20,000, is a really big vocabulary I don't even want to deal with anything that large so, why don't I actually. Just. Do. These kind of algorithms, over bytes, and so that means that, in, utf-8. Encoding. Chinese. Characters. Take three bytes each and so, you actually have two and you only get whole characters, if you've actually merged together, several. Bytes that are common sequences. Okay. So, more, concretely, how. Does this work so we, sort, of doing this bottom-up, clustering. Of short, sequences. So, we start with, a unigram, vocabulary. Which is all of the Unicode characters, in some, data we. Then sort of ask what's, the most frequent, in.
Graham Here. Initially. You'll be a bigram pair, and we add that to our vocabulary. So, if we start off you. Know we can take our text. That's, I'll. Come back to this in a minute let's assume we have a text, that has been divided into words so we do have word tokens, and so, we can represent it as a dictionary and, say, here are some words with their frequency. And. So now we look for a common, letter. Sequence. And we say oh yes. That, occurs, nine times, in. This, data because we have the counts for the words on, the left side so. We. Start with our vocabulary. Being all the individual, letters we, find a communist. Letter sequence, like es and, so, we say let's clump, that together and, make that a new thing in our vocabulary, so. Now we've got an extra thing in our vocabulary and. Now what's the Communist Ingram's, sequence, a clump something, well actually, all of these es s are followed by T so. We also have est. With, frequency, 9 and, so, we can add that to our vocabulary. And then, we ask again, well what's another common. Letter. Sequence. Let's see there, are seven cases, of O double, fault I guess, there are seven cases of either L o or o W, so, we can lump, those and, then we can lump, again and, make, an L o W, so, if we sort of run this we, start to build these clumps, of common. Letter sequences. And so. Common. Bits like est. But. Also just common, words something, like that in English will very quickly become. Together and, be a unit about vocabulary. And. So, we do that for a while so, normally. What we do is we decide a vocabulary, size, that we want to work with we, say okay I want to work with the vocabulary, size of a thing and words that'll mean my model will be fast and, we just sort of keep doing this until we have 8,000, things in our vocabulary and that means our vocabulary. Will, have in it all single. Letters, because we started with them and it'll, have common. Subsequences. Of words like, the es and est there, now no vocabulary, but also have whole words whenever, their comments. And words like you know that and - and, with, and so on will, become parts of our vocabulary. And. So then when we have a piece of text, we can do a deterministic. Longest, piece segmentation. Of words, and, we will say that is now set of word pieces, and so, for. An input piece of text we turn into word pieces, and then we just run it through our MT, system as if, we were using words, but. Really it's pieces, of words and, then, on the output side we. Just concatenate. Them back together as needed. Okay. So we get this automatic. Word based, system, and that's, proved, to be a very. Successful system. So, this idea, of using bite, pair encoding, sort of really emerged, in, 2015. And in the 2016. Workshop. On machine translation which. Has been the main sort of annual competition, for MT systems, that, the several top systems, were. Built using bite pair and coding, if you look at last year's competition there's. A bit more variety but, really a number, of the top systems, are still using, bite pair and coding, that's just been a good way to do things. So. For Google's. Newell, machine translation. They, effectively, use them a variant. Of bite, Peron coding so they don't use exactly, the same algorithm. They. Use a slightly, different algorithm. Where, they're using, a language. Model and, they're, saying what, what, rather, than just using pure counts, they're saying what clumping. Together would. Maximally. Reduce. The. Perplexity. Of my language, model, and clump, those things and repeat, over and, so. They did they've done too versions of this model, so the first version the word piece model, kind, of like. Bite. Parent coding, assume, that you have an initial tokenization, two words and, then you just sort of having. Pieces of words, using. This algorithm and, then they did a second, version. The, sentence, piece model, which you can find at this github side which said well, it's problematic, if we need to tokenize, in two words first. Because then we need to have a tokenizer, for every language and that's a lot of work so. Maybe instead of that we could just sort, of treat go.
From A character sequence, retain. White spaces and, regard, that as something that's part of the clumping process, and so. That. You. Just build, your word pieces, which commonly. Will have spaces, on one side or the other of them. Because. Often, things, inside a word are the common or more common clumps, and you build those up and. That's. Proven to be quite successful. In. Particular, one, place where some of you might see, this, is. We've. Yet to get to describing, it in the class really, but there's been this recent, work which, we actually talk about next week in class on, building, these transformer. Models in particular, google. Has released, this bert model, which gives you very good, word. Representations. And, if you download bert and try and use it what, you will find out is it doesn't operate over, words it operates. Over word, pieces. And. So it has a large vocabulary it's, not a vocabulary. Of like 8,000. Words I forget the number but the models have a large, vocabulary, but. They're still not a huge vocabulary. And, it's using word P so so lots of words are in the vocabulary so, if you look at the English model it not, only has word like F in it but it even has worse like Fairfax. And 19-teens, which aren't that common. But. It's nevertheless. To, cover all words it's. Again using this word piece idea, so if I want to representation. For word Hypatia, that's. Not in the vocabulary and, so, I'm making it up of pieces there's an H representation. And then, in the Bert version. Which is different to the Google nmt version, the. Non the. Non initial, word pieces, are represented, with two hashes at the start so, I can put, that together with H at YP etc, and this would be my representation. Of Hypatia. So, effectively. I have word vectors, for. For word pieces. And then I have to work out what to do with them the simplest, and quite common way as I just average the four of them and they're obviously other things you could do you could confident, a next poll or you could run, a little LS TM or something, to put together representation. Okay. Yeah. So, so. Those were the models that. Sort. Of worked with pieces of words to, give you infinite vocabulary. And ran, them through a, normal, system the. Other. Possibility. Is to say well, we want to work with characters, so we can deal with an infinite vocabulary. But, we're going to sort of incorporate. Those, into, a. Bigger, system and, a whole bunch of work has done this and in. Some sense it's a fairly obvious thing to do so. This, work in 2014. Was one of the early ones so, they said well we, could start with characters. We, can do a convolution. Over the characters, to generate, word, embeddings. And then we can use those word embeddings, for, something, in a higher level model. This. Was actually sort of a fixed window model, for, doing part of speech tagging. That. Makes sense instead, of a convolution. You, could use LS TM so, this was worked. For me here later and they said well we're also going to build up word. Representations. From characters, and the way we're going to do it is. We're going to run character, level bio STM's. Concatenate. The two final, states and we're going to call that our word representation. And then, we're going to put that word representation. Into. Language. Model which, is then a higher-level LS, TM that, works along sequence, of words. And. I thought I just oh yeah. So. Yeah so if you're learning, you look I mean this is the hidden layer I guess I'm not actually showing the input layer but, the input layer your learning. Vector. For each character, so. Effectively, you're, doing the same kind of thing we saw before that. You're, starting, with random. Representations. For each character, you've. Got this, embedded, inside a, word, sequence. LS, TM your, goal is to minimize the, perplexity, of the higher level LS TM as. As. A language model and, so. It's, filters. Back its gradient so it's wanting, to come up with character, vectors.
Such That if, it produces good word vectors, which produces, low. Perplexities. Good. Question. So. Here's a slightly. More complex, version of. Trying, to do this that's a bit more recent where, again the idea is can we build a good language model, by. Starting, out from characters. And wanting. To exploit, sort of related some words and rare, words. And. So they, built, sort. Of this kind of more. Stacked, complex, model, that will go through the stages of where, for we start with a word, representatives. Characters. We, have character, embeddings. Which we build into a convolutional. Network and then we head upwards, so if you take that one piece at a time. So. You have a character embedding, for each character. You're then have, a convolutional. Layer which. Then sort of rep has, various, filters, that, work, over, those, character. Sequence, of two three and four grams of characters, so getting, representations. Of parts of words. Then. From. Those. Convolutional. Networks, you're then doing max pooling over, time which. Is effectively. Sort of like choosing, which, of these engrams. Best. Represents. The meaning of a word. Then, what they do after that is so, at that point they've got an output, representation. For character. Engrams, and, so, then they feed that into. A highway network, like we talked about a bit last time. And. Then, the output, of that then, the word level. Goes. Into, an LS, TM network, and this LS TM network is now word level LS TM network and you're, trying to sort of max. Minimize, perplexity. Like for the neural, language models we saw earlier, so. What, could they show with this well, the first thing they could show with it is that. It actually again. Just works well as a language model, despite, that skepticism, that I hadn't told you of about, the, fact of the matter is you can build these kind of character, level models and train. Them and they work to a first approximation as. Well, as word level language. Models, but, one of the observations, that they make is that, you can be getting as good, results, but, with much smaller models, so up the top here are their character. Level LS tier models, and word ones but, the models they built and here, are whole bunch of models over. This data set and. So, as time went by, perplexities. Have been going down gone to seventy eight point four and their, point was well, we can build pretty much as good a character, model with 78.9. Perplexity. But our model is actually much smaller this, model here has 52, million, parameters. Whereas our model at works on a character, level has only 19, million parameters so, it's about 40%, of the size and, that. Seems, kind. Of interesting, but. Perhaps what's, more, interesting as, the sort of peek, inside, it and see, what, happened. With, the representation. Of words when, built out of characters, and this part is sort of actually a bit cool, so. What this is showing is. Four, words, that. Are up the top while here's, you richard trading.
It's. Asking. What other, words are most similar to it according, to the word representations. That's computed, and the top part is the output of a word level STM, model, and that's sort of okay Richard. Comes out of similar to Jonathan, Robert Neil and Nancy etc, while, although. Letting though my. Nude mainly. Okay, but the patter it's sort of interesting, what happens with their character, level models, and. So in particular, what's. Kind of interesting is like first of all you remember they, sort of had the character, embeddings, that went through the convolutional. Layer and the, max pool wing and if at that point you. Ask what. Things are most, similar that. Basically, it's still remembering, things about characters, so, the most similar words to while chilly. Whole meanwhile. And white, so, at least for the sort of first, ones, they, all end in le and, you see that pattern elsewhere. Right close to richard hard, rich richer, richter that, hardens, an ard rich, that, you're sort of just getting this character, sequence, similarity. It's not really doing meaning, at all, but interestingly. When. They then putting it through the highway, layers that. The highway, layers is. Successfully. Learning, how to transform. Most character, sequence, representations. Into, something that does capture meaning, so if you then say at the output. Of the, highway, layers, what words are most similar. Then. It seems to be working pretty well while, I was similar, to meanwhile. Richard. Is similar to Edward Gerard would Carl that, is sort of now working, much more like a word, level model in capturing. Semantic, similarity so. That seems kind of cool. So. Then they say well, what about if, we ask about words, that, aren't, in, the vocabulary. The model well, if they're not in the vocabulary, of the model the, word level model, can't do anything them so that's why you get those dashes there and what, they're wanting the show is that, the character, level model still works pretty well so if you give it look, with, seven, O's in, the middle of it that it's correctly, deciding. That. Look looks look looking, are, actually, the most similar words to that which is actually working very nicely and, some, of the other examples are sim what a computer-aided, is, seen as most similar to computer, guided, computer, driven computerized, computer, you're, getting pretty similar, sensible. Results. And. Then the little picture on the, right. Is sort, of showing. One. Of these 2d visualizations. Of, the, units that have been learnt and so. The. Red the. Red things, are word character. Prefixes. The blue things are character, suffixes, the, orange, things are, hyphenated. Things, like, in the middle of computer-guided. And gray, is everything, else and so there's some sort of sense, since we're just picking, out different important, parts of words. Okay. And. That's, I also I guess just another good example of. How you can sort of compose. Together different. Kinds, of building. Blocks to make more powerful models that you might also want to think about for your final projects. Okay. So. Here's back to one other example, from, a neural machine translation, system, of doing. This hybrid, architecture that. Has word, level and character, level I showed, you earlier a, purely, character, level model I mean, we'd built that out. Of interest to see how well it did but, we were sort of really wanting to build a hybrid model because. That seemed like it'd be much more practical to. Build something that translated.
Relatively. Quickly and well. So. The idea was we mainly build a word level neural, machine translation. System, but, would be able to work. With character, level stuff when we had rare, or unseen words. And. That turned, out to work pretty. Successfully. At improving performance so. The idea of that model is this. That. We. Were going to run a pretty, standard. Sequence. To sequence, with attention. Lsdm. Neural machine translation. System, in. My picked I mean it's actually a four level deep system, but in my picture I showed lesson four levels. Stack to make it easier to see things and, we're. Going to run this with a reasonable, vocabulary. Of sixteen thousand, words so, for common words we. Just have word representations. That we're feeding into. Our, neural machine translation. Model but, for words that aren't in the vocabulary we're. Going to work out a word representation. For them by using a character, level LS TM and, conversely. When, we start to generate words on the other side we. Have a soft. Max with a vocabulary, of sixteen thousand, it could just generate words like earth char, but, one, of those words is the UNK symbol, and if it generates the UNK symbol, we then run a carrot Li take, this him. Representation. And feed, it in as the initial input into. A character level lsdm and then we have the character with level lsdm, generate. A character, sequence, until. It generates a stop symbol. And we use that to generate words. Okay. So we. End up sort of with this, sort, of hybrid. Composed, stack, of eight lsdm, layers. Yeah. And, you always get something for, your accent also if you wanted to get the. Proper gradient, you you would always have to run but, what, what do you cut. It off and say you only run like doing training you only. Run the character level Alice. TM when the accent woman sees the highest. So. At training a, training. Time there's, a determinate. Piece of text right you know the source. And you know the target, and so. We're. And, at. Training, time we've already decided, our vocabulary. Right so we've just decided. What are the. 15999. Most common words those. An uncle, our vocabulary. So, for, both the input, and the output side, we. Know which, words aren't, in our vocabulary and, so, if it's not, in our vocabulary we're. Running this one if if, what was the output is not in our vocabulary running. That one and otherwise, we're just not running it at all yeah, and. So. And and, the big that I didn't explain, but. Is actually important and perhaps related like. When, we're calculating. A, loss that we can, back. Propagate. That, sort of up here there are sort of two losses, there's. A loss at the word level that. You know you'd like to in this position, give, probability. One to generating UNK but really this, model, will, softmax. Will say Uncas, you, know probability, point two or whatever so there's a loss there and then secondarily, there's, a particular sequence, of characters you want to generate and you've also got a loss because, you've loved. The probabilities, you put over the characters. So. Then I think. We saw I think Abby saw briefly mentioned, this commonly, the decoders. Do, some kind of beam search to consider different possibilities. Before, deciding. The. Highest, probability. One, over sequence. Of words and so, this was doing a slightly more complex, version of that so there's a word level beam search when running, it and then, also doing. A character, a whole beam search to consider different possibilities, and so, if you want to integrate the the two of those together. But. Essentially. This worked, pretty well. So. This. Was the winning system, and WMT. 2015. Which. Used, 30 times as much data and ensemble, together three other systems, compared, to the data, that was provided for, the task this was the system I showed, before they. Got eighteen point three and if you remember our, carrot purely, character, level system got eighteen point five. Then. By, building. This hybrid, system, that we were able to build a much better system there's about two and a half blue, points, better, than. Our then, either this word level or, the character, level system, so, that was kind of nice and. In particular that was the state-of-the-art at the time now, of course if you are paying very close. Attention that's. Now nowhere, near the state-of-the-art. Because, when I showed you that slide way. Earlier, of, the Google system. You. Will have noticed. That they have much higher numbers, in the 20s but that's what happens as the years go by. Okay. But here's an example that shows these, different, systems working.
And Some, of the mistakes they make here's. A cherry-picked, example, where. Our, system, the, hybrid, system works, perfectly, because what else would you expect to see and. So. You, know you can see some of the defects, of things that, can go wrong. So. In this, case you, know the character level system, didn't work here. Because, it just sort of. Starting. With the stiff it sort of seemed. To free-associate, a. Completely. Made-up name, that doesn't really have anything to do with the source so that one isn't, very good. The. Word. Level system. Went bung here, so. You remember when it generates, an UNK the, word level system, would, have, when. It generates, is using attention. So, when it wants to generate it. Has attention back to words and the source and, when it generates aren't Godot two strategies, it can either do unigram. Translation. Of the, word that, is maximally. Putting attention on or, it could copy the word that. Is maximally, putting attention on so. In this case it chose to, translate. The word or maximally. Putting attention on but the word owes maximally, putting, attention on was, after. Rather, than diagnosis. And so, you just get this Popo, coming, out of after, after, and we completely lost the word and. In this example, in. This example how, hybrid, system, just. Ends up working beautifully and, gives, you exactly, the right translation yay. Of, course it's not always that good in the real world so. Here's a different example so, this is the example I showed before, with. The 11-year, old daughter, and. In. This example the, hybrid, model has. The same strength of the character model, it correctly, generates, 11 years old at a character. Level in its translation. But, you know this time for whatever reason, it's, the. Hybrid. Model that goes bung, in generating, the names and, it translates, shiny Bart as Graham, Bart, whereas. The character level model gets it right actually. I think this is one of the weaknesses, of this hybrid model compared. To the character, level model, that because of the, character, level generator, is kind of this sort of second, level. So, the purely, character, level model it's able to use the character, sequence, as conditioning. Context. Very, effectively. Whereas, our hybrid, model although we feed, the. Hidden representation. Of the word level model, in as, the starting hidden, representation. Of the character level model it doesn't, have any representations. Further back than that of what's in the word level model, and so, it tends not always, do as good, a job at representing, of capturing. The context, that allows it to do translation, of things like names. Okay. Very. Almost. Finished, but there's just sort of one thing I wanted to mention before. The end which is almost a practical, thing, so. We started off with word embeddings. But. Now we've been talking a lot of character, level models so, surely, just, for word embeddings, you should be able to do useful, things with them with characters. Or pieces, of words and, that's something that people start, to play with so in this cow and ray paper they, said well let's, train. A word to Veck model, using. Exactly, the same. Loss. As word to Veck users, but, let's. Rather. Than having word representations. Let's. Start with character, sequences. And, run. A bi-directional. Lsdm. To, work out word representations. And, we'll then sort of be, effectively, training, this more complex, model, where, we're learning character. And beddings and. Lsdm. Parameters, and that will give us our, word.
Representations. And. That's an idea that people have, continued. To play with and so in particular I just wanted to mention these, fast text, embeddings, so. A couple of years ago, people. Now at Facebook, the same tomash Mikhailov who did the original word to Veck brought, out a new set of embeddings, the fast text, embeddings, and their, goal was to sort of have a next-generation, word. To Veck which. Is sort of an efficient, fast. Word. Vector, learning, library, but. Was better for rare words and, languages. With lots of morphology and, the way they did it was that they sort of essentially, took the word to vex skip grande model, but, they augment, it to put in character, engrams so. More precisely this. Is what they did so. When. You had a word my, example, word is where. For. Some Engram, size, you. Represent. It as a set of in gram so, this is kind of just about like those, whiffle phones I mentioned right at the beginning where, you have a kind of a boundary, symbol, so you know the beginning of the words so if the length is 3. Have beginning, of word, wh-wh. Eh, er er e re, end, of word as. Pieces. Of representation. And then you have an additional, one for just the whole word so, you do still have whole word representations. In this model, so where is represented. By six. Things and, so. Then you're going to use all six, of those things, in your computation. So. If you sort of remember the, guts of word to vac that, what you're doing, was, you're doing these vector. Dot products, between your context, representation. And your, Center word representation. So, they're going to do exactly. The same thing, but, for the Center word they're, going to use all six of these vectors. That, well all their, vectors corresponding. To all six of these representations. And, they're going to some month and so, you're just doing this simple summing, operation. And that sort of then giving you your representation, of similarity. Very. Precisely, they don't quite do that because there's a hashing trick but I'll leave that out but, what they're able to show is that that, model actually works. Pretty, successfully. So these are words similarity. Scores. Skip. Grab their. Old SIBO, and, then, this is the sort of new, model. That. Uses. These kind of engrams, and in this, you. Know at least for one of the English datasets, it doesn't get any better, but. What they especially noticed, is, for languages. That have, more more more. Morphology. That you're sort of getting some fairly clear gains 70 69 on 275. 59. 60, on 266, in the right column, so, these word piece models, do give them a better, model of words and, and, just, practically, fast. Text.