AI at Scale Behind the Scenes OD382

[Music] welcome to this session where we're going to see the details behind this new concept of ayata scale so this generation of ai that is powered by massive ai models training on these huge as super computers and we're going to see the technologies and also a little bit the breakthroughs the research breakthroughs behind all of these innovations that are enabling this new breed of ai i'm using for this like a little experiment here so you can see that behind me i have like a like a virtual whiteboard and i can actually draw in that whiteboard as i speak so wish me luck because i got a real drawer but i will i will give it a try okay so what is going on with these massive models in the last couple of years so if you look at them in a timeline you're gonna see that imagine that this is the size of the model and this is the time in the last couple of years and in particular for nlp models so for language for natural language processing models we are seeing an increase a huge increase in the size so we started in 2018 with gpt a model created by openai and it had 110 million parameters then we saw bird coming from google with 340 million parameters then after that gpt 2 1.5 billion parameters and at the beginning of this year you probably remember we announced in microsoft touring the family of models touring with 17 billion parameters and after that using the same infrastructure in azure this as super computer in azure open ai announced just before summer the gpt-3 model and this one had wait for it 175 billion parameters so you can see what is going on here so there's like this huge like exponential growth right well gbt3 is not even in the chart anymore it's above the chart so we're seeing this huge growth in the size of the model but what is even more important is that there's also an increase of the quality in the model and even we are learning as we add more and more parameters that the models can perform tasks that they couldn't perform before so we haven't reached the ceiling here we spec for the next months and even years continuing growth in that size of the model so why is this happening now why didn't we add more parameters before the answer has to do with the research that was done to enable these new massive models right and for that to explain those breakthroughs i need to explain a little bit how nlp works and this is interesting because i would refer to this as traditional nlp but this technique in particular started like in 2013 with a famous paper it's called word embeddings you're probably familiar with it and what embedding so the way that that technique worked is that you would map words to vectors and usually those vectors will be huge vectors like 300 dimensions right but let's imagine there is only two dimensions so a war embedding will take every word and it will map it to a vector here now the interesting thing is that at the end you are you're going to notice patterns like this so you will have words that are closer in meaning to be closer as vectors so king queen prince will be closer between them than red blue orange and this is a very simple example with two dimensions but this could be very very complex in multiple dimensions so what is behind this is a mathematical representation of the word that has to do with the meaning of the word so that's very powerful because now a computer can use these word embeddings to do something with them representing their meaning so the way that we calculate this word embeddings is by is like an approximation so we're gonna estimate that the meaning of a word has to do with the words that are usually used around them right so if you do that with a lot of text for example the entire text coming from the internet the result of that is actually very good and it has a lot to do that's representation with their meaning so now that we have this mathematical representation we can do things like a language model so a language model is very simple it's the it's a foundational we call it a secondary task in nlp because it's very it's very basic a a language model will take a sentence imagine that i have the sentence i am loving ignite and the model will try to guess what is the next sentence the sorry the next word after after this so to do that i can do something very simple so i can take a word for example let's take i i can calculate that while embedding so i can get the word embedding the mathematical representation and then i will use that as an input to a neural network that will provide as an output what the model thing will be in the next word after i and i can train this neural network with again a lot of text coming from the internet right with with a lot of sentences with the word i on it and this is great but it has a problem it only takes like one word as an input so usually you're gonna need to consider more than just the last word to guess what is the next one so what we do is a technique called regular neural network or rn where kind of the neural network has memory right because it gets as an input not only the the the word the input but also itself so it is connected with itself and there are many types i won't get into that detail but they all look very similar so if you unfold them you're gonna see that they all take like a shape like this so it will take the first word and then it will take the second one in in the next iteration of the neural network taking not only that input that work as an input but also the internal state right so that hidden internal state of the neural network then another iteration with the same neural network taking as an input that hidden new hidden state and the new world and it will take an output and then in the last one again same thing the hidden state and the last output if i train this neural net it will give me then an estimate of what is the next word for example because so that is a basic language model now the interesting part here is not actually this because forget about this the interesting part here is this h because if you think about it age is like the state of the sentence right it's that it's a representation of that sentence so in a sense it's kind of the meaning of the sentence before this particular word so at the end of this this last hidden state that is going to be another vector representation of the meaning but now not for a word but for a sentence and i can do crazy things here so for example imagine that i have now a labeled set of sentences i can classify them whether they are positive sentiment or negative sentiment right so i can create groups like manually labeling those groups with positive sentences and imagine here i have like negative synthesis like i hate it now now when i train that model what i have is a classifier so i can take now a new representation of another sentence i can run it through this model and the output in this case i am loving ignite is positive so i created uh my first like sentiment analyzer right by using this concept of regular neural networks and word embeddings perfect what is the problem of this approach the problem of this approach is that it doesn't scale so every step relies on the previous one so i need to execute all of this sequentially and that is blocking us for adding more and more parameters to it because we can scale now the trick here so the breakthrough that happened in the last two years is a series of like think of it like neural network techniques that are allowing us to increase the size of the model without impacting on the scalability of that model so the main concept is called transformers so that's the primary primary architecture concept for this so the main trick in here is thinking okay have the same sentence i am loving ignite but now instead of connecting this to a regular neural network i'm going to connect every word with a traditional feed forward neural network so it's not connected to itself so it can scale now this is not going to work right it's not going to work because of several reasons so the first one is that if you see here the the sentence just lost the order right so every neuron wouldn't know what order every word is and that is important in a sentence so it's not a collection of and ordered words so to fix that what we're going to do is that in the representation of every word we're going to add information about its position so that was an easy one i can easily fix that right and this could be one two three four depending on the position usually it's a little bit more complex like a sign or a cosine but that's that's the idea so that's the first thing the second thing here is that if you look at this every neuron doesn't know more than one particular word so the representation of it would be very tricky i need to understand what is the context for every word so for example imagine that i add here not this knot is changing dramatically the meaning of the word loving so when i'm processing loving i need to be connected i need to get that context to understand that in this case it's not a positive thing right so not loving is very different than logging so to fix that we introduce another very important concept that is called attention and attention that tension technique is being used in neural networks for for very long but not in this particular concept of transformers so how does it work so let me just remove all of this here very quickly because the way that this works is a is very similar so what i'm going to have in here imagine that i'm going to have a similar sentence but in this case it's going to be i don't love ignite but now this sentence instead of just feeding directly the neural networks that i have in here i'm going to add a new block i'm going to call that block this is going to be called an attention block an attention block is going to do the following instead of just taking these words directly i'm passing them through a the neural network on top of me instead of doing that it's gonna pay attention to other words that are associated with this one for example in this case it will because of training you will understand that don't is connected to this one and then it's gonna mix these two and usually this is as simple as just doing a weighted addition right so you just take like an average weighted addition right so you just take these words and you mix them together and then you get the output of the representation and you pass it to the neural network on top of you it's that simple then it can get more complicated because it's usually more than one word right so words are connected between them in multiple ways and it could be a there could be a lot of distance between those wars so in this case imagine for example ignite ignite could be catching fire or it could be the ignorant event and to know that i may need to go like three sentences before to understand that hey i was going to this microsoft event so all of these those two tricks that is the transformer architecture and once you have that here's the magic you can scale it any way you want because now you can say okay let me add for example like multiple layers here of attention blocks or i can increase the complexity of this neural network here or i can add something that is very common in this model which are attention heads so i add multiple attention heads they are called multi-head attention blocks and each block each head will look at particular relationships between words so each one will be specialized on a particular association between words and then the addition of them will be combined and that will be the output of this attention block and this is all trained all those weights all those parameters are learned by the model itself when we train them so then this whole thing which is called a transformer block i can add multiple layers in here so i can scale that even more by adding multiple transformers one on top of the other and so for you to get an idea in the case of for example a touring one of the variations of touring that we announced at the beginning this year that one has 28 attention heads and it has like 78 layers of transformers right so you can see how this can grow a lot so let's talk more about this touring model so let's see how all of this now relates to the touring model that we're talking for that i have invited alvi e ali alvi ali works in the team that is the behind this uh touring model and he will show us a little bit more in detail and even i think you have some demos to show us and so all yours thanks david great to be with you today hi everyone my name is ali alvi and i'm the gpm for tuning team here at microsoft i'm very excited to talk to you about microsoft turing models and their capabilities and how they are fundamentally transforming the way we think about using ai models in production here at microsoft i will show you a demo of how easy it is to use these language models for any nlp task and at the end i will talk about how cutting-edge platform innovations and microsoft azure ai supercomputer make it possible to train these large scale models but let me start off by explaining the different types of tutoring models and their capabilities i will be using this whiteboard as we go along so i apologize if things are not very legible but we can always blame david for putting me up for this so let's first of all look at what are the core nlp capabilities of turing language models so in any kind of an nlp task you always start off with text so here we have some text it's called i love ai and we're going to pass this text through our neural network model and this network is going to think about this text process this text and it's going to create what we call a representation of that text and it's basically a mathematical representation of that piece of text now a model that does something like this is called a representation model and turing has a turing natural language representation model now that's just one type of model now you can imagine let me change colors and also change the input to i love neural now this is a different kind of an input and this goes through the same model and comes up and has a different representation now there are models that can take that representation have a different component of their these models that does some processing and this time i'm going to have a different kind of an icon because this is this model is kind of thinking about this and this model then outputs not a representation but actually networks which is a piece of text so this is a different kind of model because this model takes some text and produces new text the model that can do something like this is known as a generation model and turing at builds announced microsoft during nlg which is a generational model it's a 17 billion parameter model which was the biggest model at the time uh that is the generation model takes in a text and produces new kind of text now both these representation and generation mode example that i showed you are performing working on english data however you can imagine you know that this input could be any different language so let's assume change colors again this input could be english could be cyrillic could be chinese japanese and model that takes in this such kind of an input and creates a representation now this is going to be a different representation and this is a universal representation such a model is known as a universal representation model which basically means that this model is language agnostic it can understand many many many different languages and turing has universal representation models as well now you can take the same analogy forward and a model that can actually take this input and output all different kinds of languages would then become a universal generation model and that's another kind of capability so essentially when you think about it there are two dimensions one is the type of capability which is representation or generation and then the other aspect is whether the model is monolingual or universal and as part of microsoft tutoring family of models we have models that represent all of these nlp capabilities and these are fundamentally fundamental capabilities that are required to perform these natural language tasks that you know we see all over microsoft now with that i'm gonna quickly transition over to a demo which is going to show how easy it is using azure machine learning to use one of these language models it's going to be a language representation model in english and fine-tune that for a sentiment analysis task with just a few lines of code so let's transition over to the demo this is the stanford sentiment free bank task which is part of the glue benchmark i want to highlight that when this task was published just over a year ago the baselines for the task were bi-directional lstm models with some attention mechanism that could get you accuracy of around 85 on this task which was considered significant at that time let's see if you can beat that baseline using a powerful language model like tuning nlr i will be showing you how to do this all inside of microsoft azure machine learning here i am in my aml portal and i have set up a python notebook running inside of an azure ml notebook vm which gives me all the configurations necessary to get started with this task i will first show you how to fine tune the model and then also show you how easily you can convert the model using onyx runtime and then deploy it as a service on a cpu all this in a handful of steps let's get started first off we simply initialize the aml sdk and create a workspace using my subscription and configuration next i'm going to create my compute cluster this is the gpu cluster that we will be using to train the one i have chosen is a single node with 4k 80 gpus since this is a simple fine tuning task i'm also registering my data store where i've already placed my pre-trained tnlr model this is the same data store where i will upload the glue ssd task to last thing in the setup is to create a custom environment based on an existing docker image that has all the dependencies i need for the nlr model now we have everything ready to go we copy the code to the gpu and then run the fine tune script the key settings to note are that we are using all four gpus and training this with a batch size of 32 per gpu we will run this for 15 epochs i can now submit the experiment once submitted i can easily go over to the experiments tab and see that my experiment was just skewed and is starting it will take a couple of hours for all 15 epochs to run but i have previously finished a run and i can show you that using tuning nlr v3 i was able to get an accuracy of 94.5 on this task in 15 epochs in fact you can see that after just two epochs the model was able to go well beyond 92 on this task which is significantly better than the baselines i shared before the rest of this notebook talks about how you can optimize the model using onyx runtime and then use aml to deploy this to a cpu inference cluster in the interest of time i'm not going to go into all the details however i have the model deployed to show you how this model does on some interesting examples the first example is a simple example of someone saying the movie was worth two hours of my life this example is classified correctly as a positive example keep in mind one is positive and zero is negative the next example is a bit tricky because it uses some of the same words as the first example where the user says i would really love to have two hours of my life back notice that even though the word back is misspelled and the review has the words love and life the model correctly predicts this as a negative review next one is an unusual review that you would normally not see but the model also gets this one correct next example has a phrase like if the kids are bored but the model is able to tell this is still a positive review the last two are interesting as the term horror appears in both of them but the model is able to tell that in one case it is talking about the movie genre and has positive sentiment but in the second one it is talking about the movie being really bad this concludes the demo of how using powerful large scale language models like turing nlr you can get state of the art performance on some real world natural language tasks now let's quickly take a look at how the same tuning model is powering smartphone functionality in world i have microsoft word open with the document i am viewing i can press ctrl f to get into smart mind and simply type what i'm looking for i made a mistake in spelling exposure but the tuning powered smartphone automatically corrects it and finds me all the places where those words appear in a phrase i could completely misspell something as you can see i have no clue how to spell that word but i'm still able to find it in the document turing models can even make things like question answering possible as shown here the smart find model extracts the answer to my question and points me to the exact part of the document that contains that answer all this is made possible by leveraging the power of large scale tuning natural language representation as david mentioned we have ai models that are billions of parameter in size the gpd3 model from openai is 175 billion parameters and i fully expect that these models will grow to over a trillion parameters in the next year or two personally i'm super excited about that but it does make you wonder how do we train these ai models this is a remarkable accomplishment on ai platform engineering and this is another area where microsoft is at the cutting edge of innovation so let me quickly try to explain to you how do we train these large models there are fundamentally three things that you need in order to train an ai model first thing is you need to have data the second thing is you need to have your model and the third thing which is the most important innovation of all of this and without which we could not be able to train any of these models are gpus so you need to have a gpu resource so let me draw a gpu it's my gpu now traditionally you what you do is you put the model on the gpu and then you take this data and you pass all of that data through that gpu and that's how the model learns now that is what would traditionally be known as a single box learning and it works fairly well for a large number of neural networks however in order to train the language model we need to have a lot of data and when you have a lot of data doing something like this is going to be very slow so what we need is more compute so let me quickly try and expand my tpu hyper cluster add a couple of more gpus to this now i have multiple gpus what i can do is when you have more than one gpu you can take this data and you can slice it up change colors so you can take one slice put it on one gpu take another slice put it on another gpu and take another third slice and this if you have three gps put the third slice on this gpus now these gpus need to be connected interconnected at very high speed connectivity i'm talking about hundreds of gigabits per second because in order for the model to learn from all of this data these gpus need to be connecting and communicating all the time i'm not going to go into the details of those but that's just an important fact to remember now what we've essentially done is we've split up the data into multiple uh shards and we've put the same model on all of these gpus and now that model can learn from all of this data this kind of a division or parallelism is known as data parallelism so you take data and then you parallelize it across multiple gps now this opens up a lot of opportunity now you can have a lot of data and then all you really need with the ability like this is to add more gpus and you will be able to train more data or train train with more data however there's a problem the same model exists on all of these gpus and these gpus have 16 maybe 32 64 gigabits of memory we're talking about models that are billions hundreds of billions of parameters so after a while when the model becomes large enough your gpu is going to run out of memory so how do we solve that problem so let me just quickly expand my gpu hyper cluster i'm going to add maybe three more uh let's just go with you know another three so let me quickly add grow my gpu hyper cluster to nine gpus now and now what we just did with data we can kind of start to do with the model as well so let me see you take this one part of the gpu put it off the model and put it on this gpu take another part of the model put it on this gpu and then take the third part of the model and put it on that gpu so by doing this now you've done exactly what you did with data but with the model so now you can imagine this model can continue to grow and all you really need is to add more gpus in the second dimension and this kind of way of increasing the capability of the or the size of the model that you can train is known as model parallelism and coupled with data parallelism and model parallelism now you can imagine all you have to do is just keep increasing depending on whether you have more data or if you want to increase the size of the model just increase the gpu hyper cluster in that dimension and you will be able to create very large model however can we do more so as david mentioned these models are built off layers and layers of transformers so if you think about it when we have 12 24 36 48 layers of transformer they're just identical transformers built stacked on top of each other so essentially this model is kind of like a pipeline of transformers so if i extend the same analogy across the model as well as with my gpu hyper cluster and extend it into this third dimension and let me draw another gpu here and then imagine you have gpus in this direction and the other direction also now i can take this model and exactly do what i did explain to you in terms of data model parallelism and i can do that with this model over with this gpu or this set of set of gpus and now i can actually parallelize the model itself in terms of its layer across many many different gpus now this kind of a dimension of parallelism is known as pipeline parallels now together with data parallelism pipeline parallelism and model parallelism you can imagine now all of sudden models that would takes months or years to train cannot be trained in a matter of weeks or sometimes days microsoft provides all of this support as part of our deep speed with xero library this is available on azure and all you really need to do is add this library into your code and you will be able to get data parallelism model parallelism and pipeline parallelism all at the same time and we call this thing 3d parallelism now this is a fantastic library available for anyone in the world to use and by doing this you go from training models that are less than you know maybe a couple hundred million parameters to ten billions of parameters and then using the model parallelism and pipeline parallelism added on top you can go up to like hundreds of billions even a trillion parameter model now of course we need a lot of gpus and we have something which actually has more than great more than 10 000 gpus and even more than 285 000 cpus and this is what we call the azure ai supercomputer this is what openai used to train their 175 billion parameter model and this is also the same supercomputer that turing uses to train all of their models so you can see microsoft with deep speed library is providing the software support to be able to expand and extend the training of these large scale models and also we are providing the super computer to be able to create these models thank you ali so i hope you enjoyed this presentation about at scale if you want to learn more about it i encourage you to go to this url right here aka dot ms slash ai.scale and in here you're gonna find a lot of information about the technologies the research and even examples and access to the latest technologies that are making possible this like new generation of ai so you can also reach out to me for any question at david's csa this is my twitter linkedin or email alias at microsoft you have any question for ali send them to me and i will also chase him for any answer thank you very much bye

2021-03-09

Show video