# UCREL CRS Steve Mander

we'll go back to you thank you amazing so hopefully you can all see the slides I've shared um so my plan for this talk is that we're gonna I'm gonna go through from the motivations for my research to why I'm actually what I'm actually doing and some of the findings from it um feel free to jump in at any time with questions and we'll just see which rabbit hole we end up down um I think this for me is trying to get two years of research into what I'm aiming for is half an hour of talking so I've been deliberately vegan places so please jump in if you want things clarifying um so we're going to start with what contrastive training is and if you think to I don't know how many people are in this room of computer vision but a big part of computer vision is the ability to see an item and sort according to what what that item is and you think of the problems that that gives such as I can recognize that I'm looking at a computer screen as I imagine many of you are and yet you might also classify it as looking at a presentation and depending on what I'm doing in this presentation you might go oh there's a human there there's a slide and yet these all exist in the same visual space now this got solved a couple of years back now by something called clip which is this diagram in the top right um where essentially what I've labeled as L and I are a language input and a image input and what clip does is you've got two encoders one for language one for image and it trains those representations to each other with cosine similarity now what this means is that there is no supervision for labels all the computer gets is a language font that matches a given image in the set and it learns to emphasize the correlation between that input of an image and that input of language over all the other inputs of image and language so this built on work with visual semantic embeddings devise and stomach in the computer vision space um called glom which is looking which are the previous Works to this which look at doing multi-level um image processing whereas the power of clip in so far as being the backbone that's gone on to train models that you might recognize such as Dali and um I think it's even Houston stable diffusion still [Music] um this model learns a level of computer vision that otherwise did not exist before it and the power essentially is that because you're not telling it what to look for but just going does it match or doesn't it you get a much broader set of interpretation um one of the early papers that came out from this was talking about what's quite confusingly called multimodal activations I.E that the model can learn the logo of something as meaning the same as the actual thing so you could show it say the Act that plays Spider-Man and the Spider-Man logo and it would get the same activations which is quite exciting um because at the time it was thought that only humans had this ability um so that's what clip is as a start however as this applies to our traditional language training which you'll recognize in the bottom left where you've got a language in and a language out and usually this would be a masked input so you're learning what the tokens do and then you compare your output to your ground truth and hopefully that loss guide your selection the problems as many of us may know with this very basic pipeline is that the individual tokens you're predicting are very um they're very prescriptive of a model so if you get a token wrong you're just objectively wrong and whilst you can emphasize the correct one in your training and that's what the gradient process does the you don't get the semantics carrying through so if you've got a near Miss for example if you're doing a question answering task and the answer it wanted was yes and your model output yeah you'd be marked wrong and yet the semantics are correct of what you're saying so this prompted a early foray into trying to use images to guide that gradient so what you can see in the bottom right of this slide is an approach that splices clip into that pipeline where instead of using a ground truth language we're using ground truth image and you'd use this pre-trained model to say actually if I've got a language input and language output I'm then re-encoding that output into the clip framework and comparing that to the image so the idea being that prompts that are semantically similar would hit off on this multimodal activation within the clip model and you'd get the correct output so there's a few problems with this approach the biggest one in trying to implement it is that the language that comes out of an encoder is not the same as the language that can go into a clip pre-trained model their tokenization problems and at a basic level this can just be it's a different encoder words not the same tokens aren't the same but as you dig into that the scheme of encoding is really important um especially when you start considering problems around low resource language so in a low resource language the the chance of your tokens being the same as in any other language is really Slim yet alone then the problems that exist um trying to do that conversion between different models while preserving a gradient if we go back to this slide um this conversion of language out to L does not always preserve a gradient because you can go well I can convert language out to English or whichever language you're using and I can then re-tokenize it right but as you then try and train that model this loss of gradient means that the loss function that exists with cross entropy loss here does not carry back into the encoder you're trying to train um and though I've met there's a lot of research that's gone into oh yeah this doesn't work so feel free to ask me about that at some point in the future however here's the next part of where this took me so the idea is that instead of what you'll recognize in the center of this is again the same clip model but using our second language encoder to add an extra Dimension to these logids so in a clip model where you'd normally take the diagonal of this contrast to training and emphasize it instead we're now taking that the diagonal of a cube and trying to emphasize it so ideally our three inputs this time uh boundary for output language and image and now another language are all correlated and if we know that then this training for all three encoders should work well and we can even use a pre-trained encoder for English if we've got those as inputs that that fit so in three dimensions in the bottom right of this slide you'll see what this looks like in practice you can see a white square on the bottom left of this Cube and on the top right and that is where the diagonal that's going through the cube those the items that exist for in each batch however you'll also notice on this Cube that other items are not uniform so the squares along the edges where you can see a slight faint diagonal those are like a light gray because for those items they share at least two or more similar coordinates so you can see that on this front face there is a night diagonal that goes one one two two three three four four and so on through that Cube and the edges on row zero corresponds to the top squares row zero as well so the idea behind this is that we've got certain points in this Cube where we're expecting a better correlation but not perfect and my research is looking at whether that's actually useful for training a model and the fun of this is that this approach scales up into four dimensions and five dimensions and six dimensions um so in the top right we have a representation of what this looks like in four dimensioned dimensions plotted as a square of squares to try and get those four dimensions mapped on so you can see going down the diagonal top left to bottom right they're the sort of smaller squares in that each have a white spot in and that corresponds to the same diagonal through a four-dimensional cube um I was struggling to plot this for five and six dimensions because I just do not know how to visualize that in a way I can easily convey um but we'll get on to how and why that works so the applications of this approach um are several in that firstly we can use pre-trained models as guiding of other logics for other encoders so you could say that actually if I believe in universal language Theory I know that an encoder such as clips that nicely maps onto um things like um if I can assume that an encode represents all English interpretation as is visually grounded and I know that other languages have that subset I can assume that use my pre-trained model can easily guide these other encoders and so you could implement this with a second Optimizer and training or an additional loss metric as weighted and I've tried both of them at various times and they work to varying degrees depending on the method I've used um so before I go on I'm going to talk a little bit about evaluation for this because this is where stuff gets interesting you'll notice that in this this model there's no decoders and there's no ground truth going in in terms of there's no outputs that is human readable that's what I mean by that so I can't just compare it token by token to something that I know is correct which poses some interesting problems for evaluation so one of the first approaches that I looked at implementing is this which you you'll see as this graph bit based versus Net and this lovely heat map represents the activation through the layers of each model and what it's showing is that early in the model the layers roughly a better match to layers earlier in the other model which is where you have this nice diagonal glow through it and the fact that one is a Transformer and one is a convolutional network if I'm not mistaken is borne out in this quite grid-like structure where both of them have a either conval convolutional or um key query answer set of layers and therefore both have these sort of regular vertical and horizontal lines that match if I compare this to my training and you'll see the all black and all pink graphs that have been produced I have some differences there I'm not gonna lie and it just hasn't worked and part of this problem is that because of how expensive centralized kernel alignment is to try to use in training where you're looking at to get it a meaningful approximation something like 10 000 samples and you're logging every single weight that your model is producing for every single step across the entire model then performing mathematical reductions on that um I've had to take some shortcuts in the code and I've tested every single one of these to show that they're doing exactly the same mathematical function and yet I'm still getting this result so I'm 99 sure it's not an error in code um what I can infer though is that my weight to my model are just fundamentally different to those in the stock clip model which is what I'm comparing against so in light of this I've used linear probes and a linear probe or linear regression probe as it's sometimes referred to is where you take the representations generated to each layer and you pass that through a linear regression um and you basically just record the best output you can get so you'll see here this text probe that I've got from one of my models where you can see as training goes forward the predictor of the text encoder and how good it is at picking just the best class for an image increases as training progresses and in the original clip paper they use these showing that after about 33 GPU years they get a good result which is obviously something I think I can improve upon um and just for good comparison to that paper I've also used a validation loss which is using their stock method hence the name of that graph and that's looking at um if I was just using one language input and one image with the same model as also sorry same method as originally Post in the clip paper but with my own trained encoders what does the loss look like so I'd expect that to start to dip quite early and come out around um 0.1 0.2 which is what normal um pre-trained openai version of clip gets to with the same data set as I'm using so the theory that I'm going to be evaluating is how well the logits in a model track to Performance one of the big breakthroughs of clip is the idea of in-batch negatives so depending on your NLP background you may be familiar with the term in batch negatives and usually in an NLP context it follows mapping inputs with a label of is this correct or is it wrong and starting to teach a model to predict how good or how bad um the answer will be and know whether it's right or wrong about something whereas the way that clip is trained means that every item in the batch that does not correlate between the two inputs are used as in batch negatives and what this really means is that you're sort of slave to the hardware that you have in terms of scaling laws so if you think about a grid of 64 by 64. you've got 64 good samples in there and 64 squared total samples whereas if you're able to scale your batch lights up your ratio suddenly gets an awful lot better the theory behind what I'm doing is to say that actually if I go back to the picture of this Cube the volume of that Cube compared to the diagonal is now so much better than just the square of a face and so my expectation would be that the number of Logics that I've suddenly got there being batch cubed compared to the diagonal which is still length of batch size um that ratio just gets better and better and better so I'd expect training to be better so that's one thing that I'm aiming to test out another problem that I have is that um the behaviors Beyond three dimensions start to get really interesting and the graphs that I've got on this slide are slight insight into that so we have two Dimensions you either have the coordinates of the item in each batch or the index they either match or they don't you have two set behaviors um there's one thing you're maximizing and one thing you're minimizing whereas when you get to three dimensions as I briefly talked about you have three behaviors you have they either all match they either all don't or there's a partial match in there um whereas when we get to four and five and six dimensions this gets murky so you can see on this um graph that we saw earlier of what four dimensions looks like there's actually quite a few different colors in there um and that's because you have a match of all of them a match of none of them a set where two coordinates match and the other two don't or where two coordinates match and the other two match each other as well and then you've got a set of um three and one as well so what you find is that actually as you start to get make this Cube more Dimensions the number of distinct regions that have to be modeled increases and these graphs in the bottom right model exactly what the proportion as volume of that Cube each region represents and you can see that as the number of Dimensions n which is on the bottom axis increases the mean proportion goes down however when plotted by batch size which is the bottom graph there are some funky behaviors in there too um for example if you imagine the extreme cases where you have an a cube of size one you only have one and then nothing um and same in two and three and four and if you scale this up to a cube of for example an infinite size actually then the volume of coordinates that match no other coordinate become tends to Infinity compared to all the other factors um so that's all of the Avenues of research that I'm actually pushing it as well so I'm going to move on to look at some experiments that I've gone into um which as I've already slightly into that are looking at what happens as you start to scale these Logics into more dimensions so we've talked about Regional explosion another phenomena that I've looked into are something called James Stein estimators and this is a quirk that emerges after a three-dimensional space where if you took a value just randomly sampling a space in fact I'll change the blank slide and we'll model this um so if you have a normal distribution like that and you're told that this Central value um is Mu if you're given a random sample from this distribution let's say you're given this value here the assumption is based on a single sample if you don't know what this value is then that is your best estimator if you're talking in two Dimensions where say you've got a normal distribution going this way as well if you excuse the video for drawing and you're given a y value to estimate this average then what you'd normally say is that um your first estimate and your second estimate are your best estimators for those intermediate values however if we add a third dimension to this graph going that way then if we had a third value on there those three values are no longer the best estimate for a coordinate that represents all three media means in this space and as you had four and five Dimensions that becomes more and more accentuated so there's quirks of having more Dimensions that are not intuitive and part of what I'm I've been looking at is whether this follows for embeddings as well so if we assume that our encoders create an approximation of the real world actually does that approximation get better if you start to apply some of these mathematical Concepts around estimators to them and by and large I haven't had any meaningful results at least not outside of any statistical significance and the other thing I've started to look at is optimizations that I can do by region so as we've talked about there are multiple points in this Cube that behave differently and if I were to mask them out and plot their loss separately and this is these graphs are using um mean squared ever so essentially the absolute distance to the ideal value um you can see that actually the regions where most of the coordinates are the same which is what this one represents so yeah this this one the left is like all the other ones the same the one on the right is when none of them are the same and you can see how those two graphs are practically inverted from each other for the same training set and one of the things I've been looking into is how well um I can use this inversion um to improve or speed up training and what I've noticed is that at the start when every single value in that Cube Is Random then actually it's just the internal diagonal that has any bearing on training if you think about it random behavior is perfect for the vast majority of that Cube because what I want is that the fifth item of batch a does not match at all to the fourth item of batch B and that neither of them have any correlation at all to the third item of batch C for example so tall intents and purposes then being random is a pretty good approximation to that behavior and they're behaving perfectly so that behavior I don't actually want to learn from initially whereas what I would like to see and what I wouldn't get randomly is if the first item of batch a matches the first actual batch B matches the first item of batch C and that's something that I want to emphasize from a gradient and in my research I've tried a couple of things looking at whether I can artificially weight those parts of the graph differently or whether I can give it a large parameter of Weights where if I'm using something like um dividing by the norm so that something they can't just learn that if I set all the weights to zero I get a zero loss um whether something like that teaches it well the idea being that by the time that that parameter has learned to move over to emphasize the parts of the cube where actually the loss is already perfect I'm hoping that the models then learn from that diagonal and is starting to need those other parts as reinforcements um and from the results that I've had from that um which hasn't yet been a comprehensive test but the first few initial experiments have been that it's promising but syncing those up is really really difficult and it's quite an unstable method um so I'm going to talk a little bit now about how I'm actually doing this comparison because we've talked about the rough ideas Behind These statistics and what I want to match in different places but not actually gone into it so these are four graphs and on the bottom left we have cosine similarity and what it looks like in practice now anyone who's familiar with um this method will know that it is simply normalized a but times normalized B and that means that you have a vector usually between one and minus one that you're just multiplying and if they match you get a positive answer and if they don't you get a negative answer and at least that looks like for simple integers and then when you plot it that's the kind of heat map that you get um this is what clip the original model uses where you're essentially saying I want the vector output of one encoder to match the vector output of a second encoder however when we scale this up into three dimensions this is no longer this simple what we actually want is a function that matches this cube in the top left where if all three of them match again we get a nice pretty heat map in that corner and if all three of them match as negative we get a heat map in the opposite corner and nothing on the other corners and this is what I've been working on as a way because you can't do the intuitive thing of just adding an extra term to cosine similarity um so this heat map in the top left is actually tracking the difference in hypotenuse of the rectangle that is Vector a vector B Vector C to the hypotenuse of the mean of all those three values multiplied together um and I can draw this out geometrically if you want um when I'm done the problem that this poses is this is nice and three dimensions but how do you check for four five and six as I've already said I can't visualize that in my head and I wouldn't expect anyone here to be able to and if you can please let me know because it'd be super useful um but this is what I come up with in the bottom right which is to say that if I took a diagonal sliced through that cube in three dimensions I would get a square if I took a slice diagonally of that square I would get a one-dimensional plane whereas if I now do this in four dimensions I get a a diagonal slice of four dimensions is a cube and then I can diagonally slice that and get a square and that's what I've got down here and this shows the heat map that is going through the diagonal of a four-dimensional cube and this way I can check that this heat map actually aligns very Loosely to what I would expect of Coast on similarity and I've been able to show that the gradient descent on this does in fact push things to those two extreme corners so this is what I'm currently experimenting with and stuff's currently running somewhere on campus um trying to check that this model actually behaves under six dimensions so if I go through my the results that I've got so far one of the big things that has come out from this is that balance must be maintained between modalities when using extra dimensions and what I mean by this is that where I've talked about having an extra encoder in this diagram I'm actually for some of my experiments using the same encoder with just multiple inputs that I'm then laterally splitting up so if I do not balance the install and just take the mean of the gradient and apply it back equally to all encoders I get a problem where the language learns significantly faster because it has more data um and that's actually quite harmful for the image gradients so I've had to in places artificially adjust that balance by sort of saying look you take a fifth of this gradient you take the remaining four-fifths um the second takeaway Point I've got so far is that doing the mean across different regions and across different labels does not have any particular effect on this training methodology which is resolved to me because I expected it to um particularly with what I've termed true labels which are instead of using ones and zeros for useless region don't use those regions in the cubes that I've got here um running through some sample values through it and going ah these parts of the cube tend to this value in a perfect case I'll use that as my label as a probability um that has no bearing on gradients it turns out which somewhat surprised me because I assumed that if you could tell the model the rough variance it was expected in each value I presumed that would help but it doesn't the next takeaway point that I've got is that effective logic calculation is sensitive to changes in gradient if we go back to this graph of different similarity metrics the difference between bottom left and top right is incredibly significant when you start to apply it to gradients where you can see the bottom left has quite a steep curve into that corner and it's much more gentle in the top right which is actually the function I'm using in two Dimensions this function doesn't behave nearly as well as this one and that can be quite significant when you then start to scale it up and I wasn't expecting such a sharp pronounced um difference in these different approaches my fourth takeaway is quite a simple one these models only seem to work well when you've got more than 14 layers you start to get results about 12 Transformer layers but 14 seems to be the sweet spot where it has enough depth to the network to start picking up The Oddities of language and everything in an image um the fifth one we've already Loosely talked about were waiting by region varies over training and cannot be easily learned which is to say that actually as you go into more than three dimensions the number of different behaviors you have in that cube is not something that I think can manually be trained just based on the number of parameters you'd need um it would be too complicated to sit down and work out the Maths for a human I reckon um maybe some super buff in here could do it I certainly can't um where just because of the number of regions that start to explode I think in Six Dimensions you're up to 21 off the top of my head um different behaviors that you'd expect in that Cube and that's from things like I've got five out of six coordinates matching I've got four and the two do or the two don't match I've got three and then within that other three there's a two and a one or there's the other three match or none of them match and so on and so on um lots of different combinations and trying to manually adjust those according to what you think the model needs is really difficult um however the big takeaway that I have got um is that clip models can be fine-tuned within a few days according and seemingly correlating to the number of Logics that are present and this is really significant because the original model um I think last time I sat down and worried out took about 33 GPU years or if you're doing it at the scale they're doing it you can have 512 gpus churning away for about three weeks and you get your training in a reasonable amount of time however as many of us will know at a university you don't have access to 512 gpus and if you're lucky you can normally swing one or two so being able to downscale this to a sensible size um or at least get it within a single gpu's amount of training where within three or four days I can fine tune a model on a data set um is really significant and adding to this um because again University scales are different to Enterprise scales the data set that I've been using is every single split of Ms Coco which people know is image to five captions give or take and I've looked like and gone oh well there's my Six Dimensions nice and easy um the other positive Quirk is that I know that this data set sits in about 200 gigabytes of memory which again is more than um average for most workstations and it sits well within the remit of what most researchers can wrangle um so it's a really exciting result for me to be able to show that in having lots of different ways of shuffling this data and passing it between different encoders um I unlock the ability to start to generalize language a lot better from such a small amount of data compared to the many many many terabytes of website data that were originally used um however I would add that with the caveat of if it's not in the training set I can't expect the model to know it um which starts to really apply when you have smaller data sets but this really opens up the Avenue for future research and an emphasis on having multiple correlated points within a data set as well as starting to look at different ways to scale that and as this applies to low resource language which is the way this is I'm hopefully trying to push this research um it's much much easier to find data from social media or um even though you're trying to manually Source the annotations where you just say caption an image which is often you get a lot more varied response than having to manually translate stuff and make sure it's a um gold standard annotation and whatnot so hopefully um yeah this will go somewhere and really open the door for getting other languages on board and more evenly represented within the AI and natural language space so that roughly brings me 40 minutes into the end of what I've prepared um thank you very much for listening and I welcome any questions if there's anything you want me to expand upon I'd be absolutely delighted um and really sorry this has been a very vague we'll stop tour through everything um yeah any questions yeah fantastic thank you very very much Steve and thank you for a very wonderful doc um so and um we've kept the time so we have room for questions if there is any question coming in um please feel free to either mute yourself and ask or put it in the chat um while we wait for people I just want to ask you a quick one um well wait for other people to ask um just just to confirm so far you've been playing with the MS cocoa right and the images to text and and text have you um I know you probably haven't done that but have you considered um the complexity this method will introduce if you include other modes of that like audio and video and so on have you have you thought about that or have you explored that in any in any way I've not explored it too much um however from what I've read of other approaches um there's a lot of ways of doing this so some some of that is just I've not got the spare compute power to suddenly store an extra modality of video and whatnot um but there's a really interesting um repository that can be found online looking at applying multilingual birth models just straight into this instead of training it as a whole pipeline which is what the clip model does where you give empty weights to you encodes and then start training it looks at learning just the projection from a pre-trained model into that Cube and that could be a really interesting Avenue when starting to add those other modalities because it means that you don't have to train everything in one pipeline um the problem with that is that I don't think it I haven't seen any Research into the multimodal activations that you get from this pipeline starting to play out in um other modalities that are trained like that um yeah that's the thought I think I think it might actually be a lobo complex that's used to that level but it is it's quite interesting yeah it's another question I don't want to dominate this discussion yeah Paul thank you yeah um this might be a very silly question because my brain's not entirely up to speed yet at first time back in the office but um so I kind of uh misunderstood or didn't really kind of get what you were saying about the transition from two Dimensions to three and Beyond and you're explaining with the normal distributions on X1 on you on your drawing why that's more difficult and I guess similar point to where you're talking about goes on similarity Beyond two vectors so yeah I mean I hadn't looked at that before or just Googling around that there are ways to do that yeah already we should presumably have logged out um so why is it inherently when you're switching from two to three what what's the jump I'm missing so if you think of what happens when you if we just draw a truth table of one zero minus one one zero minus one for two Dimensions actually the products that we get from those are for this pair we get one we get zero here and we get one here um excuse the fact that I've just in autopilot but minus one um getting ahead of myself all right um and again on the diagonals of here we get um sort of lots of zeros when we start doing those cross multiplications whereas if you have three terms in here and you again do that same product of doing a product vertically down that table this becomes minus one despite all three terms being similar um so that's the base level of why it just doesn't behave correctly one of the early experiments I did was there's a wonderful function in pi torch called e in sum um and what this lets you do is it lets you just say the shapes of tenses that you have and the output shape you want so what you can actually do is you can say if I've got encoder a and it's got shape f um or it will read this as batch times feature space as the output of an encoder I can then specify b f c f d f e f and I've already used F so let's get for g f as well and I can literally pass it this as a string and say I want the shape a b c d e g out and then I give it these six um tensors and it does this multiplication and summing along feature space for you and what I was Finding is that this just does not converge at all because there is nothing that forces dissimilar items together and I can further expand this by saying if you have four terms in here suddenly you can get one of the outputs in this column but I don't know whether two or more are minus one all right so that becomes quite a big problem as you get more and more terms where you just lose the gradients in that um calculation because the model doesn't know which way to push um items and you can get really big variants between what you Inc what should be very similar vectors based on this okay got it thanks that was great explanation

*2023-04-27 02:51*