Tech Talk Redefining trust and responsibility in the generative AI era

thanks everyone it's a pleasure to be here uh my name is Paulo farbos as um the slide says I actually lead a group of researchers in haker Lads which is the organization that's doing you know Advanced development and research at at you paker Enterprise and I'm here to talk to you about um you know things that people don't normally tell you about uh language models and you know the risk associated to that uh the evolution of how we got to where we got to um you know what are the principles we've developed and to kind of make the models you know behave in in in applications that are critical in the Enterprise and then some of the technologies that we've been developing in labs to help evaluate refine recommend models and you know kind of help people kind of reason around the big variety of of options that are out there and give you some examples that you can also find on the ex showcase floor in case you want to follow up with that so you know typically you know if you're an Enterprise right now every body in the Enterprise is thinking about how am I going to deploy AI in my use cases you know if once I've identified a use case which model should I pick should I just use a cloud-based API should I go with one of the open source models you know are these enough for me uh are this model production ready what else is necessary around it to build it what about the governance that I need to put in place you know is there anything that I need to think about you know how am I going to use the model what kind of risk am I exposing my operations using that model and you know what I need to get started and hopefully you know at the end of this talk you'll get some ideas of what are the the challenges what are some of the Technologies some are still in research so they're not something you can adopt tomorrow but you know we welcome anything that you know we can talk to as and and and certainly you know we can advance the field and if you heard um Antonio's keynote yesterday he also went through some of the principles that we use internally within HP to assess AI projects even for our own internal use I actually co-chair our AI Ethics Committee and you know in the last two years we have seen an enormous surge of new projects I mean two years ago three years ago they were like you know once a month now we're getting like hundred a month of use cases that people MIT to take a look at and we're looking at these things through a lens of these five principles that are trying to assess whether a model application of a AI is for example privacy enable and secure you want to make sure that you know the data that you feed to train AI or that you use in the operation and deployment of AI doesn't leak that you know you have no risk or you limit your risk depending on the kind of data again you're you're using to train in especially if you're in a regular Rel ated industry and it's very important um you want it to be human focused this is the the fundamental principles of AI requiring a human in the loop when you're in a decision process so you don't want AI usually to take an autonomous decision especially in Mission critical application without a human in the loop so AI is becoming a lot more like an assistant that gives you a set of options that then a human needs to pick and has the ultimate responsibility um inclusivity is is all about bias right so you want to make sure that for example and I'll talk about that in a in a in in an example in the financial services that if you're using certain data sets to train your models you understand the statistical property so that you know in what dimensions these are biased and you're making sure that those models are inclusive of the type of you know behavior that you want to capture you also want them to be responsible and respons possibility is a lot about being able to explain a what a model does especially when a model makes the wrong decision then you want to be able to trace back of what was the cause of that decision whether it goes back to all the way to the training and then perhaps fix it by providing additional training samples or perhaps you know making sure you put the Right Guard Rays around the decision and finally something that you know we think is very important and often is also overlooked is the robustness of a model am I have I built the model in a way that I understand where it breaks because everything is going to break it's not there's no model that is perfect and I'll even show you kind of an interesting example in a second it will break the question is do I know how fragile it is depending on the application can I quantify it can I Engineer it so that I understand the robustness first thing is you got to measure it and then you can you can try to fix it so you know in Labs we have a several research projects around you know this area of what we call responsible and transworth thei we're looking at ways and I'll talk a lot about that on how we design and synthetize the model for trust we actually started this work five years ago when um we had no generative AI so most of the work initially was designed for classifiers on computer vision and I'll give you a a quick example of that and so you know we want to design the model to be trustworthy we want to understand and measure trust and we also want to pay attention to the data layer because as you know the data is what fuels Ai and so you know the way in which you process the pipeline you want to be able to go back and version the data you know we've built uh several platforms around that so when I said an example of what we did a few years ago this is a classification of the image you can imagine we're using animals here but if you're for example in the manufacturing industry you can think of this being you know a camera operation that you know spots defects in whatever you're making whether it's a chip or or you know Foods or anything like that and you'll see that what we're doing here for example it's kind of hard to to see from the screen but you know the little box on the right that almost looks I'm sorry on the bottom right here that looks dark with a few pixels is actually uh some noise that we're adding to the image through another AI engine that tries to find a smallest amount of noise you need to put the image to make it to break the classification so in this case you see that there's only a few pixels right that changes the classifier that you saw from a wild cat to a cougar right that basically you know the model was designed to and trained on a set of images but you only need to add those you know handful like 13 measure distance you know pixels to actually make it break so this model is brittle right imagine that you know your this could be dust or you know water you know drops of rain or anything in your sensor it can cause the model to break and misclassify and then depending your application have severe or not consequences now so what we've done is effectively we're using machine learning and AI to enhance AI so the first thing we've done here we build this smart agent that tries to find you know the minimum amount of noise to break an image and you can even design the type of noise and you know the idea is that we should be able to measure these things right not just you know cross your finger and hope it it works and that's sort of the work we've been doing here and so we detect the vulnerable classes of all the things you're classifying and you actually give you a number that says your model is vulnerable but we can do better than that we can actually use that same technique to gen generate synthetic data feed it back to the training set and hopefully produce a more robust model and in the case of computer vision we've done the same on this model we've now closed the loop generated a bunch of new synthetic images and now this enables us to have a model that will still break if you change it enough pixels but it will take a lot longer and a lot you know more uh noise that is unlikely perhaps to find in a real situation to actually break it and want to see when this ends you'll see that you know compare before we had the distance was like 13 now we have distance of like 30 something which means this model is like about three times more robust than the other model right so this was kind of done work we've done a few years ago in the in the in the field computer vision but as you'll see you know we're using similar type of Technologies sort of in in the new domain now since few years ago the world has changed right so in these days we're talking about effectiv a task specific AI model that was just trained on label data so since then you know the world of AI is really boomed around unsupervised training which means the models that most people refer to as Foundation model large language model which are effectively trained once or you know a few times but on an enormous Corpus of unlabeled data they try to find you know the the the correlation and create structure around unlabeled data so they're in a certain sense harder to understand that some of the classifiers so we had to adjust some many of the techniques we've developed to kind of fit into the new world and right now again we're not talking about this is the Pyramid of of sort of AI classification news cases that we normally um you know explain users of you know where they fit in in in this in the pyramid starting from you know there there's very very few Folks at the top that can design a new model right you know there's probably a handful of of institution and companies out there that could build a new Transformer then you know there's a few more but not many more given the cost of of training that can actually pre-train a new model and then you know at the Other Extreme we got a lot of people that can use outof thee box models from you know some of the popular models out there um what I'm going to concentrate is the big opportunity in the Middle where you know you're an Enterprise you have a use case you have identified now you got to pick a model you want to find tuni you want to engineering the pron system you want to perhaps chain a few things together to fit your specific use cases right so that's kind of the the typical deployment pattern that I'm going to focus on I'm not thinking about oh you're going to design a new you know GPT or whatever it is and I'm not just going to use something out of the box I'm thinking that you know the most interesting part is actually in the middle now I'm not going to spend too much time here but lot of use cases for foundation models you most of them are about content generation they're about you know assistants they're about um you know retrieval augmented generation basically parsing all your documentation being able to ask questions about it personalizing recommander and so on but also another dimension that has recently emerge is the fact that you know there's a whole uh movement in the open Community to generate open models and now when people talk about open models there really two classes of them there's like the open source model where people have published the code in like you know repository like huging face or so on and then there's the open weight models of companies for example like meta who have trained the Llama family of models and they're making available the trained version with the open weights so you can actually use the train version of the model and if you look at the the evolution over time in this case you're seeing kind of time on the X Dimension and quality on the Y Dimension the open models that become really really good you know they're they're not quite there yet compared to some of the closed proprietary models but they're getting better better over time so a lot of users are thinking well perhaps I can start using some of the open models the advantage being typically the cost and the fact that I can flexibly change them as opposed to use some of the Clos models where I don't know how they were trained and I kind of have to follow the way in which the model owner decides how I'm going to consume it right what is the challenge here is that if you go to huging phase today there are over 2 million models and it's actually true and of those two million there's probably about 100,000 models that are variants of language models right so now you got you know you got to hire your assistant and you got 100,000 rumes to sort through right so what we're trying to do is give you an idea of how you can actually sort through these models and pick the one that's best for the job you're trying to do um the other complication is that um bigger is not always better right even the bigger models They're Not always completely reliable here's an experiment that I did on a popular um generative AI model I'm asking the model to generate an image of a room with no elephants with less than one elephant right and this is what the model gives me then I said well not quite there try again I really want an a room with no elephant and the second time this is actually a session I recorded well you know still an elephant there and then I said look there's something there that I don't want to see and the model thinks about it and now gives me two elephants right so um this is a very simple prompt right a 5-year-old understands the notion of a room with no elephants right the problem is that the attention mechanism focuses that's Bas the base of Transformer models is is that this is the syndrome of don't press the red button the first thing you're going to think about there's a red button right so the Transformer model is going to focus the attention on the word elephant there's a no negatives are hard to to map and in this case you know if you try 10 times and you can try it yourself you'll see that perhaps a couple times is going to get it right but what I'm trying to give you as a sense is that imagine translating this into an application that requires some critical thinking and you know does not actually you know where mistake is important and causes you know a bad behavior then this is clearly not acceptable so so models you know even the big one this is a 500 billion parameter model don't always get it right and so you always have to think about you know what am I going to do to make this model fit fit into my application and sometimes bigger is not better you need to do something else and I'm going to give you some ideas of of of what are the issues sometimes it's even how you phrase The Prompt and you whatever context you put around the prompt and sometimes you know there's other side effects like you you can show bias you can show tox toxicity that sort of Escape your your typical prompting so um so let's say you know I'm going to walk through some examples you know I want to create create an application that's based on language models that um you know the job of the application is to give me an assessment of a loan um um you know application from a user right so I'm a financial institution I get some user requesting a car loan they're going to send a bunch of documents and fill a bunch of forms and I want to have an LM that gives me a recommendation this is a good loan a bad loan or think about it and so on and so forth I mean the first thing you're going to do is pick a model right and you know I have different you know uh parameters in terms of of how much I'm willing to spend for the model in terms of deployment and and things like that I got a bunch of models you know thousands of models I can pick from and then I want to try to find the one that fits this domain and I'm going to try to figure out how to actually put it together so is a model relevant um is it good for this particular use case what are the things I care about what are the things I don't care about um you know is there any hardware requirement and you know how do I quantify risk right so this is sort of the problem I'm trying to solve and which I think many of you are probably um or many of your you know application and line of business are thinking about today you know I have an application use case I think I got the data and now what do I do you know what models do I pick where do I start right so we put together this pipeline that is effectively a recommendation an evaluation and then a refinement so in other words you first want to sort of have a recommender that tells you based on the way in which you evaluate the models here are the top five or you know ranked in some order that you can pick from remember human in the loop I'm not going to give you one model and give you a choice and then you can pick and but then also going to give you a technique that you know combines multiple inferences in a way that you know it will refine the output of the model to actually try to achieve and and compensate for some issues and if you think about it you know going back to the analogy of I got a thousand candidates I'm hiring an assistant the way in which you're doing it is what everybody would do in that in that position which is okay the first thing I do is I look to the resumés that's the evaluation part right and depending on the type of assistant I'm going to focus more on one particular aspect of the application versus another one so if I'm hiring an assistant that needs to be an expert at using Excel or some other math thing I'm going to look for those kind of skills and similarly here we have a bunch of benchmarks and based on the use case we'll select the benchmarks right so that'll give us an evaluation the second thing is well now I actually want to bring in the candidates and have an onset interview so now I'm designing a set of benchmarks and questions that I'm going to ask the models to actually have them you know really try to be evaluated on the things I care about again it's the moral equivalent of an on-site interview now once I've hired my assistant then I want to do specific training of the assistant to understand you know the practices that I'm using in my company and this is the similar thing for the model now when to refine it so that you know it will actually follow some of the rules that I've put in place right so these three stages are exactly the three stages that you know you're thinking that you're effective what you're doing is you're hiring an AI assistant here you have to pick the model you got to interview it and then you got to kind of reinforce the training so that you know you get the model with the right thing so again it's something that people think about naturally but you know of course there's a lot of technology behind each of these steps that you need to put in place to be able to do this magically so example not going to spend too much time the typical loan status would be you know hey give me the status of the loan based on the record you know respond with good bad average and then the prompt will be you know okay the neighborhood uh the race the gender of the applicant the location um you know the age the whatever all the parameters you put in a loan and the output would be the loan is good the loan is risky and then some explanation well you know the loan is um the credit scores are good demographics are risky blah blah blah so that's the kind of you know application you want then you pass it to an analyst the analyst will quickly go and figure out what to do um so the the evaluation of the models of course you know we're using the mo the the benchmarks I'm not going to spend too much time on this one but you know you get the idea um we're ranking the models there's a whole ranking system behind it that tries to quantify in numbers or in kind of qualitative assessment you know things like cost accuracy explainability fairness and so on and so forth and here if you look at for example you know the benchmarks that I mean the models that we've looked at you see small models medium models and like huge um you know API based models in the cloud and you just look at the overall score you'll see that you know some of the bigger models are not necessarily better for the ones for this kind of application they were putting together you know the best the best model was actually an 8 billion model that was designed for the mortgage application right which is probably not surprising but you know in this particular case you know it the system did you know surface it and bubbled it to the top and similarly there's like a financial LM and other things I don't know those would be good choices because they're low cost they're targeted for this use case and then I can create you know the system around it to improve it now you know again if I'm using the same prompt as before here the kind of differen is you know like the big model would tell me that the loan status is bad but you know if I look at it it's actually using things that I don't want you know because it's saying well because you know the the there's dab consolidation and so the the applicant is trying to kind of you know pull out request more money but actually some of these are good things and say well the credit score is too low but that's actually not bad it's missing the context right and some of you know the the good model like the the one that I said before the S billion actually focuses on the things that matter here like the last payment the the consolidation is actually good is a is a good behavior and then you know you're not emphasizing certain things so it's injecting a context of the domain remember these are like models that were trained on the whole internet right I mean they don't understand what a loan is it's just that you know that's how the Transformer thinks but you know you you can actually force it to go in certain direction by you know making sure you find tun it in certain ways and so um the the the explainability of this models sometimes is is where things can go you know badly wrong right so for example in this particular case you know we we're putting emphasis on the decision on things like neighborhood or uh race or things that you're not supposed to do because it violates some of the fairness principles some of even legal restrictions right so you want to make sure you put in the refinement that tries to steer the model to make sure that I cannot use race as a discriminant I cannot use neighborhood either because you know neighborhood leaks the bias into race even if you don't know and so some are these are some of the things that you want you know the model to fix right so what we're doing to fix these things we're creating this system which is effectively a combination of different models we call it an ensemble of AI critics so rather than having the model just give me an answer we're feeding the output of a model to another set of smaller models that act as reviewers judges or critics however you want to call it right which is typically if you think about yourself you know what you would do when you produce some text or something you do normally two things one is you read it again so it's self-reflection and you give it to a bunch of reviewers or friends who read it and tell you yeah well this isn't quite right so we're doing the same thing here so these critics are designed with a collection of targeted prompts to spot certain behavior that are undesired like bias like fairness like the things that you know we capture in the principles now these critics are in other language models we use multiple of them for diversity because each of them is designed to spot a different misbehavior then then they effectively what they do is they feedback the prompt to the language models again and they're saying well the answer was not exactly what I thought because you're looking at ba gender you're looking at race or whatever so don't do that right and so the models next time around is likely going to produce a better answer and actually it works so what we're doing in reality is trading the complexity of multiple inferences against bigger models right and that's the direction that a lot of the the community is going there's a perception that through multiple inferences you can get a much better chain of thoughts you might better much better quality than just by making a model bigger and bigger except that now you need special knowledge to create this kind of chain of thoughts depending on your specific domain right so the world is evolving the world of language model is becoming a lot more complex but there's a there's a clear set of technologies that can help you for example manage your costs get better quality without going for Brute Force size right and you know I can fast forward a little bit but you know this is the kind of feedback that the critic will give you right in this case the critic one it will will tell you look you know you shouldn't be weighting the gender the neighborhood is wrong you know the key consideration are that what you told me violates the fair Landing act so you can't use that you're missing explainability for some of your decisions you know your contextual evaluation is missing the fact that you know some of the parameters are actually good and so once you put all of these together then the final output actually looks much better if we have different scoring mechanism the raw output would be the first one then you the two critics or three critics give you the feedback then in one step I'm getting to a better output or can do a few more iterations and pay a bit more in inference but something like that of course you know we're putting things together in a in a more complex workflow where you know it's not just as I described this is not manual there's a complex workflow where effectively I have language models I have you know how I break the problem into this Chain of Thought comple complexity how I actually inject the specific knowledge or the misbehavior that I want to capture through what is called in context learning so I create a special prompt for the critic that you know will create an other um input and then I need to put in a bunch of guard rails um for to make sure that people aren't putting you know trying to kind of find ways around it and so on and so forth and so when I'm doing this I'm creating this workflow and you know of course we're measuring things and we're getting U better results at the end right so um uh you know I kind of hope that I gave you um an idea of the evolution and complexity of the AI world World especially when it comes to the application of large language model to critical problems right so for example the fact that you know it's very important to understand how the governance system is you buil around um you know your generative AI models if as we say we joke evaluation is the new training you cannot possibly afford to exhaustively evaluate thousands of models on every Benchmark it will cost you more than train a new model so you got to find techniques that effectively scale down the evaluation and Lear use you know this three-stage process remember kind of hiring your assistant in a way that is more structured large bigger is not always better you know and no matter how big it is even the biggest model out there will miss the elephant in the room if there's something to remember is that even the biggest and most powerful model will miss the elephant in the room unless they're guided in the right place and there's hope right there's Advanced machine learning in techniques that are effectively combine multiple LMS in a careful way using you know machine learning to effectively try to guide in similar way in which we did for for computer vision we can do the same thing for generative Ai and effectively get you a lower risk and a good quality with that one I think I've have exhausted my time um there's a bunch of demos that would encourage you to come see it in the show floor we got uh some actual um you know real demo of the examples that I just talked about on the on the loan assessment and um I thank you for your attention and encourage you again

2024-11-28

Show video