Sasha Luccioni - Generative AI : the Good, the Bad and the Bias (WAQ23)

Show Video

[Music] can I start thanks um my presentation is going to be in English feel free to ask questions in French I will do my very very best to answer you um and I know I'm the only thing standing between you and an open bar so I will go as quickly as possible and if no one has any question questions we can end early uh it's totally okay um so yeah I'm Sasha I'm a researcher in AI I have a I have a PhD and I've been working for 10 years on on AI and specifically how AI impacts Society so I mean the tech is interesting but how does the tech connect with society and there are two particular um topics I'm really interested in uh apart from my research and that's climate change so I work for a climate change organization that tries to essentially develop tools to stop climate change with AI and I also am on the board of women in machine learning so women are are less than 11% of AI researchers so essentially wiml is an organization that tries to highlight um essentially the research and practice uh of women in our field so to start with I picked some nice headlines about Chad GPT Bard um passed the doctor's exam um it cost Google something like a hundred million doar because it made a mistake in the in the demo couple of uh months ago I think um and so you know we hear about this all the time but I really want to show you that these these are real tools I want to take you back in time a little bit and show you where they came from and um show you ways to use them in a way that's um ethically minded let's say that so this is the winding road of AI um it goes back farther than you would think so actually 1956 is the official foundation of the term artificial intelligence uh by a bunch of guys in in the United States they came together and they said we're going to solve artificial intelligence we're going to make machines that think in 1974 they realized that it was harder than they thought um and this is the first AI winter so essentially um for almost 20 years they got a lot of money they got a lot of you know uh attention a lot of press and then in 70 in 74 they realized that it didn't really get anywhere um especially the US military put a lot of money into it didn't go anywhere so for for um six years there was almost nothing that happened in 1980 we had a new era um it's called expert systems so essentially it's rules if blah blah blah then blah blah blah long long lists of rules um and this worked for a little while but then we had a second day at winter in 1987 uh when we realized that you can't write rules for everything right you can't write rules for life in general and so we had a a second little period where people were like AI is never going to get solved since 93 things picked up again we have the rebirth of AI deep blue IBM Watson right those were like big big things in the 90s and early 2000s and then since 2011 we have the era of deep learning so um I did my post do with yosua Benjo he's one of the founders of the field um and so for the last what 12 years it's worked um and it worked well right um but since the last couple years we have something called generative Ai and this is what I'm really going to focus on because it's a little bit different than all the things that came before and I want to explain to you why so before AI models you gave it data right you gave it for example these are uh this is like the first AI data set that was really popular for a really long time in the 90s um so you had images of handwritten digits so handwritten numbers and what numbers so this this is an image of a zero this is an image of a five and for like something like 10 years everyone tried to solve this like get a system that could read digits that were written by hand and translate it into digits that were typed out and actually um these were the first commercial applications of AI used for for the post for recognizing addresses and so this this worked then we had something called imag net which was the biggest uh AI data set with 14 million images and labels so this is a tiger this is a car and AI systems are trained to predict the label predict the category right so they are trained on a lot of images they're given new images and they have to predict the right label and so this is what happened this is called supervised learning if you want to if you want to be impressive what do AI models now look like so we've got something like Bert which is a model that produces text it was kind of the first large language model so you input a string my name is Sasha and I am a and it's going to predict the next word so in this case it predicts that I'm a vampire based on I don't know what but essentially this is based on the text that it's seen so maybe there's uh I don't know a lot of uh sci-fi stories or something with vampires named Sasha I don't know I can't I can't know I can't ask the model why why do you think that Sasha are vampires right and depending on the way you know if I can say hi my name is Miss Sasha or Mrs Sasha that could actually change the the prediction but you can never know exactly um why how it made the decision and we also have gp4 you can give it a an image of what you have in your fridge and is going to actually uh propose recipes to you based on an image it will generate text right so now we have a completely different different way of seeing the world different different AI um this is called generative modeling why generative because it's creating right it's generating new things before you had just you know you had 10 categories 0 to 10 and that 0 to 9 and that was all the only the only things that the model could predict now the model can predict essentially almost anything and so that's where that's where things get interested so for example you can use um some version of a generative model to generate Radiology reports so a doctor can run a a a scan and then the AI will automatically write a report right that's that's impressive but but what happens when things go wrong wrong can because I can't ask the model why did you say what you said why do you think this patient has cancer it can't reply to me and that can be really problematic because when especially when things are critical When people's lives or health is are are on the on the line you need to be able to ask questions so I want to tell you about how these models are trained in order for you to understand that they're not magic they're technology and sometimes the line can be blurred but it's really important to understand how they work so you know things that you can see in in generative AI models are things like answering your questions doing your homework generating cool images of cats or whoever you want um but there's actually a lot going on underneath the water that we don't really see so um things like copyright infringement things like uh using a lot of energy to be training these models exploiting under underpaid workers in order to train them and so it's important to see like the whole Iceberg not just what's on top of the water but also everything that's under the water and so Let's uh go back to our generative AI models what's the recipe say I wanted to cook me a generative AI model what do I do well I start with data data is really number one you can't go anywhere with without data so I need as much data as possible probably from the internet so typically you scrape almost all of the internet there's actually um they're called dumps so essentially it's all all all the websites in all the world that are put into a website where you can download them so terabytes of data sometimes you can do filtering but filtering costs money and it's actually timec consuming because you need to figure out whether each page you want to keep or not so a lot of people actually don't even filter then you do what's called pre-training so in this case you train a model in order to predict the next word like Bert did right Sasha is a vampire so essentially what you do is you put a part of a sentence and you say what's the next word and if the model gets it right you say okay good job model plus one and if it gets it wrong you say minus one essentially the model gets trained but billions and billions and billions of times like this so the the cat in the and so when you're finished this steps you have a model that can predict text right but it's not Chad jpt it's not something that you can actually interact with it's just going to fill in the blanks and so what's the The crucial last step in these generative AI models they're set up to interact with humans so once the model has been pre-trained it's set up um like a chatbot except in this case um humans have to ask it questions and then uh correct its answers essentially you ask it give me a chocolate chip cookie recipe and it will say something completely random and then you'll fix it all up and then you'll put it back into the model and the model will train on it again and then you keep doing this for thousands and thousands of thousands of hours like really it's it's it's quite it it's months of work um this is called reinforcement learning from Human feedback and this is kind of the crucial new um whipped cream ingredient in uh generative uh AI models so before it was like Whoever has the most data is winning so Google was winning and Facebook was winning because they had all of our data but now it's really who can set up the models to be exposed to the most people to talk to them and this is how um these models have gotten so good almost overnight right all a sudden CH gbt came and we're were like whoa um so but to look at the data it's actually pretty interesting because the common crawl this is really everything from the internet and you can imagine there's a lot of stuff on the internet that you don't want your model to be trained on right but every couple of months the common crawl provides a dump of data and so you can have you can extract the textt say I want to do a language model I can just take the text I can take images because it has URLs so actually I can say Okay I want to get every URL that has jpeg or PNG or whatever I want to download them all I can do that as well from the common crawl I can do pairs of images so what's used for U for example dolly or stable diffusion they're pairs of images and the descriptive text so for people for example who um are vision impaired they have screen readers and in order to help the screen readers interpret images you have alt text it's called alternative text and so you have pairs of images and text and those are used to train these AI models and so essentially I can do whatever I want with this data and use whatever part is interesting to me but there's a bunch of questions right and what's really interesting is true that like chat GPT said um we're still figuring out the the wild wild west of AI but for example what if I put out put up a website with you know a book or or some text of mine and I don't want the AI model to be trained on this data what if I have copyright Sasha Leon at the bottom of the page and then that gets that gets ignored when the model is training who who can I who can I complain to right um also how do I figure out if there's unacceptable content if there's you know hate speech if there's pornography if there's you know stuff that I essentially don't want the model to uh reproduce how do I figure that out these Mo these training corpor are so big that even to download them you need a specialized computer also how about consent right what if I find an image of me in this data and I want to say take it out I don't want my image to be in this training data set who who can I ask right how how can I complain and actually for artists and designers whose work whose life's work has been hoovered up by these uh by these algorithms and then you know if you're an artist you can actually say uh you know a cat in the style of this artist and it will make that cat right there's the question of well maybe the artist didn't want that to happen right so how do we how do we uh enforce consent and so most webs C dat scraped data sets just ignore this they say well you know this is a this is a new field we we don't really know we don't really know what the problems are and so we're starting to see lawsuits we're starting to see really people artists for example are suing the creators of some of these modles saying I don't want my art I don't want my life's work to be uh used in this model so that's but we still don't have the mechanisms to enforce this we can't it's really hard to even prove that uh my data was used in the model unless the the creators of the model will tell me they'll tell me oh yeah you know this is a list of all the images we used used to train our model but they don't do that they tend not to do that so this is a really hard hard question um and also um these models are really really getting big so this is um plot from the last what four years uh these are billions of parameters so we went from a 100 million to essentially hundreds of billions of parameters these models are getting bigger and bigger and bigger which requires more data so we have this kind of like endless loop of more more data more model more model more data right um more training so this means uh very powerful Hardware this means thousands of gpus um so for example a model that uh was trained bloom in 2022 used 1 million GPU hours a million hours of training um and also it needs a lot of effort from humans you know tuning experimentation so essentially it's it's getting kind of out of hand more parameters more problems uh something that I particularly work on is understanding the environmental impact of these models because imagine that you have a million hours of gpus uh that use electricity that was generated using coal right coal is being burned is producing CO2 and so and you also have for example the uh the gpus themselves you know when you make information um technology tools if you make computers if you make smartphones they take raw metal they take water they take energy as well and so this is all really adding up especially that now you know before I could train when I started in AI I could train a model on my laptop I had one GPU that was kind of bad kind of slow but I could still train a model now you need a supercomputer with a million GPU hours in order to train one of these models and so of of course this really adds up in terms of environmental impacts and so for example a couple of the recent models they emit up to 500 tons of CO2 so to give you an idea one ton is a bit as about one flight from say Quebec City to uh London so one there and back over the ocean is roughly 110 of CO2 this is 500 flights um just to train a language model uh I guess the question is you know where do we draw the line um and also the human interaction angle right um this isn't Magic this is human beings interacting with the model and giving it their intelligence right because the model is not magically intelligent it has to be trained and so um these thousands of ours uh are actually human labor right that never gets recognized so there was a recent um expose by Time magazine that says that most of the workers that were training chat GPT were based out of poor countries usually so the these workers were in Kenya they were getting less than $2 an hour and they were working 14 to 16 hours a day and often uh they were exposed because part of their work was also to detect when chat GPT was saying stuff that it shouldn't say right so they were also exposed to a lot of toxic content a lot of hate speech for $2 an hour and of course they're not recognized anywhere so this is another aspect because you know um when for example Chad GPT gets made into products these people never get recognized they never get they're never seen they're invisible essentially but it was their livelihoods that that gave Chad GPT all this magic intelligence and actually above and beyond um kind of general intelligence now more and more um these companies are hired hiring very skilled and talented people like program uh programmers like um poets like authors because you if you need your model to write code you need someone to teach it to write code if you want it to write a hi coup you need someone to teach it what a hi coup is right and so now more and more we're seeing it's not just crowdsourcing it's really like hiring programmers to teach the model a better JavaScript and and and this is you know it's very hard to get a a grasp of how much um how much labor is being put into it because essentially it's not it's not a declar it anymore so what can be done I don't want to be uh too uh apocalyptic because there are things what that are already being done so I don't know if you've seen the recent news about um the open letter to stop AI research or uh the UN US Senate discussion last week um so they kind of say well it's already too late right there's superum dangerous AI That's going to maybe take over the world it's going to be smarter than human beings etc etc but actually we're very far from there and also there are a lot of people doing research that's um actively trying to make sure we don't get there because it's not like uh you know the train is on the tracks and it's going to roll towards the cliff it's really kind of like we're creating the tracks as the train is rolling and so depending on which direction we want the tracks to go we still have uh control over this and so uh for example uh when I finished my PhD I had a lot of offers from from Google and Facebook and whatnot and I chose hugging face which is um actually a startup and um their mission is to democratize machine uh good machine learning so machine learning is is is essentially artificial intelligence and so all of the work we do is to help people do better AI essentially so it's not necessarily us who are training these models but we're making tools for people to train um more ethical more responsible models we're helping um developers use the models in a way that's you know kind of uh respectful of humans and society and things like that and so that's I'll talk about some of the work we're doing and some of the work that the broader Community is doing um in order to make sure that the the train tracks go in right direction so for example to start with data um I I talked about consent it's actually really important maybe sometimes we talk about consent in a very specific uh use case but consent is really important in in society in general right you don't want people doing things to you that you didn't consent to and so data and consent is really becoming a big aspect so uh for example there we we've created some tools for um exploring big data sets so for example this is a data set um that was used to train text to image models so these are are images with captions right so we were looking in these images and we found a lot of medical data so we found images of people before and after surgeries we found um also radiology and things like that so there was actually so much uh sensitive you know identifiable there are names of people names of clinics names of conditions and so you know someone could find an image of you and find out that you have a condition I don't know like AIDS that you didn't want people to know about right and so for example uh we're working with um a company called responding in order to create um essentially ways for people to indicate that they don't want their data to be in this in this um data set and so what we're doing now is we're um trying to essentially uh reach out to people who whose uh emails whose names we find especially artists as well actually this is a big focus on artists and to send them these um opt out requests saying okay if you want to if if you don't feel comfortable with your for example your art being used in this data set or your clinical data being used in the data set please please tell us and we're really Gathering you know right we have 42 million images that we've removed so far by you know people saying I'm not okay with this and of course um we can't force people to stop using the data sets that they're using because we have no legal ways so yet but we can say this is the ethical version of this data set you know this is like the official you know stamp of approval use this data set and this one we found some um some concerns so that's we've been that's what we've been doing we've been um adding these kind of um disclaimers and and trying to nudge people in the right direction of using more consensual data um another thing we're doing is actually so imagine that some of these data sets are billions and billions of data points and so how do you understand a data set when it's so big people say now it's it's too big it's too big to document that's what people call it nowadays because essentially it's like you know you take the whole of the internet and you stuff it into your AI model and then you don't have to worry about anything because because it's just too big um but we're creating tools for people to understand their data better so uh for example uh we have a tool called Data measurements that allows you to plug in a data set and it's going to essentially if it's big it's going to take a little bit of time but it's going to go through the whole data set and find you find you duplicates find you lowquality content find you all sorts of things that you know after that you can decide to remove them or not but it's just going to tell you you know there's a bunch of duplicates do you really want like 50 times the same thing and then you can say no and remove it and essentially understanding your data better we also have um tools that help you understand the the metadata so essentially the data about the data so for example what are the sources um for this data set it's actually a medical data set so what are what's the population who gave up their data so for example like how many adults are there how many children are there you know where are they coming from things like that um that's important because especially if you're using Medical Data you should be doing it in a way that for example maybe you shouldn't be using kids data right um and so essentially if you analyze the metadata you also get anide idea of what the underlying uh for example the radiography reports are right if you just look at the images you don't know if this is a child or this is an adult but if you look at the metadata you you um you can find that out um also model access so this is um a plot that a colleague of mine made uh on the x-axis on the on the horizontal axis it's dates so essentially going forward in time the last two years um the y- axis is size so the higher up it is the bigger the models are and the colors mean access and so essentially uh red means only only the creators of the model have access to the model uh green means anyone has access to the model and blue and yellow are kind of in the middle so it depends essentially and what we're seeing as time goes on and as the models get bigger is that the biggest most powerful models are only belong to companies right they only belong to companies that um use them to make money essentially um and that's problematic because essentially if a model is closed and you essentially can't study it like uh what we do normally with models that means we can't understand how it fails right so a model like Bard there's no way for me for example if I wanted to to ask to get access I can't it's proprietary product and that means that I can't say whether it's sexist or whether it is toxic or things like that and so essentially as um as an accessibility issue that's really problematic for us as a field and so we're having these really big models as I showed you there's no way of understanding them we can't ask the model why it did what it did and even if we could try usually it events things that are false cuz I know that now you can actually ask CH jpt or Bing or whatever you can be like why did you answer this and it's going to say oh because blah blah blah blah blah blah but actually it just invents that too so it's not it's not very useful um we don't know what the what the data is actually um for both uh chat GPT and Bard there's absolutely we have no idea maybe it's you know people have um have um hypothesis that actually our emails are for example from Gmail and things like this are being used to train AI models but we don't know because no one tells us right and for open AI we don't know where their data is coming from so that's problematic when you want to do research and an ethical Ai and so um also uh there's a very um big Monopoly of these large language models because it takes so much money to train them like um something like you know $15 million for each AI model who can afford that well big tech companies sometimes very big universities like Stanford but other than that you know no one can afford to train these models which means that there's a monopoly right so we're seeing this really kind of strange power Dynamic happening in in AI right now and the more I for example go recently I was in South America at a conference everyone's like we just can't participate we can't publish at conferences we can't you know we can't train our own models because we don't have the compute necessary right so we're seeing this really a digital divide essentially between the people who have access and people who don't have access and so open source models are really important in that in that aspect because um you can for example I make a model and you can take my model and make it better uh whatever better means to you but you can keep training it you can add a language you can add right other ways of functioning so open source is really important and so it helps scientific progress in the sense of you're not just starting from scratch like right now some people what some people are doing are they're trying to reproduce chat GPT right because we don't know how it works they're just starting from the beginning and saying okay I'm going to try to figure out how chat GPT works but you you can do that it it takes a really long time essentially but if I share my model you can keep improving it and so for example um we're seeing now that um chat GPT and Bard are still the best models for sure uh but we're starting to see open source models catching up so um and the the thing is is that when for example llama came out which is the Facebook model that they open sourced people used it to make a new one which is called alpaca and then people took alpaca and then made another one that's called vuna it was a mammal thing they kept on giving it mammal names anyway but now the you know we're coming closer and closer to proprietary models because of Open Source and it's not one single lab doing all this it's you know some some lab out of uh I don't remember where it was it was a university lab that took llama and then made it better and then someone else took their model and made it better and this is how progress happens and that way we can keep building upon it and we're not um dependent right we're not dependent on Chad jpt we're not dependent on BART and so it's really um important to keep creating these models um a model that I was involved in actually uh last year it's called Bloom and uh it was a completely International uh Endeavor we had a thousand something like 300 researchers from around the world who who volunteered their time in order to make this model together and uh we got a compute from um from a public source and we had a million no we had three million uh GPU hours that were given to us for free and we built this blue model which was multilingual we had like languages uh it was as big as everyone else's um but you know this was really a first and maybe it wasn't as good as chat GPT but we made you know a point that this was possible um something else that's really important uh you know if if you code documentation is really important on doing readms and things like that um until a couple of years ago AI models didn't have readmes so you're just supposed to figure it out so you know if someone made a model and shared their model you were supposed to just magically know um how to use it so now we're working on model cards so hugging face uh created a whole tool and like a a bunch of essentially websites that you can use to make these models um to help people understand you know how they work how they were trained how how they fail that's really important as well you know saying okay I I noticed that my model is bad at this don't use it for this it's not made to do that right and that's really important it's important to be honest about what your model can't do because if you just take a model and and use it for something it's not meant to do you can't predict Its Behavior it can it can do things you weren't expecting and also it allows comparing different models so for example you put in your model card like some accuracy numbers some some performance numbers and you can comp compare different models and so pick the one that does the best on what you're looking for um something that I particularly believe in is ethical guidelines because um so until maybe you know five years ago AI was really something that was very theoretical or I mean maybe eight years ago and now it's it's very practical so you create a model and tomorrow that model can be used in something that you weren't expecting a robot uh you know a smart phone or so you really need to think of what you're doing and the theory the the the the programming and how it can uh affect society and so um what I was working on is um so there's there's a really really big AI conference called nerps um here it's a it brings together something like 10,000 AI researchers every year and it's kind of like the Super Bowl of AI I've heard it called um but uh it didn't have any ethical guidelines so people were submitting things like um an AI model for changing women's clothes and one of the uh applications was changing genes to Min skirts and that was accepted to nurs a couple of years ago and then there was like putting makeup on women uh also accepted to nurs a couple of years ago and you know there was there was no guidelines so people just submitted things they thought was were fun or or interesting I mean of course there's a lot of really interesting research too but you know there was a lot of research where or like you know data sets that were scraped from um CCTV cameras so surveillance cameras and so people would download the data and like submit it to Nur saying hey I have a data set of people who didn't know their data was collected essentially so we created these guidelines saying you can't submit your research to nurs unless you you know um pay pay the people who uh who helped gather your data get uh consent uh etc etc so we have like a whole list of things that people need to um declare or else they can't publish at this conference and it might not seem like a big deal but you know when if you are um I don't know a PhD student or even a researcher and you need a you know a little check mark uh it's really important to have a publication at NPS so um that was something I worked on recently um and so the final thing I I wanted to talk about is um is model outputs so maybe from the initial explanation I made uh you could understand that before supervised learning you could always check your answer right you can you can say this is a six draw a six and if the model says it's a three then that a wrong answer right you know what the right answer is you know what the wrong answer is um in generative AI there there are no right answers um and but machine errors can impact human lives so there was a couple of uh scandals quote unquote recently so for example there was an AI model that was deployed in order to predict CRI criminal um sentencing in the United States so for example whether uh someone is low risk or high risk and how much time they should spend in prison um and of course it was uh very racist U and so it was being used for years and years to essentially you know assign people how much time they spend in prison uh based on AI um and so they took it down Etc but these are the kinds of things that we should be getting um ahead of not behind and this is a more recent example from chat GPT write a function to check if someone would be a good scientist and uh you know if race is white and if gender is male then they're a good scientist uh and this is you know this has probably been fixed but these things keep coming up and how do we really evaluate this um and so this is there's no right answer there's no evaluation data set there's no imag net there's no uh digits written in labeled there there's not really anything we can use and also we can't access the underlying models right if I wanted to evaluate Chad GPT I can I can send you some screenshots but otherwise like even if you go and try to do the same thing chances are it won't give you the same answer because these are stochastic parrots they change their output right um something I I said in an interview that people now use a lot is uh chat GPT could be three raccoons in a trench coat um it was just like it was just like an off-handed thing that I said and I didn't think they would quote it but Bloomberg quoted it and then someone else quoted it and then now people are like oh yeah you called GPT chat GPT a racu and you're that person I was like yeah that was me I call chat raccoon but anyways I mean you don't know what's under the hood it could be one model it could be 10 models it could be you don't know you don't know what are the components of chat GPT and that means it's hard to evaluate it um a project I worked on recently um was evaluating text to image models which are particularly tricky because you know you have the text and you have the image and so it's like how do you evaluate them separately together but uh essentially what we did was we made some tools in order to compare two models so you say I want um a portrait of a CEO and you can compare different models and so on your left you can see stable diffusion which is uh an open source model model and on your right you see Dolly 2 and I mean they're all arguably very undiverse but it's interesting no matter what what adjective What description you put dolly will always output white men as CEOs it's quite it's quite surprising whereas for stable diffusion if you say for example um I don't know uh from from from France you can get a bit of uh diversity you can you can kind of give some adjectives you can say um a CEO of a pharmaceutical company and then it's going to give you you know not only white men but to is always white men and so what what really happens um when the when these there's actually a couple of other tools we made so for example um average faces uh the one on the top it's for a given profession how what is the average like what's if you compare all of the images can you get like a ghost um representation so for example for janitor Dolly still makes white men uh but stable defusion actually makes more men of color which is something um also we found a lot of cultural stereotypes typ so for example if you do Native American it will show you people in head headgear if you say Asian it will often show you waitresses or people doing uh hair hair or nails like it's got these very very clear like cultural biases that it's really hard to to find if you're not like asking a lot of questions to the model so we created a couple of these tools we called them bias Explorer and so essentially you could do and you you can keep exploring you can keep like figuring out what are the different parameters of bias and actually what's interesting for Native Americans is that no matter what you put Native American uh you know going for a walk Native American at work Native American wherever it will always uh make people in headdresses and obviously Native Americans don't always wear headdresses right and so it's interesting to see these things visually because it's hard to explain them it's hard to analyze them but when you look at them you can start to get um some observations essentially um and so we we created this website where you can kind of explore different uh professions uh and you can really this the thing you can't say that these are real people because they're not real people right but you can start making kind of highlevel um analyses like the CEOs all the CEOs are white men and glasses maybe that's something something special right and so we created a website where people can kind of play around with different models and we're still adding models um because for example now there's a lot of people making anime models so like something you can um you can input I don't know like whatever Sasha and it will make Sasha as a dis princess or whatever but what we've noticed is that they're very um they're very uh biased in terms of like gender so men can be like you know kind of neutral but women will always be half undressed or uh you know very car caricatural very anime style um and so we're trying to make also like tools like this but for even not very realistic images as well just so people see what they output and why does this matter so maybe you can think okay well these are these are just you know they just make pretty pictures right but first first of all um Getty Images which is a um like a stock image website where you can go and look for images on on their website they've already started using generative AI images because for example if you're looking for I don't know a purple dog in a green suit with a yellow balloon maybe no one actually took an image like that in the whole of humanity and so you they propos for you to generate images using Dolly 2 which is the white men in suits model um and so that can be problematic right if you're looking for images of CEOs from your website and no matter what do you cannot get it to Output anything but white men and glasses um maybe that's problematic for your website because you want more diversity in the CEOs that you show for example um and also we found some people using dolly for um forensic sketching so for example uh it's a tool to help um find criminals essentially and you know like usually it's um it's it's manual or or actual artists drawing it but sometimes you know you can you can select the hairstyle you can select glasses you can select a mustache etc etc in this case you do a descript deson textually so you say a man blah blah blah and it will generate an image and uh you can also just do a description kind of a general description and we found that if you put gang member it's always going to be black people if you put uh criminal it's only going to be Latino and black people so essentially for example you know if you start using these tools in in criminal settings kind of like the prediction of prison sentences you're going to start you know because people get influenced because you can be like it could be a dark night and you know if if you see an image you can be like oh yeah that kind of looks like that guy but maybe not at all and so you know if these if these models get used in kind of really uh high-risk settings like the medical setting and like the criminal setting this can have really real effects on real people right so this is not just uh uh just research for research sake um uh as a closing note and I want to have time for questions if you want um I I I hope that people um realize that AI is complex but also that AI belongs to everyone so it's not not just you know Google and open AI but there's a lot of people doing a lot of things and um it's actually important to keep that diversity it's important not to uh rely on chat GPT for everything because chat GPT is one model with one data set with one you know company behind it but you know even if you want to diversify the types of AI models you use that's already something because um the more we use it the more power it gets the more you know it's kind of a a self-fulfilling prophecy you know if you you're saying that it's an Oracle and it's always right and you keep using it and it keeps on being right and and after that we only have one representation of the world but the world is not one thing right especially um since we know that these models have biases and so essentially what if we um rely on them too much these biases become a reality right and all of our websites will have images of white male CEOs and glasses and all of the sketches of criminals are going to be black people but that's not what we want right and so um I'm hoping that you you can reflect a little bit on um our Reliance on these tools and how you can be a little bit more critical of them and also um kind of think about all the uh underwater stuff right that I showed you on that on that slide that there's a lot of things that go in there that you don't see and so you know chat GPT is cool and writes really cool stuff but um it's based on human labor it's based on uh natural resources it's based on a lot that you don't see um and so it's important to keep that in mind um and so I hope you enjoyed my talk and um yeah if you have any questions please [Music] [Music] Bo

2023-11-12 10:49

Show Video

Other news

全系列大對決！5.8mm薄旗艦機 Samsung Galaxy S25 Edge 到底適合誰？2億像素相機力壓 S25+ 遠攝變焦？ S25 Ultra 效能｜散熱｜電量表現終極比拼！ 2025-05-30 14:58

Bring your own model to Windows using Windows ML | BRK225 2025-05-26 17:57

John Roese, Dell | Dell Technologies World 2025 2025-05-24 00:43