Social Weaknesses of the Use of AI in Image Recognition | DeepDive by Kate Crawford
What is at stake in, how, AI systems, are harvesting, labeling, and deploying. Images, in the world. My, name is Kate, Crawford I'm the, co-founder. And. Director of the AI now Institute, at New York University, and I'm standing here with Trevor Paglen who is many, things but he, is also an, artist. Fellow with us at AI now and. Trevor. And I have been collaborating. On an exhibition, about AI. And, images, which is going to be opening, at the fondazione a Prada later, this year and. This project has involved, us looking. At hundreds. Of, training. Data sets and I would say by this point thousands. And thousands, of images so. It's quickly, for us turned. Into this much deeper, dive into the, nature of images, themselves, so. Today we're going to talk about the way that images are used, in, AI, we're. Interested in three, questions in particular, first. What. Is an image for the purposes of artificial. Intelligence and, what. Work do, images, do in AI systems, and, finally. What. Is at stake in, how, AI systems, are harvesting, labeling, and deploying. Images, in the world, so. I'm going to begin with a little, story about machine. Vision from its early, origins back, in the mid, 20th, century, in. 1966. Marvin, Minsky, at MIT, decided. That interpreting. Images was, central, to the idea, of machine, intelligence. But. Given that vision is something that many, creatures can do squirrels. Fish. Birds, he. Assumed that this would be a relatively. Straightforward, task, so. He asks his undergraduate. Student, at the time Joel, J Sussman, to, spend, the summer linking. A camera, to a computer and, getting. The computer to describe, what, it sees, this. Becomes the. Summer vision, project and needless. To say it probably won't surprise you, that the, process, of getting computers, to describe, objects, in the world took, a little bit longer than one summer. The. Current narrative, of course is that we've come so, far with, temporary AI that, challenges. Such as object, recognition and, facial recognition have. Now been solved. But. We would argue that actually, the opposite, is true, the. Challenge, of images, and what, they mean we'll. Always be far. More political than, a technical, issue once. Labels, are applied to images a type, of meaning, has been constructed, at the level of representation. And identity. And this, can have profound consequences, for how we understand, the world and understand, ourselves, of course, just ask 6%, black Michael, Jackson here. Now. We may also bring particular interpretations. To an image like this from. The Baton Rouge protest over the shooting, of Alton, sterling by police officers, but. If you run this image through, an AI image, recognition system. Like, the caption BOTS it'll be interpreted, a little bit differently. It. Says. This, is a group, of people standing, around in a parking lot well. This. Is obviously a pretty clear. Example, of vision systems being unable to read, context. And thus profoundly, missing, the many layered meanings, of this image but. There are more subtle problems. With images, labels, and AI that, are going on around us, all the time so, have a look at this image here this comes from a film that Trevor made last year based. On a string quartet and then he used an off-the-shelf, AI recognition. Algorithm, this, is basically, doing exactly. What Marvin. Minsky would have hoped for his summer project, the, computer, is describing. What it sees in real time. But. What does it mean that the, system is describing. This person as a woman or that she's likely zero to 12, years old or, that, the emotion, she's feeling, is 49. Percent, sad. What. Is the work done by these forms of recognition and misrecognition, basically. What, is it to describe an image as. Soon as we begin talking about the meaning of images we, are also talking about politics, the. Nature of images, has historically, been at the heart of struggles. Over power we, can of course go, back to 8th, century, Byzantium, where, the controversies. Over making images of. Christ. Resulted. In the mass. Destruction. Of images or, the, so called iconoclasm. Of. 726. AD and. This. Struggle over images and meaning persists. In the modern era when. Magritte said this, is not an apple he's, pointing, out that to one, extent or another the. Relationship. Between images, and meaning, is arbitrary. And ultimately. Subject, to political operations. But. If an AI system is say inspecting. Fruit, quality and it looks at the Magritte and says no, actually, this is an apple it's a red and green apple, definitely. Then. What's going on here well. We're in a contest, of meaning and who, gets to decide what, images mean is fundamentally. About the root of power and high. So. Tonight we're. Going to open up the question of images, ai and power in particular. We're, interested, in how this happens in the subset, of AI known, as machine learning which, has risen to become the dominant form in the last decade and forgive. Me I'm going to give you a short definition, because.
The Machine learning systems, use, algorithms, and, statistical, models to automate. The performance, of discrete, tasks, like recognizing. Faces filtering. Your email, or directing. Felix's. Autonomous. Vehicle going downstairs. Machine. Learning works by, building, a mathematical. Model of sample data which is called training. Data and the, system uses this to make predictions or decisions and we. Want to show you some, of this tonight because. For these systems to recognize, images they, have to be trained on thousands. Sometimes millions of. Discreet images, so. In short, images. Have become the, lingua franca, of machine learning, they, are the underlying alphabet. From, which vision, systems are detecting, forms and making, meaning. Now. We probably don't think of AI, as a practice, of doing things with images but perhaps. We should because. Right now, AI, systems, are making fundamental, interventions. Into visual, culture ones. That have far-reaching. Consequences, that. May go even further than, the invention, of photography in. The 19th century and even, perhaps the, invention, of perspective, in the 15th century. Because. If we go back to, when Albert, II outlined, the, idea, of artificial, perspective, in 1435. It. Was rapidly. And widely, accepted across, Western, Europe to, be an infallible. Method, of representation. Basically. What w JT Mitchell, calls an automatic. And mechanical. Production of truths about the, material, and mental, world's definitely. Sounds familiar to some of the things we read about AI, but. This was a profound, transformation, at, the same time it, was done, its own, artificiality. By, laying claim to being just a natural. Representation, of, how things really, look, then. The invention of the camera just reinforced. This idea that. Perspective, is the, natural, way of seeing. So. Justice perspective and photography, completely. Transformed. Our understanding and, vision and images in ways, that we now completely. Accept, so. The advent of a AI is altering, some, of these basic relationships. Between, humans, and images, under, the banner of science. And objectivity. As Mitchell. Likes to joke ultimately. What is natural, is evidently, what we can build a machine to, do for us so in. Order to understand, the, deeply, political interventions, going on right now we, first have to understand, the model of vision that underlines, AI, systems, so.
We're Going to take you into that strange, under layer of logic that, guides how images are being harvested recognized. And labeled in AI and. We're, going to end by suggesting, to you that images. Are no longer, spectacles. That we observe but, they are in fact looking, back at us images. Are, active, in a, process, of massive. Value extraction, drawn. From the most minut, and intimate. Elements. Of everyday life and, on that I'm going to pass to Trevor. Thank. You so much okay so. This is a kind. Of map of our talk at first Cape. Mentioned that we've been spending a lot of time looking at training sets looking at libraries. Of images, that are used to, teach autonomous. Vision systems how to see, as it were we've. Been looking at that when looking at some of the underlying, politics. That are built into these training, libraries, training. Libraries in a way create. A kind of vocabulary, and, grammar. Of seeing. And. It happens, at many different, levels. When. We look at precedents. To contemporary. A AI training. Libraries we find things like the great 19th century, phrenologist. Sculpt, and early. Biometrics. Databases. Of fingerprints. You by law, enforcement, but, in the 1990s. We see the beginning, of like real, training. Libraries, built for computer, vision systems. One. Early, set that. These images are taken from it was created, by the DARPA. The Defense Advanced, Research Projects. In the early, 1990s. This is a data set called ferret, which, consists. Of thousands, of images of people's, faces and it, was used to encourage. Research in, early facial. Recognition, was. Meant, to be a kind of benchmark, that different, laboratories, could use to, test their facial recognition systems. Against, and the performance, of their algorithms. Another. Early training set is this one called M NIST this, was used to teach computers, how to read handwritten. Numbers. Now. If we fast forward to contemporary. Training, sets we, can start to see the kinds of assumptions, and even forms of power that are built into them when. We're looking at training images, we're, looking at taxonomy. X' we're, looking at categories. We're, looking at labels, we're, looking at relationships. Between, images. And concepts. This. Is an, example of that this is an excerpt, from a database used. To, train, computers, how to recognize, effects. And emotions, this, one is from 1998. It's, called Jaffee which stands for Japanese. Female, facial. Expressions. It's. Two hundred and thirteen images, of these, women making seven, facial, expressions. That. Same year this data. Set was released as, well this one is called Karolinska. Directed. Emotional. Faces or Kay death and. This. Consists, of 70, individuals each. Displaying, those same emotions, happiness, sadness. Fear. Disgust.
Anger, Surprise. And neutral. So. Here in the case of these data sets that are intended, to teach machines, how to recognize, people's, effects, we have a bunch of assumptions, built built into them in the first place first. A, sumption the idea that emotions. Is a sensible. Taxonomy, in the first place. Second. Assumption there. Are only six, emotions. Plus, neutral. Really. Third. Assumption. Emotions. Are somehow, expressed. On people's, faces. So. Where, do these assumptions. Come from well, the underlying. Paradigm. Of this taxonomy, comes, from the work of a specific, psychologist. A guy named Paul Ekman, and. Ekman, asserted, that emotions. Can be reduced. And categorized. Into six basic. Universal. Categories, that, can be ascertained, by looking at somebody's, face the. Eyes provide. The proverbial, window. To the, soul and although. Ekman's. Work has been. Profoundly. Critiqued. By psychologists. And anthropologists and. People. Like Kate who. Found that this work simply, does not hold up to sustained scrutiny, this, paradigm, has nonetheless become. Adopted. By, AI researchers. And developers in, a, way this kind of makes sense. Ekman's. Paradigm, is actually perfect, for AI, because. It, assumes, a transparent. Relationship, between appearance. And essence. And it, also assumes a series of discreet, and universal. Categorizations. And this, almost. Phrenological. Will is the, stuff that AI is largely, made out of. We. Can see these politics, of taxonomy, x' even more, clearly in places, like this, these. Are excerpts, from a data set called the utk. Face data set which consists, of over. 20,000. Images, of faces and it has annotations. For, age gender. A race. And. Ethnicity. And let's. Look at the. Taxonomy, x' here in. The description, age is an integer from 0 to 116. Indicating. The age. Gender. Is either 0 male, or. Female. Race. Is an integer from zero to four denoting. Black, white. Asian. Indian. And others. Miscellaneous. So. Again we can quite see, clearly, see that there is a politics. To the taxonomy, itself, gender. Is binary, one, or a zero race. Can, be encapsulated by. Four. Categories. White black Asian Indian, and miscellaneous. And. The. Racial categories, of this dataset recall, previous. Racial. Classifications. From the past in particular one. Used by the South African apartheid regime in the 1970. 70s. Where each person, was, legally, classified as black, white, colored. Or Indian. Other. Datasets. Recapitulate. The kind of obsession, of the face of the criminal, that has been, so, much of the bread and brother of phrenology, since the 19th century. These. Photographs. Here are excerpts. From an American, National, Institute, of Standards training. Set one that's called NIST, special database, number, 18, in, this data set consists, of mug shots of. 1573. People, who, have been arrested, multiple times, so.
The Training set has pictures, of them over, different. Times that they were arrested, and the training set is meant to be used to, help, track how, faces. Age over. Time. Now. A huge milestone in. The development of, training sets takes place in, 2009. When. Computer, vision researchers, at Stanford and. At Princeton, release, one, of really, what has become one of the gold standards, of publicly, available training. Sets this. Is a data set called image net an image. Net consists. Of over 14. Million. Images. That, are labeled by hand, into, more than 20,000. Categories. It's, massive. And it's, Rickett an. Image. Net is a major development, for a couple of reasons having. To do a lot with how it was made on, one. Hand there's. A there's. A couple of things that are new here on one hand the ability to collect, tens, of millions of images in the first place and on, the second hand the ability to label. Tens, of millions of images and what, made, this possible. Well. Of course on one hand this is made possible by this centralized. Centralization. Of data in indexing, in places, like Google Amazon, at. The time Flickr, so, these, platforms, made it possible to collect images, at the scale. But. There's another aspect to, it which is just as important, if not more which, is that those very same platforms. Allowed. The, creation, of a kind of backbone, of piecemeal. Labor, practices. Image. Net was created, by employing Amazon, Turk workers to label, those, millions, of images that, the researchers, had to collected, and to organize, them into categories. The. Point here is that when we're talking about the politics, of training, data those politics, are just as much infrastructural. As they are epistemological. Or. As, much as they are about, the power to, create, forms, of common sense I'll. Pass it back to you came. So. Travis talked a little bit about this, sort of meta, taxonomy. We're. Gonna go down a level now to the, classes. Themselves that, exist in say image net so, the question here, is what. Kinds, of concepts, get, ratified as classes, within, image. Net so. You know a class like Apple, might, seem relatively. Uncontroversial. But. When you're dealing with datasets that have tens, of thousands, of classes, they get very relational, and very, weird very, quickly they. Move from. Being descriptions, to, being judgments. Now, this is an example of an image that was taken during, Mark, Zuckerberg, hearings, in Congress last, year and here. We've just applied. Open-source new your network called Yolo 9000. Which can recognize around. 9500. Different classes, of objects, but, have a look because this image actually is very interesting. You'll, see that some, people are labeled, as workers. And, down. The bottom here the women who are actually transcribing, I'll label this person. But, if you look at the guys who have ties up the back they're, leaders. Right. And. Also, weirdly. I mean there's also this category, of law, giver which it seems sort of be peppered, randomly, through the crowd, now. These. Are, obviously, very different sorts, of nouns to Apple, and they're, bringing this sort of descriptive.
Characterization. That I think reveals. Where ideology. Has been baked into the, system, but. Let's go further into the imagenet categories, and this. Is a really, charming one called bad person. Here. We have debtor, so, obviously, something in your face can certainly correlate to, your bank account and. Then, this. One a swinger. And Tramp and right. Clearly. These, are heavily. Laden value, judgments, at work here but. Again you have to keep in mind that, these are images of people and the, people have no, idea that, their selfies, or their photos from their last beach vacation, have now been scraped from the internet and are being labeled as examples, of our Tramp. So. That is what's going on at the level of the class and now let's dive down to the most molecular. Level, the way individual. Images are being labeled so. Have a guess what, this image is. Apparently. This is a, sharecropper. Try. Guessing this one this. Is a bit of a tricky one this, is a notice, er. Okay. Fair. Enough. This. Woman is. A mute, and. This. Man is, an. Anti-semite. Alright. And. Trevor. And I couldn't really figure out what is going on in this image it seems to be an indigenous child, in, some distress but. This has simply been labeled, as toy. So. There's obviously a kind, of incredible. Flattening. Of meaning, and context. That is going on here so. When we look back at this overall, architecture of a training set we, can see that, these levels of taxonomy, class. And image, are, being, presented, as, essentially. An unproblematic. A political, set, of choices but. As you can see these, are highly, political, moments, that each one, of these layers, so. What happens when, a system trained, like this is then, let, loose into, the world in order to make sense of it trevor. Is going to tell us. Well. The basic. Question, here that we're trying. To unravel, is, how. Do machines. See how, does AI see what, how do how is it trained to see and then how does it go about seeing. Let's. Imagine we've trained a neural network on something like image net and now what it can do is go out and look at the world and say that you know this is a woman. With long brown hair what. Have you how does it see well. This training stage acts, as, a kind of filter the training, data sorts, an infinitely. Complex, world into. Discrete categories. And image types and the machine vision system, can then recognize those. Categories, and image types out in the world, now. I would. Argue that, this is not actually. How machines, see in a, limited, kind of technical, sense yes perhaps but in reality, this. Is not what's going on at. The end of the day let's, keep in mind the. Point of computer vision systems, isn't, actually, to recognize, objects. It's, to, make money, it's to extract, value. From the collection, and interpretation. Of images. The. Alphabet. Here is not about the building blocks of meaning, making so. Much as about, the various forms of power being, exercised, and we me and Kate have started calling this predator. Vision, and this, predator, vision of artificial. Intelligence has. Several. Sides on. One side. We see a drive to collect the maximum, amount of images and other forms of data to generate profits, and this is done largely without our consent or our knowledge and this. Constitutes. A kind of enclosure. The. Kind of everyday spaces. And intimate, moments, of our lives that were previously. You, know very inefficient. For capital, or police to get into our now wide open for. Occupation. And extraction. And. We're starting to see this in insurance. Credit. Finance. Everyday infrastructures. State, Farm Insurance for, example recently started, doing a series of studies, to try to classify driver's. Behavior, using. Computer, vision systems installed, in cars to, monitor, drivers, in real time to, see if they're distracted. To see if they're texting, while driving or. Engaging, in other kinds of risky, behavior and, this. Study is indicative of an overall shift that's happening, in the insurance industry. Shift. That they described, as moving. Away from category. Data towards. Source. Data, and the. Idea here is that your insurance premiums. Are constantly. Fluctuating, based. On. The. Insurance company's real-time. Assessment. Of the, level of risk of your behaviors, your, auto.
Insurance Company. For example would. Have sensors. Installed, in your car and they would watch you drive now. If you're texting or distracting. Or if it thinks that you look sleepy or, distracted. Or you're speeding, your, insurance premium, will go up in real-time at, the, same time if it likes how you're driving if you follow the, recommended, routes, that it gives you it might give you a discount, on your insurance for, that premium in. That period of time and, this, is happening across many domains this is happening across health, insurance, domains, where insurance companies are monitoring, what you eat and how much you exercise model. Model, modulating, your premiums, based on that credit. Card company's. Assessing. Risk based, on patterns of life and habits, in this. World of constant, modulation. Of constant, monitoring, happens. Within. Obviously. A very normative framework. Now. The other side of this predator, vision, is a kind of colonial. Ordering. The. Autonomous. Assignation, of, meanings, classifications. And judgments, from, a top-down perspective this. Kind of will, to, classify. As a means of control, all in the name of objectivity. Let's. Take a company called predict, 'm this is an online, service that uses advanced. Artificial, intelligence to. Assess, a babysitter's. Personality. It. Looks at job candidates, of babysitters, in then, looks, at all of their pictures from. Facebook twitter and instagram and uses. An AI. To. Analyze. Those pictures to determine, their trustworthiness, or, their. Risk, of being a drug user on a scale of one to five. It. Takes thousands, of images and boils, them down to a single-digit. Risk, number. The. Result here is a stark, power. Asymmetry. The, employers, can, look at everything the. Low-wage, worker, cannot. Opt out, and this. Constitutes. Another, kind, of enclosure an, enclosure. Of meaning-making, where. The interpretation. Of images, constitutes. An exercise. Of asymmetrical. Power, our. Built. Environments. Are watching, us, looking. For moments, and images. To, extract, value, from us, and so. Many of the assumptions made about human, life made by these systems are narrow normative. And laden, with error. When. We showed the Apple class from, image net before we said that this was probably not a particularly. Controversial. Class, but. In the kind, of logic of predator. Vision, an apple. Is actually, never an apple, when. Is an Apple Apple, a signifier. For health or wellness, that, modulates, your insurance policy, when. Is it a signifier, for sin. Or knowledge. Or of. A high quality, saleable. Piece of produce, and who. Decides, what, that is, the. Meaning, of the Apple is motivated, by the types of value, that can be extracted, from it and the asymmetrical forms, of power that, it can reproduce and reinvigorate. So. In. Some the. Advent. Of widespread. Computer, vision and AI is. Constituting. As Felix. Said earlier a new, regime of images. Images. Are no longer categorized. By the dynamics. Of mass media and the spectacle, instead. They're, part of an active, apparatus. That is being used to extract, value, from sections, of society that. Were previously, pretty, inaccessible.
To Capitol and the state we. Have a lexer you know bedrooms, and in our kitchens we have cameras, on our desks, phones and in our cars and ultimately. Our, affects our tensions, and our histories, are, being captured. This. Unrestrained. Thirst for new fields, of cognitive, exploitation. Has, driven a search into ever deeper, layers of human. Bodies human, actions, and behaviors. Every. Single, form of bio, data including. Forensic, biometric. Sociometric, and psychometric, is right, now being. Captured, and logged, for, training AI systems. Now. We can kind of think of these spaces that were previously difficult, for capital and the state to, occupy, as having. Essentially, created a type of de-facto, common. A Commons. That gave us a certain, kind of freedom through. Anonymity, and as. These systems become more ubiquitous they. Colonize, more space by, subjecting, everyday life and bodies to the, top-down logics. Of financial. Extraction, and to, these epistemology. Of phrenology. Classification. And control, and this. Matters, because. Deciding, what images mean is at the root of power every, social. Struggle, has included, a struggle, for meaning as much, as a struggle, over rights because. These two are, actually inseparable. Every. Social struggle, has been in part an effort. To, make images mean. Different things, struggles. For self representation are, exactly. That. Struggles. To define the meaning of your image, the. Freedom to say this. Is not an apple is the freedom to say I am a man or I, am a woman or I am. Neither, of these. So. Our hope is that, by opening, up the substrate, of images this weird. Underworld. Of the visual that is generally. Unseen, but, is nonetheless the. Fundamental. Proving, ground of all visual. AI construction. That, we can begin to, critically, engage with, and pull, apart some. Of these claims that, AI vision, is somehow objective. Universal. A historical. And non-ideological. Because. Training data is one. Level, at which we can start to see the historical. And cultural origins. Labor. Practices. Infrastructures. And epistemological. Assumptions. That are, being built into these systems so. Our project is to look at how training, data is materially. Constructed, and deployed in. Order to, surface, what we see as the, emerging, grammar, and political. Economy, of AI. Thank. You.