Stuart Russell - How Not To Destroy the World With AI

Show Video

- Okay, I think we're ready to get started here. My name is Ken Goldberg. I'm a professor here at UC Berkeley, and I've been studying artificial intelligence and robotics for over 40 years, and I've always been a skeptic about artificial general intelligence until recently. Now, have anyone, has anyone tried or heard about ChatGPT out there? No. Okay. I must say, that was a, a very earth shattering experience for me.

And I have to say, I'm really recalibrating a lot of my, my positions on it, especially about the ability of ChatGPT to be creative, because I never thought a, an AI system would tell a good joke or, or write a great story. And it's passed all those by far. Now I wanted, I'm very excited about this series because I think this is on so many people's minds. It really is gonna change our lives. There's no doubt about it. And we also, I'm happy to report that chatGPT the specific program has some Berkeley DNA in it.

The primary architect of chatGPT is a Berkeley grad, so you're gonna be hearing from him. He'll be back, he'll be here on the 19th of April at 5:00 PM It's sold out. But we can get you on online. So that should be a very interesting talk. And we have some of the world's experts in, in AI here at Berkeley. And we have one of them, probably one of the most well known here today.

Now this is gonna be part of a series of these lectures, and the next one will be next Wednesday. They're gonna be that CITRIS has put these together with the Berkeley AI Research Lab, and the, the next one will be at lunchtime, same as this. And then every lunchtime in April, we're gonna present a different speaker. So next week we'll be Alison Gonick, then we have Mike Jordan, and then who's the next, Pam Samuelson. And then we actually have Rodney Brooks coming as well.

And John Schulman is sandwiched right in the middle. So please look at the, the CITRIS webpage to get updates on all of that. And Berkeley has a nice link on this, on berkeley.edu/ai. So please save your questions for the end. We'll be leaving time for, for some questions before we end at noon, I mean at one.

And we'll be happy to take them now. Our speaker today is a professor here at, in the Department of Computer Science at UC, Berkeley. He's a leading researcher in artificial intelligence, and he's the author, he wrote the book. Okay, he's the author with Peter Norwig of Artificial Intelligence, a Modern Approach, and that textbook has been, is being used in 1400 universities around the world. It is the single most popular textbook on artificial intelligence.

He has a number of other accomplishments. He is the blaze Pascal chair in Paris. He's on the World Economic Forum Council, global Council on Artificial Intelligence. He delivered the wreath lectures on BBC radio, and they're available online if you wanna check them out. Amazing lectures to most of the world on, on artificial intelligence, his views, and is an officer of the most excellent order in the British Empire. So as I understand it, that's the closest you get to being a knight if you live in the United States.

So he's sort of royalty. And he also wrote, or an open letter, as you may have seen about artificial intelligence that was widely circulated. Now, I just have to end by saying that, just a few last comments here. He's been referred to as the godfather of artificial intelligence, but don't worry, he's much more approachable than Marlon Brando. And one of his most significant contributions to the field is his work on inverse reinforcement, learning a technique that allows machines to learn from human behavior by inferring their underlying goals and preferences. The rewards then, and this is a groundbreaking approach that's changing the way we think about machine learning and human robot interactions.

I must admit these were written by ChatGPT. Okay, so the last one, when he is not busy revolutionizing the field of AI, he's also a lover of jazz music, a skilled wood worker, and a proud owner of a collection of antique typewriters. Please welcome Stuart Russell.

(audience clapping) - Thanks very much, Ken. And the last part is, as you can probably guess, complete hallucination. Okay. So we're gonna have, I'm sure we're gonna have the usual AV problems.

Good. All right. So I want to begin by getting everyone on the same page about what we're talking about. There's a lot of use of the word AI bandied about in the media, in corporate pitch decks. Pretty much everyone is doing AI if, if you believe what the pitch decks say. So of course, it has meant from the beginning, even going back to the 1840s when Charles Babbage and Ada Lovelace were talking about it, making their machines intelligent.

But if you're actually gonna do it, you have to take the next step. Okay, well, what does that mean? And sometime in the 1940s, this idea gelled, which was really borrowed from economics and philosophy, that machines are intelligent to the extent that their actions can be expected to achieve their objectives. And this is so prevalent that I've called it the standard model.

It's also the standard model of control theory of operations research to some extent of, of economics where we define objectives. We create optimizing machinery for those objectives. And then we set off the machinery and off the go. So I'll be arguing later actually, that that model is incorrect and it's actually the source of a lot of our problems.

But the goal of the field since the beginning, and you can find this very explicitly in the writings and speeches of some of the founders of the field, is to create general purpose AI, not specific tools like equation solvers or chess programs, but general purpose AI, meaning that it can quickly learn to behave well in any task environment where the human intellect is relevant. We don't have a formal definition of generality, but I think we will know it when we see it. And maybe Ken already saw it.

So let me begin by saying that there's a reason we're doing this, right? Not just because it's cool or fun or profoundly interesting, but it could be extraordinarily beneficial for our civilization because, you know, our civilization is made out of intelligence. That's the material from which we build it. And if we have access to a lot more, then we might be able to have a much better civilization. So by definition, from what I said about general purpose AI, if we had access to it, we could use it to deliver at least what we already know, how to deliver with the intelligence that we have, which is a good standard of living for hundreds of millions of people. But we haven't yet extended that to everybody on earth. But general purpose AI can do it at much greater scale, at much less cost.

So we could, I think, deliver a respectable standard of living to everyone on Earth. And if you calculate the net present value of that, so that will be about a tenfold increase in GDP, which tells you how far we have to go, it will be about 13.5 quadrillion dollars of net present value. So that gives you a lower bound on sort of what is the cash value of creating general purpose AI, right? It gives you a sense of what's going on right now, right? The various entities that are competing, whether you think of China and the US or Microsoft and Google, that's the size of the prize.

So it makes all the investments, all the numbers that we've ever talked about are just minuscule in comparison. And it also creates a certain amount of momentum which makes ideas of, okay, well we could just stop doing this. Probably difficult to implement, but of course we could have more than that, right? We could have much better healthcare. We could solve a lot of scientific problems that are too hard for us right now.

We could deliver amazingly good individualized tutoring based education to every child on earth, so we could really, perhaps have a better civilization. Actually, I just, I just gave a talk at the OECD where I had politics at the end of that, and people laughed, so I took it off again. (audience laughing) And towards this goal, we are making a lot of progress.

I mean, the self-driving car, you know, is now a commonplace on the streets of San Francisco. And I remember John McCarthy, one of the founders of the field saying that he would be happy when a car could take him to San Francisco Airport by itself. And that's now something that is, is perfectly feasible. This is some work from my group, one of my PhD students, Nemar Aurora developed the, the monitoring system for the nuclear Test Ban treaty. This is developed not using deep learning, but actually this is a large scale Bayesian inference engine and is running 24/7 at the United Nations in Vienna. This shows the accurate detection of a nuclear explosion in North Korea a few years ago.

And so there's the entrance to the tunnel down there that was later found in satellite images. This is our estimate in real time of where the event happened, about 300 meters from the tunnel. And then we have the, the amazing things that generative AI does.

So here's one of my favorite examples using Dolly, you ask it for teddy bears mixing sparkling chemicals as mad scientists in a steampunk style. And this is what it produces in a few seconds, right? So this is pretty amazing, right? Particularly given that for us to do this right, we would need a rendering engine, which is hundreds of thousands of lines of graphics code. It doesn't have a rendering engine, doesn't have any graphics code, and yet it's able to produce this image. It's not perfect.

The teddy bear seems to have a, a foot where he should have a hand and so on, but it's pretty amazing. And then of course, we had, what was China's Sputnik moment, as it's often described, the defeat of human world champions at go, which was predicted after the defeat of Gary Kasparov to take another a hundred years because go was thought to be so much more complex than chess. I'll tell you a little bit more about this later. Okay, so how are we doing this right? How are we filling in this question mark, between the sensory input of an AI system and the behavior that it generates. And there have been various approaches over time. The one that's popular right now is this, right? You simply fill that box with an enormous circuit.

That circuit has billions or trillions now of tunable parameters, and you optimize those parameters using stochastic gradient descent in order to improve the performance on the objective that you've specified. A much older approach was popular in the 1950s where instead of circuit in the box, you had four tran programs. And rather than just stochastic gradient descent, you also did crossover. In other words, you took pieces of those different FORTRAN programs and, and mixed and matched in a way that is reminiscent of evolution. So that didn't work in 1958, but you should keep in mind that they were working with computers that were at least a quadrillion times slower than the ones we have now.

So we are using, in a typical machine learning run, now a quadrillion times more computation than they typically used then if they did a, you know, even a multi-day machine learning run. Okay? For most of the history of the field, we pursued what we might think of as a knowledge-based approach, the idea that intelligence systems have to know things, right? That seems so fundamental and so obvious, and we built technologies around that principle. Initially, technology is based on logic where the knowledge was represented as formal sentences in a mathematical logic. Later on, we embraced uncertainty, developed technologies based on probability theory with the advent of Bayesian networks. And then in the 2000s, in some sense produced a, a unification of those two approaches with probabilistic logics or probabilistic programs. And I'll talk some more about that later on.

And the advantage of, of building knowledge-based systems is that each element of a knowledge-based system has its own meaning and can be verified and checked against reality separately. And that provides an enormous commentorial advantage in the rate of learning and the ability of a system to fix itself when something isn't working. So let me give you an example of what I mean by a knowledge-based approach, right? So here are a pair of black holes, about 1.2 billion light years away. So a a good distance across the known universe, not all the way across, but a good distance across. And well, 1.2 billion years ago,

they started rotating around each other. And as they did that, they lost energy and became closer and closer, and they started radiating gravitational waves at the peak level of radiation. They were emitting more energy, 50 times more energy than all of the stars in the universe. So a quite cataclysmic event.

1.2 billion years later, just by coincidence, we had completed the LIGO the large interferometric gravitational observatory, which is designed to detect exactly this type of event. The gravitational waves emitted by this, the type of collision of black holes. So this is several kilometers in length.

It contains enormous amounts of advanced technology lasers, lots of computers, incredible materials, and it measures the distortion of space caused by these gravitational waves. And how big is that distortion of space? Well, on relative to this scale, it's the distant, it's the, the distance to alpha Centauri relative to the width of a human hair. So if you change the distance to Alpha Centauri, which is four and a half light years by the width of a human hair, this system would be able to detect that change. So that's how sensitive it is. And they predicted what that distortion would look like if there was a black hole collision like this, and it looked exactly the way they predicted they could even measure the masses of the two black holes and so on. So this was the result of knowledge that we as the human race accumulated the knowledge of physics, of materials, et cetera, et cetera, et cetera, over centuries.

And so if we're going to create general purpose AI, it needs to be able to do at least this, right? Because humans did this, it's hard to see how we would train a deep learning system. Where would we get all the billions of examples of different gravitational wave detectors to train it in order to figure this idea out, right? It starts to just not even make sense when you think about things, when you don't think about things in terms of systems that know things. So I'm gonna give you a few reasons to, to be a little bit doubtful about this orthodoxy that has dominated AI for the last decade, which is that all we need is bigger and bigger circuits, and it's going to solve everything. So first of all, just a simple technical reason, right? Circuits are not very expressive compared in particular to programs or formal logics. And that causes an enormous blowup in the size of the representation of some perfectly straightforward concept. So for example, the rules that go just talk about pieces being put on a board and surrounding each other, being connected together in groups by vertical and horizontal connections.

These are very easy things to write in English or in logic about half a page is enough, but to try to write those down in circuits requires millions of pages of circuit. And that means that it takes millions or even billions of examples to learn an approximation to the rules, which it can't actually learn correctly. And I'll show you examples of that later on. And you also get very fragile generalizations, which we see in adversarial examples in image recognition where tiny invisible changes in images cause the object recognizes to completely change their mind.

That what was a school bus, You change some pixels invisibly on the image, and now it says it's an ostrich. In fact, you can change any object into an ostrich by making invisible changes to the image. Okay, so these should worry you, right? Rather than just sort of say, oh, well that's, that's peculiar, right? And just carrying on, right? They should worry you, they should say, okay, maybe there's something we're not understanding here. Maybe our confidence is misplaced in the success, the, the so-called success of these technologies.

And some group, some students at MIT working with Dave Gifford, did some research on the, you know, standard high performance convolutional neural network systems that do object recognition on the ImageNet database, which is the standard benchmark. And so if you look at this parachute here looks like a parachute to us. The system is recognizing it by looking at those pixels, right? As long as those pixels are blue, this is a parachute.

So that should worry you, right? You know, if you know, as long as there's a certain type of grass on the right hand edge, then it's a golden retriever, right? This should worry you if you are basing your, you know, your trillion dollar company on the fact that these systems are going to work in the real world, then they should worry you. Well, you could say, well, okay, we know that these go programs are truly amazing. They've beaten the human world champion, right? What more proof could you want than that? Okay, so here's what's happening with Go. So 2017, the successor system of alpha go defeated the current world champion, whose name is Ke Jie, I'm probably mispronouncing that, who's Chinese.

And that was, that was really the China's sputnik moment. And the best human players rating on the ELO scale is 3,800. The best Go systems now are operating around 5,200. So they are stratospherically better than human beings, right? They're as much better than the human world champion. As the human world champion is better than me, right? And I'm useless.

So, okay, so in our research team, we have a go player, his name is Kellin Pelrine and his rating is around 2300. So way better than I am, but way worse than the human world champion. He's an amateur and, you know, Completely hopeless relative to the best program. So the best program right now is called KBXKata005, which is way better than AlphaGo.

And this is the current number one program on the, the internet ghost over and Kellin decided to give it a nine stone handicap, right? Which is the maximum handicap you can give to another player, right? So it's the kind of handicap you give to a five year old when you're teaching them to play, right? To give them like some chance of playing, of having a decent game. Okay, so the computer is black, starts out with a nine stone advantage. And here we go. So I'm just gonna play through the game very quickly.

So remember the human is white computer is black, and watch what happens in the bottom right corner. So capturing and go consists basically of surrounding your opponent's pieces so that they have no breathing spaces left at all. So white makes a little group, and then black is trying to capture it so it's surrounding that little group. And then white is surrounding the black group, right? So we're kind of making a circular sandwich here. And black seems to pay no attention at all, doesn't seem to understand that its pieces are in danger. And there are many, many opportunities to prevent this.

It's actually a completely hopeless strategy for white. If you play this against a human, they immediately realize, and now you see that black has lost all of its pieces, right? This is a program that is far, far better than any human go player. And yet it's playing ridiculously badly and losing a game even with a nine stone handicap against an amateur. So if that doesn't worry you, right? Then I think you, I dunno what to say, okay? So I think we are constantly overestimating the capabilities of the AI systems that we build.

And we actually designed this approach. This is a human playing this game. But the circular sandwich idea came from this thought that the go program had not learned correctly what it means to be a connected group and what it means to be captured. In other words, it doesn't understand the rules because it can't understand the rules because it can't represent them correctly in a circuit.

And so we just looked for configurations that it doesn't recognize as captures. And we found one fairly quickly. We originally designed some by hand that involved sort of interlaced groups of white and black pieces like this, but we couldn't make it happen in a game.

And then we did some searching and found this configuration. And you can reliably beat all the programs. So not just JBXKata Go, but actually all the leading programs can't do this. So something is wrong with their ability to learn and generalize. So I would argue we still have a long way to go towards general purpose AI. I don't believe despite the impression one has using the large language models, I don't believe they really understand language, and I don't believe they really even understand that language is about the world in a real sense.

And I think the other big missing piece is probably the third bullet. The ability to manage our activity at multiple scales of abstraction, which lets us handle this world. The world is so big, so complicated, right? If you take AlphaGo right, which has an amazing ability to look ahead in the game, it's looking ahead 50 or 60 moves into the future.

But if you take that idea and you put it on a robot that has to send commands to its motors every millisecond, right? Then you're only getting 50 milliseconds into the future, this doesn't get you anywhere, right? The only way we manage is by operating at multiple scales of abstraction. And we do that seamlessly and we construct those different hierarchical levels of abstraction during our lives, and we also inherit them from our culture and civilization. We don't know how to get AI systems to do that. So we have a long way to go and it's very hard to predict when this is gonna happen. Now, this has happened before, right? The last time we invented a civilization ending technology was with atomic energy. And the idea of atomic energy goes back to special relativity in 1905.

This idea that there's a mass defect, there's some missing mass between, you know, helium and hydrogen, right? We know what the pieces are, we know, and they're, we know how they fit together, but there's some missing mass, which represents the binding energy. So we knew that there were enormous quantities of energy available if you could transmute one type of atom into another. But physicists or most leading physicists at the time believed that that was impossible. So Rutherford, the leading nuclear physicist of that age, gave a talk in Leister on September 11th, and he was asked this question, do you think in 25 or 30 years time we might be able to release the energy of the atom? And he says, no, this is moonshine, right? And he said the same thing in many different fora and in different ways. But Leo Szilard read a report of that speech in the times he was staying in London and went for a walk and invented the neutron induced nuclear chain reaction the next morning.

So when I say unpredictable, right? I think it is pretty unpredictable when these kinds of advances might occur in Ai. And this is the title of a paper that was published a few weeks ago by Microsoft who have been evaluating GPT four, right? So maybe, maybe they're claiming that this is that moment, right? That actually in fact they are claiming that this is that moment they're claiming that they detect the sparks of artificial general intelligence in GPT four, and they confidently claim that successive versions will quickly move towards a real AGI. So I'm not gonna say whether I think that's right or wrong, but I'll come back to it later.

So the last part of the talk I want to talk about What happens if we succeed? What happens if Microsoft is right? And this is Alan Turing's version of that. So he was asked that question in 1951, and he said, it seems horrible that once the machine thinking method had started, it would not take long to outstrip our feeble powers at some stage. Therefore, we should have to expect the machines to take control. So why is he saying that? Well, I think it's really pretty straightforward, right? That intelligence really means the power to shape the world in your interests.

And if you create systems that are more intelligent than, than humans, either individually or collectively, then you're creating entities that are more powerful than us. So how do we retain power over entities more powerful than us forever, right? That's the question. And I think Turing is answering it. He's saying you can't. So if Turing is right, then maybe we should stop. But that 13.5 quadrillion dollars is saying no,

it's gonna be very hard to stop, right? So I think we could try to understand whether we can answer this question differently from Turing and say, actually we can, and I'm gonna argue that we can do it, but only if we abandon the standard model of AI and do things differently. So let me give you an example of where things are already going wrong, and I'm using this word misalignment. That's the clue, right? Why is it going wrong? Because the objectives that we put into the systems are misaligned with our objectives, right? So now we are creating basically a chess match between machines pursuing a mispecified objective and what we actually want the future to be like. So with social media algorithms, right? These are recommended systems. Think of YouTube for example, that loads up the next video. You watch one video and another one comes along and, and loads up and starts playing.

So the algorithms learn how to choose that content, whether it's content for your newsfeed or whatever it is on Twitter, and they do that in order to optimize some objective. So click through is one the sort of total number of clicks that we're gonna get from this user over time. It could be engagement, the amount of time the user engages with the platform, and there are other metrics being used. And you might think, okay, to maximize click through, then the system has to learn what people want, which is sort of good, right? Sounds helpful. But we quickly learned actually that that's, that's not what's going on, right? That it's not what people want, it's what people will click on, which means that the systems learn to actually promote clickbait because they, you know, they got more clicks out of it. And so they like that kind of content.

And we very quickly learned about this idea of the filter bubble, where you start to only see things that you are already comfortable with. And so you become narrower and narrower in your understanding of the world. But that's not the optimal solution either, right? If you know anything about reinforcement learning, you know that reinforcement learning systems learn to produce sequences of actions that generate the maximum sum of rewards over time. And those sequences of actions change the environment in which the agent operates. So what is the environment that the, that the recommender system is operating in? It's your brain, right? That's the environment. And so it learns to find a sequence of actions that changes your brain so that in the long run you produce the largest number of clicks, right? This is just a theorem, right? About how those learning systems will behave.

So they simply modify people to be more predictable by sequences of thousands of nudges over time. And at least anecdotally people have reported that this can happen quite quickly. That your nice Midwestern middle of the road granny, you know, in a few months can be reading the Daily Stormer and posting on neo fascist websites. And these albums are really stupid, right? They don't know that people exist. They don't know that we have minds or political opinions.

They don't understand the content of what they're sending to us, right? If we made the AI systems better, the outcome would be much worse, right? And in fact, this is a theorem that one of my students, it's off the bottom, sorry. So Dylan Hadfield Mennell and, and Simon Zhuang proved a theorem, very simple theorem saying that when the objectives are misaligned, optimizing the wrong objective makes the, the situation worse with respect to the true objective under fairly mild and general conditions. So we need to actually to get rid of the standard model.

So we need a different model, right? This is the standard model. Machines are intelligent to the extent their actions can be expected to achieve their objectives. Instead, we need the machines to be beneficial to us, right? We don't want this sort of pure intelligence that once it has the objective is off doing its thing, right? We want the systems to be beneficial, meaning that their actions can be expected to achieve our objectives. And how do we do that? Well, it's not as impossible as it sounds, right? So I'm gonna do this in two ways. One is in sort of an Asimovian style saying here are some principles, right? So the first principle is that the robots only objective is to satisfy human preferences and preferences here is not just what kind of pizza do you like, but which future of the universe do you prefer, right? What is your ranking over probability distributions over or, you know, complete futures of the universe.

So it's a very complex and abstract mathematical object, but that's what the objective of the robot should be. If you prefer a less technical notion, think of it as to further human interests. But the key point is number two, that the machine knows that it doesn't know what those preferences are, right? That you do not build in a fixed known objective upfront. Instead, the machine knows that it doesn't know what the objective is, but it still needs a way of grounding its choices over the long run. And the evidence about human preferences will say flows from human behavior.

So I'll talk a little bit more about that. So these three principles actually can be turned into a mathematical framework. And it's similar for the economists in the room, Ben, it's similar to the idea of what's called a principle agent game, except in principle agent games, the agent is typically thought of as another human, and you're trying to get the human to be useful to the principal. So how do you do that even when they don't have the same objectives? But here we're actually get to design the agent and we'll see that the agent doesn't need to have any objectives of their own.

So in some ways it's actually easier. So we call this an assistance game. So it's a, involves at least one person, at least one machine, and the machine is designed to be of assistance to the human. So let's have a look at that a little bit more depth.

So we've got some number of humans, M, and they have their utilities, U1 through Um, which we can think of as preferences about what the future should be. And then we have some machines or robots end of those. And the robots in order to, right, their, their goal is to further human interests.

So if I was a utilitarian, I would say the collective human interest is the sum of the utilities of the individual humans. And we could have other definitions if you prefer different ways of aggregating preferences. The key point is there's a priori uncertainty about what those utility functions are. So it's gotta optimize something, but it doesn't know what it is. And during, you know, if you solve the game, you in principle, you can just solve these games offline and then look at the solution and how it behaves. And as the solution unfolds effectively, information about the human utilities is flowing at runtime based on the human actions.

And the humans can do deliberate actions to try to convey information, and that's part of the solution of the game. They can give commands, they can prohibit you from doing things, they can reward you for doing the right thing. In fact, in GPT four, that's one of the main methods is basically good dog and bad dog.

That's how they get GPT four to behave itself and, and not say bad things is just by saying bad dog whenever it says a bad thing, right? So in some sense, you know, the entire record, the written record of humanity is, is a record of humans doing things and other people being upset about it, right? All of that information is useful for understanding what human preference structures really are algorithmically. Yeah, we, you know, we can solve these and in fact, the, the one machine, one human game can be reduced to a partially observable MDP. And for small versions of that we can solve it exactly. And actually look at the equilibrium of the game and, and how the agents behave. But an important point here and, and the word alignment often is used in, in discussing these kinds of things.

And as Ken mentioned, it's related to inverse reinforcement learning, the learning of human preference structures by observing behavior. But alignment gives you this idea that we're gonna align the machine and the human and then off they go, right? That's never going to happen in practice. The machines are always going to have a considerable uncertainty about human preference structures, right? Partly because there are just whole areas of the universe where there's no experience and no evidence from human behavior about how we would behave or how we would choose in those circumstances. And of course, you know, we don't know our own preferences in those areas.

In human compatible. The book that Ken mentioned, I use durian as an example. Durian is a fruit that some people love and some people despise to an extreme degree, right? Either the most sublime fruit, the world is capable of producing or skunk spray, sewage stale vomit and used surgical swabs, right? These are the words people use.

So I tried it, I was in Singapore last week, so I tried durian for the first time. I'm actually right in the middle, right? I'm not on one end or the other, okay? But before that, I literally didn't know my preferences and I learned something about them as a result. So when you look at these solutions, how does the robot behave? If it's playing this game, it actually defers to human requests and commands. It behaves cautiously because it doesn't wanna mess with parts of the world where it's not sure about your preferences. In the extreme case, it's willing to be switched off. So I'm gonna have, in the interest of time, I'm gonna have to skip over the proof of that, which is prove with a little, a little game.

But basically we can show very straightforwardly that as long as the robot is uncertain about how the human is going to choose, then it has a positive incentive to allow itself to be switched off, right? It gains information by leaving that choice available for the human. And it only closes off that choice when it has, well, or at least when it believes it has perfect knowledge of human preferences. So there's many more extensions, right? The most obvious issue here is what happens when we have many humans, as I mentioned, one option is to optimize the, the sum of utilities, but that's an oversimplified solution and you have to be quite careful. So if you've ever seen the Avengers movie where Thanos, right? He collects these infinity stones.

Once he's got them all, he's got this plan, right? A very rational plan also he thinks, right? Which is that if there were half as many people in the universe, the remaining people would be more than twice as happy. So as a goods naive, utilitarian, he clicks his fingers, gets rid of half the people in the universe, okay? So the financial times review of the movie is Thanos gives economics a bad name, right? So we, you know, as AI systems reach these sort of Thanos levels of power, we had better figure this out right before we get there and not implement naive solutions to, to these kinds of very important questions. There are issues to do with making sure that when we have lots of as systems game solvers in the world, at the same time, you know, from many different manufacturers, that they don't accidentally get into strategic interactions with each other. We have to take into account the real psychology, the real cognitive architecture of humans, which is what's generating their behavior, which is what's providing evidence about their underlying preferences. So you have to be able to invert that correctly since we're getting rid of the standard model, right? That's an enormous pain for me because in writing the textbook, right, all the major chapters of the textbook assume the standard model, right? If you think about search and problem solving, you assume a cost function and the goal test, right? But what if you don't know what the cost function and the goal test should be? Well then we gotta rewrite that whole chapter. We're gonna rewrite markoff decision processes and reinforcement learning and, and supervised learning and all the other branches of AI that rely on knowing the objective upfront.

And then if we, if we develop successful and capable technologies on this foundation, we've then gotta make, make sure that they actually get used, that regulations are developed to ensure that unsafe versions of AI systems are not deployed. So I'll just briefly mention one of the results about aggregation of preferences across multiple humans. And this has been a fairly settled area of, of economics for a long time.

Actually, John Hassani, who is a Berkeley professor, won the Nobel Prize in part for this work. So the social aggregation theorem is that if, if you are making a decision on behalf of N people, then the only undominated or what's called pareto optimal strategies are to optimize a linear combination of the preferences of the individuals, right? And if you in addition, assume that no individual should be privileged, then that linear combination just becomes a sum, right? That everyone gets equal weight in that linear combination. But that requires an assumption that's quite unrealistic, which is that everybody has the same belief about the future. So this is called the common prior assumption.

If you get rid of that assumption and allow people to have different beliefs about the future, the theorem gives you a different result. So this is something that Andrew Critch, who's a member of our group, approved a few years ago, that the optimal policies when you have different beliefs actually have dynamic weights for the preferences of the individual. And those weights change over time as the predictions turn out to be true or false that each individual is making about the future. So whoever has the best prior that turns out to give high probability to the future that actually unfolds will end up getting an exponentially larger weight than the person who has the worst prior. It's a very egalitarian idea.

But this is a theorem, right? And everybody prefers this policy because everybody believes that their beliefs are the right beliefs, right? Nobody believes that their beliefs are wrong, right? 'Cause they wouldn't be beliefs. So this is inevitable. What we do about it is a different question. So I'll just wrap up and say a little bit about large language models because that, that's probably what you were expecting me to talk about. So what are they, right? They're circuits trained to imitate human linguistic behavior. So you want to think of their outputs not as words, but as actions, right? They are choosing to emit a linguistic act with each word.

So that's the way to think about it. And as can experience, they do it incredibly well. It's really hard to see coherent grammatical text and not think that there's a mind behind it. So if you need an antidote, think about a piece of paper that has written on it, some beautiful piece of prose or poetry, and ask yourself, is this piece of paper intelligent? Because it has this beautiful paragraph of text on it, right? We don't know where the large language models are between a piece of paper and AGI, right? They're somewhere along there, we just don't know where they are.

Now, human linguist behavior is generated by humans who have goals. So from a simple Occam's razor point of view, right? If you are going to imitate human linguistic behavior, then the default hypothesis is that you are going to become a goal seeking entity in order to generate linguistic behavior in the same way that humans do. Right? So there's an open question, are these large language models developing internal goal structures of their own? And I asked Microsoft that question a couple of weeks ago when they were here to speak, and the answer is, we have no idea, right? And we have no way to find out.

We don't know if they have goals, we don't know what they are, right? So let's think about that a bit more, right? One might initially think, well, you know what they're doing. If they're learning to imitate humans, then, then maybe actually, you know, almost coincidentally that will end up with them being aligned with what humans want. All right? So perhaps we accidentally are solving the alignment problem here, by the way we're training these systems. And the answer to that is it depends. It depends on the type of goal that gets learned.

And I'll distinguish two types of goals. There's what we call common goals where things like painting the wool or mitigating climate change where if you do it, I'm happy if I do it, you are happy, we're all happy, right? These are goals where any agent doing these things would make all the agents happy. Then there are indexical goals, which are meaning indexical to the individual who has the goal. So drinking coffee, right? I'm not happy if the robot drinks the coffee, right? What I want to have happen is if I'm drinking coffee and the robot does some inverse reinforcement, Hey, Stuart likes coffee, I'll make Stuart a cup of coffee in the morning.

The robot drinking a coffee is not the same, right? So this is what we mean by an indexable goal and becoming ruler of the universe, right? Is not the same if it's me versus the robot. Okay? And obviously if systems are learning indexical goals, that's arbitrarily bad as they get more and more capable, okay? And unfortunately, humans have a lot of indexical goals. We do not want AI systems to learn from humans in this way. Imitation learning is not alignment. And then the question is not just do they have goals, but can they pursue them, right? How do those goals causally affect the linguistic behavior that they produce, well, again, since we don't even know if they have goals, we don't certainly don't know if they can pursue them, but the empirical anecdotal evidence suggests that, yeah, I mean if you look at the Kevin Bruce conversation, Sydney, the Bing version of, of GPT four is pursuing a goal for 20 pages, despite Kevin's efforts to redirect and talk about anything but getting married to Sydney, right? So if you haven't seen that conversation, go and read it and ask yourself, is Sydney pursuing a goal here, right? As opposed to just generating the next word. Okay? The last thing I wanna make before I wrap up, I'm sorry I'm a little over time, is that black boxes, as I've just illustrated, a really tough to understand, right? This is a trillion parameter or more system.

It's been optimized by about a billion trillion random perturbations of the parameters. And we have no idea what it's doing. We have no idea how it works. And that causes the problem, right? Even if you wrap it in this outer framework, the assistance game framework, right? It's gonna be very hard for us to prove a theorem saying that, yep, this is definitely gonna be beneficial when we start running it. So part of my work now is actually to develop AI systems more along the lines that I described before, the knowledge-based approach where we have, where each component of the system, each piece of knowledge has its own meaningful semantics that we can individually check against our own understanding or against reality. That those pieces are put together in ways where we understand the logic of the composition and then we can start to have a, a rigorous theory of how these systems are gonna behave.

I just don't know of any other way to achieve the enough confidence in the behavior of the system that we would be comfortable moving forward with the technology and probabilistic programming languages are one possible substrate, not the only one, but one substrate that seems to have these types of properties and to exhibit the kinds of capabilities that we would like, for example, being able to do computer vision better than deep networks, right? Being able to understand language and learn extremely fast from very small numbers of examples, so to summarize, I think AI has vast potential and that creates unstoppable momentum. But if we pursue this in the standard model, then we will eventually lose control over the machines. But we can take a different route and talking of losing control over the machines, we can take a different, we can take a different route that actually leads to AI systems that are beneficial to humans. And I'm afraid to be really boring, right? It's a very exciting time in AI but I think AI is going to end up looking more like aviation and nuclear power where there's a lot of regulation where the developers, the engineers are extremely careful about how they build their systems, they check, they verify, and they drive down the error rate as much as they possibly can. We don't look anything like that right now. But if we're gonna be mature about this, if we believe that we have sparks of AGI, right? That's a technology that could completely change the face of the earth and civilization, right? How can we not take that seriously? Thank you.

(audience clapping) - [Moderator] Okay, we do have time for a couple of questions. I know many people have to go to other classes. I have to teach actually in five minutes, so I have to go. But if I know that there's a lot of questions here and Stuart will, you'll be willing to stick around for a little bit? - Yep. - Is that okay?

Yeah. Yeah, okay, great. So questions? Wow, we're everyone okay? You had one right here.

Go ahead. - [Audience Member] First of all, thank you so much. What do you advise young founders in the AI space in the sense of making their machine learning models, as I'd say aligned morally positive and well-defined as possible, while it's still driving innovation in that space too? - So I think that's a great question because at the moment it's really difficult if your business model is, well we get GPT four or some open source equivalent, we fine tune it for a particular application.

We figure out how to get it to stop hallucinating and lying, right? I mean, the last thing you want to do is, is you know, have AI systems selling people insurance policies that don't exist for houses on Pluto and all this kind of stuff, right? And I've talked to many, many CEOs who ask how do we use, you know, these, these language models? You know, who can I replace in my organization? I would say, well, if you have a lot of psychotic six year olds who live in a fantasy world in your organization doing jobs, you could probably replace those with GPT four. So I actually believe that we might be able to come up with a different form of training that still uses all this human data, but avoids the indexical, right? Learning the indexical goals so that might be a help. But I'm afraid at the moment, the other technologies, the well founded technologies, we can do some amazing things like the global monitoring system for the nuclear test pantry, which literally took me half an hour to write, but it's not a panacea, right? One thing we can't do with it is just download gigabytes of data, shove it into a tabular Raza system and have it work off the bat like that. So the technologies that we, that are available off the shelf that will just at least, you know, claim to solve your problem without any real effort are so unreliable that I don't, I don't think you can ethically use them for any high stakes application. And interestingly, you know, I think OpenAI did make serious efforts. If you look at their webpage and the model card paper, they made serious efforts to try to get it to behave itself basically, bad dog, bad dog.

But if, you know, if you've got a dog who pees on the carpet and you say bad dog, right? It says, oh, okay, you mean don't pee on the carpet before breakfast? Good. I won't do that, right? So it's kind of hard to get these things to learn because we don't know how they work. So anything that's high stakes open AI says maybe you shouldn't use GPT four for that application.

- Right here, - Yep. - [Audience Member 2] Thank you so much for your talk. I have two questions.

One is, I read that you signed the petition to ban AI experiments for a certain amount of time. Why don't you think we could have done what we will do in the next six months time in the past, like before all this became like blew up. Second question is, if the AI like starts replacing all the jobs should, what, what kind of solution should we seek for? Should the government start like distributing the increased production? Or should each one of us like try to find another job that we can do? - Yeah, okay, so two questions. So the open letter asks for moratorium on training and releasing systems more powerful than GPT four for six months. I did not write the letter at all when they asked me to sign it, I suggested all kinds of changes, but it was too late at that point to change it. But I decided to sign it because I think the underlying principle is very simple.

You should not deploy systems whose internal principles of operation you don't understand that may or may not have their own internal goals that they are pursuing, right? And that show sparks of AGI, right? It's incredibly irresponsible to do that. And you know, there are the OECD, which is a very boring organization, right? Organization of economic cooperation and development. You know, it's all of the wealthy countries talking about, you know, making sure that their economies interoperate successfully. So they have AI principles that all the major governments have signed up to, right? This is a legal document that says that AI systems need to be robust, predictable, and you need to be able to show that they don't present an undue risk before you can deploy them, right? So I view this petition as simply asking that the companies abide by the AI principles that the OECD governments have already agreed to and we're sort of nudging the governments to go from principles to regulation. So that's the idea.

Okay, second question, what about jobs? Well, that's a whole other lecture. In fact, one of the four wreath lectures is exactly about that. And it's a long story. My view in a nutshell is that the impact on jobs is going to be enormous and obviously general purpose AI would be able to do any job that humans can do with a couple of exceptions, right? So jobs that we don't want robots to do and jobs where humans have a comparative advantage because we have, we think the same subjective experience as each other, which machines do not, right? A machine can never know what it's like to hit your thumb with a hammer or what it's like to fall out of love with somebody, that's simply not available to them. And so we have that empathic advantage and to me that suggests that the future economic roles for human beings will be in interpersonal roles.

And that's the future of the economy. Not a whole bunch of data scientists, right? The world does not need 4 billion data scientists or any, you know, I'm not even, I'm not even sure we need 4 million data scientists. So it's a very different future, but not a terrible one, but one we are totally unprepared for 'cause we don't have, you know, the human sciences have been neglected. We don't have the science base to make those professions productive and really valuable in helping each other. Thanks.

Last question, apparently. - [Audience Member 3] Thank you for the talk. During one part of the talk, you've mentioned that a good way to reach an alignment is to prioritize the parts of population or people who have a strong pride for the future.

So the ones who are right about the future will kind of end up winning over in term of what the systems align to. How do we ensure that those are positive for the humanity and not bad actors end up kind of being right about the future and kind of en ending up a self-fulfilling prophecy, just making a negative future, huh? - Okay, so actually I wanna separate out two things here so that, that theorem is a theory, right? So you just have to face it, right? I mean, I think the, the obvious recommendation coming from that theorem is that it's a good idea to get people to have shared beliefs about the future, right? But it actually, it's interesting, it provides, for example, a way to get agreements between people who fundamentally disagree about the state of the world, but you can get them to agree by having this sort of contingent contract that says, well, if you are right, you win. And if you are right, you win. And both of them agree to this contract, right? So you can naturally resolve negotiations better by understanding this theorem.

But it the separate question about how, you know, how do we deal with the fact that some people's preferences might be negative? So Hasani calls this the, the problem of sadism, right? That some people actually get their jollies from the suffering of others. And he argues that if that's literally true, we should zero out those negative preferences that they have for the wellbeing of others. And you might say, okay, well those people are pretty rare, right? Most people, all other things being equal. So if it had no impact on my welfare, right? Someone else being better off happier, I'm not gonna, I'm not gonna be too upset by that, right? Most people actually be pleased that someone else is, is not suffering rather than suffering. But there's an entire category of preferences that you might think of as relative preferences. Some people in economics call these positional goods, right? So I like having this nice shiny car parked out in front of my house, not just because it's a nice shiny car and I like driving it because it's nicer and shinier than the guy down the street, right? I like having a Nobel prize, not because it's a big shiny piece of gold and I get a million euros, but because nobody else has one.

If everyone else in the world got a Nobel Prize at the same time I did, I would feel a lot less good about myself, right? Isn't that weird? But relative preferences play a huge part in people's identities, right? Think about soccer fans and how they care about their soccer team being better than the other people's soccer teams and so on. So if we were to zero out all those relative preferences, which operate mathematically exactly like sadism, right? It would be a massive change. So I don't have a clear recommendation on that point, but it's certainly something we should work on.

- [Moderator] Thank you so much. That's all the time we have. Thank you so much, professor Stewart Russell for the enlightening talk.

(audience clapping) Thank you all for coming. Please stay tuned for the next events. Go to berkeley.edu/ai for the rest of the schedule. Thank you so much.

2023-04-14 05:44

Show Video

Other news

全系列大對決！5.8mm薄旗艦機 Samsung Galaxy S25 Edge 到底適合誰？2億像素相機力壓 S25+ 遠攝變焦？ S25 Ultra 效能｜散熱｜電量表現終極比拼！ 2025-05-30 14:58

Claude 4: Everything you need to know 2025-05-29 15:05

Building Just Got EASIER with These Simple Tricks 2025-05-24 20:48