Workshop on Foundation Models (Session I: Opportunities and Responsibility)
So the first session is on opportunities and risks for foundation models. So we'll start with our first keynote talk by Jack Clark. Jack is a co-founder of anthropic and AI safety and research company working on building reliable interpretable and suitable AI systems is also co-chair of AI index, which tracks AI trends over time. Uh, previously was a policy director at
OpenAI where he shelved helped shape policy around efforts like Q3. So Jack, um, the floor is yours. Thank you very much, Percy. And I, I promise that Percy and I didn't coordinate, but I am going to give a presentation that makes a couple of points that Percy made, albeit in perhaps a slightly blunter form. So what I'm going to talk about today is how, and I'm going to get to I'm using the term big models, but big bottles, foundation models are becoming parts of the ecology of the internet. They're going to change sort of the environment that we will operate in. And at the same time, they're going to be
influencing the power of relationships between different AI actors. Uh, this presents opportunities for coordination. It also presents some risks which we need to pay attention to. And I think that these models represent a real grand challenge for society to try and analyze and sort of integrate, uh, safely. And as, as perfect alluded to these can be incredibly useful tools and they could also be, uh, harmful via bugs or harmful vibe, a factual they're capable.
So I'm going to give a sort of an overview talk, which tries to touch on some of these issues. And a web person has covered a couple of the same things I'm going to go relatively quickly. So first, what are these models? Well, the foundation bottles paper from Stanford sort of defined some as models, trend, or broad data at scale that can be adapted to a broad range of downstream tasks, uh, examples of birds, GPT free and clip. I think that's, that's pretty good. I prefer a slightly different definition, which is I, I think if these models is big models and the reason why I think of them as big is, but I think one of the effects of these models is, uh, that resource intensity. They cost a lot to train. They cost a lot of data to, to be sort of dumped into them. They require a lot of engineering resources and it's that bigness that scale, which re relates to sort of the power that these models have and how they change power relationships in AI development. So that's why I'm using using this term, but I think foundation bottles captures
captures from tactically quite well. So let's just quickly refresh on why we care about these models, because as Percy said, these models can do few short learning. So they can generally be adapted to a very, very broad range of downstream tasks. And they also natively have just a huge range of
capabilities. We have models that can do in the same model, you know, text generation, text classification, string prediction, you know, the same thing techniques you use to train foundation bottles for texts. You might use to train models to do, um, material science or, or chemistry analysis. And it's all going to go into lead to a sort of similar property. They can do data transformation. And as, as people likely know, data transformation is just an incredibly economically valuable things. So it called the explains why these are being deployed
so broadly and these models can be plugins to other models. They can become a world models for other systems like robots or, or other things which other panelists are going to talk about. So on the flip side of his capabilities of his issues, um, Percy did a good job of, of going over those. So I'm going to go quickly here, but as we know, they have biases, they have the potential to give inappropriate responses. They have the potential to give dangerous responses because then
you could ask a model how to build a bomb. And eventually it might tell you how to build a bomb, which is something that raises, uh, raises a broad, broad set of confounding issues. They have this broad potential for utility and also this broad potential for sort of misuse. And
probably most importantly, they're kind of difficult to interpret. Currently. We know that these models can do a broad range of things, but we don't necessarily know what is going on inside of a model that lets them do these things, but also relates to our ability to characterize capability imagines. We know that things emerge out of these models as you train for new capabilities, many of which are desirable, but we don't really understand the process by which had emergence occurs. And we can't really predict what capabilities are going to come out of a big,
a big trading run. So this presents huge opportunities and challenges and a good conceptual frame. I've, I've found for thinking about these models as I think of 'em as, as fun house mirrors. And what I mean by that is any sufficiently large model will take it a huge amount of data and it will magnify some of that sort of data and minimize alphabets.
And what this means in the context of these bottles is certain types of culture get magnified by these models, certain types of culture get minimized, certain identity groups get maximized, certain identity groups, get minimized, stereotypes, get maximized, stereotypes, get minimized to get a picture. And this is a huge issue. It means that these models have different behaviors for different parts of the sort of landscape vaping trained on and where these models intersect of culture is where a huge range of really complicated issues start occurring to do with bias to do with how these models behave for different groups of people to do with effectiveness. And I think that using this frame of sort of a fun house mirror gets at some of the challenge here where we know these things are, but they have, it's very challenging property. And as Percy said, but we really do need to like emphasize base. These models are being deployed right now. It's very much not like an academic concern anymore. It's an economic concern. And when things shift into economic concerns, power dynamics change,
and you start to enter a world where huge amounts of capital will have interest in the deployment and development of these models. And when capital has interest in something, it tends to want to benefit of the most capital oriented entities, which doesn't necessarily include academia or civil society or government. So we're in an incentive structure currently, but points as towards these models being developed primarily by sort of capital oriented actors and not by the whole of society. And as I'm going to get to later in this talk,
I think that's one of the really big issues we need to grapple with this conference today. So who's building them, you know, the GPT three replications that exist are by open AI, the originator Walway, uh, Chinese telecommunications company with a good research team and AI 21 labs, uh, which is an Israeli startup for also replications being worked on by Sberbank, which is a really, really big bank in Russia with a good AI research team. And by some other entities, many of which are companies and all of these public replications are companies. They've also recently been code models developed. And these code models much like foundation models, uh,
you know, trained on very, very large amounts of code have capability, emergence have broader utility has some challenges, but all of the people developing them or companies so far, specifically Microsoft get hub and open AI by co-pilots open AI via its own service products. And Google just published a paper on program synthesis with large language models, but indicates, it's thinking intently about this area as well. This fits into a broader pattern, but we really need to, to, to emphasize, which is that if you look at recent breakthroughs in AI research and by a breakthrough, I mean something that was notable to people in the field and or economically useful and was subsequently integrated into business. You notice that there's a correlation between these things and compute usage, and specifically you'll see the amount of compute being used to train these things, go up over time. At the same time, we're seeing the actors developing these bottles change from academia to companies. I here's a graph from open AI, which I, and my colleagues did on analyzing these large computer items. And all I've done is like marked up, which fits were developed
by companies and which bits were developed by academia. And it's, it's quite striking. But the majority of these things in recent years were developed by, by corporate taxes and not academia. And if you go back in time, it didn't used to be this way. If we go and look at the earlier era of AI development, important things were developed like predominantly by academia, academia built early self-driving car prototypes. They built early speech analysis systems and built a lot of the early primitives for deep learning by Jeff Henson and others. And do you see some coal productivity, like IBM built a good system, but can play, play back Gavin? Yeah. And the current at bell labs built, built
lean at, um, but it's actually a more even world. So we have to focus on this because, because the, the actors that get to develop this stuff determine the future of it, uh, in large parts. And currently the act is developing the stuff, uh, predominantly predominantly companies. So why is that? Well, you know, speaking a little bit from, from experience at open AI and at my current place on PropTech, for some fairly obvious reasons, you know, these models are expensive across hundreds of thousands to millions of dollars to train bumps, they need some money allocators.
They're also models that take time, you know, these models take, uh, weeks to months to train. That means that you need to babysit from during training. You need to wake up in the middle of the night and notice that your, your loss curve has exploded and roll your model back to it earlier snapshots, and do all kinds of like maintenance of that, which is not particularly glamorous and is also sort of an artisanal science right now, but isn't like that well documented. Um, in addition, these models have less inherent attraction to certain incentive structures in academia, because there, as Percy said, they are simple. They are based on like a decade old technique.
We've just scaled up a lot of data and compute. So we don't have some of this inherent novelty, but academia rewards. And there's just a lot of engineering work probably in pure theory, work industry reward space in the form of money and career advancement and academia, at least in ML. Doesn't, it doesn't so much, although I think that some of that is changing. So this leads us to the, I think the complicated issue that we need to work on at this workshop, which is the people that get to use these models of a private sector, actors who build foam and the people that the private sector actors sell their models to. So developers via an API and academia, but only by access programs. And I'm not, I'm not picking on Facebook here. I just think Facebook generates a lot of these examples, but Facebook is an example of a large tech company that periodically does access programs for academia to look at certain bits of its data, then academia finds out, but it doesn't, it doesn't play favorably for Facebook and Facebook revokes the access. It would be naive to think this won't happen in the AI community. And I'm
not saying that to suggest that companies doing academic access don't have good intentions, but you'll have good intentions. They're all like, you know, run by people who are reasonably, reasonably nice. We're all human, but we will have incentives. And these incentives are generated by capital. And these incentives will at some point mean, but access will be taken away. And so if access is taken away and academia, isn't building these models a really bad set of things happen to do with a lack of accountability, a unconstrained deployment, and a very uneven uneven world.
Additionally access programs, you know, uh, make academia in some sense, dependent from a private sector. And that's a bad thing because as Percy said, you know, these models have issues. They have really complicated issues which need to be worked through, and these issues are not going to be that fun to work through. They're going to be really, they're going to deal with some of the issues that we as societies have completely failed at solving so far like bias. Um, so to expect that conversation to sort of have the best chances of going well,
you want all the, for people in that conversation to have leveraged. And currently all of the leverage is held by a small set of private sector actors, which means that they are not incentivized to have an equal conversation. They're incentivized to exploit that power differential. Um, and again, we should have seen that this will happen based on the incentive structure. So why should you, why should I believe you build, even though it's very expensive and challenging? Well, I think it's, I've mentioned, you know, it's all about leverage information is power. We want more information about these models to be, to be broadly available and we want it to be broadly distributed. And if we don't do this, then students go to industry to learn how to build these models. Many of them probably stay in industry. Um, and don't cycle back into
academia. Academia gets trapped into dependency on the private sector for mobile access, which is not going to lead to anywhere particularly great. In the long term, the private sector gets to shape for regulatory environment because it will have the models and it will generate information about the models and it will do its own studies and it will partner with friendly institutions and it will use all of this to shape the policy environment in which these models are deployed.
And academia won't have as much leverage and academia won't be able to, it will be able to critique, but not necessarily proposal Turnitin. It's easier to get an alternative to become true. If you can prototype it and prototyping stuff here requires developing or analyzing and developing models of this kind of class in terms of scale and magnitude of resources. So final couple of points. And then we'll go to Q and a, you know, what will these models do in the world? Well, as I said, the internet is an ecology of systems. Um, the interactions
of these fees entities and this ecology lead to societal changes. Uh, good examples that just to look at the internet for the last 10 years or so, where the emergence of platforms like Facebook and Twitter and, and, you know, aggregators like Google has centralized human eyeballs onto a small number of platforms mirroring the economy of scale effects. You see everywhere, everywhere else. And you could probably be of the same time. These platforms are starting to recommend stuff to users
via increasingly complicated and inscrutable AI models. And actually we know that it's changing stuff. My media habits have changed because of YouTube recommendation algorithm. I bet yours have as well, and this has effects ranging from the benign.
And, but I know now know a lot about like how cheese gets made, because I like watching cheese, making videos for some, some reason, no, no need to the ML algorithm to the dangerous. We know that these recommendation systems also drive political polarization. Something that's just starting to happen now is also via emergence of AI tools for altering and augmenting our own subjective reality. Snapchat filters, Facebook filters, deep fakes. And again, these got, you used her range of things, but fun and interesting to the harmful, you know, DPX get used to generate misogynistic revenge pornography. They also get used throughout [inaudible], you know, they get used for things, some of which are, are clearly, uh, if not illegal, um, you know, right on the border to things that have commercial value.
What this means is that the world we're in is already being influenced in a major way by kind of non-trivial large-scale AI models, but are changing culture. They are changing the culture that we all exist in, and that will have a huge effect. And to just sort of tie it back to what I've been saying earlier, if we're changing culture, we need a large set of people to be able to analyze the things changing culture, because that has a big effect on human civilization. And as it currently stands, these models benefit capital because capital builds them. Companies built these bottles and they get to deploy them and they get to make money. It doesn't leave quite as much room for different approaches and different different stakeholders to benefit.
Obviously, if these models have benefits, they're going to be used to help people via therapeutic chatbots for are going to be better. Search engines are going to be aides to designers and artists are going to be amazing tools to help people for the batter programming, uh, like myself program better by using natural language interfaces into code models. You know, there's a whole bunch of benefits out here, but because of a Funhouse mirrors, they all are going to by default benefit everyone equally, that's going to take work on our part and work. And we're part of the academic community. And because these, these models may
cause harm to people, but are less represented or represented, but stereotypes, we need to be able to develop the tools to let us interrogate bats and at least know that that might be happening. So my final, my final point is just what are some things we can do to make things go well, I think it's all about reducing asymmetries. There is a power asymmetry in model development right now, and I believe it needs to be worked on by broadening the range of actors that can build foundation bottles. I believe that universities should be building these models either individually or as consultants. And I think beyond the universities, as Percy said, we can look into things like distributed training and big science and other, other ways of kind of skidding this skinning, this cat. We also need to deal with the inherent information asymmetries here. There must be more tools
available to assess measure and analyze foundation models. Um, partly to encourage accountability, partly to discover capabilities in Burma, but we don't yet know exists. And we need to create things like datasets to test for types of benefits and harm. And we need to create, you know, leader boards that go into some of these societal impact or ethical dimensions of the model, the same way we have leader boards that look at raw performance or on a, on reducing a loss curve for something like, you know, super glue. And finally, and this is something, but I, I I'll be having some research coming out on soon as government should invest in tools to audit, to measure and analyze bottles that are deployed because we need to create a shared, accessible set of societal information about these models.
If we wanted to go well so long, we're building them. We need to equip government with the tools to see what is being deployed and to sort of be able to automatically analyze and to some extent, interrogate these models as they're being placed onto the ecology of the internet. So that's my talk. Uh, I think I've got time for some very brief Q and a, before we go to the next next section, or you can reach me on Twitter at Jack Clark SF. I broke out a weekly newsletter called impulse firstname.lastname@example.org. And in the weeks since I said, I was doing Mr. Talk, I've had a lot
of kind of robust discussions with people on, on Twitter and other places about this. So, so thank you. It's, it's there, there's huge amount of interest in this subject from the large set of people. So thanks very much. And I hope that we can do a quick bit of Q and a now. All right. Thank you, Jack. For that really insightful talk. I really like how you've laid there, the academia versus industry research, scap and issues with power. Um, so let's do some Q and a. So I wanted to start by asking you a question about, um, incentives within industry. So there's
a big difference between in terms of resources between let's say a Google and a startup, and, um, you know, right now it's, it's kind of hard for a startup to break into the search engine market. So do you think startups will be able to compete in five years and what needs need to be done right now, or even putting academia aside? I think that if you look at things like neural machine translation, that was a system that Google developed and replaced a load of very expensive specifically developed translation systems at Google. And similarly, these large generative models that are being trained now are incredibly good at doing things like search and other things. So I actually expect the emergence of these models is going to like shake up the game board a bit in terms of dominance in different areas, because once you train these models and you could imagine sets of startups teaming up to train one, and then all using their access to it, to compete in different business areas against larger companies, you may be able to change some of the power dynamics event. I think that's an opportunity that's out there. Yeah. Great. So now let's talk a little bit about academia.
Um, how should we get funding for academia? I'm very curious about this one. Well, there's the national AI research results, which is being worked on by the MSTP. Um, and in the, in the U S white house, there is an RFI out for that which expires on October 1st and they're looking for ideas or what that should look like. We also just passed for endless frontiers act or as it was renamed a youth worker by we, I mean, myself and you sucker unlocks something like 200 billion of funding, which we can potentially use to, uh, increase funding at places like the national science foundation and others. So I think money
is actually out there. And what we need to do is develop the, the payload for the money to go into. Um, and also just speaking, frankly, whenever I go to Washington, cause I do a lot of work in policy. I talk mostly about this problem and I talk mostly about this problem to the U S government and say they need to significantly change resourcing here. So that academia kind of has an ability to do it. Uh, I don't know how effective that is, but it's what I spend a lot of my personal credit on. Oh, that's great. All right. So unfortunately we have to move on. So let's think Jack again for that great talk and we'll chat more about this at the panel. Thank you. So now
we'll have two 10 minute talks from two Stanford professors whose primary research area is not AI, but whose worlds have been colliding recently with foundation models? So first is Michael Bernstein. I'm associate professor of computer science at Stanford and a member of the HCI group. Uh, Michael builds social computing systems involving anywhere from small teams. So large crowds to help groups achieve collective goals. Um, he specifically looked at governance and online communities and directive machine learning, and recently has led the effort at high to design an ethics and society review board for AI research. Michael, take it away.
All right. Thank you very much. So I'm going to talk from the perspective of a human computer interaction researcher. So this is sort of my day job in some sense. And let's talk about what I'm going to call threshold effects here. So I'm going to tell a story of ancient history to kind of set the stage here a long, long time ago, tens of hundreds of days ago, it used to be really, really hard for us to make public content on the web. And you had to own a server.
You had to figure out how to set up Apache config files. Um, you had to learn this like crazy tongue of hypertext markup language. It was, it was a big deal, but over time tools or rose that made it much easier to publish. So you can think of stuff early on like WordPress or, or media Wiki. Uh, and then eventually the a web 2.0, now I can wrap my sound cloud. Things changed. And what
happened here is well-described by what Brad Meyers, Scott Hudson and Randy Pausch described back in 2000, oh, this threshold ceiling diagram. I'm going to draw a graph here on the Y axis is just going to be what I'm going to call threshold and how hard it is to sort of get off the ground to do something. And on the Y axis is going to be the ceiling, which is the sophistication of what you can create with that thing. So early on, you had to basically be fluent in the entire server stack to get anything public on the web. But over time as Tim Berners Lee started with, uh, with the web, we had HTML, which, which dramatically reduced the, how, how challenging it was the threshold to making a webpage.
And eventually we started seeing things like markdown. I can, I can write posts on medium, or I can just fire off something on Twitter within seconds. And this has dramatically reduced the threshold. Now it's also changed the ceiling. I can't do anything as expressive on Twitter as I can writing a full stack web app, but that threshold has really dropped from months and months to just a few seconds. Likewise, we have really, really high threshold, high ceiling tools like Adobe Photoshop, moving to something like Instagram, where you can again do, do pretty complex, but not as complex stuff. A movie's premiere moving into something just like Tik TOK work and that filters I can, I can do it now. So we, we lowered the threshold. Something went from
really hard to get in to real, to relatively easy. And what happens, this is the central question here, as it relates to foundation models, what happens when you lower that threshold? The published thing? Well, two things, one, you're going to get this massive increase in adoption of the media, right? A lot more people are tweeting than are making full-stack web websites. And you also have a really broad proliferation of different kinds of use cases that the original creators probably were not intending or aiming for. Now, this does increase sort of an iteration speed in a sense, you know, this is the very classic human computer interaction cycle, where we have an idea we have to, we implement it. And then we reflect on it. And just like research itself, this practice of design requires this reflection and iteration, it's this, it's this iterative process. And so every time you're lowering the threshold,
you're making that implementation easier, more accessible, faster. You're getting more, more turns around around that cycle, which means you get in principle better designs. It does mean cooption. So I'm guessing that when we started the web, we were not imagining
communities of knitters, but it's pretty cool that we have them. Uh, we, we're not imagining, you know, the largest encyclopedia in the history of mankind, uh, excuse me, of humankind. Uh, we were not imagining that person, you could just write tweets and it would go out to the world. At the same time, we were probably also not considering, uh, platforms would be used to convene the alt-right, uh, that there would be a hate, hate activity going on online that there would be cyber bullying, uh, and that all sorts of things were going to happen because people co-opt these platforms in ways that maybe the I'm I'm relatively confident that the original creators were not intending or even agreeing with we had this entire thing that happened as the threshold lower. And this entire time though, one thing remains stubbornly high threshold that's AI. It was
incredibly difficult to get going, to really build an AI. You'd generally required some sort of advanced coursework or an advanced degree and internship with Andrew doing a lot of effort to collect data and to train it. So when we talk about foundation models, I feel like we often put our focus on the fact that the ceiling has gone up. That performance is improving on a series of tasks that this, that this approach has moved our benchmarks. This is raising the ceiling. But what I want to argue is that I think the real action is going to be on lowering the threshold because these foundation models have lowered the threshold that the few shot nature of them, the natural language prompts basically make it so that all of a sudden you have people who previously could not, or would have required a ton of effort to train an AI, start crafting AIS.
And we can start to see that there are lots of opportunities here, certainly, uh, and here I'll point to work by a PhD student, June park, finding ways that there, that, that it is possible to help scaffold new non-expert users in writing effective prompts. Here's a particular, uh, challenging task that, that GPD three doesn't do all that well on a few shot learner does barely above chance at trying to predict whether, whether a community is going to moderate a comment. But if we start to decompose these concepts, that people can start to basically take this, this question and decompose it into a few different questions that are more amenable to GPT three, all of a sudden, we start to on some tasks, see better F1 scores. So there's a real lever here for human insights. It also means we're going to start seeing a bunch of different kinds of interactions.
Now you'll notice that these citations are actually from before GBT three, uh, in the bottom left, you can see that the HCI community has basically been working on codex for the last five years. Um, so thanks, thanks HCI community, but imagining interactions ranging from, uh, being able to use natural language, to generate, to generate code, to, to edit an image to a defined fashion, to having, uh, the ability to automate, uh, tasks on your, on your mobile device to having a ubiquitous computing systems that can do some activity recognition. Uh, so that, for example, it'll start, start making your coffee. When you wake up now, all they're going to be all sorts of new forms of interaction does I'm saying that, that we haven't seen, for example, uh, emphasizing this prototyping, uh, prototyping approach. So if we, if we make it faster and easier to prototype, what are we going to see?
So maybe we can start to see on the positive side, things like social network, uh, designers taking advantage of the fact that these foundation models have memorized a bunch of behavior on these social networks so that it can use a natural language product, excuse me, natural language prompts saying something like here's a person who's upset about a breakup, and we can actually see and populate a space with what might happen. I could say that my colleague James sees that post and decides to reply, and these are actual responses generated by GPT three. So we could actually start to populate these spaces, not just with what might happen in a social space, but also say here, Trevor decides to be a troll. We could foresee the kinds of negative behaviors that are going to arise before they actually arise. If we can do this, I think we can do a much better job of forestalling some of the negative outcomes of on these platforms.
We don't have to imagine necessarily, uh, what what's going to happen. We can essentially simulate some of it and see, you know, what we need to, we need to build protections for this kind of thing, but again, I'm going to bring up the butt. And here, I think I'm beating the drum that the first two speakers really, really spoke about already. We saw with publishing that when we, when we lowered the threshold, we got many of these positive and creative applications, but also a number of really negative and harmful ones that we were maybe not as a society, uh, able to, to grapple with in advance. So I think we need to say that the same thing is going to happen here as we lower the threshold there's going to be, co-option not just of these applications that I, that I think will be, you know, handy will be useful. And that HCI will be, uh, as a field really, really excited to, to explore. But also, as
Percy mentioned, uh, this will become an endless fountain of misinformation. It'll become a troll generator. That's really difficult to detect that I can just, I can just pull up a thing that says, Hey, um, this person, I don't like them. They broke up with me and go, go mess with them. Uh, we can reword articles for hope, higher, emotional valence and vitality. So we know that these kinds
of articles get spread further on social media. If I can have a thing that makes me sound, um, more, uh, that makes me more viral, uh, hyper curated online profiles, we're going to be able to get feedback and it's gonna, it's gonna make us look a particular way that makes us look good in the way that we want, but then everything starts to feel more fake. Um, Jeff Hancock in the communication department has been thinking a lot about this kind of work, um, more targeting with less data, right? That's a, that's a foundation, a risky routine, classifiers. We're seeing things like, uh, your, your insurance rates might start going up because we can classify what you're doing with less information about, about you. So what, what, and all these things that we probably haven't already foreseen.
So what's going to happen here. We're going to see a wide variety of new interactions, both on the positive side, but I think really we want to, we want to focus on what's what are these risks and how can we mitigate those risks? What principles should we be following in the design and deployment, both of the models and of the applications, what should the applications, what should their contract be? And I'm going to sort of handle that question on contracts, perhaps off to my colleague, Dan ho, who's going to take the next piece. Um, I want it to thank, uh, students who have been working on this kind of work that I mentioned, uh, a colleague Percy who's who's, uh, co-advising some of this work, uh, and, and funders. All right. Thank you. Michael Love the framing of foundation models, reducing the special, and
we'll come back to questions, um, in a bit, let's go on to our second talk. Um, I dunno. So Dan ho is the Benjamin Scott and Luna Scott professor of law here at Stanford, a professor of political science and associate director at Stanford high, his scholarship centers on quantitative and legal studies, focusing on administrative law and regulatory policy, anti-discrimination law and courts. So recently Dan has been really interested in using NLP for legal applications and his talk will intertwine these two themes. Dan, please take it away. Well, thanks so much. Uh, Percy, um, and, uh, I think, uh, uh, the work,
uh, here in bringing this group together is really, uh, been quite engaging. I'm going to talk about two topics. One is the use of foundation models for law as an area of application and the potential promise there. Uh, but then also the law of foundation models. Uh, that is what are the kinds of legal constraints that may actually affect what kind of foundation can be built. This is going to be admittedly an American, uh, sort of a us perspective in part, because of, uh, my perspective and the perspective of the wonderful collaborators who helped with the white paper, uh, Peter Henderson, Neil [inaudible], Mark Cross, Julian Norco and Jenny home, um, on the first topic it's widely acknowledged, um, that, uh, while the, uh, us legal system strives for justice for all, there are pretty profound, uh, issues with access to justice.
In 1978, president Carter gave a speech to the American bar association that said, quote, we have the heaviest concentration of lawyers on earth. 90% of our lawyers serve 10% of our population. We are over lawyered and underrepresented. Uh, that situation has not changed much to present day. Uh, for instance, the systematic look to the systematic underfunding
of public defenders, uh, or, uh, the file room of, uh, one of these, uh, um, uh, adjudicatory agencies that looked something like this, a few, just a few years back where it took about five to seven years for a veteran, for instance, to have an appeal, uh, resolved. And one estimate has it that about 7% of veterans pass away while waiting for their appeal to be resolved. And, um, at the same time, uh, one of my colleagues who spent much of her writing, talking about access to justice, uh, pointed to a kind of less noticed revolution of legal services aimed at America's low and middle income, uh, consumers, uh, technology is replacing lawyers wholesale in areas like preparing wills or forming limited liability corporations. And what we do in the white papers, we spell out a kind of range of different applications across the stages of a civil lawsuit. Uh, we'll note that that's already quite prevalent in areas like discovery, which is the process where, uh, by, uh, litigants, uh, seek to, to find, uh, facts, uh, and the kind of conventional a way in which that was done was, uh, for instance, by providing these larger bankers boxes of files. And that process has been revolutionized already by the use of natural language processing that said one of the historic challenges in terms of thinking about even applying these kinds of tools, uh, to law is that labels are really expensive.
The law has not had anything like the kind of large scale benchmark data, uh, um, sets that have powered, uh, sort of NLP, uh, uh, uh, development. Um, and so the basic question, the basic structural challenge is that law is expensive. It's hard to hire lawyers to label legal decisions. And one of our graduates, uh, Pablo ardando figured out that actually there is a place where we can look for high fidelity, uh, labels of legal decision, which is that lawyers have actually come up with extremely complex sets of rules about how to cite prior precedent. Uh, and one of those rules turns out to be about the use of particular paranthetically rules that literally states, uh, in the blue book, the, the conditions, grammar, and how to characterize holdings of prior court decisions, um, and holdings are this kind of basic task in the first year of law school. It's the part of the judicial decision in
a common law system that can be relied upon as precedent to site too. And so what we did is we did some work here basically, uh, leveraging, uh, this really, uh, uh, detailed set of rules to extract these kinds of holding statements out of existing case law. The context may look like the blue, then there are the purple kind of legal citations. And the
holding statement is something stated in that pen theoretical that characterizes the key precedential value of prior cases. And then it turns out we can actually construct something like, uh, one of the classic, uh, Q and a tasks where we can find very similar holding statements here about qualified immunity for police officers, and turn that into a kind of question and answer task. And here's where we start to see some of the promise of, uh, foundation, uh, models for the law, um, uh, foundation models, uh, boost performance pretty significantly on F1. Um, and, uh, what we then did is actually, uh, compile, uh, through the Harvard case law access Corpus, uh, kind of 3.4 million federal and state court decisions and do some domain specific, uh, uh, pre-training, uh, uh, one sort of important thing here is to really pre we found was really important to, uh, it was really important to create custom vocabulary and sentence segmentation that paid attention to that complex, uh, system of legal citation.
And, uh, that turned out to actually perform, uh, best. And that's where you see some of these really interesting potential gains where the gains from domain specific pre-training and these foundation models seems to be largest when the training dataset is smallest. That said we have a long way to go and example offered by, uh, Julian Nyarko, uh, is about, uh, GPT three and its limits and discerning legal reasoning. So, uh, a simple prompt might be are liquidated damages, clauses, enforceable, liquidated damages, uh, typically appear on contracts to specify what the damages are if a breach of contract occurs. And it is absolutely the case that GPT three compare it what
the black letter legal statement is. Liquidated damages are generally enforceable, unless the sum stipulated is exorbitant or unconscionable, but if you feed GBT three, anything like the kind of fact pattern that we train our first year law students on like, uh, the contract over at Toyota Corolla, Corolla, where the liquidated damages are posited to be $1 million, uh, GBG three is not able to actually perform that simple form of legal reasoning, nor is it able to do so, even if you, uh, specifically state that $1 million is exorbitant or unconscionable. So there's a long way to go, which leads me to the way in which the law may actually constrain, uh, the, uh, construction of foundation models, given all of the basic questions that Percy and Jack covered. So while about the performance bias and mechanism of these kinds of models,
and I'll only scratched the surface here, but I'll highlight a few. Uh, one is obviously the worries about, uh, the extent to which foundation models may bake in bias and lead to disparate treatment or disparate impact. Uh, another is, uh, the questions about due process when used within, uh, uh, any kind of decision-making system where, uh, due process doctrine typically requires a sufficient process, uh, when an adverse decision is made and as Danielle Citron, uh, wrote in a kind of logical AI based system, uh, there were computer programmers, uh, in coding, uh, benefits decisions that actually violated the federal, uh, regulations on this. And that becomes of course, much harder to audit when we're talking about, uh, the kinds of decisions that may be embedded in more opaque foundation models. Um, then second, there are questions about input liability. Uh, Percy noted that he didn't want his vacation pictures used, uh, think here of, of GitHub, uh, copilot, the machine learning based program being assistant and their basic questions about the licensing terms and fair use, um, to the extent that co-pilot is trained on data. That is not for instance on GitHub
and the kind of biggest tension here lies in big questions about accountability. It was only this past term that in the van Buren case, the Supreme court actually pared down an interpretation of the computer fraud and abuse act that actually could have criminalized a lot of web scraping conduct, uh, to bring data into these models. Uh, uh, and, uh, if, if you were violating the state of terms of service of a particular, uh, website, um, and that obviously had major implications, uh, Facebook was involved in one of the kind of underlying CFAA cases, uh, in terms of, uh, the ability to actually access this kind of data last, uh, we also spell out the fact that there may be protections for model, uh, or, uh, inference outlets from these kinds of models. Um, but at the same time, while there may be legal protections, uh, the interesting thing from kind of, uh, the perspective of law is that it may actually challenge some, uh, core tenants of, uh, uh, legal doctrine itself. So, uh, colleagues, Tony Massaro, Helen Norton and Margo Kaminski, uh, wrote a really, uh, uh, uh, fascinating piece, uh, arguing that quote, if we take the logic of current first amendment jurisprudence in theory to its natural conclusion [inaudible], which was the 2016 chatbot that was taken off of Twitter after 16 hours due inflammatory tweets, um, described by one newspaper as quote, artificial intelligence at its very worst, uh, um, [inaudible] may actually have first amendment rights undercurrent first amendment doctrine. And that is of course, something that may not just affect, uh, foundation models, but it's also the kind of cross-fertilization, um, uh, that I think is so overdue because it may be an area, uh, where, uh, the emergence of foundation models may actually be really teaching us something about some of the limitations of current legal doctrine and inspire a significant amount of rethinking. Um, thank you that I think the alternate to Q and a,
All right, thanks tan. Um, and now we will go to QA, cancel to turn on my video anymore, but that's okay. Um, um, you know, we're, if we're talking about building foundation models for, for law, you know, Michael talked about lowering the barrier of entry, um, you know, the lowering of the threshold, um, there's also a barrier to entry for law.
For example, I account really to be illegal is how do these two types of barriers interact with each other? And maybe this is a question either of you could speak to, Yeah, I'm happy to take a first stab at that. Um, uh, I think, uh, you know, we, we did this other, uh, report where we tried to understand, uh, how it is, uh, that, uh, government agencies were really experimenting with forms of AI. And one of the kinds of themes coming out of that report is that while nearly half of federal agencies, for instance, are, are trying to do this, the best use cases really came aware where you had a core staff members within those teams that both had technical insights and a deep domain knowledge in the social security administration. For instance, they built a sort of NLP system, and it was really built out, uh, by some of the folks who were adjudicators for several years and figured out what they really, what kinds of tools they really wanted built for themselves in the process of adjudication, which I think exemplifies some of the sort of interesting HCI, uh, components of this. But Michael, I'd be curious for your perspective on it.
Yeah, it's interesting. One of the, one of the first things I got asked, uh, from someone in the government to do when I started working on crowdsourcing was to help them crowdsource more interpretable versions of, of, you know, public legal documents and policies. Um, I think we have to differentiate here. There's sort of what's internally needed for the, for the legal policy to be correct. And then there's, what's projected and clear to sort of the, to the citizens. And those may be very different in the same sense of like how I would teach, um, you know, how machine learning works to someone who's just learning is going to be different than the conversation, the kind of conversation you would have with someone who who's an expert. And so, um, I feel like that needs supervision that doesn't exist. We can't just sort of say, oh, here we go.
Because if we just try to translate, what's already there, it's going to try to take stuff that might be, as you suggested, like quite sort of technically complicated, legal detail and make it simple, but in a way that's not right, or it gives the wrong intuition I'm in some sense, um, almost more interested in how it might help go the other direction that if I can provide some there's this vocabulary gap, you know, if I don't have the words to describe what I need to know, or the kinds of legal policies I need, like help me get in touch with experts, who can, you know, who can, who can help me? Uh, it just helped me cross that chasm. Right. Thanks, Michael. Um, we're having another question on both of you discuss the risk and co-option of these models. So how should we reason about these risks and what recourse should we take to proactively and reactively mitigate them? I want Dan to go first on this one. Well, I think part of the, the reasonable, why I split up my portion of the talk to both be about the application portion and also, uh, an understanding of the legal constraints is that a lot of the times, the way we choose to do this as a society is to take broad ethical principles and then write them down into law. And so some of the ways in which we obviously do risk mitigation has to actually involve, uh, uh, writing, uh, enforceable rules rather than, uh, merely, uh, sort of afforded Tori, uh, uh, ethical principles. Um, and so, uh, you know, I was very interested
for instance, in what, uh, uh, Jack noted, which is, uh, how do we actually build, uh, a, uh, workable, um, kind of audit system. And, uh, I think, uh, uh, uh, perceived, I think you've made a really compelling case that, uh, academic auditing is part of that. Uh, but I would also probably say that that, that alone is not enough that there may have to be actual enforceable, uh, rules to be able to audit, uh, systems, for instance, that might be used for, uh, uh, benefits adjudication to make sure that they're actually designed and work, uh, in a way that is, uh, uh, legally compliant. Yeah. W what I'm wrestling with here is, you know, in, in computing, there's a real push for CS for all right, to try and make to lower that threshold, to give it give as many people as possible access here. And yet, in some, you know, in other domains, that's not necessarily the case
have licensing procedures. We have, uh, you know, communities of practice where we're groups will, will self adjudicate, right? So medicine, law, and others, where there are real consequences, you know, you're barred from practicing, uh, if you, if you have a severe ethical violation. And so I, I'm wondering at what point, you know, are these tools, you know, no one says medicine is bad for you, or that it can't be used poorly, but like at what point do we need to start asking more seriously about those kinds of professional associations and what that would mean, um, and what harms it would do to create that in terms of restricting access to potentially marginalized groups versus the benefits of, you know, you know, say cutting out negative, uh, negative use cases.
That's really what I'm chewing on is like, where, where do we, where do we stand on that? Yeah, Just to follow up on that. I mean, in your talk, you talk about, uh, the co-option of publishing, which I guess we've seen, um, kind of play out. Are there any lessons there that we might be able to take and, and apply them to either, uh, foundation models? One thing I can say is I was early on, I think more optimistic about the role that sort of design and research could play in sort of setting models and think of it as a role model or a path forward. Look, we should
be doing it this way. I think I've become far more cynical over the last, say, five years, maybe more in line with Jack, uh, that, well, we can show pro social paths. There's nothing that prevents a anti-social actor from taking, taking that and just ignoring it or doing the opposite. And so in general, my sense, and Dan would be the expert here. My sense is that policy tends to be reactive rather than sort of happening before the issues are before there's a violation. Um, but to the extent that we, as a community can do something like what, um, you know, the researchers didn't CRISPR, where they came together, came up with essentially a set of principles that they published alongside the technique and say, here's what we think is okay. And here's what we think
is a violation. I think that's fairly counter normative behavior for computer scientists, but I think it would be a really important step if we could, you know, say in this group or elsewhere convene civil society, researchers industry, and so-and-so, this is, this is what is, okay, this is what's not, and then there becomes at least social approbation for that kind of behavior. Yeah. We definitely need some more professional norms here. All right. Uh, thank you both for your talk and taking the questions. Um, so this concludes the, the toxin. Now we're going to, um, move on to our first panel discussion. So let me introduce the panelists. The first panelist is Jack Clark. So please welcome him back. Um, next we have Sulan
Blodgett. Um, she's a post-doc researcher at Microsoft in a faith group. She studies social implications of language technologies, and just recently led some really nice critical work on how to think about bias and how to measure it in LP. Um, next we have Eric Horvitz, he's a technical fellow and the chief science officer at Microsoft he's led efforts at the intersection of technology people in society. He has made foundational contributions to principles for responsible and ethical AI, and has always been a strong advocate for a human AI complimentarity.
Um, next we have Joelle Pineau, she's co managing director at Facebook AI research and associate professor of computer science at McGill. She has worked on planning and partially observable domains, dialog systems, robotic assistance, and she's led some really remarkable reproducibility efforts in the machine learning community for the last few years, such as the reproducibility challenge and the river's participant checklist. Um, and finally we have Jacob Steinhardt. Jacob is an assistant professor of statistics at UC Berkeley. He works on making machine learning
more robust and aligned with human values. Recently he and his group have built several challenging benchmarks for language models across, um, different domains, like language and code. All right. So welcome all of you. Um, and let's get started. So as it's advertised in the name of the panel, I want to start with two broad questions, one on the opportunities and the one on risks. So is it the first one is, you know, w what are these applications or sectors, um, do you think are most going to be transformed by foundation models in the next two years, even based on kind of the current trajectory now, what are these types of populations or types of interactions that we haven't seen? Um, you know, we need a bit of imagination haven't don't exist yet, but are now made possible. Um, maybe we could start with, uh, maybe Joel and Eric, uh, since they, uh, being at, you know, Facebook and Microsoft have seen a lot of resources invested in kind of applications of the foundation models, and then found we can go to others. So maybe Joel, you can go first.
I can start a bit. I mean, I think, you know, there's, there's definitely tremendous potential that goes across all sectors. So it's a little bit hard to, to pick which ones is the most promising one right now to be Frank with you. I think you're the report, um, put out recently does a great job at highlighting a few of them, education, health care, uh, law as, as being prime examples. But I think that could have just as much talked about some of the potential in transportation, some of what we're seeing, not just in smart vehicles, but across the whole realm of, of the transportation industry, some of what's happening across finance and that industry as well. And, and, you know, I will mention, I think creativity, entertainment, all of this ability to really harness the work that's been done on generative models of AI and use that to enhance human creativity in really significant ways. I, this music credibly exciting and really promising.
Yeah. The ability to generate is definitely something that's kind of jumps out at you. Um, Eric, do you have any things? Yeah, sure. I mean, I said just that the, you know, the scale of the data compute and the model capacity that's been enabled by the unsupervised learning methodologies has really led to both advances in reputation and generation per Joelle's comments. So I'm excited about the possibilities of, um, into, into how this, these technologies will change the lives of people within their professions, as well as in daily life. Um, you know, I, I see, um, uh, uh, really incredible boosts in language tasks, visual tasks and, and generative tests. I think there'll be continuing to see surprises. Um, and so I, I, again, I see, uh, let's say for consumer and business technologies, I think the textual change, for example, when it comes to multi-step interactions, not just one shot recommendations or recognitions, but rich interactive dialogue where multiple topics and goals will be maintained, and there'll be a pushing and popping of a, kind of a stack of interrelated threads and having a conversation with the technologies versus having these one-shot experiences.
I also think that there'll be changes in the course of daily life when it comes to extreme personalization, where, um, we've seen that it takes quite a bit of common sense to, uh, to, for systems, to help us understand how we can guide us on how we spend our time, how we collaborate and get things done. Uh, and we're seeing some great directions on that front already. Some of these services are being offered in our office product line, for example, in dynamics, um, you know, on this engineering front, you know, there's been some well-deserved excitement with software development and we can go much further. I think, um, there's impressive, uh, engagement we're seeing in the private preview with copilot. And so I think that'll be a game
changer in some ways for software development, healthcare, you know, what tequila area that I've been looking at, uh, is scientific discovery. Um, you know, I think we're going to see platform models or foundation models for molecular simulation to change chemistry and biology and physics in fundamental ways to speed up our ability to understand what molecules do when they interact, uh, by many orders of magnitude. We've seen signs of that coming our way and with the recent working in, um, bio-sciences with Yuna rep and ESM, wouldn't be these, these pre-trained platforms, uh, embeddings for understanding proteins, not just structure, but function, I think will be game changers for science coming our way Science, one of applications we maybe think less about, but I think it's really Simon as well. Um, maybe, uh, Jack, did you want to, I just want to do a really quick shout out for, um, really intuitive search in very specific domains, but because you can generate generate stuff in a debate, it makes it easier to do things like find recommendations for similar scientific papers. If you start to like use these models in the right way. And so I think we'll use them to sort of change how we educate ourselves, at least at the level of like when you were beginning to read about something, I think these models will give you better and more intuitive recommendations than a standard search engine.
Cool. Um, maybe you could talk a little bit about the, the flip side. I think all of this sounds great, you know, there are, you know, as many speakers I've commented on many risks associated with foundation models, um, you know, what are the ones that you think are kind of most, uh, to be worried about? Are they this, the short-term risks because these models are already out and about, or are they kind of longer term, um, because the future is so unpredictable and what kind of actions should be taken. Um, so maybe so then we can start with you since you've started a lot about social bias in NLP models, which is very, you know, pertinent here has come up a lot of times. Yeah. Thanks. Uh, I have a lot of feelings about this. I'm trying to have a somewhat long answer if that's okay. Uh, I, I am the resident skeptic, so I'm going to be
appropriately skeptical. Um, yeah, so I, I think I, I worry about a lot of harms that, but I know all of them is not like a very useful answer. So I think I'll say, um, that maybe the thing that worries me the most is the rapid entrenchment of the models. So the view of these models is inevitable or an unmitigated. Good. Um, I think it's a really risky frame because it forecloses and research alternatives and the end, it assumes that development and deployment of these models must happen, which foreclose the situations where we just say, Hey, uh, the risks or harms really do seem to outweigh the benefits. Let's not use this. Um, then also ignores the fact that there are already lots of communities that don't want this very communities for whom these kinds of models mean increased surveillance, or there are cultural resources made available for general cultural consumption a