[Jess] I'll tell you a little more about, and next week we have a topic of Algorithmic Collusion by Large Language Models with Sara Fish, a researcher who sounds really interesting as well. So those are on our website under the events heading. So, with that said, I want to welcome Peter Wang and Chandler Vaughn, who are co-founders of Anaconda AI, which focuses on advancing core Python technologies and developing new frontiers in open source AI and machine learning. And very excited about their topic today. So thank you all for being here, again, check out our spring Speaker series. These guys are part of it. And more to come.
Thank you to those of you who have attended in the past as well. So thank you so much for being here, guys. [Chandler] Yeah, great. First off, can you guys hear me? All right, all right.
And does this work? That's the real question. Maybe. All right, all right, all right. So first two stories, one on my way over here, I managed to get hit by a car, spill coffee on myself. So it can only, like, get better from here. From my perspective.
I also got yelled at for it, so apparently I got the real Harvard experience, is what I was told. Nonetheless. So let me tell you a little bit about us and what we're going to talk about today. But let me start with a story. So a couple of years ago, I was, you know, doing what I do using a model, and I watched a model basically create a near perfect prose of colonial Africa resistance.
Now, it was beautifully written, it was evocative, it was precise, and it was historically accurate. And I asked myself, how did it learn that? It wasn't about the facts, right? It was. Iabout how it was able to communicate. And so it didn't learn that from public domain text, and it didn't learn that from some academic journal. Most likely and most certainly it actually learned that from work that was either created by an African historian or blogger, some sort of, you know, ghostwriter, underpaid ghostwriter for Medium or Substack. But the point was, they didn't consent to have their work jammed into this model, and they didn't get attribution for it, and they likely don't and won't ever know that their work had some sort of influence on this output.
And that is really the crisis we're going to talk about today. This talk is about something that is somewhat hidden that you don't see. It's about an invisible economy of data that that fuels the AI that we are marveled at. But it's about how the commons, you know, that place in the economy where, you know, information is shared, it's morphed, it's extended, that is getting strip mined today.
And so it's really about, you know, the idea is: if we don't act now, we're not just risking sort of a legal reckoning that will someday, you know, have to be paid as we YOLO ourselves into the future. But it is really risking a cultural collapse of trust in the data layer of the internet. I can't say it more starkly than that. Now. Today, what we're going to offer is a somewhat simple proposal.
You know, I'm not a ban-AI guy. I'm not that guy. I am trying to find a way that we license AI effectively, that we can build a forward- looking framework that protects creators as they create and, you know, empower the developers that want to develop. At the same time.
So. Now, this work, the what you will hear is the term AMPL, right? It's a licensing framework. We'll explain a little bit later. I'll go through it. It's a labor of love.
It was co-developed by by Peter Wang and myself through a lot of discussions, too numerous to name and some people that are actually in this room, you know, Peter is a, you know, foundational figure in open source Python and numerical computing. He is the OG of OG in that, he has spent his career advocating for open knowledge, scientific computing and building bridges between innovation and integrity where bridges didn't exist. I come from a world of infrastructure, so I am a applied AI guy. I've built systems at scale.
I've also seen what happens when the data layer gets shaky, when you start getting model drift and hallucinations and unexplained outputs and just bad statistics. So, you know, the deeper challenge that we saw and the one that led to this project is the our licensing and compliance frameworks have not kept pace with the market. You know, we have gone from running code to training and running models, and the license frameworks do not sort of understand that breadth. Open source taught us how to license code at least marginally well, but it isn't, you know, AI isn't just code anymore. You've got you know, it's synthesis. It's generation. It's transformation.
There's data, there's training, there's the weights, there's actual models. So, you know, these are not things that are open source. And it was never envisioned that way. So in that shift, we've broken a fundamental pact with the commons.
And that pact of when you share knowledge, it's not just taken, it's attributed, it's reused, it's respected. Right. And that's the line we're trying to redraw, you know, not to restrict use, but to make it intentional and reciprocal and above all, knowable.
So let me ask you a question. What is the one thing that every artist, scientist, researcher, teacher, you know, lawyer, wants in their life? They want their work to matter, right? They want to know if somebody is going to use their writing or their code, or their voice or their brief, that it's not stripped of context. It's not it's not carried without context.
And so AMPL, the AI model public license, was born out of that desire. So it's a modular license. You know, it's very, very AI centric, but it encompasses training data too.
And the model outputs as derivations. So not a blunt instrument, but very much a scalpel. And it allows tiered uses much like Creative Commons does. So we tried to take the prior art that we could. So our mission really, really pragmatically is, you know, we want clear, enforceable guidelines for training data use, not just legal weight but also moral weight. Both.
We want to provide incentives for, you know, protecting creators in that process. So particularly those that aren't represented by lobbyists or a whole bunch of lawyers, and we want to offer a compliance path for developers and companies, because right now nobody wants to get sued. They're living in fear of litigation. Above all, we want transparency. We want to put transparency back into this process because, you know, the AI community cannot build trust if you don't actually know what's under the hood. So that's AMPL.
And that's what we're trying to restore, not control, not censorship, but it's about consent. So let's get right to the heart of matter. You know, most powerful, you know, general AI systems today, whether we're talking LLMs, image diffusion model, multi-modal models, they all rely on massive data sets scraped from the internet. So that's books and articles and source code and photos and videos and blog posts and all your social media, whether you like it or not. So much of that is copyrighted. Almost none of it is licensed.
So that creates three problems in this scenario. One is obvious legal risk. So the courts the jury's still out on whether fair use covers this. Right.
They have not decided what is covered with AI at scale. Two, ethical risk, because the creators were never asked or even notified, and structural risk. Because of the lack of transparency, it means that every new model that comes out today is trained in the dark. So existing licenses, whether they're open source or otherwise, they were not built for this, this problem. They were designed to govern, you know, they weren't designed to govern probabilistic outputs of, you know, models trained on billions of tokens. So this is a little bit flying blind in terms of the regulatory environment.
But our proposal really simply is we need a way to recognize training as a legal event. I'll say that again because it's really important we have to recognize training as a legal event, right? Are we have to acknowledge sort of the derived outputs, one that, you know, fosters transparency and indemnity. It's not just a hammer, but it's a bridge to this new world that we live in. So there's a convenient myth, you know, in tech that machine learning, you know, machine learning models don't copy, they compress.
But when a model regenerates, you know, long strings of copyrighted text or verbatim paragraphs, attacks or, you know, protected images that, you know, we're expecting, it's not compressing, it's reproducing. It's predicting the next token, probably too well, right? And so courts have made clear compression is not immunity. And we can look at GitHub. We can look at the Authors' Guild lawsuit, recent claims by artists and photographers and musicians, recent claims by the AI Disclosure Project.
So they went and looked and they created a way that they could actually go introspect some of these models. What they found was that O'Reilly media paywalled books are jammed into OpenAI models. How? Don't know. Licensed? Not at all. So this isn't going away. You know, this legal landscape is only heating up.
It's not cooling off. And it's not just the US. Different jurisdictions are actually taking different approaches, which I'll cover a little bit a little bit later. But global compliance on this matter is a mess across the board. Without a framework, developers are left to, you know, guessing or litigating, and small developers won't litigate. They'll just not do it.
But the point is, holy water does not anoint this, right? People are going to come with pitchforks eventually. So we have to prepare. So we're also seeing a rapid decline in open research, a deepening opacity in the commercial development side of this equation. So companies that try to actually disclose their data, they are rewarded by getting sued. Companies that hide it? They're rewarded with competitive advantage. Right. The incentive is secrecy.
And and you know meanwhile countries with fewer restrictions, you know, lesser or looser IP regimes, they're they're gaining competitive advantage at least short term. But it comes at a cost. And that costs is cultural and linguistic erasure. Because the more you scrape without boundaries, the more you flatten the diversity of the web and the world at the same time.
As trust collapse, innovation collapses too. So researchers don't know what's safe to use. Enterprise users don't know what's jammed into the model they just deployed. Right. What's the risk of that? Regulators, they actually don't know how to intervene. They don't have a framework for it either.
So we have to somehow replace the black box AI that we're dealing with today with a glass box governance model around it. And what about derivative works? So at what point in time does the model's output crossed the line from inspired by to derived from, right? That line is really blurry today, but it matters because it determines who owns what. If a model trained on on your code emits, you know, some functionally identical thing is that a coincidence? Is that a compliment? Or is it infringement? There's no consensus yet, which means there's no protection.
Companies that, again, they share their data, they expose themselves to lawsuits. So those who don't hide behind secrecy. And this is a liability vacuum, right? The definition of one.
So we need a way out. We need we have to find thresholds. We have to set norms by giving creators a voice and developers a safe harbor at the same time. And so you might say, well, Chandler, you know, don't we have open source licenses and haven't we figured this out? The reality is, the existing licenses actually don't cut it. So the four freedoms around open source, I believe, are foundational. I believe in them wholeheartedly.
But they were not written for generative users, generative models. You know, the freedom to to run a program as you wish? Sure. What if, what if that model is actually training the hate speech? Do I want that? No. The freedom to study and modify? What if I can't actually get the weights? The freedom to redistribute? What if that triggers a copyright lawsuit? The freedom to distribute modified versions? What if if the modification is indistinguishable from training the model again, right? Not only in cost, but in terms of time. So traditional licensing frameworks not really addressing the issue. And the reality is software licenses today whether you talk about MIT or Apache or GPL, they don't address core core issues on training data.
Right. It was never envisioned that way. So data licenses rarely address derivation. And they certainly don't address indemnity.
So the result is sort of this patchwork compliance, increasing legal uncertainty and a chilling effect overall on innovation. If people don't know how to share safely, they're not going to share. And comparatively, you know, around the world, the legal picture is fragmented. You know, the United States clearly leaning on the fair use doctrine, dealing with it on a case by case basis. But the jury's still out as I said. The EU, taking a different approach.
They're doing what they do best, produced policy around it. Really focused on structured text and data mining, exceptions and opt-outs, putting in place some mandates around machine signals in terms of preference. And so that, I believe, is sort of the right direction. China.
They're taking a very, they've telegraphed a very permissive stance. Right. They have basically said, you know, we aren't going to worry about that right now. Go do what you need to do.
Which by the way, this that policy got put in place in 2017 when when AlphaGo and you know, the grandmaster went to that, you know, through that, competition, AlphaGo won three games. It became an immediate emergency for China. So this position for China is not going to change. It's going to be very, very permissive about it. And the Global South is largely underrepresented.
Now, they do have somewhat fractured and underdeveloped laws. But the thing to know about the Global South, in addition to that is from a data set perspective, it's actually really underserved. It's hard to actually find the proper data sets for specific languages in the Global South. Now, why does that matter? It will matter for cultural diversity reasons, right? Because if you build bigger models and they're all English, guess what? Right? You're either setting out a power differential between hemispheres or you're going to expect that people are going to learn English. Right? So because of this, you know, we have to find this harmony across this fractured legal landscape. That's what we're focused on.
Because the alternative is really a tangled compliance regime, you know, stifling global competition and global collaboration. And frankly, it's simply just not something that Peter or I will accept. Yeah. And the economics of this are real, right? So behind every license is a value chain. That's how I think about it.
The global creative economy is over $2 trillion. 80% of the content being scraped today for AI is either noncommercial or restrictive licensing. Now, one could argue that it's been stolen. And many people are right. But in one year alone, 28% of the commonly used web tokens of were newly restricted.
What does that really mean? This isn't a hypothetical. This is actually happening. The open web as we know it. Once a commons is closing. People are putting mitigations in place, right? The risk to innovation multiplies. In this scenario. So without licensing clarity, developers face litigation risk.
Startups stall from the legal overhead. Creators stop sharing. So our focus is really not about locking down content.
It's not that it's about making sure the gates stay open, you know, with guardrails attached. And finally, I'll hand it over to Peter, this isn't just a legal issue. This is a societal issue, right? Without a transparent, fair licensing framework that people can frame policy and action around. Researchers face polluted, unusable data sets, low resource languages, minority cultures.
They're going to evaporate from model coverage. New entrants get, you know, killed, incumbents get entrenched. That is a power structure that I'm eager to break.
It's not just unfair. It's actually catastrophic to cultural diversity and global knowledge up shift. Right. So frankly, it's very inefficient.
So innovation thrives in diverse open ecosystems. But when we lose that. When creators no longer see themselves in the tools that are being built, we all lose.
And so AMPL, which Peter will take us in detail, one small fix with far reaching impact, but it gives us, you know, a path to shared innovation without silent exploitation. Peter. [Peter] Thanks, Chandler.
Oh, wait, one is - Hello? Is this better? Okay, all right, everyone hear me? This one here? Okay. Awesome. Great. So AMPL the concept here is to create a licensing, I guess a license or a license family that, you know, standardizes the definition of these, all these different things.
So there's not clear definitions right now, but we think some broad lines can be drawn to define things like what is what is transformative, what is derivative? And we can have some back and forth on that. Right? But we can start somewhere. Ultimately the goal, of course, is to protect the rights of creators, to ensure that they have some say in what happens to their knowledge work, to their creative work. And these are actually the very, very same fundamental motivations for the creation of copyright as well, and intellectual property law.
It's not to prevent the spread of knowledge and information. It's to facilitate, some kind of economic benefit for those who do the hard work and do this kind of thing. So one of the things, as we're looking into this, one of the things that emerged was very interesting, is that unlike traditional open source and kind of existing commons tools that people use, there is a potential liability chain that comes back up to the creator in even the dissemination of open weight models, and that's something that is, you know, the open source world contemplated and started dealing with as they started putting in patent protection kinds of things into GPL and Apache. On the data set and on the training data on the on the license side, that's still an unexplored category. So there's a whole aspect of providing, there's a lot of value in providing clear protection there for creators, and giving them indemnity as they're trying to contribute to the knowledge commons. And, and lastly, again, this is a family of licenses.
So it's not just about making everything free for everyone to use if you want to be remunerated, if you want to restrict license, if you want to have a license that restricts the usage for, you know, commercial purposes or for research purposes, we're very much, you know, motivated and inspired by the the the approach that the Creative Commons took with their license families. So I think the core assertion, the simple as the simple but really, really important fundamental distinction here is that, if you look at what it is that deep learning technologies are doing, and I don't call it just AI or just LLM, because it's more than just, the transformer model. Right? It's any of these, large, neural network and deep learning inspired approaches. We as a species, as a civilization, as a species, we have hit upon a very clever mathematical trick that is able to do not new kinds of compression or new kinds of copying, but actually, uncopying, unprinting, we're able to extract essence from expression and do it at scale, do it mechanistically.
And that is not a capability that Homo sapiens has had until the last ten years. That is that is profoundly different and new. And I will, I guess, embarrassingly admit that I am not a legal scholar. I do not have a degree in law. It is kind of a a little humbling to be here at Harvard Law, but, but in my research and looking into this, when we look at the the roots of open source and of things like the Electronic Frontiers Foundation, a lot of the thoughts about information freedom and all these kinds of things, there remains this this distinction between expression and essence. If you look at the, copyright acts, if you look at the Berne Convention, like all of these kinds of things, they maintain a distinction between idea or essence and expression.
And we've always thought that goes in one direction. And for the most part, things like copyright and whatnot all govern the expression side of things. We have now a tool that completely breaks that.
And so if you think about the problem statement, as we have technological capabilities that transcend the classical boundaries between essence and expression, the simple sort of straw man solution to say, well, what if we came up with a new kind of tool, a new legal instruments, a new new set of approaches that contemplated this, that really respected this transcendence as the first class, like the fundamental problem of what's what's occurred? So looking at how we deal with essence embedded in expression and the mechanical transformation around that, the extraction of essence from expression, the changes that we can do, the transformations we can do on that essence in a mechanistic fashion. And then the re expression and that's really it is to say beyond copyright. Copyright came around when we got fast mechanical copiers or printing presses. We now have these AI boxes. Let's talk about AI rights.
You know, what is this minimum set of things we put in a box and say, these are AI rights and every human natural human AI creator they have, just as they have a right of authorship, that is the origin of copyright. There is a right of authorship. And and let's think about what AI write can stem from that as again, as a first class right. And in this license and in this licensing scheme, again, there are a few really core things that we think will go a very long way to standardizing, and, you know, standardizing a lot of these different jurisdictional perspectives, as well as creating a unified and open playing field. So one of the key things is mutual indemnification.
Again, one of the one of the really interesting things about these AI licenses is that they're not just merely software. So you can use a lot of Apache or BSC/MIT license software to to write the code for a model, but all the data around it, if that data includes different kinds of things that create legal exposure, or that might be legally challenged at the point of application of that AI model, well, who is actually, at fault? And this is not a theoretical concern. We know that there are open space AI researchers, people who work, you know, on on doing AI in the, in the, in the transparent public, in the open. They find themselves being named in lawsuits because litigants sort of, you know, prey and spray a little bit on these things. And they are not themselves in violation.
They don't think they're in violation or anything. They're putting together merely a collection of data. And I say, here's a data set someone else takes to this, builds a model, puts a model out there, they get sued, but they drag everyone else along with them. That creates an utterly chilling effect on any kind of open collaboration, especially when this goes beyond merely, oh, you included a bunch of O'Reilly books or something.
This goes into, hey, you're violating you're literally violating laws in a particular country, right? We don't often talk, when we talk about AI rights so we talk about copyright and deepfakes and all these things, people don't think about, like there are there are countries where where the governments are very, very mad at you if certain kinds of text of certain kinds of imagery of certain prophets, let's say, are actually included in a technological capability or in a software. So the chain of liability is a huge, huge problem. And with AMPL one of the very, very basic things that we do is we include in there a mutual indemnification, and that makes it so that everyone who's doing open data, who's collecting open data, who's curating it, making it available, who's writing open models, redistributing of models, hosting them, mirroring them. Everyone is playing in a place where they know they have explicitly been given, at a contractual level, they've been given indemnity from the user.
Transparency. It is really we we believe that there is a huge problem right now in the AI world, in the training, training data world and in the model weights world, there is there's no transparency around this stuff. As Chandler, you know, spoke about. And
the second thing we put in here in AMPL is a provenance system. So, you know, at least what's in it, a bill of materials, something like the nutrition label inside of a box of cereal. Right.
You know what you get if you're a business and you want to use a free open model. Well, if it's just a black box, you have no idea versus if it's labeled. And you know here all the things that are public, free, open, here are things that actually are copyrighted. And here are the things that, here's how much, you know, those particular creators want you to pay for them. That kind of a structure accelerates innovation, and it lets people confidently recompose, reuse, remix.
Again, a lot of this is, all of the motivation here is to create and defend that innovation commons around AI derived outputs. Again, this is something where we will have to have some discussion as a global innovation community, but we should start somewhere on creating clarity about what constitutes, you know, a derived output, what constitutes, you know, derivative, what is not derivative, etc.. And lastly, again, the tiered usage, this is about creating that commons, defending the commons, but it is not excluding commercial usage at all that we want to accelerate and facilitate commercial usage. So, allowing for, the model allows for creators of the data sets, curators of data sets, curators and trainers of models, and curators of models to all attach some kind of remuneration clause to it. And this way it's similar to Creative Commons. You know, not really like any of the software licenses which don't have generally a template for remuneration built in.
So for the various stakeholders in the ecosystem, what this means for creators is that you have a way of not just opting in or not just opting out, and, and being excluded from SEO, excluded from visibility, excluded from the corpus, the emerging corpus of human knowledge that, you know, will be at the heart of all these AIs. You also have a way of opting in, but the opting in doesn't strip you immediately of all of your economic upside. Right? There is, you can negotiate. You have a choose your own adventure in the middle.
So it's not, you know, opt in, opt out. This shows up a lot. I'm pretty heavily involved in BlueSky. This shows up a lot in the BlueSky conversation.
A lot of artists, a lot of musicians, a lot of people who like, do not want to have their content available for training whatsoever and want to give the ability to say they don't or they do. But right now it's it's all one or it's, you know, completely a binary switch. And that's not particularly useful for people.
For companies, again, right now, everyone basically doing AI and using models, they are YOLOing it. This is my spicy take. Everyone is basically YOLOing it, hoping they don't get sued, hoping that if anyone get sued, that OpenAI joins a lawsuit or Anthropic or or Google or whatever, and files like the amicus brief or something. But at the end of the day, we all know that all these giant models are trained on a lot of data that does not have clear provenance.
So if we provide a path for clear provenance, a path with explicit opt in, where the creators have signaled their preferences as to what they want to have happen with their things, they declare, here are the AI rights and I am under these AI rights. I am giving permission for downstream users to do these things. That gives companies a massive sigh of relief and a lot more confidence to adopt and use AI and all sorts of creative ways. For regulators. This is a path forward that doesn't just have Hollywood scream at you or have Silicon Valley scream at you.
This is a path forward for actually discovering the space, letting all the creators, many, many talented, technical folks are working in the open trying to build an innovation commons around this, in that dialogue, in that discourse, as we navigate exactly what the right breakpoints are for these various things out of that will merge some cow paths, which then regulators merely have to go and pave. That's a lot easier than paving over stuff and havinf everyone get mad at you, right? And as we've seen so far, the AI regulations have been written by legislators and regulators and paid lobbyists and all these other things. They're not particularly effective. They're swinging way too far wide over here, or they're making the drawing really pointless.
Lines in the sand over here. So it's very, very hard. I mean, regulators have obviously do not have an enviable task right now, hopefully with a grassroots, ground up first principle approach like this, we can quickly and in parallel explore these cow paths and give them something to pave. Lastly, for researchers, this is huge. Something sort of building on something Charles said earlier. Right now, if you are going to do research and you publish your research, you say, yes, I used this, or I use this data or I use that, you're just asking to get sued, right? And so this has a massive chilling effect. And the corollary of that is to be safe then you're going to enter the safe harbor or walled garden of some trillion dollars tech company.
Well what does that mean? That means research now becomes sanctioned only within these certain ivory tower or walled garden zones. And that is disempowering for everyone. And that is that hurts civilization. So by having an open ecosystem, by creating something that locks innovation open and gives everyone clarity as to what's the model, what rights do they have? This accelerates and facilitates open research.
So naturally, you know, there's a lot of, I can imagine there are a lot of, questions and potentially pushback on this. And as we've talked about these ideas with people, over the previous months, year, there's a lot of different kinds of arguments we're seen. Right? One of the the biggest ones is, you know, we don't need this because it's just fair use. And I've talked to people who work at some of the top AI companies and they're like, no, at least in the US, where fair use exists as a doctrine, they are 100%. They're going to take it to the grave.
This is all fair use. We are totally at our right to do this. If Google can go and do all this stuff and create a search index, we can totally do this and create a transformer model that maybe accidentally, you know, regurgitates the entire text of what it's scanned, but nonetheless, it's fair use, right? And the problem is that, of course, fair use only exists in a few jurisdictions in the world, and text and data mining and database rights exist in other places in the world. And there are already many, many instances of teams and collaborations of AI researchers doing jurisdiction hopping. Right.
That I think the maybe the most famous one is the Stability Diffusion- LAION relationship. But there's others as well, where people do certain kinds of activities in certain countries under certain provisions, and then they ship the contents of a database or a data set to someone else who then has the right to do certain other things. So the jurisdiction shopping and hopping will continue. And again, it is not a clear playing field for for anyone.
It's still legally unclear. Another interesting argument I heard, push back was reproducibility is overrated. So when I argue about why it's important to have clarity and provenance around the data sets are used to create these AI models.
Had, you know, a, very, very senior engineering manager, at a large AI company. Tell me, yeah, you know, even if we told you what was in it, even if we just gave it to everyone, you didn't have the GPUs to train it anyway. So you don't really get reproducibility in actuality, you get it maybe in spirit, but no one else can do this except us. And my counter point is, well, lots of people have billions of dollars to spend on data centers.
It's not just the big tech companies in the West, right? This is not something so large that sovereign level actors of even medium or small scale, they can afford a few billion dollars to put into a supercomputer or data center to train their own models. So the argument that that reproducibility is somehow, just a theoretical nice to have that's absolutely not true. Another bit of pushback. I get this more from the accelerationists and the people who really believe that an 18-24 months we'll have AGISI that will just take over everything, they believe that essentially any kind of thing like this, which puts in more legal stuff and more encumbrances and slows things down, that it is literally immoral and unethical.
We're literally killing people because we're not helping or we're slowing down creating AGIs or create cancer drugs and magic medicines. Right. And I can appreciate that from a theoretical, you've had a little to drink a little to smoke in San Francisco kind of argument, but it's not a real, it's not a serious argument. Right?
We know that if you actually go disempower trillions of dollars of economic output across the world, you're going to have a really, really big impact. People will die. So for right now, again, I just want to put this out there because there's actual feedback I've gotten.
So I just want to name it, but I don't consider it serious feedback. We if we're if this is such foundationally important technology, we should have a conversation as a society about what is ethical, what is right, and what is sustainable to do. And those conversations, when societies have those conversations, those are called laws. Right? So we should be figuring this out at a legal level.
But the I think most significant, and probably the most accessible counterargument to the AMPL approach on licensing is that the horse has already left the barn. You know, you can go back and you're going to go and tell people they can't train on the Common Crawl data set or on The Pile or whatever, but they've already trained it. They've already made models. They already created startups and raised billions of dollars.
So what are you going to do about it? Right. This argument that the horse has left the barn and it's a fairly it's a pretty strong argument, right. Because they're creating value. They vet users of their APIs. They're creating these models are getting better and better, doing more interesting things.
Why do we even care? Because the horses left the barn. You're building better barn doors. Doesn't matter. One of the interesting things is the horse has left the barn. It's off in the woods, but we're going to be able to find it and trace it and bring it right back. There are more and more research coming out over the last 18 months about how to actually unlock and uncover the eigenvectors, the semantic eigenvectors, the data inside the LLMs.
I think people sometimes forget because we are now, after only two years of ChatGPT, we're conditioned to expect these giant, massive models to be completely opaque all the time. They're not. Right now it's not efficient to recover these things out of them, but it is absolutely reasonable to expect that additional research into things like representations, interpretability, mechanistic interpretability, model sensitivity that these things will unlock for us the ability to say, here's a document. And what's this, what is the probability this document is present in this model? The set of weights. What is the probability?
And if it comes back 95% you have yourself a lawsuit, right. That is not too far fetched. Just imagine at this point, just given the state of research.
Number two, enterprise priorities. Right now, a lot of enterprises are using this stuff, you know, and they're trying to figure out what is the right path forward. We talked to a lot of people who are interested in this. And they understand, they're serious about deploying AI. They understand that there's a there's a question mark around all this. That question mark will remain.
That question mark will remain for a long time until clear regulation or the markets come out with something about provenance and about accountability, and even right now, if you look at the character.AI lawsuit, however we may feel about, you know, are they liable or are they not for that particular teen's suicide? Well. You just have to tell that story to corporate counsel, and they will have a lot of heartburn. The idea that you might put AI into a thing, you don't know where it came from, and now you've got a thing out in the wild that is potentially going to cause people to do really bad things.
You cannot look at that as corporate counsel and not raise some serious concerns. Right? So enterprises want to have trustworthy AI. They want to have accountability. They want to actually see this provenance chain of the data set. And who touched this model as it went through the training process? And the last and really important thing is that the horse has left the barn, but there's many more horses besides in the barn, not just living there now, but foals that are being birthed and that will come.
if it is true that we're going to YOLO our way into big models, everyone just uses the big models that have trained on The Pile and all this copyrighted work, and there's nothing we can do about it. Let's just take that assumption at face value and say, yes, there's nothing we can do about it. Well, what comes next? What comes next is we use those models to make more songs, more music, more art.
We go and use them to write more interesting screenplays. We make all these, we use these models to go and put them embedded in and sensor platforms and drones, and we send them in the water. We put them in the sky. We gather all sorts of interesting data, data with high economic value. What protects that data? What protects all the music and all the writing to come? I mean, there's lots more horses besides, right? A billion people, there's only billion people on earth right now. But, 92 billion have come before, and hopefully many more billions will come after. Right?
What are we doing to pave the path forward in an AI world? Well, human authorship, human creativity, human knowledge and insight still matter. As long as they matter and they have to coexist with AI models. We need some kind of way of attributing well, incentivizing the attribution of that kind of knowledge work. So, yeah, without going too much into this, again, these are just, just kind of the basic concepts here is that, the horse is very much not left the barn.
The horse is problematic. Many people are scared of the horse, even though they want to use the horse. And we can build a better barn. I think that's ultimately what we're trying to say here. We can build a better barn. And people want a better barn, right? That the data provenance gap affects everyone.
It affects the usage and adoption of AI. You know, we're still in early stages in terms of enterprise adoption, but as AI gets more and more serious, as AI systems are actually deployed more and more to end users and consumers, and as the liability impacts start coming back, people are gonna start asking harder and harder questions, right? The Copyright Act of 1976, which minted and certified the concept of software being a copyrightable artifact. That was 1976.
That's a long time after people were writing code. And that was just that initial thing, the idea of open source licenses and copyleft and all these things that came with GPL, that came, you know, quite a bit after that as well. So these things take years for people to really figure out. We're two years into the ChatGPT moment right now, two and a half years in the ChatGPT moment.
But we can see from here anyone who can think, you know, from first principles can see that these are going to be these really problematic. So the way that we're approaching this, with, with AMPL is we are we're taking a kind of multi-stage approach. The first is working on the framework itself. Right?
Really focusing on this provenance issue, focusing on how we define some of these standard terms. What are the kinds of tiers? What are the kinds of compensation structures that would be valuable for people? Having conversation with people that have direct connections to a lot of rights holders and various kinds of industries music, Hollywood, etc. and also talking to legal experts who have worked in this area a long time to see what legal innovation iis viable in the space. The next thing then is to work with the groups and projects that have the largest data sets that have been used for training and to work with them to develop subsets of their data that is clearly provenance, that is clearly public or clearly free to use, etc., etc..
And so creating a data set that is free for everyone to go and build an open model on. But around that data set, we can actually put an AMPL a license around it. We can attach that we believe this is free for everyone to use. And then we we sort of start, if you think about it as the, the the origin block of a blockchain, this is the origin block for a data set that is cleanly licensed, clearly provenanced.
And every single, anyone training LMS, whether it's in-house at an enterprise, whether it's in-house at one of the large AI companies, whether someone's trying to fine tune they can use this data set, and now they have something that that has the AMPL license attached to it. Then we're going to add additional tooling so people can look at, hey, what is in my data set? You know, I can trace these F-bombs. I can actually do some data linting, if you will, and a variety of other kinds of tools that are both end user and practitioner oriented, as well as ones that are for enterprise adoption of this stuff. And then actually go and drive the industry adoption around this. So when people are creating their works, they can go and, you know, tag them, they can identify them, they can put their, signal their preferences around, how they want to be compensated, etc., etc. all of that can be done.
And there are many projects that have been working in this vein. They're not connected. They don't have a unified legal basis right now. So a lot of the framework development and the data set creation that is being done, and we're talking about that in collaboration with many of these other projects, which I'll show in just a minute. But ultimately the goal here is to create this licensing ecosystem, that facilitates and accelerates innovation.
But also has, of course, a lot of commercial application. And the way that we're thinking about the structure right now is that, you know, obviously the the AMPL Foundation itself would be a nonprofit that develops these things and has presence or representation in most major jurisdictions. But then we would create a commercialization consortium that would allow many kinds of enterprise tool makers to enterprise to AI vendors, give them the ways to integrate and interface with, with enterprise AI use cases. So all of the stuff starts, really, if you're if you're not big on enterprise software, go to market kind of stuff.
Think about this as creating a way of doing like Underwriters Labs, like an underwriting standard, a way for testing, for certifying AI models as actually being fit for use, free of defects, free of various kinds of possible, you know, litigation, exposure, etc.. But and and lastly, but not least, so the thing, that the problem that Chandler spoke about earlier with this kind of a structure, we're hoping that we can help incentivize people to build a more culturally diverse set of data sets for this amazing technology. If you look at the tokens that go in a very practical matter, if you look at the tokens that are in the Common Crawl, which is by far the largest token, source for most of these LMS, English is the vast, by and by far is the the most represented language. The next top five languages, I think Chinese, Russian.
What were they? Russian, Chinese, German, a few others. But beyond that, even among Western European cultures, the percentage of tokens is tiny single digits. And then if you go to Global South, it's almost non-existent.
A lot of a lot of, places in Africa, South America, they've not digitized a lot of their cultural heritage. And what we know is that is not a mere matter of translation. If you take an English, if you take a model trained on English text, the semantics of what are inside that model capture a particular Western, a particular American or British perspective. And if you just take that news, train it, and you translate that to say something like Finnish, which is a European language and a European culture, the output does not actually, the output is missing many of the cultural highlights, and many of the cultural nuances.
It basically sounds like an American speaking Finnish. So this is really just that's where a European language European model, you think about the divergence and the space, between these, these current models and the Global South and many other cultures that are low resource, I we don't have a lot of digital tokens to do training. So right now, the reason why AMPL matters for this is if those cultures, if those countries invest in digitization and creating these data sets and curating them, if they don't have a way of preserving that and of defending the rights, the usage rights around that data as a cultural heirloom for that country or for that culture, then it will just be strip mined. They'll just get hoovered up and sucked up into one of the existing sort of massive trillion dollar players.
And they're basically unable to compete with homegrown domestic and culturally integral approaches. So using something like AMPL as an international standard, they would actually be able to define, have a space to create this, this distinction, between what's used domestically inside their countries, what gets used by people outside, etc., etc.. So we've thought about this as a, you know, a high level goal, but but it's a very important one because we know that these LMS and these AI models are going to be part of how creation happens, how people think, how people interpret and interface the world with each other. And if we don't start putting these things in place right now, at some point it's going to be too late. We'll lose
the opportunity to do this. There is time sensitivity around this. So, there are a lot of questions, right? We don't have all the answers. But we are starting this work. We've been in touch with some projects that we think are very, very interesting and, also very open to building on this kind of thing. But there's a lot of questions like, how do we actually handle noncompliance? I have some creative, spicy and probably not feasible approaches, but that would be hilarious. Community governance. Right?
What does it look like for us to iterate and rev on this thing while also creating clarity and comfort for people who take older versions of the license? Like, what does the governance framework look like? How much confidence can people have in it? While this is an evolving standard, how do we get regulators on board? How do we talk to the EU, to Brussels? How do we talk to the UN? How do we get American regulators to see, hey, this is something where we can attach all sorts of additional, concepts like privacy rights and likeness rights and things like that. And then, right, evolving technology, you know, how do we make sure that we are keeping tabs on a very, very fast moving technology landscape while we're putting these definitions and tools in place? So some of the projects that we've talked to you and that we're, you know, looking to collaborate with, you know, a lot of the usual suspects, the providers of large corpora that, that go into training models as well as many of the usual suspects, again, on the open knowledge, you know, innovation commons side of things. Lots of really great conversations so far, a lot of really interesting technology we could use. And and yeah, and here in this conversation, the reason we're presenting here, and having this conversation is we're trying to get we're soliciting feedback on these concepts, looking for people to shoot holes through it or to give us, you know, tools to to maybe bring in, and ideas to build on. So with that, I think we open up for questions Chandler. Is this anything else you want to say? All right.
Oh, sorry. There is, there is also right, there is also a QR code here okay. All right. Thank you everyone. We have about eight minutes for questions. You want to come up and then we.
Oh we're going to sit here, right? Okay. Let's do it. [Audience] Thank you so much. [Peter] I can't see them. Yeah.
[Audience] The presentation was really interesting. The first thing I was thinking is it seems like a lot of people characterize the moment in AI as an arms race between the United States and China. And everything especially like in the current moment, seems zero sum, in the geopolitical moment. So it seems as if the moment you say you propose AI regulations in the United States, it's like, well, I mean, China will leapfrog us and you know, that that will create some worse form of AI.
How do you think you would convince industry leaders or government officials to impose these regulations? [Peter] I have a couple of points on that. Number one, let's look at a literal arms races, even the middle of actual firefighting and arms races, in the early, like, literal arms race. Right.
We still created the Geneva Convention. So the concepts of the, you know, rules of war or whatnot, when people are literally shooting each other, trying to kill each other, we recognize there are some boundaries to this. So humanity as a civilization, a species is able to contemplate this, that even as we're racing for things, that there are certain boundaries we'll draw.
Now, some of the things we're arguing for here, they don't seem to be as bad as like saying no chemical weapons. And that's, you know, you don't see people getting gassed. And so we say, hey, let's not use that anymore. But just because it's an arms race, just because we're actually fighting for, you know, economic supremacy, that I think we still have a lot of, you know, I don't think. I don't think China wants to see deepfake porn any more than American regulators do. Right? These are sort of concepts that they're wrestling with you in inside China.
Yes. There is a state permissiveness around, hey, like, let's not let's go build whatever best AI we can. But there's a lot of people in China who make money off of China's copyright system. There's IP law in China.
There are movie stars and music stars. This is not a thing that just inside China, somehow it disappears, right? It's a problem that everyone has in the world. The second thing I would say is transparency and provenance is actually a competitive advantage. And this is one of those things which is maybe of a non-obvious argument.
What we're arguing for is transparency and the big flag we're trying to or the big warning thing we're trying to signal to everyone who's interested in open innovation commons is the way we're going right now. We're going to proceed, continuing to proceed down this path of opacity. That's bad for everyone. That's bad certainly to people want to do open research, but it's actually bad for adoption as well. Right.
If you look inside, well, we have this actually, this is another great example of China, right? In America, we have regulatory requirements around labeling of ingredients. In China they had melamine and baby powder baby formula. And they killed a bunch of kids and they were not happy about that. And they have now lived experience about why it's important, right, to have these licensing standards or, but labeling standards on food. So across the board, it's yes, there is a technology, there's all this geopolitical stuff and all these things. But at the end of the day, if you look at what is the technology, what are the first principles here? We're having something that's doing thinking for us and it behooves all of us to understand what went into that thing.
Right. So unless we have a way of clearing, unless we have a way of motivating economic development in the open around this stuff, it will always remain closed. And that's bad for everyone. And we think that as an ecosystem, that kind of closed ecosystem is less competitive, and will be less vast than the open ecosystem.
So fundamentally, you know, for me, this is work that builds on the, you know, how do we the question of how do we extend the ethos of open source and open innovation into this new era? I heard from a Microsoft guy once who said, well, open source is good at solving. Everyone knows open source is good at solving well known, well understood problems. Right. And so there's there's oftentimes a perspective that open means you build the cheap crap that's maybe not that's commoditized.
But what we've know, in my personal experience through my time at Continuum and Anaconda is actually open communities innovate faster and better. And so this is very much in that vein. Now, I could be wrong about that, but I think I'm right. So hopefully we're going to put the legal plumbing in place so people can confidently collaborate in the open. And that should out compete the closed things.
[Chandler] The only thing I would add to that is, really where I focus my energy is trying to spark that open source, open source movement. I put a source in quotes because it's no longer about source, but having a Cambrian explosion of capabilities, smaller, more capable, more targeted AI is actually good for everything. Like these large models, they're going to keep getting bigger, right? Like this is the way the market shapes, OpenAI is going to get bigger. Gemini is going to get bigger. Like, you know, meta is probably gonna get bigger.
But on the long tail of capabilities where you can place a small AI in your thermostat, in your refrigerator, or in your car, right, much more targeted. That is the open ecosystem that we have to foster, and that's the open ecosystems that we're really focused on. And if you have that, then you have more competition. And the large not only nation states, but also the large players have more, more pressure to perform, but also to perform in the right way.
Yes. Thank you. [Audience] Hi, everyone. I'm a software engineer by training turned product manager, with a background in AI and very much interested in the responsible use and development of of AI technologies.
I very much appreciate this work. So one thing I wanted to mention and I would say not only from the copyright perspective, but also because I can see how this could be enhanced to regulate the models in terms of like fairness and biases. And so, I think it's very interesting. One comment that I had in terms of potential arguments against reproducibility, would be yes, maybe there are companies that can afford the big data centers and they have the money to spend.
But at the same time, this is also causing a lot of, impact on the global environment. And if you go now to the AI conferences that are also talking about, you know, how much power consumption is generated in training the AI models. So I just wanted to mention that I don't know if you had any reactions to that. But I can see somebody making an argument or bringing this up in terms of the impact of retraining all of these models on that.
[Peter] Oh, yes. Yeah, I've lots of thoughts. But I guess one thing I would say is the models, the models are kind of getting to a point where the small models, Chandler's point, the small models are getting good enough. Yeah, the small models are getting good enough that we can actually put them into the physical world and put them in places that matter beyond just chatbots. Right?
I think that's a really important thing to index on is that, the big, you know, really, really big AI companies, a lot of their impact is still very much in the software kind of, knowledge work, knowledge artifact, chat bot, you know, data summarization, you know, all this kind of work. But when AI actually meets the physical world, those actually in those use cases, they care a lot about the power on the inference side, they care a lot about the accuracy. You know, per watt and that kind of pressure. We're already starting to see it.
And I think that kind of pressure will increase. That will create competitive back pressure against the ever bigger models that I think we're going to see. This is this is my general just prognostication. I think we're going to see back pressure from that. And people are going to then optimize for the whole system, not just how much energy do I put into training a flagship SODA giant model? If every time I do that, it gets then quantized and then distilled and then
2025-04-23 17:43