Will Copyright Derail Generative AI Technologies

gracious enough to come tell us with Copa okay good afternoon and thanks for the opportunity to come and talk to you about what's going on in the generative AI copyright lawsuits um and the an short answer to the question is uh that copyright very well may derail um generative AI uh at least trained on material that is either not in the public domain or um not already licensed uh so um uh there are three principal questions um the the Big Kahuna as I call it is the question about when if it all does making copies of Works uh for the purpose of training data um for models um does that infringe the copyright in those works that's the issue that hasn't uh that's Central to almost all of the lawsuits uh and um uh and so they're all pretty much the same when it comes to that question uh the second question which is not very interesting but I will mention it anyway because the lawsuits uh often include claims that generated uh outputs of AI systems um infringe derivative work rights of the ingested content and the short answer here is that outputs have to be substantially similar in what copyright lawyers call their expression in order to um in order to possibly infringe the derivative work right um and in most cases there is no allegation that there are any substantially similar outputs uh two of the lawsuits one brought by the New York Times and one bought by uh a set of Music Publishers um G examples in the complaints of outputs that they were able to get from the generate AI systems that they think uh are infringements uh and then I'll talk very briefly about the removal or alteration of copyright management information um uh a violation of another copyright related uh rule there are several complaints uh that rais this kind of claim but uh the courts have dismissed those claims so far um so they're kind of they're they're in the case they'll go up on appeal uh but right now they're not really the they're not really the central uh Focus uh so there are as of today 32 lawsuits pending uh in uh federal courts um most of them are here in the Northern District of California uh some of them are in New York City three of them are pending in Delaware and one in Massachusetts so they're kind of all over the map one of the reasons why that's important is that uh is that the the law of each circuit um so uh California's in the ninth circuit New York is in the second circuit Massachusetts is in the first circuit and uh Delaware as I think in the third circuit that just means that when these cases get up to the the appell courts there could be conflicts among them and that might mean the Supreme Court takes it okay so I'm just kind of going ahead a bit uh just to let you know that it matters that they're they're pending in different places um uh most of the cases are class action lawsuits uh uh and I'll talk a little bit more about that so the idea is a small number of actual named individuals purport to represent the interests of all people who are in their category okay uh and um uh so there's uh one lawsuit uh that's a class action about visual Arts there are at least eight lawsuits that are class actions involving uh book authors um and again all of these cases are in really early stages the the most prominent of the individual lawsuits are are the five that uh are listed Here Again part of what's interesting is one of them has to do with stock stock photography one with music lyrics one with recorded music one with news stories and one with uh Westlaw headnotes which I'll talk about a little bit more later so these are the the that's kind of a range of them the question is who's getting sued pretty much everybody in terms of sort of the big players out there um they're all um uh so uh open Ai and Microsoft they have the biggest uh Targets on their bellies um but um meta has a couple of cases against it so does alphabet mid Journey uh has one GitHub has one anthropic has one and so do Nvidia and data bricks um all but one of these cases are in very early stages what I mean by that is that uh a a lawsuit begins when a plaintiff piles a complaint the first move on the part of most defendants is not to answer the lawsuit and say yeah I I'm raising a fair use defense they're basically they move to dismiss the case uh or to dismiss some of the claims in the cases and so there are lots of uh preliminary motions that go on okay so a lot of these cases have been bogged down in rounds of motions to dismiss although none of the cases have talked about the training data cases uh the training data issue on motions to dismiss everyone knows that's the Big Kahuna that we're going to save for later um some of them are in Discovery so um uh so one of the lawyers want to for example um depose uh Mark Zuckerberg um uh in the metal law suit and so they've now gotten permission to do that so some of them are bogged down in Discovery and some of them like the Nvidia and data bricks cases are just inactive right now there's one judge um here in the Northern District of California judge shabria um has made it very plain that he wants to be the first one to rule on the training data fair use issue um and he's tried to hurry the case along so that he can get to that um so he basically um accused the lawyer who had been representing the cadri the class action plaintiff in this particular case um called him incompetent and said you have to get some new class counsil here because um uh I don't want to rule against your clients just because you've been incompetent the judge actually said that in open court um but um so the David Boyce firm is David Boyce um famous enough that even you guys would have heard of him David Boyce he's really super famous okay if you want to say who are the famous lawyers in the United States who were like not working for the not the attorney general or something David boy is like really really famous um and so the the Boyce firm is actually now in the cadri case and will be representing cadri rather than the the lawyer who was called incompetent by the judge so that means that there's going to be Good Counsel on both sides which wasn't true for a while why were they called incompetent um because they were not moving the case along um the um this one uh lawyer his name is Joe SE um pretty wellknown class action lawsuit um PL of Sky um in uh the Northern District of California these are his first um copyright cases and he's got seven of them okay and if you have you're trying to manage seven cases all of which are class action lawsuits and you've got lots of things going on you got lots of Discovery to do you're kind of like I'm overwhelmed okay especially because he's a small firm okay so David boy is like not a small firm person he's like a big firm person and he's used to winning okay so you probably have heard of the lawsuit that um uh that was brought some years ago um by the US government against Microsoft anybody heard of that okay David boy represented the United States in that lawsuit okay so he is really angless Scher okay anyway the uh sometime in the next two weeks um uh conquered music's motion for a preliminary injunction um uh against anthropic is going to be heard um and the key question there is whether uh it's likely to win on the merits uh so that sort of the the reason to file a preliminary injunction motion here is less about um uh wanting to enjoin this particular entity and more because you want the judge to say is there a likelihood of success in the merits right so so the so the defense council is basically saying no likelihood of success in the merits Plus there are these other 10 reasons why you shouldn't issue preliminary injunction so I think this one's going to get denied uh but at least there'll be we'll know a little bit more after that particular hearing there's only been one ruling so far onfair use and that's in a case that was brought by Thompson Reuters against Ross intelligence so um you guys won't have heard of West law but it's a data base that a lot of lawyers use okay and one of the things that West the company has done uh for more than a hundred years is basically here's a judge's opinion on something now I'm going to give you a little short synopsis of what that the what the what the ruling is so it basically tries to extract out of what was often 20 50 100 pages of material sort of like a really really concise thing so those are know as headnotes okay now the judicial decisions themselves can't be protected by copyright law but the headnotes which are a selection and arrangement of information from the opinion the headnotes may be copyrighted and so what Ross did is that it got copies of West headnotes and it used the West head notes and the opinions as training data for a legal um generative AI thing okay and so Thompson reuter said that's copyright infringement Ross says no it's it's a fair use and the judge basically denied motions for summary judgment now summary judgment is something that you kind of say what the heck is that well why do cases go to trial they go to trial because I say the facts are a and you say the facts are B and they can't be both A and B they got to be one or the other and so what is the role of a jury role of the jur is to say um I believe a instead of B okay that's what the RO that's what a jury is really about so if there are no material issues of fact for a jury to resolve then summary judgment is appropriate okay and so both of the lawyers both sets of lawyers said I move for summary judgment the judge denied both of them okay um saying that there was Tri there were triable issues of fact um on each of the Fause factors and I'm going to talk about the various factors in just a minute uh but the day before the trial was scheduled to start in late August the judge said hey how about refiling those motions for summary judgment but also please reserve the following dates for a trial if I decide a trial is necessary so I don't know what was going on in his head but that's where things uh stand right now so that case could go to trial in late dece uh late December early January and that would be the first jury finding about whether something is fair ous or not no of course not both litig against move for summary judgment that's because they are saying they basically they each say there's no triable issue of fact here don't disagree oning you should decide this on the law okay so the judge basically said you think this you think this I'm not going to decide that that's a question for the jury okay that's what the that's what's going on in this particular case so the question about what is a issue of Law and what is a issue of fact is something that copyright lawyers have been fighting about for a really long time and this case is going to be a good this case and these cases are going to be good examples of like oh God what's a legal question and what's a factual question only factual questions go to the jury which state is that I'm sorry which state is this particular case uh that one's in Delaware anyway so why are the lawsuits being brought well the the there's a lot of money in class action lawsuits particularly if you get a class certified um it creates enormous pressure on the defendant to settle the case uh and usually to Pony up a really large sum of money so all of the class action lawsuits are being brought in order to essentially put a lot of pressure on the defendants to sue uh and the the class action lawyers can be eligible for as much as a third of the take which you know in one of the cases the the explicit request is for $9 billion in uh in Damages and a third of nine billion is kind of not a small amount of money um and of course the large copyright owners such as Universal Music and uh New York Times They just want a lot of money okay so mostly this is about money but um as I wrote in the communications to the ACM recently um the several of the uh complaints asked for uh destruction of models that were um uh that were trained on infringing stu okay so destruction is kind of looming out there uh as a remedy in those cases too now many of the authors visual artists Etc who are uh who are um individual plainist in these cases they are really unhappy um and um they just you didn't ask my permission you're not compensating um you're producing high quality outputs because you built your models on us and you're competing in the marketplace with us and that's not fair so this is a nice visual to S of give you a sense of what the visual artists think about uh about generative AI um yeah okay so copyright what okay okay uh so the most important thing here is copyright attaches automatically by operation of law to every original work of authorship once it's been fixed in a tangible medium um and digital is fixed enough um the rights vest in the authors but authors often sell or license those rights reproduction derivative work right are the most important ones the rights last practically forever uh but the only thing copyright protects is the original expression in a work not its ideas facts or methods that are embodied uh and it's limited by fair use okay so that's the stuff that you really you got to know in order to like uh understand the rest of the talk so fair uses are not infringements right they're not excused infringements they're just not infringements at all but it's a defense and it means that the defendant Bears the burden of proof on the on the issues which can be really important um there are four factors that are forth in the statute um basically the purpose and character of the challenged use the nature of the copyrighted work the amount and substantiality of the taking and the effect of the challenged use on the market for the value of the work so the short version of the arguments of the uh of the plaintiffs in these cases is that um is that the that all of the defendants in these cases are making commercial are used the works for commercial purposes and they're not transforming them like they're they're not a parody they're not a commentary they're basically um just using them for the same um purpose and also the outputs have the same purpose as the planus work right so if it's a visual art hey here's my visual art I put it up on the internet um and then it gets downloaded um and then used as training data and here's an output and it competes right that's that's their argument the nature of the work well again uh the nature of the work it's like all of them have different types of uh of works but nevertheless the their the plainest point of view is that you can generate high quality outputs because the models were trained on high high quality inputs right that's the uh that's the nature argument for the plfs and you make copies multiple copies of uh of the work and that is bad too um and there are two arguments um in respect of Market effects um so the plane of such as Getty Images in New York Times say Hey you are uh harming actual or emerging markets for uh licensing revenues for my works as training data and the class action plane of say your outputs essentially reduce demand for human authored works and threaten uh authorial um authorial livelihoods and so that's not fair okay that's that's the kind of short version of that um the version from the standpoint of the defendant kind of looks pretty different so um they want to say look it's transformative because we're using works as training data and training data is like different right so uh if I created a work of visual art when I created that work of visual art I I I I want I intended right its purpose was to like let people enjoy the beauty of my thing okay when it's done computationally it's done for a completely different purpose you don't really care about whether it was pretty or not you just care about it um uh its composition its kind of component parts so if you use the works a data and not for their expressiveness that should be okay uh and uh the argument is that it's okay to use work to extract information about how those Works were constructed right so that if you were kind of like even uh analyzing a novel or something like that if you kind of extract a bunch of information from it that's not copyright infringement nature of the copyrighted works again um it depends on the the particular type but much of the most of the data on which the models who are the defendants in these cases were trained on was voluntarily posted on the internet um and there are cases that have uh given that some weight the amount and substantiality yeah I took the whole thing but it was intermediate so one of the things that the defendants are really going to be trying to say is hey here's the data set yeah I made copies of it but the model isn't a copy of the training data right or it's not a discernable copy in the way that copyright law cares about right so you can't just look at the model and sort of say oh here's the sentence that I really like from uh from this particular novel it's like it's not there in that way okay um so the the data sets being distinct from the models is actually one of their arguments um and then Market effects well it's not possible to get a license from everybody whose works are ingested um uh billions and billions of things on the internet Etc um I don't need a license the use is fair outputs are overwhelmingly not substantially similar and besides hundreds of millions of people are using these uh models to create um new things and to make personal uses and that's actually sort of something that's public benefit right yeah question on the copy piece of it so we know that models can emit individual copyright Works yes um what is the difference between that and the model containing like the copy is it like the fact that in one case it's generating it in the other case it's like stored internally like this I guess the thing as a computer scientist it's hard for me to draw the line I've talked to other lawyers and I'm just like Curious like what how you try and draw yeah um I mean some part of it depends on what the prompt is right so part of the part of the question I have a little slide l a bit later um but but I'll I'll make the point now um which is that if you as the person who is trying to get some output generated essentially prompt the the model to admit certain things then if anybody is the infringer it may be you the person who prompted it um and that isn't to say that so that's one of the questions in the New York times in the conquered music cases um now in conquered music case um anthropic has basically said look I didn't put strong enough guard rails in before I put guard rails in now you can't do this anymore and therefore um that was just a you know that's not what Claude was designed for and we shouldn't be liable for it for any kind of thing so so now we know that we got to be more careful now we're more careful um New York Times case initially they made a really big uh Splash because their complaint had a lot of examples of here's the story from the New York Times and then here's um here's the um here's the thing that I was able to get uh chat GPT to uh to do um and that was included in exhibit J was um and the New York Times as I understand has withdrawn exhibit J because they don't want want Discovery to be done on how did they how did the New York Times Engineers working for the lawyers how hard did they have to work in order to get the the the substantially similar output and that makes a difference for whether it was contained like M has a copy of it interesting so if you I mean as I understand it at least sometimes what the times people would do is they'd input the story and then ask for the uh output to be a similar story and if you do that enough times you get kind of get an iterative thing so you're kind of the person who is responsible for the uh for the for the stuff and the question is whether the the developer of the model is not directly liable because they didn't output that particular material but whether they're indirectly liable for copyright infringement and so I I'll show you a little bit more about that but that's what the that's the short version yeah on point two when models are or when when work is voluntarily posted on the Internet is there some of some sort of implicit license to allow people to copy it to browse it and read it if the courts have any as it happens there's a case uh so the field versus Google case was actually a case in which a court uh held that uh Google's copy in of uh Fields uh works on uh the internet um for purpose of indexing and caching the contents was fair use uh but the court also said that there was an implied license right if you put something up on the internet and you don't use robots txt then you're basically you have implicitly licensed that stuff and I think a lot that's what's uh that's what happened for a lot of things because common crawl has making copies of the internet uh in addition to Google and other search engines um they've been doing common cwl for like 17 or 18 years so there is an implied license Theory um that will um uh that the defendants will be looking to uh but of course the another case that um uh that's just a trial court case the second Circuit Court of Appeals which is like a big Big Shot uh among courts um uh in the author skill versus Google case decided that digitizing uh millions of in copyright books uh for purposes of indexing and and serving up Snippets uh was uh fair use and kind of said and other kinds of comp computational uses um were also fair use so that's important to sort of that's the uh that's one of the major decisions there's also a case um where um a student uh sued um a plagiarism detection Software System for storing their papers um and the court basically said look I'm not trying to exploit the expression I'm only using it as data right and that was a case uh in which a non-expressive use of Works was considered okay um and the Sega versus Accolade case is another case that defendants are are looking to because the Sega essentially sued Accolade because Accolade reverse engineered computer program in order to get information about how it needed to configure an interface in order for its code to be able to work on the Sega platform right so it's basically an interoperability case and the court said that was fair use because you were making an intermediate copy of the software right you weren't exploiting the expression in the software you are basically copying it in order to extract information uh and that information wasn't protectable by copyright law and therefore it wasn't an infringement okay now um The Plains are basically going to say oh don't pay any attention to those cases those cases are all distinguishable um and I'm not going to go through it here but the uh but you can see the kind of the arguments that the courts are being uh are going to be faced with is trying to distinguish the the first set of um cases from um these other cases and um there's going to be a lot of fireworks between now and then um uh so there are other kind of things that are in some of the cases that I think are going to be important so the cadri case that's a cadri versus meta um that's the one judge chabria wants to take to uh to various ruling um one of the things that kadre complaint talks about is that open AI was trained or no sorry meta was uh meta's uh model was trained on books on a corpus of pirated books um so is that going to matter right the out there on the internet um uh did you know about that you know sometimes you can't tell the difference between something that's just a book and something that's a pirated book um uh but there's books three is apparently a place uh where lot 190,000 um books uh are found that are pirated apparently and so if they're available on the internet that's going to be different from you know you voluntarily posted them so that's different uh and training on scub or Anna's archive might be Troublesome so it's a it's possible that that could like tip um a fair use case um uh in plaintiff's Direction uh it's also possible that courts will distinguish among different types of models so I would say diffusion models are probably more vulnerable uh to challenge than large language models okay it also I'm sorry which dire uh so I think that I think that diffusion models are going to be more vulnerable um to infringement claims than large language models because large language models are more abstract are you thinking that is that because you're thinking llms is text output yeah okay because you can have llms that generate images and you can have diffusion that gener models that generate images are more vulnerable yes I think that's right um and then part of the thing is that I'm saying too is it may depend on the data type right that software because it's functional and because it has thinner scope of protection may be more likely to be fair use than for example visual art or or music recorded music you know you mess with the recording industry you are really in trouble and I'm telling you there are two um recorded music cases now um and I would not want to be the defendants in those cases um and both of those cases are being brought against small players right not big players um and so it's easier for um the recording industry to squash the little guys than to squash the big guys um so yeah more susceptible or more vulnerable because I think it's the fact that our music I can say this sounds similar this looks similar but then in text it it's harder to have that established right but but this is separate like people going off and being like oh this is similar so I'm going to sue is different from people being able to prove that this came from the data because actually with llms I think they might memorize more than diffusion mods like there's so so these kinds I'm trying to say is that most people would go into this and say oh they're all the same I'm saying some models may be vulnerable more vulnerable to infringement lawsuits than others and I was trying to give an example of that but um it's also the case that data types may may matter so one of the things about recorded music is owing to a kind of historical um anomaly um only the exact sounds of a sound recording are protectable by copyright law you can sing the same song in exactly the same way and as long as the bits are different it's not an infringement okay that's important to kind of understand now you have to have cleared the music if the music's still in copyright but in terms of the recording the recording recording basically has the thinnest of copyrights I see okay and these music cases are about the recording what two music there's two copyrights in every sound recording potentially right now the music composition might have been in the public domain because it's Bach or Beethoven um but if it's not Bach or Beethoven if it's Taylor Swift there's a copyright on that music and then there's a copyright on a sound recording the sound recording copyright is really thin music copyrights music composition copyrights are pretty thick so that's how Taylor was able to screw her first first C well no she was able to re-record music because the copyright that the recording industry had was so thin that her re-recording it and trying to sound as much like Taylor Swift's first version wasn't an infringement because it wasn't an exact copy of the bits okay that's the point it was okay because it was she had the copyright for the music like she had the copyright in the music and she re-recorded it so that it wasn't an infringement because the copyright and sound recordings is super super thin so with these cases against the J music providers the it's going to be all about it's going to be it's going to be really interesting okay because the copyright in and recorded music are so thin and so I I know the guy who's litigating it and um this is going to be fun for him um but not so much for other people anyway my point here is just that that um different data types could be treated differently right so that um it's not going to be all fair use or all unfair use it's going to be like some will be more likely and some will be anyway it's going to take um I think I'm estimating five to seven years before we really have any uh any clue um Market harm again to the extent that there's an existing Market uh for licensing uses of things like uh Getty Images um 12 million uh version uh 12 million that can be um that that's that's a stronger argument for Market harm uh than in the than in the class action case because Anderson and her fellow plaintiffs can't give anybody a license right they can't give stability A license they can't give anybody a license because they don't exist as a class until it unless the court basically creates them as a class um now the more interesting question is what happens if you didn't have a licensing Market before but now that you see that other people are licensing stuff you sort of say hey I could license my stuff too now Reddit is an interesting example because everybody was doing training on Reddit day for a long time and Reddit now says oh if you're going to train on my stuff you basically get a li you need a license from me but I don't own any copyright in any of the postings so exactly where they get the right to license I think is just kind of interesting by itself um but also there's a question of you know if let's say open AI trained on redit Data before Reddit basically had a licensing Market or said it had a licensing Market what happens to the next company or the next individual who wants to do it now do you have to do you have to does does open AI get a get a um get a free ride because they um did it before there was a licensing market and how does the fact that now Reddit says there has to be a licensing Market how does that affect the fair use defense somebody might want to make uh along the same lines as open AI okay so the the point here is that now there's uh talk of uh Collective license what question fair market does not have to do with copyright it's a different legal well so go back to our four use fair fair four fair use factors the last Factor number four is the effect of the of the use challenged use on the market for or the value of the work so if there's Market harm that will undercut the fair use defense okay okay anyway um some people are talking about having a collective license this is a very popular idea uh in Europe uh and um we'll see what happens uh with that uh I think Congress would have to do it and I don't think Congress is planning to do that anytime soon there are uh an incredible practical problems with putting that together but whatever anyway the whole point here is that the you know they I'm trying to give you a kind of back and forth right so the back and forth is hey you know the purpose of copyright in the first place is to promote the progress of science but as say knowledge and you know the large language models and other models are basically doing things that Advance uh progress of knowledge um and so maybe that should be a reason why it's so supposed to be okay um and again to the extent that courts accept the idea that when you're doing the training you're not trying to exploit the expression you're just trying to deconstruct the thing so an analogy that I've sometimes uh suggested is that if you have uh a little construction made out of Lego blocks and then you take the Lego blocks apart and then you make a new construction is not a copy of the first thing okay so that's if corpse by that kind of analogy that tips in the direction of fair use you see what I'm saying okay um so expression um is a complicated word uh but um Works uh copyright Works contain a lot of unprotectable stuff um and um I I I would give you this but um uh I think I'm just going to say there's some case law where you can extract the really valuable stuff uh and it's still not copyright infringement um this is a slide that I think I'll probably end on uh which is that you guys to the extent that you're doing training um may want to be thinking about how do these lawsuits potentially affect the research community that is doing this kind of work and the answer is that none of the cases right now involve uh nonprofit were research oriented uses um but commercial versus non-commercial is a big deal right um and favored uses include scholarship research and teaching uh so those things point in the direction that uh that it's more likely fair use if you do it than if uh it's being done um uh for commercial purposes uh and again to the extent that it's unlikely to harm markets for The Works expression um again that will probably favor you because there's also less likely to be kind of Market harm from your use as oppos opposed to what open AI um or meta do um but all of the cases involve except Ross involve big Tech defendants and there's kind of a thing of tech lash right now and I'll tell you none of them basically wants to push for a distinction between their for-profit stuff and the nonprofit stuff so if push comes to shove and the plists are able to get a really broad ruling um the that training data uh must be licensed or it um is unfair and infringing that broad interpretation could end up harming your community uh and so um we should stay in touch because one of the things that courts allow um affected parties who are not in the lawsuit to do is to file amas briefs to file other kinds of documents basically say don't overdo it right don't give a really broad ruling here where a narrow ruling would be appropriate um and so um uh but this is really important okay to realize that even though you think I'm a researcher nobody was nobody cares about me um well guess what um if it's a recording industry they'll go after anybody okay they those guys are just uh good at litigation okay so I see I'm actually out of time uh I do have some additional slides you can see them they're pretty much self-explanatory um and I probably can take at least one or two questions so I'm curious you mentioned a lot of the factors that go into it and do political factors like you know if there are rulings against big Tech the US might fall behind do these enter into any of the considerations at any levels or well you can't say it quite that way okay um but certainly um it it is in The Ether and certainly when it comes to uh testimony in Congress um there have been several hearings about generative AI IP issues um and their copyright office has held a number of uh of listening sessions uh and the like that's a place where you sort of say look Israel and Japan have extremely broad interpretations of um of text and D mining uh as something that's uh that's that's good for society um the Israeli Ministry of Justice issued of an of an opinion that under Israeli fair use law that that using copyright works for training data is uh is okay and the European Union has a text and data mining exception one that's non- wable by contract um for non profit researchers uh and uh another for um uh for for-profit type entities but they the but people can opt out right so the that so there's going to be this kind of like you don't want the US industry to basically collapse right nobody wants that but you got to say it a little more subtly than that um so um so but it's it's part of um it'll probably be part of public benefit argument so public benefit is not a separate various Factor uh but courts have routinely considered how is this um lawsuit how is the outcome of this lawsuit going to affect the public and if uh if it benefits the public that's great if it doesn't benefit the public well that actually is something to take into consideration so for example in the uh in the Google vers Oracle case that the Supreme Court decided back in 2021 um what did uh what did what did um Google do it used uh 11,500 of the Java declarations in the Android platform um and um the question was whether that was fair use or not and this the Supreme Court said yes it was fair use U by a six to3 majority and one of the considerations was that that the um that the Android platform had promoted creativity and had had a public benefit because so many people have Android phones and have um have apps for them and so that's a public benefit okay so it's it's it's part of the penumbra but you don't say it straight out question exhibit J you mentioned withdrawn it um is it common that like people would withdraw evidence like this and do you because it it it was weird they the way that it was reproduced like everyone like as researchers it's really hard to memorize data like that and get it out and we all were thinking what is open AI doing that has gotten it to memorize this that is so extractable so do you think there could have been more to the prompts or like you know yeah this is one of those things where um the times wanted a clean case right they wanted such a clean case that see open Ai and and the times had been in negotiations for months they were asking for more money than open AI thought was appropriate and open I know from somebody who was at open AI that they thought that they were really close to getting it and then all of a sudden the lawsuit gets brought so they felt like Hoodwinked right right they felt like they' had been kind of lied to and stuff like that so you can see the times lawsuit as just a ploy to get a a higher license fee right um so that may be that case may settle right and the there's a case getting images against stability that case um looked like a pretty strong case to me um but um I haven't heard hide or hair of it lately so that one might have settled too and you know Getty might have been willing to do it um I think you were next and I don't okay we have a hard stop there was actually two minutes off clock is wrong it's 3:03 oh okay I sorry is that okay did you want to ask a question oh you can go ahead oh go um I was wondering you you said before that open I was training on some for like allegedly like stolen or like not put online legally um text but I wonder if there's a difference if like I'm training on a New York Times subscription that I purchased um because there a difference between like I posted this on Reddit and I didn't I wasn't paid for it this is paid content versus like I bought a subscription and I was told that I could learn things from this website using the subscription have you seen yeah um Again part of what all of these lawsuits are really going to try to do is kind of put a little black hat on one of of the parties right so either the black hat on the on the on the times in the case by uh essentially being manipulative um uh about the license negotiations um of course the times put the black hat on open AI through exhibit J um so a lot of it's maneuvering around we don't know at this point what's going to happen but but I think several of the lawsuits involv pirated books um and that kind of makes it look worse uh for the for the defendants now I have a colleague who actually wrote a paper that said even if you train on or do text and data mining on infringing stuff if you're just extracting unprotectable stuff it it shouldn't be copyright infringement but um again it will depend on the judges I I wrote a paper actually called fair use defenses in disruptive technology cases um that you can find on the internet if you want and I I tra how courts have looked at the market impact um of new technologies that are disruptive right there have been a series of them photocopy machines um hip recording machines VCR machines lots of lawsuits uh involving these kinds of Technologies so it's not anything new that the copyright Industries would freak out um so suppose that the AI companies were to settle uh these suits like would they be able to extract an agreement that like they can't be sued again for the same infringements you're no of course not no there's no incentive to settle because um the next one will come out of the woodwork right um right and so yeah there's so far as I know none of the cases is settles it for there's no Public Announcement of settlement uh but you can imagine the times settling and that one would just go away that might have some spillover effects for fair use defenses in other cases because there's several other lawsuits involving news sites um so again there's kind of like who's the next plth gonna show up um if I settle with this one I I think I really am supposed to stop you tell me okay you should take my question Okay no Okay the question is when you said about public benefit yeah but if there is a public benefit that helps yes absolutely so it remind me of the whole public broadcasting you know same thing with with um that at least it forced them to put some of their money into into PBS and I wonder if there's an analogy to that well here need some sort of regulatory body scre on you so there is a theory of what copyright is about which is that it's about the benefit to the public right and so there are Supreme Court statements over and over again but the primary purpose of copyright is to promote the public good right to create enough incentives for authors to author and for Publishers to disseminate but the primary um goal is not to reward those entities but is to promote the public good and to promote knowledge and the dissemination of knowledge right and so to the extent that you really buy into it as I have in my career um that that's really the fundamental purpose of copyright then you want to have the generative AI training happen because that advances knowledge uh and enables a lot of other knowledge to be created and you know it helps people um and so there will be a lot of focus if the things get up to an appell at court I think on public benefit also to the extent that any of these systems actually are held to be either direct or indirect infringers of copyright then anyone who uses those Technologies to generate things that came from the training data they could be infringers too and it was one of the things that was important in the Sony betamax case back in the 1980s when it was pending before uh the US Supreme Court that by the time the case got to the Supreme Court five million American households uh actually had a Sony betamax machine in their living rooms and so was the US Supreme Court going to say that all five million of those households Each of which might have let's say four people in it right that's 20 million people um are you going to say they're all infringers and I think the Court's going to have a hard time putting its mouth around the idea that 180 million people who are using open AI are all infringers too anyway I think I really do have to stop [Music] [Applause] we have a break um for five minutes [Music]

2024-10-24

Show video