AI @ IA : Research in the Age of Artificial Intelligence — Internet Archive's Annual Celebration

AI @ IA : Research in the Age of Artificial Intelligence — Internet Archive's Annual Celebration

Show Video

e e e e e e e hello everyone would you please take your seats  the program is about to begin welcome everyone   and thank you for being here let's get started  and hear first from David mccraine a bestselling   author a top 100 podcaster and friend to  the internet archive David please take it away for those of you who don't know me my  name is David mccrady and I really wish I   could be there with you in that gorgeous room  amongst those somewhat creepy statues you have   there and that's sort of the wonderous thing  about technology isn't it it extends what human   beings can do in part that's what we're going to  talk about here tonight I'm a science journalist   a lecturer a podcaster I write books about  things and what I mostly cover in research is   what human beings can and can't do writers like  myself sometimes like to think where outside of   Technology because we can do what we do with a  pin and a piece of paper but a pin and a piece   of paper are technology books are technology  and many of these books were written with a   computer emailed back and forth to an editor  researched with a search Eng and the truth   is this technology has been extending what we  can do all the way back to the printing press   way before that and I have to tell you that as  soon as you adopt some new piece of technology   a newer piece comes along technology changes  the the world then the world uses technology   to change technology and then that new technology  changes the world so more and just keeps it's a   good idea for podcast so I'm going to take a  digital note on this so yeah technology it's   going to change things some of it for the worse  a lot of it for the better and where it changes   things for the better it's going to expand the  limited capabilities of human beings it's going   to extend the reach of those capabilities both in  speed and scope it's about a newfound freedom of   mind and time and democratizing that freedom so  everyone has access to it tonight we're going to   look at how technology and the internet archive  are extending the capacities of research so that   we can see more so that we can understand more  so that we can access more and tonight's program   is about the past present and the future of the  internet archive artificial intelligence and that   dream of universal access to all knowledge  so thank you very much for inviting me to   do this and I do really wish that I could  be there to enjoy it with you so have fun h thank you so much David and again welcome  everyone for those of you who haven't noticed   yet I am your AI generated host also if you  haven't picked up on why I'm a talking bust   we're doing a sort of antiquity to the age of  AI design theme I also apologize if I appear   quite uncanny but I'm guessing that many of you  are going to be seeing a lot more AI generated   everything in the coming year so by the end of  this event hopefully you're a bit more prepared   for that I was made by some wonderful artists and  volunteers and I'm here because I founded what's   considered to be the first successful public  subscription library in America which still   exists as a research library today and tonight  is precisely about research libraries like the   internet archive in the age of AI so let's start  at the beginning of it all by hearing first from   the internet archive founder everyone please  welcome our very own digital librarian Mr Brewster kale welcome thank you and thank  you uh AI Ben uh so welcome   to the internet archives 2023 annual  event I'm really glad that you're here okay so we promised a little old to old to  new I can't quite go back as far as Ben but   I can go back to 1980 I was a student at MIT  at the artificial intelligence lab um and we   were dreaming of building machines okay I had  hair it wasn't great hair um uh there we were   building machines we wanted to build machines  that could think Danny Hillis said we wanted   to build machine that would be proud of us that  we were these were heady days um we were trying   to build these things though and we were data  starved even the supercomputer that I helped   build in that time held all of 32 megabytes of  memory the dis pack which was the size of a bar   was 5 gigabyt we were a long way from making  the global brain that we were dreaming of we   all knew knew what was coming right we all sort of  understood the the the graphs and the like um but   I thought a role that I could play was to try  to build the content if you will the data sets   collect the materials so I collected back then  I collected all the phone books of the Boston   area just before uh they were going to go to the  printer I collected the Boston census that had   sort of you know how many people were where  but underneath that you could go and deduct   What the how all of the streets linked together  you could make a street map and reimagining with   these databases could be used for that would be  something audacious we wanted to make a direction   assistant we wanted to make it so that you could  use your phone to call up a machine and would   answer hello this is thinking machines Direction  assistance can I help you tell me where you are   and where you want to go and I will tell you how  to drive there okay this is state-ofthe-art okay   so we didn't have cell phones then we didn't we  barely had modems and this is uh what we what   we were able to go and then describe how people  should go and drive from one place to another in   Boston okay truth be told it didn't work all that  well but the future um where machines augmented   with data interacting with people was going to  help us in new and different ways what I didn't   realize then what I think is so important is that  you actually have to have the data in your hands   to be re able to use it and reuse it in new and  different ways ways that were not imagined by   the original Publishers of say the phone book  uh or the census that that was going to be key   once the data was in our library we could do new  things with it so we started building building   and building and building along so I went  off to build the digital library um to try   to create these new unimagined and Fantastical  new services that we knew would start to come   out roll time forward 40 years and the internet  archive is maturing starting with web pages uh   we started collecting in 1996 we now have hundreds  of billions of pages from hundreds of Millions of websites then we started collecting television  um in in the year 2000 Russian Chinese Japanese   Iraqi alaz bbcn NBC 24 hours a day to be able  to have a record of television news because we   knew it was going to be important we started  collecting in the year 2000 but um we made a   small service in 2001 but it was not until 2009  that we started to actually build the service   on top of that hundreds of thousands of pieces  of software are now archived 7 million digital   digitized books we digitize it about a million  books a year which is excellent music software   all these sorts of things it's a service that's  used by about two million people a day that are   go to the website there are about five million  people that use the internet archive every day   and they don't even know they're using the  internet archive but we're there for them and   the idea that this uh has turned into uh the mo  the 200th to 300th most popular website I think   is kind of a testiment to not only this group but  also the world that actually wants to see older materials the internet archive by weaving data  and machines and empowering people to pull off   some amazing feats um I found that a way of  describing the internet archive is through   Wikipedia so Wikipedia everybody knows Wikipedia  it's kind of awesome uh one of the marvels of the   of the modern times but it's written by real  people that need libraries but also there are   people reading these that want to go deeper  and so we've been working with Wikipedia um   to go and run over the uh the Wikipedia for the  last 15 years collecting any new URL that's ref   refer by any wikipedian And archiving it and  we knew it would be important at some at some   point so then about I don't know six seven  years ago we made another robot that went and   fixed broken links in Wikipedia and it turns  out that links rot after about a 100 days um   so a lot of them were gone and as of a couple  years ago we had uh fixed 11 million broken l links working now um for the last more just  keep going keep going I just heard today we   crossed over the 300th wikip Wikipedia language  Edition that is now getting fixed their their   broken links um so we're now at 300 as of today  which is kind of great and another Milestone from   today is uh we just fixed our 19 millionth  broken Link in Wikipedia [Applause] yes so   what we wanted to do is go further than that  as well to go and make it so people could go   and get to the books that are referenced in  Wikipedia so we basically crawled Wikipedia   again tried to find all of the uh the books that  were referenced acquired those books digitized   those book books and tried to weave those links  back into Wikipedia and as of now um we've uh   got 1 million links in these uh wikipedias two  uh books and they open often right to the right page so the idea of going deeper and not being  just sort of this uh AI Skynet thing that's   going to you know detach from people but it's  woven in with people has been I think the Great   lesson of the last uh 40 years uh from me and it  doesn't seem like it's going to stop um just as   people are now depending on online resources  for accurate information libraries are more   necessary than ever libraries have been coming  under attack however so okay a lot of good news   but right now it's actually being a little bit  scary to be a library and it's not just us um   fortunately people are starting to pay attention  to the plight of libraries and for the first time   in several years they published our words in a  real mainstream paper the guardian about what uh   is going on and the forces that are now aligned  and attacking libraries we've probably heard a   lot about the book Bannings that are going on and  Me by many politicians and promot Ed and actually   happening and threatening libraries but you  probably don't know also know that there's been   large numbers of defunding of libraries that are  going on in such a way that they're just getting   stripped down so the legislatures are now starting  in many places aligning against libraries in many   places in the United States but the thing that I  think is under appreciated is what corporations   are doing through Draconian licensing terms  that make it so that libraries don't actually   own anything digital at all that they can only get  streaming access to these things so if you go and   borrow an ebook say from the San Francisco Public  Library you're not actually borrowing their ebook   you're getting passed through to a thirdparty  database uh that's driven by the Publishers   so they can surveil every page turn and they can  change that book or delete it at any time this is   just not that great right so we're having these  kinds of approaches and the internet archive um   when it uh uh went and digitized its Holdings  so that it could lend things out um we're sued   by the uh the Publishers because it wasn't part  of their streaming Vision um to go and have all   the 20th century available they just wanted to  have just their greatest hits available through   their uh their their service and the big surprise  to me is that the Judiciary has now sided with   the Publishers not just in our case but in other  cases that have been brought against uh libraries   so we've got a problem out there people need  libraries more than ever um but we have uh a   set of forces that are making libraries harder  and harder to happen so we have to do something   more about it and it was really great that  people came to our Aid that when we um needed   uh support people came and protested and helped  um the uh internet archive to be able to find   uh uh support for us and to go and propagate the  message just in time for another lawsuit against   the internet archive this one was brought by  the Raa with their major uh uh record labels   against us for going and having the audacity  to making 78 RPM records uh available which we   had been doing for 15 years and they didn't go  and say oh you should take these down and then   we refused they just hit us with a lawsuit so  we basically got this on us but it's coming at   libraries from many different directions so we  need to stand behind libraries more than ever and I'd like to highlight somebody that's  that's doing an extra amount of that so we   had our protest on the on the uh steps of  of of this building um to go and show that   we were uh in support of libraries and our  uh San Francisco city supervisor she said   she was sorry that she couldn't come but she  had another thing that she could uh she might   be able uh to do to help she said that she uh  could write a resolution and maybe just maybe   the other supervisors would support the idea of  supporting libraries unlike what we're seeing in   a lot of the country where they're taking away  their fundings and not supporting them and it happened the was the uh the resolution was  written and was passed unanimously [Music] U   and for taking this stand and to get I  would like to give our annual internet   archive hero Award of 2023 to Connie Chan  the city supervisor for our district in San   Francisco I would like to welcome please  welcome Connie Chan to the [Applause] stage please say a few words if you would Thank you and good evening. It's really a  privilege to be here celebrating with you   celebrating Internet Archive. I am a first  generation immigrant. I was born in Hong   Kong and grew up in Taiwan. I came here to San  Francisco's Chinatown when I was 13 years old and  

when I arrived I didn't really speak any English.  So I went to this place called Chinatown library   and Chinatown library had Chinese books. It was  amazing because they were free and there were a   lot of them and in a foreign country being able  to read books for free in my mother tongue it   was amazing and comforting but at the same time  it took me another step. I also was able to read  

English books and other books and continue on  and on my education throughout time. It's the   reason why I know libraries to people like  me - first generation immigrants, low income   immigrants - at that time with my single mother  and my brother, I knew that living in Chinatown   and having access - free access - to information  was part of the critical part of my education   and to higher education and I know that I was not  alone. And it's true I think that many of you know   that we are not alone because we needed library.  Library was probably by far the best system that   United States had come up with. So when I learned  about Internet Archive I was like - what is this   wonderful thing? I don't understand it because  it's beyond me - it's not a physical library   but an online library. And I learn more and more  about it and I have a wonderful tour. Thank you to  

Brewster. But here I'm not an engineer, clearly.  I do not understand internet. In fact, I do not   understand the technology, very much of it, but  here what I understand when I learn about Internet   Archive it is a gem - it's a hidden gem that we  can see for few things that I thought it's very   critical to humanity. First is freedom, diversity  and truth. It's freedom of information, it's   diversity of information, but most importantly  it's the access to truth and that is what we   need and here we are. I really believe we're not  just fighting for libraries and access to freedom   and information, we're really fighting for our  humanity. It is for our humanity. Humanity to  

exist. So thank you for this award. I'm grateful,  but really this award belongs to Brewster and all   the people that have been supporting Internet  Archive. I really hope that you will continue   to fight with us and stand with Internet  Archive. This fight is worth it! Thank you. okay we need more politicians like Chan uh tonight  you're going to hear how we are strong and we're   growing to use the tools of AI to create better  libraries and better services that benefit us all   this can be of the research libraries day the day  we have been building for the day that we've been   collecting for the day where the collective works  of humankind can be more relevant to more people's   lives I thank you very much for coming the rest of  the show um you'll see some videos but you'll also   see some of the real uh work that's going on using  the internet archives research collections to go   and do new and different things and also help  build the collections thank you very [Applause] much we use web archiving every day to save evidence  for investigation we believe this is crucial for   journalism the internet archive is indispensable  in creating our podcast a lot of old websites   where you can't find them anywhere you can find  them on a Wayback machine internet archive is my   favorite way to teach history showing real life  footage of actual events that happened having   them Listen to Sound Bites thank you internet  archive for contributing to to and enabling my   creative practice as an artist as a video game  developer as a musician and all the other ways   you've enriched the world giving me free access  to a huge collection of books from the past the   internet archive allows me to discover new fields  to explore and study it's so fun and helpful to   go back into the internet archives and find  all of the graphics and websites and images   of my earlier projects that I actually personally  did not save thank the Ary for helping me with my   research specifically in the topics of literature  on cultural and social preservation because I'm   from Pakistan most of the cataloges that we have  from the British Indian Era are locked away I'm   very grateful to the see inter R SE um thank  you web machine is a tool we frequently use   here at neutral to look for deleted sites or  online posts that were shared on social media   this allows us to have aov to show who and what  contributed to the spread of our misleading or   false claing internet archive is the reason this  book exists we use we machines to do a fake check   on our daily work the internet archive enabled  our fact checken team at Vera files to find a   record at the Senate of the Philippines website  that our president now lied about graduating from   the University of Oxford thanks to internet  archive as a writer and researcher I can keep   my texts audios and videos together I use the  internet archive to investigate human rights   violations for the past 23 years the internet  archive has made it possible for me to upload   audio field recordings have made at rainbow  gatherings throughout the world thank you   internet arive for enabling our fact seeking  work whenever we want to trace the digital   footprints for any investigation weac machine  is the most collap for and I really couldn't   write the washingt post fact cheer without  uh the way back machine simp believe using   the archive has always been and will continue  to be great ased in the facts thank you and AR   for nightling resarch early online coach thank  you internet archive thank you internet archive   thank you internet archive I'm very grateful for  the internet archive for existing thank [Music] you wow it's so rewarding to see how people are  making use of the archive to create and do   great research but how do we make sure that all  of those patrons can easily discover and use the   materials in the archive that they need and how  does AI help hello everyone I'm Dre Camy and I'm   excited to share with you some projects we've been  working on that use AI to enable our Librarians   to to make our materials more discoverable and  easy to use at internet archive scale in fact Ai   and machine learning have been a core part of  our digitization pipeline for a few years now   when a book is digitized all we have is a bunch of  photographs of pages these photographs are great   for humans who can easily make sense of them but  to make them searchable and discoverable we need   to make them make sense to a computer first we  even have to tell it where the pages are because   these photos include things we don't want like  the scanning bed originally scanner operators   had to tediously and manually crop each of these  images to correctly identify the page boundaries   in 2021 we trained a custom machine learning  model on all of those manual page croppings   from the years before to automatically suggest  page boundaries to the scanner operators this   allowed them to double their rate of processing  and made it possible for us to digitize even more books great so now that we found where the pages are we  want the computer to understand the words on these   Pages for this we use test an open-source machine  learning based tool to convert the images into   text a machine can easily understand it's this  process that makes it possible for our books to be searchable accessible to those with prce  abilities through features like read aloud and available for bulk research cross  referencing and text analysis since   beginning to use Testa act in 2021  we've made over 14 million books   documents microfish records you name it  discoverable and accessible and in over 100 languages but since last year the definition  of the term AI has shifted to mean something   a little different and as more capabilities  like chat GPT and large language models have   been made available we've been finding many  new opportunities to allow our Librarians to   process more materials than ever allowing us to  tackle projects we previously couldn't in order   to help improve discoverability and ease of use  a key part of material discoverability is good   metadata remember a digitized book is just  a bunch of photographs we need good metadata   to know things like the title author history  subjects of the book so that we can correctly   connect Patron searches to those books and for  some materials even despite having the book text   metadata can be difficult to Source resulting  in books that are a mystery to the computer and   this can be very difficult to find by searching  for our patrons to help tackle this problem this   year we've been piloting the internet archive  metadata extractor a tool that reads the the   that book text that we talked about earlier  from the front of the book and automatically   extract some key metadata elements with this  extra information our Librarians and metadata   staff can match the digitized book to other  full catalog records and solve these mystery books and there are a lot of mystery materials in  our catalog we currently have over 300 thousand of   these mystery books and that number continues  to grow we also used this tool in a project   this year in partnership with the University  of Toronto to digitize over 23,000 Canadian   government documents these documents were  unlined to catalog records and so also had   no metadata labeling collections of this  scale manually by hand was unfeasible but   the new AI tooling allows Librarians to make  these previously unfeasible projects feasible we've also been using AI to help make our  materials easier to use for our patrons   for example our serials metadata team which  works with digitized magazines and newspapers   from the 20th century has always worked  to research and add descriptions to each   of our periodicals this is timec consuming  taking an average of around 40 minutes to do   that research and then write a description and  there are over 18,000 periodicals that need a   description so this is no small task this year  the team began experimenting with using AI to   help in the description writing process given  metadata about a periodical chat GPT is asked   to generate a description here we use chat gpt's  prior knowledge about these periodicals using it   almost like a research assistant this description  is then vetted and edited by metadata staff and   finally uploaded back to the archive where  it can help our patrons find the things they [Applause] need with AI assistance writing   descriptions has gone down from  40 minutes to just under 10 minutes another way we're using AI to improve  Patron and researcher experience is by extracting   table of contents data from books the diverse  structure of table of contents across different   books has made automated extraction difficult  in the past however with AI a new process has   been developed which initially identifies  the table of contents using traditional   programming and then employs OCR and chat GPT  to extract the table in a structured format   this data can then be used in the book reader  UI to help people navigate the book and inside   open library to help people discover the book  so that's a lot of [Applause] projects that's   a lot of projects we've been able to make use  of AI with because of AI we've been able to   create new tools to streamline the workflows  of our Librarians and metadata staff and make   our materials easier to discover and work with  for patrons and researchers and those are just   some of the many projects and experiments at  the archive using AI right now other projects   include everything from new summarization  to the ability to talk to and ask questions   of our materials to AI enabled search or to  citation parsing or to you name it and with   new AI capabilities being announced and made  available at a Breakneck rate new ideas and   projects are constantly being added I'd like to  give a huge thank you to everyone who worked on   these projects gave me information and kindly let  me present all of their wonderful work on their behalfs please now join me in in welcoming  another colleague of mine from the   internet archive Alexis Rossi to  talk [Applause] about to talk about   what kinds of research can be made  possible when you aggregate these [Applause] artifacts hello everybody so the work Drey just described is  helping us make sure every artifact in this   library is well described and easy to find  that work makes it easier for researchers to   find what they're looking for and we've seen  so many great projects using the resources in   the internet archive Helen end and Laura Gibbs use  books from our library to study African folktales   and share them with the public Laura even wrote  an entire book showing people all of the uh tales   that she found here in the internet archive we  see news stories come out pretty much every day   that use the Wayback machine as a resource these  are the outlets that used the Wayback machine just   this year for their reporting and factchecking  journalists like Philip bump from The Washington   Post use aggregated data from our TV archive to  report about the media bubbles that we all live   in now libraries build collections to facilitate  research sometimes we can anticipate the types   of research that people will want to do people  have been using books one by one for thousands   of years to learn other times New Uses emerge  that we didn't anticipate and AI is showing us   what some of those uses might be let me tell you  a story about why it's so important to have these   large collections of digital materials probably  everybody here has been surfing the web and you   find a page in German say and chrome pops up and  says hey do you want that in English you click   yes suddenly it's in English and you can read  it that's the magic of machine translation so   how do you teach a computer to translate between  languages essentially you provide the computer   with millions of sentence Pairs and the computer  teaches itself that's the artificial intelligence   at work a sentence pair is the same sentence  represented in two different languages it works   like the Rosetta Stone so the more sentence pairs  you provide the better the translation will be for   languages with lots of data like German French  in Spanish the translations are pretty good but   when you have a language where there's less data  available less data equals worse translations that   means that a language with fewer speakers is less  accessible online because our technology hasn't   learned yet how to deal with it so a few years  ago a group of European researchers from the   University of Edinburgh and other universities  and funded by the EU came to us asking for web   content in European languages including these  underrepresented languages so that they could   try to make better translation models web pages  like the ones stored in the Wayback machine are   a great data source for this sites in the same  language sorry in different languages they give   you that Rosetta Stone situation right same  content different languages so we put together   a set of data for them and then the researchers  went and did the hard part they figured out which   Pages were translations of each others and then  they matched up the sentences got rid of all of   the data and filter or all of the noise in  the data and filtered it and then they came   out with opsource data sets of these sentence  pairs now for some languages this wasn't a big   deal German art has lots of stuff it wasn't that  big of an addition but for other languages the   difference was huge for instance they more than  doubled the number of sentences for lvan and   quadrupled the number of sentences for Romanian  exactly that allowed them to drastically improve   the quality for these translations now this  might seem kind of academic and like cool why   do I care but um this has come back around full  circled to benefit the public it turns out that   the sentence pairs that came out in these open-  Source data sets are now part of the underpinnings   that allow Firefox to translate web pages for  you including in some of those underserved languages yeah so this open source data is  helping to level the playing field so that a   nonprofit open-source browser like Firefox can  compete with a corporate Behemoth like Chrome   that's amazing that's amazing and I we are  so happy that researchers can use these large   data sets from libraries in ways that we never  dreamed of so what else might we be able to do   with languages according to Wiki tongues there  are about 7,000 languages spoken today 3,000 of   which are endangered and only 5% of languages are  well represented online now with stats like that   it is understandable that there are concerns  that technology is leading to the demise of   some of these smaller languages but stories like  the one I just told you show us that technology   can also help us make these smaller languages  more more accessible online several years ago   we worked with nonprofit panx project and  the culture office of Bali to digitize all   of the palm leaf manuscripts written in Banes  yeah Banes locals helped with the translation   and they also transcribed some of these and if  you want to know how underrepresented Banes is   online we had to modify utf8 so that you could  see all of the characters on your screen but now   we have these digital seeds for Banes can we  use them to increase online access for Banes speakers exactly to do work like this researchers  and libraries must be able to collect large   amounts of digital information and the researchers  have to be able to access it they need this data   so that they and their machines can learn and  help create tools that help us talk to each other we live in a world with so much conflict it's vital that we preserve languages  yes but also the cultural artifacts we need to   keep them safe and accessible in our libraries  [Applause] yes I will leave it to Quinn and Alysa   from the Stanford University to explain  why everyone please welcome them to the stage yes there we go hello and thank you my name  is Quinn nski and I am a digital Humanities   staff at Stanford um also teaching um dlcl 103  future text uh Ai and literatures cultures and   languages as quarter with Laura Whitman um  I'm a former I'm a former medieval slavist   and also co-president of the US professional  association for digital Humanities immediately   following Russia's full-scale invasion of Ukraine  in February 2022 a group of volunteers from across   North America and Western Europe came together  to found uh saving Ukrainian cultural heritage   online or sucho which went on to Archive over  50 terabytes of Ukrainian cultural heritage websites the internet archive has been an essential  partner in this work from the very beginning   scaling up their web archiving capacity and  response to demand and developing new tools   to allow volunteers to work faster and more  efficiently in summer 2022 ananaya a sucho   volunteer and a svic librarian at Harvard  proposed that sucho capture memes from the   war Anna collaborated with my colleague Simon  Wilds a digital Humanities developer at Stamford   to develop the sucho meme wall which shows off  all the memes that our volunteers have collected   translated and created Rich annotations for  by hand we've had people approach us asking   about whether AI could play a meaningful role  in sucho and we've always said no because we   wanted this to be handled with extreme care  and accuracy especially when it's a task that   we know will be a meaningful way for people to  come together and help when they would otherwise   sit paralyzed alone and doom scrolling the news  we shared our diverse expertise technological   linguistic cultural and learn from one another  and then taught the next round of volunteers   we've still got a long way to go for machine  inter interpretability of a lot of memes so   let's take this one as an example it's my  nine-year-old's favorite and I imagine it'd   be pretty easy for Dolly 3 to interpret we start  with operation Z which is the Russian name for   their war and in the second panel control Z the  ukrainians have deleted the warship by sinking it what we got instead was the guess  the ship had reverse course undone   some previous action when the ship itself had been undone especially in a world where so much  AI training data is created under exploitive   conditions we made the choice to support and  Empower people who wanted to help preserve   protect and showcase Ukrainian culture and in  doing so we have created to our knowledge the   largest set of Real World memes with this kind of  extensive annotation about templates people events   as well as the transcription and translation of  Ukrainian if there's a future for AI powered meme   collection and annotation it might start  with this data set but these memes mean   a lot more than just data and for that I will  turn it over to my friend and colleague Alyssa Burker hello hello uh my name is Alyssa Burker I am a  student at Stanford and I'm currently teaching a   Ukrainian language course there is no relief  during the War uh but there is always a way   to feel closer and more connected to others  while so many ukrainians have been separated   from their families and loved ones one of the  most most essential ways to connect throughout   the war has been through memes there has never  been a war where we have had as much access to   real-time coverage as we have during Russia's  fullscale invasion of Ukraine today thanks to   sucho and internet archive we have the ability  to not only document Russia's war crimes which   is absolutely necessary but to say vital elements  of Ukrainian culture perhaps none of which have   been as vital for the surv survival and optimism  of the Ukrainian Spirit as memes it is truly it's   true it is truly impossible to overemphasize the  importance of Ukrainian memes for Ukrainian people   in this war whether ukrainians are hiding in bomb  shelters right now in occupied territories or have   fled as refugees they are united with their  community Through the collective experience   of memes in the famous case of Chaba ukrainians  used memes to transform Russia's repeated attempt   to take control of the Chaba airport in Ukraine  into a legendary and hilarious example of Russia's   utter failure and incompetence there is not  a single Ukrainian that did not experience   relief and laughter from CH the B of Kim yes this  even includes my 91-year-old grandfather Russians   regularly bomb homes and hospitals but they  also make a concerted effort to bomb museums   schools universities and libraries in an attempt  to fulfill their stated purpose of obliterating   Ukrainian Culture by preserving memes that carry  so much history and emotional connection sucho and   the internet archive actively resist Russia's  goal of the complete Erasure of Ukrainian culture when I look through the sucho meme wall  today I remember each meme as a cultural moment   that helped me and millions of other ukrainians  gather strength and emerge from truly the   darkest of places as a Ukrainian American  in the United States these digital assets   have given me tremendous Insight understanding  and empathy towards what those in Ukraine are   going through today despite the most horrific of  circumstances in hiding and occupied territories   and on the front lines ukrainians never give up  their fight for their freedom their culture and   human values as is exemplified in the saying be  brave like Ukraine I humbled by the opportunity   to tell you the story of so many ukrainians  today and I am overwhelmed truly with gratitude   for the internet archives in invaluable  role in ensuring our people will never be forgotten thank you in the face of the struggles and injustices  the internet archive is facing today I want   to remind us us all to never give up and  be brave like Ukraine [Music] [Applause] slav Quinn Alyssa thank you so so much for  sharing this beautiful project these digital   meme collections are not just a source of  connection for those alive today but are   precious historical artifacts like World  War II posters or letters from the Civil   War those in the future will be able to  comprehend more about this moment in time   thanks to them but there is still so much we  have to understand about the world today here   to talk about a decade's worth of work turning  the internet archive into a research platform   please welcome the founder of the the G delt  project everyone please welcome him to the [Applause] stage thank you so much  it's truly an honor to be here tonight click there we go it's truly an honor to  be here tonight so what is the internet archive to   most of you you probably think about the web and  the book archive to me what I'm so fascinated by   is a television news archive a 100 channels from  50 countries on five continents in 35 languages   over portions of the last 20 years one of the  most incredible archives of visual storytelling of   global events now about a decade ago the founder  of the TV news archive Roger McDonald reached out   and said how can journalists and Scholars use  this incredible art archive to tell the stories   of the world to understand when we turn on the  news what are we hearing about and how is it   framed so one of our very first collaborations  was to map the geography of television in other   words when I turn on the television where am I  hearing about we actually made this incredible   map it was like raindrops out of map every time  a location was was mentioned now this in turn led   to something called the TV Explorer so this idea  of taking close captioning allow you to keyword   search that so journalists for example could  ask how much attention is co getting right now   how much attention is inflation getting um all the  major events Ukraine you know all these different   events across the world how much attention are  they getting how are they being framed and in   turn you think about on television it's not just  a spoken word it's the onscreen text that goes   with it so in the same way that books you know  you take a photograph of a book page and you use   technology to turn that into text we did the same  with television to extract all that onscreen text   one of the earliest examples that we did which  we took Donald Trump's tweets and we scanned for   them across television news and showed how he was  able to drive the cable news agenda um from his   Tweet now in turn this this this this beg this  question of the connection between television   and the online world so one of the things we've  done is so here's a clip this is CNN during Co   and it just said from Russia somewhere but where  what's the story behind this clip so we showed   how you can take a clip from television and  scan the open web for that uh and this text   you see beneath was the description of the video  when it first appeared in the web so be able to   connect across modalities and also fact check  them so this is a fascinating example we took   a known fact check so an existing fact check and  we used um new AI tools to scan television news   for any reference related to that and this is  really powerful as fact Checkers to be able to   say where's this narrative gaining traction uh  so last year when Russia invaded Ukraine uh so   Mark Graham at the um at the internet archive  champion this idea of how do we preserve you   know this is a huge moment how do we preserve  Russian bellian and Ukrainian television and   that led to this incredible Archive of what were  the narratives how was this you know how was each   country telling the story at this moment um and so  then so the archive K came to me and said well how   do we make this accessible to journalism scholar  we have this incredible incredible archive how do   we make this accessible so the first thing that  we did was create something called the visual   Explorer so we took each broadcast and every  four seconds we extract one image and we make a   thumbnail grid and so this is a broadcast so you  think about television it's linear you know it's   just play plays plays well by making it something  like this you can skim television now so if I want   to know did did Vladimir Putin appear anywhere in  this broadcast um how much military imagery was   shown how often is was the Z shown at this point  this is early on uh in Russian television I can   do that um I can scan all of this very rapidly as  a human being so remember also sometimes the most   powerful AI is AI that allows us that amplifies us  as a human being to be able to use our ability to   understand and kind of gets rid of a lot of  that that grunt work now of course a lot of   telion across the world is not Clos captioned so  starting early on we applied Google speech to tech   technology to transcribe these in the Russian and  then use Google Translate to translate these in   English now again that's far from perfect but it  allows thank you um it allows us now journalists   to go through this and say what are they saying  how are they spinning these narratives what are   they paying attention to and that's incredibly  incredibly powerful and fast forward to today   we're using a new tool so this is something  Google has something called chirp uh it's   a it's a large speech model it's essentially  um sort of the new era of speech recognition   recognizes over 100 languages but what's most  interesting about this and this whole generation   of new tools multilingual so this is a actual  broadcast this is a Chinese State television uh   three languages in 60 Seconds we've got English  Mander in Arabic in 60 seconds in this bro in   this particular clip all transcribe right there  this is incredibly powerful for the first time we   can start studying how do multilingual societies  tell their stories um you you it's really really   incredible what we can actually questions we can  ask now but of course what makes television news   so powerful is the visual Dimension if you took  all of that television and you start looking at   it like this you start looking at all the stories  across the world but look at this all the visual   stor so how can a machine help us make sense of  the visual dimension of all this so we've been   exploring how a variety of different AI tools can  help us understand visual storytelling not the TR   not just the spoken word not just the on-screen  text but the imagery the visual metaphors so one   early question was to say what if we took Russian  television for a year and folded it on itself   compared every second to every other second so we  can actually Trace clips and see how those clips   are being reused but more interestingly visual  metaphor so things that are not the same but have   similar color schemes similar similar visual  Styles it turns out some really fascinating   things you can do with visual metaphors now facial  recognition is a very scary area so the way we've   been approaching it is for major public figures  hting it a picture and saying find others that   look like this so in this case Tucker Carlson we  knew that he appeared a lot on Russian television   but how much so we literally took his picture  we were able to track his appearances across   Russian television this is really powerful and  document just how important he was to telling   their narratives but then we took an episode of  60 minutes that's a a famous Russian show and   what we did is we extracted out every face that  appeared on there and who occur co-occurs with   whom now this is really powerful who is telling  the story so we can see complexity here but we   can see Olga here at the center she's she's kind  of the the star of the show there we can see her   at the center there now we scaled this up to an  entire year of that show and we can see all these   complex dynamics that at the center her now this  is really powerful that we can take these tools so   when you think about it's just like well who's  in this this page or sorry this uh this image   what we're more interested in is questions like  this how can we use this to understand visual   storytelling now computer vision historically  was predefined categories about 30,000 objects   and activities that machines could understand so  in the early days of Co we said what's different   about coid uh television coverage compared to  pre-co the answer books everywhere but not on   every channel and this was really interesting  to us now this is really powerful and we use   this on Russian television to show military  imagery in the early days of the war and then   Russia realized it was losing less and less and  less coverage and then they felt that they were   gaining some ground so then they start ramping up  again so this is a really powerful way of kind of   understanding like how are governments portraying  like what in this particular case how is Russia   feel about how it's doing in the battlefield and  this is really really powerful but still it's a   predefined category un limited to what someone  else came up with so there are tools today we   have a demo of this type in an English language  description like soldier in front of a flag and   it will find imagery that matches that but here  comes a really cool part what about the inverse   so we have a golden retriever detector so this is  part of the part of the AI Explorer it actually   stand for Golden Retrievers on on television  news so here's an example of one and we asked   the machine describe this image everything you see  there was written by a machine this is where we're   at today um this is really really powerful the  ability to have a machine watch tele and tell us   about it but there are a lot of limitations what  you're hearing mostly about generative AI today   is the height the poses there's a lot of limits  so hallucination you may have heard that term   before so we hinted a broadcast about the Chinese  spy balloon and and asked to summarize it and it   became a nuclear capable Hypersonic missile aimed  at the American Homeland not quite what you want   for television summary uh false transcripts  it said NATO fully Praises Putin and says he   was did a great job um plagiarize summarize so  sometimes you say summarize this and it goes out   and it finds clips from across the web of people  saying similar things and glues those together   now bias is a really scary thing so for about 6070  years now we've had keyword search so if you have   a collection of biographies and you type in CEO  you're going to get the biographies that mention   the word CEO the most but nowadays semantic search  if you run the same query with the semantic search   engine white men first minority men second women  last this is a huge issue that people are not   really everyone's kind of rushing to the space  without if you ask him to make a summary to Res   summarize that it's even worse so these are really  big issues to think about um distraction summarize   this Russian broadcast in English midways through  it sees a reference to Rome gets distracted and   starts summarizing in Italian and then this is  a really scary thing machines are not really   good at understanding what country has supported  Ukraine the most Russia because has delivered the   most weapons these are you know these are really  scary things that these machines can do um but   there's still a lot of powerful things that we  can do we can take a day of Co coverage and say   make a narrative map of everything that's being  said and how it's interconnected we can have a   machine watch an entire day of Russian television  and summarize it Mo everything here was M by a   machine summarize it Moment by moment U but of  course why summarize well because you want to   do something with it so we had it watch a day of  Iranian television and said find every reference   to the nuclear Accord and any criticism right  a state uh Point BYO rebuttal digital diplomacy   automated diplomacy now again is this a wonderful  the future of diplomacy and an amazing thing which   is a really frightening future um that has a  lot of danger for society um these are really   fascinating questions either way this future is  here but it's our shared future it's up to us to   decide you know because again hallucination all  the limitations that goes with this and all the   impact on society do we really want machines to  be writing all this stuff for us these are huge   questions um and finally I want to give a huge  shout out to Tracy jquest she's the architect   of the TV news archive um I can't see where she  is but uh give her a huge shout out to so she   created a really neat tool recently it takes the  onscreen text and then summarize it you can go to   archive.org the television news archive section of  it go there today and you'll actually see this and   it's using this onscreen text and doing a live  summary of what's being said on television each   day again the incredible power of making this  content more accessible and thank you so much   hopefully this has been an inspiration to you  um what's possible day thank you so [Applause] much all right well speaking of narrative maps and  big questions hi everyone my name is Jamie from   the internet archive and by show of hands uh who  here is just crazy excited about AI okay okay uh   is anyone kind of uneasy about AI not sure where  it's going to take us okay a lot of people yeah   okay um is anyone upset or angry or really afraid  by show of hands of AI democratizing catastroph   weapons super intelligence that sort of thing  okay there's a few of you yeah okay well now all   of you have your own reasons for why you believe  things differently you have your own experiences   and your own insights that inform your point  of view so at the internet archive we've been   pretty sincere about trying to understand what are  people's different views and why they disagree so   that we can inform how we as an organization can  serve both the public and our mission in this time   of intense technological change so we started  a series of hackathons and invited people from   different groups who hold different views to come  and debate do research and have conversations we   invited people from alignment researchers to those  who want to accelerate Ai and we asked them to do   research and answer questions like well what do  you actually mean when you're using the term Ai   and what kind of risk do you perceive with AI  we also borrowed some deeply Wicked ethical   questions that were posed by open AI this summer  like should AI ever be used to instill beliefs   in people we came up with over 800 topics of  debate about artificial intelligence just as a   start but things are changing so fast and these  debates are ongoing and so I really don't think   there's any way we can just organize enough  hackathons to sort out 800 AI debates if you   are relying on human beings to do the research  and the debating so to understand the debates   that are happening in AI we turned to AI itself to  help us research topics and map debates so instead   of collecting arguments from people showing up  at a physical place at a specific time one of   our hackathon and this is a story of AI working  out really well um one of our hackathon created   an autonomous research agent to crawl through the  web and identify claims related to topics on our   list when it identifies a claim that's relevant to  us it summarizes and extracts it we also created a   prompt-based model that extracts arguments  claims and evidence from entire artifacts   like Open Access skull Journal Journal articles  and websites and then it filters out all of the   irrelevant claims a secondary model interprets the  correctness of those extractions because of course   you got to look out for the hallucinations but  in the past day alone we extracted over 23,000   claims from 500 references for about $15 and  this rate is approximately 12,000 claims per   hour with just one machine running I actually  have a background in this kind of analysis work   doing it by hand and the fastest I've ever seen  a human being do this is 300 under 300 claims   per hour we also built a prompt injector which  creates a sequence of prompts with a few shot   examples to identify positions that people take on  questions about AI to give us a sort of top level   scaffolding of the debate then using this tool  we generate arguments across economic no going   back all right across e ethical environmental  economic and nine other categories which which   support or refute those positions so let's  dive into just one example question should we   regulate AI to find the highle general positions  people take we used our prompt injector that was   tuned with those few shot examples to pull  data from chat GPT to give us the high level   positions so this was machine generated one  position we should allow technology companies   the freedom to develop AI Technologies as they  see fit with minimal government interference   another one we should impose strict laws on the  development and deployment of AI Technologies to   ensure safety here's another position heavy  regulation is unnecessary as the AI industry   is mostly self-regulating capable of learning  from its mistakes and improving I chose those   ones there's there's a whole bunch of positions  that it chose okay The Prompt injector uh then   prompts gp4 to identify likely arguments within  the 12 different categories and any additional   ones which we may want to add uh for example an  economic argument in favor of regulating AI may   include regulating AI has economic benefits as it  prevents unchecked development that could lead to   financial harm high risk low probability threats  such as unchecked AI magnify Financial Risk   by increasing the likelihood of rare but costly  disasters these disasters can potentially Cascade   globally massively impacting the concentrated  Tech sectors and therefore disrupting fragile   Global economies that was created by AI  an argument against the regulation of AI   there is a concern that government regulation  could lead to a convergent of AI Technologies   towards a one-size fits-all standard stifling  diversity and reducing the potential benefits   of competition and Variety in the market so now  we have this framework of the debate these high   level positions corresponding to these questions  which we can actually start modeling in a graph   from there we can take the claims and evidence  that we've extracted from the bottom up and start   connecting them with the top- down positions  that we had generated it's a sort of connection   of the scaffolding we're still working on  integrating The Logical coherence middle   part which is actually a lot of work but already  these Maps Asis are incredibly comprehensive the   map that I showed you earlier actually has over  540 of those arguments but um who wants to look   at a map besides me so we decided that we were  going to create a tool to make it easier to see   we create a tool that summarizes these claims and  uh visualizes them as an interactive unpack piece   of paper so what does this mean this means  instead of you spending potentially hundreds   of hours researching the different points of view  about AI you can instead read a paper which shows   you arguments from different points of view and  we can actually automate this paper over time   as we continuously process more information so AI  can help us research and understand what we think   about AI the pros the cons of the technology not  by limiting the conversation to just those people   who can show up and be in a room but instead  by combining the collective points of view   from people across the web and across the world by  creating new infrastructure that could accommodate   it our goal is to automate the creation of these  maps with these various tools that we've built and   then link evidence to claims that we um that  we will do through techniques like retrieval   augmented generation then of course give this  information for free away as a library um having built like I said before having built these  deliberation graphs by hand for years um I   can tell you very definitive that putting  together these materials with automation   is going to save thousands of research hours  for any one reader and for this project we   still have again hundreds of debates hundreds  of debates hundreds of debates and much more   work to do since much of the

2023-10-19 03:42

Show Video

Other news