Stanford Webinar - GPT-3 & Beyond
so Chris Potts is a professor and actually also the chair of the Department of linguistics and by courtesy also at the department of computer science and he's a great expert in the area of natural language understanding so he's you know there would not be a better person to hear about a topic than him and we are so grateful that he could make the time and he's actually also teaching a graduate course cs22 for you natural language understanding that we actually transformed into a professional course that is starting next week on the same topic so you know if you're interested in learning more we have some links included uh you know Down Below on your platform you can check it out and you know there's so many other things that can be said about Chris like he has a super interesting podcast he's running like so many interesting research papers like projects he worked on so you know go ahead and learn more about him like you should also have a little link I think without foreign I think we can kick it off Chris thank you so much once again oh thank you so much Petra for the kind words and uh welcome to everyone it's wonderful to be here with you all um I do think that we live in a golden age for natural language understanding maybe also a disconcerting age a weird age but certainly a time of a lot of innovation and a lot of change uh it's sort of an interesting moment for reflection for me because I started teaching my nlu course at Stanford in 2012 about a decade ago that feels very recent in my lived experience but it feels like a completely different age when it comes to nlu and indeed all of artificial intelligence I I never would have guessed in 2012 that we would have such an amazing array of Technologies and scientific Innovations and that we would have these models that were just so performant and also so widely deployed in the world this is also a story of again for better or worse increasing societal impact and so that does come together for me into a golden age and just to reflect on this a little bit it's really just amazing to think about how much many of these models you can get hands on with if you want to right away right you can download or use via apis models like Dolly 2 that do incredible text to image generation stable diffusion mid-journey they're all in that class we also have GitHub co-pilot based in the Codex model for doing code generation tons of people derive a lot of value from that system you.com is at the Leading Edge I would say of search technologies that are changing the search experience and also leading us to new and better results when we search on the web whisper AI is an incredible model from openai this does speech to text and this model is a generic model that is better than the best user customized models that we had 10 years ago just astounding not something I would have predicted I think and then of course the star of our show for today is going to be these big language models gpt3 is the famous one you can use it via an API we have all these open source ones as well that have come out opt Bloom GPT Neo X these are models that you can download and work with to your heart's content provided that you have all the Computing resources necessary so just incredible and I'm sure you're familiar with this but let's just you know get this into our Common Ground here it's incredible what these models can do here's a quick demo of uh gpt3 I asked the DaVinci 2 engine uh in which year was Stanford University founded when did it enroll its first students who is its current president and what is its mascot and DaVinci 2 gave a fluent and complete great answer that is correct on all counts just incredible that was with DaVinci 2 we got a big update to that model in late 2022 that's Da Vinci 3 and here I'm showing you that it reproduces that result exactly and I do think that that DaVinci 3 is a big step forward over the previous engine here's actually an example of that you know I have like to play adversarial games with this model and so I asked DaVinci to would it be possible to hire a team of tamarinds to help me paint my house assuming I'm willing to pay them in sufficient quantities of fruit to meet minimum wage requirements in California this is adversarial because I know that these models don't have a really rich understanding of the world we live in they're often distracted by details like this and sure enough Da Vinci 2 got confused yes it would be possible to hire a team of tamarinds to paint your house you would need to make sure that you're providing them with enough fruit to meet minimum wage requirements and so forth so easily distracted but I tried this again with DaVinci 3 and with the same question it gave a very sensible answer no it would not be possible to hire a team of tamarins to help you paint your house DaVinci 2 was not distracted by my adversarial game this is not to say that you can't trick Da Vinci to just go on to Twitter and you'll find examples of that but again I do think we're seeing a pretty remarkable rate of progress toward these models being robust and relatively trustworthy this is also a story of scientific innovation that was a brief anecdote but we're seeing this same level of progress in the tools that we use to measure system performance in the field I've put this under the heading of Benchmark saturate faster than ever this is from a paper from 2021 that I was involved with keyla at all here's the framework along the x-axis I have time going back to the 1990s and along the y-axis I have a normalized measure of our estimate of human performance that's the red line set at zero so mnist digit recognition a grand old data set in the field that was launched in the 1990s and it took about 20 years for us to surpass this estimate of human performance switchboard is a similar story launched in the 90s this is the speech to text problem it took about 20 years for us to get up past this red line here imagenet is newer this was launched in 2009 it took about 10 years for us to reach this saturation point and from here the pace is really going to pick up so Squad 1.1 is question answering that was solved in about three years the response was Squad 2.0 that was solved in less than two years and then the glue Benchmark if you were in the field you might recall back the glue Benchmark is this big set of tasks that was meant to stress test our best models when it was announced a lot of us worried that it was just too hard for present-day models but glue was saturated in less than a year the response was super glue meant to be much harder it was also saturated in less than a year a remarkable story of progress undoubtedly even if you're cynical about this measure of human performance we are still seeing a rapid increase in the rate of change here and you know 2021 was ages ago in the story of AI now I think this same thing carries over into the current ERA with our largest language models this is from a really nice post from Jason way he is assessing emergent abilities in large language models you see eight of them given here along the x-axis for these plots you have model size and on the y-axis you have accuracy and what Jason is showing is that at a certain Point these really big models just attain these abilities to do these really hard tasks and Jason estimates that for 137 tasks models are showing this kind of emergent ability and that includes tasks that were explicitly set up to help us stress test our largest language model they're just following away one by one really incredible now we're going to talk a little bit later about the factors that are driving this enormous progress for large language models but I want to be up front that one of the major factors here is just the raw size of these models you can see that in Jason's plots that's where the emergent ability kicks in and let me put that in context for you so this is from a famous platform of paper that's actually about making models smaller and what they did is track the rise of you know increases in model size along the x-axis we have time depth it only goes back to 2018. it's not very
long ago and in 2018 the largest of our models had around 100 million parameters seems small by current comparisons in late 19 in late 2019 early 2020 we start to see a rapid increase in the size of these models so that by 20 at the end of 2020 we have this Megatron model at 8.3 billion parameters I remember when that came out it seemed like it must be some kind of typo I could not fathom that we had a model that was that large but now of course this is kind of on the small side soon after that we got an 11 billion parameter variant of that model and then gpd3 came out that says 175 billion parameters and that one too now looks small in comparison to these truly gargantuan Megatron models and the Palm model from Google which surpassed 500 billion parameters I want to emphasize that this has made a complete mockery of the y-axis of this plot to capture the scale correctly we would need 5 000 of these slides stacked on top of each other again it still feels weird to say that but that is the truth the scale of this is absolutely enormous and not something I think that I would have anticipated way back when we were dealing with those hundred million parameter babies by comparison they seem large to me at that point so this brings us to our central question it's a golden age this is all undoubtedly exciting and the things that I've just described to you are going to have an impact on your lives positive and negative but certainly an impact but I take it that we are here today because we are researchers and we would like to participate in this research and that could leave you with a kind of worried feeling how can you contribute to nlu in this era of these gargantuan models I've set this up as a kind of flow chart first question do you have 50 million dollars and a love of deep learning infrastructure if the answer is yes to this question then I would encourage you to go off and build your own large language model you could change the world in this way I would also request that you get in touch with me maybe you could join my research group and maybe fund my research group that would be wonderful but I'm assuming that most of you cannot truthfully answer yes to this question I'm in the no Camp right and on both counts I am both dramatically short of the funds and I also don't have a love of deep learning infrastructure so for those of us who have to answer no to this question how can you contribute even if the answer is no there are tons of things that you can be doing all right so just topics that are front of mind to me include retrieval augmented in context learning this could be small models that are performant you could always contribute to creating better benchmarks this is a perennial challenge for the field and maybe the most significant thing that you can do is just create devices that allow us to accurately measure the performance of our systems you could also help us solve what I've called The Last Mile problem for productive applications these Central developments in AI take us 95 percent of the way toward utility but that last five percent actually having a positive impact on people's lives often requires twice as much development twice as much Innovation across domain experts people who are good at human computer interaction and AI experts right and there's so there's just a huge amount that has to be done to realize the potential of these Technologies and then finally you could think about achieving faithful human interpretable explanations of how these models behave if we're going to trust them we need to understand how they work at a human level that is supremely challenging and therefore this is incredibly important work you could be doing now I would love to talk with you about all four of those things and really elaborate on them but our time is short and so what I've done is select one topic retrieval augmented in context learning to focus on because it's it's intimately connected to this notion of in-context learning and it's a place where all of us can participate in lots of innovative ways so that's kind of the central plan for the day before I do that though I just want to help us get more common ground around what I take to be the really Central change that's happening as a result of these large language models and I've put that under the heading of the rise of in-context learning again this is something we're all getting used to it really remarks a genuine paradigm shift I would say in context learning really traces to the gpt3 paper there are precedents earlier in the literature but it was the gpt3 paper that really gave it a thorough initial investigation and showed that it had promised with the earliest GPT models here's how this works we have our big language model and we prompt it with a bunch of text so for example this is from that gpt3 paper we might prompt the model with a context passage and a title we might follow that with one or more demonstrations here the demonstration is a question and an answer and the goal of the demonstration is to help the model learn in context that is from The Prompt we've given it what Behavior we're trying to elicit from and so here you might say we're trying to coax the model to do extractive question answering to find the answer as a substring of the passage we gave it you might have a few of those and then finally we have the actual question we want the model to answer we prompt the model with this prompt here that puts it in some State and then its generation is taken to be the prediction or response and that's how we assess its success and the whole idea is that the model can learn in context that is from this prompt what we want it to do so that gives you a sense for how this works you've probably all prompted language models like you like this yourself already I want to dwell on this for a second though this is a really different thing from what we used to do throughout artificial intelligence let me contrast in context learning with the standard Paradigm of standard supervision back in the old days of 2017 or whatever we would typically set things up like this we would have say we wanted to solve a problem like classifying texts according to whether they express nervous anticipation a complex human emotion the first step would be that we would need to create a data set of positive and negative examples of that phenomenon and then we would train a custom built model to make the binary distinction reflected in the labels here it can be surprisingly powerful but you can start to see already how this isn't going to scale to the complexity of The Human Experience we're going to need separate data sets and maybe separate models for optimism and sadness and every other emotion you can think of and that's just a subset of all the problems we might want our models to solve for each one we're going to need data and maybe a custom built model the promise of in-context learning is that a single big frozen language model can serve all those goals and in this mode we do that prompting thing that I just described we're going to give the model examples just expressed in flat text of positive and negative instances and hope that that's enough for it to learn in context about the distinction we're trying to establish this is really really different consider that over here the phrase nervous anticipation has no special status the model doesn't really process it it's entirely structured to make a binary distinction and the label nervous anticipation is kind of for us on the right the model needs to learn essentially the meanings of all of these terms and our intentions and figure out how to make these distinctions on new examples all from a prompt it's just weird and wild that this works at all I think I used to be discouraging about this as an Avenue and now we're seeing it bear so much fruit what are the mechanisms behind this I'm going to identify a few of them for you the first one is certainly the Transformer architecture this is the basic building block of essentially all the language models that I've mentioned so far we have great coverage of the Transformer in our course natural language understanding so I'm going to do this quickly the Transformer starts with word embeddings and positional encodings on top of those we have a bunch of attention mechanisms these give the name to the famous paper attention is all you need which announce the Transformer evidently attention is not all you need because we have these positional encodings at the bottom and then we have a bunch of feed forward layers and regularization steps at the top but attention really is the Beating Heart of this model and it really was a dramatic departure from the fancy mechanisms lstms and so forth that were characteristic of the pre-transformer era so that's essentially though on the diagram here the full model in the course we have a bunch of materials that help you get Hands-On with Transformer representations and also dive deep into meth into the mathematics so I'm just going to skip past this I will say that if you dive deep you're likely to go through the same Journey we all go through where your first question is how on Earth does this work this diagram looks very complicated but then you come to terms with it and you realize oh this is actually a bunch of very simple mechanisms but then you arrive at a question that is a burning question for all of us why does this work so well this remains an open question a lot of people are working on explaining why this is so effective and that is certainly an area in which all of us could participate analytic work understanding why this is so successful the second big innovation here is a realization that what I've called self-supervision is an incredibly powerful mechanism for acquiring Rich representations of form and meaning this is also very strange in self-supervision the model's only objective is to learn from co-occurrence patterns in the sequences it's trained on this is purely distributional learning another way to put this is the model is just learning to assign high probability to attested sequences that is the fundamental mechanism we think about these models as generators but generation is just sampling from the model that's a kind of secondary or derivative process the main thing is learning from these co-occurrence patterns an enlightening thing about the current ERA is that it's fruitful for these sequences content to contain lots of symbols not just language but computer code sensor readings even images and so forth those are all just symbol streams and the model learns associations among them the core thing about self-supervision though that really contrasts it with the standard supervised Paradigm I mentioned before is that the objective doesn't mention any specific specific symbols or relations between them is entirely about learning these co-occurrence patterns and from this simple mechanism we get such Rich results and that is incredibly empowering because you need hardly any human effort to train a model with self-supervision you just need vast quantities of these symbol streams and so that has facilitated the rise of another important mechanism here large-scale pre-training and there are actually two innovations that are happening here right so we see the rise of large pre-scale pre-training in the earliest work on static word representations like word to VEC and glove and what those teams realize is not only that it's powerful to train on vast quantities of data using just self-supervision but also that it's empowering to the community to release those parameters not just data not just code but the actual learned representations for other people to build on that has been incredible in terms of building effective systems after those we get Elmo which was the first Model to do this for contextual word representations truly large language models then we get Bert of course and GPT and then finally of course gpt3 at a scale that was really previously unimagined and maybe kind of unimaginable for me a final piece that we should not Overlook is the role of human feedback in all of this and I'm thinking in particular of the open AI models I've given a lot of coverage so far of this mechanism of self-supervision but we have to acknowledge that our best models are what openai calls the instruct models and those are trained with way more than just self-supervision this is a diagram from the chat GPT blog post it has a lot of details I'm confident that there are really two pieces that are important first the language model is fine-tuned on human level supervision just making binary distinctions about good generations and bad ones that's already Beyond self-supervision and then in a second phase the model generates outputs and humans rank all of the outputs the model has produced and that feedback goes into a lightweight reinforcement learning mechanism in both of those phases we have important human contributions that take us beyond that self-supervision step and kind of reduce the magical feeling of how these models are achieving so much I'm emphasizing this because I think what we're seeing is a return to a familiar and kind of cynical sounding story about AI which is that many of the transformative step forwards are actually on the back of a lot of human effort behind the scenes expressed at the level of training data but on the positive side here it is incredible that this human feedback is having such an important impact instruct models are best in class in the field and we have a lot of evidence that that must be because of these human feedback steps happening at a scale that I assume is astounding they must have at open AI large teams of people providing very fine-grained feedback across lots of different domains with lots of different tasks in mind final piece by way of background prompting itself this has been a real journey for all of us I've described this as step by step and Chain of Thought reasoning to give you a feel for how this is happening let's just imagine that we've posed a question like can our models reason about negation that is if we didn't eat any food does the model know that we didn't eat any pizza in the old days of 2021 we were so naive we would prompt models with just that direct question like is it true that if we didn't need any food then we didn't eat any pizza and we would see what the model said in Return now in 2023 we know so much and we have learned that it can really help to design a prompt that helps the model reason in the intended ways this is often called step-by-step reasoning here's an example of a prompt that was given to be by Omar khatab you start by telling it it's a logic and Common Sense reasoning exam for some reason that's helpful then you give it some specific instructions and then you use some special markup to give it an example of the kind of reasoning that you would like it to follow after that example comes the actual prompt and in this context what we essentially ask the model to do is express its own reasoning and then conditional on what it has produced create an answer and the eye-opening thing about the current ERA is that this can be transformatively better I think if you wanted to put this poetically you'd say that these large language models are kind of like alien creatures and it's taking us some time to figure out how to communicate with them and together with all that instruct fine-tuning with human supervision we're converging on prompts like this as the powerful device and this is exciting to me because what's really emerging is that this is a kind of very light way of programming an AI system using only prompts as opposed to all the Deep learning code that we used to have to write and that's going to be incredibly empowering in terms of system development and experimentation all right so we have our background in place I'd like to move to my main topic here which is retrieval augmented in context learning what you're going to see here is a combination of language models with retriever models which are themselves under the hood large language models as well let me start with a bit of the back story here I think we're all probably vaguely aware at this point that large language models have been revolutionizing search again the star of this is the Transformer or maybe more specifically its famous spokes model Bert right after Bert was announced around 2018 Google announced that it was incorporating aspects of Bert into its core search technology and Microsoft made a similar announcement at about the same time and I think those are just two public-facing stories of you know many instances of large search Technologies having burnt Elements Incorporated into them in that era and then of course in the current ERA we have startups like you.com which have made large language models pretty Central to the entire search experience in the form of you know delivering results but also interactive search with conversational agents so that's all exciting but I am an NLP or at heart and so for me in a way the more exciting Direction here is the fact that finally search is revolutionizing NLP by and helping us bridge the gap into much more relevant knowledge intensive tasks to give you a feel for how that's happening let's just use question answering as an example so prior to this work in NLP we would post question answer in your QA in the following way you saw this already with the gpt3 example we would have as given at test time a title and a context passage and then a question and the task of the model is to find the answer to that question as a literal substring of the context passage which was guaranteed by the nature of the data set as you can imagine models are really good at this task superhuman certainly at this task but it's also a very rarified task this is not a natural form of question answering in the world and it's certainly unlike the scenario of for example doing web search so the promise of the open formulations of this task are that we're going to connect more directly with the real world at in this formulation at test time we're just given a question and the standard strategy is to rely on some kind of retrieval mechanism to find relevant evidence in a large Corpus or maybe even the web and then we proceed as before this is a much harder problem because we're not going to get the substring guarantee anymore because we're dependent on the retriever to find relevant evidence but of course it's a much more important task because this is much more like our experience of searching on the web now I've kind of biased already in describing things this way where I assume we're retrieving a passage but there is another narrative out there let me skip to this then you could call this like the llms for everything approach and this would be where there's no explicit retriever you just have a question come in you have a big opaque model process that question and out comes an answer voila you hope that the user's information need is met directly no separate retrieval mechanism just the language model doing everything I think this is an incredibly inspiring vision but we should be aware that there are lots of kind of danger zones here so the first is just efficiency one of the major factors driving that explosion in model size that I tracked before is that in this llms for everything approach we are asking this model to play the role of both knowledge store and language capability if we could separate those out we might get away with smaller models we have a related problem of updateability suppose a fact in the world changes a document on the web changes for example well you're going to have to update the parameters of this big opaque model somehow to conform to the change in reality there are people hard at work on that problem that's a very exciting problem but I think we're a long way from being able to offer guarantees that a change in the world is reflected in the Model Behavior and that plays into all sorts of issues of trustworthiness and explainability of behavior and so forth also we have an issue of Providence look at the answer at the bottom there is that the correct answer should you trust this model right in the standard web search experience we typically are given some web pages that we can click on to verify at least at the next level of detail whether the information is correct but here we're just given this response and if the model also generated a provenance string if it told us where it found the information we'd be left with the concern that that provenance string was also untrustworthy right and this is like a really breaking a fundamental contract that users expect to have with search Technologies I believe so those are some things to worry about there are positives though of course these models are incredibly effective at meeting your information need directly and they're also outstanding at synthesizing information if your question can only be answered by 10 different web pages it's very likely that the language model will still be able to do it without you having to hunt through all those pages so exciting but lots of concerns here here is the alternative of retrieval augmented approaches right oh I can't resist this actually just to give you an example of how important this trustworthy thing can be so um I used to be impressed by DaVinci 3 because it would give a correct answer to the question are professional baseball players allowed to glue small Wings onto their caps this is a question that I got from a wonderful article by Hector Levesque where he encourages us to stress test our models by asking them questions that would seem to run up against any simple distributional or statistical learning model and really get it whether they have a model of the world and for Da Vinci 2 it gave what it looked like a really good levec style answer there is no rule against it but it is not common that seems true so I was disappointed I guess or I'm actually not sure how to feel about this one I asked DaVinci 3 the same question and it said no professional baseball players are not allowed to glue small Wings onto their caps Major League Baseball has strict rules about the appearance of players uniforms and caps in any modifications of the Caps are not allowed that also sounds reasonable to me is it true it would help enormously if the model could offer me at least a web page with with evidence that's relevant to these claims otherwise I'm simply left wondering and I think that shows you that we've kind of broken this implicit contract with the user that we expect from search so that'll bring me to my alternative here retrieval based or retrieval augmented NLP to give you a sense for this at the top here I have a standard search box and I've put in a very complicated question indeed the first step in this approach is familiar from the llms for everything one we're going to encode that query into a dense numerical representation capturing aspects of its form and meaning we use a language model for that the next step is new though we are also going to use a language model maybe the same one we use for the query to process all of the documents in our document collection so each one has some kind of numerical deep learning representation now on the basis of these representations we can now score documents with respect to queries just like we would in the standard good old days of information retrieval so we can reproduce every aspect of that familiar experience if we want to we're just doing it now in this very rich semantic space so we get some results back and we could offer those to the user as ranked results but we can also go further we could have another language model call it a reader or a generator slurp up those retrieve passages and synthesize them into a single answer maybe meeting the user's information need directly right so let's check in on how we're doing with respect to our goals here first efficiency I won't have time to substantiate this today but these systems in terms of parameter counts can be much smaller than the integrated approach I mentioned before we also have an easy path to updateability we have this index here so as Pages change in our document store we simply use our Frozen language model to reprocess and re-represent them and we can have a pretty good guarantee at this point that information changes will be reflected in the retrieved results down here we're also naturally tracking Providence because we have all these documents and they're used to deliver the results and we can have that carry through into the generation so we've kept that contract with the user these models are incredibly effective across lots of literature we're seeing that retrieval augmented approaches are just superior to the fully integrated llms for everything one and we've retained the benefit of llms for everything because we have this model down here the reader generator that can synthesize information into answers that meet the information need directly so that's my fundamental pitch now again things are changing fast and even the approach to designing these systems is also changing really fast so in the in the previous era of 2020 we would have these pre-trained components like we have our index and our retriever maybe we have a language model like reader generator and you might have other pre-trained components image processing and so forth so you have all these assets and the question is how are you going to bring them together into an integrated solution the standard deep learning answer to that question is to define a bunch of task specific parameters that are meant to tie together all those components and then you learn those parameters with respect to some task and you hope that that has kind of created an effective integrated system that's the modular vision of deep learning the truth in practice is that even for very experienced researchers and system designers this can often go really wrong and debugging these systems and figure out how figuring out how to improve them can be very difficult because they are so opaque and the scale is so large but maybe we're moving out of an era in which we have to do this at all so this will bring us back to in-context learning the fundamental Insight here is that many of these models can in principle communicate in natural language right so a retriever is abstractly just a device for pulling in text and producing text with scores and a language model is also a device for pulling in text and producing text with scores and we have already seen in my basic picture of retrieval augmented approaches that we could have the retriever communicate with the language model via retrieve results well what if we just allow that to go in both directions now we've got a system that is essentially constructed by prompts that help these models do message passing between them in potentially very complicated ways an entirely new approach to system design that I think is going to have an incredible democratizing effect on who designs these systems and what they're for let me give you a deep sense for just how wide open the design space is here again to give you a sense for how much of this research is still left to be done even in this Golden Era let's imagine a search context the question is what course to take what we're going to do in this new mode is begin a prompt that contains that question just as before and now what we can do next is retrieve a context passage that'll be like the retrieval augmented approach that I showed you at the start of this section right you could just use our retriever for that but there's more that could be done what about demonstrations let's imagine that we have a little train set of QA pairs that kind of demonstrate for our system what the intended behavior is well we can add those into the prompt and now we're giving the system a lot of few shot guidance about how to learn in context right but that's also just the beginning I might have sampled these training examples randomly from my train set but I have a retriever remember and so what I could do instead is find the demonstrations that are the most similar to the user's question and put those in my prompt with the expectation that that will help it understand kind of topical coherence and lead to better results but I could go further right I could use my retriever again to find relevant context passages for each one of those demonstrations to further help it figure out how to reason in terms of evidence and that also opens up a huge design space we could do what we call hindsight retrieval where for each one of these we're using both the question and the answer to find relevant context passages to really give you integrated informational packets that the model can benefit from and there's lots more that we could do with these demonstrations you're probably starting to see it right we could do some rewriting and so forth really make sophisticated use of the Retriever and the language model interwoven we could also think about how we selected this background passage I was assuming that we would uh just retrieve the most relevant passage according to our question but we could also think about rewriting the user's query in terms of the demonstrations that we could constructed to get a new query that will help the model that's especially powerful if you have a kind of interactional mode where the demonstrations are actually part of like a dialogue history or something like that and then finally we could turn our attention to how we're actually generating the answer I was assuming we would take the top generation from the language model but we could do much more we could filter its generations to just those that match a substring of the passage reproducing some of the old mode of question answering but now in this completely open formulation that can be incredibly powerful if you know your model can retrieve good background passages here those are two simple steps you could also go all the way to the Other Extreme and use the full retrieval augmented generation or rag model which is that essentially creates a full probability model that allows us to marginalize out the contribution of passages that can be incredibly powerful in terms of making maximal use of the capacity of this model to generate text conditional on all the work that we did up here I hope that's given you a sense for just how much can happen here what we're starting to see I think is that there is a new programming mode emerging it's a programming mode that involves using these large pre-trained components to design in code prompts that are essentially full AI systems that are entirely about message passing between these Frozen components we have a new paper out that's called demonstrate search predict or DSP this is a lightweight programming framework for doing exactly what I was just describing for you and one thing I want to call out is that our results are fantastic now you know we can Pat ourselves on the back we have a very talented team and so it's no surprise the results are so good but I actually want to be upfront with you I think the real insight here is that it is such early days in terms of us figuring out how to construct these prompts how to program these systems that we've only just begun to understand what's optimal we have explored only a tiny part of the space and everything we're doing is sub-optimal and that's just the kind of conditions where you get these huge leap forwards leaps forward in performance on these tasks so I suspect that the Bold row that we have here will not be long lived given how much Innovation is happening in this space and I want to make a pitch for our course here right so we have in this course a bunch of assignment slash bake offs uh and the way that works essentially is that you have an assignment that helps you build some baselines and then work toward an original system which you enter into a bake off which is a kind of informal competition around data and modeling our newest of these is called few shot open QA with Colbert retrieval it's a version of the problems that I've just been describing for you this is a problem that could not even have been meaningfully posed five years ago and now we are seeing students doing incredible Cutting Edge things in this mode it's exactly what I was just describing for you and we're in the sort of moment where a student project could lead to a paper that you know really leads to state-of-the-art Performance in surprising ways again because there is just so much research that has to be done here I'm running out of time what I think I'll do is just briefly call out again those important other areas that I've given short trip to today but I think are just so important starting with data sets I've been talking about system design and task performance but it is now and will always be the case that contributing new Benchmark data sets is basically the most important thing you can do I like this analogy Jacques Cousteau said Water and Air the two essential fluids on which all life depends I would extend that to NLP uh our data sets are the resource on which all progress depends now Cousteau extended this with have become Global garbage cans I am not that cynical about our data sets I think we've learned a lot about how to create effective data sets we're getting better at this but we need to watch out for this metaphorical pollution and we need always to be pushing our systems with harder tasks that come closer to the human capabilities that we're actually actually trying to get them to achieve and without contributions of data sets we could be tricking ourselves when we think we're making a lot of progress the second thing that I wanted to call out relates to model explainability you know we're in an era of incredible impact and that has rightly turned researchers to questions of system reliability safety trust approved use and pernicious social biases we have to get serious about all these issues if we're going to responsibly have all of the impact that we're achieving at this point all of these things are incredibly difficult because the systems we're talking about are these enormous opaque impossible to understand analytically devices like this that are just clouding our understanding of them and so to me that shines a light on the importance of achieving analytic guarantees about our model behaviors that seems to me to be a prerequisite for getting serious about any one of these topics and the goal there in our terms is to achieve faithful human interpretable explanations of Model Behavior we have great coverage of these methods in the course Hands-On materials screencasts and other things that will help you participate in this research and also as a side effect right absolutely outstanding discussion and Analysis sections for your papers and the final thing I wanted to call out is just that last mile problem fundamental advances in AI take us 95 percent of the way there but that last five percent is every bit as difficult as the first 95. in my group we've been
looking a lot at image accessibility this is an incredibly important societal problem because images are so Central to Modern Life across the being on the web and in social media also in the news and our scientific discourse and it's a sad fact about the current state of the world that almost none of these images are made non-visually accessible so blind and low vision users are basically unable to understand all this context and receive all of this information something has to change that image based text generation has become incredibly good over the last 10 years that's another story of astounding progress but it has yet to take us to the point where we can actually write useful descriptions of these images that would help a BLB user and that last bit is going to require HCI research linguistic research and fundamental advances in Ai and by the way lots of astounding new data sets and this is just one example of in the innumerable number of Applied problems that fall into this mode and that can be very exciting for people who have domain expertise that can help us close that final mile so let me wrap up here um I don't want to have a standard conclusion I think it's fun to close with some predictions about the future and I have put this under the heading of predictions for the text next 10 years or so although I'm about to retract that for reasons I will get to but here are the predictions first laggard industries that are rich in Text data will be transformed in part by NLP technology and that's likely to happen from some disruptive newcomers coming out of left field second prediction artificial assistance will get dramatically better and become more ubiquitous with the side effect that you'll often be unsure in life whether this customer service representative is a person or an AI or some team combining the two many kinds of writing including student papers at universities will be done with AI writing assistance and this might be transparently true given how sophisticated auto-complete and other tools have gotten at this point and then finally the negative effects of NLP and of AI will be Amplified along with the positives I'm thinking of things like disinformation spread Market disruption systemic bias it's almost sure to be the case if it hasn't already happened already that there will be some calamitous world event that traces to the intentional or unintentional misuse of some AI technology that's in our future so I think these are reasonable predictions and I'm curious for yours but I have to tell you that I made these predictions in 2020 two years ago with the expectation that they would be good for 10 years but more than half of them probably have already come true two and three are definitely true about the world we live in and on the flip side I just failed to predict so many important things like the most prominent example is that I just failed to predict the progress we would see in text image models like Dolly 2 and and stable diffusion in fact I'll be honest with you I might have bet against them I thought that was an area that was going to languish for a long time and yet nonetheless seemingly out of nowhere we had this incredible set of advances and there are probably lots of other areas where I would make similarly bad traditions um so I said 10 years but I think my new rule is going to be that I'm going to predict only through 2024 at the very outside because in 10 years the only thing I can say with confidence is that we'll we will be in a radically different place from where we are now but what that place will be like is anyone's guess I'm interested in your predictions about it but I think I will stop here thank you very much thank you so much Chris for the engaging and extremely interesting topic and presentation that you have given I'm always so amazed by all the new things you're mentioning every single time we talk I feel it is something new something exciting you know it's not you not me especially not me like expected that you'll be talking about it so soon many questions came in I must already say we will unfortunately not be able to get to all of them because the time is limited and the audience is so active and so many people showed up so let me pick a few um Chris so the cost of the training model so it seems it really scales with the size and we are paying a lot of attention and like putting a lot of effort into the training uh so what does it mean for the energy requirements and I guess you are talking about predictions but like how does it look like now and like what do you recommend people to to pay attention to oh it's a wonderful set of questions to be answering and critically important I mean um I ask myself you know you know if you think about Industries in the world some of them are improving in terms of their environmental impacts some are getting much worse where is artificial intelligence in that is it getting better or is it getting worse I don't know the answer because on the one hand the expenditure for training and now serving for example gpt3 to everyone who wants to use it is absolutely enormous and has real costs like measured in emissions and things like that on the other hand this is a centralization of all of that and that can often bring real benefits and I I want to not forget of the previous era where every single person trained every single model from scratch and so now a lot of our research is actually just using these Frozen components they were expensive but the expenditure of our lab is probably going way down because we are not training these big models it kind of reminds me of that last mile problem again in the previous era it was like we were all driving to pick up our groceries everywhere huge expenditure with all those individual trips now it's much more like they're all brought to the end of the street and we walk to get them but of course that's done in big trucks and those have real consequences as well I don't know but I hope that a lot of smart people work continue to work on this problem and that'll lead to benefits in terms of us doing all these things more efficiently as well thank you so much um the next question and you touched on that a few times but it might be good to summarize that a little bit uh because we got a lot of the questions about kind of the trustworthiness and if the model actually knows that it's wrong or correct and like how do how do we trust the model or like how do we achieve the trustworthiness of the model because right now it's a lot of the generation happening generative models happening so like how do we pass that it's a an incredibly good question and it is the thing I have in mind when we're doing all our work on explaining models because I feel like offering faithful human interpretable explanations is the step we can take toward trustworthiness it's a very difficult problem I just want to add that um it might be even harder than we've anticipated because people are also pretty untrustworthy it's just that individual people often don't have like a systemic effect right so if you're really doing a poor job at something you probably impact just a handful of people and other people say at your company do a much better job but these AIS are now it's like they're everyone and so any kind of small problem that they have is Amplified across the entire population they interact with and that's going to probably mean that our standards for trustworthiness for them need to be higher than they are for humans and that's another sense in which they're going to have to be superhuman to achieve the jobs we're asking of them and the field cannot offer guarantees right now so come help us fascinating thank you so much and like I saw also some questions or comments about the bias in data and like you mentioned it also right like like we are improving like there is a big Improvement happening um last question for you um like a little bit of a thought experiment but like do you think that the large language models might be able to come up with answers to as yet unanswered important scientific questions like something we are not even sure that it even exists like in our minds right now oh it's a wonderful question yet and people are asking this across multiple domains like they're producing incredible artwork but are we now trapped inside a feedback loop that's going to lead to less truly Innovative art and and if we ask them to generate text are they going to do either weird irrelevant stuff or just more of the boring average case stuff um I don't know the answer I will say though that these models have an incredible capacity to synthesize information across sources and I feel like that is a source of innovation for humans as well simply making those connections and it might be true that there is nothing new Under the Sun but there are lots of new connections perspectives and so forth to be had and I actually do have faith that models are going to be able to at least simulate some of that and it might look to us like innovation but this is not to say that this is uh not a concern for us it should be something we think about especially because we might be heading into an era when whether we want them to or not mostly these models are trained on their own output which is being put on the web and then consumed when people create train sets and so forth and so on great thank you so much and we are nearing the end so like last Point um do you have any like last remarks any anything anything interesting you would suggest others to look at follow read um learn about to kind of get more acquainted with the subject like learn more about the nlu gpt3 other large language models any recommendations the thing that comes to mind based on all the interactions I have with the professional development students who've taken our course before is that a lot of you I'm guessing have incredibly value valuable domain expertise you work in an industry in a position that he has taught you tons of things and giving you lots of skills and my Last Mile problem shows you that that is relevant to Ai and therefore you could bring it to bear on AI and we might all benefit where you would be taking all these Innovations you can learn about in our course and other courses combining that with your domain expertise and maybe actually making progress in a meaningful way on a problem as opposed to merely having demos and things that our scientific Community often produces real impact so often requires real domain expertise of the sort you all have beginning of the quarter hectic Stanford live I'd really appreciate you taking the time to to do this to run this webinar thank you also everyone everybody who had a chance to join us live or like who is watching this recording if you could please let us know what kind of other topics you might be interested in in this sort of a free webinar structure we have a little survey uh down on the console um and yeah I hope you all have a great day a wonderful start of the of or like end of the winter start of the spring and yeah thank you everybody for joining us yeah Petra this is wonderful we got an astounding number of really great questions it's too bad we're out of time there's a lot to think about here and so that's just another thank you to the audience for all this food for thought thank you
2023-02-01 20:18