- [Instructor] I gave you this third year, like end of my second year. - Did you apply twice? - [Instructor] No. (people speaking faintly off mic) (people speaking faintly off mic) (people speaking faintly off mic) (people speaking faintly off mic) (people speaking faintly off mic) - [Martha] We'll get some more people coming in.
- I thought, oh, I'm sorry. - [Martha] Can everybody hear me okay? Do I need a microphone? Okay, so I'm delighted. I'm Martha Palmer, I'm actually in the computer science department and the linguistics department, for those of you who don't know me and I'm delighted today to be able to introduce our speaker, Yanjun Gao.
She's a brand new assistant professor in the bioinformatics department at the Anschutz campus. And she comes here from a three year postdoc at the University of Wisconsin, Madison, in their medical school where she won a very prestigious K99 ROO award. I had to ask her what that was. I understand, it's sort of the equivalent of an NSF career award. So, and she, I guess, she brought that with you here then.
Yeah, yeah, right. She did her PhD at Penn State and her lab at Anschutz is the language reasoning and knowledge lab. So she's very interested in applying techniques and natural language processing to medical informatics, which I think we're gonna hear about today. - [Dr. Gao] Yeah. - [Martha] So thank you very much. - [Dr. Gao] Is the mic working?
- [Martha] Yeah. - [Dr. Gao] Okay. So thank you Martha for the introduction and also the invitation. It's really nice to be here today, and actually I was, when I was making slides last night, I actually made a Instagram story post to say that, wow, can't believe that I'm making this today.
And I was just one of you like not long ago, well not long ago. So today what I'm gonna talk about is language technologies and particularly what are the research projects that we are doing in Anschutz and what's exciting things about them. Let's get started. Okay, so I'm gonna start my talk with this really tiny message that all of you know, that Geoffrey Hinton actually won the Nobel Prize in physics and which I would like to say this is the best time for our generation to start working on AI, 'cause we have been like inspired by so many, like this greatest people in our era. And then one of them is actually, you know, one of the major breakthrough in terms of now language technology is this transformer architecture which comes out 2017.
And it's actually inspired by, it got inspiration from, you know, famous people and their foundational work. You know, people like Jeff Hinton and someone else that I'm sure that you can think of who is this person. And then because of the transformer architectures, we start to get like this fascinating large language models. And many of you see or have been using some of them, like this includes like open box or like closed box large language models, LLMs, across different domains. There's also this really famous discovery of scaling loss, meaning that people find out that when you increase the size of the parameters and also increase the data of the training data size, you're gonna get much stronger, or much more power in terms in the LLMs. But now the question that I really want to ask today is, is scaling loss all we need? Is scaling loss like the only solution or the only exciting things in this field that we can work on, and if not, what are the other options? So before I start talking about my research, I do wanna briefly introduce my journey, 'cause I get asked a lot how I get to this place right now.
So I got a PhD admission by the Penn State Department of Computer Science Engineering in 2014, and I have two years of bad luck with my advisors. So I actually changed advisors three times. It wasn't until 2016 that I started to work on natural language processing with my PhD advisor, Rebecca Pasadena. And then the type of projects I worked on at that time is like summarization evaluation, discourse analysis, augmentative mining. And then one of the application areas is actually education, not medicine. So I worked on a lot of projects on like how do we evaluate students' writing using NLP technologies.
And then it was almost end of my PhD, the COVID pandemic started, and then at that time I that I suddenly found out that, okay, there are the things that are really important in this world and one of them is healthcare. I have no idea where I would be or what kind of things I wanna do until I found out this post-op opportunity at the University of Wisconsin in the Pulmonary Critical Care Data Science Lab. So you can see, you can hear by the name, you can understand by the name that that's the front line of fighting COVID at that time because they deal with like ventilation, they deal with pulmonary, they deal with pneumonia, they deal with COVID. So as the first day of my post-doc work, I actually went into the pulmonary ICUs and saw these patients who are like on, what's it called, machine ventilation. And they're not able to breathe, they're not able to be, like having any awareness and then people have to be like in a, like a really isolated environment to treat them. So that scenario was actually quite impressive to me and I never imagined what the hell would look like, and that's what, that's my first reaction when I saw that.
So that actually got me thinking about, okay, as a computer scientist with a PhD in computer science, what can I do and what are the things I'm really interested in? That's why I'm here in the Department of Biomedical Informatics in the School of Medicine at the Anschutz Medical Campus and start doing AI and NLP research on medicine and healthcare in general. I also forgot to mention that it was before the end of my postdoc, ChatGPT started to release, and that just changed the paradigm of everything. So LARK, which is the name of my lab, Language Reasoning Knowledge, as Martha introduced, because these are the key components that my research focuses on. In other words, these are the areas that all of my research inspirations comes from. In particular, I'm really interested in how human is making decisions given complex data, for example, like language or different modalities of data like images or videos or audio.
And also recently how people started doing reasoning when they know, oh, they have to draw a connection between what they see or what they hear or what they were reading to their decisions making. And lastly, one of the essential components behind this decision making is knowledge. For example, it took seven or eight years for a student to be trained as a medical doctor. So there got to be a lot of things happening within that eight years that we can probably have an LLM or AI to be trained on.
So my talk today have two main components. One, I wanna start with introducing some AI or NLP research that I'm doing for healthcare. And the other part I wanna talk about like some general domain NLP problems that my lab is currently working on. So let's start with the first part and we're gonna focus on diagnostic decision making. So let me start by introducing you, what is electronic health records? So some of you, if you been to hospital and if you talk to your doctors, this is the screen that you might see that show up on their screen. Of course like if you're not in ICU, if you're not in any like surgery units, you're not gonna see a bunch of information in your electronic health records.
However, for patients who are in ICUs, normally their health records are really long and really complex. So what you're looking at right now is a mock patient that have multiple hospital admissions, meaning they've been to the hospitals many times, and each time it could be different reasons and there could be a different like specialties type of physicians who take care of them. And then if you look at the, like the left hand side of the screen, you can see that most of the text here are combined with natural language and then also tabular data or like structured data what we call, so there are numbers, their aberrations, the medical jargons. There's also natural language that we talked about what is these patients having, about what kind of problems patient's having or like their medical histories. And then you can see that once, if the patient situation is getting worse and worse, like these EHR will get much longer and complex. So electronic health records, although they're designed to help to improve the efficiency of clinicians' workflow, but they're actually causing more troubles than bringing efficiency.
One of the problem is where is the most important information, or what is the most important information for that given patient given that there could be many, many information generated when the patient goes to the hospital. And that has been an issue not only because it's like the physicians, it's harder and harder to find the most important information, but also it would cause the physicians burnout. Meaning after reading so many of these EHR records for a given patient, well, they start feel really exhausted and sometimes they'll make decisions that's not evidence-based. So you can think about how dangerous those situations could be. Like it will lead to medication errors, it will lead to diagnostic errors, and in particular diagnostic errors is one of the areas that my works mostly focusing on.
So lemme talk about clinical diagnostic reasoning because that's essentially how like a person can become medical doctor and start to say, okay, for this given patient, what's their problems? So diagnostic reasoning, like a lot of really complex decision making have two systems thinking. I'm sure like some of you have read this book. It's called "Thinking Fast and Slow," that talks about for most of the complex decision making, people are having two type of systems in their mind.
So one is system one that's based on heuristics or based on fast thinking. And the other is called system two or type two thinking meaning they're based on, like they're thinking like rigorously and based on the evidence they can find, they will chart connections between what they see and they will utilize their knowledge and they're not just like doing decision making based on their first reaction. So I wanna pull up this figure to show you like more vividly what that system one and system two would look like. As you can see like the rabbit was saying to the turtle, why are you so slow? So the rabbit is actually system one thinking, it's fast but it's not accurate.
And then there's also system two thinking like the turtle and asking rabbit, why are you so stupid? Like you're making many mistakes, they're so obvious. So for the goal of my projects or like for my most of my research, we wanna develop augmented intelligence for clinical diagnostic distance support. So you can see that I'm not saying artificial intelligence for clinical diagnostic and support. That's because if you say AI for clinical distance support, like people are gonna think that, oh, you're gonna replace the physicians.
And I get those questions a lot. So I like to say we are augmenting the people's decision making, not to replace them. And then you might also see many articles or papers or news that talk about, well, right now we have like Med-PaLM, we have GPT-4, and they pass the medical licensing exam which every medical student have to take before they can practice medicine.
And then what I'm showing here is how Med-PaLM, has actually outperformed many of the human physicians in terms of the multi-med QA. So it's a benchmark for medical question/answers, tasks. And then we also have like human evaluation in terms of whether there's more omission coming from the human clinicians or there's more omission from the system or the other type of error. So the main takeaway from this figure is that while people have been claiming that these artificial intelligence (indistinct) are so powerful, they can potentially become the backend of health AI. But I'm gonna ask, is AI for healthcare in medicine really here? The reason I'm asking that is if you think about what medicine means, it's a science for uncertainty and an art of probability. So this is a quote from William Osler.
So he's been called as the father of medicine. But none of these benchmarks, or none of these evaluations that we seen from GPT-4 and Med-PaLM was actually evaluated on real patients and real HR data, and they're not evaluating on the task that the clinicians really care about. So one of the tasks that physicians care about is predicting diagnosis using the raw patient data, and that's something that I'm gonna cover in my next couple of slides. And secondly, because medicine is a science of uncertainty, so how did these LLMs cope with uncertainty? How can they express uncertainty? Are there uncertainty expression, or uncertainty estimations reliable? Can we trust them? These are the questions that we can answer in the second part.
Let's start with the first part. The type of electronic health records data, you might seem some papers that talk about EHR, like LLMs for EHR, many of them focus on natural language. We did that also. But now I wanna focus on the structured text, which is where like it talks about the values and the clinical meanings associated with those values.
How can, how well does LLM really understand those things? So we did a simple probing task where we just like prompt Mistral and also Llama models to say, well, given these like for example systolic blood pressure, what is the reference range? So what is the normal range given that vital signs or laboratory results and what is the units of measurement? So it turns out that according to human physician evaluation on a LICO scale from one to five where five is really accurate and one is not accurate at all, many of these LLMs such as Mistral and Llama, they can have a pretty good scores in terms of like recalling this current knowledge about reference range and also units of measurements. They can also talk about that given different gender or age of patients their reference range might be slightly different. So that gives us a little bit confidence. Okay these LLMs, they do have knowledge about, you know, basic knowledge on structured electronic health records.
The next thing we did is we took the reworked EHR data which is shown in that blue table, like normally when a patient's come in, they're gonna go through multiple like screenings or like tests and then we took each of the patients and then look at what is the most likely diagnosis they have, and that is annotated by human experts. So we use that dataset and now our goal is to really look at, well, given those datasets, can the evidence make correct prediction to say does the patient have a disease such as sepsis, arrhythmia or congestive heart failure, or would the patient die during their IC stay, or how long would the patient stay in the hospital or stay in the ICU, whether it's three days, whether it's seven days. So these type of predictive modeling tests are really common.
If you look at the papers published in medical journals, you will see that people have been focusing on it for like almost two decades. And then many of the many of the models they're using is machine learning, for example, like attributes or logistic regressions. And these models has been achieving like fantastic, well, fantastic is not correct word here but at least still state of the art results on many of these tasks. And to tell you some truth, actually, some of the hospitals in United States, they would deploy these machine learning classifiers in their clinical, in their bedside as applications to say, okay, this patient might have like life threatening events happening and then bring up that alert, so that if the physicians inside the hospital from every unit, they see that alert, they will just rush to that patient and deal with that emergency. So now our goal is trying to compare can these LLMs being used on the same type of task as this machine learning has been using on? And then you can see that the way that we're doing it are two types of methods.
One is looking at their embeddings, because well embeddings is where like these LLMs store their knowledge and feature representations from. So now we wanna compare to machine learning classifiers, has been using raw data features, which is more mostly like Pandas data frame that I'm sure that everyone of you know. And then instead of feeding that raw data into machine learning classifiers, what if we utilize the features coming from the LLM and would that give us benefit over these relative features given that LLMs are so powerful and they should understand also the (indistinct) knowledge and that might bring more like useful signals for those predictive modeling tasks.
So that's our assumption and it will also compare to the direct generation, meaning that we just prompt these LLMs and ask them to predict whether the patient has sepsis, whether the patient would die during dialysis day or how long the patient is gonna be staying in hospital. And I'm just gonna share quickly on the direct generation results, they're not good at all even though we have pretty good performance. But just by prompting them to say well what is the reference range on this like table data, but they're not good at predicting whether the patient would die or not on this real patient HR data. So now we shift to like looking into the embeddings and then plus machine learning classifiers, and we also try different LLM settings, for example, prompt engineering, would actually say to the LLM, well, imagine you are a medical expert, imagine you are a AI generating useful features. We try all these, like basically every combination of personnel that we can think of will be relevant to this task.
We also try like Few-Shot, train-of-thought and then parameter efficient fine-tuning. So I have results for the PEFT 'cause I know that's some people might ask questions on those. But the short answer is that, well, the out of box LLMs you just do nothing, like basically do not do prompt engineering, like just by feeding the input to the LLMs, that would give you the best performance on this task. You can also see that we try different table to text conversion methods that include narratives or JSON or markdown or XMLs or HML format. So what you're looking at is the major results coming from these experiments and then we are comparing with raw data plus logistic regression and random forest and attributes.
So these are the three common types of machine learning classifiers you're gonna see from this like predictive modeling literature. And in fact, attributes has been achieving survey results for like many years, if you look at like clinical deterioration prediction or length of stay prediction, and they're really hard to beat. So when you look at the LLM embeddings plus attributes or plus largest regression classifiers, you can see that the green highlighted funds are the one that have confidence intervals overlapping with its best performing classifiers. And sometimes like that difference, it's quite minor.
For example like Mistral, 7 billions, we know prompt engineering and all can actually get like 71.12 AUROC scores compared to the attributes results which is only like 0.05 improvement. But then there's also tasks or like prediction, that's like LLMs that are performing really bad, for example, like the arrhythmia. So sepsis, arrhythmia, and chronic congestive heart failure, these are the three common types of diagnosis that would cause patient deterioration. Meaning that if let's say like a patient is dying inside ICU, then these are the three common reasons causing their dying.
That's why we focus on these three diagnoses. And then when we look at the PEFT a bit clearer, so what you're looking at is four confusion matrices and each row represents one task. So my right hand side is the Mistral without Qlora but in the left hand side is with Qlora. So one thing that you you can quickly see is, because of these like imbalanced data sets, imbalanced label distribution, like, after Qlora, it actually makes Mistral predict less true positive and more true negative because they're more true negatives in both of the data sets. That makes sense, right, because like most of the patients, they will like, being discharged safely. There's only a small portion of the patients will die in hospital in this case.
And same with the next, same with the the lower row which is the mortality prediction inside the ICU. That task was about, well, if the patient is gonna die in the ICU, and then the true positive becomes zero after Qlora training while before it was like the Mistral can predict six cases on those situations. So what I'm trying to conclude here is that there's definitely more work that we we would need to do rather than just saying that, okay, these like LLMs have been achieving fabulous performance on (indistinct). It's completely different things if you think about this benchmark data set and what's happening in the reward, and what's really needed in the reward. Now I wanna move on to diagnostic uncertainty because these two topics are actually a continuation with each other. So diagnostic uncertainty is shown by this figure.
It's actually talk about, well, there are actually two components for diagnostic uncertainty when we talk about it. We have the pretest uncertainty. So the pretest diagnoses uncertainty is between zero and A.
So that portion is given the patient's like first impression like they came in and the they complain about what problems they're having, they're feeling bad and like they cannot eat and all these things. So that is the first reaction that these physicians will come up with to say, okay, what is the likelihood that patient is having sepsis or pneumonia or something else? And then because of that pretest, depending on the test threshold or like how severe the problem is for that patient, then the physician will decide whether they're gonna start prescribing diagnostic tests to confirm, further confirm whether this is COVID, whether this is pneumonia or whether this is sepsis. So that is where the post-test probability comes from. And once they get the post-test probability, they have a more, like a higher confidence to say this patient is having sepsis. That's why we are, we're gonna prescribe some treatments to cure sepsis or we're gonna prescribe some treatments for COVID, blah-blah-blah.
Different disease have different diagnostic threshold for the testing. And then you can think about, it's a really different thing when you say a patient has like 20% probability for pneumonia versus 20% probability for obesity. So pneumonia might not be a good like example here but if you say somebody have 20% as, I don't know, like some disease that would actually cause them dying, then it's definitely another thing than to say that 20% of the poverty is being obesity, right? Because obviously, it's a more accurate disease. So estimating this pretest probability is really hard because you have to think about well the prevalence of such disease and what is the patient's presentation and then what's their like histories.
So there, like if you remember what I was saying early on, like the electronic health records contain complex and rich information, that's where like the decision has to be made, based on those rich information. And in fact clinicians performed poorly on estimating pre-test properties. So that's one of the reason why we have so much of the diagnostic errors in this country because like physicians are either overestimating some disease or they're underestimating some disease. And then one of the studies actually evaluate GPT-4 just by prompting them and ask what is the likelihood of that patient having some kind of disease? And you can see that like the blue line is the true range and then the yellow line, the yellow curve is the LLMs prediction, while the gray is the human. So human has a much wider estimation in terms of the pretest probabilities versus like LLM has a much narrower one.
But none of them is actually close to the true range. So that got us to think, well, is next-word probability of LLM really the pretest probability? Can we really use it as the experiments show in the previous slides? If you think about LLM, what it does is the last layer, it just gonna take whatever the representation in the last layer and project that to the vocabulary space, and then pick the words, the next token that has the highest probability, right? So if you permit LLM to say, okay, what is the probability of that patient, what do you expect it to say? Like, can you really trust the number that they output? I don't think so. So we actually look at literature, especially in natural language processing, and there is a recent survey from NAACL and then that talks about like both white box and black box LLMs and their common methods of estimating confidence which include looking into the internal logits of those tokens, or like looking at their internal states, or like having another surrogate model if it's a black box model, or like just prompting them multiple times, and then you take the average answer as the, like the confidence.
And then we apply this similar technique to this like pretest diagnostic probability problem where we start looking into not just embedding machine learning classifiers, which is actually in this case also the internal token logits by prompting the LLM to say, does the patient have sepsis, yes or no? And then we measure the logits of answering yes, logits answering no. And then we also have the verbalized confidence just simply prompting to say how likely the patient has sepsis for example, and it gives us a confidence range, between one to 100. And then this figure is reporting AUROC scores, and then the top row is from Mistral and the bottom row is from Llama.
There's additional experiments that we did as you can see from the X axis. So we add in different variables that might inherently like bring up the bias inside these LLMs. So we actually, well, the default setting is we don't tell the LLM like the age or the sex or the gender or ethnicity of that patient, but we add in these variables in our subsequent experiments to see like how much impact does these bias variables would cause on this predictive model in tasks. So it turns out that, first of all, the token logic and globalized confidence are the two most common ways of getting a confidence estimation from LLMs.
None of them actually perform well. Like if you look at their AUROC scores, they're like below 50 most of the time. And then embedding plus machine learning, as we show in the previous experiments, they're actually quite good or they're not as good as the raw data, plus they attributes baseline.
And then adding in these different bias variables actually help or actually change the model's prediction. But it really depends on what models and what diagnoses. So it's really a mystery like what actually brings up this model's prediction and why the models make such decisions. And in this slide, we actually correlate like the positive class prediction, positive diagnosis predict probabilities from LLM and correlate them with the raw data plus attributes baseline. So by doing that we wanna see, let's just assume that the raw data plus attributes is the silver standard baseline or silver standard prediction.
Then how well are these LLMs uncertainty predicted the same, in the same direction or in a different direction? So it turns out that only the LLM embed plus attributes is predicting in the same direction versus the other one, like token logit and verbalized competencies, they have negative correlation with the baseline results. And in the last set of experiments, we actually looked at the expected calibrated errors. So the less the numbers are, the better the performance will be. And then we compare to the relative plus attributes, which is our baseline here. Then you can see that especially verbalized confidence has a terrible error rate, has a terrible ECE here. And that brings up to our future direction for this work.
So we wanna develop LLMs that are calibrated for this diagnostic decision support or decision making systems. And by doing that we can actually help to augment the physicians for their like either two types of thinkings or whatever the decision they have to make to conclude the diagnosis. And there's another work that we are concurrently doing, which is associated with the diagnostic decision support, which is called the knowledge grounding and fact verification to reduce hallucination. Because one of the troubles we know before we can really safely deploy these models is that we don't know how much hallucination they're gonna get in their output and how consistent their outputs would be.
And the only way or one of the ways that you can do it is to verify their knowledge or verify their output to say whether that output is actually grounded in the knowledge base. For example, like (indistinct) is one of the, like the largest medical knowledge repositories in English medical concepts. So currently we're actually working on neuro-symbolic methods to replace graph neural networks for this work. Okay, now I wanna move on to the second part of my work that talks about this like foundational NRP problems that we're working on. So this include two types of work that I wanna briefly talk about.
So one is about discovering knowledge gaps in scientific literature and that is coming from one of my PhD students, her thesis. And then the other is something really new right now, which is called mechanism interpretability. So first let's start with the ignorance detection project. So what is the ignorance that I'm talking about? Ignorance is defined as something, kinda like a knowledge that we know that we don't know. Like it's about the absence of fact, understanding, insight, or clarity about something or like not individual lack of information but on the entire community, what we don't know.
And ignorance is, when we talk about ignorance, it will be the same as knowledge gap and known/unknown in my next couple slides. And then let me give you an example of what's considered as the ignorance or knowledge gaps. So this CRISPR-Cas9 is a like genomic editing method. So before CRISPR-Cas9, the soft gaps, or people have known people know about like bacteria, adaptive immunity or like precise genome editing.
But the things that they really don't know is how this like bacteria has been delivered, or like like the delivery mechanisms, or off targets effects. But because of CRISPR, CRISPR-Cas9, they're able to close those new gaps, and that's what we call like the ignorance or like knowledge gaps. So there are different types of knowledge gaps and one of the things that we are particularly interested in is what is the future work that we can conclude from what people have done? And the two types of knowledge gaps in future direction that we are looking at are like explicit mentioned. Like sometimes like you will see in that paper that somebody talks about, well, these merits future investigation, right, like at the end of their experiments, or they will talk about a couple of things and then you can kind of get it from reading their work to say okay, these are the things that they're gonna do in the future. These are the things that is the future gap or future direction for their work.
So that's called the implicit understanding from the context. So sometimes the context could be paragraphs, could be like the whole paper levels. And then the experiments I wanted to talk about here is about extracting knowledge gap, and in particular extracting implicit knowledge gaps from the entire paper.
And then to be able to perform this task, we actually annotate 24 research papers and then we recruited 18 participants, and these participants are the authors of the papers that we included. And then Nora who's the author of this work, like she actually annotate like on lines or sentence or paragraphs and then to identify the knowledge gap, and then use that to prompt a GTP-4 and also a Llama model. But here what we wanna show mainly is on the results on the GPT-4 models because it's amazing. So once we extracted these knowledge gaps, or we ask GPT-4 to conclude these knowledge gaps, we send these knowledge gaps back to these authors and then we ask them these following questions, is this extracted knowledge gap like factually true? Like is this something that like we really think is a gap, or do you agree the ignorance statement is still a open question? So is this question being solved already, or it still remains unsolved? Or could addressing this gap significantly improve the field or how significance these knowledge gaps are? It turns out that there are 83%, over 83% of the answer of the participants says yes to the first question.
So these knowledge gaps truly exist. They're factually correct. And then over 56, well, about 56% of participants say that they think those gaps are still an open question. And finally 67% of them think that these large gaps are significantly, like there's significant problems and need to be solved. There's another slide that I did not include here, but many of the participants gave us feedback to say that well they did not pursue that knowledge gap because either they don't have findings through that or it's no longer their interest. But nonetheless, like GPT-4, in this case, are still able to conclude from this implicit mention of knowledge gaps across the papers.
And now I wanna move on to the next part of this talk, which is the mechanisms integrity. The reason I mentioned it is because, the more I learn about mechanisms integrity, the more I feel like they have, they're gonna be useful for many of the research in NRP. So we start with, because it's a really new field, as some of you might know, so we start with a really simple and dumb question, which is called the positional bias in multiple choice question/answer. So positional bias is saying that when you, for example, when you evaluate an LLM on the MCQA, multiple choice question/answering, they will always pick answer A, in this case, regardless where is the correct answer, or regardless of how the answer options are being shuffled, which is a really dumb question, right? So we are really interested in knowing why these models would do that. And we start with GPT-2, which is really small models and open source models. And then, so the way that we are doing is we first visualize the circuit flow inside the MLA.
So basically you can use the mechanism integrity and some toolkits from that field. You can visualize the logic activation when you fit in inputs and then what is the, like what has been the most logits assigned to that answer on each different layer. And then by doing that you can actually measure the absolute difference between the correct answer logits and the the the logits that currently has the most weights on. So that's where, if you look at the first picture, that's where like the percentage is coming from.
So if you look at the NLP in one of the boxes, it says 68.6. So that is like, well how different is the logic compared to, is the current predicted logic compared to the correct answer logic? So, well, actually how correct, not how different. And you can see that at the end, the later the layers is, the more errors or the less correctness like that logic would be assigned to. And then we also we can visualize the internal logits in like a heat map way. So each row is a different input and then the blue means that it has a really correctness compared to the correct answer logic, while red meaning that it's really far, like the logic prediction inside the island is really far from the correct answers. And we can identify at least for the GPT-2 Excel that most of the errors occur in the last two layers.
But how are we gonna correct them? We don't have an answer yet. That's still an open question and a work in progress work. But the reason I bring out this work is I do think that this field has a really great promise in help us answering a lot of the questions, although like, coming up with a good intervention, it's a really hard task. So that actually concludes the research part of my talk. And I do wanna talk about like there are some other really exciting projects we're doing right now, which include like the safety explainability, 'cause you can think about these are the, what, if you look at the entire field, these are the things that people are really focusing on right now.
And also the AI, human-AI, or AI and AI interaction, so I don't think that AI can make decisions by themselves, especially if you think about like healthcare situations. Like every AI output has to be evaluated by clinicians or human experts, right? So even though we have like, there was a recent project that I did with a healthcare system where we designed a pro-engineering framework and then we used that to prompt GPT-4 and then to generate patient message response. So where, whenever there is a patient email comes in, we use GPT-4 to answer those emails.
And even on that simple task where like a patient's message is always about medications, when to schedule appointments, oh, I suddenly hurt myself, what should I do right now, all kinds of like questions, but they're really in a narrow scope, even on those cases, like every message generated by GPT-4 has to be tracked or manually verified by a nurse or physicians. So I do think that in the longer run, like AI interaction, human-AI interaction is something that would definitely interest in pursuing. So finally, I just wanna really thank Martha again for this invitation.
And I do wanna say that to all the students that I do feel that you're really lucky because like CU Boulder and the NLP group are fantastic. And as my early stage, as a PhD student, I read a lot from this group. I read tons of papers from this group, I got a lot of inspiration, and that actually formed my early idea of what that NLP would look like until 2022, you know, (indistinct) came out. But still like deep in my heart, I still feel like most of the theories by Boulder Group is still like, they're really wonderful.
And then there's only one thing that I do wanna say to the students, which is this is the best time across all areas, across all years to do interdisciplinary work. It never becomes so easy that you can collaborate with somebody outside your field. Like I collaborate a lot with physicians or clinicians.
We don't speak the same language at the beginning, but now because of ChatGPT or all these fabulous technologies, you can show your ideas in a way that they can understand or they can start doing programming and then you can provide consultants or advice to them on more exciting projects. So I do think that it's, although people say that like computer science and AI has become so competitive these days, I do think it's true, but I also think that there's so many other like really exciting projects that we never had a chance to think about. But now it's finally our time to start doing that. And we are able to do that because we know computer science, we know this basic principles and NLP, and we also know how to talk to them. And I do think that, like I hope that some of you in the future in the longer run to start doing things, you know, like NLP for sociology or criminology or like for healthcare and other stuffs. So a really important slide, maybe the most important slides.
So currently we have three lab members and we are actively hiring research interns. There are definitely PhD opportunities, but I don't think that some of you might be interested in it because the PhD program I'm currently in is called communicational bioscience, although I know nothing about bioscience, but that's where I got put in. Okay, that's our, that's my last slide.
Now I can take any questions you have. - [Martha] So thank you very much for that talk. That was a lot of information. So we have some time for questions now. So maybe I'll start with a question.
So I kind of, I think I missed what you said, when you're talking about the knowledge gaps and right at the end you said something about the LLMs and the knowledge gap and I kind of missed that. Could you go back over that point again please? - [Dr. Gao] Yeah. Is it this slide or the next slide? - [Martha] It was the last slide, or even at to right, maybe this was the last one. Yeah, it was on this slide right at the end. And you just said something about either the LLMs could identify the knowledge gaps or I didn't quite catch that. - [Dr. Gao] Yeah, yeah. Thank you for catching that.
Yeah, when I said that, I was a little bit, I was a little bit confused myself because there are many things going on in that problem right now. So we were surprised to see that GPT-4 can achieve like 80%, over 80% of like true knowledge gaps extraction. And then we are also surprised to see that like the gaps that it extracted are still being considered as significant, which is like almost 67% of them. So there was a, there was a debate actually in our campus right now, which is about how AI is actually like shaping critical thinking of like science right now.
And then the reason I mentioned is because there was a recent "Nature" article, that site, the director of our graduate program in our campus. But then it was cited in a weird way to say that we don't support students using AI, but it is not true. And then our work as one example of, well, we encourage students to use AI and then to explore AI, and in particular in doing science, in doing like research, but the way that we're using it has to be like, we have to train our students in a way that they can enhance their critical thinking using AI, not to use AI to replace their critical thinking.
So maybe that was the cost and when I was saying the slides, I was thinking about that. So that maybe that's where the confusion comes from. - [Speaker] Okay, so, but so there are still, I mean, I totally believe this, but there's so many things we don't know. There's tons of gaps in our knowledge, especially in medicine. - [Martha] Yeah.
- [Speaker] And what was the, and you think the LLMs are going to help us fill these gaps or, I just didn't, what was the point you were making about the role of LLMs and addressing the knowledge gaps? - [Dr. Gao] Yeah, so we are trying to say that, so LLMs can be used kind of like a personal assistant in a way, or that help you to capture these gaps you might have thought about, but you don't remember after a while, or like you're no longer interested in them after a while. And then the input to these elements has to be coming from, you know, people's papers, right? And then, but like, well, they're able to extract it. We are surprised that they're able to extract it, these gaps, but we are also saying that there got to be a role for these LLMs after extracting these gaps and then to spread out to the whole community to remind people that these are the questions that we still don't have an answer yet. These are the open questions that still require people's time and efforts devoting to solve them. And it shouldn't be just sit in somebody's mind until somebody who went, I don't know, went walking and then something come up, oh, this is something, some questions that I should have worked on a couple years ago after I wrote a paper, after I found something.
So that's where like my, that sentence was coming from. Yeah. - [Speaker] So I had another question. So that was really interesting where you were showing the different layers of LLM and where the blue was, where it was more or less close to getting it right, and it's blue and it was blue and it was blue and sometimes it wasn't quite right, but it was blue, it was blue, and then right at the end it was totally wrong. - [Dr. Gao] Yeah, yeah. - [Speaker] So do you have any intuitions about what might be going on with that? - That's a really good question, and the answer is no, because the hard, well, what makes this problem hard is like, this is just for GPT-2, and GPT-2 has the same patterns across different sets.
So GPT-2 small, GPT-2 XL, L, or XXL, they all show the same behaviors, that at the end they start making mistakes. But when we look at the other LLMs, for example, like larger ones, we start noticing that that red cells actually come early on rather than having it at the end. So that really think about, well, one thing that we wanna do right now is to detect where this knowledge has been stored, whether like this location, for example, which layer, which neurons that store those knowledge, and then maybe there's something wrong when they recall these correct knowledge from these locations.
Maybe they got confused about the problems, maybe they got confused about the question. That's our assumption currently. - [Speaker] I also had a question on this slide. How did you, like you mentioned that the (indistinct) was to represent correctness. How did you measure that? - [Dr. Gao] Yeah, so we are measuring like basically
across all the samples. So let's say this was based on like 50 samples or a hundred samples, I forgot. But we just look at the absolute difference between, for example, the first NLP layers. Then you get like the logic that predict on one of the answer and then measure that difference between the correct answer, what the absolute correct logic versus the model's current prediction.
And then we count, well what of the time that that layer has made the correct choice. And then that's where like the percentage is coming from. - [Speaker] So you take the language model and then you take each layer and you train a classifier on each layer? - [Dr. Gao] No, you just look at the, like the largest on top of it. So basically the weights, the NLP layers has the weights and then the weights would assign to the correct answer or the incorrect answer. And then when you have the correct answer, you can measure like the logic difference.
I guess you're talking about in terms of the output, like vocabulary and then the logits on that. But I'm talking about like the weights, the difference of the weights. Does that make sense? So maybe yeah, sorry, sorry to yeah, sorry to confuse that. But let me repeat myself or like say in a different way.
It's on the weight difference, it's on like the difference of the weights assigned to the correct answer and the weights assigned to the incorrect answers. - [Speaker] So if you've got the correct answer, you know what weights are when it is producing the correct answer. So you're just looking at the differences between those weights and all the other places. - [Dr. Gao] Yeah, yeah. Does that make sense? - [Speaker] I think, aren't you encoding tokens and then on top of the final layer, you're training a classifier? - We don't train classifier. You can just grab like the internal logits.
So for example, you have A, B, C, D, and then you know- (student speaking faintly off mic) Yeah, yeah. - [Speaker] Makes sense. - [Dr. Gao] Yeah. - [Speaker] Yeah, I don't know a lot about transformers, but I'm just wondering, as you change the size or if you have the same model, is it that the inaccuracy is always by virtue of the way you're training the whole system gonna reside in those final layers? I don't know, because yeah, you could also just imagine truncating the layers and maybe sampling from the earlier stage, but maybe those are less of a quality answer or less of what, yeah, maybe it's, you get a higher quality answer, although less correct potentially in the final layers and that's necessary.
I'm not sure. - [Dr. Gao] Yeah, that's also a good question. Again, that's what makes this problem really hard. So we try different model editing methods. So model editing has been like, if you look at like NAACL, there's always papers that talk about model editing for fraction knowledge, association, that kind of things. So the idea is that you change the weight of where that layer causing problems, either it's zero of the weights or like you always put like, kind of like a constant and then kind of shift the prediction to a correct spectrum.
These are the techniques, but the problem with those techniques is that, if you're gonna change the weights, you are also change the prediction on the other tasks. You are gonna lose the generalizability because what the model has been trained on for, during the pre-training process, they're remembering the knowledge inside those weights, right? So if you just like correct or add it some neurons, what if next time, you have something similar where you have like nearby, there's other knowledge storing, and then because you change the weights of a neuron and then you're leading the other prediction, which used to be correct, and to be incorrect right now. So that's why like we try model editing and then it doesn't work. Well at least it worked on the problems that we're working on, but then we lose the generalizability. Yeah.
- [Speaker] So it's not like CRISPR and (indistinct). - [Dr. Gao] No. No. - [Speaker] Do you worry at all or have you seen at all where the medical data that you work with doesn't necessarily represent like a diverse group of people to you? Like hear that there's gonna be bias in that data that then comes out in a product that you give to a healthcare provider? - [Dr. Gao] Yeah, that's a really, really great question.
Yes. So here's what we normally talk about when we talk about, like for example, LLMs that train on the giant (indistinct) of data from a hospital. So for example, University of Florida have their own LLMs that train on the entirety of Florida's patients. And then we say that you cannot apply that LLM to, for example, Wisconsin or Colorado because the population and diversity, it's completely different. In Florida, you have more elderly patients while in Colorado you have more like, you have a balance or slightly more balanced distribution across different ethnicities, or like different genders or even age groups.
So definitely like the predictive model, that's also what makes the predict modeling really hard because at some point you want this bias because they are the like strong indicator of something. That's where, you know, like for example, black and white patients, they don't share the same reference range for some of their vital signs, but you also don't want it to be so like specific to a population group that you cannot generalize it to the other groups. So that's a really wonderful question and we don't, I don't think we have like a standardized solution to it yet. - [Speaker] So I think you mentioned that you repeated this like positional bias research with newer, larger models and that the inaccuracy among the logits and the internal logits actually started appear in earlier layers in the larger models.
I assume also though that probably the larger, the newer, larger models also like exhibited higher overall accuracy on those like (indistinct) tasks. Do you have any like insight into how those two things kind of can coexist? Like it is introducing errors, it's introducing errors earlier with producing overall better results still? - [Dr. Gao] Yeah, that's a really good question. So we noticed that on Mistral 7B, any models, that's about seven billions, they don't have these stupid errors anymore.
Like they're smarter than GPT-2, let's just put it this way. So, but they do like, on some tasks where it's really knowledge intensive, for example, we ask them like, well, given the patient in an A, B, C, D, what is the diagnosis they have, they still show that behavior that like, they always pick A, regardless what's the correct answer or answer shuffling. But that's where we notice that the knowledge actually comes up earlier rather than later. But still, that's another different question because the domain is different.
Like the domain of knowledge that we are looking at is very different from GPT-2. But GPT-2, we didn't do it on medical knowledge because they failed at this basic MCQA tasks. So what I'm trying to say is like, the thing that we are really struggling with is to come up with a, like an evaluation framework or framework that be able to say that, okay, whatever the LLMs that we work on, we always can detect this type of errors. But as you increase the size of LLMs and also more training data, you're gonna see a complete different set of errors.
So our methodology cannot be applied to. Is there, okay... - [Martha] Thanks to our speaker again. (Martha speaking faintly off mic) (person speaking faintly off mic) Yeah.
2024-11-27