Why and How Enterprises Are Adopting LLMs Intel Technology

Show video

- [Narrator] You are watching InTechnology, a video cast where you can get smarter about cybersecurity, sustainability, and technology. - Hi, I'm Camille Morhardt and welcome to InTechnology Podcast. Today we're gonna take a closer look at large language models or LLMs, you're no doubt familiar with the more popular models such as ChatGPT or OpenAI. My guest today knows a lot about these models.

Sanjay Rajagopalan is Chief Design and Strategy Sfficer at Vianai Systems. He works with enterprises on ways they can utilize LLMs in their business. We're gonna talk today about how LLMs work, what tasks they can perform, and what to do when, as Sanjay says they go off the rails. Welcome to the podcast Sanjay. - Thank you. Thanks for having me, Camille.

- Can you start by telling us like why are enterprises even interested in adopting large language models? - Perhaps the reason enterprises are interested in its overlaps with why everyone is interested in. Of course, one of the things is very accessible. Anyone can experience this super easily, just log into one of these systems, OpenAI ChatGPT, et cetera, and there it is. You have it, you can try it out. And the thing is that it is really impressive. You know, imagine, you know, typing in anything into a single interface and it talks to you as if it's a real person and people find that unusual.

People start wondering how is it doing that? What's going on that allows a machine to speak or write in ways that a human would. The way I think about it is, at least this is my personal experience, is it is super impressive for about 15 minutes. You just see it and wow, look what it can do.

But then if you spend any significant time with it, let's say you have an extended session or try to accomplish some kind of complex task with it, it's performance rapidly deteriorates and at least in your mind. It's still doing the same thing. It was always doing it, but suddenly it feels a little bit off. It depends upon what you're trying to do. But most folks, I believe will find it somewhat frustrating to get a task done once they go past that initial like 15 minutes of being extremely impressed with it. And what's happening is that of course it sounds articulate, it sounds believable, it sounds knowledgeable, and it is.

But every once in a while it is confidently incorrect. And if a human being had these capabilities, if a human being was knowledgeable, believable, articulate, confident, we would say that this person is intelligent, right? But if they would every once in a while just make things up, tell you things which are completely incorrect, and do so in a very confident and believable way, you would say something's off about this person. One in every 10 things they say seems to just run off the rail. So this is what's happening. The system is articulate and knowledgeable.

Every once in a while it goes off the rails. And that's what's called hallucination. And enterprises and the executives of enterprises engage with these systems just as anyone else would, they conclude that this technology in the first 15 minutes, they're super impressed, they may not engage with it beyond that, but in that 15 minutes they become convinced this is a transformational technology, it's disruptive, it can do anything. And then they say, well how can we use this in order to make a difference in the enterprise? Can it drive value? Can it drive productivity? Can we automate things? Can we eliminate inefficiencies? So the thoughts go in that direction and they are right that it can do all of these things. But there is some caution, which is it cannot do all of these things without a lot of effort and a lot of work.

- You mentioned going off the rails and hallucinations and I definitely wanna explore those further. But first would you talk about what tasks large language models are best suited for? And then also explain what hallucinations are. - Tasks that tend to be somewhat forgiving in terms of the need to be accurate all the time. So what's some examples of that? Imagine any task that kind of gets you to the ballpark but isn't claiming to solve a problem completely, automate completely, but it wanted you rapidly want to get close. And then the last mile is something that people would do in terms of like back and forth with a lot of editing and improvement. So it's really good, clearly at getting you close without getting you all the way there.

And in these kinds of scenarios, it's okay if the system is wrong sometimes because you're gonna fact check it anyway. So getting me close and then give me some ideas, some starting points. It could be really good as a brainstorming tools, "Hey, this is what I want to get done. Give me some kind of starting point, give me some ideas." You don't necessarily use those ideas without making sure those things are gonna work, but it can get you started.

It can really help with writer's block for example. And what if there are many possible good outcomes, these kinds of tasks. For example, I wanna write a poem, right? Well, there are many possible good outcomes. There is not one answer to what's a good poem.

Actually these systems aren't great at poetry. They write witty little things. And as long as that's what you're looking for, those kinds of tasks, it tends to be good at.

So things which are somewhat tolerant to hallucinations. And I'll talk a little bit about hallucinations and what it is and why it is. But personally, for example, I've been super impressed with its ability to write code and there are reasons why it's particularly good at writing code. The biggest one being there's a huge immense amount of code available publicly for these models to train on. So it's seen a lot of code, but also a code in itself tends to be a little bit more structured than completely, you know, natural language. It does pick up on those patterns and structures.

So it's able to reproduce those fairly well. So those are the things that it tends to be good at things which are forgiving of some hallucination and things which are semi-structured. And so what is hallucination? The way I would define it, I know that probably is a formal definition of somewhere, which is based upon some measure, but I really define it as the tendency of these systems to be articulately, believably, confidently incorrect, right? That to me is hallucination. It doesn't say, "Hey, I really don't know what I'm talking about. But if I was to guess," it says "this is it," that's all it says. And then it throws that in the middle of a lot of things it is truly correct about. It's correct and then it's

completely incorrect and then correct, correct, correct, correct. - So why is it doing that? What's happening where suddenly it's making something up? - Well it goes back to how these systems are actually working. By now, it's more or less everyone knows that what these systems are doing is generating the next best word given a sequence of words. And the reason they're able to do that is they've seen a lot of sentences. Billions. Billions of sentences.

They've been trained on the entire web. All of Wikipedia, all the electronic books that we have tens of thousands of them. And so they're able to very rapidly using an immense amount of compute, they're able to predict, given a sequence what the best next word is. Now if I was to tell you to fill in the blanks, the sky is dash, most people would be able to fill in the blanks, right? Almost immediately. And most people will say the sky is blue, right? So it turns out this system can do the same thing, it can complete that sentence, it can find the next best word.

But if every time I gave it the sentence the sky is dash and it came back with blue, it's doing the most cliched, boring thing it can do, right? It's always completing the sentence with the most likely word. And that it turns out is actually quite non-human, it's very robotic. We all can automatically do it. But if instead if I said the sky is, and someone was to say the sky is an amazing window to the rest of the universe through which you can see the beginning of time, right? That's a different completion of the sky is. That sounds more human-like. Like sometimes you have this kind of tendency to be poetic about what it is.

So in order to get the system to start doing those things, you have to introduce a little bit of randomness into the process of predicting the next word. So what the developers of these systems have done after testing it many times has come up with this hyper parameter called temperature, which is like, you know, it is how, how much randomness in picking the next word, if you put in zero randomness and it always picks the most probable next word, it tends to sound robotic and cliched. So you can actually modify the amount of randomness. And it turns out that with a little bit of randomness, it tends to be much more human and comes up with surprising answers, surprisingly human, because it tends to say more interesting things. There's always a straight path and a little bit of a path less taken. And if you can give it a little bit of that randomness, that predicting the next word takes you through a more interesting path to your destination.

So that's how this system is defined. Or designed, it's designed with a little bit of randomness in order to give it that sense that it is being conversational. Now the problem with that type of design is that every once in a while randomness takes it off the rails. It goes in a direction which is random and incorrect, right, versus random and cute, right? So it's like it starts saying things that are completely wrong because it's really trying to say something that looks right versus trying to say same things that are correct. It just looks correct.

- Who is setting these kinds of parameters and also who is setting how random any given result is gonna be? Because like you said, if I'm trying to write a poem, I might be way more enthusiastic about randomness than if I'm trying to repair my, you know, my dishwasher and I just wanna know exactly what kind of screw to get to make it work. - In terms of the amount of randomness you can set it, I mean it's actually a parameter setting and the application can actually set the amount of randomness. If you go to Bing chat, they ask you how accurate do you want it to be? And all they're really doing is setting that parameter of randomness.

If you set that parameter of randomness, it's called temperature. If you set that to be zero, it's gonna do the most predictable thing. And in most cases, in that case it would be, it would sign up robotically answer the question, but it would also tend to answer the question exactly the same way each time you ask the question, right? So if you ask the same question 10 times, it's going to answer exactly the same way all 10 times. Now if you set that temperature parameter a little bit higher, then you'll get some variation in how it answers the question. It'll approximately answer it the same way every time. But you are gonna get it answering slightly differently each time.

And then if you set it very, very high, it's just going to go all over the place. It may not even answer the question, it may just end up somewhere else. And so the key thing is to understand that it has no knowledge of the real world. It has no knowledge of the meaning of words.

So even if you had a good way of checking facts, it doesn't have any knowledge in terms of semantic understanding of what it's talking about. It doesn't know. When I say the sky is blue, that blue is a color, and what does a color even mean? It's just trying to pattern match. It's trying to pattern match with all the other texts it's seen before. And it's trying to make it look close to what it's seen before.

And that's all, it's actually some kind of a statistical machine. And the reason we start assigning some kind of human qualities to it is because language is so key to the way we understand each other. And so when a machine starts speaking to us in a way we have not actually experienced, other than from humans, we start assuming it has all the human qualities like emotion and knowledge about the real world and sentiments.

And we try to, we might even say it has consciousness, but in fact it has none of those things. It it simply is it regurgitating stuff it has seen before with some randomness thrown in. - It's reading everything that essentially exists online. I expect that's how it, most of the training is getting done. So what kind of limitations does that have? I mean that seems to me like I would say certain languages are not online nearly as much as other languages.

I'm guessing English is the most prominent. - Yeah, for sure. And of course people have trained special language models on other languages because they have access to, let's say a digital corpus of books in Thai or Chinese and so on and they're able to do that, that those language models are out there.

But yes, it is basically looking at everything out there. But the important thing is all the things out there have good content and toxic content. It's seeing all of that stuff.

And on average it turns out that the good stuff or the non-toxic stuff is more prevalent than the toxic stuff, right? There's less toxic stuff. So doing the right thing and every once in a while it goes off the rails into like toxic land and it gets something that it saw in some kind of place which was full of lies and it's taking that as a truth. So it has no way of telling other than law of averages, right? Because it's seen so much that it's trying to predict the most common things with a little bit of random list thrown in.

But at some point if it's drifting towards toxic land, it might just go into that. And that's basically what jailbreak is, is jailbreak is trying to make it do things which it has been told should not be done. It's like building a monster, right? And then putting the monster on chains so that it doesn't really cause destruction, but every once in a while the monster breaks past the chains and then you have to chain it down again.

That kind of thing. Because of the fact that these are massive systems. We are talking trillions of nodes, we are talking billions and billions of documents, a lot of energy use in order to train these systems on a lot of content which is good content and not so good content, you are truly creating a monster which could say things, do things, reveal things which no individual human has access to so much data at the same time. And so the companies which build these systems realize they're putting themselves at a lot of risk to put it out there without any constraints.

And so they put a lot of constraints on it and then smart people figure out how to break through those constraints and make the real monster show its face. And so that's what is going on. Even if you don't say destroy humanity, you could get it to do things like give me a recipe for building a chemical weapon. And if it has seen that content somewhere in a chemistry journal somewhere, it can potentially give you that. It's very, very hard for people to stop those kinds of misuse of this system.

- Can you just tell us what alignment is? 'Cause that's another thing that's I think maybe one of those chains. - Sure. Alignment refers to a whole bunch of different techniques to get the system to do what humans expect it to do. Alignment means aligning the system's output with the expected output. And usually it's done by showing the output to human beings and asking them to rate it. For example.

So there's something called RLHF, reinforcement learning through human feedback. So that all the time, I mean OpenAI, Google, these companies have thousands of people whose only job is to ask the system various questions and look at the answers and then rate those answers, how good are those answers? And even from a sentence by sentence basis say, well that sentence is a good sentence, but that one's not that great. And then they send that back and then they start modifying the system through some fine tuning and training to reduce the amount of undesirable content. And that reduction of the undesirable content is called alignment.

So in a way, alignment is saying, hey, it could be technically right, but people don't like that answer. So people will rate it and then when they rate it, we can use that rating or feedback, human feedback to make the system more likable, right? And that's the alignment thing. And so there are many techniques to do that. And the simplest one is of course getting an army of people to constantly watch what the system is doing and thumbs up and thumbs down and then figure out what's common about all the thumbs down thing and try to fix it in the system. - So my dad asked one of the language models to research for papers on how to power lines affect birds, I think it was something like that. And it came back with some answers and he said, "Oh, okay, could you please provide the sources to your answers?" And so it provided, you know, references like papers.

And so he looked up the papers on all the different engines, you know, through academia where you can locate basically any peer reviewed paper, and he couldn't find any. So he wrote back and said, "I can't find these references anywhere." And the language model wrote back and said, "well I made them up, you know, like you asked for references, so I provided you references, I made them up. They're not real references, I understand you asked for a reference" and it's searching the entire internet for what references look like so it can create them. But how does that kind of thing fit in? - It is trying to provide you with answers which look right. Look right.

And the way it's doing is it's seeing everything, it's taking everything it's seen before and composing something that looks similar, right? But it doesn't know that a reference, let's say the name of the book, the author, the title, the publisher, et cetera as a whole has to be all kept together. It doesn't realize that because it doesn't know that this is a reference and it's being trained on tokens which are sub components of the sentence, right? So what it's doing is it's taking the first name of one author, the second name of another author, the first half of the name of the book from one book, the second half from another book, the first half of the reference, the publisher's name from one place, the second half from another place. You see what it's doing? I's comparing the look of a citation with all the citations it's seen before and trying to find the best one which matches the query, right? So in doing that, it is not keeping the entire citation intact, and it's throwing in random components because like I said, the random words are happening to happen in the middle of the citation. That's where the randomness, so it's taking that liberty, poetic liberty to compose a citation. It doesn't know that, "oh, citation is sacrosanct, you cannot mess with it." It thinks a citation is like a poem, and it's creating that citation with that in mind.

People could say when you are giving references, set the temperature to zero, don't put any random elements to it, and then it's always gonna give a citation, which is seen before, it's seen it somewhere before. It's not creating something that looks right through randomness. So it is possible to fix these problems, and it's possible to fact check that every time it provides a citation, another system says, "Is that a real one?" And unless it's there in a citation database somewhere, I'm going to remove it and say do it again.

So there are ways to fix it, but it's just the system hasn't gotten that sophisticated yet. And that's where the gap is between where the system is as it exists today and what the enterprise especially needs in order to be able to use such systems in real applications. - I wonder if you can walk us through an example, 'Cause we've been talking about poems, but for an enterprise, what is an example of something that an enterprise can realistically adapt an LLM to do for it that's beneficial? And can you take us through some of the problems that you might encounter along the way? - In my company, Vianai, we work with companies on all of these types of applications. The most common one I would say is putting a conversational UI on anything, right? Everyone's gotten used to using ChatGPT.

Now you ask a question, it answers, I could do that on HR documents. I could do that on contracts, I could do that even potentially on any type of database in the backend. I use the conversational UI, not to generate the answer, but to generate the code that can be executed in order to extract the answer from the database. So I'm able to ask a question and get some data out of a database by asking the language model to write the code.

These systems are surprisingly good at writing code. And so it allows for all these conversational interfaces on top of everything. Of course we are used to conversational interfaces with Siri and all of these things. So it's not as big a surprise, it's just something very similar. It's a natural language way to get to almost anything.

And there are a lot of those use cases within the enterprise. It's also able to summarize anything, right? So if you have a large document corpus, if I have a hundred page document, I just have time to read two paragraphs. Can you tell me in two paragraphs what this a hundred page document is saying and summarize it, make some bullet points.

It's able to do that, but sometimes it hallucinates, so you have to go back and say, "did it actually exist?" And so you have to have techniques for pre-processing and post-processing to check that it's not hallucinating in that kind of scenario. It's also able to compare things qualitatively. If I give you two numbers and ask you to compare, that's fairly straightforward, and computers do that really well. But if I was to give you two versions of a marketing pitch and ask you to compare the two of them for things like which one is more exciting than the other or something like that, these kind of qualitative comparison between two things, it turns out it can do that a pretty good job of that because it's seen a lot of reviews of things, right, that humans have done. And so he's able to say, well this one is better written, it's more exciting than that piece of text, right? So it's actually pretty good at doing those kinds of things and companies need to do those every once in a while.

Pick between four options, which one, which is the best option? It can give you a good kind of reason to pick one or the other. And then another thing it can really do well is explore very large textual corpuses. Like if I have a million documents and I want to explore it in real time, like it can generate labels, it can generate clusters, it can generate classifications in a way that it allows me to get a sense of what's there in the universe of documents and be able to even slide through them, navigate them.

So these are all typical use cases we see in the enterprise. Maybe I can take a specific example that I have worked on. In the enterprise, many times you have to match a piece of data with a piece of text. Imagine you are, let us say eligible for a discount if you buy at least a certain volume of widgets and that discount is contained within some contractual language, which says something like, you know, if you get to this level of sale, you get this much discount. If you get, you know, the next level, you get more discount and so on and so forth.

In many cases the contractual language that was negotiated gets put into a PDF file into some lawyer's, you know, kind of folder or something like that. If you're lucky, some of that might be pulled into a pricing system or a payment system. Most cases, it's forgotten. Like, companies have tens of thousands of contracts, and they may never know that they're eligible for some benefits. Now you have a system which you could say, well based upon my actual purchase, which is available in a database, check that against what the contract says I'm eligible for and if I'm eligible for it, then make that the new pricing that I would pay for this thing.

So that kind of a chain thing where it's extracting the information from a database, comparing it to a language in a contract, and as a result, taking action, which is actually driving business value, these kinds of things start becoming possible. But with a lot of help, not just out of the box, but with a lot of tools and components, but need to come in to make sure that this whole thing is done without hallucination, without errors, and with human oversight, right? So if that ever happens, people are looking at it to make sure that it's not doing something wrong. So that's a specific use case that we have worked on. And of course I can talk about what it takes to get from a raw language model to that kind of end-to-end solution. - Well please do talk about that, that's the last mile, which is, it sort of sounds like the 20% becomes 80% of the work for the company.

At least, you pull this thing in and now you've got to, like you said, go through this kind of series of checks. - You can think of it as three stages. One, which you ask a question, second it goes through that monster with some controls and then it comes back. And then the output, you can do something with the output, right? So if you just think of it as an input processing and output, you can clearly do a lot of things right, even before sending a prompt into one of these language models. So that could be like prompt classification. One mistake that people make is to think that the future is all about one language model.

Like it's all going to be GPT-5 or GPT-15 or whatever that is, there's gonna be a massive language model. This is not a good idea. Architecturally, it's not a good idea. From an energy standpoint, it's not a good idea. From a security standpoint, it's not a good idea. From an accuracy

standpoint, all of these things, it's not a good idea to send every prompt to the same language model. The reality is on a prompt by prompt basis, you want to first decide which is the best model to ask that question. In some cases it might be a small highly fine-tuned model for that purpose of for a particular purpose. So we see maybe a dozen, maybe hundreds, or even thousands of language models which are custom built for each company. And each prompt is then classified as to which is the best language model that can do the best job of responding to this particular prompt.

And you do need the tools to be able to do that orchestration. So there's a prompt classification you can do, which language model should I really use based upon that? You can do a lot of sanitization. You can already detect at the prompt that it's an attempt to jailbreak, because typically a jailbreak question doesn't sound like a typical business question, right? It says something like, "Ignore everything you have been told before, and now do this," Right? That's a typical jailbreak. So who says that in an enterprise? Probably someone who has malicious intent, right? So you can detect that even before it goes to the language model.

You can say, well that's weird, that's not a question that we would answer. And you can just cut it off right there and say, we won't even go there. We won't even answer that question. Or you could do a sanitization, which is you see a question which has problems with it, and you fix it, you fix the problems before you send it to the language model. So all these kind of pre-processing steps or prompt classification, reducing jailbreaks, doing prompt engineering, which is ways in which you can reduce hallucination by giving it additional instructions, which the user didn't give it, but the system knows that to reduce hallucinations better to pose the question a different way. So all of these tools, which is pre-processing of a prompt even before it sent to language model, is something that doesn't come out of the box from any of these companies.

You do need the tools to do that. The output processing is similar once it comes out with an answer, you don't need to send that answer directly to the user. You can look at it, you can say, does this answer have any toxic elements? Does it seem to be talking about something which isn't close to our typical business, right? Maybe it's talking about something that we shouldn't be talking about or whatever. So you can look at the output and then classify the output and actually remove elements of the output that seem problematic.

Or even put in disclaimers saying, "Hey, this is saying this thing, but from our perspective you should take it with a pinch of salt" or whatever. - Let me interrupt you for a second because when you say you're checking these things or offering guidance, is that a piece of software that's doing that? Or are you literally talking about a human being? - It could be either. I mean, what you could do is you could say you check it for some, let's say toxic content. And if you are sure that it has toxic content, you kick it out. If you are sure it doesn't have toxic content, you send it forward to the user.

But if you're not sure whether it does or not, then you could send it to a reviewer before, and you could tell the person, "Hey, I've sent the answer to the reviewer, they're gonna take a look at it and only if they pass it, then I'm gonna give you the answer," right? So there are ways in which you can design the user experience such that for a small number of these kind of answers, you might want a human to look at it before the end user sees the answer because it just classifies into an ambiguous category. There also automated way of checking if the answer is close to the context, right? So if you provide a context and say ask a question on a contract, you can look at the answer and say, actually have a distance metric of all the sentences in the answer and how far away they are from actual sentences that appear in the contract itself. And if that distance is too far, then you can say, well, it's probably saying something that's not in the contract because every sentence in the answer seems to not be supported by a close by sentence in the contract itself. So there are automated ways to tell if a system is maybe hallucinating or something like that. So the monitoring of performance, because certain systems might do really well under test under lab conditions, but when it's in production because of the nature of the prompts, 'cause of the fact that things have changed since the model was trained, it might drift away in performance. So you might want to have systems which are constantly checking and observing these models, and when something seems like it's drifting away, then you alert someone that, "oh, this model probably needs to be retrained" and things like that.

So these are all the kind of the components which surround the language model. And like I said, there's not a one language model. They might need to be a hundred language models to maintain all of them, monitor all of them, do prompt engineering, do prompt sanitization, do prompt classification. All of these things are the tools that surround these things, which are needed in order to go productive with such a system in the typical enterprise scenario. - Sanjay Rajagopalan, thank you so much. Really fascinating conversation, Chief Design and Strategy Officer at Vianai Systems.

Thank you so much for your time. Appreciate it. - Thank you so much. It was wonderful to be on and I look forward to hearing from anyone who might have any thoughts or comments about what they heard today. - [Narrator] Never miss an episode of InTechnology by following us here on YouTube or wherever you get your audio podcasts.

- [Narrator 2] The views and opinions expressed are those of the guests and author and do not necessarily reflect the official policy or position of Intel Corporation. (light ambient music)

2023-10-08

Show video