Build and maintain your company Copilot with Azure ML and GPT-4 | BRK211H
Hey guys. Cool. Thanks for coming. I'll be joined up here shortly with by Daniel and Seth and they're going to go through a demo. It's a demo. I mean you haven't seen this exact one before, but you'll see a theme where we're doing kind of a.
Augmented generation. This time we're going to use prompt flow. So you kind of saw the wizard more wizard view earlier, but it should be exciting. But that said, you can pretty much do anything you want. We we've been focusing on this kind of retrieval pattern because it's very popular, but you can put any API's in there you want to. So keep that in mind. Also, I'm going to go super fast. They're going to go super fast. I apologize in advance, but it is sort of like an advanced session, so.
I'm I'm trusting that we can play it in slow mo later or something. So I'm sure you've heard these foundation models are really popular they kind of took everyone by storm I would say it really started to get crazy maybe last summer to last fall and it's just gotten more so as we've moved across the into things you can do practically and. One common theme is that the place to do these is on Azure. So I think that's super awesome since I work for Azure. And another really cool thing about it is it it really hits all industries. I mean,
there's some things that all companies have like call centers and stuff like that and knowledge bases, but even if you get into the specific verticals, you see a bunch of problems that are solved much better by this technology than previous technologies and. I I should say I I'm a fairly long term machine learning person. So when it came out I'm like, probably doesn't really work. I mean, we kind of hype stuff up a lot. And then this one started to work and I was like, wow, this thing really works and it just kept getting better from there. I mean,
we probably have time anecdote. So I got an Android phone and I've been an iPhone user for a while for no particular reason. I used to be a Windows Phone user, but you know, these things happen. And then I got iPhone and then you know the whole thing made iPhones. But anyway, I got Android phone and I can't type on it. I honestly cannot type
on it. Every time I go to hit spacebar I hit V and I went through all four of the keyboards and then my options were like learn swift key, which I know it's Microsoft product but it's it's too fandangled for me. So I wrote my own keyboard and it took me probably like 8 hours to figure out Java again and you just get typing Java forever for like one line but anyway, but actually calling ChatGPT to do all the spell correction. And you could do spell correction like the whole thing. You just highlight the whole thing, send it to chat, chat, GTP and have it come back. And it works.
And you don't have to worry about, oh, I you know, in regular spell correction you type the 1st letter wrong. It's like, oh, I don't know, it doesn't know what to do with that on a phone. But anyway, so it's actually quite, quite amazing, the technology. I want
to call out one company in particular we've been working with called Sinner AI. It's great example of being able to scale this technology. Right now they're doing over 400,000 news articles a day where they take all this information, they glean insight from it and then they can have like company profiles on a daily basis, like what's the health of the company, how are they doing against their competition and stuff like this. So really interesting competitive analysis tool. Yeah, super cool. And just their ability to scale on Azure has been amazing.
OK so it gets a little confusing all the Azure AI stuff we have, but when you think about it, think of this layer cake. At the top we have a bunch of services that are sort of like a business function almost like they kind of solve a problem for you. In the middle are these cognitive services that are models sort of models as a service and at the at the lower layer is Azure Machine Learning which is the platform. All this stuff is built on top
of Azure Machine Learning and we have a set of tooling and you know API's and. Interfaces where you can build directly on Azure Machine Learning. And of course this all sits on AI infrastructure, so you'll hear things like AI and AI Studio. It's kind of the collection of this stack. You'll hear about these throughout the week. I just want
to call them out real quick. So as your open AI service updates, deploy on your own data, that's the this sort of enterprise ChatGPT that you've been seeing with just a few clicks you can kind of get something up and running. Really cool, fantastic work by the team. The next one is a new skew we're coming out with. I'm really excited about this because you get sort
of dedicated bandwidth or dedicated allocation of tokens per second or something like this, so. That that'll be awesome. And the third one is the plug in service that you guys have been hearing about and you know you'll continue to sort of improve your understanding of what you can do with that over the course of the week in Azure Machine Learning, what we've been doing is we've taken this platform that has been fairly well established in traditional machine learning and DNNS and stuff like that and sort of adding in capability for generative AI application development. And what you'll see today in prompt flow is really how do you orchestrate this fairly complex business logic? Like a lot of times they'll show an example and it might be like the three-step thing. But in practice once you get going, it often gets more complicated than that. You have multiple data sources, you want to say filter or post process the data, coming back from an API call, all this kind of stuff. So we'll go through
that today, but. It's not just prompt flow, it's sort of native support for any model that you want and what we call the model catalog. But we have a bunch of OSS models in there that goes live this week. The ability
to to deploy this to an end point to put responsible AI in it, there are a bunch of different ways you can do that. We have some really cool dashboards that you'll see later today. I really encourage you to go to that session and of course once you deploy it because it's on managed inference in as your mail, it scales like crazy.
And you can monitor it. You get like specific monitoring that's out-of-the-box for LLM applications, like how many tokens am I using and stuff like that. So really excited about that. But you know, let's see if we can get
to it. If we can bring Daniel and Seth up here, all right? There we go. We don't even get like, walk on music or anything. Hey, hey, your computer's locked here, Greg. Hey, Seth. How's it going? How? You got to get it.
I should probably. No, but you got to unlock your machine, man. Yeah, this is my boss and I'm using his machine, so I'm pretty excited. All right, So what are we doing? OK, Seth, why don't we start this off with a little challenge? OK.
Can you explain in 3 minutes how LLMS work and what RAG is? It's been kind of all over the keynote, but I don't know if that was kind of are we going to? Time it at 3, like actual 3 minutes. Yeah, I got it like. 3 minutes. OK, there you go. Large language models, all they are. I'm going to go to Ted talk mode and use my hands aggressively. All
large language models do is they take a fixed amount of tokens or words. And they retrieve the next single likely token, and they do it in a loop. That's it. 3 minutes. 03 minutes. OK, so. I mean it's like, OK, OK, OK. That was kind of.
So that's the that's the first. So if it returns the next N likely tokens, what you put in the first tokens before you send it to the model. Are soups important to talk like the kids do today? Is that how the kids talk today? Soups important. What you put in the prompt is super important. So I thought I would do a demo of because you and I obviously are are are singers and we like the Eurovision Song class. So let's go to my machine right
here. Are we seeing my machine? OK, good. So what I'm going to do is I'm going to change this prompt to say something else. OK And I actually, I think I still have it on the clipboard. There you go. So this is what I'm changing the prompt to continue. You are an A I assistant that
helps people find information about the Eurovision Song Contest. And you're you're from Germany, right? How? They do this bit of a sore point. I scored.
25th out of 25 I think. You answer concisely with whatever information you know. If someone asks a question related anything else, say salsa, we're going to be using G PT4. And I'm going to ask it who won the first ESC? Because, oh, I didn't even spell it right. Who cares? It's a language model. Who won the first Eurovision Song Contest? Boom. Switzerland. Okay. Who won ESC in 2023?
It doesn't know. Why doesn't it know? Because the model was trained before 2023. So what I'm going to do is I'm going to put information into the prompt to help us out. So how do you
make a new tap? Is it control? TI don't know Press, New tap. You're making me look dumb in front of Eurovision Song Contest 2023. Let's click on that goodness here. And I'm going to control. I'm going to just, I'm just going to copy all of this here. Control C, let's go back to the playground and I'm going to say. This is what you know. I don't know why I'm
making noises, but I I just. I feel like we need to make a computer noise. So I'm going to say who won ESC in 2023? Sweden. Sweden. Do you see that that is effectively rag retrieval augmented engineering. I just it was an artisanal.
Way of doing it, You know, artisanal, handmade. What should I put on my tacos? Am I over? Oh no. Hold on, hold on, hold on. What should I put
on my tacos? Does that work? It works. Is that helpful? I've not seen that work before we actually rehearsed this. By the way, this is a real thing. This isn't
like a videotape of like we had in the 90s. How do I say videotape? That was weird. Okay. It's in the tab in a while. Okay, okay. So now that we know what a rag is and how LLMS work, why don't we kind of go into our demo where we actually show the app that we built? Cool. So let's go to Daniel's machine here. What? We did is do you want to switch over there? What is it, 11/11/11 I think So yes, we did it so.
Since it's the era of the copilot, I think I counted copilot in the keynote. I don't know, I stopped counting after like 30 or something. So it's a lot of copilots. And obviously the assumption is like, hey, how
do we build 1? So we wanted to build one as well. That's what we did. So I'm. Actually too and so this one is for this outdoor company called Contoso and the. The copilot is helping the customer service agent to help the customer during a chat interaction. Right. So and oops, Seth is in. He's on his machine right now, so
he sees this. Yes, and I'm on. My machine. I'm the customer service agent and I'm waiting for Seth to log. In. I tried to. You might have to reset it.
And I'm sorry. Sorry. You left it up here and now it took a nap. Oh, look, there I am. There he was. Oh man, do it again. So let me do it
again. Let me log in as Seth here. And there he is. So the first thing. That the this, this app shows me and that has nothing to do with AI. It's like, OK, Seth is logged in on his website or on our website rather and with his user and so we know who he is, how he looks and also where he lives. Ha ha. And we have his current, his recent purchases, right? That's so we know what he bought. So when he
asks us a question, we can kind of, you know, make it a connection to what he bought and what the question might be in regards. Should I ask the question exactly? Go ahead. Important question here. Important question. I'm planning a tip to Patagonia. Will my sleeping bag keep me toasty warm? Yeah, that's an important question because I don't know where Patagonia is and it, but it feels like it could be.
Yeah, it's, it's in the South of Southern tip kind of of South Africa, South America. Sorry, I'm OK. So we got two things that the Copilot is now doing for me. First, it infers an intent, So what does a customer want to do here and they have a product question. They could also be want to like return a product or something. So and then if it's in regard to a product. Then please kind of identify which product it is, right and give me some information on that product and that's what's happening over here and that's where the retrieval part comes in that we talked about like so the the, the product, the Co pilot is actually retrieving that information and it's showing it to the person, to the user first hand. And then secondly it is going to use
that information to craft a good reply. And in this reply it's saying well the cozy night sleeping bag is rated for three season use and has the temperature rating of 20 Fahrenheit to 60 Fahrenheit. And is that correct? Actually, yes, it is. Because if
you look down here, you can see you can now zoom. Yeah, you can see that the sleeping bag is indeed that the information is indeed correct. So I can now return this to you. So you actually get the reply, I have another, and then, you know, the chat continues. What do you have next? Ohh so super nice. What about my compass? How can I use? That Yeah, Let's see what oh, oh.
Ohh, another product question. Now it's about the compass. OK, the Pathfinder compass is a great tool for navigation and so forth and so forth so. That's it works. That is that is our our copilot.
And this is an actual copilot that's actually helping someone with their job. Because I can, I can in theory log in as someone else and they could be other. You see what I'm saying? And so it's a copilot helps. So how do we, how do we build such a thing? Yeah. So let me kind of go into a little
bit of detail. Let's do it. Let me. So actually, I had like 1 slide here first to kind of case you didn't kind of fully grasp how this, this Rack thing works, user question comes in. The first thing that happens in that rag retrieval, augmented generation, is the retrieval part. So usually there's some search.
In this case, we had to kind of find out what the product was, and then we go to go to in cognitive search and find out some information about the product. And then we stick that into the front, just like Seth did with the stuff that he like pulled down for the EUR and Song Contest. From Wikipedia. That's right. Only we're going to our product database and we stick that into the prompt and then ask the user's question, like pass the question of the user, just write into that prompt as well and say that's what the user wants to know and here's what you know. And then the large language model comes back with the result, so in a little bit more detail.
If we kind of break it up, like how exactly did we build this, so the you see the same kind of components, there's only one thing that's happening at the beginning first that is this find intent thing. So we got like a three-step flow here, right? Find the intent and we use ChatGPT in order to find out what the user actually wants and which product they were going to refer to. The second one is we just go to cognitive research with that product ID that came out of that question, out of that that that first call and then we did the next call which is the actual rag call where we actually do the the get the answer that we want to. And if you notice, the first one was just as simple, we inject search results. Now we're doing injecting more
because if that principle works, which is documents, it should work with a whole bunch of other information as well, right? Exactly. What's also to be now, so why do we have this, this intent there? Well, sometimes people ask about products, then we go to the product database, sometimes people ask about opening hours for stores, then we actually go to our store directory and and and like we have like a different set of data that we're going to pass in. That's why it's kind of interesting to to be a bit more you know, sophisticated you know. Yeah, that's what the word they use to describe me sometimes. So beautiful Where? But OK. Shall we look at the code? Let's do it. So I want to look at how this is actually built.
Yes. So in order to build this, I use Link Chain, which is a popular open source framework. It helps you to abstract from different types of LLM providers and has a bunch of components for building agents and tools that you can plug in. So it's very useful and it's quite popular. So and using that, I'm just basically building out those 3 components that I have like the customer intent, the product surge and the product rag is 3 different.
On files here and you can see here this customer intent just takes in the customer info which comes from you know the website and then the chat history which is basically all the conversations, the back and forth we've had so far. And then with the last question, last one being the question of the user that's to be answered right. And all we do here is we've got these templates here and these templates are basically those you know the things that that that it's all about the prompt exactly and.
All that's happening here is like, OK, let me find the the the connection to the model that I'm calling the large language model in the back end in the cloud and then I'm going to basically, you know I'm building a prompt and then I'm stuffing these this data into the prompt and run it. And so this is standard Lang chain. Data Standard Lang chain and in the end I've got a reply back which is of course some text. I'm asking the model to return some some YAML, but it always doesn't. Doesn't always kind of give me YAML back. So I have a pretty robust parsing function that that kind of doesn't crash when the the data isn't fully well formatted.
So there's like layers around what the model returns both on the input and the output. Exactly. Smart so. And then the second thing here is just a product search. It really just takes the item number in and then goes to the cognitive search to look for that item number and give me back what's there. If it isn't a product item, that's kind of the thing I mentioned. If it is, if there's no product item, because the intent is somewhat other than the product related content, I'm just giving back some general information. So that's what's
happening here. And then this one is the product rag and that one just looks very similar. There's just more information now I'm passing everything I know what's.
Current date and who's the customer, what have they ordered and all the information that are basically gathered along the way and just sticking it all into the into the prompt and run another run another chain like another request to the SO. It's like 3 different parts of three different line chains working together as friends, you know? Exactly. Cool. All right, well, let's can we see it, right? That's how it works. And I think one thing, Seth,
you might have realized is. There isn't a whole lot happening in this. Python. Not not really. I was. I actually looked over the lines of code and it really wasn't a lot. No. Yeah. So either this is really simple or are
we done now? Yeah, we're done. Let's go home. Let's pack it. Up. Let me let me let me demo it so people actually understand this That's actually working. So. So here's a console app that is all happening on my laptop, with exception of course, of the calls that go through the large language models. And so I'm just going to
be customer 13, which happens to be. That's me. Hi. Right. And and then I can ask about like does
my compass work in South America, right. And what's going to do, it's going to ask, it's going to kind of come back with a intent first and the product item number and then it's going to give me an answer where it doesn't work in South America, however, the accuracy, blah, blah, blah and so on, right. So, so we're done now. No, I was not working. I feel the book. I mean, this is.
Like if it was 1987 and this is the interface we give our customers #1, #2. You just ran it. Like how do you know it actually works? You just ran it one time you got. It worked one time.
Yeah, but how do you know this actually works with different prompts? Because I've I've heard that prompts are sensitive, kind of like me. Yeah. So that's where Prompt Flow comes in and Prompt Flow. Is a new feature of Azure Machine Learning, the Azure AI platform and it allows you to do 4 things. One, it allows you to run your prompts change in the cloud, right? So your LLM apps could be precise and you can use link chain, you can use semantic kernel, you can use whatever framework you like as long as it's Python based, right? Secondly. It allows you to experiment with your prompts and your app so it's settings of the app so you can run box of data through it. You can run, you
can measure outcomes and then start to change your app or change your prompt and then see how those changes impact that. Well, I want to set up. Can we get it in there? Let's do it. That's two more things, I said.
40 Sorry, I'm sorry, Daniel. There's people watching. I know, gosh, are we in a rush? Yes, we do it anyway. So thirdly, you can deploy your app to an authenticated endpoint which is like a feature of Azure Mainland has been for a long time, pretty much the inception, but it allows you to do things like safe rollout, traffic split and so on and so forth. And the 4th one is you can monitor the app as it executes and then you can see how those metrics develop over time as your app is running.
Can you show us? I promise you, a lot of stuff. You're in a rush. People are in a rush. Yes, Okay. So here is here the prompt flows that I've built. So the one that represents our Copilot is
this one here, it's actually. Big enough? Yeah, it should be. It's loading. Is it loading? It's like he said it was a big prompt flow. I mean, I guess he wasn't lying. A huge one. Can we get those pipes unclogged out
there? Friends? Let's go. Back maybe let's go to this one. OK, this is already loaded. OK. So now by the now. OK, OK. So just to show you, it actually did work. So
this prompt flow contains the same three elements that we just saw before, right. So you got one customer intent and let me kind of make this a little smaller. You can see that it's basically where it is very much the same code that I already showed you. It's exactly the same code, isn't it? Somehow I'm having trouble moving the divider and the product search. Same thing, same same code that you see here
where it just basically says product search, you know passing in so those three things are unchanged. And the the the part here is like you have some inputs at the top that. Basically show what inputs go into the into the, the prompt flow, so you have the customer inflow and the history, right. So it's basically your question and and what you ordered and all those things we know about you and then there's some outputs here and here we're outputting the customer intent and then the product context and and then the reply that actually we. Want to do and this is cool because I I know all of you are looking at these like these are all chat apps. They can be whatever you want. It's not just chat, it's augment. It's like a large
language calculator where you can do text stuff and that's the thing that I want to make sure that impress upon you. It's not just chats, it's way more than that and hopefully you remember that that's the case because it'll inspire you to do cool things. Yeah. So for example, you could also help a customer
that you're just on the phone with and transcribe you know and then you know basically get that automatically into the. Into the system. So that's like limitless opportunities, right. So we're curious what you guys will build. So what I haven't done yet so is I haven't run this yet. So let's do that. So if I just run this, hit the run button, it's going to execute this flow with the inputs that I have here. So one input
is here that's again it's you and your data and here is what you were asking. I'm pulling a chip to Patagonia and you know will my. Oh, it's finished work. Yeah, it just finished. It's rather quick and here you can see now all the inputs and all the outputs of this thing, so. Don't be shy. It's just cool stuff. I know this
is the first time we've seen this. This is cool. Look, this is cool. And I'm and I'm not sick because they clapped a little. It was golf clappy stage. But the thing about it is, is that you did the same thing that you did before you tested one prompt, but in a different place. So how do we.
I want to make sure this works. For the most of my prompts. How do I do this? So there's a thing called bulk test here. So if
I want to do more than one I can just hit the bulk test button. Here there's a data set that I have prepared which contains customer info and history, but a lot of it. It's the same instance from before.
But multiple exactly. Okay. Yes, it also happens to have some ground truth. We'll get to that in a second, right. So, but these two are important. And then I'm just going to run this here. I'm not going to do elevation yet and then submit it. And so this is going to take a couple of seconds, so.
Should we just stare at each other? Yeah. Or make like. Super funny jokes. No, let's I don't go here and see what the outputs look like. I ran this before. OK, in the interest of time. OK, OK, So what we see now is like, OK, we got line #0 and that's what the information is there and but the output was there so so that is all basically now but loads of it, right. So, so now I can
run a lot. That's really cool, but Daniel, let me read the script here. What do you have? How dare you? No. Wrong script.
No. Like the data scientist in me is looking at this and thinking if I have a hundred of these things, am I just supposed to read them and be like that? Looks good. Check and then just do that. I feel like like my data scientist is screaming at me to be able to score this somehow, yes.
So you want some metrics? Yes, yes, that's what I want. Let me show you inside here. Yeah, so let me go back to this guy and see how we can actually. Oh, no, actually let me go to the other flow which is our deep dive into just the customer intent. So we just want to actually get like really better at. Detecting this customer intent.
So you're looking at the first part of the prompt flow, just so we can show testing. And so that thing has its own flow as its own prompt. Here, as you can see, you're given a list of orders. And then it says like, OK, here's
all the possible possible things that you could infer as an intent. Those five things, right? And so this is a classifier. For scientists, it looks like a classifier. You get something and you should assign it a class, right? And so, yeah, so you can run a bug test on this one. Where you use one of our evaluations and as I said, we have some ground truth data here, We have like the real intent of this question, right. So what we're going to do now is we're going to use an evaluation and we use the classification accuracy evaluation for this one. Of course, now we have to tell prompt flow which one is the actual intent. So the ground
truth, that's that one. And which one is the predicted one or the one that the model came out with And then I'm going to submit that, right. And that so this is like a standard psychic learn style classifier that tells you this one, this one, or this one. And we're just testing it like a regular
machine learning thing, except it's a proper test. Yeah, I mean. The classifier is based on Chet GBT correct? But the metric is just the standard, like the how we measuring this is like, OK, you got two things, are they the same? Correct. Are they not the same? Incorrect. And then let me show you so if you go here to the to the outputs.
Yeah, you can see now here once I turn on the evaluations. Eval. Here we go. Sorry, here we go. Product question, product question, correct. Product question, product question incorrect. And then you see like something's like, OK, this guy's got extra quotes. I don't want that, that's incorrect, right. And so if you look the at the aggregated value of that, we see like we get an accuracy of 61% like that feels out of 161, we're correct, but.
That's not very good. And the 39 we're in, that's not. Yeah, that's not. Great. Can you fix it? Yeah. Let's make it better. Okay, okay. So for that we
have the ability to change the prompt and the. Then run that one and compare it against the one that we had before. So we can see whether side by side this actually made any difference and made it better or worse. I see. Right. So for that, this
prompt that we have here has the button here that says show variance. So I can open this and then I see a second version unfold, which I prepared this right so we don't ship with. Always a better second version? Unfortunately, yes. Right. So you're saying, let me see if I understand. As we run these prompts, we can create variants of prompts in line to test against each other, like variants like that one show that had a lot of variants in it that I'm not going to say for legal reasons. It's the Disney show.
Variant they had, they had low key variants. Oh, yes, yes. I can't say yes. OK, I think my. Daughter So these are prop variants, and they're supposed to test if changes make any effect on the output exactly OK.
So so here I've got now a different one. This one actually uses this so-called one shot pattern. OK right? So it's basically gives you an example and then.
Or the model rather, and then it tells the model this is how you should answer with this for this right, this kind of stuff. And then it knows much better how to actually handle these things. So if we if we run this, it's another bug test and this time we're actually going to say, yeah, turn the variance on again, same data set and same metric and. You got It's the same thing we did before, right? Yeah, yeah, yeah. Yeah. Just with the variance turned on
now and. Oops, no, no, I won't get them all wrong. Oh, come on, intent. There you go. Thank you. OK,
so I'm submitting this that will take me to I think that one. Yes, the run with variance. So if I look at the outputs here, you see that's like variance 0, invariant one, right. So variance 0 as an empty intent, not so great. And and that is obviously incorrect. The variant one got
it wrong too but it at least didn't give me something empty. So that's a step in the right direction and but if you look at the the quotes here it actually let's variant 0 variant one for the same same input did actually fix it. Oh that's cool. So and in the aggregate let's look at that and that shows that 72 percent is now correct.
As opposed to. And that's a lot better, 62. So I'm not trying to throw like what a web blanket over all this stuff. This is classification. The whole copilot, though, returns free form text. How do we test that?
Yes. So it's all basically down to which metrics do you use and how do you calculate. Those can we? Can you? And it's actually a pretty hard problem to calculate. What's a good answer.
To a question that, you know, somebody gave you, what's a good answer? Yeah. And so for this rag pattern, we actually have a particularly interesting way of measuring that. Oh, do we? Because by rag. Pattern. You have context, and based on that context you want the model to answer a question and you wanted to just answer based on that.
Context. Like the Eurovision Song? Context. Exactly. Yeah. Don't make up who won. Just look at that. And if it's not in there, say you don't know, right. And that's what you can do with
the rack pattern. And we call that groundedness. So your data is grounded. And it's been mentioned in the keynote a couple of times as well. So what we can do is we measure how grounded is the answer, right. So let me show you how. And it's just gonna, you know, gonna be, you're gonna find this very interesting.
Let me go back to the other. Like the obviously to the second step because you know the groundedness we want to measure on the second step of the pipeline. And so this time what I'm going to do is I'm going to again same data, but this time I'm going to use the groundedness evaluation, right. And the first thing this thing is going to ask me is like, hey which model do you want? To use that way. That's weird. Why does it need
a model to test groundedness? Because. The way that we calculate the ground and this is by using test GPT or an LLM right? So. I wish I had a button there and maybe you can do it. Now I I want to emphasize because this probably was like, what the heck are they talking about? We use G PT4 to look at the data that you're giving, G PT35 and the answer to score groundedness and you're like, that sounds weird to me. It's like. Plugging in your plug into the other end somehow, but that's not how it works. Machine learning has advanced because
of certain patterns like adverse Generative Adversarial Networks. Have you heard of those? Where models fight against each other to optimize each other? And we're using GB 4, so if you're wondering like, hey, they haven't talked about GB T4 this whole time, turns out that we're using G PT4 not for what you thought, but for scoring. Groundedness on the responses of the model, and this is all built into prompt flow, right? And the other reason is like we had to submit the talk. Oh yeah, we had to. Talk like the way ahead. Of.
OK, so here we go. This is the result of this run. And what you see in the end is like it gets the GPT ground in this score and that score goes between 1:00 and 5:00, right? One means it's. Completely unfounded. Like what the model answered with is not founded in the context in the product information that we gave it meaning. Well it says that the the sleeping
bag is OK for minus 45 Fahrenheit, but it just says like 20 Fahrenheit in the in the description that would be very unfounded or ungrounded. So so and and and in this case most of the answers were actually found to be rounded right. There's there's like for example there's one here that isn't. And and, you know, you would have to go now and investigate like what's going on? Why does it get taken off track in aggregate? We're actually at 4.31, which isn't terrible out of five. So it's a pretty, pretty
decent score. But you can optimize if you want to understand better how Manoush and Sarah, our colleagues have a session in this very room at 4:00 o'clock, but they're actually going to go in much more detail into groundedness and how to improve it with special prompt techniques. And now it's important because. It's not just about making prompts and shoving them into models. You got to do this responsibly. I mean, you have to have both. If you do not have both,
you're not using these things responsibly. And that's why we spent a great amount of effort and sleepless nights making not just prompt construction and evaluation important, but also the scoring, groundedness, metrics and Rai stuff built into it as well. And you need both. So if someone's selling one and not the other, you should be careful because you could be exposing yourself. If you do not at least start to look at what The thing is outputting in bulk. Does that make sense? It's a safety tip here, OK.
Quick look here just to make sure people see these ones, there's a. Few more metrics, and many of them are The majority of them are actually driven by a large language model. So you can see what is the relevance of a question of an answer pertaining to a given question.
And So what is the similarity? We have a pairwise evaluation, so interesting stuff that you can use in order to measure the quality of the generations that you're producing. Hold on, hold on. What if I don't like those metrics and I want to make my own prompt to measure my own stuff? Yeah, you can just go and create a new one or just clone one of these. So if you want
to go and say, oh I look at this Ada simulation similarity, I can just clone it and. Define it myself and then use it in my. Runs, and that's even better because now because you're using GPD 4, you can change the prompt that tests your other prompt, and then you could make a test for that one, and then you could if you. Yeah. And you could never like, I mean at some point you have to kind of bootstrap yourself. That's right. And that's right.
And this is, I think, the coolest part. The. Point is this LLM based metrics you should also. Ground them in user like eyes, human eyes at every so often, right. But it helps you to actually make progress and and and be able to iterate rather quickly without having to do like a lengthy OK, let's go do like a labeling job by giving this to users And then you have like 100 or 1000 things that you need users to look at and it takes a week and it's very expensive and so you can still do that and you should do that from time to time but way less often. So a question.
Question. Yeah, we they talk a lot about vector databases and I know we can you tell us a little bit how you could make that work here very. Good point. What we saw here was the data was already in cognitive service, cognitive search. So there was number
need for us to do anything. But if you are actually looking to kind of upload your data and you would just want to create a vector index, there's a feature in prompt flow as well where you have a vector index here, you can create it. We ask you, you know, it's the name, and then where's your data located? You can upload your local data, or pull it from Git or from BLOB storage. And in terms of which index to use here, you can use a PHYS index, which is an open source tool, or a cognitive search index. That's what we support right
now. It'll, you know, basically create this all for you and then you end up with a beautiful index that you can query in your rack. I love it. So this is cool, but this is all still in prom flow. How do I put this into my app? That is very easy because Prom Flow just like any other, you know, model or or app in Azure Machine Learning, just deploys to an endpoint, right? So you're. Going to tell me it's just one button.
It is. Let me show you the button. We should have made that button bigger clear. Here's the button. It says deploy on it and when you hit deploy it basically asks you a few questions and you can reassign the connections. Here you might be working with one set of llms in testing which since we start to.
Allow you to kind of add content safety and so on and so forth. It actually makes sense to play in one region or in one way during development. And then in production you you might, you know, switch over to a different endpoint and well then that's all you do is you pick a compute like a compute size and then and then you do. You need like a huge GPU machine for this.
No, because all the hard work is actually done by the LLM. It's just, you know, we're hosting, you know, that's all hosted in the cloud, whereas these things are really. Very, very. Yeah. You need like a really small CPU for it. So let me show the. Do I have it here? Yeah, here's all the test page for for this thing that we deployed earlier. And you find a snippet. So again, what you actually put in there is
exactly what you saw at the top of the. Prom flow, its customer info and then chat conversation and where we then basically go and it's just in the same thing in this Jason format. How cool is this? Yeah, I mean there's. The answer Yeah, there's the answer. Somewhere down there says,
yeah, the compass. You know how to compensate for the declination angle. Here's how. So we're running out of time. Is is how Who uses this endpoint stuff? I mean. I mean, is it reliable? Can people? What if I want to add another deployment? What if I want to do monitor? What if I want to do other things? So you can do like yeah, this is a tried and tested feature of Azure Machine Learning for a long time. So this is very scalable. Like you know, we
run actually. You know, very, very large, large services off of it ourselves. So let's finish up with one more thing. Monitoring. Monitoring.
Exactly. Yeah, let's do that. So. So here is the monitor that I've created for this this. Service this endpoint. And so if I'm going to go to the monitoring page here, you can see that's the kind of page you get. You get some notifications here
and we're measuring and monitoring the groundedness over time, which means that if there's a dip like here, we'd like a data issue there. And so it was kind of predicting weird stuff and so you can see that here, go in and fix it and. So this is cool because you now that you've deployed it, this thing is still it's like a puppy, you know, it pees on the carpet sometimes, and you got to clean it up. These things as you're live, you can actually measure groundedness, among other metrics, as it's in production so that you can go back and fix your prompts if you need to, which is really cool. I mean, this is really good.
I think we got to close, right? Oh, so. Yeah. OK. He's giving us one minute. Thank you. OK, so last thing, model catalog, so we have this beautiful monocolock was also shown in the in the keynote and the opening eye models are actually part of this. You can. Even fine tune from here so you can find a model. You can deploy it directly from here or if you look for all the models that support fine tuning, you can go here and then just right from here go fine tune the model, you know, select some data, upload it and so on and so forth. I won't
do it now, but you've basically you just kind of get the sense. This can you can then integrate it to normal Azure ML pipelines. You can run this in automation and so on and so forth and. I want to be clear about this, because notice that we've showed you how to fix prompts so that the everything works. But there may come a time when, for example, the vocabulary of your organization, the model just doesn't know about it and so it's not likely to show that token. Then you can do something like fine tuning which updates the weights in a really intelligent way with like low rank approximation etc. And we're very fast at
it too. Exactly and this is. Something that people that have used Azure machine learning know. You know, we log metrics and so on, so you can kind of run multiple runs and then compare them. It's just all the normal machine learning goodness, OK. The
workflow connection, you got to help me out. Come on up, come off stage. Yeah, you have to. You have to close those. Anyways.
Yeah, I do have to close, but I wanted to keep here because if you do deploy that model, you train, you say fine tune OS model, then you put it on an endpoint. Now you want to use it in your prompt flow. You need to add that as a connection in your workspace AH. He's talking about.
The connection to the connection space. OK, OK, Because you mean here the workflow connections. There you go. Yes. So now you can actually call that model as
a as a connection. Yeah, but if you have the runtimes tab for example, the way you get Lang chain to work is you create a runtime with Lang chain that runs Lang chain. You can do the same thing with semantic kernel or whatever thing comes out this weekend.
Yeah, Okay. I think we got to go to Skype again. Okay. Good job. Good job. Thanks guys. Thank you. We got a couple of slides here. Hopefully you show those. Did we get the signs up here, Okay. Yeah. So you know this is a prompt flow.
I just want to call that out. And so basically I want to say one thing like. You see the graph on the on the right hand side and it looks like it's kind of wizzy wig and drag and droppy and stuff like that. It's really
that's read only. It gives you the flow of it but all the programming and it is very much a dev tool and the programming is on the left hand side and you you actually, you write code in there. So that's important to know. Yeah. And so if you want to play with this now, it'll come out in a couple weeks in public preview, but if you want to play with it now, just jump on this URL & up and we'll we'll get it for you. Enable it in a workspace for you. Yeah, a lot
of cool things coming out. We have a responsible AI later today. We also have a few discussions that are coming. And yeah, it's it's just a fantastic week for us in the AI platform team. So really excited. Try
to grab these if you can and we have some links. Feel free to learn more, connect, explore and join stuff. Cool. Thanks.
2023-05-27 08:24