Build your own Copilot with Azure AI Studio | BRK201HG
[MUSIC] Speaker 1: We're witnessing the shift of generative AI from a niche technology in the hands of a few to an accessible technology businesses are harnessing to benefit the many. Now, you can tap into this opportunity to gain an AI advantage. Get to market fast, do it right, and do so responsibly, safely, and securely. Copilots are becoming an AI category in and of themselves. Today, you can build custom copilots to serve the exact needs of your customers and employees. Bringing content generation to all apps to supercharge creativity, designing contextual experiences that impress and captivate and freeing people to focus on their highest value work.
This is Azure AI Studio, the place to build your copilots where you can confidently apply the latest state-of-the-art and open source models easily ground responses in your data knowing privacy is protected, deliver multimodal interactions beyond text alone that feel more natural and build on a foundation of trust through every step of app development. The possibilities with AI are limitless. Build your own copilots with Azure AI. Let's do generative AI right together. John Montgomery: Good morning.
Welcome. I'm incredibly excited to be here. Thank you for the warm welcome. I'm John Montgomery. I'm in the team at the Azure AI team at Microsoft.
I'm joined by Nabila Babar. We want to spend the next 45 minutes going a little bit deeper into some of the news that Satya and Scott Guthrie have talked about today. Generative AI is actually real. It's happening. It's happening now we have 18,000 customers who are using Azure OpenAI Service today to build amazing generative experiences. It's everything from the simple chat experience to the most sophisticated kinds of applications that you see in the Microsoft copilots, companies like Siemen's that you saw. I think in Scott's Keynote that's building an internal application that does translation and can issue tickets to ISVs like Grammarly and on into even more sophisticated examples like the Microsoft copilots.
I want to spend a moment right now highlighting one of those customers that we're super happy to have with us, which is Perplexity. Denis Yarats: Perplexity tries to revolutionize the way people access and search information on the internet, performance and speed is at the core of what we do. We've been extensively using Azure AI Studio for this type of workflow where you have an idea, you try things out, and then once you're satisfied we can take this model, deploy it, and next day its in productions. Lauren Yang: With Azure OpenAI Service, we have the ability to control and independently change which model is serving prod traffic. Perplexity Ask started out as a slack bot.
You ask a question and the bot replies with an answer. We hooked it up to web browsing and added the capability to search the Internet. It was like magic. We got overwhelmingly positive feedback and from there created the official Ask Perplexity product. Aravind Srinivas: The underlying models and Perplexity Ask couples on Azure and you want to make sure that no matter what requests people send, the service is safe, it's secure and reliably runs. That's what Azure enables us to do. You send in a lot of requests to the large language model API and there are multiple times you repeat the same thing.
You don't want to keep paying again and again for the same thing. What PTU has allowed is you can cash the repeated requests, saving costs. Denis Yarats: The power of large language models is that you can skip many of those steps. You can prototype with tools like Azure AI Studio, and in a matter of hours or days, you can deploy this feature into your product. We've been working with Azure OpenAI to migrate to a faster version of GPT-4 models.
That essentially led to increasing throughput from roughly 300,000 tokens per minute to 600,000 tokens. You getting twice as much as throughput through only two percent of the cost. This not only allow you to release features in a matter of days instead of months, but also it allows you to experiment much faster and get much better user experience. Aravind Srinivas: We didn't start off as trying to work on a search engine, but rather we were trying to just do cool things with large language models like GPT.
We ended up building something that would be useful for basically the entire world. Lauren Yang: Where the first generative AI answer engine out there, and we plan on continuing to deliver a seamless and new type of search experience and to continue growing our product using Azure OpenAI Service to push us forward. [MUSIC] John Montgomery: Perplexity has done some amazing things. I think that video calls out some of the reasons that a tool like Azure AI Studio is useful. Their ability to flight things, new experiences versus old experiences to be able to do testing of one model against another. The ability to scale the solution from a proof of concept quickly out to incredible load.
This is the same technology that we use inside Microsoft to build our copilots. We're not just providers of copilots, we're not just providers of a third party infrastructure. The same technology that Perplexity used and that these other customers have used, that's the technology that we build our own copilots on the exact same stuff. It's all Azure AI. That means that the advantages that we get as we scale out our copilots, the efficiency of the service, the accuracy of the models, the scale we're able to operate at, we are able to pass all of those along to you. A huge portion of that is about making sure that there is commonality in the infrastructure that we use across Microsoft, and increasingly that we're seeing customers standardize on in their accounts.
We talk about this as the Copilot Stack, and it has these common layers that you build up from. Here we are actually in Level zero of the conference center. I want to say that AI is in fact the foundation upon which we're building all of Ignite because we're on the foundation of it, maybe the Garage. I don't really think we'd want to give this talk in the Garage, but we could.
But thinking about it through that perspective, there are multiple layers and as we talk about this in this show, we are going to build from the bottom to the top. Talking a little bit about the foundation models that we provide, the orchestration engine we have, the AI toolchain that we provide and how we integrate with data sources. Mostly in this talk, we're going to be talking about how you build a completely custom copilot. We have other talks here to talk about how you can extend the Microsoft copilots or to build copilots from within a product like Fabric. But this is about building your own Copilot at scale from scratch.
The tooling we used to do that is Azure AI Studio. Azure AI Studio is in public preview today. We announced it at our build event last spring, and it has been an amazing journey to get to the place today where we are able to offer it to you as a public preview. I encourage you to go to Ai.azure.com and give it a try.
We have brought together all of those practices from that Copilot Stack. All of the knowledge that we've gained from building and operating our copilots at scale. All the customer conversations together to give you one place to do AI right and to build applications quickly and offer them out to your customers. AI Studio is a new thing for us. We are truly unifying multiple different AI services into one experience.
We see this very commonly where customers want to be able to use multiple different AI models, they want to bring their custom models, many different data sources, and we want to give you that unified platform to do that. That brings in the best of our data and search technologies. That brings in not only our own foundation models that we have, that we partner with open AI to do, but the best of the open models that are out there and the best other commercial models so that you can build your copilots using these technologies.
To do so safely and responsibly using the cutting-edge safety tools that we use within Microsoft, and to do all of that with a full end-to-end development life cycle because that is the next phase of what's going to be happening here. We will see the increasing merger of DevOps processes, MLOps processes and this new area called LLMOps, which is about how you actually build these systems and scale them and make sure that they version correctly over time. With that, enough I think of me talking.
Nabila, would you actually introduce us to Azure AI Studio? Nabila Babar: Awesome, thank you, John. We've all seen powerful copilots like Bing Chat, M365 and GitHub and talked about a few of these. All of these are built on Azure, and they're some of the world's most complex workloads. What we're going to do today is we're going to build our own copilot safely using Azure AI Studio. Here you see that I'm an Azure AI Studio, The first thing I'm going to do is I'm going to create a project. Actually, that's not true.
I've already created the project. We're going to use this project as you see here, and within the Settings tab, you can see my project configuration. Here you can see I've created something called an Azure AI Resource. This Azure AI Resource helps me connect to, create, and manage all of the different AI-related assets that go into my project. For example, in this project, I'm using Azure OpenAI, I'm using Azure Machine Learning, I'm using Azure AI Search. All of these are already connected, and they're ready for me to use.
You can also see the compute and infrastructure that my project is dependent on in here, and you can manage that here as well. I have access to different API endpoints and keys, and I also have access to seeing who has access to my project, and I can assign different roles and permissions here as well. That's the project set-up. Let's take a look at our deployments.
As a part of creating this project, all of the deployments that I needed, in this scenario I'm using Azure OpenAI, they're already deployed and they're ready for me to use. I can go ahead and manage these deployments through here as well. We're going to go into the project playground. This is a great place to get started.
The first thing I'm going to do in here is I'm using a GPT-3.5 Turbo model. I'm going to ask this a question that is very specific to my company data, and it's not going to work. As you can see, it's not able to give me an answer because this model has not been trained on my company data. Let's go ahead and fix that now. I'm going to add our data using Azure AI Search.
What I've already done in here is I've taken our data, and I've added it to OneLake. OneLake is supported throughout the platform in Azure AI Studio. Here you see me ingesting my data from it and then adding that data to my vector index.
It's supported for fine-tuning and also within our orchestration flow as we'll see later. But I've already grounded this model with Azure AI Search. I can see it's connected to the playground here. Now what I'm going to do is I'm going to ask the same exact question. This time the model knows the answer. The hiking shoes cost $110.
We also can view exactly where in the product documentation the information comes from. I've not only enabled this but let's say we asked the same exact question in a different language. I'm using Spanish here. (no audio) As we can see, it's giving me a response in Spanish, even though my product documentation was in English. Through this, we've enabled multilingual responses.
Now that we have an idea of how this model looks, let's take a look at some of the other tools that we have within the playground. Up until now, my system message has been pretty basic and system messages or prompts as we call them, they're pretty important. They tell my application how to behave. Through Azure AI Studio, you have a wide variety of different prompts that you can use directly within your application. If you want to create your own, you can type that in here as well. Also, I've been having a one-on-one conversation with this.
I'm just trying to get a feel for how this model works. But let's say now what I want to do is I want to ask this many different questions. Since the model has grounded in my data, I want to compare all of the answers from those questions. For that, we need to do a manual evaluation. Here what I can do is I can manually enter several different questions one at a time or what could be quicker is if I import an entire dataset.
I've already gone ahead and done that and let's take a look at that. Here you can see I had a dataset that had around 10 questions in there. I've imported this in here, as well as a list of expected responses that I'm expecting from the LLM to give me. I ran this, and here you can see all of the different outputs that I can now manually view and I can also add my preferences to these as well. Here I can see the majority of them look good, with the exception of this one. . . and all of this effort doesn't go to waste.
At the top, I see a summary of the insights that I'm building. I'm building my dataset as you see here. I can export this and share this, and I can also save these results to use as a dataset later for a more comprehensive evaluation, and we'll see that a little bit later. With that the tour for our playground comes to an end, and I'm going to hand it back to John. John Montgomery: Awesome. Thank you. That's a tour around AI Studio.
We're going to go to each of these areas in a little more detail. But you saw the basics here, connecting to data, working with a prompt, doing evaluation, selecting a model. These are the cores of LLMOps, and you can do all of these within AI Studio, both through that user experience, through the SEKs and through the command line. We have a full experience for deployment.
Let's start by talking about data. I think Arun when he was onstage, he said, "data is the fuel that powers AI." I would probably be a little less formal. I would say garbage in, garbage out.
If you don't feed good data into the LLM, you're going to get bad results. When I talk with customers, one of the most common things I see is for whatever reason, the data is not well formatted, it's not clean, the chunking isn't right, the embedding isn't right, and you just get poor quality results. We're trying to make it much easier for you to connect to data and to get it into the right form to do that. Within Azure AI Studio, we have connectors to all of the Azure data sources, as you could imagine, it's very, very easy to connect to them and bring them in. But we can also talk to data sources from on-premises, from other Cloud providers and so on in order to feed these models, the information you need in order to get the best possible output of them, whether it's structured or unstructured data. One of the big hearts of all of this is this idea of vector search and being able to use vectors.
We announced the whole bunch of stuff today about vectors and vector availability. We can talk a little bit more later about what goes into a vector and why vectors are important. They're basically just mathematical representations that it's very easy for a system to compare, so you can get nearness and relevance out of it. But it turns out not all vectors are created the same. We're very proud of Azure AI Search, which was formerly called Azure Cognitive Search, and its support for vectors.
Because it doesn't just bring together vectors, it actually brings together vectors with semantic search and classical keyword search. We talk about this as a vector hybrid search, and it actually has a dramatically positive impact on the results that you're likely to get out of a search index. As you think about the vector support, which is a core for all of this stuff, you also need to be aware of what those vectors are doing and how they're being brought together with other technologies to make sure that the information being fed into the large language model is the best possible information you can get in. Again, I think there are some very easy ways to show that. Maybe Nabila, would you show us a little bit about the amazingness of Azure AI Search? Nabila Babar: Awesome, thank you, John.
Before I show you, I want to get a little nerdy. Let's talk a little bit about how search works. John Montgomery: That sounds good. I like nerdy. Nabila Babar: Perfect. In earlier you saw me grounding my model with my data.
It's a very simple example of me grounding the model of my data. I want to talk a little bit about how we're searching that data. Under the hood, we're running a sophisticated ingestion pipeline. That pipeline does three things.
The first thing it does is it cracks open the documents. It then runs a chunking strategy, and then on top of that what it does is it embeds the data from that and runs it through an LLM. I was using an LLM earlier. Then it takes that embedded data and it searches Azure AI Search.
When you're searching, you have three different options in AI Studio. The first one is called vector search. Vector search is really good at understanding user intent and relationships between words. For example, it can understand the relationship between tent and camping, waterproof and jackets, for example, where it's not that great is getting exact matches. For example, if you're looking for a product document, an e-mail address, or a phone number. For that, keyword search works really well. It's fast as well.
What hybrid search does, it combines both of those based on what the user is asking. It combines vector search, which is really good at understanding relationships and intent, with keyword search based on what the user needs. The last search type we have in here is hybrid plus semantic, and this is the one that builds on top of both of them. It's the one that rules them all.
What this does is it adds a semantic ranker, and what that ranker does is it ensures that the most relevant information is available at the top of a search. This is the one we're going to try right now. I'm going to go back to the Playground and I'm going to ask these two questions. The first question is, what gear do I need for Seattle weather? What I did within my product documentation is I have a lot of documents about my products, and then I also inserted some documents about just generally about Seattle to see which document this is going to rank and search depending on what I'm asking. Let's go ahead and ask this, what care do I need for Seattle weather? This is great. As you can see, the first thing is telling me that the retrieve documents do not provide a direct answer.
It's giving me transparency that this was not in the retrieve documents. Then it tells me that the closest information available to this is for any weather. Actually, sorry, we're going to try this one more time. I promise this worked before. John Montgomery: How do you know, it's real software? Nabila Babar: What it should do and what it has done I promise is that it is able to reference Seattle weather and able to say that it rains in Seattle and then it brings up my waterproof jacket. John Montgomery: What's the irony here right now is it's bright and clear and cold.
Nabila Babar: The next thing I'm going to ask, let's go back to earlier one, maybe. No worries. What it does is it pulls out the right document within here and then it's able to add to the record, it's able to pull up the right document that has a waterproof jacket within here that was within my product documentation. I probably I am not using the right dataset. But it does work, I promise. John, now back to you.
John Montgomery: Thank you. That's a hybrid search with semantic. There is a session on this that I would invite you to go to, given by my compatriot, Pablo Castro, if you are interested in it. He goes into quite a bit of depth into how this technology works, when to apply it, and strategies for doing chunking and embedding on your data. Now if it's a garbage in, garbage out with data, one of the other core things that's going to determine whether a Copilot is successful or not is the foundation model you build on top of. Within Azure AI Studio, obviously we have the models from OpenAI, GPT-4, GPT-3.5, Turbo and so on.
We're now expanding that out for a while, we've had a model catalog that can bring in open source models, and there's quite a bit more going on in this space. The key for us is to ensuring that customers have choice because one size does not fit all in this domain. Sometimes you need a specific model, sometimes you need other models and sometimes you need the models to be composed with each other so that you can get the right result. That's actually most of the copilots at Microsoft use a mixture of different models to achieve the results they have.
One of the biggest announcements we're making today is that we are now bringing GPT-4 Turbo, the latest version of GPT-4 GPT-4 Turbo with Vision and DALL·E 3 into the Azure OpenAI Service. If you were following the news, you know that about a week ago, OpenAI announced that they have these models and obviously we are bringing them into Azure OpenAI as quickly as we possibly can so that our Azure customers can make use of them. GPT-4 Turbo, a tremendous model. It delivers the same level of accuracy that you would expect from GPT-4. The interesting thing is it's much more efficient, much higher throughput, which is enabling us to be able to lower the price on the token based offering significantly.
I suspect this is going to enable a lot more new scenarios since historically GPT-4 has been a somewhat expensive model to use. The quality is worth it, but we're delivering the same quality now at a significantly lower price. DALL·E 3 similarly, is delivering a quality of image creation that is really blows the doors off of DALL-E 2. I think you probably saw some of that. It's an amazing model.
But I want to spend a moment here talking a little bit about one of the trends that's happening within AI and that's called multimodality. I think a lot of us have been talking about this for a while. Mostly today when you think of large language models, they are language models. You type in language and you get out language. But multimodal models go quite a bit further than that. Microsoft has been at the forefront of creating these multimodals for a while and we're very pleased to be able to say that we're bringing GPT-4 Turbo with Vision to the Azure OpenAI Service.
This is an amazing model, it can take in video, it can take in images, you can supply it with video or images and ask questions about them, it can summarize videos. It's an incredibly powerful model that extends the language to language to incorporate video and images as well. There's a whole session on this that my compatriot, Marco Casalaina, is giving. I encourage you to go to that one.
He will go very deep on this model and what it's capable of. I'll stop there. You should definitely go to Marco's session. Somebody who was reviewing it from outside Microsoft said it's probably going to be the best session at the show, so you should go to Marco's session. Now I want to actually reiterate something that Scott talked about.
There is a session we're giving on this in detail, but a lot of the questions that I get from customers have to do with what happens with my data when it goes into your model? Do you use it to train the foundation model? Is it secure? How are you using it? I just want to say super clearly, the promises that Azure makes generally apply to the Azure OpenAI Service. Your data is your data. We don't use it to train our foundation models, we don't do anything with it. When you fine-tune a model using the Azure OpenAI Service, the weights of the fine-tune model stay in your subscription, we don't see them.
We have some of the strongest protections around data and data sovereignty that in the industry, data stays in the region it's in and we're very proud of that. We're very proud to announce that today we are extending some of the announcements we've made previously about our copilots, where we are offering copyright protection to applications built on top of Azure OpenAI Service. We call this the Customer Copyright Commitment. If you follow the published guidelines we have for how to build a copilot responsibly, that includes how to design your metaprompt, how to handle data, and you use the Azure Content Safety Service with the switches that we document, we will indemnify you if somebody makes a copyright claim against you. That is a huge announcement because it's one of the biggest fears that a lot of customers have. It's really about the control over what's coming out of these models.
A major announcement that we made today. Another major announcement has been about fine-tuning, so you can now fine-tune GPT-3.5 Turbo, GPT-4. Now I'll say the GPT-4 Turbo, that's a private preview, so it's a limited number of customers we're going to bring in as we bring this service up.
But I want to spend a moment talking about fine-tuning versus retrieval-augmented generation. Everything we've shown so far has focused on retrieval augmented generation. This is the idea that instead of having to go and retrain a model and become an expert in weights and biases and things like that, you can grab some data and you can feed it into the prompt along with user questions and get answers out of it. Retrieval augmented generation is awesome in a lot of ways.
It's relatively fast to do, building an index is relatively inexpensive, you can reindex often. Retrieval-augmented generation in our internal tests. It's really good at adding knowledge to the experience. Fine-tuning is a very different thing. You bring your data in, you fine-tune the model. It takes some compute to create those new weights.
We then assemble the new weights with the base model when you call the model using a technology we have called Low-rank Adaptation. But the interesting thing about fine-tuning is fine-tuning is not the best way to add knowledge into a model. It's a really good way of adjusting the tone of output or the format of output.
When you think about these two technologies, they are entirely complementary. It is not one or the other. A lot of applications won't ever need to do fine-tuning. A lot of applications may not need to use RAG. But we're very pleased to be able to offer both within Azure AI Studio.
Now I've talked a lot about the OpenAI models. I'm also very happy that we have an amazing assortment of other models that you can bring into the experience. We have the OpenAI models, but you can also deploy and fine-tune any model basically through a partnership with Hugging Face as well as partnerships that we have with a bunch of other companies to bring their models onto your own infrastructure. This is basically an infrastructure-based play.
You can go to our model catalog, select a model, you can fine-tune it, you can deploy it. It's a full experience of bringing open source models into the Azure AI experience so you can build them into your generative AI application. Today, we also announced a technology called Models as a Service. Now this goes the next level. We have a lot of customers who don't want to deal with infrastructure, they just want to call the API endpoint, and that is where Models as a Service comes into play. These behave basically the same as Azure OpenAI Service or any of the other AzureAI Services.
We operate the endpoint. You call the endpoint, it does what you ask. We have announced partnerships with Meta around the Llama 2 model to operate Llama 2 this way, also Mistral, Jais, and Cohere. We have their models now that we're bringing up as part of the Models as a Service platform. This is going to make it a lot easier for customers that don't want to deal with infrastructure just to use the model without having to think about infrastructure, which is the direction that we want to go.
But it's not enough just to have lots of models. I always say to customers, the proof is in the pudding and you should test. Because it turns out every model behaves differently and they behave differently with different test data. Now, I would say GPT-4 is an amazing model. We have tested it nine ways to Sunday, it is by far the best model we have tested on every measure of quality. Its inferencing performance is terrific, but you know what, it may not be perfect for your task.
We're going to stand behind what we do. We will offer a model benchmarking system within AI Studio. We're both going to start by publishing benchmark results that we have run of these open models, our models along with common benchmarks.
We're starting with a subset and gradually adding out. In the fullness of time, what we'll enable you to do is to benchmark any model against any other model using this system, this evaluation system that we have. You can bring your own data to that as well. This is a key part of answering the question about what model do I choose for what task. We're going to give you the tools to make that easy.
Again, I would love to actually ask Nabila to show off some of what we have there. Nabila Babar: I have a small surprise for you, John. John Montgomery: What's the surprise? Nabila Babar: I ran it again, and here is it.
Here, it is work. I didn't give up. (applause) Here we see what gear do I need for Seattle weather and you can see it referencing the document that we need, which is about a waterproofing jacket, even though I'm not mentioning rain within that document. When I ask it a question about where to eat in Seattle, even though I have documentation about Seattle that's specific to food, it's referencing that and not any product documents. Here we can see the ranker pulling up the right information based on which question I'm asking, and it's also able to build those relationships between words. We were using the GPT-3.5 Turbo model the whole
time and an ada model for embeddings. But let's say, as John was calling out, that we want to use different models within our application. Here within the model catalog, I have access to thousands of open source models, including Meta's 70 billion parameter, Llama 2 model as well. But let's say I don't know which model to choose. With our benchmarking tool based on the task that we're trying to do, in this we're doing question answering, I can compare a set of models. Here I have the GPT-3.5 Turbo model that I was using,
getting compared to a Llama 2 model against a standard dataset and I see metrics right here. We will be adding additional metrics to this, additional datasets, and you can also replace this with your own dataset and benchmark those. Now let's say we want to use a Llama-2-70-b model.
We're going to go back to the model catalog. We're announcing two services with this. The first one is being able to deploy this model through a mass service. This is a pay-as-you-go service.
The benefit of that is I don't have to manage any infrastructure, any quotas. I pretty much call an API and this model is ready for me to use. But let's say you want to customize this model. You may want to fine-tune this model if you want to introduce bias into your application.
Those could be for scenarios within certain medical terminology. If you want to introduce within your application or legal terms for that, you may want to fine-tune models. For that, we're also announcing a fine-tuning service, that's a managed service. All you need for this is your dataset and you're calling an API, and we're going to take care of the quotas infrastructure for you. To see this, I've already gone ahead and fine-tuned the model. When I fine-tune it, I can see it within my fine-tuned models right here.
Here you can deploy this model. But before you deploy, you want to know what are the metrics like, is this model ready for me to deploy? Within the Metrics tab, you get a list of metrics that you can view. You can always go ahead, change your parameters, change your dataset, continue to iterate on this, and then when you're ready you can select the best model and fine-tune this. We've talked a lot up until now about a text-in-text-out scenario. The future of generative AI is all around building on top of that and adding multimodality.
What multimodality means that I can add text images and even videos to my applications. With AI Studio, you can do that. We're going to take a look at a few different scenarios. In the first one, what I've done is I've created generated descriptions that can be either for my website or for my brands in here. I did this by going to a different mode within the playground.
I can then build on top of that and I can also generate different images for my product descriptions and for my websites as well. Here we're using DALL·E 3, and I've saved the best for last. Here you see with the GPT-4 Turbo model, what I've done is I'm asking this.
I've uploaded a pretty bad video actually from a vacation that we took. I'm not telling it where it is, and I'm saying based on this generate a half-day itinerary for me with a packing list. What we see here is it's able to pinpoint exactly where in Yellowstone National Park this hike was.
It's also able to create a half-day itinerary for me and a packing list for my hike. I think that deserves a round. . . (applause) And that's not all.
What we've also done is we've enabled with Azure AI Speech, we've enabled text-in in and text-out. Lastly, you can also create your own natural sounding synthetic voices using Custom Neural Voice, and these can have different speaking styles and can adapt to different languages as well. Now back to you, John. John Montgomery: Cool, thank you. Never give up, never surrender.
The Neural Voice is very interesting because of course, you can attach the output of one of these language models to a Neural Voice and it can speak on your behalf. I'm going to take the next couple of sections together and then Nabila will bring us home for a demo. I want to talk about safe and responsible AI. Microsoft has had these principles for a while about how we approach and how we think about safe and responsible AI. Things like fairness and reliability, safety, privacy, security, things like that.
But the thing is, it's not just principles. The principles infuse our engineering processes, the ways we build things. Everything from how we source data to how we train models. Also the tools that we create that we are able in turn to offer to you, and they're the same tools that we use internally.
One of those tools is Azure Content Safety, which is a major addition to our safety product line. We use it internally with our Copilots, we're offering it to you. It is the thing that can identify things like hate speech or sexual content, but it's also the thing that can identify jailbreak attempts and so on. That's our Content Safety Service.
The other big service that we offer for model creators is the evaluation on our AI dashboard, which is built into the AI Studio as well. It enables you to see where the data may be biased or where you may be having issues with the application actually in production. Think about these as two parts of the equation for how you build a safe system. Again, it's not just talk, there's real software engineering behind actually making these things safe. Now let's talk about the full development life cycle.
This is increasingly a key thing. It's not just about calling the API and walking away. There's quite a bit of engineering that goes into shaping the prompt, that goes into the model orchestration, the APIs that are called, and so on. You need a complete way to do that.
At Microsoft that feature within AI Studio is called Prompt Flow, which is generally available today. It is our tool for prompt orchestration, for evaluation, and for prompt engineering. It is deeply integrated into AI Studio. It is also in our Azure machine learning product.
It gives you complete tools to be able to understand the behavior of the graph of models and APIs that you're calling. To be able to evaluate the outputs. These models are very non deterministic and so evaluation is at the heart of how you actually make sure it's doing the right thing. A little coda on this one. Everything I've been talking about actually has application on the Edge as well.
We are beginning the journey to link together Azure AI Studio to be able to deploy and do things with models on the Edge. We talk about this as Windows AI Studio. It is a Visual Studio Code plug in. We'll be showing a little bit of it here, particularly about how you might fine-tune something like the fine-tune model to be able to deploy it onto the Edge using all the power of AI Studio in the Cloud. But to show off the rest of it, Nabila, would you please show us some more of the product? Nabila Babar: Awesome. Thank you, John.
So as you saw, the reality is a lot more complex than we were showing within a playground experience, and sometimes things may not go the way you want. How do you adapt to that? Customers may want to be able to add many different data sources. They may want to use many different models.
They may want to use many different meta-prompts and then they want to compare these. These are all variables that we can use in our application. But what are the tools that you need to be able to see what's the impact of these variables on my application? For that, we need evaluation tools. What I'm going to do now is I'm going to click open in Prompt Flow. What this does is it takes me to Prompt Flow.
The entire UI that you saw is now broken up into these nodes and an orchestration engine that we can see here that was powering that UI. All of these nodes, they contain code that developers can check in. If I ask the same exact question, I'm going to get the same answer here. I can also see the files behind each of these nodes, right here at the top.
These are files that I can download, I can check in my repo and I can iterate on them. But let's say I want to develop in a tool a developer tool of my choice. If I click on "Open" in VS Code, I'll be taken to VS Code Web. What's going to happen is a development container is going to be created for me and it's going to be my own development workstation in the Cloud. Here I'm going to have access to the same exact files. Now I can go ahead and I can iterate, encode, whatever changes I make here are going to be reflected back in the UI.
Let's go back there now. Developers face a lot of challenges with LLMs because of their non-deterministic nature. Minor changes in the meta-prompts can have major impacts to your application. To do that, we need evaluation tools.
What I'm going to do here is I have a dataset here that has a set of questions and answers. These are same exact dataset that I had used earlier. I also have a Prompt Flow here as well. What I'm going to do here is I have two prompts that I created. The first one is pretty simple and the second one has instructions for it to be safer. What I want to do is I want to see the impact of both of these prompts on my application.
For that, I'm going to run an evaluation. To save us some time, I'm going to go ahead right here and here you can see all of the different evaluation metrics that we have. We have both data science metrics like F1 scores and accuracy, as well as LLM-assisted metrics like relevance, coherence, groundedness, and GPT similarity. For example, GPT similarity compares the ground truth from my dataset to the output of the LLM. When we run this, we're going to see this evaluation within our evaluation runs. Here now we can see both of those variations and I can see aggregate metrics at the top as well as a breakdown for each row within my dataset.
I can see in general, the safer prompt has slightly better metrics in here. If I want to dig further, I can look at each of these variants. I can look at each question within here. Let's say actually further, I can trace every single API call that this application runs.
I can view the input and output of that as well. If I want to know exactly how this 5 is calculated for groundedness, for example, and where it's coming from. What's amazing is the evaluation run itself is a Prompt Flow. When I ran that evaluation, a new Prompt Flow is created for me. The benefit of that is now I have transparency into exactly how these metrics are calculated. Further, I can clone this and customize and tweak it for my own scenario.
Let's finish off with Deployments. Here you see I've already deployed my flow. I can consume this in my application. Lastly, I can monitor this. When I deploy this to an endpoint that's consumed in my application. Here I'm seeing now the same exact metrics for my application that I was while I was prototyping and developing.
Now I'm able to have peace of mind when this application is deployed to my customers. Lastly, on the same endpoint, I can enable content filtering. Here you can see I can filter out both harmful content from the user input into this application as well as the output.
With that, our tour of AI Studio comes to an end. We're incredibly excited for a public preview and we cannot wait to see what you build with it. Thank you.
2023-11-28 09:46