Welcome to LlamaCon 2025 - Opening Session!

Welcome to LlamaCon 2025 - Opening Session!

Show Video

[Music] You can hear it. You can see me. [Music] [Music] for you. [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] Heat. Heat. [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] Please take your seats. Our program will begin in 15 minutes.

[Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] All right. [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] T one t [Music] 1 2 [Music] 1. Please welcome Meta Chief Product Officer Chris Cox. [Music] All right, welcome everybody to Llamacon. [Applause] It's an honor to be hosting you all today. It is crazy to think that two

years ago open source AI was a dream. I remember the conversations we had internally when we started talking about this. People were like, you guys are nuts. There was this financial question, does it make sense to spend all this money on training a model and then give it away? There was this priorities question, is this going to be a distraction to focus on developers and open source instead of just training the best model and putting it in our products? When we started talking about it externally, it was just like this is bad.

This was like peak AI doomer two years ago. Everybody said open source is going to fall into the hands of bad guys. There was not a lot of imagination about all the good that could come of it. There was the safety and security questions. There was the performance question. Could open source ever compete

close to the frontier of the closed labs. But we were in touch with the startup community. We were in touch with developers. We were in touch with entrepreneurs like everybody here.

And I mentioned this up here last September, but we were a startup once too. I was at the company when we didn't have a lot of financial resources and we wanted control of our own destiny. And we built this place on open source. And from the very early days, a lot of our best engineers were contributing back to the open source stack that we were built upon or becoming primary authors of new open source projects in the web stack and then over the years the mobile stack and then over the last decade the AI toolkit we build upon today. So it is great to see now that

open source is cool. There's general consensus in government, in the tech industry, among the major labs that open source is here to stay. It is an essential and important part of how AI will be built and deployed. It can be safer and more secure just like Linux because it can be audited and improved in broad daylight.

It can be performant at the frontier. And most importantly, it can be deployed and customized for specific use cases for specific types of applications and specific types of developers like the folks here. The ground swell of support for Llama has been awesome. We announced 10

weeks ago a billion downloads after the release of Llama 4. In just 10 weeks, that number is now 1.2, which is exciting. And what's cool if you look at hugging face where the downloads are happening is most of these are actually llama derivatives. We have thousands of developers contributing tens of thousands of derivative models being downloaded hundreds of thousands of time a month fitting into the specific types of applications nooks crannies language support and use cases that are happening uh that everyone here is building. This is thanks in part to the

awesome partners that we have. Thank you to everybody here. We appreciate the fact that regardless of your development environment, you can get started in using Llama right off the bat. It's a lot easier to plug and play specific parts of your applications because of the Llama stack. A lot of

partners have implemented this. That means that you can choose the best parts of the solutions. They're modular. It's easy to switch between component parts.

This has made development a lot easier. When we started thinking about Llama 4, we were very focused on the things we heard from the community they care about most. A lot of performance in a very small package. This is our very first model, which means you're getting more performance in a smaller binary and a smaller set of weights. It's also the first open- source multimodal model, which means that photos and text were trained on natively, so it's fast against visual use cases. It's massively multilingual,

supports 200 languages natively. We use 10x more multilingual tokens than Llama 3. And it's got a huge context window. You can put your whole codebase in the context window. You can put the entire US tax code in the context window if for some reason you have a very complicated tax return. In this case, we took all of the

code for SAM 2, which is our object segmentation model, and asked to write a get started guide, an FAQ for new engineers, which it did extremely well. Needle in the haststack is good. It's a key breadandbut use case for large context windows. We're working on instruction falling and latency which are the areas we've heard about the most for visual reasoning quite good especially again at the key tasks like OCR PDFs charts and graphs we know a lot of you are working on visual domains this was a huge ask of us and one we're committed to improving and then again the most important thing was make it as small as possible get as much intelligence as you can into the smallest possible possible model because we're building businesses here. We're running them at scale. So our little guy is Scout that runs on a single H100 and it's bestin-class on a number of benchmarks for model of its size. Llama Maverick is the next

biggest. This is 17 billion active parameters. It runs on eight GPUs. Most of you will be doing this distributed. This is what we're using in production for a bunch of our products including Meta AI. And we got to good performance here

by using a teacher model which is Llama Behemoth. This is absolutely massive. One of the largest models out there. Yeah, that's what he looks like. And again, this teacher model was how we got to good performance on Maverick in again a relatively small package. When we look at the stats, we're most proud of price performance. That's cost per token as well as latency that you get in small sizes. That's time to first

token and throughput in terms of tokens per second. These are the benchmarks along with intelligence. Here we're using composite which is artificial analysis blend of a bunch of key OSS benchmarks that really measures what we know most of you care about which is as smart as possible in as low as possible a cost and as fast as possible of a speed. Again, this is how we can deploy this stuff at scale. If you're running a business, whether it's small or whether it's large like ours, we're using this stuff to power Meta AI, which is used by almost a billion monthly users across WhatsApp, across our other apps and standalone experiences on array bands.

This is what we're using at scale. And we've customized it for the specific use cases, each of which you have as well, that matter to us. For us, that's conversationality. It's brevity, especially if you're on the go. It's humor. And speaking of metai, we

announced an app this morning, which we launched. So, we wanted to focus on pushing the limits and a fresh take on how people could use AI. We were very focused on the voice experience, the most natural possible interface. So, we

focused a lot on low latency, highly expressive voice. The experience is personalized, so you can connect your Facebook and Instagram accounts, and the assistant will have a rough idea of your interests based on your interaction history. It will also remember things you tell it, like your kids' names and your wife's birthday and all the other things you want to make sure your assistant doesn't forget. We added an experimental mode which is full duplex voice. This means it's trained on native dialogue between people in speech. So rather than the AI assistant reading a text response, you're actually getting native audio output. And full duplex means that the

channels are open both ways so that you can hear interruptions and laughter and an actual dialogue just like a phone call. So it's pushing the limits what we think of what's possible with a natural voice experience. It's early. So the duplex mode doesn't have tool use. It doesn't have web search. So you can't ask it about NBA trades and like the pap conclave. However, it's a very interesting way to see what's possible at the frontier. We also wanted to make this

fun and we know a lot of people getting started with AI have no idea what to do with it and it's not until they see the way that others are using it doing stuff like them that they get inspired. So whether it's homework or creative writing or AI artwork or code, the community is actually quite mimetic in that we learn from seeing each other do it. So we put this right in the app. You can share your prompts, you can share your art. It's super super fun. We've really been enjoying prototyping this because it makes the experience a lot more creative and a lot more social. And then the last piece we focus on is pairing with your Ray-B bands. These are incredibly popular. They are

the ultimate AI device of today. They're multimodal. You can ask questions about what's in front of you. Again, it's a voice interface. So, the app focuses on taking that and making it coherent whether you're using the glasses or whether you're just using it with your phone. So, in terms of eval, we like many of you have been very focused on understanding use cases and understanding user value. That's I know for each of you a

big question is how do you make that stuff really good? We've actually seen a bunch of partners here doing really amazing things. I wanted to highlight a few of what I think are some of the most inspiring that showcase what's possible with llama. Starting with local deployments, the ultimate local deployment is a llama in space. Booze Allen and the International Space Station are running a version of Llama 3 that makes it easy for astronauts and scientists on the International Space Station to make decisions based on tens of thousands of pages of documentation and manuals to figure out what button to press. And they can do that without a

connection to Earth if the connection's absent or super super super high latency. This gets them something local that's quite good. Privacy matters a lot, especially in the domain of medicine and health care. We're seeing a bunch of really interesting work happening here. Here you see Sophia, which is a Latin American medical provider. They've already seen really good results on helping doctors spend less time on paperwork. We've seen similar work from

the Mayo Clinic, who's done this in the domain of radiology, made it easier for doctors to diagnose and understand complex cases instead of spending time rumaging through massive amounts of literature. The regional examples are very interesting. Farmer chat is a subsaharan African tool for farmers in agriculture who can spend time getting real-time information to understand what to do with their fields. This has language support. It's been really popular and again they're getting great results. Kavak is a used car marketplace that's in Latin America and the Middle East. similar example where you're

getting really good results at a customer support use case that's focused on a specific region with a specific type of content based on really good fine-tunes. And we're seeing big folks using it, too. AT&T told us that they're taking the daily transcripts of all of the customer support calls and putting them into a context window so that they can generate basically top 10 lists for what their engineers need to focus on that day to fix bugs. And again, really, really good results from switching to an open source model because they can customize it specific to the things that matter to them. The models getting built on top

are awesome. NVIDIA's Neotron is focused on reasoning and agentic use cases built on top of Llama 3. Again, we're really excited to see the energy around this deployment. Enterprise use cases are telling us pretty good things about switching to an open- source model in terms of delivering results for enterprise customers who care about data security. This is Box. They have 100,000 active customers. They announced a

partnership today with IBM which is going to allow them scale to a lot more. So we're continuing to ask what we can do to make Llama better for you. With Llama 4, much like with Llama 3, we expect that the DOT releases are going to deliver a lot of value and are going to let us burn down the list of what you're telling us we could do better. Starting with model performance and quality, most important instruction following, etc. cost. Keeping it as affordable as possible, making it as fast as possible, making it completely customizable and steerable to do what it is that you need to do with your tools, and of course, making sure you're not locked in. Not locked into Llama, not locked

into specific building blocks, being in control of your data, not having APIs changing up from underneath you, having control of your destiny. Also, a lot of feedback on ease of use, which we're continuing to work hard on. Today, we're announcing that you can now start using Llama with one line of code because we're releasing an API. I'm going to invite up Monahar

Polari and Angela Fan, two of our very first llama engineers, to walk you through how that works. Come on up, guys. [Music] Hi everyone. It's amazing to be here with you all today. Yeah, it's great to

be doing this together, Angela. And what a crazy ride we have had since we started building Llamas together. Yeah, when we started making Llama before, we reflected a ton on the feedback from you all in the developer community on things like new capabilities we should add, improvements that we wanted to make, and we heard things from multimodal support to increased efficiency in the base model. We couldn't take an incremental

approach to Llama 3. We had to take a step back and look at some hard technical decisions we had to make for Llama 4. And each one came with different trade-offs. A really early one that I remember we debated was just the architecture itself. We had been working

a lot on dense models. We had a lot of experience with this. It was simpler and to configure, easier to parallelize. But

should we make the jump to sparse? And llama 3 model family was dense and we knew dense had its drawbacks and we had to think about alternate architectures. Yeah, exactly. dense models, they can be really compute inefficient. And as we train more and more models and develop more products, a lot of this training inefficiency hit us pretty hard. And so we ended up going with sparse. There was

of course all that added complexity in some of our deployment and our training, but we got a lot of wins in performance, efficiency, and cost. I also remember the technical bet we made on distillation. uh llama for behemoth was a two trillion parameter model which packed a lot of intelligence but it is harder to run at scale so we used it as a powerful feature to improve the performance of maverick at the same price another one that I remember is incorporating multimodal right early or late fusion was a discussion we had so when we first started taking llama 2 and trying to expand out of the text performance of that towards llama 3 by adding the image capability we had kind of frozen in the text model of Llama 3 and added on image by adding adapter layers. This was very straightforward and it made for a very modular system.

But with Llama 4, we wanted it to be natively multimodel. So we took a step back and changed the training recipe. We started with text training interled other modalities and this allowed Llama 4 to be crossod training which made both scout and maverick much better across all modalities. I think we can keep going on and on on the numerous technical decisions we made in building Llama 4, but we are excited today to talk about the API. Our goal from the beginning for

Llama 4 API has been it should be the fastest and easiest way to build with Llama. Not only that, as you saw the derivative models, it should be the best way to customize these models. And going one step further, you should be able to take these custom models with you wherever you want. No lock in ever. Speed, ease of use, customization, and no lockin are the marquee features of the Lama API. And we handle all of the inference and the infrastructure work so you all can focus on building your products when you care the most about.

We want to deliver the best performance and the features that you would get using closed model APIs but with the control and the openness that only the open source ecosystem can provide. Let's check it out how it looks right now. So if you want to start, you can just register your API key with one click. If you're ready to go right away, you can head on immediately make a curl request.

Just copy the code, head to the command line. We maintain a lightweight Llama API SDK in both Python and TypeScript. Or if you're already familiar with the OpenAI SDK, you can use that directly by just adjusting the endpoint. This makes it really easy to migrate applications to Llama. If you want to just play around quickly, you can head on over to the chat completion playground where you can play around with different models, test it out. You can change model settings

like the system instruction or the temperature to test out different configurations. Long before models are also the leading multimodal LLMs in their class. You can take advantage of this image input as well. And you can

also send multiple images per prompt. If you want a structured response, you can also input a JSON schema to the model. You may also want to use tool calling. We allow you to provide tool specifications um so that the model can decide when to invoke them autonomously. This is a preview feature that we're looking to get a lot of feedback from the developer community so that we can also improve. Once you're done configuring and setting up everything, you can hit get code and copy from there right into your codebase. And we've designed all of this

with a focus on data privacy and security. Any input that you send to the API or output that you generate, we don't use to train our models. Those are the basics. But as we talked, customization is where open source should really lead the way. And one of the ways to customize these models is fine-tuning them. The Lama API is really

good at fine-tuning for your product use cases. And the best part, you have full agency over these custom models. You control them in a way that's not possible with other offerings. Whatever model you customize is yours to take wherever you want, not locked on our servers. This opens up a lot of

possibilities for people that need to build and deploy custom models. Here is a quick way of fine-tuning how fine-tuning works. You can click on the fine-tuning tab. Once you are there, you

start a new job and you upload data that is used for your fine-tuning or you can reuse the data that you have already uploaded. By the way, you have new synthetic data toolkits that help you generate really good post-training data for your use case. We talked about this with the Llama 345D model as a distillation mechanism. There's a dev session later today if you want to learn more about it. Once your data is uploaded, a typical thing you could do is splice some part of the data for evaluation so you know how the custom model is going to do. And you configure

the job giving it a name, choosing some hyperparameters, and it's that simple. You can keep an eye on how the job is progressing where you can look at the training loss as it progresses in real time. Once the job is complete, you get to the best part. You can go and download your custom model and run with it wherever you want. You're not locked

in. Or you can run it on the Llama API, whatever is convenient for you. Okay, so you fine-tuned the model. Now you've got it ready. But how does it actually perform? So to evaluate your fine-tuned model, you head on over to our evaluation area. We fill out some of

the boiler plate for you like the job name and the model. And if you've split your data set during the fine-tuning process, we also select that by default. More importantly, you can add one or more graders that will score your model's outputs against specific criteria that you care about. For example, maybe you really care about the factuality capability. From there, you can kick off your job. It then hits our batch

inference API and we generate a bunch of model outputs and grades them. When the evaluation job is complete, you can check out a quick summary of your graders or more importantly, you can dive deeper into the results and read them yourself. I feel like with text generation today, you know, you just can't repeat uh replace reading all of the content. Our flow is designed to be very simple compared to alternatives, which will help you understand the performance of your fine-tuned model as easily as possible. Many startups have been testing the Llama API for a while. The early feedback we're getting is how easy it is to use, the speed and quality of responses. So that's how you build your

own llamas using the llama API available for preview today. But wait a second, this is only one part of the job. The second part is actually taking these custom models and running them at scale. Which brings us to your favorite topic, inference. Mono

knows me too well. I mean, you've built all of these great models, but how do we actually serve them efficiently? And so, when we designed Lama 4, it was with inference efficiency in mind. This is a huge priority for Meta because Maverick powers Meta AI and we have users all over the globe. And it's also extremely important that we've heard from all of you in the developer community. We used a mixture of experts architecture to activate only a fraction of the total number of overall model parameters. But beyond this design choice, we specifically sized Maverick.

Why is it the number of parameters it is to fit on a single host with FP8 weight quantization? This makes deployment very straightforward because you can use single host inference. Although of course if you want higher throughput, you know, distributed inference options are always available. and the lava force scout model fits on a single H100 which really opens up many more possibilities for the developers. Okay, so that was

the model. We should also talk about the infrance stack. Uh yes, so we can't forget about the software we serve on.

So as we move from dense toe, we actually switched over to our brand new runtime to power this inference stack and we shipped a couple new features along with it. For example, we added speculative decoding. This is a technique that uses a smaller model to draft a set of tokens for the main model to use and it will speed up token generation by about 1.5 to 2.5x. We also use uh page KV to handle

long queries. This breaks up the KV cache into multiple smaller blocks that can be more efficiently stored and accessed with a lookup table from a memory management standpoint. And of course, being an open model means that hardware makers can do deep integrations to deliver the best performance that you'll get anywhere. We're thrilled today to announce a collaboration with Cerebras and Gro that will help developers get even faster inference speeds with the Llama API. You can now request early experimental access to Llama 4 models powered by this exciting hardware, which is a great way to experiment and prototype with different use cases before scaling with a vendor. And to activate this, you can just select Cerebros or Grock, those model names in the API. You'll be able

to test them out and have a streamlined experience straight out of the box. So those are the partnerships. This is what you've got to work with. the models, the

API and the llama ecosystem of partners, the speed and the efficiency, the ability to ship by building really robust products on top of really robust infrastructure and the power to customize the models and take them with you with full control over what you build. When you put the models, the API and the llama ecosystem of partners together, you can do incredible things. And we are already seeing this happen with agents that can navigate the web and use computers with state-of-the-art performance on open source uh on on OCR with open source models built on top of Llama. They are showing up every day and it's just less than a month in. There's

never been a better time to shift to open source from closed models. You get maximum flexibility, full control, and the freedom to build wherever and however it makes sense to you. It's a great time to build with Llama. And thank you so much for joining us. We can't wait to see what you build with Llama 4 and the API. Back to

[Applause] [Music] [Applause] Chris. All right. All right. So, when we think about where AI is going, we believe, and I'm sure many of you agree that we're moving from textual domains to visual. And just like people don't learn to understand the world through reading the internet, AI will learn by seeing the world and interacting with the world the same way that we do. At Fair, we've

been very focused on building the toolkits for perception and visual domains to make it easier for folks all around the world to build the next generation of AI. We announced a few things a couple weeks ago in the open source that I wanted to highlight. Starting with locate 3D, which is a tool for labeling boundaries and objects in any 3D environment using text. This is incredibly valuable for creating data sets and building on applications on top of 3D environments. This along with a whole bunch of our other tools are using a perception encoder which is a visual encoder that works well across many visual domains and is extremely portable and flexible.

We're using this in a bunch of our AI tools and we hope that many of you can too. This powers among many other things SAM 2 the world's best object detection system for image and video which we open sourced last fall. We've seen an immense popularity in this in scientific communities and many other visual domains. We're also using this internally for our creative tools like Instagram edits, which we rolled out globally last week. It turns out that

good object segmentation and detection is the basis for very interesting creative tools which we're continuing to roll out. I got permission from the team to announce that SAM 3 will be coming this summer. Major advancements. Yeah, AWS is partnering with us to host this natively so that if you're working in a visual domain, you can just add this right in. It's super easy. It's

very low latency. It's basically real time. And the innovation here is rather than pointing and clickling or labeling objects with code, you just say the thing or type the thing that you want it to to label and you'll get it instantly across the entire database. And so you'll spend less time finding all the potholes in San Francisco, which you can now just do. I hope that helps. Um, and then the consumer use cases just sort of write themselves. I mean, you

can blur all the faces for privacy reasons. You can blur license plates. Anything you can imagine where you want to take any object type and then just quickly deal with it. Um, you can do. So, we're using this a bunch internally and we hope that you all will get a lot of value out of this, too. I've been spending the last six to eight weeks now driving my 10-year-old son to school and prototyping this app I showed you earlier. So, we turn on the voice assistant and I'm like, "Hey, you can ask it anything." And he's like, "Okay." So, we like ask about world news. It's like,

"Do you want to learn more about tariffs?" He's like, "Not really. That's cool, but I don't care." Um, and so like many young people, it's just taken for granted now that you can just talk to the internet. Um, however, he really lit up when I explained to him that he could talk to it about Dungeons and Dragons. My D and D fans are here. Okay,

great. Yeah, but he, you know, he loves D and D. It's like his favorite thing in the world. He's got his D and D friends. Um, but then most other people he like don't know anything about it. So it's

lost on them that he's a level four druid. And uh, of course he he dives into the conversation and and you know it's an assistant that knows everything about D and D. And it could immediately pop back with like, okay, what's your subclass? Like what's your dexterity modifier? I may be getting this wrong. Um, and it's cool because that's for him when the light bulb went off. It's like, okay, this is a very specific niche interest that I have that I'm deep on, and now I'm actually able to go deep in an ongoing way with something that gets it and that is sort of imbued with all of the wisdom and understanding of the internet on this specific topic. And

I think part of what's so fun and interesting right now about the field is those light bulbs are going off for a lot of people in a lot of different domains and verticals as people out there are starting to wrap their heads around the power behind these systems once you get past sort of the first chapter. So we are very very very excited to be hosting you all today. Thank you for making the trip out here. There's a great lineup uh for us to understand better what we can do to serve you guys. We really appreciate you building with us. So have a great Llama Con. And then coming up after just a

short break, you're going to have Mark um and Ali from Data Bricks who are going to walk you through what they're working on. All right. Thanks everybody. [Music] [Music] [Music] [Music] [Music] [Music] Heat. Heat. [Music] [Music] [Music] Please welcome Meta founder and CEO Mark Zuckerberg and datab bricks co-founder and CEO Ali Godsy. [Applause] [Music] All right, welcome to Llamic Con. Good

to see all of you guys. Um, all right. So, went through some of the new stuff with um with Chris a minute ago. That that that was awesome. Um, now I'm excited to have a conversation with um, someone who's been a a a great partner and um, and and really someone who's been working on open source since the beginning of your company, right, with with data bricks. I mean, it's I mean, you guys do all the data management. I mean, obviously now, um,

AI has become increasingly important to everything that you do. um from the early days you've you've embraced all of the different AI u sorry open source approaches to this and have really like been great at building an implementation that just makes it super easy for enterprises and developers to get started with with a a lot of these different things. So you you kind of have this front row seat to this and I'm glad that you're you're here to tell us about um kind of what you're seeing. What are the big trends? I mean, there's obviously a lot of new directions for the technology, a lot of um new inventions that are that are kind of coming out, but I'm curious both what you're seeing is getting the most traction. Um what what what are the most exciting trends and what you think are on the horizon that you're most excited about that that your customers are seeing? And we can start there and just go whatever direction makes sense. Awesome. Yeah, thanks so much for having

me. Super excited about this. It's awesome event. Um yeah, I mean it's just the rapid progress uh is crazy. I think we kind of maybe lose sight of it. Um you know I think the most impressive thing is I mean I think Llama 3.1 was a

year ago um you know last year a little less. Yeah. And and that was you know 45 billion parameter model and then soon after you guys released 3.3 the 70B model was just you know as good as the big one. So it's like you know you get

like these smaller and smaller models that are better than Yeah. just as you know way bigger model a year earlier and that just unlocks lots of new use cases because that puts pressure on price prices come down and that unlocks new use cases that were just not possible before um new architectures. So we are super excited that there's mixture of experts because that again lets you go faster and cheaper with more intelligence. So that again unlocks a lot of use cases. Uh so that's super

cool to see um you know the context length uh getting bigger and bigger that also unlocks new use cases where you can you know just put whole things in the context of these models. Uh so there's just you know coding you know I'd be missed to not mention that I mean I would say just a year a year and a half ago um software engineers were not really affected by AI and now it's just completely transformed uh what people are doing and then of course we have agents so you know could go on and on. Yeah. So, are there any particular success stories that you're seeing from developers who are using Llama or any of the other open models on um on data bricks that you think are kind of inspiring and interesting to share? Yeah, I mean there's like thousands, you know, tens of thousands if if not even more. Let's go with two. Okay. All

right. Two. All right. I mean, one that I like is Crisis Text Line. Um, and Crisis Text Line, what they do is they use Llama and they've customized it to be able to sort of detect, you know, people who want to do self harm or suicide and they can detect that and it can sort of help people that call in into their lines and, you know, it's all, you know, it can detect what's going on. It can give you sort of risk

scores. I think just last year they've had over millions of conversations with people, uh, you know, that were in crisis situations. Um and you know thanks to it being open source you can also customize these it gives you opportunities like at data bricks we can develop techniques that really can customize it and let the monitor and kind of like improve the quality so that you get those cuz you know that's like a important use case you don't want to get things wrong so accuracy is really really important you don't want to have false positives you don't want to have false negatives uh so I think that's like you know something that would not have been possible a few years ago um another one on the more business side enterprise side that's uh uh really interesting is you know people have been using these you know facts at terminals like facts at Bloomberg terminals things like that and on these you have to like analysts type in these queries over the last 20 years fact set query language so if you want to see how stocks are doing you have to enter these complicated things uh well with llama now they can just ask in English like you know they can just say hey how's the earnings per share for this particular stock compared to this you know bond this and that and it just produces the code for it uh and that was just not possible before. So it just opens up a whole new market for fact set. So just two things that have been completely impossible let's say 3 four years ago. Yeah. Interesting. What

about some of the new capabilities that have come out? Right. So I mean obviously the the model architecture for Llama 4 the the first two models around Scout and Maverick were designed to be very low latency, very efficient. I mean that's why we we chose 17 billion active parameters and and and it's um and the context window length was was a big thing that we focused on. Um what are you seeing in terms of uh novel things that people are doing. I know

it's still early days obviously but I'm but I'm curious what you're seeing around all that. Yeah, I mean there's so many different techniques. Um, you know, the most important thing we're seeing is especially for developers is do you have some some some access to some special data? If you have access to some special data, there just so many things you can do. Number one thing is you know if you can get the evals or benchmarks right on these models uh then you can get into this iterative improvement loop uh and you know start tuning the model. So like one thing we did at data bricks that that would not have been possible if if llama was not open source is we developed this thing called toao which is we can do reinforcement learning on our customers data and make the llama model really understand the custom data and reason on the custom data that each organization has. We call this data

intelligence. Um this kind of reinforcement learning technique would have not been possible if llama was not open source like if we were using any closed source model we couldn't do that. Um so things like that that open you know open up because the model is open source uh is a big deal. The context link this is early days. Um but it's really interesting because you know people are using all kinds of hacks you know using vector search they're embedding documents into these and so on. If you can just take all of your relevant data and just put it in the context and then llama can just figure out what's relevant for it that's also a game changer. Uh so excited about that

direction as well. But I do think thee the mixture of experts is probably the biggest change because it just lets us lower the price so much. Um a lot of enterprises have use cases where you know every night you know you know hundreds of gigabytes of PDFs come in and we want to just extract all this information from the text automatically which by the way also includes uh vision right it needs to be able to understand other modalities than just text. it needs to extract pictures and things like that. Um, and now you can just, you

know, you want to run that every night and it's like lots of documents. So the price matters a lot because it's a big data sets. Mixture of experts let you just cut that price down because it's just using much fewer active parameters when it's running. So these are all

things that would not have been possible like even with llama a year ago and now we can do it. Yeah. Yeah. So, and and obviously your wheelhouse is really around all the data stuff and combining that to customize. And one of the main pieces of value around open source is is that you can fine-tune and customize the models however you want. So, I'm curious

what you're seeing in terms of how developers are making really good use of of the open- source models to combine it with whether it's, you know, rag or or different um different ways to do fine-tuning to kind of build custom models. Um, distillation is another thing that I think has really emerged over the last year is like a really important um and valuable way to basically take a larger model and then you know basically suck most of the intelligence out of it into whatever form factor you need. Um, but I'm curious what you're seeing people do around um that whole set of things that around kind of the like basically customizing building building your own things cinbing with the data that they use for all the rest of the data brick services. Yeah, I mean I'll see I'll tell you one the most surprising thing for me that I saw from Llama is like, "Okay, you you guys put out the bigger ones. They're super smart and we're like, "Oh, you know, as a company, we're going to make a lot of revenue on the big models." But the best biggest

surprising thing I've seen is they distill down and they use the smallest possible humanly smallest model they can find. So the the most popular Llama 3 model was the 8 billion. Yeah. 8 billion. And we'll build I mean we're you know we released the Scout and the Maverick ones, right? the the 17B experts by you know how many of our experts but the um I mean we're working on the big ones we talked about behemoth but we're also working on I our internal I don't know the temporary code name is little llama but we we'll we'll see what what we actually end up calling it but um but but I think that that's going to end up being a really popular thing too yeah yeah I think that the smaller ones I mean the big ones are important because they have the intelligence and people are using data bricks to distill them down u and yeah you're right they use rag they put you know the relevant data in there they use both what they have in the rag in the vector database uh and what the bigger models have. But

ultimately their goal is like hey so this seems to be kind of like the quality of the frontier I can get with the bigger models on llama. Now I want to distill down the smallest possible for this particular use case and the cheaper I can make this the more important because I'm going to hit this like you know billions of times every day. So like the smaller I can make it uh the better and yeah you're right with distillation you can for that particular use case get the tiny model to have the same intelligence as the biggest ones not on everything but you know most organizations they don't want to do everything they want to like do this thing repetitively again and again it's just like jobs like we don't do everything in a job at work uh people are specialized and they do something you know repeatedly yeah so do you have a sense on where the sweet spot is on that or or is it shifted over time sweet spot on the size the size yeah for the model architectures Yeah, I mean the smaller the better honestly like you know it's because it's a cost thing you know they keep pushing like they I think the smaller you can so the question is just how much can we compress the AI down so a lot of organizations are thinking like this okay I have this repetitive task what's the frontier sot state-of-the-art okay and now do that as cheap as possible who can do this or can I distill the cheapest possible and I don't think there's like a you know oh you don't want to go below 1 billion parameters or so on no if you go even further harder. Um, you know, a lot we

have a lot of customers that are doing coding and for coding, you know, you want to have, you know, as you're typing, you want it to code ahead and write ahead for you there. You want that latency is super important. So even a few billion parameters might be too big. You want it to quickly do autocomplete like you know uh milliseconds count. Uh

so I think there's going to be a continued race for like okay Frontier big smart models which you guys are doing and then the like I'm looking forward to the what did you say small llama? Yeah little llama. Little llama. Little llama. Yeah. Little llama. Looking forward to little llama. I hope it's really small. And tiny llama and you mini llama. There's there's always a smaller llama.

Yeah. Exactly. Yeah. Yeah. Um interesting. So yeah, I mean part of a big part of how we designed and chose the model architecture for the the 17b expert size was around latency. Um but not like that kind of extreme latency we're talking about. I mean, a

big theme for us, and you guys might have seen, we we just dropped this Meta AI app this morning. There's almost a billion people using um or monthly activives at least using like Meta AI across our our different apps. And um you know, a lot of people using it on a daily basis, but we figured it would be useful and some people would enjoy just having it as a standalone app. One of

the big themes in that is voice, which I think is going to be an increasingly important paradigm for how people interact with um with with these models. and um and that obviously requires really good latency. So, I mean that that's one of the things that we've been thinking about. It's maybe a little bit different from type ahead type latency, but you obviously when you're done speaking, you want it to be able to respond very quickly and you want it to be able to um to kind of jump in. So, I I I think the latency thing is is um and and cost is definitely a big part of this. a lot of how we've tried to design these are um you know obviously we want to be able to deliver kind of the highest level of intelligence but I think most folks who are running applications at scale really care about the intelligence per cost or the intelligence per kind of compute cycle um because you're just constrained on on how on you know on on uh on on scaling this stuff. So that

definitely makes sense and and I mean obviously we're a little bit less in the enterprise context than you are, but it's um but it's good to to to kind of hear that's consistent across it. It's spot on. It's it's and and you're right, the voice use cases are just massive. You know, it's like there's just organizations out there, millions of companies out there, you know, they interact with their customers mainly through voice. Uh so yeah, it's it's super exciting. I think that's going to be one of the big modalities that's going to be a gamechanging, you know, I mean, why do you need even a keyboard in the future? Why type? I mean, we don't we're not typing to each other right now. I I I think text will remain. I

think you and I probably communicate a little more over text than we do by calling each other. Although we call each other a bit. That is true. But uh but do we have to type? Um when there's someone else in the room, it's helpful. That's true. That's true. But but no, I mean I I generally agree with your point. I think

that um I think voice is kind of under index today. I mean, my understanding is basically the mix on this is it's today in terms of this type of AI use, it's maybe like 95% text or more. And I I do think that over time um you know, especially as more people get like wearable like the glasses that I'm wearing, right? It's like you you just like throughout the day, I think you're just going to be talking to AI and in kind of all these subtle ways and get interesting context from from that. And um so I think I think voice will be will be a lot bigger than it is today. 100%. Yeah. Um, all right. So, open source,

let's talk about your kind of um your your history on this and how you see this whole thing developing. I mean, this really has been a part of data bricks and your DNA um from the beginning. Um, you've been a huge proponent and someone who I've bounced a lot of ideas off of when, you know, I'm trying to think through what what should we do with Llama and like how should we approach this? I mean, you've kind of been an important adviser and partner to to to me and and Meta on figuring this out, but I'm I'm curious how this kind of open source is obviously bigger than just us and um I'm I'm curious how you think how does this tie into your vision for what you're trying to do with data bricks and where do you think um kind of the ecosystem is going overall and now that we're getting this? Well, well, maybe I'll stop there and then I'll ask more questions after. Yeah, I think that's uh that's that's awesome. Yeah, I mean we're big believers of open source, you know, ever since the Apache Spark days from the beginning. Um, you know, I think the the most important thing is that uh it first of all it just puts pricing pressure on it. this stuff I

think people are not taking that into account that you know price of uh you know serving large language models or VLMs you know uh is where it is today a large part thanks to llama uh because you know it's just you know it's like anyone can take that model serve it themselves with you know llama CPP they can even run it on their CPU so that immediately like you know drops that price and uh that just means it unlocks a lot of use cases for a lot of organizations out there where if this would I mean the price has been dropping continuously ly on this stuff. So yeah, and you see it every time we do a llama release like all the other companies drop their API prices. So that's I think that's a game changer because you know it just unlocks it's a supply and demand thing right like so now when the price is lower you just now have much more demand for this stuff. It unlocks so many new use cases that would not be possible before. So I think that itself is huge. Uh the second is you know all these universities all these researchers around the world that you know want to work on AI they kind of can't use these closed source models. They're like kind

of blocked out. Actually, I you know I work uh part part-time at UC Berkeley and you know it's like at the universities the mood is not great around AI because they feel like hey we're kind of like locked out you know we can't we don't have access to the weights of these models we can't do anything. So Llama's been a gamechanging for them because now they can actually start using all these techniques and they can come up with you know new ways to do things uh new ways to fine-tune or add reasoning and so on. So I think that's actually really gamechanging um you know on the whole planet right if it's closed source it's just that one company that has it that can you know we're we have to just depend on them doing research on it when it's open source uh you get this sort of open research and open sharing you get sort of a whole world working on these models and you just have much much more rapid progress so I think that's really important uh and then of course there's the usual thing with open source which you know so this was like specifically to LMS but with open source you've of course get this community of people using it and you build up this ecosystem around it. Uh and that I think is like invaluable and we see that already. Uh

you know there's like you know people are just super excited to mix and match the different models. Uh and there's you know there's a lot of cross-pollination between all these different models that are being put out there. Uh you know if you like track local llama on Reddit or something like that you can just see people are doing crazy things you know slicing putting the models together coming getting better results. um all of

this completely impossible if it wasn't open source. Um so I actually think that this you know there's going to be a continued race for open source models and we're going to see them continue sort of uh being at the frontier. In fact I kind of joked and said in the future I think when it comes to sort of model business like model API business or serving these LLMs every model will be open source you might just not know it yet. Uh you know because there's

distillation there's all this sharing this stuff. So I think like in the long run everything is going to kind of go towards open source and I think you guys were a big reason if not the main reason why that's happening because you know you put put that out there and you put that in the front and pushed for it. Yeah. No, I mean it's been really gratifying to see the industry move in this direction. So I I'm I'm very happy

that that is happening. Um, you mentioned distillation and you mentioned other models and I think it's worth spending a few minutes on this because to me one of the most valuable parts of open source is not just that people can take the models and use them and customize them a little bit, but it's like you can really take it and distill it into whatever you need it to be and whatever shape you need it to be. And I think one of the most interesting things over the last year if you if you kind of rewind a year, Llama was like really the main and kind of only at that point like major open source uh model and now that's not really the case. Now now there are others as well. I mean Llama I think still leads in in in many areas.

Um but I think the reality and part of the value around open source is that you can mix and match. Um so if if another model like if deepseek is better or if Quen is better at something um then now as developers you have the ability to take kind of the best parts of the intelligence from the different models and produce exactly what you need which I think is going to be very powerful. Um but it means that there's an even more important place in the ecosystem for companies like yours to help build these distillation tools. And I'm curious how you see that going because I I just think that like this is part of how I think open source basically passes in quality all the closed source is that you don't just have to choose one model off the shelf and then either customize it or or distill off of that base. You you get this just increasing mix of being able to just pick the parts that you want from an increasing number of open source models. And it it it honestly it feels like sort of an unstoppable force, but it really does need the infrastructure that companies like yours provide in order to provide that. So yeah, I'm curious how you think

about that and where you think that's going and what technology needs to get built to really enable that and and and that whole theme. Yeah, I think that's spot on. Um when DeepC came out, uh you know, a lot of folks uh a lot of the customers that we had were like, "Oh, well, you know, the model was produced, you know, in China. Can we really use

it?" um you know the weights were you know baked there but the most common model people were using on data bricks were actually the llama distilled deepse ones so you know where you know you took the R1 reasoning and you distilled it on top of llama so they were actually using llama so it was really the llama model and by the way the beautiful thing for us was we could just launch it immediately because it's just llama uh but we distilled the reasoning from R1 on top of it was super popular it actually became one of the most popular models that week when everybody was talking about deepc coming out um so things like that. So it's the mixing and matching as you said like we all see these models come out and there's like all these benchmark results like oh this really good at programming this or you know math olympia that or function calling for enterprises and so on. These are all sort of characteristics and capabilities that different models uh have. Uh we're seeing lots of mix and

matching. So like okay can we create traces on how well it does this particular aspect this model and can we then overlay it and distill it and add it on top of a model over here. That's basically what people are doing on data bricks. basically mixing and matching these uh and that's why you're seeing this convergence of all the models right people are like oh it's like saturating or it's converging well it's because people are mixing and matching again this would not be possible uh without open source with open source you can just do that and there's so many techniques you can use uh you know especially if you want to do things like reinforcement learning and I think actually things are accelerating now in the reasoning space this inference time compute where you're doing thinking and using the thinking traces uh to generate models that have these sort of thinking capabilities and they can become really good at specific tasks that seems to be even more much more than a year or two ago uh amendable to distillation. You know it was like a Berkeley research paper that showed that for $450 they could get you know the reasoning traces from like the best reasoning models and apply to any model for 450 bucks. Um so I think you're right the distillation especially in this reasoning inference time compute era uh is extremely powerful and I'm not sure everybody's kind of realize that but yeah

2025-05-02 11:10

Show Video

Other news

Building a Homelab - Proxmox Hypervisor [Part 2] 2025-05-16 23:07
LogicLooM 4.0 | Guest speaker and orientation session - 1 2025-05-11 10:14
The Quantum Internet: Qunnect’s Vision for the Future of Communication 2025-05-10 15:27