On a scale from 0 to 10, how big of a deal is DeepSeek R1? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome to the show. What do you think? I'm going to take maybe a little bit of a controversial position. I'm going to say 5. Chris Hay, Distinguished Engineer and CTO of Customer Transformation. Chris, welcome back as always.
0 to 10. What do you think? 9. 11 or 9. 9, but I'm not sure which is the bigger number. Wow, that is a niche reference. And finally, Aaron Baughman is IBM Fellow and Master Inventor, zero to 10.
Aaron, what do you think? Yeah, that's a great question, and I think we're gonna be right in between the other two at a 7.5. All that and more on today's Mixture of Experts. I'm Tim Hwang, and welcome to Mixture of Experts.
This is episode 40 of the Mixture of Experts show. I'm really excited to meet this milestone with an all-star cast. Each week, MoE is the place to tune into to hear the news and analysis on the biggest headlines and trends in artificial intelligence.
And today we're all going to talk about DeepSeek-R1. It's basically anything that anyone is talking about right now is DeepSeek-R1. It's the talk of the AI chatter class, it's rocking markets, and even my dad is texting me about it. Um, so what I want to do is start the first segment with a little bit of DeepSeek-R1 myth busting.
Um, if you've been anywhere around our, uh, AI, uh, in the last week, you know the basic story. Um, there is a, a Chinese lab, uh, DeepSeek that has released a new model called R1, uh, that is both open source and competitive with the state of the art models coming out of Anthropic, OpenAI and all the names that we're really familiar with. And there has been so much hype about this story that as I said, even my dad's texting me about it. And a lot of the mainstream coverage has actually been getting a lot of the facts wrong.
So kind of what I want to start with is just to knock down a bunch of myths, so we can kind of calibrate as we really kind of peel back and talk about this story. And, Kate, I want to start with you, because I know you were, uh, angry about this in the show Slack, and so I wanted to give you a chance to kind of like, you know, let loose. Um, I think the first meme that we've heard in a lot of this mainstream news coverage is we can now train state of the art models for 5.5 million dollars. And that's so crazy expensive relative to the kind of numbers that we've heard before, right? I think the Stargate price was a hundred billion dollars or something crazy like that. Um, so Kate, how true is that number? Can we, can we really train models for 5.5 million dollars now? So, first, the number is true, it's published in the, uh, presumably, it's published in the paper, DeepSeek isn't necessarily hiding anything about this number, it's heavily caveated if you, if you look at it, but the takeaway that people are driving from this number is a little bit crazy, so, yes, training one iteration of a base model, DeepSeek-V3, by the way, this all came out in December.
This isn't late breaking news as of last week. Back in December, they trained this model. They said one iteration of training it would cost about 5.6 million dollars. But that's like saying if a startup could go and train a model for the same cost. That's like saying if I'm going to go run a marathon that the only distance I'll ever run is 26 miles.
The reality is you're going to train for months. Practicing, training, running hundreds and thousands of miles potentially leading up to that one race that then takes 26.2 miles. And if you look at what that number is and take that metaphor even a step further, it's like saying, okay, what if I'm running a race, but I take a break every mile, I stop, I, you know, take a drink of water, I take a nap, I come back the next day, I keep running.
And you only add up your time from the actual miles that you're running in the race, not all your breaks. That's like the equivalent of what this number represents. It's a really valuable number to understand and impressive. The parts that they're measuring, they did bring down a lot in efficiency.
But that number does not represent the cost to go and now train a model. It's not like we're going to have startups now flooding, you know, the ecosystem with their own version of 600 billion parameter mixture of expert models. That's super helpful. Yeah, and I think it's a great calibration. I want to kind of pick up on that last thing that you said, which is there is a lot that's new here from an efficiency standpoint. And maybe Chris, I'll toss it to you for sort of the next meme that we're hearing kind of flow around in this space, which is DeepSeek-R1 is a huge breakthrough.
Models are running way more efficiently than they used to. You know, dot, dot, dot, DeepSeek is so far ahead. I know you said that you, uh, felt like this model was a big deal, like 9.11, um, a bunch of a big deal. Uh, but can you tell us a little bit about, like, has DeepSeek really unlocked some novel things, and if so, how big of a deal are these novel things that they're really uncovering with this new model? I said 9.11 or 9.9, so clearly, Tim, you think 9.11 is the bigger number out of those two.
Sorry, there's some uncertainty bars there. I actually think it is a big deal, so I think there is a few things, which is, we're sort of joining everything together there, so we're actually saying, okay, here's the base model, and then there's the RL training for the R1 part of that, right? And actually, if we separate out the kind of the DeepSeek-V3 version from the RL training there for a second, I think there is a big deal there. Because the reality is... Nevermind the 5.5 million bucks, right?
You are going to be able to take an existing base model that has been pre-trained and then you are going to be able to do RL training over the top of that. You're going to be able to take your cold start fine tune data. So you can take a relatively small amount of data set and then put that on top and train it to do amazing tasks. And I know that myself, right? Because I took like a tiny model myself, one and a half billion parameters, absolutely tiny Qwen model. And then I put maybe a thousand lines of SFT data, right? And I got that thing to be able to tune math, right? Basic arithmetic at the kind of same level as GPT 4o, right? Just myself and I, I'm telling you right now, I love IBM. They do not pay me five and a half million dollars.
That was on my laptop. So, so this is a big deal and it's a big deal because the thing that they're showing there is long chain of thought has a huge impact and accurate data because actually even to the point of RL, they they started with pure RL training, right? So they actually just said, here's your rewards for not doing anything else. And, and we're going to train the model that way.
And then they went to say, actually, if we, if we do one round of fine tuning with a really good chain of thought, a set of data, maybe a few thousand rows worth of data, and then do RL training after that, then we get much better results. So actually what they're showing is that we can maybe stop obsessing about pre-training so much and we can get into this kind of post training world and Inference time compute world and for that you don't even need five and a half million dollars. Just your laptop and a little bit of tenacity and a little bit of GPU is gonna do that job. That's awesome yeah, we're gonna talk a little bit more about RL and chain of thought in just a second, but Aaron I think before we move to that one other question I want to ask you the kind of other third big meme is that everybody suddenly discovered Jevons paradox, uh, this week.
Um, and I think one of the narratives that popped up is NVIDIA is doomed. You need a lot less compute, uh, compute for these models now. You know, NVIDIA's stock price took a tumble. I bought the dip for what it's worth. Um, and I want to say to, uh, I guess for Aaron, if you wanted to kind of respond to this, this question, or whether or not you think it's a myth at all, is, um, are we going to need a lot less compute in the future? Is NVIDIA doomed? Like how should we read this? And if you want to explain Jevons paradox, you can go ahead too. Yeah, I mean, I mean, so, so I think, you know, fundamentally, you know, that that's an interesting notion.
Um, you know, but I tend to follow, um, the dynamics of AI, you know, which comes in to me three different areas. One is the scaling law, right? Is that, you know, it tends to say that, um, as you scale up the train of AI systems, you get better results, which means bigger models are better. Generally, right? And then the shifting curve where new ideas are making training more efficient, right? And so this, you know, affects that scaling law.
So the more new ideas you get, you know, the smaller models become more powerful, right? But then there's a third, which is a shifting paradigm of these big revolutionary ideas and can an order of magnitude change the scale of which you actually need to train these models in order to get performance, right? And so I think by having those points one, two, and three laid out, uh, which is, you know, backed by a lot of research over time, you know, you know, I think that, yes, there, there's always going to be a demand for GPUs, but I do think that, um, there's going to be different chip architectures that are coming out, but also, um, if you look at some of the efficiency gains that- that V2 had, um, such as a multi-head attention, were they able to cache a lot- a lot of the weights? The, the token throughput is incredible, uh, that they were able to achieve, but I think that was one of the bigger innovations that they had. Uh, and then the second one was there, what they call the DeepSeek, um, MoE, uh, where they're able to sort of partition out and share uh, knowledge amongst these different agents that they can have. And that also helps, but those two things where some of those, um, uh, pieces that gave us the shifting curve on that scale law, which said, okay, I don't need as many GPUs now. But if you look at the foundational model, um, so if you go to, let's say, DeepSeek-V2, right? It's big. It's a very big model and v3 is even bigger with a- what is it? 671 billion parameters, right? That's a very big model. Yeah, it's, it's chunky.
Yeah. So, I mean, I mean, that that's, it's very fun, right? To watch that curve. And I, and I think that we'll see agglomeration of models together, we can do reverse distillation, right? To create and combine smaller models together.
Uh, you can do model distillation to create smaller models, you know, um, but, but it's, it's going to be fun. I want to maybe pick out a point that Chris actually mentioned that I think is really important, which is the like, can we just stop worrying about pre-training now? Because I think everyone is talking about this 5.5, 5.6 million dollar number. And they're tagging it to all of these amazing performance improvements that we're seeing in the R1 model and that, and the distilled models, and they're kind of equating the two and saying, all right, now we can just go and we are getting like this crazy performance at a pretty minimal cost. And I think it's really important to disambiguate these two things, right? A step in this process costs about 5.6 million, the true cost of building this pre-training
model is likely orders of magnitude higher, but Regardless, it almost doesn't matter, like this 5.6 million number doesn't even matter because you can take this big model that's now open source and distill it basically for free on top of other open source, smaller models that are out there to get crazy performance improvements and build. So it's not that startups are going to go and build and pre-train their own 600 billion parameter model because the cost is only 5.6 million dollars. That's the wrong takeaway.
It's that we now have the ability to distill and thanks to more and more competitive models being put into the open source that distillation is becoming even more powerful and use reinforcement learning as a technique that DeepSeek used really effectively to go and build our own smaller versions of these models that are really powerful. And that is where there's actually very low barrier to entry now, as Chris is saying, you know, doing it on your laptop. Yeah, yeah, yeah. I find that, you know, very, very nice because it's like a house of cards, right? I mean, they're only quoting the top card.
They're not quoting any of the cards at the bottom. And if you move one of those cards at the bottom, the whole house collapse, right? And so that 5.5 million is only the cost associated with maybe you know, you know, one, one epoch right of this type of training and that's it. But if you look at the hardware, even that they use, what are the H800's? Just procuring those alone or using those as a service is expensive, right? Um, and so, so they're excluding lots of costs associated with prior research ablation studies and lots of different things, right? Which um that that number is very, very much misleading.
Uh, Chris, I see you nodding. Do you want to jump in? I don't know if you have a comment Uh, no, I was just nodding in agreement because I'm a very kind and collaborative person. Uh, no, I, I, I, no, I, I, I absolutely agree. I, I think that, um, you're gonna go for the big hit numbers, right? You're gonna say, we did this super cheap.
And you are really going to miss out all the steps that took you to get there in the first place, right? And, um, and as Kate probably knows better than anyone, right, that the amount of experimentation that it takes for these models, right, to get to the final version is a lot. So the, the actual final epoch, as Aaron was saying, that final training run, that's just, that's just the kind of the end of the road there. So, um, but. You know what? No one wants to hear about the big journey going up there. They want to hear the big number. We're in a hype industry, baby.
So we'll, yeah, five and a half million. Here we go, right? Kate, I guess maybe one last myth I've seen kind of popping up that might be good to address before we, we do a segment on distillation, because it's already come up a couple times, and I think it is worthwhile to explain why, what it is and why it matters. But maybe one last thing to cover before we get to that, Kate, is, um, on the point about RL, um, it feels like, the DeepSeek narrative has also been a little bit about like the revenge of RL, like reinforcement learning's back, baby. Um, and I know some people have gone so far to be like, everything is RL now. You know, fine tuning is dead.
Um, do you want to talk a little bit about that? Like even, you know, with everything that we've said, like how much does R1 indicate to us that really RL will be kind of like the more dominant method for these types of fine tuning efforts going forward? Yeah, and I'm really curious to get Chris's take on this because I know he's just run these experiments right locally on his own laptop, but so DeepSeek in their paper, they trained two models, uh, in addition to all the smaller distilled models that they, that they worked on. One model was trained with just reinforcement learning only. So there's no additional data that's added. You've got your pre-trained model, which costs, you know, 5.6 million plus, you know, all the arguable buffer on top. And they just use reinforcement learning using some rules based systems more or less to be able to verify the results, um, and, and score the responses.
And then, and so they called that R1-Zero, I believe. Then there's R1, which they also created because in their paper they mentioned that there were some, you know, rough edges, so to speak, on the, the reinforcement learning-only model. And in that model, they start the model first with some fine tuning, basically using some structured data in order to better prime it for this reinforcement learning task. And that is the model that everyone's now playing with on the DeepSeek app and that everyone's really excited about.
So I think it's a really interesting look at, you know, the takeaway shouldn't be that, oh, we, you know, can't do RL only we had to, you know, resort to this cold start and fine tuning before the model was released. The takeaway that I think people should have is it's amazing how far they were able to push just RL. And yeah, there's still always going to be a need for some structured data potentially. And there's, you know, maybe a hybrid approach is best, but it is kind of crazy how far they were able to push it.
Now, what they also published in their paper, getting to the distilled models, and you asked about distillation, distillation has been around forever. It's where, you know, back to, you know, early days of the first Llama model, you know, a group of students distilled that into Vicuna, um, and it's basically where you generate a bunch of synthetic structure data from a big model and use that to fine tune a small model. So DeepSeek used that same kind of thought process doing just RL only on a small model, so no big models involved, just RL and try to see how far could they get, you know, they published numbers on Qwen32B. So how far can we push Qwen32B's reasoning just on RL? And they weren't able to, in the paper, they claimed they weren't able to push it nearly as far, get any real reasoning capabilities out of the model, they had to resort to distillation, take their big R1 model, generate a bunch of synthetic data, and tune it. So, you know, I'm curious from your perspective, like, what your take is on that, Chris, based off of some of the RL experiments you've been doing with small models. You said you also, I think, did some fine tuning first to start it off and then with chain of thought reasoning and then RL on top.
Now, for me, the critical thing is the long chain of thought reasoning. That is actually an accurate long chain of thought reasoning. That is the thing that really enabled everything.
So, again, if you look at the paper, when they did RL, um, they said they got there. But, you know, if you think about, especially math problems, LLMs are not really good at that. So you're going to say, what's 25 plus 8? What's this? Whatever. And you're going to ask an LLM, you know, to go and generate me this sum.
And it may or may not get it right. It may or may not get the sums and the length of chain of thoughts that you want. It may not get its explanations right.
So it's, it's really a, a bit of a crap shoot and getting an accurate chain of thought. And then at the end of it, they're using this thing called a verifier. And what the verifier does is, is, you know, take the answer that you've got and go you know, run a bit of rules to run the equation and say, yeah, that was correct or that was wrong.
And then you get a, you get a bit of a reward, you know, it's like, here's a cookie. Well done model. Good job.
But, but if you think about how long that's going to take it, it, you really are monkeys and typewriters at that point, right? It's going to take time for the models to, to come back with the right answers. Now, if, if you run a fine tuning step before that. So if you can produce long, accurate chain of thoughts for those math equations, for example, and I'm picking math because that was in the paper, then the model is gonna look at that and say, okay, I'm doing this equation here, and you explain every single step, step one, step two, step three, and then I'm reflecting back, this was right, this was wrong, and then finally I'm gonna check the answers back, then you're just going to need less steps for the model to be able to learn what it has to do.
And then you can use RL afterwards to go, oh, this particular song- sum you went wrong. So I'm going to give you a cookie here. Um, because you've now, you know, we've now given you a different way, you know, here's the right way of doing it. And you don't get a cookie if you got it wrong. So I think, I think that combination between the two is the key thing. But I actually think the real, uh, take away from that paper is the long chain of thought.
So when I did my experiment, uh, on my YouTube channel, the thing that I did is I took a slightly different approach from what DeepSeek did. And, and I have a thing called a math compiler. So what I did is I, I automatically generate the, uh, the math equations, and then I put it into my compiler, and I generate an abstract syntax tree, and then I walk the tree, and then it, I don't need the LLM to do the math, I'm just going step zero, step one, step two, and I'm just walking the tree, and I'm outputting the explanations, and then what I do is I use the LLM to transform that into something that the model actually unders- you know, is actually human language, and the explanations behind it, and then that's how I got these really accurate, uh, chains of thought. And then when I put that in just as an, uh, you know, fine tuning step, I think I used maybe a hundred different examples and and honestly the math and I did it on a one and a half billion parameter model the math was incredible, right? It was like a couple of decimal precisions of uh, you know accuracy out which which the larger models of six months ago would be nowhere near so so I think the this The real innovation is the long chain of thought and the accurate chain of thought. It's not to say RL won't get you there, it will get you there, but it's just going to take a long time. So if you can, if you can short set that a little bit, and then have RL sort of, you know, do the kind of, uh, the smooth and out of the edges, then you're really going to win.
That's kind of my view on this. RL is really valuable for tasks like math and things where it's easy to check the accuracy, right? Um, as well as, relatively easy to generate that chain of thought, but when we look like in the paper, for example, they talked about still needing some instruction tuning for tasks like tool calling, you know, instruction following, like there's still going to be a need for having like these reasoning models aren't designed to do every single task. They're specific for reasoning, and you're still going to potentially need instruction tuning in order to handle some of those more specific instruction following tasks. So I'm going to move us on to our next segment.
Uh, so this is super helpful, I think, in terms of setting the scene, knocking down some of the myths that have popped up. We've already talked a bunch about distillation, um, and I think on the last episode, Skyler actually gave like a short, brief explanation of it, but for those who weren't listening on the previous episode, maybe Aaron, I'll toss it to you. I think it's worth it for our listeners to, just get a sense of like, what is distillation in the first place? Um, and then I think if you want to give that explanation, there's some interesting things I think that are worth getting into about like, well, what does this mean for where the industry is going? But maybe I'll toss it to you to give the quick capsule explainer first.
Yeah. So, I mean, I mean, model distillation is very powerful technique. You know, it's about having a teacher model, you know, that could be a bigger model where it's encoded, you know, much more information, uh, through weights and through embeddings.
And what you want to do is transfer that knowledge to a student model. And then usually that student model could be smaller and it requires, then in turn, less resources to train, right, and to also use for inference, right, and some people and groups think of this as model compression, right, where you're making a model smaller, and so on, um, and then, and then there's, there's different things that you can distill, right, you can distill like response based knowledge, You can dispel feature-based knowledge or even like the relations between all the different connections within all of the neurons that you have, right? And one interesting thing that I wanted to bring up that I saw within the R1 paper is that the distillation process, it wasn't just about, to me, about just doing this, you know, model compression or getting knowledge out. But it was almost like this model translation. Uh, because what I saw is that you were actually distilling information, uh, from an MOE, right? And then you were going directly to the student model, which was a densely connected, you know, feed forward, uh, neural network and many, uh, different cases, right? And so, and so just changing that model architecture, um, looked to be, um, a different way of doing this type of model distillation that, that I thought also looked You know, gave R1, I think, some advantages, especially whenever you were looking at using like Qwen2.5 and like the Llama 3 series, right, as the base foundational model to pull information out.
Yeah, and I think one of the most interesting elements of distillation is sort of the idea that um, you know, you, you can, you can take any large model and bring that sort of knowledge into whatever it is that you're building. Um, you know, I think really just literally, I think in the last 24, 48 hours, there was a little bit of kind of a controversy over did DeepSeek use effectively kind of OpenAI's chains of thoughts or other inputs, outputs to kind of do the distillation here. Um, I guess, Kate, kind of question is like, this makes it very hard for any model company to kind of protect its models in some ways, right? Because everything is distillable.
Is that the right way of thinking about it? Yeah, I mean, I think by releasing a very capable, the most capable model to date in the open source with a permissible MIT license, DeepSeek is essentially allowing and eroding kind of that competitive moat that all the big model providers have had to date, keeping their biggest models behind closed doors. And regardless of whether or not DeepSeek also benefited from distillation from those bigger models, we're now able to go and take that really big model in the open and use it indiscriminately, where before people, I mean, this distillation from GPT has been going on for, ages. Anyone can go to Hugging Face and find tons of data sets that were generated from GPT models, uh, that are, you know, formatted and designed for training and likely taken without the rights to do so.
Um, so this is like a secret that's not secret that's been going on forever. So yes, it most likely worked its way in some degree or fashion to the DeepSeek model, but it almost doesn't matter anymore because DeepSeek now is out there and that model can be used to run very similar style distillation. With great effect on as many small models as you like, and anyone now has the rights if they, uh, use DeepSeek's model to do so, according to the license that it's published under.
That's right. Yeah, I think one of the funniest parts about the kind of news cycle has been like, they used a secret, you know, sinister technique called distillation. Yeah, it's like, actually everybody's been, everybody's been distilling all the time. It's just like happening It's been around forever.
And it costs 5.5 million dollars. That's right. Yeah, exactly.
What strikes me, I mean, Chris, even to the example that you gave earlier, right, like you don't, it turns out you don't need a whole lot of data to make these models much, much better. And it kind of seems like there's this sort of like fundamental thing in the market where it's like, unless you want to control and really down to the nth level, prevent people from getting outputs from a model, there's basically no way to stop distillation, right? I don't know if you think there's a realistic way to prevent that at all. No, I don't think so. I mean, the reality is, as Kate said, there's open weight models out there, um, and people are gonna do that. And I, I think that, and, and I love this by the way, and the reason I love it is that I'm, I'm all for chaos. I'm all for open source.
I'm all for sharing and collaborations. So, you know what, people are going to go off now, they're going to create their own data sets. They're going to distill from different models. They're going to share that out in the community. And you know what, we're going to all end up with better stuff. Right? So I'm, I'm not a big fan of the closed models, personally, my opinion.
Um, I'm a big fan of sharing and learning from each other. So I, that's what gets me excited about kind of the DeepSeek stuff. And, and it, and again, it's not just the fact that, um, they put the model out there that you can distill from is they talked about the techniques that they use. So, so yeah, it's, it's cool. We can all start doing interesting things.
And, and you know what? I don't think everybody's going to, I don't think we're suddenly all going to be going out competing with OpenAI, Anthropic, blah, blah, blah. I don't think the, you know, all of these people sitting in their bedroom are going to do that. But you know what they might be able to do is take one of these out of the box pre-trained model and then solve one of their own particular tasks that the general model can't do that's specific to their use case and make it easier. Um, but again, don't, don't undersell this. I mean, Kate, you know this better than anyone, right? Fine tuning models is really hard, right? Because of all of the biases, you might, you might think, hey, my model is now great at doing this one particular task. But then you've just ruined that model from doing any other tasks there because you didn't have the right biases and mixes within that data set.
Yeah. I mean, just take a look at the Hugging Face Open LLM Leaderboard. All those distilled versions of Llama and Qwen are on there, and they all rank significantly lower than the original model that they were distilled from. Uh, on those Open LLM Leaderboard tasks, which are not predominantly reasoning-based tasks. So the model was boosted in reasoning, but other general performance characteristics drop.
But I, I think it's still incredibly powerful. And as we talk about, you know, DeepSeek's introducing this new era of efficient open source AI. It's true.
It's just not true because they trained this really cost-effective model during the pre-training. It's true because we now have the methods to create these armies of distilled fit for purpose models that are specific for the tasks that you care about because we have better tooling, like powerful teacher models, out in the open source ecosystem. Yeah, yeah, yeah. I think that there's a lot of secret agents, you know, that are hidden amongst our labs. And, you know, in the next couple days, weeks, you'll see them to become super agents that are going to be a release that we can all use.
So, um, I, I really think this might have been one of the impetuses, you know, to sort of grandstand out, you know, what's happening within the field of AI. Um, DeepSeek just happened to be right time, right place, you know, to do it, to put all, to connect all the dots together. Um, but, yeah. Um, I do think that lots of these technologies and new innovations that are coming out, inventions, um, you know, you ask the question, can you prevent someone from distilling a model? Um, that sort of brings me back to biometrics. You know, it used to be, can you prevent someone from stealing a picture of your face? You know, and we, and we came up with this cancelable biometric invention so that if someone took your picture, you could revoke your biometric and create a new one, right? So I mean, I mean, there, I think there might be some cancelable technologies, um, and patents, right? That we could work on together, right? To achieve some of this.
Final question here, I think for Kate, um, particularly given your work on Granite, you know, I think there's maybe one point of view, which is, well, you know, the only reason you know, investors have put money in towards building these giant, giant models is kind of the idea that if you build these giant models, you'll be able to capture all the value from that model. And it sort of seems to me that like, if distillation gets good and you know, granted, distillation is hard in some respects, but you think you've got enough eyeballs, someone will eventually figure out ways of cracking it. You know, is there an argument here that it kind of erodes the incentive for people to invest in building the big model in the first place? Uh, like, and there's kind of a really interesting question, which is, you know, it's almost kind of an accident that like we've ended up with these giant models and it's partially based on the idea that like, well, you could have some exclusive control over this, but it feels like this is rapidly kind of escaping the ability for anyone to be able to kind of exclusively control.
Yeah, look, I don't think there's any incentive to really build big models to run at inference time. The incentive is to big, really big models to help you build really small models. And all it takes, like we've, it started with Llama releasing, you know, 400 billion plus parameter model, NVIDIA released a 400 billion plus parameter model as a teacher.
And now DeepSeek releasing their 600 billion plus parameters, you know, size isn't everything. They also have to have high quality post-training, which is why the reinforcement learning part of DeepSeek is so important. But we're seeing more and more large models that can be used openly to train these smaller models. And I think it's just going to continue to make this more of a teacher model based commodity like why pay for those big models if we've got similar capabilities out in the open that you can customize further and I think we are going to converge on to a point where we've got powerful enough tools to craft the smaller models that we need that are going to run, you know 80 percent to 90 percent of our workflows for generative AI in the future. Yeah, it's kind of a funny world where it's kind of like you ever, you never talk to the giant model that's just inside company headquarters and then just like lots of tiny models that are coming out around it.
Well, in the last few minutes, I want to zoom out a little bit. We've been talking a lot about DeepSeek and what's going on underneath the hood. Um, and I want to just take a moment to talk a little bit about what all the other companies are doing. Relative to this development in the AI space, um, Sam Altman, of course, the head of OpenAI put out a little tweet thread kind of responding to this news.
Um, and I'll just quote a little bit of it. He said: "we are excited to continue to execute on our research roadmap and believe more compute is more important now than ever to succeed at our mission." Um, which is really like a statement by a guy that says, steady as she goes, we're continuing on the research path as you know, we had planned, um, and nothing has changed by the DeepSeek, uh, release. Um, I guess, Chris, maybe I'll kick it to you. Do you buy that? Like, is OpenAI pretty much gonna just keep doing its strategy? Or does this really kind of fundamentally change what they're gonna need to do? Nah, he's gonna release his model sooner. He's been holding on to these models for too long, and he needs to get on with it.
And good on you, DeepSeek, right? Where's my o3? You showed me it at the end of Christmas. Do I have it in my hands? No, so thank you DeepSeek, maybe we'll get his model out a bit quicker and then we'll get o4 and o5 and then maybe we'll get some of these models in Europe because guess what, they're releasing vision models and video models and I don't have any of them so I'm gonna get them as well, so woohoo! Uh, so I guess ultimately what you're saying is it just accelerates his roadmap, right? To just get him off the fence. There is no way he's just gonna sit there and go, uh uh uh, I'm not giving you my model while DeepSeek is getting all of this press. He's gonna respond and we're gonna get new models. But I think, I mean, Aaron, maybe to turn it to you, you don't think this changes, like, their approach to, I guess, ironically, being kind of a closed source model here, right? Like, this is not the kind of situation where you believe that OpenAI or Anthropic, any of the big kind of providers, would say, hey, now we need to start switching to open source is the way we play this game.
Um, I don't think so. I mean, I mean, I think so. I mean, this, this could go in several directions, you know, but I think, you know, open versus closed source, you know, I think that there's advantages and disadvantages to both, but I think ultimately it helps the academic community, which then in turn fuels, you know, economies of scale for the average consumer, right? Because, uh, because if you think about it, you have two groups, right? You have the open source group, closed group, they compete, you know, um, to make sure that one is better than the other, which then spurns innovation. Okay. Great. And then within each one of those groups, you have companies and organizations that then in turn competes.
You have like this 20 and of competition that further accelerates, you know, this, uh, innovation. And so I, and so, so Sam Altman, you know, you know, I think he's going to release a secret agents, right, uh, sooner, right. Make them available, right.
And, and lots of the techniques that, you know, DeepSeek has shown, you know, like that caching layer of the key value and queries that they've, you know, come up with some of their MoE innovations and then some of the parallelization whenever they can share context and information amongst their grid, lots of that is going to be included, I think, in Sam Altman's but pushed even further, you know, with their own innovations and it's going to splinter out a bit, but the fundamental like model distillation and so on and so forth, you know, I think that's going to be, uh, very key. And then it brings a value proposition down to frameworks, you know, um, how can I better train the models for my own fit purpose? You know, whether I'm an enterprise or a customer and then also how can I trust it, right? Because there's going to be a zoo of models now that are out there. It's just very confusing to pick which one do I use? Yeah, Kate, uh, so we've been talking about OpenAI. Obviously they take up a bunch of sort of airtime. Um, but I guess one thing to kind of, as we think about zooming out to tell this DeepSeek story is whether or not we think OpenAI is kind of similarly situated.
Like, you know, everything we've been hearing is, okay, OpenAI is going to continue its strategy, it's just going to move faster. Do you think it changes the economics at all or the kind of decision making at all for say a Google or a Meta or, um, you know, even like an Anthropic? I don't think it changes the decision making or strategy, uh, overall. I think a lot of DeepSeek strategy, you know, necessity is the mother of invention. They only had access to H800 chips.
So they optimized the hell of it. They invested in efficient architecture like MoEs and DeepSeek was born, right? So I think the U.S. based labs are operating with very different constraints and DeepSeek's innovation doesn't necessarily, constraints and incentives. And I don't think DeepSeek necessarily changed that calculus.
I also think a lot, again, what we've talked about today with DeepSeek is distillation. And for the labs pursuing AGI, distillation is not necessarily as relevant, right? They need to keep training as big a model as possible and have incentives to try and keep that behind closed doors. Whereas the business value, again, my take is the business value is all around these distilled smaller models that are actually what people are going to deploy in a commercial setting. And I don't think they're at least at the highest strategy level and what they're working on with right in terms of their investment profiles, that longer term AGI game.
And for that you still need a crap ton of big model of big GPUs. And they're not going to want to release any of that out in the open. Right? Yeah. It's not like they're going to use Stargate to do like small distilled models. That would be the funniest thing.
It's actually an inference cluster. Surprise. Um, that's yeah, that's I think really fascinating. I guess, Chris, maybe I'll turn to you for the sort of last word and last question here. You know, Kate just talked a little bit about the idea that Chinese researchers are operating under very different constraints. So they kind of develop different types of methodology, different types of models, different types of proficiencies.
Um, and do you think there's something to the idea that like, almost we're like, we have an embarrassment of compute among the U.S. labs. And So it actually kind of like, limits, like the degree to which we would ever invest in the kind of thing that DeepSeek would be working on. Um, I'm really sort of interested in the idea that like, these constraints really mean that AI will start to look pretty different in different parts of the world. As researchers operate under very different constraints of what they need to do to deploy systems. I, I think that's exactly the case, right? And you can see a little bit of reinforcement learning happening there and reward modeling, right? Which you, you were saying here.
You're going to have less compute available to you. And guess what? They have different incentives at that point, and they've been rewarded by being more efficient. So if you've got an abundance of compute, you're not really going to be optimizing for efficiency.
You're going to be trying to be getting your models out first. And I think that's also, you know, speaking from my own experience, I, you know, I don't have any H100's kicking around. What have I got? I've got my MacBook Pro, right? Where I've got -- You're more like the deep sea researcher, basically. Exactly. So you're trying to come up with innovative techniques to work within the hardware constraints that you run within today.
So I, I, and I think honestly, if they didn't have the chip constraints in China, I'm not sure that DeepSeek would have probably came up with those techniques, because they, they would have been just trying to focus and catch up with everybody else, as opposed to trying to take things from a different angle. And, and therefore, again, one of the reasons I believe in open source very much, and everybody's sharing their papers, everybody's running under different constraints. And they're going to find new innovations, and if we share that, we're all going to learn from each other and be able to contribute. And that's not just the big labs, but the, the people in the community, just with their laptops, trying to discover and experiment with new things. Ah, so I love this panel.
Kate, Aaron, Chris, thank you for joining us on the show as always and walking us through DeepSeek. Uh, a lot more to talk about and we will be tracking the story. Um, and thanks for you listeners for joining us. Uh, if you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week on a jam packed episode again of Mixture of Experts.
2025-02-06 18:29