GPT-4.5: And the future of pre-training is...

GPT-4.5: And the future of pre-training is...

Show Video

Is pre-training dead? No, because 4.5 does the best cheese jokes ever, and why would we stall pre-training for lack of cheese jokes? I need my cheese jokes, so no, pre-training is here to stay. I knew I could count on you. Kate, over to you. It's already been dead. Come on, we're beating the dead horse now.

All right, so with that, we will jump into this week's episode. Hello, everyone. I am Bryan Casey. I am guest hosting this episode and welcome to Mixture of Experts. Every week, Mixture of Experts goes through the hottest stories in artificial intelligence. And a fun thing about the show is that we record on Thursday mornings.

And on Thursday afternoon this week, GPT-4.5 arrived, and we decided that this was a good time to do our first ever emergency pod. Um, and so I'm excited and thrilled to have both Chris Hay and Kate Soule on the episode today. If it wasn't obvious from the opening question, the one and only topic that we'll be discussing today is, um, Thursday afternoon, OpenAI released the much anticipated, uh, GPT-4.5 Having seen many releases, we get new model releases every day of the week, um, around here. There are some things to me that were pretty remarkable about this release, even in the way that OpenAI, uh, communicated it to the rest of the market. Um, first, in all of their announcement materials, um, They didn't describe 4.5 as a frontier model.

Um, they were very clear in all of their communications that this was against kind of the normal conventional benchmarks, was not going to be a world beating model. Um, they talked heavily actually about the expense and cost of serving the model and the size of the model, even saying that they had run out of GPUs, um, in terms of their ability to serve it. Um, and that even in some of their like documentation, they were kind of non committal long term about whether they were even going to keep this model available in the API. Um, and so to just maybe riff a little bit on our opening question, just maybe start with you, Kate, a little bit, because you said that pre-training's already been dead. Um, the immediate discussion in the market, because I think the assumption is that GPT 4.5 was trained on something like 10x as much compute as GPT-4, that's at least the hypothesis, that I've seen, uh, thrown out there.

People immediately went to the implications around, did we hit the wall, scaling laws, is pre-training dead? What's your take? I mean, I think even before GPT-4.5 came out, we saw really compelling evidence with DeepSeek and other models that inference time compute is king, not pre-training computes. We're seeing tons of things unlocked by spending more, by reasoning longer at inference time that costs more, but if you spend more money there, it's unlocking all sorts of new performance gains. And the old mode of just pay your way during pre-training by training for longer and longer and more and more data.

It's, you know, We're not seeing the same gains all of the we're worked our way so far up the cost curve, so to speak, we're really seeing a plateau from that perspective. So, you know, I don't think this is actually unexpected. Um, if you actually look at where we've been headed for a while now. Maybe Chris to throw to bring you into this and to riff a little bit on your, um, your opening remarks, um, let's just say, um, I think one of the reactions that I've seen in the community is that while it didn't beat, you know, set like new kind of, um, uh, I think standards in terms of like some of the math and science benchmarks, even just like the conventional benchmarks, the reaction and even discussion in the market was that the model was like really good at writing. Um, it was funny and that, you know, people weren't used to seeing a model that was like actually funny in a way that like wasn't cringy. Um, essentially that was maybe more creative, um, than, uh, you know, past models.

Do you agree that like with kind of Kate's take, or do you also feel like we're kind of like underselling like how important is to have he reached those milestones? Are we underestimating the value and, you know, where kind of creativity and writing and humor and things like that sit on the intelligence curve. But, you know, I'm curious your thoughts on that. It's late at night for me. So if I agree with Kate, do I get to go home? No, of course I'm going to disagree with Kate.

Where would be the fun if I didn't, right? I expect nothing less. I think. So I think the first thing it is the creativity. It is a genuinely funny model.

It is actually super, super funny and sassy. And it is the first time I've really seen good creative writing coming from the model. So actually, I, I think they've done some, something pretty good there. Of course, it's not going to be good at the, uh, stuff that inference time computers is, you know, math and stuff like that because you, it does need more time to think on that. And I'm, and I'm okay with that.

Does that make pre-training dead? No. You know why? Because if there's no pre-trained models, what are you inferring at inference time? Nothing. You need the pre-trained model in the first place.

So I pre-training is not going anywhere. I. If I was going to make a prediction and that prediction is going to be that there's a lot of techniques that we're doing at fine tuning layers to be able to support inference time compute, which I think can go back to pre-training because the reality is, right, you know, here is the entire internet is probably not the most efficient way to do pre-training in the first place, and actually, the biggest thing that we've learned is that the quality of data during reinforcement learning is the quality of data for chain of thought during inference time is actually making a bigger impact than anything else. So actually, if we go back to the pre-training cycle, rather than saying, hey, go look at the internet and tell me when you've suddenly learned something, it's like that episode of The Simpsons where Bart went to Paris for three months and then suddenly at the end, he was like, I can speak French.

That's how we train large language models. And I think that's going to change. And it's going to be, look, how do we build quality, synthetic data sets to be able to do pre-train? So I think we're going to go back and forward all of the time, and we're going to pronounce pre train as dead. And then suddenly, you know, we're going to do something good again. And then it'll be like, oh

no, everybody pre-train. And we're going to go back and forward, back and forward. So no, pre training isn't going anywhere. So just a couple of reactions, right? So, especially on the emotional side, the, uh, the humor, the characteristics, that's not pre-training, right? That's all imbued into the model during the alignment after pre training.

It doesn't matter, like, really, if we want to talk about the models doing a great job at that, that's not because they pre trained it for 10 times longer. I highly doubt it. It's really due to the alignment of the model. So, I don't know that.

That was worth, you know, if they truly spent 10x on the pre-training, uh, really worth it, but I do think you're right, Chris, that pre-training will change. So when I say pre training's dead, the mode of throw more data and spend more until performance goes up, I think is dead. Being smarter about how we pre-train, I completely agree. I do think for the near term, we're going to see much more like base models as a commodity where I don't care, uh, from a performance perspective. I think there are other things that can differentiate base models, particularly on a trust and transparency angle, especially if they're not driving up performance anymore, that becomes more interesting.

The licensing of the base model is another example, but you know, I think. In a lot of ways, it's kind of like pick your right model size and then all the innovation right now is really happening on the alignment side. So pick your favorite base model that meets your cost and other criteria, uh, and then apply your alignment techniques on top of that to really drive and meet your needs. I agree and disagree. And I think the reason I'm going to say that is I don't think you care. until you care.

And what I mean by that is, is base models have been commoditized at the moment, and at this point, time inference, time compute is the most important thing, and there's a ton of mileage to get on that. And then there will be a point where that mileage will slow down a little bit, and then it'll be like, oh, I need a better base model to be able to get there. And then suddenly we're all gonna be like kids on the soccer field.

We're all gonna run towards the other end of the field. And then we'll be like, oh yeah, I can get an extra percentage point, or I can do a little bit better if I have a pre-trained model. And we're gonna run over there, and then we're gonna go, oh my goodness. Tools, tools is the thing we, I'm going to mention agents, of course I'm going to mention agents, it's like, and we'll be like, the best tools is what's going to make the models, and we'll be like, inference times compute, that is dead, because tools is going to be the most important thing, and we're going to run over there. And then actually, we're just going to run around in circles from thing to thing, optimizing and because we've done this dance before. Uh, and, and that's what we're going to continue to do. And it's going to be fun, but all of these things are important.

If we take the, you know, base model as a commodity right now, another step further, I think it will show that we're going to see a lot more innovation in the architectures. So you have to pre train new architectures, but as we talk about how do we get more efficient models, obviously a mixture of experts as an architecture is becoming important, uh, in terms of broader efficiency and, being able to maximize performance per cost, and I think continuing to find ways people are going to try and differentiate and kind of break out of this commoditization, right, by finding ways to drive architectural improvements. But I don't think the story is going to be, we're going to have this new architecture that we trained for 10 times longer than anybody else, and that's why the model is special.

It's going to be, we came up with this new architecture that's even more efficient and powerful that you can now move to when you do all of the fancy alignment that gives you the true performance for the model. One of the first places everybody's head goes when they see this happening is like, Oh my God, what's happening to all the compute build out that's happening in the world right now is like, is that under threat? But what didn't happen is like. All of those, like, stocks just going nuclear or something like that. And I think a lot of that is because of the opportunity that is around test time and inference time compute. And I saw even one of the former research leaders at OpenAI was just talking about, like, it's pretty clear that in 2025, the optimal way to use compute is not going to be to be just, you know, scaling, um, pre training, basically, as far as you can go.

And it's going to be in reasoning, and the gains are going to happen in reasoning and I know that we are kind of early in that journey, um, over the last, like, you know, few months or so. Obviously IBM just had a release, where we started on that journey a few days ago. You know, maybe could you talk a little bit about, like, what even that might look like? Like, if we're gonna be, like, attacking this vector over the course of the next you know, until we get as far as we can go there. Like, what are the types of things that people are going to explore? What are the opportunities? Is it just like make this thing think for a week and come back? Or is, are we going to be a little bit more sophisticated than that? Yeah. I mean, I think at the top level, the thing to think through is we now have a pass through model for cost, right?

So instead of a model provider spending a bunch of money in fixed costs to get high performance, model provider can just kind of pass that through and say, look, you can host the model and pay for as or you can pay through an endpoint, but just keep paying until you get the performance you want. And if you don't need all that performance pay less. And so I think it's gonna like approach a much more, you know efficient market so to speak right where you're paying for what the task calls for versus, you know, you've got some subscription of X dollars, you know, uh, a month that you, you know, are kind of locked into. So I think we're going to see a lot more flexibility in pricing. We already saw that with Anthropic, uh, 3.7 right, where you can set different cost, uh, parameters for how much you want to pay basically to how long you want to think.

Uh, for a given task. And so I think that's only going to continue until, you know, everything is going to be like, well, how much is it worth to you? Like, I don't know if we'll get to like an auction setting almost like you could like even put it up for bid. Right, but, um, I think it's going to be much more efficient. Uh, In terms of actually getting economic value out of generative AI, because you'll pay for what something is worth. Chris, maybe as like a question to that is like, as an end user of these tools, I think being able to decide how much, like when I want to use, when I just want a quick answer, when I want to use reasoning, when I like search is becoming like an increasingly significant thing that people are rolling out. And like, I can just kind of decide when I'm using each one of those things.

As an application developer, um, and when now you're like trying to, you just want the model to get to the right answer at the lowest cost as fast as possible, like, how are you, like, how would you be thinking about these trade offs going, going forward? Are we happy that, like, so many of the, like, next incremental grains are going to be happening via the, uh, you know, more of like reasoning models, um, versus the way that we used to get them through kind of the base models. Is it more complex now when you're thinking about like the user experience? It's like, oh, I used to be able to just count on getting that answer really quickly now it's like, sometimes my answers are coming instantly and sometimes like a model goes off and like thinks for like, you know, five minutes before it comes back on something. It's just like, as when you're thinking about like the developer community and they start to adopt some of these tools, like, are they happy about the fact that like more and more of this is going to get put into reasoning over time? Does that make things harder, um, to build this stuff into applications? I think everything is a trade off, and actually. I really like Kate's analogy on this one, and, and I like it 'cause I did a video on this a while back. Whereas I, I think we are gonna move into this agent marketplace and I think that is probably the most important thing.

And I think in the same way as we go into something like Fiverr and we say, I need some, I'm gonna spend five bucks 'cause I need, uh, a video edited, or I need somebody to go and code something up for me. I think we're gonna be in the same world with agents and, and the reality is that if I need a document translated and I need it done in five minutes, then you can have the best model in the world, but if it ain't doing it in five minutes, which is when I need this thing to be done, then I don't care. If I'm doing real time translation, so I once spent some time, I think I was in Moscow at the time, and there was this guy who was translating what I said real time into Russian. Now, It can sit and think all it likes, but the audience is going to be waiting for that translation, right? So, so I think there are times where real time is going to be important, and that's going to be the same with coding.

But at the same time, accuracy is going to be important as well, because like that translation scenario, if the guy just started making up what I said because he didn't understand it, then it's great that he's real time, but he's just spouting gibberish at that point, which isn't any use to anybody. So I think if you're fast, and you're accurate, and you're cheap, and you can do the same job as something that is big and takes a long time, that is going to, and expensive, that's going to win, and that's just market dynamics. Um, but when it comes to something really important, so, If I am, for example, needing to do some deep research and finding some chemical compound, you know, having a, today's one billion parameter models, take a plucky guess in the air without thinking about it.

I honestly, I don't think you're going to be that satisfied with the result. So it's going to be a balance on the task, on how much effort and thought and tools, etc. You're going to, but it is going to be a marketplace dynamics, and it's going to be latency, it's going to be cost, and it's going to be the level of intelligence that you need. So. To the point, this is again, coming back to this, why I still don't think pre training is going to go away, which is if you can gain an edge on the base model so that it can actually reason a bit better with the combination of inference time compute, with the combination of tools, then that might be the thing that gives you the edge. In that scenario, and therefore every single company in this is in a race to have an edge.

And if they didn't have the race for the edge, why are we all publishing benchmarks all the time? Because we wouldn't care if we weren't, this is better than this one. So I I just think these dynamics are going to play out and back to what I said beginning, I don't think this is going to go away in the case of development. And I know this is a long run. Bryan, I do apologize. Sometimes I need a fast, so coming back to your original question, sorry, everyone, it's taken me that long to do that, If I'm in my VS code environment, if I'm just sort of doing auto complete stuff, that needs to be fast. But if I'm writing an entire program, an entire game, or doing a migration, and you know what, the model's going to take five, ten minutes, but actually it would have taken me two weeks to do it.

I'm going to wait that time, right? If, especially if it's accurate, if I have to wait ten minutes and it's completely wrong, I'm not going to wait. So it's, these are the marketplace dynamics that I see. I think there's two really interesting points to bring up, Chris. One is that you can think of costs from like, how much do I have to spend? But obviously costs from latency is critical to think about as well. So that's definitely like a third dimension to all of this as people starts into the market and figure out like, what is it worth to me? How long can I wait and how, what performance do I need? And like those three combined is going to kind of drive you to your, your model selection. But I also think that like as we talk about the experiene and what we've built with generative AI so far, everything we've done for the past two and a half years has been based off of chat, instant response.

So now that we have the reasons to wait, because we'll get better results, like waiting for a conversation, turning on conversation doesn't make sense. No one would do that. But now that we have reasons to wait, you know, I think we're going to see entirely new things get built with generative AI or ideas of how you build with generative AI, because we now have the incentive to find those other patterns and things that didn't require instantaneous responses now suddenly become in scope.

I'm curious how you think, um, that this will actually come together and like people will consume it. And so like open AI, um, it's kind of like a little bit of a joke online when people you open up the interface and you look at the model selection it's like if you're not like listening to the equivalent of this show every day like how would you guess which one of these things you're supposed to use and so they've been very clear that their part of their roadmap is to bring them together and you ask a question and the model just kind of knows which ones of these things it's gonna bring together and also part of me was even wondering in the wake of this where, you know, if you're not going to be able to break through on the benchmarks in terms of like the criteria that I think the market understands, you know, what's the purpose of actually shipping a base model that doesn't have reasoning, in it, if it's just going to end up underperforming whatever your last reasoning model is. And one of the questions I walked away from like that kind of series of things is like, are we getting to the are we coming to the end of the line in terms of like even having base models that don't have reasoning attached to them? Will that be kind of like a weird artifact of history that we had those models at all and in the future all of this stuff will just be integrated together in a single model, and the model itself will just decide whether it needs to use reasoning or gives you a straight answer right away? Or do we think there's like a real chance that like, no, there can continue to be, you know, like each ones of these different classes of models and they can each do kind of their discrete things. But, you know, I'm curious, just like how much convergence that you see actually happening in that space. And, you know, maybe Kate, I'll just turn it over to you to get kind of your initial take on it. Yeah, so a couple of things.

I don't think OpenAI made a mistake by releasing a non reasoning model. I just think the fact that they released such a big one that costs so much, you know, it was probably a waste of time, a bit, uh, a waste of money. Like I think there are plenty of use cases that we're seeing right now where reasoning actually doesn't help. Things like tool calling and things where you have very clear structured patterns and you kind of just want to like fine tune for that very specific thing, uh, you know, doesn't necessarily require reasoning, but I almost don't know that it matters like, are we going to have reasoning models and non reasoning models? Because like, what is a model? Like is OpenAI, are those models really just an individual model or are there either multiple models being routed to already? Are there experts that have been reserved for different tasks? Like our whole definition of what a model is or is not, I think is just going to continue to be fluid and, continue to evolve as we find new and clever ways to bring this together.

I do think though that we're always going to need more instruction based focused. Capabilities and more reason based capabilities and the ability to kind of switch back and forth, depending on what the task calls for. And I'd agree with that, Kate. I really would.

And I used this analogy before and I, I can't help thinking about the mainframes ironic for the company that we work for. Right. But, you know, But you had these big massive mainframes and you know, and what is the world that we live in just now? We have computer in our pocket, on our mobile phone, on our laptop, and then architecturally everything is distributed. We have microservices and they all communicate and they all have specialized tasks and then we have good buses between them. And if I really think forward into the future. I do think that the models are going to get smaller and smaller.

They're going to get distilled down. Um, I think that was probably the point of, uh, having such a large model is that they're going to use that for distillation. And we're going to see some really good reasoning models, uh, with a very, very good base model, uh, based off of the GPT-4.5 architecture and then in the future, GPT 5. So I think that's really the kind of, kind of point behind that, and also to keep the hype cycle going, which I love.

Um, but the. But I think we're gonna end up in this microservice based world of architecture, and in the same way as we move from mainframes to distributed computing. And I, I see this exact same thing happening with generative AI because the reality is if I need something fast on my phone and the models are getting capable, and I can do something that that, that maybe a GPT-3 model used to be able to do, but I can do that on, on my phone on a couple hundred million parameters, then let's do that real time and then, uh, you know, if I need a little bit more reasoning, if I need a bigger model, then I may go off and use some bigger compute. So, and then we use mixture of experts at the moment to be able to do that routing, but then as network speeds get faster and faster, latency gets faster and faster because the chips are getting faster and faster. Again, why wouldn't you start to do that across a mesh of some sort? So at that point, rather than yeah. necessarily using a mixture of experts where you've still got a large model and partitioning,

then you have truly separated AIs which are communicating with each other with true experts in the same way as we have humans. So I, I see that expansion coming. And, and again, this is Chris opinion.

I, I think as we move into probably, it's probably not a '25 thing, but I think as we look into '26, '27, I, I think we're all going to be going, "Oh my god, is the big, the days of the large model gone? Is the day of inference time compute gone?" It's mesh. We need to be in an AI mesh, and then, and single models are dead. I'm, I'm sure that's coming. All right. Well, 12 months from today, Chris, we will have an emergency pod on mesh networks, so I think that's a really good place to to end. Chris, Kate.

Thank you for joining. Um today I think this was obviously a topic that the industry has been waiting to see the outcome of for um a long time and in some ways I feel like it asked as many questions as it ended up answering, but that's good because it means we got to do the podcast for another 12 months. So, uh, thank you both for joining. And as always, uh, you can find Mixture of Experts on podcast platforms everywhere, and we will see you next time.

2025-03-03 13:31

Show Video

Other news

How I Built a Web Scraping AI Agent - Use AI To Scrape ANYTHING 2025-03-12 02:56
Amy Webb Launches 2025 Emerging Tech Trend Report | SXSW LIVE 2025-03-11 11:01
Jeffrey Ding: AI and the Rise of Great Powers 2025-03-09 14:18