DeepSeek-V3-0324 Gemini Canvas and GPT-4o image generation

Show video

It's 2026 is the top model in the world an open source model? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome to the show. What do you think? I don't know. I agree with that framing, Tim. I don't think any model is top. I don't think there'll be one model that is overall best at anything or that will rule them all, so to speak. Alright, uh, Kush Varshney, IBM Fellow AI Governance.

Uh, Kush, welcome to the show. What do you think? I think Open is here already and Open's gonna dominate into 2026. All right, great. And Skyler Speakman, Senior Research Scientist. What's your hot take on this question, please.

If you define the top as the most used, then definitely open models will be the most used models in 2026. All right, everybody's fighting my questions today and all that, and more on today's Mixture of Experts. I am Tim Hwang and welcome to Mixture of Experts. Each week, MoE brings you the best minds and artificial intelligence to walk you through the biggest headlines, uh, that are dominating news. Um, as always, there's a lot to cover. We're gonna be talking about Gemini's new release.

We're gonna be talking about a new thermodynamic computing paradigm. We'll be talking about OpenAI's image gen. But first I really wanted to start by talking a little bit about DeepSeek-V3, and specifically not V3, but a checkpoint that DeepSeek released. Um, so to just give the full numbers if you're interested, is DeepSeek-V3-0324 um, and, uh, there's a lot of kind of hype about this release because by some measures, um, one specific one is this artificial analysis intelligence index. It is now kind of the best reasoning model, the best model out there in the world.

But maybe, Kate, I'll start with you. I know you kind of fought the premise of this question when I just asked you it a moment ago. Um, should we think about models as being the best in the world? Like is that even a useful way of thinking about this space? Well, well, a couple of things. I think DeepSeek-V3 is a non reasoning model, so I think a lot of the press's, best non reasoning model in the world, uh, according to, uh, reports like artificial analysis, you know, I, I think a lot of these analyses are trying to come up with tools to help people better evaluate models and pick ones to, um, use in, in production. The reality is these models all are differentiated by like 0.01. You know the differences in performance.

Do we really think that tiny lift in performance of one benchmark is going to result in meaningful performance improvements on a rag or even a agent based task that you're trying to deploy in production? I don't think so. I think there are great ways to give you a list of models to start to test, but ultimately the best model is the best model that does best on your task that you care about. And that could be any model, regardless of how it kind of scores on some of these top level benchmarks. You're almost saying like we're like posts almost benchmarking in some ways. Like all the models are so performant now that like, it's almost difficult to say like there's one absolute measure. I don't know if that's putting words in your mouth, but.

I mean, I think different model providers have different priorities. I think DeepSeek is actively chasing OpenAI. They're trying to have the same, you know, pursuit of AGI, and so some of these benchmarks are being used as demonstrations of capability on that broader pursuit of AGI.

That's fair. I don't think that means for like a everyday production task or use case that really reflects a, necessarily a, a meaningful difference in performance. I think there's a, you know, some of these big models are frankly overkill, uh, and so boosting it a little bit further isn't going to make a real actionable impact. And the benchmarks that matter the most if you're trying to deploy a model is. What is the performance for a given cost profile? And those are some things that you know, really you just have to test use case by use case, using information like artificial analysis to help you get started.

But ultimately, you know, you have to run your own experiments. Kush, maybe I'll turn it over to you. I don't know if you agree. I guess there's a one way of reading your answer, which might be like, almost like a. Uh, a contrasting position, I guess, to Kate, right? Where I think the way I heard you sort of respond to the question was, by any measure, open is winning.

And so like, it doesn't matter how you measure this, open will be the best in the world in 2026. Is that another way of thinking about it? Or maybe you were nodding, so maybe you actually agree. Violently with what Kate just said, um, with, uh, both Kate and Skyler on this point, that, I mean, there's different ways of measuring what is best and like even asking the question of what is best is probably, um, kind of not the, the right way to think about it. But, um, I think the main point is that Open is, uh, a way the, that the world is gonna move forward. So, um, whether we wanna count best or not best, um, or usage, or adoption or not adoption, um, I think open is, uh, gonna have a, a very strong sort of play. Um, uh, just continuing.

Uh, so whether the, that number is, uh, a little bit above a little bit below, that's not the, the critical point, just that it's in the same ballpark as the, the important point. And, um, I think, uh, a couple of months ago when I was on the show, um, I was talking about just the culture of, um, how DeepSeek is doing their work. Um, the fact that they can rapidly iterate and, um. Uh, kind of make this difference and, uh, and, and reach their goals, whatever those happen to be, uh, very quickly. And, uh, I think that's the continuing story in, in my mind that, uh, whatever happens, uh, I think deeps seek will be able to adapt to, to the changing environment, whatever the needs happen to come across, uh, in the, in the actual world.

So, um, uh, just in terms of, uh, the, the culture aspect, uh, I mean, Open culture is gonna be what's gonna dominate actually, um, not maybe the, the open model. So maybe I'll clarify that a a little bit. I'd like to jump in on, on Kush's point there about the role of DeepSeek and for sure the headline that got me to click was Open Source is now Best.

Uh, but below that headline. Was this really cool graphic that showed where Deepseek from January was to where DeepSeek, DeepSeek-V3 in March now is, and that delta, I think is worth paying attention to. I agree about the difference from the other leaders. Depends what metric you're using at, et cetera. But for this same metric, the increase that DeepSeek has made from January to March, uh, really quite, quite impressive. So think about that change that has happened in that short period of time.

And I think that just kind of echoes, uh, Kush's sentiment about the way DeepSeek is going about creating and releasing these models. Very cool release in January and a great follow up 3 months later. So that's how, that's, that's the really cool headline after the fact, after that line.

Yeah, that's right. Yeah. I think it's like very interesting as a way of looking at these metrics is that we, we tend to think about them as like, is the model good or not? Right? And I guess, Skyler, kind of what you're seeing is like almost maybe like, well maybe that's not the real question. Like this is really useful for almost knowing how good the kind of team and their improvement method is, right? Like it's almost like how quickly can the team hill climb? That's the really interesting thing that's kind of revealed by these numbers more so than like the quality of the release in some objective, uh, sense. Well, and I also think there's something interesting going on about being able to kind of bootstrap reasoning models to improve non reasoning model performance. So the initial V-3 that was launched back in December, uh, DeepSeek had a internal version of R-1 , which was their reasoning model that they said they used to train it and then they released R-1 in January and that was, uh, you know, market moving and now they've released an updated version of V-3. And I

think what, so part of that momentum, which is really exciting to Skyler's point, is that we see them able to kind of innovate on some of these core building blocks that they've released. And that's probably gonna unlock all sorts of ways that the broader open source community can also innovate given that they've released these building blocks out into the world, like the R-1 model. Yeah.

There's a final theme I guess I wanna pick up on from Skyler's original response, which is, you said maybe one metric we should just look at is usage, right? Like, nothing beats usage, right? Like if there's a lot of adoption, you know, almost like we can not, we can debate what's better and what's not, but it's almost like it's the one that's being used. Uh, do you wanna talk a little bit about that? Like, we don't really talk about that so much. I feel like often we're very obsessed with like, how did it do on this benchmark? But I wonder if like usage over time becomes like a more important way of kind of measuring, not model quality exactly, but like who's winning I guess in some sense. I think, I think downloads on Hugging Face is a thing, right? That's, that's kind of a stab at that idea of usage. And I, and I think that is something that these model developers, uh, keep track of it and watch over time. So, uh, no, I don't think we're too far off the mark by talking about adoption and, and usage.

I will push back a little bit just because DeepSeek is a huge model. Like, you know, if we talk about downloads and usage, I think small models are gonna lead and win something a developer could literally download and run. Uh, the DeepSeek model is kind of a bear. I mean, it's 600, uh, 70 plus billion parameters that would have to be loaded in memory to run. So I, I think usage is really important, but I think usage for these larger models is going to be predominantly more in a, a hosted setup. Uh, and you know, there are interesting ways to look at demand based off of model size.

Uh, and I think we see a lot of small models that are more cost effective, are gonna get more usage in 2025 and 2026, and some of the bigger models that are just monsters to, to run. Yeah, for sure. I, one of the things I've always been obsessed with is like, one of my data, like secret data points of the world that I would love to know is like, what's the book that's most downloaded on Kindle? That's like never read. And I actually wonder if there's like, almost like a similar dynamic for language models where you have like these models that are widely hyped and very much downloaded, but the question is like, how much use are they actually getting in practice? And we have a lot more limited sense of like that. Right.

And that, that's almost kind of like an invisible part of the, the question of like, you know, who's winning? Right. You know, and I think we're already attacking the premise of that question. Great. I'm gonna move us on to our next topic. Um, speaking about big models and, uh, and also, you know, the battles over benchmarks.

Um, Google did another kind of raft of releases. They've, seems like they've really been picking up the pace. Um, there was an announcement for Google, uh, Gemini 2.5, um, and then also kind of

the release of this sort of canvas feature that they've been playing around with. Um, and you know, because we've. Spoken just a lot about kind of models and benchmarks. I do wanna maybe start by talking about Canvas.

Um, one of the really cool features about it I thought was actually the idea that you can kind of be coding and then also automatically see a preview of what you're building. At the same time, and we've talked about this a little bit in the past about just like we're still trying to figure out how, you know, kind of AI assisted coding will look in the future and a lot of the innovation seems to be like on the interface level. Um, and so I guess maybe Kush, I'm curious to get kind of your thoughts on, you know, these types of approaches, right? Where like, it seems like we're moving away from pure, just kind of auto complete. Um, but um, just kind of interested in how you think about it, uh, as a research run some of these issues.

Um, let me start, uh, with a little bit of a, of a history lesson, just if you'll, uh, accommodate that. So, um, of course there was an, uh, a person, an IBM Fellow, Irene Grief. And, uh, she was in our, uh, Cambridge lab. She's pretty much started it. Um, and she founded the field of computer supported, uh, cooperative work and, uh, that, um, she started it in Lotus and then IBM acquired Lotus, which became part of IBM Research and, and so forth. And that field brought together all these different sort of things.

It was the human factors sort of things, the distributed systems, I mean like a lot of different stuff of what it really means for humans to work together, supported by computers and computer technologies. And I think the paradigm is shifting a little bit, and it's more about individual work. So, um, and how that's supported by AI and, uh, the collaboration between humans and AI and, and doing that.

So kind of the co creativity and, and, and these sort of things. And I think, uh. Just the fact that this whole paradigm is changing is calling for exactly that, the innovations in the interfaces, in the, um, interactions.

And, uh, I think there needs to be a lot more kind of control given to the user, the ability to tinker with the interface to make it what works for them. And the canvas is very much a, a great starting point, but I think, uh, because I mean, just a single chat box is not. The answer. I, I think everyone can, uh, appreciate that. But um, uh, once we go beyond that, then the world opens up into lots of different possibilities and I think the canvas is one.

Um, but why not just let me as the user determine what is the right in interface for me? And maybe that'll actually be the, the next step. Oh, like that. The future will be like almost purely like everybody will have their own basically interface for this sort of stuff. It's very interesting. I guess my question to the rest of the panel, is this the first broadly released multiplayer AI? You know, the, the, the interface where you've got multiple people interacting at that with the same interfaces; have there been versions of this before? Is this the, is this the one to make the splash where we look back and say, this is the first time where people are going to be, uh, interacting together over the canvas, you know, over that case? Or am I, am I blanking on some, uh, examples? Previous examples? No. I mean, I think like the

Google Docs, I mean, you're just all editing at the same time. And then you can have some AI, um, helping each of the people a little bit is in that same pathway. It's not like, uh, we haven't seen canvas type things when you, uh, we use mural for, um, design thinking sort of things then, and there's multiple people moving things around and our team, um, to develop kind of a AI Mural version sort of thing, uh, in, in our Cambridge Lab. So, um, yeah, a lot of things that are happening, but uh, yeah, it, it is a step I would say.

Yeah. I think there's one question we've talked a little bit about in previous shows is, you know, it's kind of funny that in some ways because ChatGPT was like this big kind of moment for AI. All of the interfaces that kind of have followed since have like fallen into the gravitational well of everything needing to be chat. And it feels like maybe, I think what's exciting about Canvas is like, and you know, a bunch of other experiments as well is like kind of like finally people are trying to like, kind of stretch beyond that. Um, and I think it's kind of an interesting debate on just like how much path dependence there is here, right? Like whether or not people will. Sort of, I, I mean myself, I'm kind of like, oh, there's no chat.

Like, or like chat is like a less part of this interface. It feels like a little bit weird for me. Um, and, and I think that's like pretty interesting to see. Kate, any thoughts on this? I don't know if Canvas is like something you'd use or, uh, how you feel about it, kind of particularly from sort of thinking about this from a product standpoint? Yeah. I mean, in general, I'm, I'm always a fan of finding ways to move beyond kind of the initial chat based constraints.

I think Canvas is probably more of a stepping stone than a final destination. I, I think it's got that a little bit of chat feel while still being different. For coding, I really think it's about being embedded in where developers are coding today versus having a standalone kind of canvas app where you kind of iterate in terms of where you'll get the most productive use. So, you know, I, I think it's a little bit more of a, a demo perspective there. From a product strategy I think it's interesting to kind of look at how some of the big players are focusing on more of the endpoint side of usage, like Entropic, I think is focused pretty heavily there, versus more the application side.

Um, with UIs where, you know, it seems like Google's focusing a little bit more on that with this release. Um, certainly with some of these new features. Honestly, from my perspective, I was most excited by the Gemini 2.5 model simply from the reasoning. Uh, I do a, a basic sniff test, uh, for different reasoning models and just ask what is two plus two and see like how much thought will the model put behind this answer? Like, can it, can it figure out how not to reason if it's simple? I like that a lot. Yeah, and the, the model did actually pretty well, like compared to DeepSeek it'll give you like five paragraphs of R-1 will give you five paragraphs of, okay, I've got 2 fingers on this hand and 2 fingers on this hand. And, you know, it goes way into it.

Um, you know, uh, Gemini was able to give a very reasonable short response that was still correct. So, you know, I, I thought that boated, well, I haven't done more, you know, exhaustive testing. Obviously that's just a quick sniff test, but that's the first time I've seen like a, a more practical, just like a... It is an easy question.

I'm not gonna spend a million paragraphs and tokens trying to give you a response. Mm-hmm. Yeah. That's great.

I love that. Is like, the idea is like, actually now we need to be doing simpler evals because the question is whether or not you're overcommitting resources. It's like death by reasoning.

Very, very interesting. Yeah. Um.

Kush, Skyler, other sniff tests, vibe checks on 2.5. I do think these qualitative like evals are pretty valuable, I think in terms of people like navigating, like, is this something I should spend time on or look into? Not in the last 36 hours, sorry. No. Okay.

Same here. I also do, where is Rome? That's my other go-to. Uh, similarly, you know, paragraphs of debate on where Rome is compared to mm-hmm. a short response on Gemini. So I thought that was pretty good.

Yeah, I will need to try that with deep seek. I just love the idea of like, kind of grinding away for like a very simple question. It is. And really stressing about the answer.

It, it Literally is like, okay, two fingers plus two fingers. But then if I have two toes plus two toes, how many toes do I have? Like it, it gets mind blowingly intricate. Um, I think one final thing I did wanna touch on and Kush, I think we should recognize that you're wearing a safety vest. Um, before I kind of tee up this session, do you wanna explain why you're wearing a safety vest on the show today? Yeah, this is, uh, safety vest because, uh, IBM Research with our Granite program is, uh, very focused on safety, um, through our, uh, uh, red teaming our, uh, granite safety alignment and our Granite Guardian model. So yeah, that's, uh, just trying to represent that. Yeah, absolutely.

And I did wanna finally just talk a little bit about. Like model safety here. Um, and you know, uh, I think one of the things we've talked a little bit about in the past is like how much safety is built into the model versus kind of a future where safety is kind of like a separate model that you're working on. Um, and I don't know, I guess Kush like looking at a release like this, it still feels like at least a lot of the big companies are still kind of rr I would say like at least kind of, you know, Google, like, let's just say, um, is kind of still chasing after this kind of like, well, it's just all gonna be embedded in the model versus kind of safety being outside. You wanna talk about kind of like the pros and cons of that? And I guess why, you know, Google isn't kind of like doing what a lot of other companies are doing is saying, well, like Meta or like IBM like, hey, we're gonna actually separately think about safety as a, as a its own kind of like model construct in some ways. Um, just was curious to get your thoughts on that.

I mean, Google does have, uh, something called Shield Gemma, so they do have, uh, uh, a player in, in this, uh, separate model sort of field. But, um, yeah, it, it's really not a question of like choosing between the, the different ways of doing it. I mean, you really should do everything because, um. Uh, there's never any perfect sort of solution. So yes, do the safety alignment as best as you can, um, and then still have, uh, an input and output guardrail, um, because I think, uh, it's, uh, it's critical.

And then even on the data curation side, I mean, um, uh, try to exclude as much of the, uh, the bad content, uh, as possible. And. Uh, to me a big reason for keeping a separate guardrail model alive is because, um, uh, beyond the performance sort of question, um, where yes, I mean, that does show that, uh, you can do a little bit better. But, um, the other thing is customizability because, um, not every, um, sort of application, every use case is gonna be exactly the same. So. Uh, the notion of safety, the notion of what is, uh, desired and undesired is gonna change.

And so, uh, if you just pick everything in, uh, you don't have that flexibility anymore. So, uh, just, uh, we need to think that, uh, uh, every, uh, customer, every sort of application needs some level of customizability and that applies to the overall model, but uh, uh, also on the safety side. Yeah.

And I do think that's actually a great way of sort of thinking about it, is you, you're sort of saying safety at every level, right? Was like do safety everywhere. Um, and it's like how we'll end up doing it Kush in 10 seconds. Could you compare and contrast safety and security? Uh, the reason I ask is the UK recently rebranded their AI Safety Institute into the AI Security Institute tour. Yeah. What's, what are your thoughts? Not necessarily on that particular rebrand, but along those two dimensions? Yeah.

No, I mean, both of us were in San Francisco in November, right? When, uh, it was a convening of the AI Safety Institutes. Um, you were a part of the Kenyan delegation. Right. And, um, uh, the, yeah, things have changed a little bit. I think that's more politics, more just wording sort of things. But, um, to me, like security is at the application level.

Um, that's, those are things that you do kind of, um, in a general sense. Um, and then the safety is at the model level. Um, things that you're trying to bake into the model or put a extra Guardian. And then when you kind of meet in the middle, um, the model, uh, comes up and the application comes down. Uh, that's where kind of the, the confusion might be a little bit, so it's security that's kind of becoming more AI-ish, and then safety that's become, or the AI model is becoming more secure in some capacities.

So yeah, to me there's, uh, the general idea is just reducing the risk of harms and, um, uh, the more you can do that, uh, the, that's the, the goal. For our next topic, uh, I wanted to kind of bring us to a hardware story, a really interesting, uh, feature coming out in Wired this week on a company called Extropic. Um, and what Extropic is investing in is an idea called thermodynamic computing. Um, and I really want to kind of bring this up just 'cause, you know, a few episodes ago we talked about quantum. And these guys I think are really making the argument that like, well, it's not gonna be GPUs, it's not gonna be quantum, it's gonna be this new thing called thermodynamic computing.

Um, and I think it's just really interesting as we kind of think about the ways in which hardware influences, uh, the work of AI. Um, and, uh, was kind of interested in, in like the, the takes of this group as the people who kind of like work in AI day in, day out. You know, to what degree are you kind of like paying attention to these kinds of developments. Right? Because I feel like one way of thinking about this company is that it's, it's big if true, right? Like if you can actually do it, then maybe it's a really big deal.

But we kind of don't know at this point. Um, and so I, I'm kind of curious like on a day-to-day level, like are folks kind of thinking about like these alternative computing platforms? Are they sort of still so far in basic research that they're kind of not impacting day to day sort of thinking. Okay.

Maybe I'll kind of turn to you for, for the first take here. Yeah. I mean, and I'm not a expert at all on chip design or, or hardware, but I think it's something that certainly, uh, IBM and we have huge teams working on, specialized in alternative chip design, uh, and AI accelerator chips is paying really close attention to, and there's a lot of innovations going on in that space.

So, you know, I think some of these headlines, like normally we let it mature a little bit before we start paying more close attention, but as a field and as a whole, you know, I think there's a ton of opportunity to better optimize and redesign chips based off of the inference loads that we expect to see in the future. Um, moving into, for example, running smaller models more times at inference versus one big model one time at inference in order to improve performance as everyone starts investing more heavily in, uh, a phenomenon, we're calling inference time compute. Um, so, you know, I think that there's just tons of opportunities in this space. So certainly eager to see how onic evolves and, you know, if something becomes more mature that, that the field can take advantage of. And this is kind of where I wanted to point the discussion 'cause I think, um, you know, in some ways kind of like the, the kind of uniformity of GPUs and even the uniformity of like NVIDIA has been in some ways like really good for the AI space just because like, I think there's been kind of like a common standard that people can build around on the hardware side. Um, and I think one of the questions that I'm sort of curious about on whether or not like as this evolves is if you have all these kind of alternative kind of computing platforms that end up being good ways of doing ai.

If that kind of fragments up the space a little bit, right? Like I assume the way that you would kind of try to do AI on top of something like a thermodynamic computing chip or a quantum chip might look really, really different. Um, and so, yeah, I just kind of here, like as you kind of think about the future, maybe Skyler I'll turn to you is like. Do we think there's gonna be more fragmentation in the space or is it, I don't know.

Maybe we'll find some way to just get Coda to work on everything. You know, I am. I'm not ready to invest in Extropic yet, but I do think they've got some interesting takes and I was reading it about it today. You don't want any randomness in your floating point operations, our typical zeros and ones.

But if you're doing billions and trillions of these floating point operations, that noise is actually okay. Uh, the idea of AI and how we train those sorts of things, distributions of data. So the problem is you don't want any randomness at any individual calculation, but you wanna simulate randomness at the larger scale. Their approach seems to be, let's not bother, bother with zeros and ones anymore at the chip level.

Let's embrace randomness down at the chip level because that's where we're eventually going anyways, thinking always about distributions rather than the answer being, you know, four for example. So I'm really glad people are asking those questions. Whether or not they'll be able to induce the desired distribution by passing, you know, electronics through, uh, metal wafer. That'll be, that'll remain to be seen. But I'm, I'm really glad that people are considering this idea of the extreme accuracy required for our zeros and ones.

And then in the bigger picture, actually, we don't need that specific, specific accuracy when you're talking about training these massive models. Uh, so with some really cool tension to see how it plays out. Uh, but like I said, I'm, I'm not taking my, um, I'm not taking my money there quite yet.

Yeah, absolutely. I actually really love that you're, I think, revealing a bias in how I kind of frame this segment, which is hardware is the upstream thing and all the AI people kind of have to like dance depending on how the hardware evolves. Skyler you're almost making the reverse argument, right? Which is actually like what we're seeing now and I guess what this company is an example of is an attempt to make the hardware kind of like match.

Like what we know about sort of AI now. And so it's actually the, the power is actually going the other way now, which is like the, in some ways, like GPUs were always kind of an accident in some ways. And so we're kind of like trying to rebuild that, huh. Yeah.

That's a nice take. Yeah. Kush, any final thoughts on all this? Sure.

I can maybe go back to, to some more of my, uh, my history lesson if you guys are okay with that. So this is good. I feel like, you know, yeah, it's like we have, we have Chris on for like, the, the crazy take. Yeah. And we have Kush on for, you know, the, the historical history, philosophy, you know, perspective history and everything like that.

Right. So, um, uh, just, I mean, what is thermodynamic computing? Right? Um, so I think it's to understand a little bit of, uh, of like how this has come about a little bit more as well, because, um, uh, I mean, uh, you said it. Uh, at the beginning, right, that, uh, there's some sort of like hardware lottery. Um, and Sarah Hooker, um, is a researcher who wrote an essay all about this, that, whatever the hardware happens to be, that's kind of what, uh, uh, makes things go forward and, and so forth. So even the whole IBM company, it started, um.

Uh, at a time there was this guy, Herman Hollerith, and he was doing punch cards and he, um, I mean, did the US uh, census in 1890 and stuff, right? And that's like a paper with a hole in it. That's a very basic sort of technology. And then, um, in the sixties, um, Bob Denar here at IBM Research invented dram, which, um. Took a capacitor and a transistor, and you could do memory that that way instead of through these, uh, hole punching sort of things.

And, uh, then you get to like the thermodynamics of it. So, uh, you have James Clark Maxwell, um, second Law of Thermodynamics, and he is trying to think about this demon that's trying to like make heat flow without any energy expended or anything like that. Right. And, um, uh, there were these two researchers here at IBM, uh, Raul Landauer and, uh, Charlie Bennett. And, uh, uh, what they figured out is like how to argue against what, uh, was this Maxwell's demon, um, sort of thought process.

So. Uh, land Hour showed that, uh, any sort of computation, um, actually requires the use of energy. Um, so it requires, uh, uh, heat, right? And then, uh, Bennett took that idea and said that this demon who is sorting uh, hot and cold molecules, um, uh, must, uh, actually do some information processing, so that's actually using energy. And, um, so the, uh, second law of thermodynamics must hold.

So like all of this is part of like IBM's heritage as well. And, um, but this new thing, um, I, I think it's exciting. Uh, it's. Been in the works for, for a long time as well. This, these thermodynamic ideas. So, uh, the claim is that things like matrix inversion, which is a very important computation, um, and it's very expensive to do, um, with large matrices, can be done naturally with, uh, with, with this sort of approach.

So I think, uh, uh, that makes a lot of sense. So just take a capacitor and a, uh, inductor. Um, uh, and then, uh, with those, you can actually, um.

Set up, uh, the, the matrix on the conductors, uh, let it dissipate energy, uh, however it's supposed to. And then the, uh, correlation among these different circuits actually tells you the inverse of the matrix. So, uh, all of that is like really cool stuff. Um.

And I don't see why we shouldn't be looking at those alternatives. Like a lens we know does the reciprocal operation. We know that, uh, resistors do this or that or whatever, right? So like, why not do it this way where we shouldn't be beholden to digital logic just because, um, that's how it, it's happened over the years. I think, uh, all of these things, I guess you take a look back and it's always like, well, actually it's been going on for decades. You know? I feel like all of these kind of new developments, I, I mean, AI included, right? Is like, just like part of this like very long kind of historical legacy.

So for our final topic, uh, this was kind of a, a fun thing that I did want to talk a little bit about, um, is in a week that was just packed with different announcements. Um, the one that seems to have taken the cake, at least in my social media feeds. Has been the release of 4o OpenAI's 4o Image Generator. Um, I think, uh, most importantly I guess for me is that, um, this kind of meme of rendering everything in a studio Ghibli format, in an anime format, um, has just kind of like taken over. Like I, my social media feed is nothing but these images right now. Um, and so it's a kind of funny moment I think to take a step back and say like, okay, image gen like is suddenly kind of like, you know, uh, uh, trending again in a way that almost kind of like dampened, I think all the, a bunch of the other announcements this week.

Um, and I guess, I don't know, playing around with it, it is really quite impressive. Um, and, uh, I guess similarly, maybe Skyler, I'll throw it to you for kind of like the vibe check if you've played around with it, what you think. Um, and if this is actually like a big improvement or, I mean, we've done style transfer in the past, right? So this is in some ways not new, but this seems to have really hit a nerve in a way that like has not been the case for previous announcements.

It has. I have not played with it. But again, okay.

My feed. Has been filled with people, rememeing, all of these different, uh, all of these different styles. Um, I think, I think with this, are we, are we in a position? Is, has multimodality, at least between language and images, has that been solved? Is this, are we gonna move the goalposts goalposts further down away, or can we say we have cracked, um. GPT-4o has cracked multimodality. I think. I think they've done that.

I think this is, this is some really, really cool, impressive tech. Um, so yeah, I don't know. Otherwise we're gonna again say, no, but I can't do this and we'll keep moving those goalposts. So I think it really is, uh, quite impressive, at least again, from all my friends, uh, playing on it and, and sending images over, uh, over social media.

Yeah. In some ways, I think, excuse me, having played around with it a little bit, it is sort of a triumph, not necessarily of images or text to images, but it's almost a triumph of like the ability to kind of correctly infer what someone is looking for when they, when they search, right? Like I think that's always kind of my reflection is it's kind of the first time like playing with older versions of Midjourney. It was like, oh, well not quite this.

Can you make this change? Can you make this change? And you finally get to the end. This one is like kind of magical 'cause it's like very one shot. You're like, I want this. And it generates an imagery. You're like, oh, that's. Kind of exactly what I was looking for.

Um, and I think that's really interesting and I don't know if, is there, Kay, you're nodding if there's like a good name for like that achievement. It feels like, what's the big jump here in some ways? Well, I think it's important to recognize where we were before, which was DALL-E 3 which was back in 2023, you know, ancient, Ancient in generative AI terms. Way outta date! So, you know, DALL-E 3 more or less kind of being called as a tool. So like being swapped in when it's, you know, being called to generate an image based off a part of the conversation and then turned off and, you know, GPT-4o oh or whichever model can take back up the conversation.

And so I think what we're seeing here, I, I mean obviously OpenAI does not share tremendous amount of details on their broader architecture and design, but based off of what I've read, like on their docs and the on the release notes, you know, they talk about this being a more native capability, embedded far deeper in the architecture of the system. And so I think what we're really seeing is some really exciting innovations from multimodality focused on system design and how can we bring down some of these multimodal components far more core to where the language model operates. But that could mean, for example, potentially sharing some parameters and, um, being able to kind of bring different components together, uh, much earlier on in the process than kind of at the very last minute kind of Tool call, you know, call a friend and, and then pick back up the conversation. And I think that is the future, not just for multimodality, but for all types of understanding and more specialized tasks.

Um, being able to have different experts, whether that's an expert on documents or expert on images, or expert on audio. Being integrated at a systems level, far more internal to the model and having a model or an application about, you know, a chat bot, be far more of a systems based approach versus here are some weights that, you know, we're, we're calling, uh, for a given prompt. Yeah.

And I guess as someone who's been less involved in this, in the day-to-day, I mean, Kate, could you explain maybe a little bit like why ha, why has that been hard? I guess like from moving from like something kind of, it sounds like bolted in, at the end to kind of like fully integrated in the system. Like what, what makes that difficult? I think part of it is just momentum of how things have been built the previous way. You know, starting with the release of the original GPT, you know, 3.5 models and ChatGPT, uh, to scale performance has been. Baking it into the model at training time, train on more data, have more parameters, and boost your performance by just baking it all in, in that upfront training step. And so I think a lot of the system design and architecture and applications have been focused on, okay, there's this big black box that we make a single, you know, call to, and we get a response back.

And we're starting to see more of a shift. And you know, I don't know that it's necessarily more difficult. I think in some ways it's actually a lot easier to innovate if we're innovating outside of that training and innovating more on the systems based approach. But you know, we do have to make a conscious shift. To enable that we don't have the same tools and capabilities available. Um, the community needs to build those.

Particularly if you're talking about doing this in the open versus, you know, OpenAI doing this behind a closed gated wall. They've got a whole inference, orchestration layer that they haven't released to the broader world. So I think this is a big challenge that open source models actually face in particular is being able to catch up to the same degree of this more systems based approach. 'cause we don't have the same infrastructure or the same kind of revenue coming in, so to speak, to pay for that, that build out, um, to enable that system. That's really helpful.

Thank you. Um, Kush, I'm gonna call on you not just as the history person here, but also as the safety person. Um, I think one of the things I observed in this kind of like wave has been, I mean, indeed even like the Studio Ghibli meme um, is something that I think traditionally, I think companies have been a lot more restrictive on, right? To say, oh, you, you really don't want to copy a style. Right. Um, I've also seen a number of image generations that are a little bit at the edge of kind of like what you would consider sort of acceptable image generation. Mm-hmm.

Do you think this also marks maybe like a shift in how companies are thinking about image gen? Like I think there's one way of reading this, which is OpenAI is concluding that actually we should kind of let up a little bit, we should actually allow people to use these kind of image gen products more freely, even though it might occasionally generate some stuff which is offensive, harmful, toxic, so on. And, and I wanted to just get a comment from you on sort of like. The meta here? Like are companies kind of opening up in a way that they haven't in the past, and what are the trade offs of that? Yeah, I think, uh, they are, and uh, I think the, uh, the image side of things is maybe a little bit more forgiving on, on this because, um, uh, for the natural language, the text is more used in business sort of applications, generative imagery. It's, um, uh, I mean less, uh, sort of.

Uh, legalistic in, in some ways. So I, I do think that that is probably the, the case and for demonstration and, and for many other things. So, um, yeah, I was actually playing around with this, um, and uh, like an example that my wife and I were running, so she actually did her, MFA and computer art a decade ago. And she took a class in, uh, digital mat painting. And one of the assignments was for a week. Like they had to take an image, um, it was a summertime image, and then change it so that it looked like a wintertime image of the same scene.

And, um, this thing does it really well. I mean, like in, in a minute you have the, I mean, what you were looking for, but then what she was zooming in on was the windows of the building. Um.

Had had some minor changes across, um, the, the summer and winter image. And so, uh, like at first glance, like I didn't notice it. Like, I mean, she is like an expert at this, so she was like zooming in and going back and forth and like really looking at, uh, whether something has changed or not.

And, um, uh, from the safety perspective, like, um, those sort of little minor things that like. Uh, someone like me, um, doesn't notice is probably fine. But, um, uh, once you're at a very expert level, like if you're, um, an actual like movie maker, um, doing digital net painting or like other stuff, then it becomes critical.

So as a consumer tool, I think it's all good, but there's still a gap. And, um, uh, we have this researcher, um, uh, Morro Martino. Um, he is a world famous AI artist and, um, uh, he just created a 12 minute long, uh, AI. Fully AI generated, uh, film, and he couldn't use any of the tools that are out there. I mean, he has to innovate the tools and everything else. And this is being shown in Seville, Spain these days.

And, um, this film is like so professional level, like you can imagine the difference between what this image generation stuff is able to do and what the professionals are truly able to do. So, um, this is not the tool for them, but, uh, I think it's safe enough for, for all of us. So, uh, I think that's the, the way to think about it. Mm-hmm. Yeah, that's really interesting.

And I almost love this kind of threshold of like, I. Good enough to fool the amateurs. It's like, I think actually a really important threshold. I like sent an image to a friend being like, you know, it's really impressive that they get the fingers right now.

And then he shot back with like a zoomed in version of the image to show that there was this like little fingertip still hanging out somewhere. And I was like, oh no. It's like, again, it was like kind of enough to get past my sniff test, but I think anyone who with a keener eye would've clearly just not seen, like seen the problem.

So on OpenAI's blog release, they have a little paragraph about how they've used, uh, a reasoning LLM. On the safety side of this generation, and I, I don't know if we'll get. Any more details beyond that paragraph, but I, I thought that was interesting.

I don't know if they're, you know, covering themselves or, or trying, but it was, yeah, there was this very clear paragraph about how they've used their, uh, their reasoning LLM to help, uh, parse through, uh, some of these more, you know, I think edge cases of things that, uh, unquestionable generation. Um, so yeah, love to see if we get more details about that going forward. Uh, and why, why they were. Why they called that, uh, that nuance out in particular. Yeah, that'll be really interesting.

I missed that and I think it's definitely worth keeping an eye on. And I think back to Kate's little kind of like reasoning kind of vibe check about like how much time does it spend, thinking about whether or not it's a good thing or a violation of its content guidelines is like a very interesting set. If it happens at training, I don't care how long it takes.

That's true. Exactly. I don to see it like, take as long as you like. It's a reference time. Yeah. Well, as usual, so many things to cover. Not enough time to cover it all.

Kush, Kate. Skyler, thanks for joining us and thanks to all you listeners for joining us as well. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere, and we will see you next week as always, on Mixture of Experts

2025-03-31

Show video