Claude 3.0 came out a little over a year ago. It was probably a year and change at this point. The distance between 3 and 3.5 was about three months, and then it took a year to get from 3.5 to 4.
And that's something that we've been seeing across the industry is that like the slowing down of some of like the big next generations. And so, as a hot take question, how long until we get to Claude 5.0? Shobhit, I'll start with you a few months.
Okay. Uh, Marina, uh, a year, maybe a year and change. All right, Chris. Why haven't I got to already? Like you can,
Claude 4 came out yesterday. I want my Claude 5 now. That's the right answer. Hello and welcome to Mixture of Experts.
Uh, Tim has been given a much deserved break from podcast hosting, and so you're all stuck with me today. I'm Bryan Casey, your interim host for, uh, MOE and today we are doing, this week actually we're doing a double episode of MOE because an almost impossible number of things happened in, in the market This week we're gonna be talking mostly, um, about the Claude 4.0 release, but we might touch on some of the other kind of fun stories that are, uh, a bit related to this.
We got a great panel today. We're joined by Chris Hay, CTO of customer transformation, Marina Danilevsky, Senior Research Scientist and Shobhit Varshney Head of Data, AI data and AI, uh, IBM Consulting Americas. So this one, the big news obviously we're talking about is Claude, um, 4.0 came out both Sonnet and Opus. Um, and a kind of funny story maybe to start things out, to kind of frame the release is, um, A few months back, you know, I was having a conversation with one of the previous generations of Claude, um, and one of the questions I was asking is like, what's your favorite band? Um, and it was like, I don't listen to music I'm an AI model.
Um, like, but if you were somebody who listened to music, what would your band probably or favorite band? Because, and Claude was "Probably Radiohead would be my favorite band." Uh, and I was like, I thought that was kind of like really on the nose, and the reason why I'm bringing this up is that like for a long time it felt like the kind of wedge that Anthropic had in the market was these two main pillars. Um, one, there's a community of people who kind of liked the model and the app almost like for this purely vibes based sort of thing, like the personality of Claude. People just like liked it more and liked talking to Claude more than they like talking to just about any other model. Then you had this other use case, uh, which was like everybody who was, um, a lot of the people who like, loved using coding models, Claude was oftentimes like one of the most popular ones in that space and had these kind of two main pillars, um, associated with it. And one of the things I thought was sort of telling in this release is the degree of focus on coding in particular, like coding sort of dominated the entire release.
And I was just juxtaposing that to like everything else going on in the industry this week, which it actually feels like the whole ar AI market's actually like expanding and like all the use cases, um, you see everything from I/O this week. You see all the multimodal stuff coming out. You see even like the news outta OpenAI with Jony Ive, so it's like, oh, we're gonna get into hardware and new devices and things like that.
So the whole world's like expanding. And at that same moment, anthropics actually getting like more focused. Uh, and so like maybe Chris, I'll start and kick it over to you, um, as a way of getting into this.
Like, does that surprise you at all? Like, does it make sense? It's kind of obvious given kind of where Anthropic is these days, but just like, what did you think about, like, what was kind of like almost a singular focus on coding in this release? Before I get onto that, I think you missed the setup of Claude's joke, and, and you completely missed it, Bryan. 'cause it went, oh, come on What is your favorite band? And it said, Radiohead. And you were supposed to say, okay, computer, but you didn't, didn't you? Oh man. You missed Claude's setup. I mean, Claude, they, they're never gonna let me back on this show now that I've just like blown it.
Um, Claude, so Claude is an expert in coding and standup comedy. That is, it's two things that it does now, a, as for coding, Claude is a fabulous coding model. I mean, the first thing to say is it pretty much powers anything, um, that involves coding.
So if you think of things like Cursor, et cetera, right? Claude is the default model for that. I have switched from model to model, and I have to say I, I always go back to Claude. So even, even the uh, kind of E oh three models, I tend to, I used to tend to pick Claud 3.7 over it. Um, I am so glad Claude 4 Sonnet I is.
Especially, I'm gonna skip Opus for a second, for very, very good reasons um, but Claude 4 Sonnet is a fabulous model. So you know what? They've got a niche. It is the best model at this and they're doubling down. They have sorted out some of the frustrations of the coding models. Um Oh my God, it was driving me insane.
The amount of time is you say, okay, create me this piece of code, and then it's like, okay, here's this one change, and then you're like, okay, where's the rest of the code? You're like, ah, it's all the same. You can figure it out for yourself. You're like, no, you figure it out. You're the computer. I'm the lazy human.
Give me the code that I want. I only want to hit copy and paste, and so that has been sorted out. The second thing that's sorted out is what's been driving me mental for the last few weeks is what I call Diffageddon. Whereas every piece of code I ask for, it goes, here's a diff file. And you're like, I, I'm not a canvas.
You don't need to give me a diff file. You give me the code. Right. You know? And, and that has been sorted out.
The, the Claude 4 Sonnet model really is just killing it. So I love the fact that Anthropic is doubled down on that. The only complaint I would say to come back to the Opus model is Oh my goodness. It eats your tokens, right? It's like, you know, put any prompt in there and then suddenly two minutes later, it's like, ha, you've used up your limits. Why don't you subscribe to Max? Or come back later? And you're like, huh. So I'm now kind of going, Opus is for architectural things, you know, make me think about things in a different way, but for default coding, Sonnet 4 Here we go.
I love it. So go Anthropic. You have the best coding model, in my opinion. That was a great intro like Shobhit I know before the show we were even talking about.
There's just like an enormous amount of stuff going on in the coding space, uh, right now. So maybe you wanna take a minute, just talk about, like, put this in context a little bit and just say like, you know, obviously they're doing a lot in this space, but like, how does it fit into this kind of broader landscape of, you know, what's going on in the, the coding terrain? So I think coding, software development, coding general has been the killer use case for these large language models. And there are a few things going for it, right? There is a, there's a good structure to it. So you can train it on structured code and so forth. There is some sort of a verifiable reward that I can say that, hey, this code compiled and ran, actually did what it was asked to do and so forth.
And then there's a very structured way of, of, uh, talking to these models, right? I can go give it a guitar, it'll do a PR on its own, and it'll go exclude stuff. And those things are very well defined and mature versus if I ask you to say even something simple as summarizing a call center recording into, into a paragraph, I can't trust it to always pull the root cause of the issue into the summary and don't have a good verifier for that either. Right. So if you look at the, the how far we'll come with say, reinforcement learning techniques where I need a verifiable output, we have done a lot better as a community on software development. If you look at the multi-agent space right now, I think the way human organizations are structured in our software development teams, there's a overarching PM that will go split it up into these tasks and create Jira tickets and whatnot for you. As you can pull and start working on it can check in.
There's a way of verifying what you're doing and you can collaborate much, much better. There's systems in place for collaboration around, uh, software engineers, and I think we just leverage those tools to stand on those shoulders and say, now I can have LM agents start to communicate around these. But if you go to say a finance back office workflow, it is a complete mess of how people work with each other across systems and talk to each other and so on and so forth, right? So I think we have an unfair advantage in the software development area where it makes sense for people who are building the software for these AI models. So then leverage themselves and there's this good feedback loop. So you, I feel that software development will stay ahead in the users of generative AI and multi-agent just because of the, the nature of that.
So within software development, when you look at all the different models, we've had massive improvements with Gemini 4.0, uh, to 2.5 Pro. We've had great, uh, models come out from, from OpenAI as well, and all of them are now starting to move away from auto completion last year to now doing the whole repository end to end. We can take the whole 10, 12 different files is five, figure out how to, how to connect the dots and what's, what is starting, spreading it and so on and so forth in one such scenario.
Uh, one of the work I was doing, it ran out of, uh, its limits on one API call and I could see it come back and edit the plan and say, oh, let me try this other worker out. The fact that it is able to stop, rethink, come back and change the to-dos and go back and execute a workaround and stuff, that's just beautiful. When I start hiring people, interns, this summer, it's very difficult to, for me to figure out what work am I going to give them.
It doesn't matter which, you know, Ivy League you went to, I'm gonna define what my work is supposed to be and I'm gonna give you some instructions. I'm gonna validate what you, uh, I might as well just have these models do it for me now. So I think it's a, with a very, very different world today with all the software development. I think I think that's a a really, it's, and it's also like a big theme and like, Marina, maybe I'll throw this one over to you, which is kind of where like the original like coding model I. Like, I think like product market fit started with is was
just like as an assistant that's sat alongside in your IDE, um, essentially, and like that was like the main use case. And then, but what we're seeing more of is, especially as these things get better at reasoning, as they get more nines of reliability at they as they get better at like long context and longer running tasks, like this whole space of like background agents where it's like you just give somebody a, you know, give an agent a task and it just goes and does that, um, and comes back to you when it's done Um, essentially. I'm curious whether we will actually end up looking at like, the coding assistant as like a interesting blip in time where that was like the primary paradigm around how people were using, uh, models to generate code and like we're, you know, moments away from actually just background agents just writing all this stuff themselves and they just like check in with you periodically and you know, it won't be long until like the vast majority of code that's getting written in the world is being written that way. And I'm, I'm curious how you kind of see some of this stuff shaken out, Marina. I still think that there's a lot to be said for, uh, being able to phrase your problem to the model properly and you still need a decent amount of experience.
So one thing I will say Shobhit is at the very least you could teach your interns how to ask the thing for the right problem because that itself is a learning experience. We've talked on the pod before about like education and things about students cheating and the rest of it, and that now you really do need a different kind of critical thinking, which is fine, this thing can write code for you, what did you ask it to do? Is thi, are these models also gonna be able to tell you that you're being an idiot? Are they gonna be able to tell you that you are writing code that is extremely inefficient. Why are you doing quotation product in your code? What is this with the SQL query that you asked me to do? Or anything of that kind? And but until we get to there, we're really not gonna be anywhere near, uh, what you're saying, which is go ahead, run in the background. I'm fine. Move it along so that I, I I'll say we still got some time because people aren't that great at asking for what you need, unless you already have spent 20 years being a really awesome engineer. And then, yeah.
Great. This is helpful for you. Not if you're new, not if you're outside The Silicon Valley bubble. I, I have definitely seen, um, in my own experience, just like even just like telling the thing how to structure like, I want these functions to do these things.
Like somebody who doesn't know anything about the space is gonna have like a really hard time with that. So, um, so a good one of this is exactly where the interns are going. Marina, I fully agree with you. Um, I had somebody who was white coding and said, Hey, check out my, my demo that I built and sent me a local host, Colin. Colin had no clue what this means, right? Like, you don't realize that the local host is of your right. Yeah, I was like, you realize I can't, I can't get to your local host.
It's just, uh, maybe as like a final question in, in this space, and it's, um, one of the things I thought was kind of interesting and it's like, oh, I'm, it's a little bit grasping at straws, but like it all happened at the same time, so I'm gonna make the connection anyways. Um, it was a little bit telling to me that like the only code editor, uh, prominent code editor where Claude 4 was not available like on Day Zero was Windsurf, um, which was recently acquired by OpenAI. Um, and this same week was the same week that we saw, like the huge announcement around like OpenAI and Jony Ives company and $6.5 billion, uh, transaction there,
and it was just making me think about a little bit. The kind of Mac versus PC vibes. Uh, right. And it's like almost like hard to ignore those vibes when, you know, you're literally bringing in the people who are famous for building up some of like that Mac ecosystem, like onto the open AI team in a moment where they're going kind of more vertical and it's like you're just getting like a touch of walled gardens. I'm showing up now.
It's like, it's not that far. It doesn't like, you know. I don't think big trends have kind of landed in this space yet, but it's like you're seeing kind of hints of it and like maybe Chris will throw it over to you is like, do you see, do you see any world where like this becomes like more kind of Mac versus P like PC and like these labs start to go kind of more vertical into the the software space. It's like not clear to me how that could actually like play out from a technology perspective. The same sort of way like, you know. Like the PC era did, but I don't know.
Are you getting, I'm curious if you're getting some of these vibes. Oh, I, I'm getting the 1984 vibes going on. I'm thinking here's, here's OpenAI there, or whatever, and then it's like Big brother is watching you and then you get the runner coming in and the big. Gong open source bangs of gong and or whatever, and then we're like, yeah, we've broken down the walled closed source models and the open source models are here, and a big herd of llamas are gonna come trodden by everybody and stomp on all their closed models.
That, that's where we're getting to. Yeah. No, I'm, I'm all for the fighting. I am, you know, let the games commence as far as I'm concerned. No, absolutely I think anywhere where you have big players that are sort of playing against each other in that sense, you know, we're gonna get into that world. Everybody's playing for the same ecosystem at the moment.
They want to, they want domination of the world and, uh, that, that's a bad phrase, they want, they want to be, uh, the best in their space. So I think we're gonna see that play out for a while because the prize is, is so big at the moment, right? So whoever has the best AI is, is gonna make an awful lot of money. So they're gonna go for it. You can't, you can't put that, the models want world domination into the training set, Chris, because then you know what's gonna happen as a result, um, from that. So, um, maybe to, maybe to switch gears just a little bit, um, here. So I think that was a really good discussion on, on coding.
I wanna talk on just like one other aspect, um, of, of the release, which was you know, Anthropic is kind of known for two things. Um, one of them is coding at this point I'll, I'll still say they're a little bit known for vibes, uh, even if they're leaning into that a little bit less, um, these days. The other thing that I think they drive a lot of interesting research, um, around a lot of interesting discussion in the market around is just like all the work they do around safety and alignment, and they like they'd put together a lot of materials associated with, with this release.
Um, you know, one of the ones that I know that was in their papers and they were talking about on some of the podcasts is just like some of the work they're doing around like constitutional classifiers and then just like the work they're doing to have like AI, um, basically monitor and enforce like certain sort of protections, um, and around different types of responses, particularly around notably harmful, um, things. And so like Marina, maybe I'll just like turn it over to you to talk a little bit about kind of, you know, what anything that was kind of striking to you about some of the work that they were, um, doing into this space. And then maybe by extension, like, do you ultimately see like AI being kind of the primary way that we protect against the harmfulness of AI, at least when it comes to just like managing output? Now, God talk about Big Brother vibes.
Um, I'm like, yeah, you're talking to the AI and you never know if you're gonna get reported for what you just said and did. That's a little bit challenging. Um, right, so Anthropic does have a good, uh, at least, certainly very much good intentions in this direction. They put out interesting things and they know how to have a little bit of clickbait articles like one weird trick to make AI blackmail you if you won't do what you want. That that was a fun one that just came out, right? I mean, they did tell it and see how well that it listened to instructions to be blackmailing, but it was fun, is that it did. Um, and so I think that here we have to continue to pay a lot of attention uh, speaking a world domination of, again, where does all this data go? Who has access to it? If governments wanna see your data, then what, which ones, uh, regulations are gonna be behind for a while? Talk about Mac versus PC.
In my head, I'm like, okay, so when are the antitrust lawsuits gonna come up? Who's gonna be arguing them in what court? Because there's a real question, like you said, a market share now of the market is the entire world. So who are we going to be fighting with here? So the whole safety thing here has a, a number of levels that we're still only barely starting to scratch the surface. I, I like Anthropic at least continuing to try to put some interesting things out and that's really positive, but don't maybe get completely distracted by the, the fun little anecdotes. Think about what it's, uh, going on under the hood and again, who people are and are not teaming up with is gonna be at the end more important, especially economically and legally than, uh, here's a particular research paper that I put out, it hurts me as a researcher, but it's true. That's what's really gonna matter. Shobhit maybe I'll turn it over to you as well.
Like, Marina, Marina touched on this, um uh, like can actually an adjacent example a little bit, but like, obviously one of the things that they've spent a lot of time on is investing in coding and tool use and, uh, things like that, and they had some scenarios where they were giving Claude, unfettered access to, to tools, and then, kind of gave it examples of egregious wrongdoing where it would actually go in then proactively like alert the press at authorities and, uh, things like that, which is like, I had some pretty conflicted feelings, uh, about that. But like, one of the other things I was just thinking about as part of that is like, you know, how would a company feel if like, you know, it's modeled, decides like, I don't like you asking these questions, I'm gonna go notify like, um, you know, external authorities about some of this and, you know, I'm, I'm curious about just your general reactions to some of the, the alignment research and safety research they're doing. And then just like how you think, like what if any kind of implications you see in terms of like how enterprises might kind of free some of those interactions? So I think we, like, we as humanity need to pick a lane. If you try to say that we want AI employees to behave more and more like, uh, like our own employees, then we should be ready for whistleblowers as well. Right.
I'm just making the point that we are trying to make sure that we have good, trusted employees that where, you know, with who are onboarded, with verified skills, we know which college you went to, there are training that you go through to become an IBMer, there is good, uh, like good measures, ways of tracking what you're doing. We give you access to only the tools that we need you to, and eventually there's a supervisor was to approve promotions or whatever, any big decision, right? Those things apply to an AI employee as well. We want verified where, what training material, when do your models, like grant models are. We wanna make sure that you go through a onboarding process to, with Granite, uh, with our InstructLab to make sure that you models understand our way of doing things. You have conference evaluation metrics with governance, you have policies that we have to abide by access control and eventually human in the loop, right? So I think if you're looking at those human employee and AI employee, I think they'll start merging. We will come to a scenario where as agents start to cross boundaries and talk to other agents in other silos, right? You'll have a co-pilot talking to service to SAP or to Salesforce.
You'll have people, agents talking to each other as well. The constraints, the, the kind of governance that's needed to make sure that you are not leaking any internal information, you're not sharing stuff with authorities still of that nature are very real. So as a community, we need to get to the point where we evolve agent ops governance. In this new world where you have brilliant, uh, people, LLMs working across organization, you wanna make sure that they're not exchanging information that you don't want them to as well. But I think generally, uh.
Anthropic has done a phenomenal job of trying to balance this speed to market versus safety as well. It shows they're more transparent in, in how it's thinking through it and you know, give a very detailed view of the safety checks and balance that they've put. So I'm all in on the Anthropic camp that the future has to be a little bit more transparent. You have to ensure the right safety mechanisms, but I'm not seeing this from an enterprise lens yet. As an enterprise user, I don't have control over what kind of a model training or, or rules they're setting for safety internally right now. All of that is baked in, in the regional training process, right? Me as an enterprise, I may wanted to do certain things a little bit differently, in which case I may onboard an employee and tell them, here's how we're gonna do things in our organization.
Right? Different from what you were originally trained, or your previous owner or your previous, uh, company that you worked for, uh, used it, right? So I think we need more control over the safety metrics so I can relax and make them more constrained as I need, that is missing today. I think that thought's actually a interesting place to close, which is that if our goal is to make something that mimics. In some ways, like the human brain and the way that human beings behave and think and reason, um, we should also expect them to do things that human beings do, which is sometimes not always what we, what we want them to do. And so, um, you know it, but like, you know, I, I think having kind of the right sort of controls and visibility and observability around that is like obviously gonna be huge. Go ahead. Bryan, there's a couple things before, before we do, uh, start wrapping up.
I think this was a big statement from moving from an LLM provider for Anthropic to becoming a full stack. We saw that with Llama, they create the full stack around it, Anthropic spent inordinate amount of time explaining all the different components around it. The MCP protocol, uh, uh, has been winning the battle with all of their competitors, Google and Microsoft and, and everybody else is supporting MCP protocols now. Right? So I think there's a lot that they're doing in, in growing from a company that does. You also saw that they pulled back from the customer facing chat bots. They practically have not invested much since December of last year on their chatbots.
So they've just come completely given up on that and say people are going to go to, uh, open AI and Gemini, and I know Chris, you'll make fun of me using Gemini again. But I feel that it's, it's doing a really, really good job, especially out of Google I/O. Right.
So we are getting to a point where the focus of the company is changing from a model provider and a chat bot to now being the full stack, a lot of focus on safety and coding. We have not seen them spend that much effort on multimodal the way Gemini has, right? So there are certain areas where they're doing really well, certain not. If you look at the amount of time that AI models, the kind of workforce that they're doing on one axis, you have the complexity of the task. On the other axis, you have how long you can go and maintain recurrence when you're doing the task, right? I think from 30 minutes of what OpenAI could do, now we are at seven hours of, of work that, uh, Anthropic model can do. That's a step change in what you can do when you're trying to connect all the dots and the complexity.
So I think there's a massive improvement that Anthropic has done. It's a very, consequential release for the, for them as a company, I believe. Yeah, I was just gonna add to that, uh, Shobhit and I think you're spot on there. One of, one of the biggest things, uh, they've really done there is focused on planning and memory and sorting out some of the context issues. I think that's absolutely huge because if you're gonna wanna run those long running agent tasks.
Then it can't get confused halfway through. And some of the examples that they've got were, I think they were kinda, again, this week has been the Pokemon Wars. If these companies are not talking to each other, I'd be surprised 'cause suddenly everybody's playing Pokemon with their models at this point.
Yeah, it's like, you know, don't talk about models talking to each other. You know, let's get the employees stop talking to each other, right? 'cause they're all trying to out battle each other. Um, but, but actually the ability to play games in a long running way and be able to plan and manage that becomes really important.
So I think Anthropic is uh, one trying to hit themselves towards the, kinda the, the vertical stack on coding, but actually I think they're setting themselves up as the agentic stack, so I said agents again. Um, and I think that's important to Shobhit's point. They've set up MCP, they're focusing on planning, they're focusing on long running tasks, et cetera.
Um. You know, they brought tools in there, they've enhanced their memory side of things. And then the other thing that they talked about, which was really a kind of short piece and it's really important, is the, the reasoning elements of the models, they are now proving that they're able to do that in the latent space and not necessarily requiring the tokens to do the thought there. And I think therefore, they are making a different play. They are the more technical focused, the more safety focused, but uh, and, and more transparent in that sense.
So I think Agentic Stack is probably a big place that they're going towards. Well, so since Shobhit and Chris both decided the podcast wasn't over, Marina do you want to give the final, final word, um, on this one? I'll, uh, I'll build a little on what Chris was saying, which is that if they're gonna continue to push into the enterprise space and anyone is pushing the enterprise space, this work on planning and memory and, uh, long-term, being able to pick off where you left off and things like that, that is so important because, well, the type of tools we use as employees in enterprise space is collaboration. You need to be able to do things in real time, do things over different granularities of time, like what did we do today? What did we as a team do last week? So yeah, the Pokemon thing's funny, but you do see a say like learning, oh, here's a strategy, I'm gonna write it down, I'm gonna remember it for whatever remembering can mean in reality. Yes, it's a text file, but like we can go a lot deeper actually in how you use it. And so as we start to hopefully move away from just talking about these individual models and talk about more about the applications in which they're used, the systems in which they're used, this is gonna become more and more, uh, of a real thing.
And right now, Anthropics, I think showing themselves this thinking about the right parts of this task. Alright, now we are gonna call it. So, uh, I think that was a great, a great place to, to lead on or to leave on. So Marina, Chris, Shobhit thank you for joining us, um, today. Another great episode. Um, hopefully next week will be even more eventful 'cause we didn't get enough, uh, news this week.
And we'll do three podcasts, uh, but uh, appreciate everyone joining us today and we will see you next time on Mixture of Experts. Are, aren't we getting Claude 5 next week? Uh, supposedly yes. Chris said yes.
2025-05-29 15:05