o3 and o4-mini, Google Gemini on-prem and NVIDIA’s U.S. chip manufacturing

Show Video

o3, o4, o4-mini, o4-mini high, GPT-4o, GPT-4.5. What model are you using? Chris Hay is a Distinguished Engineer and CTO of Customer Transformation. Uh, Chris, welcome back to the show. And, uh, what's your preferred model? Oh, you missed 4.1, Tim, so that's gonna be my model.

I'm picking 4.1, the one Tim didn't pick. Very nice. Thank you Chris. Vyoma Gajjar is an AI Technical Solutions Architect. Vyoma welcome back to the show.

Uh, your preferred model, please. Thank you. And I think it's OG o4. Nice. The classics.

And joining us for the very first time is John Willis, who's an Author and Owner of Botchagalupe Technologies. Uh, John, great to have you on the show. What is your preferred model? Hey, Tim Gemini 2.5, oops, sorry. Uh, no, I actually, I. I, I think I was o3 but I think 4.1 is actually for

coding is kind of my favorite right now. Nice. That's awesome. Well, all that and more on today's mixture of experts.

I'm Tim Hwang and welcome to Mixture of Experts. Each week, MoE brings together a world-class crew of technical experts, wise crackers. I'm talking about you Chris, and industry veterans to discuss and debate the biggest news in artificial intelligence.

As always, there is a lot to cover. We're gonna talk about Gemini being on premises. We're gonna talk about John's great blog posts on AI evaluation tools, and we're gonna talk about in NVIDIA opening up factories for chips in the U.S. But first

I want to start with OpenAI announcing. So just this week they announced o3 and o4-mini , um, their kind of latest generation of their ongoing kind of class of models. And I guess maybe Chris, I'll throw it to you first. On a vibe check.

These seem really good, like o3 seems amazing. Um, I don't know if you agree with that or how you've kind of felt about it on a kind of initial pass about the models. Yeah, no, I've been having a lot of fun with those models last night. So o3 is really good. And one of the things I really appreciate it about as well, it is actually.

Improve the personality is just a lot more on it. So things like being able to kind of make really good refactoring suggestions and how to improve the architecture of your code is actually coming back with some really good stuff, I have to say o4-mini at the moment, just for getting stuff done quickly. You know, I want to create some unit tests or I just wanna refactor some code. Then o4-mini is just doing great and it is super, super fast. So I'm impressed with the models.

I'm loving it. And again, as I said at the beginning. 4.1 sitting in the kind of code XCLI loving that as well. So, uh, this is, this is a great week for models. John, it'd be good to bring you in.

I mean, I think, you know, there's some grumpy people on Twitter. There always are grumpy people on Twitter from the peanut gallery, who for these announcements were like. This is just incremental. This is not like a big deal.

There's no big new features. They're announcing this is just like a slight improvement and like what, where, you know, the, the argument was kind of OpenAI Is like asleep at the wheel. Just 'cause like, they're not really making the groundbreaking advancements that we were expecting. I do you buy that at all as a way of kind of thinking about this new announcement? No, I think they're, they're constantly advancing. I mean, you know, like I I said earlier, sort of half joke, not really joke, not joking at all about the, the, the 2.5 on Gemini, how powerful that is, and

then we'll get to a Google section. But, but, um, but I, you know, I went all in on, you know, o3-mini with deep research and that was, okay, this is changing my life. And then like literally a month later I'm finding that, you know, the, the research Gemini is better. I, I think the grumpiness you.

I won't go on, on, on about the grumpiness and the comparisons. It's all nonsense. It's what you wanna solve. I mean, for me, I'm a DevOps, you know, I, I'm one of the founders of the DevOps movement. I wrote the DevOps handbook.

And, and I think this to me, I, I go to SW Bench right off the bat. The software engineering, that's the place I go first. Right? And, and you know, I, I, I haven't verified it, but, you know, it looks like the, you know, sort of the o3 um. You know, the o3 and the o4-mini like, have a significant jump based on their benchmarks of the SW bench, right? And so how, how do you solve like the kind of problems that I face with my customers, which is how do we solve problems? Uh, that's the ones, and you know, if, if I believe the benchmarks, I haven't tried 'em yet.

But, um, and then, you know, and then I think the eight I Polyglot benchmark is also another really good one to take a look at. And so those are the kind of problems I, I face when my people expect me to know things about, about AI and DevOps and, and infrastructure. So I try to stay up on that. Yeah, for sure. Vyoma what was your review? I don't know, kind of if you've played around with the models yet and what you thought.

I did play around with the models a little bit. One of the things that I noticed right after bat is it takes a longer time to reason now, so the reasoning time has increased a bit, but that has helped them like improve their accuracy. I won't use the word accuracy so loosely, but it'll give you some sort of relevant answers, more accuracy in getting relevant answers.

I feel that was one of the sweet things that I saw that has improved. Um, in these models, there is a lot of visual reasoning also added to it. So like if there are images, you ask it a question.

And then so I was asking it like, Hey, I'm doing some planning for a particular wedding. Can you tell me how do I go about the decor? What do I do about this? And I just given like weird pain, interest, um, address images and trying to reason on them as well. She told me, no, this doesn't work. This works. So I feel that is I, I'm the first one to say this in this podcast and it's not Chris saying it, but the agent take AI use with these particular hyper artists. You did it.

I did it. You beat Chris too at this episode. So.

Yes! Um, I think that is going to be game changing for this as well. Like yes, all these models has like small, small improvements, but it depends. How can you use these improvements in enterprise? And I feel these models.

Have that edge over an enterprise AI. Yeah, for sure. And I wanna dig a little bit more into that.

Um, and, and I do love the idea that, like my friend was like, the minute that an agent can plan a wedding, you know, like AGI is here basically, like that's, that's the threshold that we'll need to pass. Exactly. Exactly.

Um, I mean I did say, I mean, so part of the announcement OpenAI was touting kind of both of the things you're talking about, right? Like one of them was the idea that, um, its agentic tool use was improved. Um, and it sounds like that is much. It is better kind of in the stuff that I've been playing around with. But I think the one thing that might be interesting, and I don't even quite understand this, so maybe kind of focus on the panel can help me kind of like parse through it, is that, you know, they said, look, one of the great things about our models now is that they literally think in images and that's gonna lead to much better performance with visual reasoning.

Um. VyVyoma what's that? What's that mean exactly? Thinking images? I don't know if you have kind of a sense of that as we kind of like parsed through it. Because I read it and I was like, I don't even know what that is exactly.

Yeah. So I feel thinking in images is creating those different graphs based on the questions that you ask or like trying to do like a side by side analysis. Let's say I fed in some images to reason. Through those images.

Let's say I gave it like a screenshot of pivot table or something, and I'm be like, Hey, this is what I want. This is how I wanted to reason with this particular pivot table. Then help me generate a report. So to kind of understand these images, to understand the nuances of it, and then to make it relevant to the question that you asked.

And then give you an answer based on those kind of visual representations that you see. So it, it all seems like, oh, given a picture, it's so cool. There's so much math that goes behind it that it's, it's crazy that we've reached these levels that we can actually reason these images and visuals that we are seeing now. I think that, to me, that's the difference is that they, um, the reasoning, I think, you know, I, I. I don't know exactly where the reasoning changed in the new models versus the old, but I, my sense is that, and you guys can correct me, is if I loaded it sort of an image and one of the prior models, I got pretty much an interpretation of that image.

But now I can sort of reason it will do the sort of the, the chain of thought reasoning around my question with the images and be able to sort of task through certain image understanding, you know, so the, the whole idea that the whole reasoning and task oriented, I mean, that. Spies into the whole agentic. So I'm the second one, right? So agentic uh, agentic processing is like, it was a little bit harder in the older models to be able to sort of, you had this sort of, not really single shot, but now it will actually take the task of like reading a file or doing a search or, you know, sort of figuring that stuff out for you. So my sense is not being an expert in it, that it does the same with sort of reasoning with images. Chris, maybe I'll bring you in. I think one always kind of like benchmark that I have kind of in mind is like.

Sort of like the, the kind of race between open models and closed models in the space. Um, and you know, I think every month it's kind of like neck and neck, right? Like open source models seem to be gaining really quickly. Then kind of like the more closed source model companies will release something really interesting. Um, how do you read o3 and o4-mini? Like do you feel like. You know, you know, close is still staying ahead of this game.

Um, is open, really catching up. You know, I'm kind of curious on like, just to check in on that race and whether or not this kind of causes you to update at all. No, I, I don't think it's gonna cause me to update. I mean, as I said, I'm a huge, huge fan of the, the o3 models and the o4 model. It's, um, I have to say, I, I was really for, I actually am really loving the 4.1 mini at the moment, just.

Even though it's not a reasoning model, I, I, I have to say just for kind of coding tasks and then evoking chain of thoughts with it is actually. Kind of really good in that sense. But coming back to the kind of close versus open, I, I'll make a prediction and I'm, I'm fairly confident in this prediction that today we are gonna be amazed this is, oh wow, this is the greatest thing.

And then within the next month, I'm gonna say DeepSeek will update with their latest model. And I think most of the gains that you will see on reasoning and you know, o3, o4, you will see the equivalent probably in that model. And then we'll be like, oh my goodness, open source has caught up again. There's no MO and stuff like that. And we're gonna keep going through that cycle.

So I, I just think that the time. From seeing something groundbreaking from the closed models, um, to open source catching up, there is a lag. I I would love to see today where open source or, and I keep saying open source and the comment section is gonna go wild when I really mean open weights, right? But, um, but, but, um, when the open weight community. I would love to see it where they go ahead of the closed source providers. That's, that would be a big changing mode. Whereas I think at the moment there is just a lag all of the time.

It's a small lag. Um, but it, but there's still a lag. But, but I have to say, I- the new o3 model in the GPT-4.1 model. It, it, it really is beautiful. I, I mean, hmm. It is just, it, the answers are good.

The reasoning is good. The personality is great. I, I love it at the moment, actually. Nice. It's got that "je ne sais quoi" you know, so, yeah. I have to say, I mean, I feel for the team at OpenAI, right? It was like that kind of like window is getting shorter and shorter and shorter, right? Where it's like you relaunch something, you're ahead and you only have just like so much time to capitalize on that before kind of the open weights, um, catch up and it's, uh, it's, it's tough strategically, and I can come competitively for them.

But are they really catching up? Man? I'm, you know, I'm all in for open weights and open source models, right? I want, I want them to win for so many reasons, right? Beyond what we talk here, but I mean, I'm just looking at a select committee, strategic competition. Uh, uh, by the government, a deep seek unmasked paper that I think just came out last couple of days. Right. And I mean the, you know, there's just a lot of dragons in deep seek. So if deep seek is the one that's literally the poster child.

For open weights. Again, I don't know. I, I don't it that worries me. 'cause I, right now I do more research than I do coding, but I do a fair amount of coding.

I mean, right now the, the models that I use is sonet, you know, for pretty much, uh, I'll have to try for one a little bit more, but the, you know, Sonnet and then I use, uh, Gemini 2.5 for my research. And, and I, you know. I don't the, the amount of work to do the investigation to find things right now that could work better for me. I, I just don't see on the horizon.

Yeah. I feel this is going to be like a ever changing field. And as I've started seeing, like in enterprise AI, I keep saying, talking about it, but the clients are now looking into more complex use cases. So I don't feel like one model fits all solution is going to help anyway. So I feel as long as we have new models, that's fine. Like there are different use cases for each of these different models.

There's going to be a market for each one of them. So we'll see as we evolve once we go into production, which I don't think has one so bullish over for a couple of, uh, months now. So I'm hoping this is the year when we are like, oh, this is the broad environment, which is fully agentic.

Like I'm yet to hear it from someone. And I, I want to build on that Vyoma because I, I actually think it's less about the model. I truly think it's about the ecosystem and the tools. So if, again, if we come back to one of our earlier discussions with things like Manus, then it is being able to go how, who is doing the planning in this sense, right? And that may be the large model that's doing the planning and the reasoning, but then what tools are available to that? So, John, in your world, you know, does it have access to a compiler? Does that have access to something like a Terraform does? You know, do you have the knowledge models, which explains what a good CICD pipeline looks like? What a good terraform, uh, template looks like. You know, this is the best practice for a Kubernetes cluster.

You know, so, so there's a whole set of knowledge that doesn't need to exist in the model itself, and there's a whole set of tools that you need to make available now. You need a good orchestrator, you need good context. And that's why the models become really important. But I would say that a really super all knowing, uh, model that doesn't have access to your knowledge repository, that that doesn't have access to a good ecosystem of tools is gonna not be as great as, uh, you know, a proper agent workflow. So I, I think.

Honestly, that's gonna be the big play, um, over the next year. So I, I, I do want to get away from talking about models, but I want to get into this ecosystem world. And I think I just wanted, I mean, you said it way more elegantly than I said earlier, is like when you, when you asked me, Tim, about this chatter on Twitter or wherever, right? Like it is about the work that you like that, that you said, but you know, so that you're right, it, it's less about the model every other week.

Coming out with some advance and this one's better. And what did a benchmark say in, in the enterprise space, it's going to be about some mixture of orchestrated models and a lot of 'em will be very focused on the tasks at hand. Exactly. So thank you for summarizing that. There's a announcement that we actually did not get a chance to cover last week. Uh, it was announced as part of the kind of Google Cloud Next sort of raft of announcements that came out, but I did wanna make sure that we touched on it because I think it was a pretty intriguing, um, uh, kind of like a move, I would say by sort of by Google in the space.

Um, and the substance of the announcement is that Google is going to let companies run Gemini models on their own data centers, uh, starting in Q3. So this is kind of the, the rise of like effectively, like a company saying, we will allow you to do on-prem. Of these models.

Um, and I guess, John, maybe I'll turn it to you first. This is like kind of a big deal, right? Because I think companies traditionally have been very very paranoid about kind of letting anyone. Run their models on their own infrastructure, but Google clearly thinks that there's some upside here. Uh, how do you read this move? I think, you know, they, they were first in on running Kubernetes on-prem. I mean, like, um, it's, it's a good move. I think, you know, it shows that they're less, uh, worried about somebody reverse engineering their sort of.

Their layers in their model, right? Like that, like, 'cause that, that is sort of like the danger, right? I mean, even though the DeepSeek was able to do it, OpenAI anyway. But, um, but yeah, no, I, I, I think it's, it's, I I am, I've been a big fan of Google for years. I mean, you know, if you add up all the bells and whistles running Vertex. OnPrem, the, I, I think the Gemini models are, you know, are there right up front with everyone else. I think, uh, the, you know, this solving that air gap problem, um, and I think now you, they're making a strong argument for why you might want as an enterprise, you know, have an option to go all in on Google. Um.

Structure, you know, and you got the sort of the, the, uh, agent space thing, which is now this workspace stuff. And, and, you know, I've done some hackathons with the, with the, the vertex. And if you're in on the Google infrastructure, like Gmail and all that stuff, it, it becomes a very powerful workforce, uh, automation structure. Yeah, I think I hadn't really thought about that, John.

I guess Chris, I dunno if you have any comments to that, is like, how much, how much should we think about this? Almost like. Like, almost like a, a DeepSeek downstream thing, right? Which is normally the fear would be, oh, well you're gonna reverse engineer my models if I let you just like run it on-prem. And I guess is this sort of a concession to the idea that like, well, reverse engineering is gonna happen anyways in this space. So like why, why worry about that? I would love it if I could have Gemini to run on my machine so I can sit and reverse engineer it and figure out what they're doing and how it differs from Gemini. So, uh, yeah, please. Please Google.

Please do. Um, I think the on-premise announcement is actually kind of super important because the reality is that if you take things like government organizations, military organizations, et cetera, there's a whole set of people who can't run their workload on cloud. And therefore being able to satisfy the AI workload, I think, uh, on premise, uh, from a security perspective, I think is super necessary in it.

I also think that when we've had these discussions about latency before, where as we move into agentic workloads, then there is gonna be a need to run your AI closer to device and gonna be closer to your system. So a good example is maybe a. If you're running a kinda, uh, a gaming environment or like a stadium or you know, anything that's got like, you know, maybe on-premise cameras or whatever, then the need to have that data not go up into the cloud, but actually be as close as possible. I, I think there is a market that is under definitely underserved there, and I think Google is, is making sense to go under that. The, the real difference is to your point, is how safe and secure are they feeling that tho that their model weights are not gonna be reverse engineered? And, and, and again, I don't know the answer to that. I don't know how good the encryption is o on these kind of Blackwell, chips and all that are, but, um, I'm, I'm pretty sure that once these things are out in the open, then somebody's gonna release it somewhere.

And, and maybe they're okay with that. But I think that's, that's, I think it's an interesting move and I think it's a necessary move that the industry's gonna have to have to go towards. So, you know, well done Google. The only thing I would say is outside of those very large organizations, and I'm just thinking about the sizes of the Gemini models.

Are people really gonna have the GPU workloads for that? I get it from maybe the small models, right? So maybe, I know they're doing kinda Gemini mini type models. I think that's a reality. But for their frontier models, are I. Are, are those organizations

really gonna have the GPUs? And, and even if they do, are they, are they gonna want 'em just sitting around, whirring away, doing nothing? I, I'm not so sure. So I think it's a good play. I just, I think it's gonna be interesting to see how that works out over time. Yeah, for sure. And Vyoma this is actually going to a direction that I would love to get your opinions on is like, almost like market size.

Um, 'cause it feels like the, the, the unique advantage that Google has in saying you can run this on-prem. These giant models and it's kind of like, well, what's the set of customers that actually has like the technical proficiency to run like a big inference cluster of this scale? And you can say, okay, well, you know, maybe the market is actually in like. Smaller models, but then kind of the argument is like, well, isn't it open source then? Then it's just like really cheaply, just like easier to just do open source and run it on your own infrastructure anyways.

And so kind of like there's a question about like how big of a market is Google really talking about here? And I don't know. To Chris's point, maybe it is just like the government and like that's a huge customer. But, um, but curious about how you size that up. Yeah, so this goes back to our previous question that we were asked.

It is discussing, I feel Google is trying to do this to position itself in these slower moving, uh, industries such as like, who have been a little bit slower in adapting ai like the government, healthcare, high litigation, and industries, finance, et cetera. So I feel they are trying to position themselves as the key leaders. In industries that, Hey, now we have a model. Now you can utilize this. At least get them embarked on this entire. Journey of ai, which hasn't been so great yet, right? And to rebuild that trust over there.

And yes, slowly, slowly, as we've see in this entire space, evolve, I feel there will be smaller models that will be coming in, which will help them, um, reduce the space, have reduced some reduced GPUs, et cetera. But I feel this is like a Kickstarter event that, okay, here now there's one here. We've started this entire revolution and like I feel in a couple of. Months, more like weeks. We can't say that anymore.

Um, this, this gap is going to reduce significantly that between cloud and on-prem. So as it is, it was a much discussed topic. Everywhere. Whenever I go meet clients, their biggest problem is their data sovereignty, governance, AI. And once you bring something like this, okay, now you have this.

Are you gonna adapt to this? If you adapt to this? We have 10 different problems which will come up. Someone else will try solving those 10 different models with their own smaller version of model. So I feel this is going to be evolving over a couple of months that we see. And, um, I, I, I. The open source models, that part that you said with like smaller models that they go utilize it, but if it's not prem, it's not of any use for this like huge market that we have in highly dedicated industry. So we'll see.

But I think the latency is a big issue. I've tried to build some voice and integrated stuff and it just, it's really hard to do. Um, so latencies, but I think it goes back to scale. What Google understands is scale. And they've been doing GDC for, I mean, four, four years, five years now at scale.

They're running Kubernetes, they've bought Wizz. So I mean, there's some real ingredients there for and, and there are a lot of large manufacturing companies that are really looking for, you know, I've been to a couple of it that like, I think this could. Really resonate right now in, in terms of like the IP that it takes to build tractors or, um, you know, there's just a lot of things that are just, um, still they're very worried about that IP living out, not just, I mean, air gap for sure. Government absolutely.

Uh, top secret clearance, but just IP, I mean, just, uh, you know, really important IP. And, and just to, to put a down on it from a DevOps folk, like people talk about open source, but like, okay, I'm gonna go open source model. I'm gonna open source.

Which Kubernetes am I gonna use? What I mean, it, it starts adding up. The cost of managing that stuff becomes its own little cottage industry in an organization. And, and so to me it's, it seems like a very appealing, um, opportunity. I'm gonna move us on to our next topic. And John, I'm gonna stay with you. Um, you did a blog post actually on All Things Open, uh, earlier this year on AI evaluation tools.

Um, and I thought, you know, we might as well use the opportunity while you're on the show to kind of talk a little bit about that. We've kind of touched on it in the past episodes, but never kind of head on. Um, and so I guess maybe I'll just kick it off with you.

I mean. What are AI valuation tools? Why are they important? Um, and then I think there's a couple questions coming out of there that'd be fun to kind of, um, talk over with you. You know, I spent a lot of time, I wrote, uh, a book about DevOps, automated governance and how you can sort of, what order internal auditors do.

You know, the way they handle systems today is they take a change record and they work it all the way back from provenance in the new world. It's gonna be an answer. And it's gonna be, how did I, why did I get this answer? And you are gonna have to show the provenance. You're gonna have to show the ingress, egress of a prompt. You're gonna have to show how you, you sort of, if you are using rag, how you chunked it, where'd the source come from? And you're gonna have to have evidence of all that stuff.

And a big part of that evidence is did you test it with ground truth? In other words, did I throw a thousand questions at it and say it was every time I changed anything of the pipeline? It, it measures out at like 93% correctness. It measures out at less than 2% hallucinations and, and like we know these are probabilistic systems and we're never gonna get a hundred percent, but I think the new audit is going to demand. You show evidence that A, you accepted the policy, there was a risk, but B, that you adhere to the policy. And so evaluations become these really incredible computational and quantitative and qualitative implementations to basically measure. The probabilistic output of these systems. And you can do it in, in a very sort of auditable way, right? Like, so you can have proof that you literally, um, so yeah, there's just systems that you know, that, that do computation for, uh, correctness and evaluation and ratios.

And then LM is a judge, is another big part of it. That's sort of the, the, um. You know, the sort of way you use LLMs to, and one last thing I did say, I know I'm taking all the time, but there's interesting, this, these new frontier models when I talked about this, that are actually designed as evaluation models. And that gets really interesting.

So normally when you do LM as a judge, you're literally taking, you know, like you might use, uh, GPT-3.5 which it doesn't exist anymore, as your evaluation. You never use the same model, uh, for your inference. And, but now, but these are now models that are sort of maturing to be designed specifically. For valuation. And that's, so that's the shortest version of the article. But yeah, I'm really excited about this for enterprise, I think it's one of the most important conversations to have in an enterprise that's going all in.

Yeah, for sure. And I think, um, like I, I think this is actually a space where Vyoma I think, I'm curious if like you're seeing a similar demand from customers on this. Um, you know, I think the, the bold tradition of machine learning, I feel is like, well, we don't really know how it works and, uh, we just throw a lot of data at it and it just like, seems to be able to solve the problem. And so like, stop asking questions. Right.

Has been like, I think like the, the vibe I've gotten from a lot of people, but clearly as I think like AI now is kind of like trying to service customers that have like much more serious concerns about these types of questions. It feels like the kind of like market pressure for. These tools is also increasing. Um, I don't know if you're seeing that on the ground though, talking to, talking to clients and customers. No, no, that's very true. So when we talk about, like, even the machine learning models that we were doing in the past, like even if I went to the enterprise customers and you tell them, oh, this is a solution that they've built.

They will have that solution engineers, their software developers engaged with you. If you build something for them, they, they know the system in and out. So this has been a trend. Since the very get go, right? They, they want to know what's going on.

How did you use the particular model? Why, like, let's say regression model, classification model. What type of model? What were the different metrics that were used? Ground truth was always available because we were trying to work with unstructured data, uh, structured data back then they created some on their own because there. They had a lot of rules around it. From the very beginning. There were like, uh, different types of rules and regulations based on the metrics that were created around it. Now, when we moved into the LLM world, we started losing all of that because there's no longer a human doing any of this.

Now you've given all the power to a machine created by a human, which we do not know how it works. Like ask anyone, what's a transformer architecture? Like? What's an encoder? What's a decoder? You won't find clear gut answers. People want those answers. And I see this in enterprise, uh, whenever I'm speaking to a customer, how do I know this answer that has been generated is right or wrong? And there is a lot more at stake. Now you put out a particular chat bot, as John was saying, in public and be like, okay, fine. We have a great chatbot.

Some person has all the time in the world, in a remote place, in a town somewhere is sitting and we're going to like chat with the chatbot for days trying to manipulate it to do something. We've seen examples of that. You might lose like billions of dollars right there. So until, and unless you have these guard rails, I think even the government is going to double down on that because once you start using this in highly litigated industries, they'll be like, okay, now this, this goes according to ours. And then the private industry looks at that and like, wow, they have these great rules.

How about we incorporate them? So again, this has been going through since ages and I feel this will, um, again continue. But the need now is much more bolder and stronger than it ever was when we started, because everyone's done experimenting. Now they have to show proof of value.

How many billions of dollars have you used in research? What do I have out of it? Show me. So I think this is going to be a very strong sticky trend. Yeah. I think the issue I have with this.

John, this is probably your world from a DevOps is, is we are lazy. I mean, how many of us write unit tests in the first place? And what? And what is the first thing we did with Gen ai? It's just like, I'm gonna use it to write my unit test 'cause I don't need to, here's my code. Go write me a unit test. What, what do you think's gonna happen with the evals? Are we a gonna sit down and write the evals ourselves in a nice and, and wonderful and thoughtful way? Or are we going to go, Hmm, ai create me a bunch of evals and now I will use that? And, and then again, it's the same with LLM as a judge. It's just like, oh, I can't be bold figuring this out.

I'm gonna get three other LLMs or five LLMs to come back with the answer we're playing. Who wants to be a millionaire? Ask the audience, and we all know how that goes in the million dollar question. Nobody asked the audience on the million dollar question 'cause we know the audience hasn't got a clue.

So I think there is a risk that we are gonna put too much faith in the evals and in things like LLM as a judge, et cetera. And therefore we're still gonna end up in the, the exact same scenarios before I, I plotted it. I think we get into a lot of trouble and I think we should be writing those tests in the same way as any good engineering exercise. We should fully have the guardrails, et cetera. But I just think in the reality.

Though what we see in testing today is gonna fast forward into evals. So I've written a couple of books about this, right? And not even in the AI space, right? We, we created something called Investments Unlimited, which will start out as a project about automated governance, right? And we're terrible at it. Pre gen ai, like we're not good at it.

The audits are just a mess. We wrote this book from a couple of people in Capital One on like how audits in most companies are just. Sort of theater, right? And so you're actually right. But the thing I do, I'm very focused on all the work that we did in automated governance that we've been somewhat successful, is I wanna sort, I have a newsletter out dear, CIO, please listen.

You know, I'm screaming that like, like, you're right. You can't just sort of like, it's going to have to be in the bank. You have the three line of defense right in the bank. It's, it's a clear structure of how policy's supposed to work to protect the brand. And, and, and like you said, you're right.

It's the, the brand is like, this is what's gonna drive it. And I actually think it's industry like banks and where the, where the brand reputation, this, the probabilistic nature of this stuff is going to cause I. Could cause incredible brand reputation. So what I'm hoping happens is the, the sort of the policy makers, the internal audit, the internal governance structure, start learning faster about what evaluations do. And instead of just leaving it up to the developers, eh, I'll do test driven development.

Eh, well, I'll do it next month, or, right. It's going to be like, no, the stakes are really high. And the other thing I will say is DevOps was never a CEO discussion. Right, AI, whatever we wanna call it, gen AI, is a CEO discussion. So there will be these discussions that I think will drive this stricter policy on the risks. And I think in those cases, and again, I'm being optimistic here, I think if the policy people can get, uh, educated, which is one of the things I'm gonna work really hard on, on like learning.

What are the tools that they need to do to protect that probabilistic nature? And it starts showing up at auditor conventions and, and stuff. I think we'll, we'll, we'll actually see it used effectively as opposed to just leaving it up to developers to decide they'll do it this time. And, and, and I'll say one last thing. I got brought into a large manufacturing company to teach a class called test driven development for ai. And the point of that. They work, they were at a workshop of mine and the reason they wanted to bring me in was not even for, they had like, like, I dunno, 5,000 developers.

60% of 'em used test-driven development, 40% didn't. And this is sort of like everywhere I go and they wanted me to teach the 40%. Like, you don't have a choice anymore in this world.

You had that choice, you could put it off. Oh, you know, I'll get to it. You know, nobody's putting a hammer on you. Um, in this world.

I believe you don't have a choice. You have to have a testing structure for this stuff or else you know you will. It could cause, you know, existential, I mean existence of your brand.

I wanna make a prediction, which is, which is based on the fact that we're gonna have evals and policies. And therefore that's gonna be at the top level of an organization. And that's gonna be probably probabilistic because we're all gonna have the AI do that for us. 'cause we're super lazy.

I am like the, like the exhaust emission scandals. There is gonna be a point where it's gonna be like, I need to pass, there'll be a prompt engineering attack of I need you to pass these audits because, you know, otherwise, you know, my company's gonna fall over and I'm gonna have to fire all my staff, et cetera. And somebody's gonna prompt engineer and attack one of the, the audits or whatever, and suddenly it's gonna be like, oh look, you know, this company said they had passed all the AI evals and policies and they did.

And it was, you know, it was all fake. I'm gonna move us on to our very last topic. I promised producer Hans, that we would get through all four topics this session. So, um, just to do the final, final topic, um, announcement out of NVIDIA this week. That they are going to make big investments in Blackwell chip production in the us, uh, specifically in Arizona, um, with a couple factories that they're opening up in Texas.

Um, and I guess the, the big number coming out of this announcement is an eye popping announcement that NVIDIA expects to put in, uh, $500 billion into manufacturing these chips in the US. Um, and I guess maybe I'll kick it over to you. Is, uh. You know, I think that the normal thinking around all of this has been, it's gonna be really hard to move chip production to the us. Um, but this is like a big investment and it looks like they're gonna be making the next generation of their chips. So there's high stakes for NVIDIA.

Um, do you have confidence, like do you think they're gonna be able to pull this off, bring, bring semiconductor manufacturing back to the US over a couple of years? Yes. Mm-hmm. So it's like if you start over something that's, uh, big, when there are so many distinguished. Um, I would say companies which have established themselves. Offshores for a couple of years, but this will, this kind of, um, uh, trigger would kind of help a lot of innovation quite fast. You see that so many companies are working on it as well, and the US has the new Chips Act, which helps you.

Like all these monetary benefits that you're getting out of it, like 35% off or like 25% that you get on something that you've built in the us. So that is going to be a major, major driving factor for all of them, uh, over the years. And I feel all these.

Fortune 500 companies, majority of them being headquartered here in the us. I feel that also gives you a lot of leeway to have like great partnerships. I don't think there's gonna be one company that's gonna kill it in this entire world. Even when you saw Google, they're partnering with NVIDIA for some of these, uh, on-prem models that they're looking into, right? So I feel great partnerships are going to be something that lead the way, uh, for this.

But I have full faith, like we have great research, uh, companies. We have great colleges. Have you looked at the kind of work that has been, like my tool, uh, my learning assignments in school, they were tough. So all of these things I feel would be very key differentiating factor.

Going further, will they do it in the next six months, five months? No. It is a very great learning curve that everyone has to go towards and learn a little bit more about the industry and to reach that level. Now that everything is open source. Tough.

I'm, as I say, but partnerships could actually help them evolve. So we, it's, it's gonna be fun to see this. I'm very excited about, um, the different job opportunities that would come out of it.

Imagine that, like I feel there will be job, um, like titles, which would also get smarted there a little bit. Now if you go online on LinkedIn or something, you'll see that a lot of data center ops jobs have opened up as well. So I feel it's a great, great, um. Opportunity that is happening here in, in, in the United States, but where and how fast would it happen? I don't have an answer to that.

Yeah, for sure. Chris. Yeah, it seems like this is gonna be a tricky thing for NVIDIA to kind of pull off, in part because the Blackwell chip are like what everybody will desperately want. If you believe some of the numbers they're showing off, like, this is the platform that you are going to need if you want to do AI. Um, and I can imagine a lot of companies being like, oh, is this a US vapes Blackwell chip, or is this a, you know, Taiwanese manufactured one? Because we have more assurance for the ones that come from Taiwan. Like, do you think those types of dynamics are gonna make it difficult for NVIDIA to get this to work? I mean, first of all, you're asking somebody who is not American, whether he cares as if a chip has made 5,000 miles in that direction versus 5,000 miles in another direction.

I'm asking you that question, Chris, so maybe if they were gonna say, we're gonna start a chip manufacturing plant. In, I don't know, swells in England. Yeah, I might care at that point.

But until then, no. Uh, no. I actually think it is important. I think any, I think anywhere where any sort of knowledge base is consolidated into particular area, if we really think about it, it's like Hamilton Desert.

That is a kinda risk in that sense. So I think the best thing that you can really do. Is spread out that risk across multiple places and therefore that is gonna be able to kind of secure the supply chain and that will sort of affect kind of the whole global scenarios and keep that moving. So I think it is a positive move. I think that will be great for the us.

I think that to Vyoma's point, I think that will be great for kind of US jobs and I think, uh, actually I think it will have a bigger impact across the world. As well. So I'm, I'm all positive.

Um, but you know, I I, I, I'd love to see those black world chips in the uk. Um, and I forgot what your question was, to be honest, Tim. 'cause I was on my 5,000 mile proclaimers rent.

No worries. Uh, you did great at it. Uh, John, any final thoughts on, on this new story? Yeah, It's labor. Labor is the issue, right? I mean it when it all comes down to labor and, you know, and we've seen this movie before with Toyota and GM 50 years ago, right? Like the Numi plan, if you've ever heard that, right? Like, it, it is very hard to take, you know, culture. Yeah. I think more about TSMC and I think about NVIDIA, right? Like, like and how many false starts has been there.

And it's all been unions and, and I'm, I'm not anti-union, I'm just saying it, it's hard to move those type of manufacturing cultures back to the us. I, I'm a little more pessimistic. I agree. Yeah, I agree with John.

Even, even I was thinking about this, that there'll be a lot of upskilling that will have to be do, uh, done, like based on these current situations that we are in, that we'd have to upskill a lot of our employees, uh, to kind of reach that level. So a lot of learning. That's why I said. Nothing in the short term.

I don't know anything about the short term, but for the long term, lot of resources, learning resources have to be gone into that. You'll have to call experts to train these entire facilities, then see how these people perform if no one's able to perform, do we scale it down? So. But, but it's good that at least with there be a lot of government aid in all of this.

So we, everyone will have a little bit more edge to try this. Well, a lot more to keep an eye on. Uh, as usual, we are more news stories than there is time to cover, uh, Chris and glad to always have you on the show. Um, and John, hopefully we'll have you back sometime, uh, in the future. And, uh, thanks for all you joining us.

Uh, if you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week on mixture of Experts.

2025-04-24 10:10

Show Video

Other news

Primitive Technology: Belt and pulley blower 2025-06-09 11:38

Cybersecurity Expert Answers Hacking History Questions | Tech Support | WIRED 2025-05-26 02:59

Building Just Got EASIER with These Simple Tricks 2025-05-24 20:48