Frontiers of AI and Computing: A Conversation With Yann LeCun and Bill Dally | NVIDIA GTC 2025

Frontiers of AI and Computing: A Conversation With Yann LeCun and Bill Dally | NVIDIA GTC 2025

Show Video

Please welcome Bill Dally and Yann LeCun. [Music] Hello, everybody. We're just going to   have a little chat about AI things.  Hopefully, you'll find it interesting.  So, Yann, there's been a lot of interesting  things going on in the last year in AI. What   has been the most exciting development  in your opinion over the past year?  Too many to count, but I'll tell you one  thing which may surprise a few of you.   I'm not so interested in LLMs anymore. They're  kind of the last thing. They are in the hands  

of industry product people, kind of improving at  the margin, trying to get more data, more compute,   generating synthetic data. I think there are more  interesting questions in four areas: how you get   machines to understand the physical world, how you  get them to have persistent memory, which not too   many people talk about, and then the last two are  how you get them to reason and plan. There is some   effort, of course, to get LLMs to reason, but in  my opinion, it's a very simplistic way of viewing   reasoning. I think there are probably better ways  of doing this. So, I'm excited about things that  

a lot of people in this community in the tech  community might get excited about five years from   now, but right now, they don't look so  exciting because they're some obscure   academic papers. But if it's not an LLM  that's reasoning about the physical world,   having persistent memory, and planning, what is  it? What is the underlying model going to be?  So, a lot of people are working on world models.  What is a world model? We all have world models in   our minds. This is what allows us to manipulate  thoughts essentially. We have a model of the   current world. You know that if I push on this  bottle here from the top, it's probably going  

to flip, but if I push on it at the bottom,  it's going to slide. If I press on it too hard,   it might pop. We have models of the physical world  that we acquire in the first few months of life,   and that's what allows us to deal with the  real world. It's much more difficult to deal   with the real world than to deal with language.  The type of architectures we need for systems   that can really deal with the real world  is completely different from the ones we   deal with at the moment. LMs predict tokens,  but tokens could be anything. Our autonomous  

vehicle model uses tokens from the sensors and it  produces tokens that drive. In some sense, it's   reasoning about the physical world, at least where  it's safe to drive and you won't run into poles.  Why aren't tokens the right way to represent  the physical world? Tokens are discrete. When   we talk about tokens, we generally talk about a  finite set of possibilities. In a typical LLM,  

the number of possible tokens is on the order of  100,000 or something like that. When you train a   system to predict tokens, you can never train it  to predict the exact token that's going to follow   a sequence in text, for example. You can produce  a probability distribution of all the possible   tokens in your dictionary, which is just a long  vector of 100,000 numbers between zero and one   that sum to one. We know how to do this, but we  don't know how to do this with video, with natural   data that is high-dimensional and continuous.  Every attempt at trying to get a system to  

understand the world or build mental models of  the world by being trained to predict videos   at a pixel level has basically failed. Even  to train a system like a neural net of some   kind to learn good representations of images,  every technique that works by reconstructing an   image from a corrupted or transformed version  of it has failed. They kind of work, but they   don't work as well as alternative architectures  that we call joint embedding, which essentially   don't attempt to reconstruct at the pixel level.  They try to learn an abstract representation of   the image or the video or the natural signal  that is being trained on, so that you can make   predictions in that abstract representation space. The example I use very often is that if I take a   video of this room and pan a camera and stop here  and ask the system to predict the continuation of   that video, it's probably going to predict it's  a room and there are people sitting, blah, blah,   blah. It can't predict what every single one of  you looks like. That's completely unpredictable  

from the initial segment of the video. There are  a lot of things in the world that are just not   predictable. If you train a system to predict at a  pixel level, it spends all its resources trying to   come up with details it just cannot invent. That's  a complete waste of resources. Every attempt   we've tried, and I've been working on this for 20  years, of training a system using self-supervised   learning by predicting video doesn't work. It only  works if you do it at a representation level. What   that means is that those architectures are not  generative. If you're basically saying that a   transformer doesn't have the capability, but  people have vision transformers and they get   good results. That's not what I'm saying because  you can use transformers for that. You can put  

transformers in those architectures. It's just  that the type of architecture I'm talking about is   called a joint embedding predictive architecture.  So, take a chunk of video or an image or whatever,   run it through an encoder, you get a  representation, then take the continuation of that   text, video, or transformed version of the image,  run it through an encoder as well, and now try to   make a prediction in that representation space  instead of making it in the input space. You can   use the same training methodology where you fill  in the blank, but you're doing it at this latent   space rather than down in the raw representation. The difficulty there is that if you're not careful   and don't use smart techniques, the system will  collapse. It will completely ignore the input and   just produce a representation that is constant  and not very informative about the input. Until  

five or six years ago, we didn't have any  technique to prevent this from happening.  Now, if you want to use this for an agentic  system or a system that can reason and plan,   what you need is a predictor. When it observes  a piece of video, it gets some idea of the state   of the world, the current state of the world, and  what it needs to do is predict what is going to be   the next state of the world given that I might  take an action that I'm imagining taking. So,   what you need is a predictor that, given the  state of the world and an action you imagine,   can predict the next state of the world. If you  have such a system, then you can plan a sequence  

of actions to arrive at a particular outcome.  That's the real way that all of us do planning   and reasoning. We don't do it in token space. Let me take a very simple example. There are a lot   of so-called agentic reasoning systems today, and  the way they work is that they generate lots and   lots of sequences of tokens using different ways  of generating different tokens stochastically,   and then there is a second neural  net that tries to select the best   sequence out of all the ones generated. It's sort  of like writing a program without knowing how to   write a program. You write a random program and  then test them all, keeping the one that actually   gives you the right answer. It's completely  hopeless. Well, there are actually papers about   super-optimization that suggest doing exactly  that. For short programs, you can, of course,  

because it goes exponentially with the length.  So, after a while, it's completely hopeless.  So, many people are saying that AGI or, I  guess you would call it AMI, is just around   the corner. What's your view? When do you think  it will be here, and why? What are the gaps?  I don't like the term AGI because people  use the term to designate systems that have   human-level intelligence, and the sad thing is  that human intelligence is super specialized. So,   calling this general is a misnomer. I prefer  the phrase AMI, which means advanced machine   intelligence. It's just vocabulary. I think the  concept I'm describing of systems that can learn  

abstract mental models of the world and  use them for reasoning and planning,   I think we're probably going to have a good  handle on getting this to work at least at   a small scale within three to five years. Then  it's going to be a matter of scaling them up,   etc., until we get to human-level AI. Here's the thing: historically in AI, there's   generation after generation of AI researchers  who have discovered a new paradigm and have   claimed that's it. Within 10 years, we're going  to have human-level intelligence. We're going  

to have machines that are smarter than humans in  all domains. That's been the case for 70 years,   and it's been those waves every 10 years or so.  The current wave is also wrong. The idea that you   just need to scale up LLMs or have them generate  thousands of sequences of tokens and select the   good ones to get to human-level intelligence,  and you're going to have, within a few years,   a country of geniuses in a data center, to quote  someone who will remain nameless, is nonsense.   It's complete nonsense. Sure, there are going to  be a lot of applications for which systems in the   near future are going to be PhD-level, if you  want, but in terms of overall intelligence, no,   we're still very far from it. When I say very  far, it might happen within a decade or so. It's not that far. AI has been applied  in many ways that have improved the human  

condition and made people's lives easier.  What application of AI do you see as being   the most compelling and advantageous? So, I mean, there are obvious things,   of course. I think the impact of AI on  science and medicine is probably going to   be a lot bigger than we can currently imagine,  even though it's already pretty big. Not just   in terms of research for things like protein  folding and drug design, but also in understanding   the mechanisms of life. And there are a lot of  short-term consequences. Very often now in the US,  

when you go through a medical imaging process,  there is AI involved. If it's a mammogram, it's   probably pre-screened with a deep learning system  to detect tumors. If you go to an MRI machine,   the time you have to spend in that MRI machine  is reduced by a factor of four or something   because we can now recover high-resolution  versions of MRI images with fewer data. So,   there are a lot of short-term consequences. Of course, every one of our cars, and NVIDIA  

is one of the big suppliers of this, now comes  out with at least a driving assistance system   or an automatic emergency braking system. These  are required in Europe now for a few years.   Those things reduce collisions by 40%. They save  lives. Those are enormous applications. Obviously,   this is not generative AI; this is perception, and  a little bit of control for cars now. There are a   lot of applications of LLMs as they exist today  or will exist within a few years in industry and   services, etc., but we have to think about the  limitations of this as well. It's very difficult,   more difficult than most people had thought,  to field and deploy systems with the level of   accuracy and reliability that is expected.  This has certainly been the case for autonomous   driving. It's been a receding horizon of when  we get level five autonomous driving. I think  

it's going to be the same thing. It's usually  where AI fails—not in the basic technique or   the flashy demos, but when you actually have to  deploy it, apply it, and make it reliable enough   to integrate with existing systems. That's where  it becomes difficult and expensive and takes more   time than expected. Certainly, in applications  like autonomous vehicles, where it has to be   right all the time, or someone could be injured  or killed, the level of accuracy has to be almost   perfect. But there are many applications where  if it just gets it right most of the time, it's   very beneficial. Even some medical applications  where a doctor is double-checking it, or certainly   entertainment and education, where you just want  to do more good than harm, and the consequences   of getting it wrong aren't disastrous. Absolutely. For most of those systems,  

the most useful ones are the ones that make  people more productive and more creative. A   coding assistant that assists them, for example.  It's true in medicine, it's true in art, and it's   true in producing text. AI is not replacing  people; it's giving them power tools. Well,   it might at some point, but I don't think people  will go for that. Our relationship with future  

AI systems, including superintelligence, is  that we're going to be their boss. We're going   to have a staff of super-intelligent virtual  people working for us. I don't know about you,   but I like working with people who are smarter  than me. It's the greatest thing in the world.  So, the flip side is that just as AI can  benefit humanity in many ways, it also has   a dark side where people will apply it to do  things like create deep fakes and false news,   causing emotional distress if applied  incorrectly. What are your biggest concerns   about the use of AI, and how do we mitigate those? One thing that Meta has been very familiar with is   using AI as a countermeasure against  attacks, whether they are from AI or not.  

One thing that may be surprising is that, despite  the availability of LLMs and various deep fakes   and such for a number of years now, our colleagues  who are in charge of detecting and taking down   this kind of attack are telling us that we're  not seeing a big increase in generative content   being posted on social networks, or at  least not in a nefarious way. Usually,   it's labeled as being synthetic. So, we're  not seeing all the catastrophic scenarios   that people were warning about three or four  years ago, where this was going to destroy   information and communication systems. There's an interesting story I need to tell   you. In the fall of 2022, my colleagues at Meta,  a small team, put together an LLM that was trained   on the entire scientific literature. All the  technical papers they could get their hands on.  

It was called Galactica, and they put it up with  a long paper that described how it was trained,   open-source code, and a demo system that  you could just play with. This was doused   with vitriol by the Twitter sphere. People  were saying, "Oh, this is horrible. This is   going to get us killed. It's going to destroy  the scientific communication system. Now any   idiot can write a scientific-sounding paper on the  benefits of eating crushed glass or something."  

There was such a tsunami of negative opinions that  my poor colleagues, a small team of five people,   couldn't sleep at night. They took down the  demo, left the open-source code and the paper,   but took down the demo. Our conclusion  was that the world is not ready for this   kind of technology, and nobody's interested. Three weeks later, JPT came out, and it was   like the second coming of the Messiah. We looked  at each other and said, "What just happened?" We   couldn't understand the enthusiasm of the public  for this, given the reaction to the previous one.  

A lot of it is perception. GPT wasn't trying  to write a scholarly paper or do the science;   it was something you could converse with and  ask a question about anything, trying to be   more general. To some extent, it was more useful  to more people or more approximately useful.  There are dangers for sure, and there are misuses  of various types. But the countermeasure against   misuse is just better AI. There are unreliable  systems, as I was talking about before. The fix   for this is better AI systems that have  common sense, the capacity to reason,   and check whether the answers are correct, and  assess the reliability of their own answers,   which is not quite the case currently.  But the catastrophic scenarios, frankly,  

I don't believe in them. People adapt. I  like to think that AI is mostly for good,   even though there's a little bit of bad in there.  So, as somebody with homes on  both sides of the Atlantic,   you have a very global perspective. Where do  you see future innovation in AI coming from? 

It can come from anywhere. There are smart people  everywhere. Nobody has a monopoly on good ideas.   There are people who have a huge superiority  complex and think they can come up with all   the good ideas without talking to anyone. In  my experience as a scientist, it's not the   case. Good ideas come from the interaction of a  lot of people and the exchange of ideas. In the  

last decade or so, the exchange of code has  also become important. That's one reason why   I've been a strong advocate of open-source  AI platforms, and why Meta, in part,   has adopted that philosophy as well. We don't have  a monopoly on good ideas, as smart as we think we   are. The recent story about DeepSeek really  shows that good ideas can come from anywhere.  There are a lot of really good scientists in  China. One story that a lot of people should  

know is that if you ask yourself what is the paper  in all of science that has gathered the largest   number of citations over the last 10 years,  that paper was published in 2015, exactly 10   years ago. It was about a particular neural net  architecture called ResNet, or residual networks,   which came out of Microsoft Research in Beijing by  a bunch of Chinese scientists. The lead author was   Kaiming He. After a year, he joined FAIR at Meta  in California, spent about eight years there, and   recently moved to MIT. That tells you that there  are a lot of good scientists all over the world,   and ideas can come from anywhere. But to actually  put those ideas into practice, you need a big  

infrastructure, a lot of computation, and you  need to give a lot of money to your friends and   colleagues to buy the necessary resources.  Having an open intellectual community makes   progress go faster because someone comes up with  half the good idea over here, and someone else   says the other half. If they communicate,  it happens. If they're all very insular   and closed, progress just doesn't take place. The other thing is that for innovative ideas  

to emerge, as a chief scientist at NVIDIA, you  know this—you need to give people a long leash.   You need to let people really innovate and  not pressure them to produce something every   three months or every six months. That's pretty  much what happened with DeepSeek and LLaMA. One   story that is not widely known is that there were  several LLM projects at FAIR in 2022. One had a   lot of resources and support from leadership,  and another was a small pirate project by a   dozen people in Paris who decided to build their  own LLM because they needed it for some reason.   That became LLaMA, and the big project you  never heard of was stopped. So, you can come   up with good ideas even if you don't have all the  support. If you are somewhat insulated from your  

management and they leave you alone, you can come  up with better ideas than if you are supposed to   innovate on a schedule. A dozen people produced  LLaMA, and then a decision was made to pick this   as the platform. A team was built around it to  produce LLaMA 2, which was eventually open-sourced   and caused a bit of a revolution in the landscape.  As of yesterday, there have been over one billion   downloads of LLaMA. I find this astonishing. I  assume that includes a lot of you, but who are all   those people? I mean, you must know them because  they all must buy NVIDIA hardware to run those   things. We thank you for selling all those GPUs. Let's talk a little bit more about open source.   I think LLaMA has been really innovative in that  it's a state-of-the-art LLM that's offered with   open weights, so people can download and run  it themselves. What are the pros and cons of  

that? The company is obviously investing enormous  amounts of money in developing the model, training   it, and fine-tuning it, and then giving it away.  What is good about that, and what is the downside?  Well, I think there is a downside. If  you are a company that expects to make   revenue directly from that service, if that's your  only business, then it may not be advantageous for   you to reveal all your secrets. But if you are  a company like Meta or Google, where the revenue   comes from other sources—advertising in the case  of Meta, and various sources for Google—what   matters is not how much revenue you can generate  in the short term but whether you can build the   functionalities needed for the product you  want to build and get the largest number of   smart people in the world to contribute to it. For  Meta, it doesn't hurt if some other company uses  

LLaMA for some other purpose because they don't  have a social network that they can build on top   of this. It's much more of a threat for Google  because you can build search engines with that,   which is probably why they are a little  less positive about this kind of approach.  The other thing we've seen the effect of,  first with PyTorch and now with LLaMA,   is that they have jump-started the entire  ecosystem of startups. We see this in   larger industry now, where people sometimes  prototype an AI system with a proprietary API,   but when it comes time to deploy it, the most  cost-effective way of doing it is to do it on   LLaMA because you can run it on-premise or on  some other open-source platform. Philosophically,  

I think the biggest factor, the most important  reason to want to have open-source platforms,   is that in a short time, every single one of  our interactions with the digital world will be   mediated by AI systems. I'm wearing the Ray-Ban  Meta smart glasses right now, and I can talk to   Meta AI through them and ask it any question. We don't believe that people are going to want   a single assistant, and that those assistants are  going to come from a handful of companies on the   west coast of the US or China. We need assistants  that are extremely diverse. They need to speak all   the world's languages, understand all the world's  cultures, all the value systems, and all the   centers of interest. They need to have different  biases, political opinions, and so on. We need a   diversity of assistants for the same reason that  we need a diverse press. Otherwise, we'll all have  

the same information from the same sources, and  that's not good for democracy or anything else.   We need a platform that anybody can use to  build those diverse assistants. Right now,   that can only be done through open-source  platforms. I think it's going to be even   more important in the future because if we want  foundation models to speak all the languages in   the world and everything, no single entity  is going to be able to do this by itself.   Who is going to collect all the data in all the  languages in the world and just hand it over to   OpenAI, Meta, Google, or Anthropic? No one. They  want to keep that data. Regions in the world are   going to want to contribute their data to a global  foundation model but not actually give out that   data. They might contribute to training a global  model. I think that's the model of the future.  

Foundation models will be open source and trained  in a distributed fashion with various data centers   around the world having access to different  subsets of data, basically training a consensus   model. That's what makes open-source platforms  completely inevitable, and proprietary platforms,   I think, are going to disappear. And it also makes sense both for   the diversity of languages and things but also  for applications. A given company can download   LLaMA and then fine-tune it on proprietary data  that they wouldn't want to upload. That's what's   happening now. Most AI startups' business models  are built around this. They build specialized  

systems for vertical applications. In Jensen's  keynote, he had this great example of using a   generative LLM to do wedding planning, to  decide who was going to sit around the table. And that was a great example of the trade-off  between putting effort into training and putting   effort into inference. So, in one case, you  can have a very powerful model that you spend  

an enormous number of resources on training, or  you can build a less powerful model but run it   many passes so it can reason and do it. What  do you see as the trade-offs between training   time and inference or test time in building  a powerful model? Where is the optimum point?  First of all, I think Jensen is absolutely right  that you ultimately get more power in a system   that can reason. I disagree with the fact that the  proper way to do reasoning is the way current LLMs   that have reasoning abilities are doing it. It  works, but it's not the right way. When we reason,   when we think, we do this in some sort of abstract  mental state that has nothing to do with language.  

You don't want to be kicking tokens around;  you want to be reasoning in your latent space,   not in token space. If I tell you to imagine a  cube floating in front of you and then rotate   that cube by 90 degrees around a vertical axis,  you can do this mentally, and it has nothing to do   with language. A cat could do this, and we can't  specify the problem to a cat through language,   but cats do things that are much more complex  than this when they plan trajectories to jump on   a piece of furniture. They do things that are much  more complex than that, and it's not related to   language. It's certainly not done in token space,  which would be a sequence of actions. It's done in  

an abstract mental space. That's the challenge of  the next few years: figuring out new architectures   that allow this type of reasoning. That's what  I've been working on for the last few years.  Is there a new model we should be expecting that  allows us to do reasoning in this abstract space?   It's called JAPA, or JPA world models. My  colleagues and I have put out a bunch of   papers on this, kind of the first steps towards  this over the last few years. JPA stands for joint  

embedding predictive architecture. These  are world models that learn abstract   representations and are capable of manipulating  those representations and perhaps reasoning and   producing sequences of actions to arrive at a  particular goal. I think that's the future. I   wrote a long paper about this that explains  how this might work, about three years ago.  To run those models, you're going to need great  hardware. Over the last decade, the capabilities  

of GPUs have increased by an order of 5 to  10,000 times, both in training and inference   for AI models, from Kepler to Blackwell.  We've seen today that even more is coming.   Scale-out and scale-up have provided even  additional capabilities. In your opinion,   what is coming down the road? What sort of  things do you expect will enable us to build   your JPA model and other more powerful models? Well, keep them coming because we're going to   need all the competition we can get. This kind  of reasoning in abstract space is going to be   computationally expensive at runtime, and  it connects with something we're all very   familiar with. Psychologists talk about System  1 and System 2. System 1 is tasks that you can  

accomplish without really thinking about  them. They've become second nature, and you   can accomplish them without thinking too much.  For example, if you are an experienced driver,   you can drive even without driving assistance,  and you can drive while talking to someone.   But if you drive for the first time or the  first few hours, you have to really focus on   what you're doing. You're planning all kinds of  catastrophe scenarios and stuff like that. That's   System 2. You're recruiting your entire world  model to figure out what's going to happen and   then plan actions so that good things happen. Whereas, when you're familiar with a task,  

you can just use System 1, a sort of reactive  system that allows you to accomplish the task   without planning. The first thing this reasoning  is System 2, and the automatic, subconscious,   reactive policy is System 1. Current systems  are trying to inch their way towards System 2,   but ultimately, I think we need a different  architecture for System 2. I don't think it's   going to be a generative architecture  if you want a system to understand the   physical world. The physical world is much more  difficult to understand than language. We think   of language as the epitome of human intellectual  capabilities, but in fact, language is simple   because it's discrete. It's discrete because it's  a communication mechanism and needs to be discrete  

to be noise-resistant. Otherwise, you wouldn't be  able to understand what I'm saying right now. So,   it's simple for that reason. But the  real world is just much more complicated.  Here's something you may have heard me say in  the past: current LLMs are trained typically   with something like 30 trillion tokens. Tokens  are typically about three bytes, so that's  

0.9 to 10^14 bytes, let's say 10^14 bytes. That  would take any of us over 400,000 years to read   through that because that's the totality  of all the text available on the internet.   Now, psychologists tell us that a 4-year-old  has been awake a total of 16,000 hours, and we   have about 2 megabytes going to our visual cortex  through our optic nerve every second, roughly 2   megabytes per second. Multiply this by 16,000  hours times 3600, and it's about 10^14 bytes   in four years through vision. You see as much  data as text that would take you 400,000 years  

to read. That tells you we're never going to  get to AGI, whatever you mean by this, by just   training from text. It's just not happening. Going back to hardware, there's been a lot of   progress on spiking systems, and people who  advocate this and look at analogies to how   biological systems work suggest that neuromorphic  hardware has a role. Do you see any place where   neuromorphic hardware could either  complement or replace GPUs in doing AI?  Not anytime soon. [Laughs] Um, okay, I have to  tell you a story about this. So, when I started at   Bell Labs in 1988, the group I was in was actually  focused on analog hardware for neural nets. They   built a bunch of generations of completely analog  neural nets, then mixed analog-digital, and then   completely digital towards the mid-90s. That's  when people kind of lost interest in neural nets,  

so there was no point anymore. The problem with  exotic underlying principles like this is that   the current digital semiconductors are in such  a deep local minimum that it's going to take a   while before alternative technologies,  and an enormous amount of investment,   can catch up. It's not even clear that, at a  principal level, there is any advantage to it.  Things like analog or spiking neurons or spiking  neural nets might have some intrinsic advantage,   except that they make hardware reuse very  difficult. Every piece of hardware we use   at the moment is too big and too fast in a sense,  so you have to essentially reuse the same piece   of hardware to compute different parts of your  model. If you use analog hardware, you can't  

use multiplexing. You have to have one physical  neuron per neuron in your virtual neural net. That   means you can't fit a decent-sized neural net on  a single chip. You have to do multi-chip, which is   going to be incredibly fast once you're able to do  this, but it's not going to be efficient because   you need to do cross-chip communication, and  memory becomes complicated. In the end, you need   to communicate digitally because that's the only  way to do it efficiently for noise resistance.  In fact, the brain provides an interesting piece  of information. Most brains, or the brains of most  

animals, communicate through spikes. Spikes are  binary signals, so it is digital, not analog.   The computation at the level of the neuron may be  analog, but the communication between neurons is   actually digital, except for tiny animals.  For example, C. elegans, a 1 mm long worm,   has 302 neurons. They don't spike because they  don't need to communicate far away, so they can  

use analog communication at that scale. This  tells you that even if we want to use exotic   technology like analog computation, we're going  to have to use digital communication somehow.   If nothing else, for memory. It's not clear, and  I've gone through this calculation multiple times.   I probably know much less about this than you  do, but I don't see it happening anytime soon.  There might be some corners of edge computation  where this makes sense. For example, if you want   a super cheap microcontroller that is going to  run a perception system for your vacuum cleaner   or lawn mower, maybe computation makes sense.  If you can fit the whole thing in a single  

chip and use maybe phase change memory  or something like this to store weights,   and I know some people are seriously  building these things. These are what   people call PIM (Processor in Memory) or analog  and digital processor and memory technologies. Do   you see a role for them? Are they promising? Absolutely. Some of my colleagues are very   interested in this because they want to build  successors to those smart glasses. What you   want is some visual processing taking place all  the time. Right now, it's not possible because   of power consumption. Just a sensor like an image  sensor can't be left on all the time in a pair of  

glasses like this; you run the battery in minutes.  One potential solution is to have processing on   the sensor directly, so you don't have to shuffle  the data out of the chip, which is what costs   energy. Shuffling data is what costs energy, not  the computation itself. There's quite a bit of   work on this, but we're not there yet. I see  this as a promising direction. In fact, biology  

has figured this out. The retina has on the order  of 60 million photoreceptors, and in front of our   retina, we have four layers of neurons—transparent  neurons—that process the signal to squeeze it down   to 1 million optic nerve fibers to go to our  visual cortex. There is compression, feature   extraction, and all kinds of stuff to get the  most useful information out of the visual system.  What about other emerging technologies? Do you  see quantum or superconducting logic or anything   else on the horizon that's going to give us a  great step forward in AI processing capability?  Superconducting, perhaps. I don't know enough  about this to really tell. Optical has been   very disappointing. I remember being totally  amazed by talks about optical implementations   of neural nets back in the 1980s, and they  never panned out. Technology is evolving,  

so maybe things may change. For quantum, I'm  extremely skeptical of quantum computing. I   think the only medium-term application of quantum  computing that I see is for simulating quantum   systems, like quantum chemistry or something.  For anything else, I'm extremely skeptical.  You've talked about building AI that can learn  from observation, like a baby animal. What  

kind of demands do you see that putting on the  hardware, and how do you think we need to grow the   hardware to enable that? How much can you give us? It's a question of how much you're willing to buy.   The more you buy, the more you save, as we heard  today. It's not going to be cheap. For example,   video. Let me tell you about an experiment that  some of my colleagues did until about a year   ago. There was a technique for self-supervised  learning to learn image representations using   reconstruction. The project was called MAE, or  Masked Autoencoder. It's basically an autoencoder,   a denoising autoencoder, very much like what's  used. You take an image, corrupt it by removing  

some pieces of it—a big chunk of it, actually—and  train a gigantic neural net to reconstruct the   full image at a pixel level or at the token  level. Then you use the internal representation   as input to a downstream task that you train  supervised, like object recognition or whatever.  It works okay, but you have to boil a small pond  to cool down those liquid-cooled GPU clusters to   be able to do this. It doesn't work nearly as  well as those joint embedding architectures.   You may have heard of DINO, DINO V2, JAPA,  etc. These are joint embedding architectures,   and they tend to work better and actually  be cheaper to train. In joint embedding,   you basically have two latent spaces for the two  input classes. Instead of converting everything  

into one kind of token, you take the full  image and the corrupted or transformed version,   run them both through encoders, and then  try to link those embeddings. You train   the representation of the full image  from the representation of the partially   visible or corrupted one. This  works better and is cheaper. Okay, so the team said, "This seems to work okay  for images, let's try to do this for video."  

So now you have to tokenize a video, basically  turning a video into 16 by 16 patches, and that's   a lot of patches for even a short video. Then,  you train a gigantic neural net to reconstruct   the patches that are missing in a video, maybe  predict a future video. This required boiling   a small lake, not just a small pond, and it was  basically a failure. That project was stopped.  The alternative we have now is a project called  VJA, and we're getting close to version two.   It's one of those joint emitting predictive  architectures. So, it does prediction on video   but at the representation level, and it seems to  work really well. We have an example of this. The  

first version of this is trained on very short  videos, just 16 frames, and it's trained to   predict the representation of the full video  from a version of a partially masked one. That   system apparently is able to tell you whether a  particular video is physically possible or not,   at least in restricted cases.  It gives you a binary output:   "This is feasible," "This is not," or maybe it's  simpler than that. You measure the prediction   error that the system produces. You take a  sliding window of those 16 frames on a video   and look at whether you can predict the next  few frames. You measure the prediction error,  

and when something really strange happens  in the video—like an object disappears,   changes shape, or spontaneously appears or  doesn't obey physics—it flags it as unusual.  These are natural videos, and then you  test it on synthetic videos where something   really weird happens. If you train it on  videos where really weird things happen,   that would become normal, and it wouldn't detect  those as being odd. So, you don't do that.   It corresponds a bit to how baby humans learn  intuitive physics. The fact that an object  

that is not supported falls—basically the effect  of gravity—babies learn this around the age of   nine months. If you show a five or six-month-old  baby a scenario where an object appears to float   in the air, they're not surprised. But by nine  or 10 months, they look at it with huge eyes,   and you can actually measure that. Psychologists  have ways of measuring attention, and what that  

means is that the internal world model of the  infant is being violated. The baby is seeing   something that she doesn't think is possible,  and it doesn't match her expectations. So,   she has to look at it to correct her internal  model and say, "Maybe I should learn about this."  You've talked about reasoning and planning in  this joint embedding space. What do we need   to get there? What are the bottlenecks both  on the model side and on the hardware side?  A lot of it is just making it work. We need a  good recipe. Before people came up with a good  

recipe to train even simple convolutional nets, it  was very difficult. Back in the late 2000s, Geoff   Hinton was telling everyone it was very difficult  to train deep networks with backpropagation.   Yann LeCun could do it with ConvNets, but he  was the only one in the world who could do it,   which was true at the time but not entirely  accurate. It turns out it's not that difficult,  

but there are a lot of tricks you have to  figure out—engineering tricks, intuitive tricks,   which nonlinearity to use, the idea of ResNet,  which is the most cited paper in all of science   in the last 10 years. It's a very simple idea: you  just have connections that skip every layer, so by   default, a layer in a deep neural net is basically  confused into being the identity function,   and what the neural net is doing is a deviation  from that very simple idea. This allowed us to   keep from losing the gradient going backwards  and train neural nets with 100 layers or more.  Before people came up with a good recipe with  all those residual connections, Adam optimizers,   and normalization, nothing really worked. We  just had a paper showing that you don't need  

normalization in transformers, and things like  that. Before you had this complete recipe and   all the tricks, nothing worked. The same was true  with NLP and natural language processing systems.   In the mid-2010s, there were systems  based on denoising autoencoders like BERT,   where you take a piece of text, corrupt it, and  train a big neural net to recover the words that   are missing. Eventually, that was wiped out  by the GPT-style architecture, where you just   train on the entire system. You train it as an  autoencoder but don't need to corrupt the input   because the architecture is causal. It turned  out to be incredibly successful and scalable.  We have to come up with a good recipe for  those JAPA architectures that will scale   to the same extent. That's what's missing. Well, we have a flashing red light ahead of  

us. Are there any final thoughts you'd like  to leave the audience with before we adjourn?  Yeah, I want to reinforce the point I was making  earlier. The progress of AI and the progress   towards human-level AI, advanced machine  intelligence, or AGI—whatever you want to   call it—is going to require contributions from  everyone. It's not going to come from a single   entity somewhere that does R&D in secret. That's  just not happening. It's not going to be an event;   it's going to be a lot of successive progress  along the way. Humanity is not going to get   killed within an hour of this happening  because it's not going to be an event. It's  

going to require contributions from basically  everywhere around the world. It's going to have   to be open research and based on open-source  platforms. If they require a lot of training,   we're going to need cheaper hardware. You're  going to need to lower your prices. [Laughs]   You have to take that up with Jensen. We'll have a future with a high diversity  

of AI assistants that are going  to help us in our daily lives,   be with us all the time through maybe our smart  glasses or other smart devices, and we're going   to be their boss. They're going to be working for  us. It's going to be like all of us are going to   be managers. [Laughs] That's a terrible future. Well, on that note, I think I'd like to thank you   for just a really intellectually  stimulating conversation, and I   hope we get a chance to do this again. All right, yeah. Thanks. Yeah, thanks.

2025-04-13 20:08

Show Video

Other news

Between Two Nerds: Global critical infrastructure 2025-04-18 18:26
VisoMaster - Face Swap - Замена лиц 2025-04-17 13:54
Examining the Bluetooth technology roadmap: Key upcoming features and emerging use case trends 2025-04-16 01:15