Please welcome Bill Dally and Yann LeCun. [Music] Hello, everybody. We're just going to have a little chat about AI things. Hopefully, you'll find it interesting. So, Yann, there's been a lot of interesting things going on in the last year in AI. What has been the most exciting development in your opinion over the past year? Too many to count, but I'll tell you one thing which may surprise a few of you. I'm not so interested in LLMs anymore. They're kind of the last thing. They are in the hands
of industry product people, kind of improving at the margin, trying to get more data, more compute, generating synthetic data. I think there are more interesting questions in four areas: how you get machines to understand the physical world, how you get them to have persistent memory, which not too many people talk about, and then the last two are how you get them to reason and plan. There is some effort, of course, to get LLMs to reason, but in my opinion, it's a very simplistic way of viewing reasoning. I think there are probably better ways of doing this. So, I'm excited about things that
a lot of people in this community in the tech community might get excited about five years from now, but right now, they don't look so exciting because they're some obscure academic papers. But if it's not an LLM that's reasoning about the physical world, having persistent memory, and planning, what is it? What is the underlying model going to be? So, a lot of people are working on world models. What is a world model? We all have world models in our minds. This is what allows us to manipulate thoughts essentially. We have a model of the current world. You know that if I push on this bottle here from the top, it's probably going
to flip, but if I push on it at the bottom, it's going to slide. If I press on it too hard, it might pop. We have models of the physical world that we acquire in the first few months of life, and that's what allows us to deal with the real world. It's much more difficult to deal with the real world than to deal with language. The type of architectures we need for systems that can really deal with the real world is completely different from the ones we deal with at the moment. LMs predict tokens, but tokens could be anything. Our autonomous
vehicle model uses tokens from the sensors and it produces tokens that drive. In some sense, it's reasoning about the physical world, at least where it's safe to drive and you won't run into poles. Why aren't tokens the right way to represent the physical world? Tokens are discrete. When we talk about tokens, we generally talk about a finite set of possibilities. In a typical LLM,
the number of possible tokens is on the order of 100,000 or something like that. When you train a system to predict tokens, you can never train it to predict the exact token that's going to follow a sequence in text, for example. You can produce a probability distribution of all the possible tokens in your dictionary, which is just a long vector of 100,000 numbers between zero and one that sum to one. We know how to do this, but we don't know how to do this with video, with natural data that is high-dimensional and continuous. Every attempt at trying to get a system to
understand the world or build mental models of the world by being trained to predict videos at a pixel level has basically failed. Even to train a system like a neural net of some kind to learn good representations of images, every technique that works by reconstructing an image from a corrupted or transformed version of it has failed. They kind of work, but they don't work as well as alternative architectures that we call joint embedding, which essentially don't attempt to reconstruct at the pixel level. They try to learn an abstract representation of the image or the video or the natural signal that is being trained on, so that you can make predictions in that abstract representation space. The example I use very often is that if I take a video of this room and pan a camera and stop here and ask the system to predict the continuation of that video, it's probably going to predict it's a room and there are people sitting, blah, blah, blah. It can't predict what every single one of you looks like. That's completely unpredictable
from the initial segment of the video. There are a lot of things in the world that are just not predictable. If you train a system to predict at a pixel level, it spends all its resources trying to come up with details it just cannot invent. That's a complete waste of resources. Every attempt we've tried, and I've been working on this for 20 years, of training a system using self-supervised learning by predicting video doesn't work. It only works if you do it at a representation level. What that means is that those architectures are not generative. If you're basically saying that a transformer doesn't have the capability, but people have vision transformers and they get good results. That's not what I'm saying because you can use transformers for that. You can put
transformers in those architectures. It's just that the type of architecture I'm talking about is called a joint embedding predictive architecture. So, take a chunk of video or an image or whatever, run it through an encoder, you get a representation, then take the continuation of that text, video, or transformed version of the image, run it through an encoder as well, and now try to make a prediction in that representation space instead of making it in the input space. You can use the same training methodology where you fill in the blank, but you're doing it at this latent space rather than down in the raw representation. The difficulty there is that if you're not careful and don't use smart techniques, the system will collapse. It will completely ignore the input and just produce a representation that is constant and not very informative about the input. Until
five or six years ago, we didn't have any technique to prevent this from happening. Now, if you want to use this for an agentic system or a system that can reason and plan, what you need is a predictor. When it observes a piece of video, it gets some idea of the state of the world, the current state of the world, and what it needs to do is predict what is going to be the next state of the world given that I might take an action that I'm imagining taking. So, what you need is a predictor that, given the state of the world and an action you imagine, can predict the next state of the world. If you have such a system, then you can plan a sequence
of actions to arrive at a particular outcome. That's the real way that all of us do planning and reasoning. We don't do it in token space. Let me take a very simple example. There are a lot of so-called agentic reasoning systems today, and the way they work is that they generate lots and lots of sequences of tokens using different ways of generating different tokens stochastically, and then there is a second neural net that tries to select the best sequence out of all the ones generated. It's sort of like writing a program without knowing how to write a program. You write a random program and then test them all, keeping the one that actually gives you the right answer. It's completely hopeless. Well, there are actually papers about super-optimization that suggest doing exactly that. For short programs, you can, of course,
because it goes exponentially with the length. So, after a while, it's completely hopeless. So, many people are saying that AGI or, I guess you would call it AMI, is just around the corner. What's your view? When do you think it will be here, and why? What are the gaps? I don't like the term AGI because people use the term to designate systems that have human-level intelligence, and the sad thing is that human intelligence is super specialized. So, calling this general is a misnomer. I prefer the phrase AMI, which means advanced machine intelligence. It's just vocabulary. I think the concept I'm describing of systems that can learn
abstract mental models of the world and use them for reasoning and planning, I think we're probably going to have a good handle on getting this to work at least at a small scale within three to five years. Then it's going to be a matter of scaling them up, etc., until we get to human-level AI. Here's the thing: historically in AI, there's generation after generation of AI researchers who have discovered a new paradigm and have claimed that's it. Within 10 years, we're going to have human-level intelligence. We're going
to have machines that are smarter than humans in all domains. That's been the case for 70 years, and it's been those waves every 10 years or so. The current wave is also wrong. The idea that you just need to scale up LLMs or have them generate thousands of sequences of tokens and select the good ones to get to human-level intelligence, and you're going to have, within a few years, a country of geniuses in a data center, to quote someone who will remain nameless, is nonsense. It's complete nonsense. Sure, there are going to be a lot of applications for which systems in the near future are going to be PhD-level, if you want, but in terms of overall intelligence, no, we're still very far from it. When I say very far, it might happen within a decade or so. It's not that far. AI has been applied in many ways that have improved the human
condition and made people's lives easier. What application of AI do you see as being the most compelling and advantageous? So, I mean, there are obvious things, of course. I think the impact of AI on science and medicine is probably going to be a lot bigger than we can currently imagine, even though it's already pretty big. Not just in terms of research for things like protein folding and drug design, but also in understanding the mechanisms of life. And there are a lot of short-term consequences. Very often now in the US,
when you go through a medical imaging process, there is AI involved. If it's a mammogram, it's probably pre-screened with a deep learning system to detect tumors. If you go to an MRI machine, the time you have to spend in that MRI machine is reduced by a factor of four or something because we can now recover high-resolution versions of MRI images with fewer data. So, there are a lot of short-term consequences. Of course, every one of our cars, and NVIDIA
is one of the big suppliers of this, now comes out with at least a driving assistance system or an automatic emergency braking system. These are required in Europe now for a few years. Those things reduce collisions by 40%. They save lives. Those are enormous applications. Obviously, this is not generative AI; this is perception, and a little bit of control for cars now. There are a lot of applications of LLMs as they exist today or will exist within a few years in industry and services, etc., but we have to think about the limitations of this as well. It's very difficult, more difficult than most people had thought, to field and deploy systems with the level of accuracy and reliability that is expected. This has certainly been the case for autonomous driving. It's been a receding horizon of when we get level five autonomous driving. I think
it's going to be the same thing. It's usually where AI fails—not in the basic technique or the flashy demos, but when you actually have to deploy it, apply it, and make it reliable enough to integrate with existing systems. That's where it becomes difficult and expensive and takes more time than expected. Certainly, in applications like autonomous vehicles, where it has to be right all the time, or someone could be injured or killed, the level of accuracy has to be almost perfect. But there are many applications where if it just gets it right most of the time, it's very beneficial. Even some medical applications where a doctor is double-checking it, or certainly entertainment and education, where you just want to do more good than harm, and the consequences of getting it wrong aren't disastrous. Absolutely. For most of those systems,
the most useful ones are the ones that make people more productive and more creative. A coding assistant that assists them, for example. It's true in medicine, it's true in art, and it's true in producing text. AI is not replacing people; it's giving them power tools. Well, it might at some point, but I don't think people will go for that. Our relationship with future
AI systems, including superintelligence, is that we're going to be their boss. We're going to have a staff of super-intelligent virtual people working for us. I don't know about you, but I like working with people who are smarter than me. It's the greatest thing in the world. So, the flip side is that just as AI can benefit humanity in many ways, it also has a dark side where people will apply it to do things like create deep fakes and false news, causing emotional distress if applied incorrectly. What are your biggest concerns about the use of AI, and how do we mitigate those? One thing that Meta has been very familiar with is using AI as a countermeasure against attacks, whether they are from AI or not.
One thing that may be surprising is that, despite the availability of LLMs and various deep fakes and such for a number of years now, our colleagues who are in charge of detecting and taking down this kind of attack are telling us that we're not seeing a big increase in generative content being posted on social networks, or at least not in a nefarious way. Usually, it's labeled as being synthetic. So, we're not seeing all the catastrophic scenarios that people were warning about three or four years ago, where this was going to destroy information and communication systems. There's an interesting story I need to tell you. In the fall of 2022, my colleagues at Meta, a small team, put together an LLM that was trained on the entire scientific literature. All the technical papers they could get their hands on.
It was called Galactica, and they put it up with a long paper that described how it was trained, open-source code, and a demo system that you could just play with. This was doused with vitriol by the Twitter sphere. People were saying, "Oh, this is horrible. This is going to get us killed. It's going to destroy the scientific communication system. Now any idiot can write a scientific-sounding paper on the benefits of eating crushed glass or something."
There was such a tsunami of negative opinions that my poor colleagues, a small team of five people, couldn't sleep at night. They took down the demo, left the open-source code and the paper, but took down the demo. Our conclusion was that the world is not ready for this kind of technology, and nobody's interested. Three weeks later, JPT came out, and it was like the second coming of the Messiah. We looked at each other and said, "What just happened?" We couldn't understand the enthusiasm of the public for this, given the reaction to the previous one.
A lot of it is perception. GPT wasn't trying to write a scholarly paper or do the science; it was something you could converse with and ask a question about anything, trying to be more general. To some extent, it was more useful to more people or more approximately useful. There are dangers for sure, and there are misuses of various types. But the countermeasure against misuse is just better AI. There are unreliable systems, as I was talking about before. The fix for this is better AI systems that have common sense, the capacity to reason, and check whether the answers are correct, and assess the reliability of their own answers, which is not quite the case currently. But the catastrophic scenarios, frankly,
I don't believe in them. People adapt. I like to think that AI is mostly for good, even though there's a little bit of bad in there. So, as somebody with homes on both sides of the Atlantic, you have a very global perspective. Where do you see future innovation in AI coming from?
It can come from anywhere. There are smart people everywhere. Nobody has a monopoly on good ideas. There are people who have a huge superiority complex and think they can come up with all the good ideas without talking to anyone. In my experience as a scientist, it's not the case. Good ideas come from the interaction of a lot of people and the exchange of ideas. In the
last decade or so, the exchange of code has also become important. That's one reason why I've been a strong advocate of open-source AI platforms, and why Meta, in part, has adopted that philosophy as well. We don't have a monopoly on good ideas, as smart as we think we are. The recent story about DeepSeek really shows that good ideas can come from anywhere. There are a lot of really good scientists in China. One story that a lot of people should
know is that if you ask yourself what is the paper in all of science that has gathered the largest number of citations over the last 10 years, that paper was published in 2015, exactly 10 years ago. It was about a particular neural net architecture called ResNet, or residual networks, which came out of Microsoft Research in Beijing by a bunch of Chinese scientists. The lead author was Kaiming He. After a year, he joined FAIR at Meta in California, spent about eight years there, and recently moved to MIT. That tells you that there are a lot of good scientists all over the world, and ideas can come from anywhere. But to actually put those ideas into practice, you need a big
infrastructure, a lot of computation, and you need to give a lot of money to your friends and colleagues to buy the necessary resources. Having an open intellectual community makes progress go faster because someone comes up with half the good idea over here, and someone else says the other half. If they communicate, it happens. If they're all very insular and closed, progress just doesn't take place. The other thing is that for innovative ideas
to emerge, as a chief scientist at NVIDIA, you know this—you need to give people a long leash. You need to let people really innovate and not pressure them to produce something every three months or every six months. That's pretty much what happened with DeepSeek and LLaMA. One story that is not widely known is that there were several LLM projects at FAIR in 2022. One had a lot of resources and support from leadership, and another was a small pirate project by a dozen people in Paris who decided to build their own LLM because they needed it for some reason. That became LLaMA, and the big project you never heard of was stopped. So, you can come up with good ideas even if you don't have all the support. If you are somewhat insulated from your
management and they leave you alone, you can come up with better ideas than if you are supposed to innovate on a schedule. A dozen people produced LLaMA, and then a decision was made to pick this as the platform. A team was built around it to produce LLaMA 2, which was eventually open-sourced and caused a bit of a revolution in the landscape. As of yesterday, there have been over one billion downloads of LLaMA. I find this astonishing. I assume that includes a lot of you, but who are all those people? I mean, you must know them because they all must buy NVIDIA hardware to run those things. We thank you for selling all those GPUs. Let's talk a little bit more about open source. I think LLaMA has been really innovative in that it's a state-of-the-art LLM that's offered with open weights, so people can download and run it themselves. What are the pros and cons of
that? The company is obviously investing enormous amounts of money in developing the model, training it, and fine-tuning it, and then giving it away. What is good about that, and what is the downside? Well, I think there is a downside. If you are a company that expects to make revenue directly from that service, if that's your only business, then it may not be advantageous for you to reveal all your secrets. But if you are a company like Meta or Google, where the revenue comes from other sources—advertising in the case of Meta, and various sources for Google—what matters is not how much revenue you can generate in the short term but whether you can build the functionalities needed for the product you want to build and get the largest number of smart people in the world to contribute to it. For Meta, it doesn't hurt if some other company uses
LLaMA for some other purpose because they don't have a social network that they can build on top of this. It's much more of a threat for Google because you can build search engines with that, which is probably why they are a little less positive about this kind of approach. The other thing we've seen the effect of, first with PyTorch and now with LLaMA, is that they have jump-started the entire ecosystem of startups. We see this in larger industry now, where people sometimes prototype an AI system with a proprietary API, but when it comes time to deploy it, the most cost-effective way of doing it is to do it on LLaMA because you can run it on-premise or on some other open-source platform. Philosophically,
I think the biggest factor, the most important reason to want to have open-source platforms, is that in a short time, every single one of our interactions with the digital world will be mediated by AI systems. I'm wearing the Ray-Ban Meta smart glasses right now, and I can talk to Meta AI through them and ask it any question. We don't believe that people are going to want a single assistant, and that those assistants are going to come from a handful of companies on the west coast of the US or China. We need assistants that are extremely diverse. They need to speak all the world's languages, understand all the world's cultures, all the value systems, and all the centers of interest. They need to have different biases, political opinions, and so on. We need a diversity of assistants for the same reason that we need a diverse press. Otherwise, we'll all have
the same information from the same sources, and that's not good for democracy or anything else. We need a platform that anybody can use to build those diverse assistants. Right now, that can only be done through open-source platforms. I think it's going to be even more important in the future because if we want foundation models to speak all the languages in the world and everything, no single entity is going to be able to do this by itself. Who is going to collect all the data in all the languages in the world and just hand it over to OpenAI, Meta, Google, or Anthropic? No one. They want to keep that data. Regions in the world are going to want to contribute their data to a global foundation model but not actually give out that data. They might contribute to training a global model. I think that's the model of the future.
Foundation models will be open source and trained in a distributed fashion with various data centers around the world having access to different subsets of data, basically training a consensus model. That's what makes open-source platforms completely inevitable, and proprietary platforms, I think, are going to disappear. And it also makes sense both for the diversity of languages and things but also for applications. A given company can download LLaMA and then fine-tune it on proprietary data that they wouldn't want to upload. That's what's happening now. Most AI startups' business models are built around this. They build specialized
systems for vertical applications. In Jensen's keynote, he had this great example of using a generative LLM to do wedding planning, to decide who was going to sit around the table. And that was a great example of the trade-off between putting effort into training and putting effort into inference. So, in one case, you can have a very powerful model that you spend
an enormous number of resources on training, or you can build a less powerful model but run it many passes so it can reason and do it. What do you see as the trade-offs between training time and inference or test time in building a powerful model? Where is the optimum point? First of all, I think Jensen is absolutely right that you ultimately get more power in a system that can reason. I disagree with the fact that the proper way to do reasoning is the way current LLMs that have reasoning abilities are doing it. It works, but it's not the right way. When we reason, when we think, we do this in some sort of abstract mental state that has nothing to do with language.
You don't want to be kicking tokens around; you want to be reasoning in your latent space, not in token space. If I tell you to imagine a cube floating in front of you and then rotate that cube by 90 degrees around a vertical axis, you can do this mentally, and it has nothing to do with language. A cat could do this, and we can't specify the problem to a cat through language, but cats do things that are much more complex than this when they plan trajectories to jump on a piece of furniture. They do things that are much more complex than that, and it's not related to language. It's certainly not done in token space, which would be a sequence of actions. It's done in
an abstract mental space. That's the challenge of the next few years: figuring out new architectures that allow this type of reasoning. That's what I've been working on for the last few years. Is there a new model we should be expecting that allows us to do reasoning in this abstract space? It's called JAPA, or JPA world models. My colleagues and I have put out a bunch of papers on this, kind of the first steps towards this over the last few years. JPA stands for joint
embedding predictive architecture. These are world models that learn abstract representations and are capable of manipulating those representations and perhaps reasoning and producing sequences of actions to arrive at a particular goal. I think that's the future. I wrote a long paper about this that explains how this might work, about three years ago. To run those models, you're going to need great hardware. Over the last decade, the capabilities
of GPUs have increased by an order of 5 to 10,000 times, both in training and inference for AI models, from Kepler to Blackwell. We've seen today that even more is coming. Scale-out and scale-up have provided even additional capabilities. In your opinion, what is coming down the road? What sort of things do you expect will enable us to build your JPA model and other more powerful models? Well, keep them coming because we're going to need all the competition we can get. This kind of reasoning in abstract space is going to be computationally expensive at runtime, and it connects with something we're all very familiar with. Psychologists talk about System 1 and System 2. System 1 is tasks that you can
accomplish without really thinking about them. They've become second nature, and you can accomplish them without thinking too much. For example, if you are an experienced driver, you can drive even without driving assistance, and you can drive while talking to someone. But if you drive for the first time or the first few hours, you have to really focus on what you're doing. You're planning all kinds of catastrophe scenarios and stuff like that. That's System 2. You're recruiting your entire world model to figure out what's going to happen and then plan actions so that good things happen. Whereas, when you're familiar with a task,
you can just use System 1, a sort of reactive system that allows you to accomplish the task without planning. The first thing this reasoning is System 2, and the automatic, subconscious, reactive policy is System 1. Current systems are trying to inch their way towards System 2, but ultimately, I think we need a different architecture for System 2. I don't think it's going to be a generative architecture if you want a system to understand the physical world. The physical world is much more difficult to understand than language. We think of language as the epitome of human intellectual capabilities, but in fact, language is simple because it's discrete. It's discrete because it's a communication mechanism and needs to be discrete
to be noise-resistant. Otherwise, you wouldn't be able to understand what I'm saying right now. So, it's simple for that reason. But the real world is just much more complicated. Here's something you may have heard me say in the past: current LLMs are trained typically with something like 30 trillion tokens. Tokens are typically about three bytes, so that's
0.9 to 10^14 bytes, let's say 10^14 bytes. That would take any of us over 400,000 years to read through that because that's the totality of all the text available on the internet. Now, psychologists tell us that a 4-year-old has been awake a total of 16,000 hours, and we have about 2 megabytes going to our visual cortex through our optic nerve every second, roughly 2 megabytes per second. Multiply this by 16,000 hours times 3600, and it's about 10^14 bytes in four years through vision. You see as much data as text that would take you 400,000 years
to read. That tells you we're never going to get to AGI, whatever you mean by this, by just training from text. It's just not happening. Going back to hardware, there's been a lot of progress on spiking systems, and people who advocate this and look at analogies to how biological systems work suggest that neuromorphic hardware has a role. Do you see any place where neuromorphic hardware could either complement or replace GPUs in doing AI? Not anytime soon. [Laughs] Um, okay, I have to tell you a story about this. So, when I started at Bell Labs in 1988, the group I was in was actually focused on analog hardware for neural nets. They built a bunch of generations of completely analog neural nets, then mixed analog-digital, and then completely digital towards the mid-90s. That's when people kind of lost interest in neural nets,
so there was no point anymore. The problem with exotic underlying principles like this is that the current digital semiconductors are in such a deep local minimum that it's going to take a while before alternative technologies, and an enormous amount of investment, can catch up. It's not even clear that, at a principal level, there is any advantage to it. Things like analog or spiking neurons or spiking neural nets might have some intrinsic advantage, except that they make hardware reuse very difficult. Every piece of hardware we use at the moment is too big and too fast in a sense, so you have to essentially reuse the same piece of hardware to compute different parts of your model. If you use analog hardware, you can't
use multiplexing. You have to have one physical neuron per neuron in your virtual neural net. That means you can't fit a decent-sized neural net on a single chip. You have to do multi-chip, which is going to be incredibly fast once you're able to do this, but it's not going to be efficient because you need to do cross-chip communication, and memory becomes complicated. In the end, you need to communicate digitally because that's the only way to do it efficiently for noise resistance. In fact, the brain provides an interesting piece of information. Most brains, or the brains of most
animals, communicate through spikes. Spikes are binary signals, so it is digital, not analog. The computation at the level of the neuron may be analog, but the communication between neurons is actually digital, except for tiny animals. For example, C. elegans, a 1 mm long worm, has 302 neurons. They don't spike because they don't need to communicate far away, so they can
use analog communication at that scale. This tells you that even if we want to use exotic technology like analog computation, we're going to have to use digital communication somehow. If nothing else, for memory. It's not clear, and I've gone through this calculation multiple times. I probably know much less about this than you do, but I don't see it happening anytime soon. There might be some corners of edge computation where this makes sense. For example, if you want a super cheap microcontroller that is going to run a perception system for your vacuum cleaner or lawn mower, maybe computation makes sense. If you can fit the whole thing in a single
chip and use maybe phase change memory or something like this to store weights, and I know some people are seriously building these things. These are what people call PIM (Processor in Memory) or analog and digital processor and memory technologies. Do you see a role for them? Are they promising? Absolutely. Some of my colleagues are very interested in this because they want to build successors to those smart glasses. What you want is some visual processing taking place all the time. Right now, it's not possible because of power consumption. Just a sensor like an image sensor can't be left on all the time in a pair of
glasses like this; you run the battery in minutes. One potential solution is to have processing on the sensor directly, so you don't have to shuffle the data out of the chip, which is what costs energy. Shuffling data is what costs energy, not the computation itself. There's quite a bit of work on this, but we're not there yet. I see this as a promising direction. In fact, biology
has figured this out. The retina has on the order of 60 million photoreceptors, and in front of our retina, we have four layers of neurons—transparent neurons—that process the signal to squeeze it down to 1 million optic nerve fibers to go to our visual cortex. There is compression, feature extraction, and all kinds of stuff to get the most useful information out of the visual system. What about other emerging technologies? Do you see quantum or superconducting logic or anything else on the horizon that's going to give us a great step forward in AI processing capability? Superconducting, perhaps. I don't know enough about this to really tell. Optical has been very disappointing. I remember being totally amazed by talks about optical implementations of neural nets back in the 1980s, and they never panned out. Technology is evolving,
so maybe things may change. For quantum, I'm extremely skeptical of quantum computing. I think the only medium-term application of quantum computing that I see is for simulating quantum systems, like quantum chemistry or something. For anything else, I'm extremely skeptical. You've talked about building AI that can learn from observation, like a baby animal. What
kind of demands do you see that putting on the hardware, and how do you think we need to grow the hardware to enable that? How much can you give us? It's a question of how much you're willing to buy. The more you buy, the more you save, as we heard today. It's not going to be cheap. For example, video. Let me tell you about an experiment that some of my colleagues did until about a year ago. There was a technique for self-supervised learning to learn image representations using reconstruction. The project was called MAE, or Masked Autoencoder. It's basically an autoencoder, a denoising autoencoder, very much like what's used. You take an image, corrupt it by removing
some pieces of it—a big chunk of it, actually—and train a gigantic neural net to reconstruct the full image at a pixel level or at the token level. Then you use the internal representation as input to a downstream task that you train supervised, like object recognition or whatever. It works okay, but you have to boil a small pond to cool down those liquid-cooled GPU clusters to be able to do this. It doesn't work nearly as well as those joint embedding architectures. You may have heard of DINO, DINO V2, JAPA, etc. These are joint embedding architectures, and they tend to work better and actually be cheaper to train. In joint embedding, you basically have two latent spaces for the two input classes. Instead of converting everything
into one kind of token, you take the full image and the corrupted or transformed version, run them both through encoders, and then try to link those embeddings. You train the representation of the full image from the representation of the partially visible or corrupted one. This works better and is cheaper. Okay, so the team said, "This seems to work okay for images, let's try to do this for video."
So now you have to tokenize a video, basically turning a video into 16 by 16 patches, and that's a lot of patches for even a short video. Then, you train a gigantic neural net to reconstruct the patches that are missing in a video, maybe predict a future video. This required boiling a small lake, not just a small pond, and it was basically a failure. That project was stopped. The alternative we have now is a project called VJA, and we're getting close to version two. It's one of those joint emitting predictive architectures. So, it does prediction on video but at the representation level, and it seems to work really well. We have an example of this. The
first version of this is trained on very short videos, just 16 frames, and it's trained to predict the representation of the full video from a version of a partially masked one. That system apparently is able to tell you whether a particular video is physically possible or not, at least in restricted cases. It gives you a binary output: "This is feasible," "This is not," or maybe it's simpler than that. You measure the prediction error that the system produces. You take a sliding window of those 16 frames on a video and look at whether you can predict the next few frames. You measure the prediction error,
and when something really strange happens in the video—like an object disappears, changes shape, or spontaneously appears or doesn't obey physics—it flags it as unusual. These are natural videos, and then you test it on synthetic videos where something really weird happens. If you train it on videos where really weird things happen, that would become normal, and it wouldn't detect those as being odd. So, you don't do that. It corresponds a bit to how baby humans learn intuitive physics. The fact that an object
that is not supported falls—basically the effect of gravity—babies learn this around the age of nine months. If you show a five or six-month-old baby a scenario where an object appears to float in the air, they're not surprised. But by nine or 10 months, they look at it with huge eyes, and you can actually measure that. Psychologists have ways of measuring attention, and what that
means is that the internal world model of the infant is being violated. The baby is seeing something that she doesn't think is possible, and it doesn't match her expectations. So, she has to look at it to correct her internal model and say, "Maybe I should learn about this." You've talked about reasoning and planning in this joint embedding space. What do we need to get there? What are the bottlenecks both on the model side and on the hardware side? A lot of it is just making it work. We need a good recipe. Before people came up with a good
recipe to train even simple convolutional nets, it was very difficult. Back in the late 2000s, Geoff Hinton was telling everyone it was very difficult to train deep networks with backpropagation. Yann LeCun could do it with ConvNets, but he was the only one in the world who could do it, which was true at the time but not entirely accurate. It turns out it's not that difficult,
but there are a lot of tricks you have to figure out—engineering tricks, intuitive tricks, which nonlinearity to use, the idea of ResNet, which is the most cited paper in all of science in the last 10 years. It's a very simple idea: you just have connections that skip every layer, so by default, a layer in a deep neural net is basically confused into being the identity function, and what the neural net is doing is a deviation from that very simple idea. This allowed us to keep from losing the gradient going backwards and train neural nets with 100 layers or more. Before people came up with a good recipe with all those residual connections, Adam optimizers, and normalization, nothing really worked. We just had a paper showing that you don't need
normalization in transformers, and things like that. Before you had this complete recipe and all the tricks, nothing worked. The same was true with NLP and natural language processing systems. In the mid-2010s, there were systems based on denoising autoencoders like BERT, where you take a piece of text, corrupt it, and train a big neural net to recover the words that are missing. Eventually, that was wiped out by the GPT-style architecture, where you just train on the entire system. You train it as an autoencoder but don't need to corrupt the input because the architecture is causal. It turned out to be incredibly successful and scalable. We have to come up with a good recipe for those JAPA architectures that will scale to the same extent. That's what's missing. Well, we have a flashing red light ahead of
us. Are there any final thoughts you'd like to leave the audience with before we adjourn? Yeah, I want to reinforce the point I was making earlier. The progress of AI and the progress towards human-level AI, advanced machine intelligence, or AGI—whatever you want to call it—is going to require contributions from everyone. It's not going to come from a single entity somewhere that does R&D in secret. That's just not happening. It's not going to be an event; it's going to be a lot of successive progress along the way. Humanity is not going to get killed within an hour of this happening because it's not going to be an event. It's
going to require contributions from basically everywhere around the world. It's going to have to be open research and based on open-source platforms. If they require a lot of training, we're going to need cheaper hardware. You're going to need to lower your prices. [Laughs] You have to take that up with Jensen. We'll have a future with a high diversity
of AI assistants that are going to help us in our daily lives, be with us all the time through maybe our smart glasses or other smart devices, and we're going to be their boss. They're going to be working for us. It's going to be like all of us are going to be managers. [Laughs] That's a terrible future. Well, on that note, I think I'd like to thank you for just a really intellectually stimulating conversation, and I hope we get a chance to do this again. All right, yeah. Thanks. Yeah, thanks.
2025-04-13 20:08