Mobileye CEO and CTO reveal stealth developments in AI for achieving full autonomy
Hello, everyone, and welcome to Mobileye Driving AI day 2024 here in Jerusalem. I'm Nimrod, I'm the Executive Vice President of Strategy and Business Development in Mobileye, and it's my pleasure to host this day for the first time. This day centers around recent AI advancements in the industry in general and in our company. It will be a series of presentations by some of our technology leaders, covering the various aspects of AI that play a fundamental role across our product portfolio. We'll start the day off by a presentation by our President and CEO, Professor Amnon Shashua, with a bird's eye view on recent AI advancements in the industry and in our company.
He will be followed by a presentation by our CTO, Professor Shai Shalev-Shwartz, who will go into the detail of how we leverage recent AI advancements in our products. So without further ado, it is my pleasure to invite our CEO, Professor Amnon Shashua, for his presentation. Hey, hello everyone. So you know, we have been spending quite a while, integrating the latest modern AI into our stack. And what we will do is Shai and I, we kind of split, split the presentation into two parts.
I'll give a bird's eye view, kind of setting the stage, so it's not only what we are doing, but how it's couched within a global environment. And then Shai would go into some of our really deep innovations in this area, which we will be showing for the first time, really exciting, mind blowing innovations. So let's start with kind of setting the stage.
So, you know, the story of Mobileye is solving autonomy. This is what we have been doing for the past almost a decade, and the goal is to reach full self-driving. Now, we put this in kind of quotations because there's lots of claims of full self-driving out there, but kind of an eyes-off system that now, legally the person can let go of the driving and also in the robotaxi even there's no driver. So we want to reach full self-driving, but since this is a marathon, you want to maintain a sustainable business, you want to generate income, you want to generate earnings as you move forward. For example, in Mobileye expenses for the past many years, we are spending around $600 million a year just for building autonomy. I mean, if we decided that we will focus only on driving assist, we can reduce $600 million a year of our operating expenses.
So this is really a game for marathon runners, and since we're talking of such big investments, you cannot just rely on outside investments. You need to be able to generate earnings. And this requires kind of sharp thinking of how you make those big investments while generating a business. This is very important.
So if we look at kind of the landscape in terms of how to solve autonomy, I think one can look at three prototype approaches. I put here the names of competitors, but you know, these are kind of prototype approaches. The first one is a lidar-centric approach.
You kind of simplify the problem by putting very powerful 3D sensors, not the sensors that cost hundreds of dollars but sensors that cost many thousands of dollars and many of them around the car. So you get very precise data, 3D data, and this simplifies the problem completely. Now, in terms of the columns, the second column is the AI approach. So CAIS is a Compound AI system which along this presentation and Shai's presentation, we will go into details what a Compound AI system is. And we believe that that Waymo has a kind of an engineered system, they have very likely lots and lots of AI components inside it.
So let's call what they have there a Compound AI system. Let's go to the positive side. The last column, the Mean Time Between Failures. So what separates a full self-driving system from just a level two, you know, driving assist system is the level of Mean Time Between Failures. It means how many hours you need to drive before there is a critical intervention. And you know, it's somewhere between, I would say 50,000 hours to 10,000,000 hours, depending on whom you speak with.
So we're talking about a very, very large high bar to pass. So with Waymo, we believe that they pass that bar. They're providing a service with no driver. So there's a kind of a positive check on the MTBF.
In terms of cost, you know, it's a very expensive sensor, so maybe it's good enough for a robotaxi because the return on investment, you know, replacing the driver with CapEx, but for a consumer vehicle, in which cost is king, cost is a key, you know, cost of such a system cannot cost tens of thousands of dollars when you're talking about a car that you want to buy. So this is why it's a negative. Modularity, by modularity I mean, can you take your system and kind of distill it into lower cost systems for driving assist? So even if it's not only driving assist, could be also an eyes-off, let's say, eyes-off only for highways. So it simplifies the system. Or if it's not eyes-off, it's eyes-on but hands-free, it simplifies the system.
If it's not hands off, but just an advanced driving system, it simplifies the system. And that way you create modularity and you create a business. So in this case, the Waymo approach doesn't allow for modularity. Geographic scalability is how easily you can scale geographically. So we don't know and this is why you put it as a question mark. They need to build a high definition map kind of manually from city to city, but apparently they are doing this quite efficiently.
So this is why we put a question mark there. The next prototype approach is Tesla. Here it is camera only, and I'm making a distinction between camera only and camera-centric. So Mobileye, for example, is camera-centric We are doubling down on cameras, but we are open to introduce additional sensors and we are developing our imaging radars we are also, you know, in our eyes-off systems, putting a lidar, a front facing lidar. So we are open to add sensors, but the technology is centered around cameras.
So we are camera-centric. Tesla is camera only. The AI approach is an end-to-end which as we go on this presentation, will clarify what this means.
In terms of cost it's very good. Cameras are very low cost, compute is low cost Modularity also very good. They can distill it into any type of a downgraded system that they like.
The geographic scalability - very good. There is no limitation on geographic scalability. MTBF - a question mark. Can this system reach MTBF that is required for an eyes-off system? It's a question mark. I'll talk about this a bit more.
Mobileye is camera-centric. The AI approach is also a Compound AI system which we will elaborate in the next hour, hour and a half. Cost is very good. Just cameras and... you know, radars, also very low cost and for an eyes-off, just have a front facing lidar in a very low cost, front facing lidar, a few hundreds of dollars.
So the cost is good. Modularity is very good. This is how we build our business. We have a product portfolio going from driving assist up to a robotaxi. Geographic scalability is based on our REM™ technology, which is crowd-sourced, so we have immediate geographic scalability. The MTBF is also a question mark.
We haven't shown yet that we can reach, you know, the hundreds of thousands or millions of hours of driving between events. We haven't shown that. We believe we can do that, but we haven't shown that. So this is also a question mark.
And what we want to elaborate on through this deep dive is which is more likely to succeed. So there are two question marks here, right? Except Waymo, which is lidar-centric, nobody has reached the MTBF so far that is required for eyes-off. So there's a question mark here, and which is more likely to succeed? So this is kind of the real question. So let's go into the end-to-end approach.
We're talking about camera only end-to-end approach. So let's start with a premise and the promise. The first one is no glue code.
You have a black box, a neural network, which you input the camera information, you output the trajectory that the car needs to go. So you are outputting the action and it is just a data pipeline. You add more and more data, the network learns how to drive by observing human drivers.
You have a fleet of cars, of millions of cars sending event data. So there is kind of no glue code. As time goes by, you have more and more data and you can train the system with more and more data, and eventually you will reach that singularity level, in which you are better than human drivers or reach human driving capability.
So this is the premise. Another premise is that unsupervised data, because there's no labeling going on, just, you know, the raw images go in. Nobody's labeling what's in the images. So unsupervised data alone can reach sufficient Mean Time Between Failures.
So this is the premise, if you look at the reality in terms of no glue code, so the glue code is shifted to offline. What does that mean? Machine learning, especially when I'm talking about the transformers' architecture, they estimate the probability, the probability of the trajectory given the input data. When you are estimating probability or estimating the likelihood of the next token that you are predicting, you are not estimating the level of correctness. That means there is a difference between correct yet rare and incorrect, and the network doesn't really know how to distinguish between them. This is why in language models are doing RLHF, reinforcement learning with human feedback So what you need to do, you need to find what is correct and incorrect through offline. You have all this data, you have filters that will remove bad behavior of human driving, like, you know, rolling stops or things like that.
So this is a lot of engineering and glue code, but offline, not online. So you are shifting from online to offline. And this is called the AV alignment or alignment problem Knowing what is correct and what is incorrect. Then, in terms of unsupervised data, well, I put here a question mark and then I'll elaborate a bit more about notion of a calculator.
If you look at today's language models, they cannot estimate well multiplication of two numbers. They use a tool for that, right? From a neuron perspective or from a data statistics perspective, and they were not able to abstract the concept of long multiplication. So what Open AI did, they introduced a calculator, a calculator through a Python code. There's the issue of shortcut learning problem. I'll mention that in my talk.
There's the issue of long tail problem. I'll also mention this. So this idea that unsupervised data alone is sufficient, is really questionable. So let's go over this AV alignment problem. So the "No Glue Code" AV alignment problem.
So an end-to-end approach maximizes the probability of the output, in this case, the trajectory Y, given the input X, which in this case is all the sensory data that comes in, which in this case is cameras, and Y is the future trajectory that the human would take, because we are using human data. and, you know, if your... your objective is to estimate the probability or the likelihood, then you would prefer, say, common and incorrect over rare and correct and there are a number of examples of that. So one is when you are approaching a stop sign.
Most humans, they do a rolling stop. They say slow down, they look left and right, everything is okay and they continue. What you actually need to do, you need to do a full stop. Most humans don't do a full stop. So this is an example of rare and correct versus common and incorrect. So if you are just observing human data, human driving data, you will learn to do a rolling stop and then you'll always do a traffic violation, and we don't want an autonomous car performing traffic violations.
Then there are "rude drivers". You know, certain countries, certain geographic areas. You know, here in Israel, we are at the center of rude driving. People cut in line, cars standing in line and then exit and then all of a sudden, you know, some... I don't want to mention names, but cuts in. So this is rude behavior.
And it's not only in Israel, there are other places as well. But you don't want to learn rude behavior. You are building an eyes-off system, and it should drive well. With ethics, right?
It's not just rude driving. So if you just learn from human data, you will learn how to drive in a rude manner. Reckless driving, there is a lot of reckless driving that did not generate an accident. Because if there was an accident, you would know that this is a bad example.
But, you know, just rude driving. How do you know whether it's rude or not rude or reckless driving? So all of these are examples of common and incorrect versus rare and correct, and this is built-in in the fact they are estimating probability. So in language models, you know, it was understood once, you know, once ChatGPT came out, it was understood all these hallucinations, giving answers that are incorrect.
So the next breakthrough after ChatGPT was introduced, was deploying RLHF, reinforcement learning using human feedback, where an army of humans would teach the network what is correct and what is incorrect. Rank answers, you know, put in a query, get a collection of answers and rank them. So to teach the network through a reinforcement learning phase what is correct and what is incorrect, because just by statistics you will prefer the common and incorrect.
So what is RLHF in autonomous driving? So the glue code, so this is an example where the glue code is shifted to offline. Now whether glue code in offline is easier than a glue code in online, I don't know, but there is glue code, just not shifted. The second is about abstractions. You know, that idea that unsupervised, the data alone would be sufficient to learn everything that is important. We know that this is very, very questionable.
And take, for example, multiplication of two numbers. Say you have two numbers with, say, up to 20 digits each, and you are asking your language model, what is the result of this multiplication. So what you see on the right is a tweet showing the latest, you know, GPT 4o1 compared to GPT 4o. So, what you see, the columns and rows are the number of digits in the number, so up to 20 digits.
So 20 digits multiplied by 20 digits. Red means that it is zero percentage of correct answers and green is higher than zero up to 100%. So when the number of digits is very small, the network succeeds, But when the number of digits grows up to four or five digits, the network does not succeed.
And we're talking about the latest model. Because what the network did not learn, was not able to abstract the concept of long multiplication, all that the network has seen is examples of pairs of numbers and their multiplication. From that it was not sufficient to abstract the concept of long multiplication. So what is being done by companies like OpenAI? They introduced tools. If you go to ChatGPT and ask it to multiply two numbers, it will not use the network itself to output the result.
It will translate it into Python code, understand that what you want to do here is multiplication, translate it into a piece of Python code, and Python code employs a calculator. So a good old calculator, you don't need to learn this. So this is what we refer to as injecting abstraction And so you need to inject abstraction, not everything unsupervised can be learned or it will take a very, very long time to learn how. You have infinite time, infinite data. Maybe everything is possible, but it's not practical. So this is about the abstractions.
And there's another notion. So what I said so far is really well understood by anyone who took Intro to Machine Learning. So far, I didn't say anything that is not really known. Maybe not known to laymen, but anyone who took a course in Intro to Machine learning knows that. So I didn't say anything that is [new]. What I'm going to say now is something that maybe few people know about.
This is called "The Shortcut Learning Problem", and let's look at the notion of fusion. Say we want to fuse information. It could be information coming from different sensory modalities, for example. And let's assume that this information, some of it has a low sample complexity, meaning you need a small amount of data in order to generalize, and some of it has high sample complexity, meaning you need a lot of data to generalize. So for example, with a lidar, since it's a precise three dimensional sensor, to generalize, you need much less data than the data that you need for cameras in order to generalize.
So a lidar would be a low sample complexity source of information and cameras would be a high sample complexity source of information. So now you want to fuse all this data. So the "End-to-End" approach would be just feed all the sensors into a big network and train it.
So instead of the network just receiving camera input, say the network receives camera input, radar input, lidar input and outputs the trajectory as before. But what's the big deal? if we can generalize from cameras we will generalize also from lidar, from radar, feed everything that is called low level fusion into an end-to-end network, right? So the shortcut learning problem is that when you input different modalities and they have different sample complexities, an end-to-end stochastic gradient descent, right? Because machine learning today is only stochastic gradient descent based. So an SGD-based learning will struggle to leverage the advantages of all the modalities. Now in order to see this in a bit more details, let's look at three types of sensors. We have cameras, radars and lidars, and suppose that each system, has an inherent limitation which cause a failure probability of Epsilon, Epsilon is small, say, once in 1,000 hours of driving.
So once in 1,000 hours of driving sounds not such a high bar, but... the systems today on the road that claim to be full self-driving is about one hour of driving of MTBF, so 1,000 hours is quite a high bar. So let's assume that the error is about... the error is Epsilon, is about 1,000 hours of driving. And let's assume that the failures of the different systems are independent. Now, the point is...
are the different sensors' modalities dependent or not dependent is beside the point. Because fusion is a kind of a design methodology. It's not coming to replace validation. To validate the overall system, you'll need the amount of data that you need in order to validate an overall system, right? It's just a design methodology. So when I'm talking about the design methodology, I can make this assumption that the sensors inputs are are independent. So now we want to compare two options.
One is a low level end-to-end fusion, as I mentioned before, just feed all this data into one big network and you train the data based on the combined inputs. And the other one would be a Compound AI system, the decomposable learning. So you do a high level fusion. Each sensor, we produce an output. Let's assume that the output is - is there a vehicle in front of me or not? Should I break or not? Let's make this even binary.
Should I apply brakes or not to the car in front of me? And then you do a high level fusion. So those are the two approaches and which one is better? Okay so this is a bit of a mouthful. So let's, let's go over this step by step.
So let's assume that all the variables are binary, 1 and -1. And we have the output Y is 1 and -1, and it would be, it's a Bernoulli of a... you know, probability of half means that 50% of the time it would be 1, 50% of the time it will be -1.
Let's assume that r1 and r2 and r3 are errors with a Bernoulli distribution of Epsilon, means that most of the time there will be 1 and a probability of Epsilon there will be -1. Okay? Now let's assume that x1 is a low sample complexity sensor, which is y, the output multiplied by r1. That means that at the probability of Epsilon it'll make a mistake because y is the right output and r1 is mostly 1, except at the probability Epsilon will be -1. So x1 is a simple system. x2 is also a simple system.
Now let's look at x3, x4, x5 as representing camera input. So we're going to describe a complicated system. So x4 and x5 are also 1 and -1 at the probability of 50%, 50% time, it will be 1, 50% time it will be -1, and x3 is going to be our output multiplied by r3, which is the noise multiplied by x4 x5. Now, just to see where this is going, if you look at the product of x3x4x5, because x4 is going to be squared, x5 is going to be squared and this is -1 and +1.
Then squaring it is always 1. So what we're ending up to is that it is y multiplied by R3. which is the output at the probability of Epsilon to make a mistake. So what we created here is the system needs to learn this product of x3x4x5. So it's kind of a complex system to learn and therefore it's complicated, right? So you are modeling the camera. So this is the idea.
This is a higher sample complexity source of information because you need to learn something compositional. So now the theorem is that you can easily reach a probability of error of Epsilon squared if you do a high level fusion, and you can learn this through a one fully connected hidden layer, one hidden layer network with a majority voting. So you can easily reach in a Compound AI system, easily reach an error of Epsilon squared taking a majority of three sources, each one has an error of Epsilon but an end-to-end SGD will be stuck at an error Epsilon for a very long time. What is that long time? Let's say the T is the is the time complexity, the time to learn the high complexity system. Now you are taking T divided by epsilon, epsilon is very small.
So it basically will be stuck at an error Epsilon for a very, very non practical amount of time. So this shows you that this end-to-end approach is not able kind of to leverage the different modalities, the benefits and the advantages of different modalities. Now, it's not that it cannot. It can, but the amount of time that you will need is completely impractical. This is what we call "The Shortcut Learning Problem".
And if you want I didn't put here the proof, but send an email to me or to Shai and we'll send you the proof. It's not that difficult exercise to do. So, again, this raises the question about this approach, where you are just feeding data, unsupervised data. You don't inject the abstractions, you are not leveraging sources of information with different sample complexities, and it could take an unpractical amount of time in order to get this, to learn what you need to learn. Another point is called "The Long Tail Problem".
So say we have edge cases, and we have lots and lots of edge cases to handle. And all those edge cases, let's call this "The Long Tail". There are scenarios and we don't know which one is true. There are two scenarios, one, let's call it the optimal scenario, where if the x-axis is events and the y-axis, the probability for the event, so there is going to be a small number of events that, you know, reduce the probability mass considerably.
So this is why the curve is going down dramatically. So a small number of events reduces the probability mass. This is the optimal scenario, The pessimistic scenario, This is flat. You have 1 million edge cases and each edge case does not move the needle, so another case and another edge case and another one.
So the time to kind of... the coverage of all those edge cases could take a very, very long time even if you have a big fleet of cars to send data. And we don't know, but we don't know which... you know, which scenario is the right one. But what we do now, we have some data from the Tesla FSD. The Tesla FSD is a very impressive system.
You know, we test it and so forth. It's a super impressive system. I'm not saying anything negative about the Tesla FSD system.
Super impressive Their achievements are really very very impressive. But now we're talking about MTBF. I'm not saying anything negative against the Tesla FSD. So this is public data.
There was a jump between version 11 to... Yeah. So this was version 11. So Mean Time Between Failures are kind of hours of mean time between critical takeovers. So there was kind of a jump between version 11 to version 12. Version 12 is the end-to-end system. And then the next version from 12.3 to 12.5,
which is supposed to be to go even better, is worse. Okay? I'm not saying that they will not improve over time, but what I'm saying that it looks challenging. It's not just this optimal thing that, you know, every month there's going to be a 2x improvement. It sounds challenging.
There was something a few days ago, AMCI, they did a kind of a test of 1,000 miles of driving. They had 75 events, takeover events of 13 miles per event. And as I said, the bar that we need to pass is somewhere between a large number of tens of thousands of hours of driving to a large number of millions of hours of driving, depending on whom do you talk to, which car maker you talk to. And here we're talking about what we see here is MTBF of around one hour. So the bar that we need to pass is very, very high and where we are in terms of the industry is far away from where we need to be in order to provide an eyes-off system, and therefore, this raises the question mark, can just feeding unsupervised data forget about abstraction, forget about those calculators, forget about the long tail, just, you know, be religiously focused on just feeding more and more data, it's a real question of whether you can reach the goal, because we are very far away from the bar that we need to pass in order to say that we have an eyes-off system.
And again, I emphasize that the Tesla FSD is an impressive system, nothing to say against the system as well. Okay, so, so we kind of went through this bird's eye view and we put a question mark about MTBF and also a question mark about us. We have not shown that we have achieved that MTBF. So now what are we doing? So in order to understand what we are doing, we need to kind of remind ourselves of the basic concept of machine learning, which is called "The Bias-Variance". The intent of the x-axis here is the amount of abstraction injection. So you are injecting abstraction.
So in the case of multiplying two numbers, the abstraction is... I know what is the algorithm of long multiplication, so I'm injecting it into the learning machine. In case of autonomous driving, RSS, for example, Responsibility-Sensitive Safety, our engine for... for checking driving policy, is an abstraction. Sensing state, the fact that you are given images and you are outputting where the vehicles are located, the pedestrians, all the road users, the lane marks, traffic lights, traffic signs, etc. etc. etc., it's all in abstraction.
Because you are deciding, you know, from your experience what is the important information and this is what you are outputting. So this is an abstraction. So the level of abstraction that you inject is called bias. And if you inject too much abstraction, you may introduce an approximation error, meaning that the richness, you know, the capacity of the model does not reflect the richness of reality.
So you are limiting yourself too much. Too much abstraction, too much bias can introduce the error of the system because of approximation error, because you are limiting the capacity. The capacity of your network cannot reflect the richness of reality. So you need to inject the right amount of abstraction.
Then the second curve here is variance. If you inject no abstraction, you will get a high generalization error. So as you inject more and more bias, inject more and more abstraction, the generalization error goes down, but then if you inject too much abstraction, your capacity is not sufficient to reflect the richness of reality.
Okay, so this is a tradeoff. It's called the bias-variance tradeoff. And the true error of the system is the addition of those two. So we want to reach a sweet spot of injecting the right amount of abstraction, and this is what you see here. This is the sweet spot. The total here, this is where you want to reach, right? So now how this translates to a what we are doing? So first is the abstractions, The Sense - Plan - Act methodology. Sensing state, you know, describing what a sensing state contains is an abstraction.
Analytic calculations. RSS is one big abstraction, And also there are calculations being done, analytic calculations. That's also an abstraction. Those are tools. Now, the notion of time to contact when you're doing AEB, that's also an abstraction. And the calculation, just like in the long multiplication.
And so forth and so forth. So redundancies, the fact that you have a sensor, different sensor modalities, you have redundancies in algorithms. So it's not that you have one monolithic algorithm, one algorithm to go from cameras to a sensing state.
You can have different algorithms. So you are providing redundancies. Then there is high level fusion.
That's also a redundancy. And then in the AV alignment, understanding the difference between correct and incorrect, that's RSS, that gives us the AV alignment. So separates correct from incorrect.
So this is what we call a Compound AI system. What Shai will focus is on the Sense - Plan - Act and there he will show some really deep innovations that we have been... Deep innovations in algorithms, deep innovations in the chip design. And Shai will talk a bit about the chip design. So this is what Shai's going to talk about immediately after me. I'll talk about this high level fusion, this PGF.
So what I mentioned before when I talked about the shortcut learning, is that if you do this naive, low level fusion, just, you know, do an end-to-end network where all the sensor data, cameras, radars and lidars are fed into, you'll not be able to really leverage the advantages of each sensor modality. So, okay, so that's a negative result. What is the positive result? How is Mobileye doing the high level fusion? It's not that simple, but there's a simple part to it and there's a very non simple part to it. And we call that PGF.
Primary Guardian and Fallback system. So I'll say a few words about that and then I'll give the floor to Shai. So high level fusion. So, how to perform high level fusion or high level fusion done right, let's put it that way. So we have camera, we have radar, we have lidar, say for example, we have this scenario where there is a vehicle in front of us and we need to decide whether to apply the brakes or not apply the brakes.
So in this simple scenario, if there is a vehicle in front of us, and we need to apply the brakes or not apply the brakes, this is kind of a binary decision, and the classical way to handle this is the majority. Two out of three, the two out of three majority. Each system tells you 1 or -1.
apply brakes, don't apply brakes and you take the majority out of them. And if each system has a probability of Epsilon to make a mistake, then the majority rule will give you a probability of Epsilon squared to make a mistake, both misdetection and false detection. So this is the classical, this is the simple part. If you ask any engineer, how would you go and do high level fusion, this is what that engineer would tell you. But there's another non-trivial aspect to it, and there are many decisions that are not binary, that they cannot be decided based on the majority.
So say for example, and this is made mostly in terms of lateral control, say, for example, we are the car in the middle and there are two busses flanking us from both sides. And we have one system that tells us the roadway is going left, another system that tells us the roadway is going straight and another system tells us the roadway is going right. Now, there's no notion of majority here, so what should I do? If I make the wrong decision, I'll have an accident with the two busses that are flanking me.
So... there is an issue of how do you do high level fusion when you have non-binary decisions. So the general idea, we built three systems and this is for every component that we want to do high level fusion.
It's not only just camera, radars, lidars. Anything that has high level fusion in it, for example, you know, lane detection, we can have lane detection through the camera perception and we can have lane detection through our high definition map, our REM™ map. Right? We localize ourselves, we project the map onto 2D and we have the lanes coming from the map.
So you have here two redundant systems, right? And now how do you do the high level fusion here? It's a non-binary decision. So we build the three systems, one we call "Primary" and it predicts if, for example, in the case of a lane, it will predict where the lane is. Another one is a "Fallback" based on a different approach. And the examples that I gave before one would be REM™, and the other one will be camera perception. It's a fallback system.
And then there is a "Guardian" system that checks if the prediction of the primary system is valid or not. Say it's an end-to-end network that makes a decision, you know, is the roadway going right, left or straight? So you have the Guardian system. Now, each of these has an error of Epsilon to make a mistake.
And the fusion is if the Guardian dictates primary is valid, choose valid, otherwise choose fallback. And the claim is that this framework has the same property of the majority rule. It means you can prove that the overall error of the system will be Epsilon squared and not Epsilon. And I leave the proof here. You can go over it.
And this gives us kind of this natural pause until Shai comes and replaces me for the next part of the talk. So really, the good part is now coming. I just gave kind of a broad overview. So, everybody saw the proof and knows it by heart, so we can continue. Thank you Amnon. As Amnon said, we're building a compound AI system (CAIS), and what I'm going to focus on is on the sense/plan part.
So, how do we do sensing, how do we do planning, using AI. And the focus, or the main theme, of what I'm going to talk about is the notion of extremely efficient AI. You know, in computer science efficiency is key.
But when we are doing a product for business, it's even more important. And when we are building a compound AI system, it's even even more important because it's really crucial to be able to be efficient. So, I'm going to talk about four themes of efficiency. The first one is about transformers for sensing and planning, and we will show a hundred times efficiency of transformers, the latest and greatest tools in AI, while not hurting performance at all, and even slightly improving performance.
So, this is really magic, and I'm going to explain how we achieve this magic. Secondly, at the end of the day, we need to run these transformers and our entire stack in the car on a chip, so I'm going to talk about how we design the AI chip so that it will be a very, very capable machine while being extremely efficient in terms of what we achieve per dollar and per VAT. Third, for these beasts, AI beasts, we need to feed them, we need data. The question how we create data for these machines in an efficient manner.
Of course, you can have humans that will label you millions of examples, but this is not efficient and not always enough, so we need to do it in a much more efficient way. And last, since modularity is very important to sustain the business, we need to understand how we develop for a stack of products starting from ADAS until robotaxi. These are very, very different products, and we need to consolidate the development in order to be very, very efficient in our development and supporting all of the variety of these products with the same software stack.
So, all of these make extremely efficient AI. So, before diving into what we are doing, let's talk a little bit about what happened in the world around us in AI in the last years. It's really amazing what's going on with AI. AI is changing the world, and all of us are seeing this.
It started with machine learning. So, Mobileye was founded in 1999 and was one of the first companies to use machine learning in in its products. But then around 2012 came the deep learning revolution that took computer vision and made it really, really different from one year before. In one year, all the papers in the Computer Vision conference changed from different methods to a single hammer, which is called deep learning.
And Mobileye was one of the first companies to utilize deep learning in our software stack. But then, in the last four or five years, came revolutions, more and more revolutions, one by the other - generative AI, universal learning, all are achieved by transformers, then Sim2Real, simulation to reality, and, what we are experiencing now is a reasoning revolution. So, we need to look how all of these things should affect the problem of self-driving.
And I'm going to talk about the transformers part, about generative AI, and universal learning. So just briefly, what happened with deep learning before transformers, pre-transformers. This is a typical use case of object detection. So, we start with an AI component, a deep learning component, usually a convolutional neural network that runs on the image on the left, and every pixel declares whether there is an object that this pixel belongs to, and what is the bounding box of the object. So we get many, many rectangles around objects. This is not enough, because what we want to achieve is what we see at the bottom.
We want to achieve the position, the location of the objects in the 3D world. So, there are two steps that we need to do. One, we need to cluster all of these many, many rectangles and do some maximal suppression, getting the image on the right, so we have now only one box around every object. But this is still not enough, because these are boxes in the image space.
What we want is boxes in the 3D world. So, we need to write again code which is not deep learning, which is not learning at all. Call it a glue code, but maybe it's not the right word to call it a glue code. It's a very sophisticated computer vision code that takes the detections from the image space, from the 2D space, into the 3D space.
So, this is what happened before transformers. Now came the GPT family of methods. GPT stands for generative pre-trained transformers. And these models brought three revolutions.
The first one is tokenize everything. So, think about everything like text. If you can talk about it, you can learn it, okay? I don't care if it's images, I don't care if it's audio, video, maybe it's text. Can you talk about it? If you can talk about it, I can learn it.
This is the first revolution. The second revolution is generative auto-regressive learning. Generative means that we learn probabilities; we don't discriminate between if it's true or false. Instead, we generate objects, all types of objects. So, we saw it in text very, very clearly, because these models, the GPTs, can generate text. They are not just saying if a text has a positive or negative sentiment, this is discrimination.
They are actually generating new text, and they are doing it auto-regressively. What does that mean, auto-regressively? One by one. So, generate the next word. Now that you have generated the next word, generate another one, and another one, and another one. This is what auto-regressive means. And the third revolution is how to do it using a specific architecture of a neural network which is called transformer. And the paper that brought transformer is called "Attention is all you need", because there is a specific layer, and I'm going now to deep dive into these three revolutions and explain what they did.
Now, why do I explain all of this? Because I want to explain what we are doing and how you need to really, really understand it very, very deeply in order to do it better. Okay. So, tokenize everything. Tokenize everything. So again, let's look at the problem of detecting objects in an image.
We need to make the image tokens. So, let's take every small patch in an image and call it a token. This is input. Now, we also need to transcribe the output in some way, make it a text as well. So, how can we make it a text? What are objects? Just locations in the image of the objects. So, let's just say that every object is four numbers indicating the corners of a rectangle that determines where the object is in the image.
And then let's predict it in an auto-regressive manner, like what the what the video shows. You see at the bottom how the predictions output one token at a time that describes what's in the image. Okay? So, the input is a single image, we tokenized it by making small windows, each one is an image patch, it is an image token, and we tokenize the output, the description of all the objects in the image, we tokenize it by saying, let's describe every object by the four coordinates and then saying what is the type of this vehicle.
In this case, it's truck and truck. Now, generative, auto-regressive. Before the transformers, what we did is just we predicted very simple output spaces. For example, the ImageNet data set that brought the deep learning revolution into the world is a problem in which you get an image, and there are a thousand possible classes, and you need to determine what you see in the image - if it's a certain type of a dog or a cat or a certain type of glass or whatever. But there is a fixed dictionary of fixed size of what can possibly be in this image, and you need to choose among these options.
In generative auto-regressive we are converting everything to a language, and once it is a language you can describe everything. With words I can tell you everything that I want. I can convey you every message that I want. So, I no longer need to talk about a specific fixed set of possible classes, but I can tell you I will describe to you in words, which are called tokens, I will describe you in words what I want the output to be, and I will generate a sentence or a paragraph or arbitrary length sequence that will tell you what I see in the image or whatever I want to convey. So, this is a generative approach, auto-regressively meaning that we are outputting one token at a time. And the key feature to do this is a mathematical rule which is called the chain rule.
So, let's explain the chain rule. So, suppose again that we want to output what's in the image in terms of object detection. So, as you can clearly see, there are four cars in this image. Okay?
Say that each one is described by four coordinates, so we have 16 coordinates to describe what's in the image. The locations of all the cars in the image. Okay? Now, suppose that we discretize the X and Y, so we have 100 possibilities for the X location and 100 possibilities for the Y location. If we are not using the chain rule and we need to output the probability of what we see in the image, we need the probability of 16 numbers, each one has 100 options or 10 to the two, so overall 10 to the 32 options. So, we need to output a very, very long vector of length, 10 to the 32, so, one with 32 zeros after it, and each position in this vector tells what is the probability of seeing a specific combination of four objects. But even this is not enough, because why four objects? Maybe there are five or six or ten, and of course it becomes impossible to output the probability of what we see in the image.
So, what does the chain rule tell us? The chain rule tells us that we can output things one at a time. And this is a mathematical identity, so there is no approximation here. Instead of saying the probability of this sequence given the image, we can say what is the probability of the first token given the image times the probability of the second one, given the first one, and the image times the probability of the third one given all the previous ones, the first and the second, and so on and so forth. This is equality. Okay? So, so far this trick chain rule makes each individual prediction of dimension 100, much more efficient. Okay? Of course, the magic of using the chain rule comes from a function approximation. We use a deep learning to approximate the conditional probability of the next token given the previous tokens.
So, we need a generic machine, a generic architecture of a deep learning, that knows to approximate the probability of the next token given previous tokens. And this is exactly what transformers are designed to do. So, what are what are these transformers? They are, like many, many other deep networks, comprised of a sequence of layers. Now, let's focus on a single layer. All the layers are the same in the sense that they all get a sequence of tokens as input, or representation of a sequence of tokens as input, and the output is again a sequence, a representation of a sequence of tokens. The representations are of the same dimension.
I call the dimension d. So, the input is n tokens each one of dimension d, and the output is also n tokens, each of dimension d. Now, inside a transformer layer, there are two types of operation. The first one we call self-reflection, and the second one self-attention.
So, now let's deep dive into these two types of layers, of inter layers component. And in order to explain it, I'm making an analogy from group thinking process. So, think about the group of people here. Each one of you is a token, okay? Now, each one of you knows something about life, about the meaning of life, et cetera, and also whether there are cars in the image. Okay? Which is more important. So, each person has his own, their own, area of expertise.
They all contribute to the outcome. And now when we are doing a discussion, a group discussion, we have two types of operations that are happening. One is self-reflection.
Self-reflection is, each one of you sits by himself and thinks-processes what he knows, thinks some more, and finds a better representation of the personal knowledge that each person has. Okay? The second type of operation is a discussion, and we call it self-attention. How does it work? One of the tokens wants to know something, so he asks a query. For example, "Does anyone see a close truck on our left side?" Because, you know, your meaning in life is understanding what vehicles are around us. Okay? And each token deals with something, so, ask a question, a query. We call it a query.
And then the rest of the of the tokens, some of them know nothing about it, some of them know something about it. So, we need to match between those who knows and those who don't. And then we can communicate and message passing the knowledge, the shared knowledge, between the group.
So, now let's make it formal. Starting with the self-reflection. The self-reflection is a simple one hidden layer neural network. So basically, each token multiplies its knowledge, its vector, d-dimensional vector, by a matrix, apply some non-linear operation, and it goes back, maybe to a different dimension and goes back to the original dimension. So, this multiplication by matrix of dimension d by d takes this d-squared operation, and since there are n tokens, the overall time complexity of this operation is d squared times n.
As for the self-attention, each token, first, generates three vectors. One is called query, one is called "key" and one is called "value". So, first each one queries what he wants to know, key what he knows in order to answer, to propagate knowledge to some other tokens, and value, what is the actual knowledge that we want to propagate to others. So, after generating these three vectors, we multiply each query with each key. This takes n^2d, because we have n tokens, each token multiplies its square with each of the key of each other token, so it's n^2d, and this multiplication is of vectors of dimension d, so we get n^2d. Now, once we generate this matrix of n by n, we normalize the rows, making them probabilities, and then the message that i gets from the rest of the group is simply a summation of the value vectors of all other tokens, where the summation is weighted by this alpha.
Alpha are how much each token j value matches the query of token i. And this is what the token i gets in order to continue improving its knowledge for the next phases of the transformer. So overall, what is the complexity of applying transformers? We have L layers, denoted by L, times two operations. The self-reflection takes n times d squared, and the self-attention takes n^2 * d Okay? So, this is a complexity. Now, let's compare it to what happened before transformers. One option is a flat, fully connected network.
So, in a fully connected network we take just the entire input. all the neurons in the input, we have n tokens, each one of dimension d, so the overall number of neurons that we have in the input is n * d. And then in a fully connected network, we multiply by a matrix of dimensions x dimensions. But now the dimension is n * d, so overall we get n^2d^2. So, it's more expensive than a transformer layer, because in a transformer layer we have n * d^2 + n^2 * d Okay? So, this is one option, a fully connected network.
another option which was before transformers is all the recurrent networks, which applies a Markov assumption. So basically, every token talks only to the one before it. So like, if you sit here, everyone talks only with the one to his side. Okay?
So, now we have n * d^2, similar to the n * d^2 of self-reflection without the n^2 d. So, these are recurrent neural networks. So, if we compare transformers to the alternative, we see that relative to fully connected networks transformers are sparser, okay, relative to the Markov assumptions in long short-term memory or recurrent neural networks, transformers are denser because you talk with all the tokens, and not only with the one next to you.
And comparing to convolutional neural networks, convolutional neural networks are also very efficient, but they are specifically tailored for images, while transformers are more general; you can tackle everything with transformers. Okay, so, all of this... What do we get from all of this? We get a universal solution, a universal learning machine, because we can handle all type of inputs of inputs. We made everything a language. So, now it doesn't matter if it's images, if it's language, if it's voice, whatever.
Okay? We can deal with uncertainty because we are learning probabilities. And Amnon mentioned that what we want to learn is a probability of the next token given the previous tokens. Okay? So, even if there is no single right answer, like what we see in text processing.
In in ChatGPT, for example, when you ask ChatGPT something, there is no single right answer, there are many possible right answers to every question. But since we are learning probabilities, it's fine, we can handle this uncertainty. And finally, we enable all types of outputs. Again, because everything is a language; we can express everything.
So, did we get the ultimate learning machine by transformers? In some sense, yes. Okay. So, if it's so good, let's use it. So, let's see how we can use a transformer for end-to-end object detection network. Okay?
I will later talk about end-to-end from pixels to trajectory to control commands. Okay? Which will also be very, very similar, and most of what I'm going to say here is applicable also to this end-to-end, but for the sake of concreteness, I want to focus on the object detection problem. So, the input is images, a set of images, from multiple cameras and from multiple time frames, because we want to understand the kinematics as well. And the output is a description of all the objects in the scene, what we call a sensing state.
This problem is quite complicated because we need to tackle what we call the five "multi" problems. We are dealing with multiple cameras, surround sensing. We are dealing with multiple frames, not a single time frame but several time frames. We need to output multiple objects, not a single object. We need to tackle multiple scales that are very, very far away that are appearing in the images as very few pixels, versus close objects, which takes the entire image.
We need to tackle all of this, and it's not sufficient to describe the position of the objects in the world, we also care how they behave according to traffic rules, and for this we need to assign them on the lane system. So, all of these multi problems need to be tackled by a good end-to-end object detection system. Luckily, we have transformers. How can we solve it with transformers? Simply let's encode image patches. So, the input is images, let's encode them as tokens.
Let's call them image tokens. And let's encode objects also as a sequence of tokens. We want at the end of the day to describe the location and lane assignment and everything on all the objects in the scene, just describe it in words. Okay? And apply a transformer.
Now we have an ultimate learning machine, universal, that can tackle everything. Let's apply it and generate the probability of the objects given the images, and problem solved. So, let's make it concrete. We need to start with generating image tokens. Now, we could do something very naïve, just take a small patch of an image and say every patch is a vision, is an image token. Okay?
We could do it, and some people are doing it, it is called image transformer, but it's not efficient. It's much better to rely on convolutional neural networks, which are much more efficient and tailored for images, in order to generate the input image tokens. So, this is exactly what we are doing. We are taking 32 high-resolution images from different cameras and different time frames, and by a convolutional neural network, converting them to smaller images of resolution 20x15 pixels, and each such image has channels, has dimensions. So, taking the channel dimension makes a representation of a token, which we call an image token.
I just want to mention that there was a very interesting, a little bit funny, even, discussion on Twitter between Prof Yann LeCun, who invented convolutional neural networks and did a lot of good things besides this, and is heading a Facebook AI research, and Elon Musk, about... you know, it was a funny discussion and I will not go into the details of the discussion, but one of the things that they debated on is whether convolutional neural networks have any meaning in the era of transformers, and Elon, you know, mocked Yann and said, "You know, convolutional neural networks are not used anymore, we use transformers." And Yann responded that, no, it's not true, convolutional neural network is really what you want to do when you are dealing with images.
And of course, Yann is correct. Because if you know that you are dealing with images, you don't want to take something completely generic. You want to use the prior knowledge that things that are close to each other are more connected in order to generate the image token. It doesn't mean that we don't want to use transformers, we of course are using transformers, this is a second part of the encoder, but you want to start with a convolutional neural network backbone in order to do the tokenization, the mapping from the images themselves, into a sequence of image tokens.
So, this was a CNN backbone. Now comes the encoder. In the encoder part, now we have these C * Np image tokens. So, C is 32, it's the number of images that we started with from multiple cameras and multiple time frames, and Np, p for pixels, means the number of image tokens that we have per image, which is 20 x 15, namely 300 pixels or tokens per image. So, overall is almost 10,000 image tokens, and each one is a dimension of 256.
Of course, all of these are a little bit arbitrary. You can go to a higher dimension, but for the sake of discussion, I chose something that actually we tried and it works. So, a good example. Okay, now let's apply a vanilla transformer network. If we have L layers, we saw that the complexity is L * (N^2 d + d^2 N). In our case, if you count and you apply 32 layers, again, a standard choice, you end up that you know, the number of computations that you need to do is around 100 TOPs.
Now, TOPS is tera operations. Okay? So, just counting how many operations we need, we need 100 tera operations now... per second, yeah. It's actually... Yeah, tera operations per second. So, here I assumed that that we are running at 10Hz so we are processing ten times a second, which is, again, a standard thing to do in self-driving. Okay.
Now, you might think, is it a lot? Is it not a lot? It's actually quite a lot, and 100 TOPS it's also the theoretical thing. In practice, it will be much more, because operation can take more than just one cycle. So, it's really a lot. Okay.
This is for the encoder part of the transformer. Then comes the decoder part. The decoder, I remind you that we are doing generative auto-regressive decoding, meaning that we output the first description, the first token that describes the output, and then given the first one, the second one, again and again. So seemingly, we should run the entire network again and again and again for every output, but there is a trick how to do it more efficient, which is called KV caching.
KV is for the keys and values matrices, and since transformers are left to right, we can do this trick, which saves a lot of compute but comes at a price of memory. Now we need to bring all the key values from all other tokens per each generation of a new token, and this is L x n x D per token, which is really expensive, also, and actually all modern AI chips will suffer here. Okay. So, in a sense, the blessings, the universality, of transformers, is also making them a brute force approach, maybe the dark side of uni
2024-10-10 16:32