AI Paper Talks Ep. 2 - Reinforcement Learning as One Big Sequence Modeling Problem
(upbeat music) - [Ruth] Welcome, everybody. Good morning. Good evening. We want to welcome you all to the "AI Paper Talks." These are technology conversations about recent research done in the artificial intelligence field where we're going to dive in into the technology and math behind it.
These are meant to be discussions, so our presenter today is our CTO, Phil Steitz, from Nextiva, and he's going to walk us through the paper for the first 30 minutes or so, but this is a discussion, so we want you to write down your questions. You can send them in the chat. You can raise your hand with the little emoji over there in Zoom, and we want you to talk to us and to have a discussion about this, so with that, I'm gonna hand it over to Mr. Phil Steitz,
and we're gonna get started. - All right. Well, thanks, everyone, for joining, and, as Ruth said, this is a series where we're talking about recent papers. The content is fairly technical. I'm gonna try my best to get the core intuitions if you're not familiar with all of the math and technical aspects of what we're talking about, but hopefully, this will also inspire people to dig in more and learn more. There are references at the end of the paper that can help you to acquire some of the knowledge that's needed to understand what we're talking about, so without further ado, let me just jump in.
The paper that we're gonna talk about today is called "Reinforcement Learning "as One Big Sequence Modeling Problem" by Michael Janner, Qiyang Li, and Sergey Levine from University of California, Berkeley. This is a recent paper. Came out this summer, and it really, there's another paper that came out almost immediately after it that's on the same topic, so this is really a new development in how we're thinking about, at least in some ways, reinforcement learning, so it's a very exciting time, so what we're gonna do is we're gonna first just basically talk about the model and the paper, talk about the new way of thinking that it represents. We're gonna briefly touch on a couple of applications of it that are in the paper, imitation learning and offline reinforcement learning. Then just quickly review experiments, and then, as Ruth said, just get into discussion because, as I said, this is new stuff. It raises as many questions as it answers, and I'm really interested in ideas that others have about the promise of the technology and sort of what it means to us.
Okay, so I'm not gonna give a complete review of reinforcement learning. There are references at the back that really can get people in, but I do need to frame a little bit, get some common vocabulary. The basic idea of reinforcement learning is that there's an agent who observes state in its environment, decides on actions, and gets rewards, and then, as a result of that experience that the agent gets over time, it ends up making better decisions, so the canonical example is Fido here getting trained, learning what the environment, which is its human partner, likes to see and looking for those rewards that the human is gonna provide to it, so a little bit more formally, the reinforcement learning is typically modeled as what's called a discrete time process, so there's individual time steps that march along from t equals 1 to t equals 2 and so on, and what we describe, the state of an RL system, as a sequence of state-action and reward pairs, so the state is what the agent observes. The action is the action that the agent decides to take, and then, the reward is obviously what the agent gets back as a result of its action, and usually, the state and actions are vectors. They're multidimensional things, and the reward is a scalar, and the transition from state-action pairs to successor states is governed by a distribution that we typically denote by p, and I will use that notation throughout here, and that's something that's a dynamical aspect of the system that the agent doesn't control, and is often unknown.
Action decisions are based on a policy, can be based on a policy, and that policy basically says, "If I'm looking at this state, what do I decide to do?" And policies, throughout this discussion, are gonna be denoted by pi. Policies often have parameters, and those parameters, we're gonna use data to represent. Okay, so the little picture down in the bottom shows how actions and states collaborate to produce successor states, and the whole process marches along through time. Okay, so starting with an initial state, when action choices are made by an agent, the result is what we call a trajectory, these triples, the state-action-reward triples, and progress through time, and the successor states are always determined by the preceding state-action pair, and so, as this progression proceeds, rewards are accumulated, and the basic idea of just about every RL algorithm is to maximize cumulative rewards, so the most, you wanna get the most reward that you can throughout the process, and very often, we engineer it so that rewards out in the future are discounted, so we multiply by a number that's a little bit less than 1, exponentiating it as we go further out into the future, and that, naturally, makes the impact of the further out rewards a little bit less, and so model-based RL, it really is designed to determine policies and to fit parameters to policies so that when those policies are applied, starting at initial states, we end up with the best discounted cumulative rewards as a result, so the basic idea of the paper is that, if you think about it, if you look at one of these sequences that is a trajectory, you can sort of look at a trajectory as just a sequence of symbols that is actually very similar to a sequence of tokens that we work with when we're forming and using language models, so this basic idea that has been so fruitful in natural language processing of taking sequences of tokens of natural language and learning their behavior by masking and learning to predict what the next token is going to be, figuring out how to fill in the gaps, figuring out how to extend sequences of tokens, and applying this transformer technology that allows you to take sort of a bidirectional and wide context-rich look at the structure of a sequence so that you understand its internal dynamics, and you understand the dynamics of other sequences that are drawn from the same language. That technique has been very, very useful in enabling us to do things like predict where conversations are gonna go, like to solve classification problems in natural language processing, like do transcription, audio transcription.
All those things have relied on building models that basically learn what the contextual relational structure of sequences of symbols is. The basic idea that these authors, and the other ones who did essentially the same thing in just about the same time, the basic idea that they had is that, "Well, what if we actually take these things, "and we turn them into sequences of tokens "in some kind of, something like a language?" And we've learned about the internal structure of that language, the language which is the overall system that all this stuff is operating in. Then, what we can do is we can actually predict where things are gonna go, which is, a lot of times, the real problem that you have in RL. You wanna see, "If I make this decision now, "what's going to happen out into the future?" Okay, so that's the root intuition, and I could probably just stop here and say, "Okay, that's what these guys did. "Now go figure out how to do it," and like reading many math papers, it happens a lot to me. I look at this, and then I wanna go and figure it out myself, but instead, I went and read the paper, and I'm gonna talk a little bit about how the paper does it, so this is somewhat ugly, but the basic ideas is very simple, so, as I said before, the state and action components of these triples are actually vectors, so if we wanna pull them apart and make these trajectories into just strings of symbols, we have to pull apart the components of those vectors, and also, we have to discretize them.
I'm gonna kind of roundly ignore this wrinkle moving forward, but it is an important part of the thing, that typically, states and actions might be continuously scaled variables, and in order to replace them with tokens, we've got to discretize them. We've got to give them individual discrete labels, so in the big, ugly expression down at the bottom, the bars just mean that you're looking at the discretized version of things, and all that, what's being, what we're teaching our transformer model to do is simply to predict the progression of the sequence, so all of this ugly expression just basically says, "At each state, "I wanna maximize the likelihood "of each individual token in the sequence "based on everything that precedes it." Now, everything that precedes it is not just the elements of the tuple that it's in, but all the tuples that came before it. That funny symbol tau less than t means take the full trajectory up to this point, so if I look at that log probability, the first log probability there, that's really, let's try to maximize the probability of the ith component of the state vector in a tuple based on the other components of that tuple that came before it together with the full trajectory that came before it. Now, there's no actions in that same tuple before it because the states precede the actions in the tuples, and then, you can see in the next component, now I've got the state because the state has preceded the action in the tuple, so all that's really happening here is I've done exactly what I said in the little light bulb slide. I pulled these things apart, and now I'm trying to learn how...
I'm teaching a transformer model to learn the structure of how these sequences of tokens behave and how to predict where a sequence is going to go. Okay, so the most straightforward application of this is to imitation learning. In imitation learning, basically, what you wanna do is just produce trajectories like the ones that you've seen, so the basic idea is we have a bunch of experts or agents who are really good.
Fido is watching other well-trained dogs do things, and he's trying to learn how to imitate their behavior exactly, so the problem is exactly like next-sentence prediction in natural language processing, and I could literally stop here and say, "The rest is left to the reader," but I wanna just illustrate it a little bit by giving a little bit of the high-level picture of how it works, so basically, we do the same thing that we do in a NLP situation with transformers. We use teacher-forcing to train. We also have to limit conditioning to what we can pay attention to. There is a limitation in the transformer technology, basically, that I can't take...
Most of what we're gonna be talking about is what we call finite time horizon trajectories, where it's a certain number of steps that we're actually able to look at and observe and pay attention to, and within that number of steps, I've got a limited window, so that's a practical limitation of the technology that we have to deal with. We use the Adam optimizer, and then we use beam search to build sequences. I'm gonna talk a little bit about beam search 'cause we're gonna modify it a little bit later on, but it's, for those familiar with transformers and the NLP applications, it's really just the same thing. It's exactly the same thing that we do with transformers, so okay, so what we wanna do in this imitation learning application is we wanna build up sequences that are very highly likely sequences because they're like the ones that the model expects because it's been trained on really good behavior, so you start with an initial state, and then you pick among the actions that, the entire set of actions.
You actually pick over the entire set of tokens. It could be rewards or whatever, but nothing other than actions are gonna come out with anything, any nonzero probability, so you look at the things that are, have the highest likelihood of being next in the sequence, and you choose the most likely ones as your candidates for the next thing that you're gonna put into the sequence that you're building. This little picture on the left is a visual that describes a little bit the root intuition behind beam search. You sort of cast a beam that's B wide. You pick the best ones, the ones that are most likely to be the next thing in the sequence, and then you just keep doing this, but for each of the, as you build a sequence, you keep along with it the running sum of these log probabilities that are the reason that it was selected as a candidate at each stage, and so, as I add to that total, what I'm really getting is sort of the cumulative probability that that sequence is a good extension of the thing that I started with, so once you get to the length where you're trying to get to, so if I wanna get to a length of, a sequence of length L, I then look at the ones that are still standing, the ones that I have retained, that my beam has sort of illuminated for me to look at, and I pick the absolute best one, so the idea here is just to, at each stage, build with what's most likely to be the best next thing.
It's a greedy algorithm, and, as I said, those familiar with the NLP stuff will know it's really the same thing, so imitation learning is very, very straightforward, and maybe that's the, was the, if it was me, that would be the root intuition that brought forward that this idea could actually work, so offline RL, now we're in a more classic situation where we wanna maximize cumulative rewards. We're not just trying to imitate behavior. We're trying to get to maximum cumulative rewards of a trajectory, so there, the idea is you use reward instead of token probability, so what we saw in the last example was, as we extend our sequence, we basically just use probability, the highest probability next thing. That's the thing that we tack on.
In this case, we wanna, instead, just use reward, and instead of selecting the most likely trajectories, we wanna select the trajectories that are most likely to be optimal, that are most likely to give us the best reward, so here's another place where, if I had more time, and I would've just stopped there and tried to figure it out myself, and for the reader, it's kind of a cool problem to pose for yourself. How would you do that? It's not really obvious the best way to do it. How do you set up a beam search to make it work? And then, the other thing is how do you avoid myopia? Which the greedy algorithm can be myopic, so just think about this. Suppose that at each stage when we're making an action decision, we say, "Okay, let's pick the ones "that give the best reward." Then, the next reward, the action I take, what's gonna give me my best reward? And I choose. I sort of shine my beam on the most rewarding extensions that I can see right in front of me.
The problem is those things might not be the best, the most long-term rewarding things. I might go for a near-term gain and miss the long-term reward, so I've somehow got to work around that little problem, so here's what they do. What they do is add a reward-to-go as another token. This is just, it looks like cheating, but what the heck.
You can do it. You have the data, so you train your trajectory transformer over a discretized set of, now, four-tuples because what you've done is you've added on the reward-to-go. The reward-to-go is just the reward from point t forward. If you look at a given expectation of the reward-to-go of a token that, of a sequence that's extended, you can actually estimate that and add it in as another discretized element because you're, again, you're building these things up. They already have been fully completed, and you can simply add it at each point, and you can train your transformer to understand the structure of that, and then you use your beam search over probability filtered transitions that, where your successor decisions are based on total reward, the reward-to-go plus the reward that you're going to get immediately, and there is a slight wrinkle here because what you're doing here is you're actually sort of changing the vocabulary to be, now, the entire tuple sequence. This is opposed to building up the full trajectory one single component at a time.
You're focusing only on the action decisions, but it's, again, it's the same beam search approach where you're looking to choose your successor tuples, the tuples that could come from where you are in the progression based on the expected reward, so this has some nice properties. First, it is not myopic because you've now, you've sort of smashed in the reward-to-go into the equation, so to speak. The other nice thing about it is that it keeps the actions and rewards in distribution. We'll talk more about that at the end because it's really, it's a powerful thing because it's the only things that you can choose.
If you look at the picture here, the only things that you can choose are things, are actual trajectories, actual trajectories that could occur in nature, so to speak, and that results in some nice properties of the predictions and kind of holding things closer and keeping you from diverging and making predictions that are out of model. Okay, so I'm gonna just really quickly say a couple words on policy gradients because this is an alternative. This is a classical approach to model-based RL, but there are many, and there's no way in the world I can do justice to the rich and robust field of model-based RL, but I wanted to just, I wanted to present a little bit about it because it's an interesting contrast, and one of the things I wanna think about in the discussion is how do these things really relate? So in model-based RL, you start with the optimization problem that we've already presented. Basically, all you wanna do is you want to find a policy that's subject to the dynamics governed by p, maximizes your expected total discounted rewards, and a kind of a standard way to do that, well, the policy gradient approach to this is to use a sampling-based approach where you fit a model, you estimate the return, you improve the policy.
You then generate samples from the policy. Then you refit the model. Then you take another gradient step to improve, and then you continue to iterate until your improvements don't seem to yield very much, so the reinforce algorithm, which is kind of the core of the implementation of policy gradients, basically, the thing that I wanna convey here, primarily, is that it's really a method of adjusting the policy, the parameters of the policy, so that the more beneficial long-term trajectories end up favored, and the core underlying technology is essentially gradient descent, so what you're doing is you're waiting. That term on the right, in number two, is the total reward that's returned back from executing that policy with those parameters, and you're pushing the gradient so that you end up with...
Because of that weight, you're pushing the gradient so that you end up favoring trajectories that have higher returns. The other thing that's important to notice here is that the samples are generated by running the policy, so it is essentially a statistical sampling approach, and that's, notice that that sampling is different from the sampling that we looked at before in the trajectory transformer model where you're actually looking at, you're sort of, you're not sampling; you're actually, you're directly analyzing the structure of the observed trajectories that go into training the trajectory transformer, so I wanted to present this only because, just to get the wheels turning in terms of how the problem is looked at a little bit differently. Okay, so the problem with what we just saw and most kind of feed-forward approaches to RL is that the errors tend to propagate, and this is an example showing the bars really show the divergence of the trajectory from the kind of target trajectory, and very small errors in the computation that's performed in execution of the policy can lead to a kind of rapid divergence, and this is something that... Now we're gonna move into, a little bit, the experiments that were done with the trajectory transformer and kind of the exciting aspect of it is that, unlike what we just saw, here's an example of a feed-forward trying to kind of replicate or reproduce a target reference trajectory with the transformer-based approach. We get something that's much more stable, and this is a picture that's right out of the paper that shows really nice fidelity in the trajectory transformer, so they did a whole bunch of benchmarks using the AI Gym, the OpenAI Gym experiments, where these really amount to simulated environments where it, robotic-like entities have their, work within a dynamics-governed environment, and the goal is to kind of reproduce a desired behavior, and you can see in the picture here, the trajectory transformer against several other leading models. It actually does perform.
It performs fairly well, and it also has fairly narrow error bands in its performance, so this is extremely promising, and it sort of indicates that with very little... This is really, really initial work, and very little kind of optimization and tuning. The results are actually very good. Okay, so this is an example.
This is Walker2d. That's an example of the dynamics environment that these things are executing in. Okay, so now I wanna make a few observations and ask some questions, and we can move into the discussion. First, there are some real benefits to this approach.
The first is that it can handle very large state spaces, so similar to the world of NLP. The state spaces can be large, and this can be a challenge in RL, so sort of check. It can handle long trajectories, and part of that is, part of the stability that enables it to handle large trajectories is the fact that it kind of forces action decisions to be made sort of within distribution, and reward attribution is less of a challenge. In a sense, it's not, it's almost irrelevant because you've got, because of the bidirectional nature of the setup, the rewards can be sparse. The rewards can be separated from the actions that produce them, and we can still end up effectively solving the optimization problem, and it, as I said, it avoids out of action, out-of-distribution action, this selection. Finally, it can exploit large models and large data sets, so one of the things that has been very, very successful in using transformers in NLP is we keep getting returns to scale, so to speak, as we build larger and larger data sets and larger and larger models, they get stronger, and so this is a situation where because of the underlying technology has these nice returns to scale, there's reason to be excited about this stuff.
The challenges are, first, the same knock that you have on large language models. If the dynamics of the environment are stable, if all of the underlying really distributions that govern reward and govern the dynamics, the state transitions, if all that is stable, like, say, in a physical environment that is, well, that is not subject to lots of environmental change, that's fine, but in lots of other applications of RL and places that we've used RL successfully, we've used it precisely because we've got a dynamic environment and the distributions change, so those kind of environments, it's not obvious how you could be, you have to be sort of retraining, and the training is somewhat expensive. It may be impractical for real-time control in today's world. That's more of a, that's something that the real hardcore can get excited about 'cause it's one of these things that's really a computational challenge, and there's no reason to expect that challenge won't be solved.
It forces discretization, so I kind of glossed over this, but it's something to think about at each... For each possible application of this technology, you're gonna have to discretize and you're gonna have tokenize. Will it generalize? That's a pure root intuition question.
We don't have huge experience applying it into real-world scenarios, and so the whether or not it does generalize is sort of a... In what environments will it successfully generalize? Is an interesting question to ask. Also, base model reuse is not obvious, so a lot of what we do in, a lot of the successful applications of NLP come down to building large base models, investing significantly in building a base model, and then fine-tuning it for other applications, and we're gonna, in the discussion, I wanna talk about that, but it's not completely obvious how that will work in this environment. Okay, so now I wanna jump into questions and just invite people to respond, to really have a discussion, so my question number one is more of a theoretical question that I don't know the answer to and have only... I've thought about it, but don't really have anything meaningful to offer back, which is whenever we develop a new technique to solve a class of problems that have been solved in other ways, that invites the question of for what class of problems does it come down to the, essentially the same thing as the other technique, so an interesting question, interesting theoretical and practical question to ask is for what class of problems do you end up with essentially the same solutions as pick your favorite classical RL algorithm? The second is for what domains are there canonical tokenization strategies? So one of the things that...
You get very excited when you look at this idea, at least I did, and you can see that, "Oh, wow, "all you have to do is convert these things to strings, "and then, bang." Well, in language, we have the advantage that there are sort of natural tokens (indistinct). It's a language. There's a natural tokenization that we're working with that is embedded in the language models, and if you're gonna build a large pretrained language model, you use the word pieces.
That's fine. We can work with them. In RL domains, it's kind of not obvious how...
Are there canonical tokenizations that are gonna work across a large range of things? Physical systems, possibly, but my sort of interesting question I scratch my head about is, "Well, what other kinds of environments "and how does that work?" Given a large model, for example, conversational strategies, how, exactly, would fine-tuning work for an individual application to a domain or customer-specific trajectory? So this is kind of a practical stuff for applying this to an RL problem that, a specific RL problem, but this is one example of an RL problem. In a lot of cases, what we do is we build, as I said before. large pretrained models, and then we fine-tune them, so how would that work, actually, here? And then, finally, what determines generalizability? Okay, I'm gonna stop here and open it up for any questions that anybody has. - [Ruth] So we have a question, actually, that Erion or Erion sent while you were talking about offline RL, so the question is, "Would cue learning "be a way of managing the second problem here, "introducing some randomness into the next action to take?" And Erion, if you're still online, I don't know if you want to give us a little bit more context, or are we good? - Yeah, they are. I'd be very interested in hearing what he has to say about how that might work. Oh, he says his mic is not working.
Can you unmute him? - I can only ask him- - All right. - [Ruth] to unmute. Oh, there you go. - Yeah, yeah. There you go. - Cool. I was just about to type it.
I meant in terms of the generalization problem that you've mentioned here on the last slide, so how can you make the system appropriate for different paths that may come up? So I was thinking in terms of a predictive text kind of situation. If you were gonna talk about a different domain or a different subject, how would it know that the main specific terminology? - So yeah, unpack that further. What do you mean the... What would you see? The problem that I was thinking about a couple of... We're talking about two different things, two challenges here. One is just generalizability in general, but the second is sort of the fine-tuning, and one of the challenges that I saw there with the fine-tuning is essentially extending the vocabulary, and explain to me what you mean by potentially using cue learning to do that.
That's interesting. - [Erion] Oh, my understanding is that cue learning introduces some randomness to your path that you take, and you're not necessarily always going for the most reward, and sometimes, you'll take a lower reward in an effort to explore other path. - Yeah. Yeah.
That's definitely something that could be, could be composed on top, and yeah, it could be composed on top of what is happening here. It's still not obvious to me how the... And maybe I'm just thinking about it wrong. Maybe I'm being overly constrained by kind of the standard way that we do specialization and fine-tuning in NLP, and that's a good reason to have this conversation, that what you could do is, in terms of don't look at it as a fine-tuning problem. Look at it as sort of an exploration problem in an adaptation to another, to a different domain, and use more of an ensemble technique where you got cue learning on.
That's a very good point. Could work. - Thank you. - Thank you Okay, what else? - Thank you. We have another one, this one also on the RL subject. "In RL, you have to train for every specific task, "and this one comes from Anna.
"Transformers made a breakthrough in transfer learning "for solving language-related task. "In language, we had one task: "predict next word in the sentence. "This task is used as a source task "in any kind of transfer learning.
"Could we replicate this idea "with transformers in RL, "train the model for a general problem, "and reuse it for related tasks?" And Anna, if you are here, if you want to add anything else to your question. - Uh, yeah- - Yeah, why don't you go ahead and answer it, actually, Anna. (all laughing) - [Anna] Yes, I'm here, so to add more context for this question, so we are using, and for AI communities there are available models that I've shared, and with the transformers, we had ability to predict the next word with a sentence, and these models were used for variety of different applied tasks, so in RL, we have specific bottleneck, as I think about it, in terms of data sets that you need to have to train for specific multitasks, so is it doable in general? And it was my thinking behind that, so we could use some, and to have some general model in RL that we could use later the same as we use transformers for transfer learning and for solving not only some specific tasks that it was trained for, but any other, so as we use this general transformers models in AI for NLP tasks, so that was my thinking- - Yeah, yeah, that- - Behind that. - Yeah, it's a really good question, Anna, that I don't have the answer to, and it's one of my kind of head-scratchers about this whole thing, and that's why I was sorta trying to indicate that, that it's not obvious to me, given that the reward structure is essentially built into the training in the way the trajectory transformer is trained, but I keep thinking that it's just a limitation of my own imagination, so I'd be very interested to hear ideas that others may have about how you might do that, and think about how we actually do it in NLP.
How do we take the very sort of canonical example of next token or next sentence prediction and turn it into a solution to a broad additional class of questions? And so what you need to think about, at least as far as I've gotten in my thinking on this, is a lot of it comes down to actually backing up and thinking about a general domain where you may be applying this stuff, and a canonical tokenization and a canonical representation of reward so that we can then take the, do the kind of thing that we do to go from next sentence prediction to, say, classification, solving classification problems, but I'd really be interested in other people's ideas about this, and I'm excited because this is the reason that I wanted to talk about today 'cause I wanted people to start thinking about this stuff. - [Anna] Thank you for answering and for sharing, so also wanted to edit on this one thinking behind it, and for solving like when there is an idea first, and what, also, in AI community, people say that at first there should be... Data sets come first, so your experience historical data, so something that you would feed your model with, and so I'm not sure if right now industry's ready, and we have a lot of, and especially open source datasets in terms of RL because it has specific of the implementations that you need to take into consideration in the data set, and for variety of tasks in NLP, and we had more data sets that we could use and leverage from, and for right now, the scene I could name that is more or less taken as behind the regions it go is that we don't have this variety and availability of open source data sets that we could use to train our models on.
- It's definitely a very good point, but it's not sort of insurmountable. If the right ideas emerge about how to do the tokenization, how to do the representation part, for it may split into different domains, having canonical representations, that, then, yield large open source data sets that can be, and large pretrained models, there's no kind of theoretical reason that couldn't happen, but what I'm honestly being very honest about the limitations of my own intuition and understanding is how you'd actually do that in different domains. What you see in the...
One thing I didn't mention is that in an act of great awesomeness, the authors of the paper, and there is a link in the slides here to the GitHub, all of the code for the paper is on GitHub, and a bunch of training sets. Now, those training sets are artificial environments. It's for the AI Gym and that kind of simulated control environments just to illustrate the promise of the technique, but the sort of interesting open question is where is it gonna go from here? So I'd be interested in anybody else's input and ideas about this or anything else. - [Ruth] So we have another question kind of like also related, but not to answer this specific question. This one comes from Michael. "Are there any circumstances or fields "that we will consider RL agent "that focused on punishment sanctions minimization "and less on reward maximization? "I believe there is dangerous fields "with big risk that maybe we will one that, "we will one that "seeing it in the other way in that," so Michael, I don't know if I butchered your question, so if you want to add anything to that, but I think it's kind of like seeing it the other way, maybe.
- Yeah, I think I get the gist of the question. I can say kind of, I can respond, as a naive mathematician, that yes, you can express the reward any way you want, and it could be applied to anything, and there is no question that these technologies, like any AI technologies, can end up applied in ways that are not socially beneficial, and also, that have safety risks. There's a whole emerging field of AI safety that this will definitely, just like any other AI technology will come under, and it's a very good point that those of us who are working with these technologies, those of us who are stewarding their application, need to be always thinking about: What is it that we're doing? What is the application that we're allowing? Unfortunately, like all technology, once some new thing has been developed, the cat is sort of out of the bag. I'm an open source guy.
I believe in the value of sharing knowledge and different approaches to technology, but at the same time, I believe in our responsibility, as practitioners, to make sure that the way that we frame problems, the applications, the canonical applications that we develop and present are things that are gonna make the lives of people better and are not going to do things that discriminate or that cause harm to class of individuals, but it is a good point. Just applies to really any AI technology, any RL technology. - [Ruth] And to focus on the questions that you were, that you're posting here, Phil, I have one. I don't know if I'm gonna hit it right, but for the second one about canonical tokenization, would that be maybe financial systems? And, for example, biosystems, when people are kinda like learning how cells, I don't know, distribute, I guess. I dunno, I'm just saying maybe those are the ones that could be canonical. - Those are good examples, Ruth.
That's essentially what I was thinking about, but I haven't thought deeply about how the tokenization would work, and in particular, how the rewards could be kind of represented in such a way that we could end up with some kind of reuse, so those are our examples. The one example that's, if you dig into the GitHub and look at the, you look at the data sets, it's this sort of robotic control environments, probably there is standardizations that can actually happen there. Somebody said, "Some people at DeepMind think that RL "is enough to build artificial general intelligence. "Could you please share your thoughts on that?" I'm interested in other people's thoughts on that. Maybe, Avneet, you can answer.
- [Ruth] Let's see, Avneet. - Yeah, thanks, Phil. I'm in (indistinct).
There are a lot of tasks that humans do that AI is now able to do, so vision and speech can definitely be done with RL. Within two days of training, it can beat humans on Atari games. It can also solve those cube-based puzzles with the hand, and it can... The safety levels of some self-driving systems is 10 times better than humans, but all that is still narrow AI. What is interesting is that if we teach a humanoid, a virtual humanoid to walk, the very same algorithm can also be used to train a spider to walk, so it seems like while it's difficult to get the RL algorithm to work, but once that is done, it's easily transferable to similar tasks, so I, personally, don't have a very good intuition if this will itself lead to general intelligence, but it seems like the lot of metal toss where it is becoming pretty good. - Yeah, I would certainly agree that the advances is impressive.
The one thing that gives me a little bit of pause is something that is very evident in what we just talked about, and that is that the whole setup, it really does require this consistent reward signal, and it enables us to, it enables these RL systems to perform very well in a lot of tasks where they can be described as kind of environmental reward conditioning tasks, so like the humanoid that's walking, like the self-driving car, like winning the game, and there's kind of a large leap between that and what I would classify as general intelligence. Again, it's a slippery concept. The general intelligence for figuring out how to win ever more complicated games, how to navigate ever more complicated mazes, how to drive through ever more difficult terrain, all of that is definitely within reach.
That's a foregone conclusion. Just plain gonna happen, but being able to... A lot of what's been happening in natural language production is very interesting, but being able to produce things that really amount to really creative works, to be able to engage in the kind of dialogue that we're having here, there's sort of a leap between that and just simple reward conditioned behavior, so it could be I'm often wrong about this stuff.
I was wrong when I thought about the limitations of applying deep learning to NLP, and the advances there have been significant, but my intuition is there's something else is gonna be required, a additional kind of complementary techniques are going to be needed to get to something that I would classify as general intelligence. - I thought about another example of the second one while were talking about this, and is, for example, traffic systems that could be canonical tokenization where we're thinking about the efficiency of lights in different cross sections. Imagine that all the systems were connected so that we make it efficient, maybe.
That could be one that I think a human cannot do where the AI might come in and really help. - Yeah, yeah, that's another great example. The challenge, then, I can't remember who mentioned it, but I was intrigued by kind of Michael's idea at the beginning of take a more ensemble approach because it's easy to think of different domains where you could set up the tokenization, you could set up a model, you could take a transformer, a trajectory transformer, and train it, and have success, but, as Anna said, the kind of training for every specific environment, building up the open source data sets, doing all that stuff, that's kind of a heavy lift, so we've gotta figure out a way to make transfer learning work in a way that's more like what we do in NLP.
- So we have only two more minutes, but we do have a question, but it's really technical, so this one comes from Juan Ramirez. "What would a sufficiently large trajectory set, "given size of state space and size of action space, "in order to create a sufficiently rich token sequence "for the transformer training?" - That's an interesting question, and it will... That's a sort of a, in a sense, it's like a hyperparameter selection question, and the answer would be that depends on the dynamics of the underlying system.
It depends on... Imagine that there's only one trajectory. Oh, you're only gonna need one.
No, I mean, I also, I'm a mathematician, I confess, and so the answer to the question of for the answer to number one, it's for sure the constant thing fits them all, but no, that's kind of a facetious response. The right answer is that it is gonna depend on not just the number of states, not just the length of the trajectories, but the underlying dynamics, and you would do the same, you would end up doing the same thing that you do when you train language models, which is observe how the gradients are moving, observe the stability of the models that are generated, so it will, just as happens in NLP, it will depend on the domain. That's my intuition. You can look at, in the paper, they actually, they have data sets. They have a hyperparameter set for those robotic control environments, and you can observe what data is there, but that's another.
It's sort of an interesting, another interesting set of questions that come from this. Given specific domains, how do we determine the size? Just like in language models. How many hours do you need for a good transcription model? It depends somewhat on the quality of the data, all that stuff. - Fantastic. - That's a good question,
though, okay? - Thank you, Phil, so we're right on time, and thank you to all the team from all team Nextiva for showing up, and thank you to all of our guests, external guests. We will be making the recording available in the PDF so that you can click on that, and then, if you don't work for Nextiva, please check out our nextiva.com careers because we have a bunch of different roles, and don't be fooled by the job descriptions.
We're looking all the way from juniors to principal engineers. We're looking for really, really talented people, really smart, creative people, and so just send us your CV and hope that you join us for the next episode of the "AI Paper Talks." Thank you, Phil. Any last words? - I will just repeat again, hear, hear. Come on in; the water's fine. We are hiring.
We're looking for people that are interested in this stuff. We dig in, and we work with these technologies, and it's an exciting time at Nextiva, and I'm very proud to say that one of the questions about applications of this stuff at Nextiva I can say with a lot of pride that our objective is to help humans connect and everything that we do is powering human connections. All of our applications of AI technologies are not designed to replace humans.
They're not designed to trick them into doing things. They're really designed to help people understand each other and connect better. All right, thanks, everybody. - That's awesome. - Thanks for joining. - Thank you, everybody. (lively music) See you next time. Bye.