AI Disruption of Quantitative Finance: From Forecasting, to Generative Models to Optimization
Hi everyone, i am nimanushi. And i'm a customer success engineer, at databricks. Today. I will be talking about reinforcement, learning, applied, to the problem of financial portfolio, optimization. The ultimate goal would be to design, and implement, an automatic, trading bot, with spark. However. I figured before i jump to the implementation. I go through the theory of portfolio, optimization. And how ai. And in particular. Reinforcement. Learning can be can be applied to it. I will be talking about the specific, implementation. In the following. Talks which i will be presenting, at. Upcoming, spark and ai conferences. Well today's agenda, is the following. First we define the financial, portfolio, optimization. And discuss, how it has been. Approached, with stochastic, optimization. Methods, started by markups, in the 50s. Then we talk about, how we can actually translate, the problem. Of portfolio, optimization. Into a framework. Which, uh. Reinforcement, learning can be applied to, namely a markov decision, process. Here we talk about, some specifics, of financial, markets. That bring. Some challenges, in the general formulation. Of the mdp. After that we talk about. Model-based. And model-free, reinforcement. Learning. And. Talk about how these models can be applied. To the optimization. Problem, and we discuss, some of the pros and cons of those, algorithms. Which puts us. In a good position, to start with a decent implementation. In in the following sessions. Um. Okay what is financial, portfolio, optimization. Well. It is a subcategory. Of, the broader, class of fund allocation, problems, imagine. Something like an index one, which tries to construct, a portfolio. According, to some, predefined, index weight. Or even a more straightforward, scenario, an equally weighted portfolio. Which rebalances, the, holdings, of the portfolio. Each day such that each individual. Asset gets an equal share of the initial capital. These are, basically, some examples, of capital-based. Fund. Allocations. One can think of another. Or other allocation, methods which are not focused, on. The dollar amount of positions. For example. Instead of allocating, same amount of capital, to each individual, asset. To come up with the equally weighted portfolio, one can think of, a portfolio, which is fully diversified. With respect to some measure of risk. Here instead of having the same dollar amount for each position. We have equal amount of risk. In general. One can optimize, some function to construct, a portfolio. Right. Imagine, that. You want to have a portfolio. Which can extract the maximum, value of returns. Or you want to maximize, expected, return. For volatility. Units, at the end of the investment time frame, because, you do not only care about the return, but you want to manage. Risks. So. We can formulate the generalized, problem as follows. Like, given a student. Given a set of. Assets. Like m assets with a defined price history. Uh, and. A, an initial, portfolio. Or an initial, endowment, that you have at the, at the time zero. That you want to find. An allocation, which maximizes. Some objective, function. Which is denoted, by gamma here, this function, could be simply, expected, return. Which actually. Maximizes, your wells at the end of the investment, period, or, it could be expected, return divided by, some measure of risk like volatility.
Or Any other function which basically, describes. The, investment, goal. Objective, functions, are, in general, some function, of the projected, distribution. Of the asset returns, at the end of the. Investment horizon. This means that the problem of portfolio, optimization. Naturally. Is a, stochastic. Uh, optimization. Problem. Uh one can solve it like in two ways yeah you can think about in a static, optimization. Where. You project. You project the. The, the, estimation. Of the next period, uh. Returns. All the way forward to to the end of the investment horizon. And you come up with the distribution, of your returns, at the end of the investment horizon, and solve the optimization. Just once for that. Or you can estimate, the next period returns. Dynamically. And solve the optimization. Problem, sequentially. To basically, set up a. Dynamic, optimization. Paradigm. It was markowitz, who pioneered the attempts, to solve the stochastic, optimization. Problem described in the last slide. This framework. Is actually consists of two steps. The first step, no matter which. Objective, function, you have. You just need to solve the mean variance optimization. Problem. It is a constraint, quadratic, optimization. Which tries to minimize, the portfolio, variance. While constraining, the expected portfolio, return. At a target level. Solutions. Like this for different values of target mean value. Define a curve in the mean variance, plane, which is called efficient. Frontier. Once you have division, frontier. One can solve a one-dimensional. Optimization. Problem, of optimizing. The custom objective, function. Along that curve, in the mean variance, plane. Um, so, one one might think, uh. To use this method to periodically. Solve this static, optimization. Problem. To make it a dynamical, setup, right. Well it seems that it is a viable option unless you try it in the real world. Because, in the real world you have the transaction, costs. The problem is that, the solution to the optimization. For two periods, might be so far away from each other. That rebalancing, the portfolio, and the costs, of transactions, lead to a sub-optimal, policy. In fact. Myopic, optimal, actions, can cause sub-optimal. Cumulative, rewards, at the end of the period. Um. So that. Now that we talked a little bit about, uh. The opt-in the portfolio optimization. Problem and how it was formulated. In terms of the stochastic, optimization. And the attempts that were made to basically, solve those those, uh, problems. We can discuss, now a little bit about, how we can formulate the portfolio, optimization. As a markup decision, process. And applies some of the. The methods in the reinforcement, learning.
To Solve the portfolio optimization. Problem. So first of all how is it, an mdp defined, right. Um. Let's assume a setup where at each time step, an agent starts, from an initial state, takes an action which is uh some kind of interaction, with the environment. And the environment, gives a reward to the agent, and. Changes the state. Right. If the state transition, probability. Which is determined, by the environment. Is, only a function of the current state and not all the history, of the up up to this point of the time. The dynamical, system is called markovian, decision, process. Okay. How does it look like for a trading agent, right. At the beginning of each period. The agent has to rebalance, the portfolio. And come up with a vector, of the acid holdings, this this basically defines the action so the action of a trading bot would be. The, directly, the the portfolio, weights that, is uh is coming up with at, the end of each period. What about the reward what is the reward function of the, environment. In general identifying. Reward, is a is a little bit more challenging. And. What is rewarded, report basically. Is a scalar, value, which fully specifies. The goals of the, agent. And maximization. Of the expected, cumulative, reward, over many steps. Will lead to the optimal solution of the task. Let's, let's look at some examples. Taking, games for example, the goal of the agent is very well defined in in games right, either you win or, you lose a game and it could be well divided. Into separate, reward signals for each time step. If you win a game at the end of the step. You got a reward of one, if you lose. Again, at the end at the end of a time step you get a reward of minus one for example. And you get a reward of zero otherwise. So. Very well defined, and very well divisible, into. Separate time steps, however. Take a trading, agent for example who wants to maximize, the return. But at the same time, do not want to expose, this fund. To extreme market downtrends. And crashes. He does it for example. By managing, the.
Value At risk of his portfolio. So the the objective of the agent, is clearly, defined. By, but dividing, this objective. Into. Sequential. Reward, signals. Uh might be a very challenging, task. Um. Now let's talk about the state and observation. At any step. We can only observe, asset prices, right, and the observation, is given by the price of all assets, this is clear. We also know that. When, one period prices, do not fully capture, the state of the market, so this is something which is known i mean you cannot basically, predict the whole state of the market by just looking at the, prices, of yesterday, for example. This makes, uh. Financial, markets, a little bit more challenging. And they. In general financial, markets, are not a fully observable, markup decision process, and they are just partially, observables, because we can only as agents. Uh observe. The prices. Um. So, what it means is that, the state that that an agent has, is completely, different from the state of the environment. And there are some solutions, to basically. Uh, build. The whole environment, state from the state of the age. The most obvious, solution is we can we can build the state of environment, from the whole, history. Of the observation. Which is basically, not scalable. Or. We can approximate, the environment, state, by some parameterized. Function of past observations. When we were working with time series. Uh. As as we're as we're doing that in financial, markets. It is natural, to. To assume that the state generating, function, is not only, a function of observations. But also a function of the past energy in states, right. So, with we think of some some models which has some kind of memory. Um. Let's look at some of the examples. Garsh models, so these are these are these are the models which are widely used in quantity, finance and, they are basically. Constructed, in this way. Assume that the state of the market, at each time can be fully represented, by the volatility, of individual, assets, this is the assumption, that basically, says. If you know the volatilities. You know the full state of the market. If you assume that garch models, can build a rather simple mapping, of past volatilities. And current observations, which are the prices, to generate, the volatilities. For the current time step, and therefore, they can, fully. Build. The state of the market. From the observations, that passed. From the past observation, and, past states. We can look at uh. In a con. We can look at other models like in in continuous, domain stochastic, volatility, models they do the same. They basically, build volatilities. Which are hidden states of the market. As. Um. By by just fitting a kind of stochastic, process. To to, to the volatilities. In this way they are able to basically. Generate, the hidden states which are volatilities. And. Generate, a full representation. Of of the market. But obviously one can use. A, more sophisticated. Featurization. Of the, of the hidden variables, or hidden state of the market, so it shouldn't be as, simple as just volatilities, one can, have a. Complicated. Representation. Of that and, neural networks, for example can can build those kind of complicated, models of, of the market state. But the common thing. Among all these models. Is that the state of the environment, is built. Used, using the past observations. And past states. And. The. The state of the agent. At the current time is not. Uh enough to basically, come up with the with the whole state of, of the financial, market, or basically, the returns for the next, period. Okay now that we talked about the mdp, formulation. Of portfolio, optimization. A little bit. I want to go through some of the main components. Of the reinforcement. Learning, in this part. To basically, put us in the position.
To Come up with some algorithms. That we want to eventually. Be implementing. Using reinforcement. Learning. Um, policies. So, policy is simply mapping, from a state, which an agent experience. To an action, that it takes, it could be deterministic. Policy, which means that if an agent finds himself in a certain state. He will always take a certain action. Or it could be a problem, probabilistic. Policy, which means that, he will choose a certain action from a spectrum, of all possible, actions. With some predefined, probability. Concept of value function so what is value function, value function is defined, as the expected. Amount of reward. One can get from an mdp. Starting, from the state, and following a certain policy. For example. If we define. The reward, of a trading, bot, to be. Just log returns, of portfolio, returns. At the end of each time step. The value function would be the expected, amount of cumulative. Return. At the at the end of the investment horizon. And models. What are models models of just agents represent, a representation. Of the environment. And it defines the transition, probabilities. Of the state and development. For example, if you assume. That the next step. Returns. Of the financial. Time series, following gaussian, distribution. The model of the environment is fully defined. Via the transition, probability, of a gaussian. Distribution. So, now that we have all the ingredients. In place. We want to talk about the model based in reinforcement, learning, import for the optimization. How the setup looks like and how we can basically, build. Algorithm, algorithms, based on these setups. We start from our familiar mdp, setup. Where an agent interacts, with the environment. And gets rewards, based on the action it takes. But now the idea is that, uh. The agent, first tries to learn the model of environment, from the transition. Uh. He, has been experiencing. So he's not going to. To, to, optimize, the policy, directly, from the experience. But he first tries to, learn some model from the transitions, that he's been experiencing. And then based on that model he will try to, to solve a, kind of optimization. So, at each time step. Uh. The agent first, predicts the next state because he has a model for the employment, so he he predicts the next state and the reward he will be getting. Based on the action e2, he took. Then he observed the real transition, the real rewards that he got from the environment. And then he can basically, incrementally. Update his model, because. He has a model and he has a loss function that he can basically, train. Uh train the model upon. So what are the advantages, of that, uh, that kind of paradigm. So. Um. There are some advantages, especially. In in, financial. Uh, portfolio, optimization. The most important one is that. There has been a lot of studies about the behavior, financial, markets. And. The properties, of the. Financial, time series data, it is very easy, to basically, implement, those findings, directly, into a model-based, reinforcement, learning. Paradigm. Right, so you basically, can put all those findings. Explicitly, into a model. And then. Have a model that best describes. Uh the financial, market transitions. So, things like, volatility, clustering. Things like heavy tales of the returns. Tail dependence. Among different assets. Existence, of jumps, and non-stationarity. Can be directly, modeled and learned from the data. But then obviously there are some disadvantages. Because you have an explicit, model, that that you have to first. To learn. There are. Uh. Some sorts of errors and approximations. Coming, coming, right, so you you first have to to learn a model and if your model is not a, an accurate. Representation. Of the. Environment. The the optimal policies, that you learn based on that that model won't be optimal, at all because, you have a model which which cannot, or is not basically, describing, the, the market, as good as, if it can or it should. Um. So, let's formulate, everything that we've been talking about the model best, reinforcement, learning. Um. What should we do. Um. Well, in general, if you want to basically, use reinforcement. Learning. Or model-based, reinforcement, learning. We need to gather some experience, by interacting, with the environment, and figuring, out. The model from those, experience, that we have been, gathering, right. But in finance. It is a little bit much easier, because. Uh, the interactions, that we make with environment.
Which Are basically, the the. Transactions, that we make. Uh do not uh affect. Uh the state transitions, what do you mean by that is that. Any, time that we buy or sell any asset, in the market. We can assume that this kind of transaction, does not change the prices, so that we can basically. Separate the whole action. From the whole transition, and we will have, a. Setup, which. Which, which has only the transition, of the prices so basically we can look at the history of the prices, or the returns. And. We can basically. Train a model based on that or supervised, model based on that. So the the whole approach will look like something like this, we pick a parameterized, model. Which predicts the next state transitions. Or comes up with the. Probability. Distribution. Of the next time, uh period, uh returns. We pick, an appropriate, loss function. Uh so that we can train that model. And then. We find the prime, the the parameters. Which minimize, that loss function, and we basically, can train the whole. The whole model, on our uh. On our, on our data set. Let's put all of this into. A, generalized, algorithm, that we can use. For any. Type of model based reinforcement, learning. Input of this algorithm, is, simple, you have your trading universe, of m assets so basically you have to define what kind of assets you want to trade in. You need to define, the parametric, model, which you think, predicts, that, the returns of the market, the best. And you need to come up with a loss function. Which describes, the deviations. Of the model predictions, from the observed, uh, returns. For example. You can have an armed guards model, with non-gaussian. Innovations. And the corresponding, loss function would be. Like likelihood, you could use maximum, likelihood, estimation. On, a batch data, set basically, to to, to first initialize, the model or learn, the parameters, of the model, and then jump into. A kind of online. Online, training setup in the reinforcement, learning so. The rest of algorithm is simple you use the batch data. That. That, that you have basically gathered this is simply your history, of the prices. Um. You, use that to learn the, the parameters, of the model. And then. You, start to. To. To iterate, over the time steps. You start to predict, the next state, from the model that you basically, have learned. On your batch data. You observe, the returns. And the state. Not the state you you observe the return, and and the prices or the returns. By just stepping forward. You, build your state, from the observation. And the state's, uh, history that you have been uh. Gathering. This is part of your model so basically part of your model is is, responsible. For. Building, the environment, state from the observation, that he has been making. And then you calculate, the deviation, from the state that you observed. Or or. Basically built upon. The the observations, that you made and the state that you have been predicting. And then incrementally. Incrementally. Learn the parameters, or or change the parameters, based on the gradient, of that loss function. And. Rebalance, the portfolio. Based on, uh based on the model that you have you can use any kind of. Model, model based. Control, basically, to, solve the the optimization. Problem, so. Uh as. As soon as you have a model your for your environment, you basically, can sample from that model for example in a monte carlo setup so you can have, a sample. Of all returns. Until the end of the, investment horizon. And basically, start to. Uh estimate. The. The. The objective, function, like expected, returns, volatilities. Whatever, you, you basically put as. An object as a as a function objective function. To optimize, for. And then solve the optimization, problem, uh, for that uh for that one to call a sample. As soon as you have a model you basically, can. Can uh, can control. Uh, your policy, basically, using different, policy methods, or just subscribe, to follow. And then uh, reiterate.
Until You, basically, converge. So this is a, whole, parting, whole scheme of of of, using. Uh model based reinforcement, learning. To. Uh, to, to learn the model, at the same time. Uh use that model, to. To, to plan, and come up with a with the optimization. At the same time. Okay instead of learning, a, predictive, model, of the transitions, first. And then. Use that model to come up with the optimal, policy. One can start, to learn the optimal, policy, based on the value function, directly. Right. Assume that the value function, can be defined. As a cumulative, return. You will be getting at the end of the investment, horizon. Then you can use a generalized, function, for the policy. Which, parameterizes. How you rebalance, the portfolio, at each time step. And then at the same time. You can use another. Function. To parameterize. The amount of cumulative, return. You will be getting, if you rebalance, the portfolio, accordingly. So this is a typical. Actor critic setup, which is one of the state-of-the-art. Methods modern. Model of reinforcement, learning, and could be directly. Applied, to the problem, of automatic. Uh trading, bots, using, using one of reinforcement. Learning. So. Here in the graph i still i i, have pictured, basically, how how it could look like the networks, basically, the actor network. Will get the observation, which are the prices. Based on those observations. We will first build the state, of the environment. And then, use that state. To come up with with some action which is basically the portfolio, weights. And then. A critic network, at the same time. Uses those. Those uh. Those weights, those portfolio, weights, and of course. Uh. Observations. Uh. Of the prices, at the same time. He builds a state, upon those again. And then. Uh. Come up with a with a value, right with value. Uh, how much that rebalancing. Will, will basically. Um. Give you a cumulative, return, at the end of your investment, so basically, we'll roll it out until the end of the investment, horizon. And. Look at the the returns, that you will be getting and give you an estimation, of the value you will getting from that action that you picked, from your actor network. And these setup can be jointly. Trained. We are. Different state-of-the-art. Algorithms. I just put a. Generic, ddvg. So, deep deterministic. Uh policy, gradient, algorithm. And it could be. Applied to. To to this specific, problem. And this is something that i will be trying, trying to do alongside. A model based, reinforcement, learning to be able to, to show how we can, implement, those, in spark. And use some of these spark features to basically. Paralyze. Those, those model trainings. And. And come up with the ideas, how, a full implementation. Will look like. So let's uh, briefly, talk about, the kind of challenges. And the problems, that, that all these kind of models that we talked about have so. Um, as. As i said before i mean uh it is very. Important, and crucial to, to. Reinforce, my learning algorithms, and mdp, formulation. Uh to have a clear reward, function signal, right. Um. It is kind of challenging. For, uh. For a generalized. Portfolio, optimization.
Uh Framework. To come up with um. Reward. Function generators. If you have some some sort of, complicated, risk, functionals, like valid, risk. Um. Or any other quantile, based uh risk measure of of the, of the portfolio, returns. It. Might be a problem to basically. Uh, engineer, a, reward. Generating, function. The other thing is about the. The environment, the financial, market environment, it is a very complicated. Environment. It is. A lot of features. Which basically, make it very hard for the models, to to learn, effectively. And on top of that there is a. General. General, theme, in financial, markets so basically. The. The ratio of the the signal to the noise. Is. Pretty, low compared to other, other. Other. Other, areas. Which basically, where reinforcement, learning has been successfully. Applied to so, things like. Games, um, things like, uh. Image. Processing. Text classification. Stuff like that so, basically the nature of financial, markets, and the nature that is a very noisy. Noisy. Noisy environment. Makes it very hard. For for the, for the reinforcement, learning algorithms, to, learn it. Added to those problems, there are some specific, problems. With with model 3 and one of model based reinforcement. Learning financial, markets. Um. For example if you want to use model 3. There, are. Limited amount of training data so you know that the financial, time series. If you, for example want to to, to to learn a model. Which uses daily, um, return, data or daily prices. You basically, have like 250. Data points, for a year. And then, for i mean if you want to to, to train your model uh, on a history of 10 years. You will not have more than like, 2 000 to 3 000, data points to basically, train your model, and this is very very small amount of data. Um, which basically, means combined, with with the fact that the, the financial, markets, are very noisy, environment. Will. Make the models to be very prone to overfitting. And not being able to generalize, well, for the out-of-sample. Out-of-sample, data. In a model-based. Reinforcement, learning. You. Could have. Some of the some of these uh. A specific. Specific, characteristics. Of the financial markets, modeled, explicitly, into your, interior, algorithm. But then you will. Have to come up with.
Um. With ways to, to to cope with model uncertainty. Changing of the models. Inaccurate, models, and and your hyper, parameters, of the models. And these will, directly, affect your optimal, uh optimal portfolios, and optimal solutions that you will have at the end of. The day. So there are some. Some. Ideas, that can basically. Use. The benefits, of those both worlds. To to make. Make make reinforcement, learning a kind of viable option for portfolio, optimization. Kind of hybrid. Methods, that you basically, start to learn a model, at the same time use model uh model three. To to generate samples. And augment, uh, augment data basically. To, to. To to come to to basically, cope with that, limited amount of training data. Problem. And try to use model model 3 reinforcement, learning under generated, data from the model 2. To, to get more accurate. Type of, type of solutions. But these all have to be, uh. To be tested, and taken, uh, very carefully into the account. So, um, it was my uh presentation. So. The first part i just wanted to give you a theory, of, how it will look like what are the challenges. Of using reinforcement, learning, and. Um, trying to understand, the theory behind it and then the next part, i will be trying to implement, a fully, integrated. Solution, based on. Based on. Reinforcement, learning algorithms. Different type of algorithms. And spark, to. To. To train, a. An automatic, trading bot, which can basically. Come up with the optimal. Optimal portfolios. At the end of a, certain. Investment, horizon. I hope you enjoyed it and thank you for your. Time. You.