Northwest Database Society Annual Meeting - Session 3
You. So. We're starting with binding, from Michaels research he's going to talk about improving, OCC, by construction, practice matching interviewing yeah. You don't, know oh this is Optimus, mr. concurrent control but it's too crowded I cannot put everything here okay. Hi everyone I'm by loading from Microsoft to research and today, I'm going to talk, about improving, OCC, per transaction batching an operation reordering, this, is publishing, the ODB this year so you can look it up on life you want to know more details about this work this. Is joint work with your hand and general, set, and. Now we get started ok. Before, that, we also get into transactions, I would like to give an overview of what. Is going on in my group so, I work in the DMX group and in, Microsoft, Research and, the full name is Darren management and exploration, exploration, and mining group so, as the name indicates our, group, works on a variety, of different, projects, in the space of data, management and data analytics. So, here, is a just a sample of what is going on here and this is a by no means a completely, complete, list and please. Feel free to talk, to me or my colleagues after, the talk if your interesting, knowing, more about those projects, ok. So the first one is the self storage state of information, so. Idea is probably. Similar to our previous speaker is, to democratize, data preparation, for. AI but, we are also the market, has a data, conversion, for bi as well so how we have an edge here, the. Second. Project is, proximal. Query processing, so as, you know like for some of the exploratory. Data analytical. Tasks it. Makes much more sense to first get a rough idea of like, back of envelope calculation of, what is going, on in the data before you, consume. All resources, and spend a lot of time on getting the exact, answer so, Sergio has led an effort on how to do. Approximate. Query processing, on with, a large amount of data with, some guarantees. Now. The, next one is the actor, or any database, systems, this. Is this is our classic, of collaboration. That I feel on how, to end database, functionality, on top of our actor or entity, framework so. That you can, do are, some, of the like scalp do. Sorry. Can do some obscure, bore for. Torrent, and me to your applications, such as multiple era games IOT. And telemetry, and mobile, devices, and. Finally there is the automatic. Physical, design in the cloud project. Where. Sudipta and I and a bunch of colleagues were trying to automatically, build physical, illnesses, for. The database in the cloud so as we know the. Database. Indexing. Is a critical, for the performance, of the database so pyre work most of our work tried to position, the, index, recommendation. As a tool to facility, some of the human experts, to, selecting, disease, for. A given workout on a given, database so in a cloud service. However ever you, will be way too expensive for, humans. To oversee, in the decisions, of the Eunice's are millions of data bases so, in this work or in this project we, are proposing a closed-loop solution. To, continuously. And automatically, read, to, the queries and workload, to reduce the actual query, execution, time instead of the estimated, cost of the queries, in. This way we are trying to remove, human intervention. In the critical path of the indexing, or query database. Tuning process, so. In, contract, to, some of the prior work we are actually looking to the full pipeline of, the whole process, including. Not just index. Recommendation, but also the workflow extraction, the index recommendation, validating, and monitoring of what happens after you implemented. Those indices. So, in this case we. Measure the success of the implementing, index in terms of the execution of the career cost instead of the estimated, cost. During. This process one, thing more, or less law as a side product. We observed, is that during because we are continuously tuning, and monitoring, and changing. The configuration of, the databases, we can see or observe multiple. Executions. Of different, plans for the same query so, if, based, on those execution, data we, can actually, construct.
New Plans to, improve the, quality of the plan at very low risk. To reduce the execution, cost if, we have this kind of data so, if we are interesting, more about those things place to look at just, top me talk to me afterwards, or look at the everything, is containing those papers. Okay. Now. So, much about my group let's. Go the actual chess actions. Okay. So, optimistic. On currency culture is a transaction. Concurrency. Control protocol. Which, where a transaction, first executes. Its. Operations, and optimistically. Assuming, there is no conflict, in transaction, going on with it and before. It. Commits. To the databases you store all the, updates. In its pirate work for us workspace. Now before, commenting there. Is actually some procedure. To check or validate. Whether there is actually, no transaction conflicting. With this one so if there is not it. Goes into. The. Right wrist where it tries to make its updates, available for. Others to see, now. If there is conflict. Transaction. With other transactions, you, will just support so. Here. Is the example of like one-hour, transaction, in optimistic, concurrency control can abort so for example we. Have two transactions, t1, and t2 where, t1 read, x and y and modify X where, T to read, x and y X and Z and modify Z so, in the real face they read the data item make the update in their, local workspace. Now. After. That the interest of validations, face. Where, t1 tries, to commit or tries to check if there, is any one complicating with it for, the updates it has to make so, in this case there is nobody conflict, remains t1 so, you can commit and update, the value of x to 30. Now. Here comes to 2 because. The. Value, of x has been updated, to 30 and t2. Just read all developed, and assume we're running, answer all about transaction, in, this case the t2 actually has a still read and it has to abort in, this case so. This is how our transaction can abort any optimistic, concurrency control because, we don't lock everything, now. The, question we ask here, is that, does. T to really have to abort now. It turns out. In. This example the t2 has a steward, so it's sort of a hopeless it has to abort but, if. We can switch, the, order of how, t1, and t2 sterilized. For example instead of sterilized, t2, t1 first which there are the two two first in, this case t2, can make it up there to Z and then, when we valued at t1 there's also no conflict, with t1 so t1 can also make. Its up there to X, so, both of the transaction can commit with a different, sterilization. Order now. The question, is that can we also like, people. Might be a bit nervous that whatever can. We really also like manipulate. The sterilization. Order of the transactions. Actually. In OCC optimistic. Control. The. Answer is yes, the. Reason is that in. Optimistic. Concurrency control the, serialization order, is depends, on like how the order concurrent, transactions, how they execute and how they like, when they come to the validation phase so, you, know normal operation of I also say the transactions, can come and whichever, order they want so basically, there is the freedom of which, order the actual being sterilized and the validator so as we see agent famously with Olivia's, supervision. We can actually construct a data order than the order that they come to the system. So. Not. Only that we. Actually can extend, the idea of reorder, the transactions, to reorder the operations, of the transaction not only in the validation phase but also throughout the life of the transaction, and we, can further try, to patch transactions. To increase, the opportunity, of reordering and also to scope out how. Far. We you are going to look at when you do the reordering, so this is the batching action makes it explicit. And easier, for us to do the reordering work. Now. Here's, the is a life of a transaction a. Client. First, descent a transaction to the system in our corner. Execute, the transaction you should read and write a request to the storage so, before the storage actually answer those requests there is some matura that tries to patch the operation, to put the operations, into batches and to, reorder the read and write request now. The store to reply to the. Transaction. Coordinator and, after, finish. All the operations, of the transaction it is sent to the validation, phase so. Before, it actually get validated, there is also component, which tries to patch the validation, request and tries to reorder, the transactions, so, that we can get a better civilization, order. So. From the architecture, we can see there are multiple points, the transactions, are matched and reordered. For, example. The, action, of the transaction can be a batch even, before they enter the system for is a man the client you get a bunch of transactions, you can decide in which order you usual the transaction, to reduce the conflict on this, is beyond the scope of this work but there has been some prior work on how to do this static, analysis, of transactions, to reorder.
Them Now, here, we have focus on the reorder, the operations, the retry request and the storage and also, reorder the transactions, and the valuation, so, in a storage is the strategy the optimal, strategy is actually fairly simple because we want to power it has all the rights before, all the reads from a batch of operations. So that for the real operations we don't read any like a stale state which, is happen to be updated by someone else in a batch. But. At the validation. Layer the, reordering, strategy, strategy, is less straightforward so, essentially, the question we are asked is that how can we create a sterilization, order with, the least number of ports. Given, batch of transactions. Now. Formally, the. Problem is to like give, up the actual transactions, the state B we want to construct a subset, B prime which. Is a subset of transactions. Such, as the B prime is 0 eyes about which means we can sterilize the transaction, in B prime without. Causing, any conflict, and the size of the department is maximum, among orders all such subsets the maximal action means we have least number of our ports now. Okay. Lost, its. Okay one minute, like I say that there is a way to, decide. When. The prom is survivable. And there is also an algorithm, to construct a, sterilization order after, we decide this is Sara as well this is something to do with the dependency, graph and has, something to do with how, many cycles or if there is in a psycho in the dependency graph so, you can look up in a paper for more details okay, so after we identify, this problem, we still have the, usual like we, know how to decide this is the so, this is a sterilized but now the question is how it can take the least number of transaction, out of the, out. Of the. Dependency. Graph to make the guapa cyclic now. There is people have strings study that and, unfortunately and, this is mg hard and apparently. We haven't figured out how to make em T equal to V so, we propose them greedy, organisms so that we can have, a design. Choice or trade-off between the overhead. Of the organism, and the. Performance. Of the organism, okay. So there is some extension, where you can like optimize for different goals instead of just minimize, the number of ports you can try to phrase, my reduce the tail lit and say or you can try to power, it has some transactions. Way so like Amina in dollar value instead, of a like a penny value transaction. So. This can also be handled under the same framework so, I guess they just give you a snip of the performance. Numbers, we. Have up, to two times the improvements, throughput as. Well as up to four times reduction, in percentage, presentation, 11 so that you don't lose anything you don't have a trade off here you actually get pissed, off passports, in. Terms of additional. Policies, we. Have like for example we have a college, that try to reduce the very tail light and see like the three nights four nights five nights and, in this case the new policy, can actually reduce that but up to six times.
Okay. I think, I can skip the conclusion. Any. Questions. Yes. Is it single corner or multi-port this, is not equal but but the recording process has to be like when you decide the sterilization order so we have an extension, to how to paralyze. Different. Operations. And we do like type like paralyzing, we, do we, do like horizontal, paralyzes, you whenever we, can but if we can we do can do pipeline paralyzation. Yes. How. Much do you need to know the read and write sets up front before you schedule, the transaction so don't know at all because in this case you also say you you actually decide, the Ceredigion order after, the transaction has finished it is present, so we are trying to decide the order you already know exactly, what are the Rhydian Rice's yes. There. Is no approximation oh, there, is so we can guarantee it and this is a correct, but we cannot guarantee this is the minimum, okay. Does. It make sense yes, but I wonder why you don't use an approximation, if, there are known approach if there if you can use an approximation then. You can always gather with no really long approximation, because we cannot sterilize, the transaction. See if we, find a cycle in a graph. You. Can take this offline. Okay. Thanks. Our. Next speaker is Alec turned off from Microsoft. He's going to talk about towards learning optimizer, for shared clubs Thanks. Hello. Everyone my name is Alec Jindal and I'm going to talk. About our work on the water learning optimizer for share clouds this is a joint work with our awesome intern, Chang Gangu and. A whole, set. Of my colleagues, here at Microsoft. So. We are witnessing a rise of decatur systems they're, really big and we. Have so many examples of systems like hive spark, flame calcite, and a, whole bunch of offerings even from the industry, and there's, a trend towards making the system's declarative. And having. A cost-based, optimizer. Which. Essentially says given. A user query. It'll, automatically figure out the best physical. Execution plan for. That query and the way it does is that it estimates the cardinalities. At each point in the query, plan estimate.
The Cost of the plan and picks the plan with the cheapest cost. So. Far so good and of, course a good plan leads to good performance. But. Problem. Is that the costs optimizers can make all sorts of mistakes and, the, most notorious. Being cardinality. Estimation. So. Much so sure and so that, people believe that it's the root of all evils. It's. A acnes he'll of good. Optimization and. It's a biggest, problem apparently, so. It's a really hard problem I would say, so. Put, into basically tune, our systems you know so, as to make them really performant. And, the. Things which people normally do is to think and do things like you know collect, statistics or, provide you know query, hints so, for example in sequel server you can you know add a row, count hint. Or. You can apply some DB. Administration. So. There's also been, a rise or in the clouds. As well and what. This means for this systems. Is that they're being offered as either. Managed, or even. As our less offerings. Which. Means that the end. Users, now don't have any control and they, can't apply these kinds of tunings any more so. Which means they may not they may not have admin or they not have you in expertise because it's where long tail of users who simply show up and sign up for any of these big. Data services, and they. Don't even actually have control because it's a managed service so. It, means the systems have to self tuned there's no other option. Self, fielding is a really hard problem and it's, super. Hard in these big. Data systems, so. But. There is some hope if you look at specific instances. Of these systems for. Instance if you look at shared cloud infrastructures. What. Do we mean by that so. If you look at enterprises, where there are sets of people who query. Or who like you know process the same sets of data some, sheer data over. A shared infrastructure. Then. They end up actually, producing. A massive log of queries. On the same in inputs, and that, could be used actually to you know tune the system, so. This essentially, visible, query log on the same in each search could be really valuable, this. Might see actually. It, turns out that this, kind of setting is actually very typical, in enterprise, analytics. In, fact at Microsoft. We have cosmos, which. Is a shared, cloud infrastructure, for, doing all of internal Microsoft analytics.
And, In. Cosmos, we have a scope query engine that. Does, batch processing. In a job, satisfactory. And it. Has a massive output of hundreds. Of thousands. Of jobs each day, which. Has been scripted, by thousands. Of users and they, also, checks the bytes of data. If. You look at the Carnegie estimation, in the scope jobs. So. We looked at one of the largest scope, customers, which is Asimov, and we, looked at like one day what, of the locks from them. That's. How the estimation looks like so. In this figure the. Green Line is the ideal and even. After lots of tuning, and constants, and best, effort estimation. There. Is a huge curve of underestimation, and there's, a, huge. Overestimation, happening, and, given. That the x-axis is log scale you can understand that it, could be all over the place basically. And. It's a hard problem because scope. Has this big. Data lots. Of data it's, not easy to know analyze or. Statuses on on that data the. Data is unstructured and, there is lot of custom. User code which, ends up being black, boxes. But. We have some observations, here actually and, one. Thing we observe is that much, of the scope work boards actually are collecting. Which means same sets of queries appear again and again at. Some, frequency, and they. Also have lots of shared, query sub expressions or sub graphs so. Which means that if you have two queries q1, q2 they, may end up scanning. The same items. Table and applying the same filter. Now. A question is how can we leverage this common. Sub-expressions, or common sub graphs to, build or to learn cardinality. Models. So. Of course the easiest thing is that we can simply cache so. If we see a similar. Sub. Expression, we can just completely simply record, the value and use it the, next, time you see it again the. Problem of however is that it's very very has, a very low coverage, because the inputs change we have to change this you know we can't use a cache, anymore and the. Feedback has to be online it, has to be materially as soon as we have seen get applied back immediately which, is kind of tedious. The. Other extreme is ok we learn models, a general, model for, any sub graph no. Saying, if you have a single model for any kind of a sub, graph that's. Very hard to fit your eyes we tried that we failed it's. Hard to train it. Has a very high latency. When scoring because we end up having a large a very, large feature vector. But, most important we had very low accuracy we could not fit a single model for all the cases, so. We took some middle. Ground you say okay let's amongst. Logical. Sub expression, parameters. And inputs let's, only fix expression and let's, allow there are parameters and inputs to change. Which. Means that we, don't learn a single model no, one-size-fits-all let's. Learn a set of models which, together capture the entire workload. So. Which means we now look at sub graph templates, which have the same logical, expression, but may have different physical implementations. And different parameters. And inputs as an. Example suppose we have this. Join, happening and this is a filter, happening, we. May have a ply filter, age 18 or age 20. And the inputs may have changed as well, so. If he tries this kind of sub, expressions and, used. A bunch of features which includes mainly metadata. And. Parameters. And input. Characteristics. And. We. Tried a bunch of models so we try it and we found the linear models to be actually very useful, especially. The. Poisson. Regression. It was the most. Accurate, somehow. The figure is not that, accurate but. It, makes sense because. Cardinalities. Can really the distribution is more exponential, so it can really fall, off really, and it's are also non-negative, so, it makes sense that it was most accurate if, you do all of that what, do we end up basically, so. We end up something like this basically so we see wherever we have models if they are super accurate basically, you know we can actually very. Close to the ideal almost, this. Is actually scary hey like how how actually can be and. If. You look at the default of course we have orders of magnitude more accurate, because. We look at we optimize, for specific instance that for this. Customer. Can be done the, cardinality. Models. But. The other question is what's about applicability, are. These mods applicable to the entire worker or not so. We looked at the applicability. What, percentage, of sub graphs we have models for turns. Out that we approximately, have after 2 days of training. Approximately. 65% of the sub graphs have models. And. We. Need to train around once. A month because, after a month we. Lose some, applicability.
So, We have to train up approximately, once a month so, as to make sure that we cover, those 65%. That's. Very enough. If. You look at the end-to-end feedback, loop how. Does how does this whole. Thing work so when, we have this our. Scope, engine which has the compiler, optimizer, scheduler and runtime, and we take all the logs you. Know we train. Them offline, and, periodically. And we. End up with a large number of very small models which super accurate and. Then. We feed it back to a, feedback. Service, which. Provides these models as hints to future queries so, the queries can choose to use or choose to ignore it's up to them it's a hint additional hint for every query in. The future and. Then. The optimizer. Uses the models to feature eyes and because, they're small it's easy to future eyes so we have no scooting overhead and, they. Are accurate and one, important thing is that once. If, some query fails or if somebody, has to go and debug. Then. They can actually also understand, the weights what. Features have no which way it's because there are simple linear, models. So. Each. Of these actually were a requirements. We had in the scope environments. So we were able to achieve, them and if you do all of that how, does the end-to-end performance. Look like, so. We looked at some. Of some sort of queries from Asimov, we. Really ran them with the default and, the new, basically, optimizer. And. If. You look at latency. Well. Sometimes, there is sometimes there's not improvements. But. If you look at the total processing, time we found interestingly. That there, were significant improvements, and the. Reason that happened was because we. Were. Actually scheduling, a lot less number of vertices. Or containers. Because we were in the overestimation which, was so so. Heavy so. By avoiding that we actually save resources, because give this kind of queries which is important, for this kind of in fact no infrastructures, medes which, is pay as you go. We. Also have looked, at the problem. Of learning. Bias because the, only learn, based on what we have album observed. And we, meant up basically no having a bias so. We look at how we can you know try. Out alternate orders of joints and maximize. Observation, so that we can train on those new observations, has been, more. Details of this are in the paper, which we have so. Feel free to have a look at that. But. In. Summary, and the takeaways, we had in our work was as follows. Big. Data systems, are increasingly cost, based you know, optimized. There's no like user, on manual tuning they have to be self tuned in, fact users cannot even tune the systems anymore within. Their managed in server less. It's. Very hard to achieve a one-size-fits-all, optimize another system we, cannot really optimize system for all instances, and scenarios because the clouds, and the workloads are so diverse, it's, very hard but. There is hope if you look at specific instances, for example a given shared infrastructure. Then. We can do something there. We saw very promising. Results on the scope workloads. You can achieve high accuracy, and. Large. Applicability, and we, could actually produce, significantly. The resource requirements. And. This is just a starting point, we. Believe that the learning cognitive. Models of step towards making, a self-learning optimizers. And systems and we are working more towards that thank. You. Question. How large were your mother is how many choices, model. What is the maximum number of choices a mother sweeping right. So. Even. In terms of each model we had was laissez like a simple linear, model. Like, so because Lana model at any sub-expression basically so, we and. We were looking at like the places. Where we had the maximum bang for the buck which means that we can really see. Them a large number of times so, it. Does not matter how, many number of joints, we have that think about is how many like features. Or how many parameters, we, have and. Interestingly. We had most parameters, where we had you, do is the, user-defined. Code basically so wherever they appeared we had very very large number of parameters. And that really make sure that we he tries them otherwise, things can go horribly wrong so. Dress was our main challenge. Do. You think this will be also sorry, again for welding, systems, reading, this will also make it to get back like is although the right.
So We are engaging. With some toasted cassava folks, and seeing if you can apply there as well. There, would be some changes probably how we apply feedback, and how much you go ahead you can always. So there are some concerns but, we are engaging industries we don't have answer yet they engage with them and see how we can apply it there has been. Thanks. And. We'd like next to a mere from guru who is going to talk about machine learning in Google. Machine. Learning in bakui. I've. Been, in the Vickery team for for. Four years I've been at Google for six. Before that I was at MSR before. That I was a Michigan, so. It's good. To be back here. I'm. Gonna give a very, short overview of equity, and. Motion. Laser is gonna give you a better overview probably. And. Then I talk about why, we decided to do bigquery ml ml, and. Briefly. Touch and touch on syntax, and then, talk about our, solution. To one of the simpler. Ml. Algorithms, that we support linear. Regressions we have two solutions I'm going to go, through why we decided to do each one and finally. We, have time you can do questions. So. Bigquery, for, those of you who, don't. Know B query goopy. Queries Google's, cloud, base sequel. Data warehouse as a service for, mostly, analytics. You, can use it right now for. Free it, has a free tier so you can try it basically, it's an enterprise data warehouse. It's. A fully managed service within. Sequel. With a sequel, API. It, can scale to petabytes, of storage and. Support. Encryption streaming. And all. The nice features that you will you, want from your data warehouse so. The, quarry ml. One. Of the things that we saw from enterprise. Customers, and. That. Use bigquery is that the. Sequel analysts, that they have. Their. Main job is, to extract, inside, from, beta and. Usually. It's based on some historic, data, that they stored they want to extract something. Based. On the history or they want to predict something in the future and. You. Can see two queries here that these are just simple queries the first one is trying to come up with the average income based on state, and the, second one is top, orders of, top.
And Orders, of customers. So. Our. Goal, here was to. Expand. The. The, sequel language to give, analysts. The, tools based. On math concepts. Statistical, methods ml. Algorithms. To. Do this insight, extraction. Without, needing to learn anything, new. Without. Needing to learn, Python, or tensorflow, scikit-learn. Without. Needing, to move their data in and out just. Staying where, they're comfortable and a video, with even minimal, need, to learn. Machine. Learning. So. That's. The first school that I've I, list. Here it's you, know democratizing. Ml for business customers without leaving. The seco environment, the. Second thing that we wanted to do was analyze being able to analyze large data, sets without sampling, them obviously. You can always sample, your data say take it into your local. Machine or one, node of your database for, some problems that's suitable, for some problems that is not suitable, so. Our goal was to use the power of bigquery. To. Scale to petabytes, of data for. ML algorithms, and. As. I said we, don't want to have this move. Data, out move. To. Do to do the, machine. Learning and then are there somehow bring that model back into, the database or move. More data out to use, that in in in that model for, system that's, ingesting. Data all the time and these models need to be retrained. Sometimes. Daily, that's. A that's an expensive pipeline and, it's beyond what the sequel analysts. Can do. There. Are examples, of. This. Effort before, to introduce, ml into database obviously they're not the first people who are doing this at. Least on the syntax side I have two. Samples here from systems, that are out there the first one is. Heavily, based on R and Python, it-it-it. To. Me it's a non descriptive. Syntax. That. You, know if you know are probably, you understand the red part if, you know sequel you know, the rest but mixing, the two for, someone who know who's a psychoanalyst. Doesn't, really make make, a lot of sense the. Second one is a little bit better. This, shows how they do generalized, linear models. One. Type of machine, learning model in sequel. There. Are there, are a few, issues with, that system or own name the system but there are a few issues one, is if you go through different models you see completely, different syntax, sometimes. There is tvf somehow sometimes, there's UDF sometimes. There is UDA. The, way that the model is applied to the data sometimes, is a join sometimes. There's some other special syntax so, every model is different. So. There's no unique. Syntax, or there is no common, syntax for training, and print prediction. So. What we decided to do was to, extend, C code EDL's, and. The same way that create table is the standard, form. Of creating tables we. Introduced, create, model, and. You. Just basically pass your model name there is a set, of options which. Are really optional and, those. Can control your ml algorithm, if you, want the knobs to tune some stuff and then you pass your data in the add statement, so, the same running example that we had for, the, income, based on census. Data here, you can see a model that we can create in in bigquery, a, linear, regression model that, predicts. The income based on the input data and you. Can in, fact change the model type to something else if if, you want to try different things you, can add a very close to the select statement, to predict. On a specific you, know time time, range or group, by state or whatever you want, the. Create, model statement stays the same and. Then. To apply prediction. We have one, tvf for prediction, across, all models, so. You pass your model name and then you tell pass your data, so.
Here You can see how prediction, can be done. For. The same model that we we trained you, just put the. Inputs where state and job and here. You just you, know just pass pass those in we, do have a bunch of other T vs for other interactions, that you want to do with. Your models for example if, you want to evaluate, the model based, on and. Get some metrics. Like how good this model is we, have another tvf but all of these CVS are same syntax we pass the model and in pass your data, so. The model is basically here is it. Is a top-level object, in the catalog of the database on and. You. Can store it however. You. Want you can have it as a table or as you. Know set up binary. Blobs that's, up to a specific database. So. That's. The, syntax and we, believe this syntax, improves. The usability of, ml for, psychoanalysts. Compared to the current workflows, which is either based on moving. Data using, python and r or using, notebooks and you know sample. Your data in memory of the notebook. Environment. So. Moving forward. I'm. Going to talk about two. Approaches, for, for. The same running example that we had so predicting, income based. On census. Data the first one i'm going to go over a. Very common, approach called. Iterative, gradient descent. So. Let's look at the problem first just, quickly. So. What. We want to solve is basically find a set of weights. That, we can apply to the features of, our. Input data features where state their. Job. And. State. So. We want to have we, want to find weights for those and the. Training data is modeled, as X, here rows, are training examples, columns are features so we have two columns in. Which. Are, state and and and job title and then the observations, are the income, that we want to predict so. We want to find w-4. Based, on our training data and the. The, common way to do this is to iteratively. Minimize. This. Objective function the, first part is a loss which, basically says how far we are from the values that we want to predict and the. Two other terms are regularization terms, to, control. Overfitting. Of, the model so. That's, what we want to solve and. What. We. Decided to do was. To basically. Given. That our data size input. Could be petabytes. We. Wanna use the power of the database so we want to use sequel, to solve these equations and we. Decided to basically, do the gradient descent in pure, sequel there, is no extra. Function, or any modification, to the database that we made for. Solving, these, these. These. Equations so we represent our data as a table our, model, although is an object, in the backend we store it as a table and. We. Basically run a set of queries for each iteration of gradient descent, and. So. There's a paper from, 12, my colleagues that you can you can look about you can look up about the details of all of these queries and all the techniques they use to optimize it. Basically. Each iteration is as, I said our sequel. Statements, and there, are a bunch of techniques to. Hide. The complexity, and and, make the query scale, I'm not going to go into the detail you. Can you can look, them up or you can talk to me after talk, the. Second, approach is closed, form solution, so if the data is not petabytes, and. Lots. Of customers may have smaller, data the data set now, we still want to be. Able to train. But. There. There could be a closed-form solution, that doesn't. Need all of the nodes of the database, - - to, solve it here, again we. Have the same same. Equation. And, it. Just. Doesn't note that the training data usually is an M by n matrix and, M which is the number of examples is usually, much much larger than n, which, is the number of features. So. The closed form solution. Usually. Just. Uses, linear algebra. To to.
Solve This and. The. Solution, for, W, so I wrote it here it's, it's a it's a well-known solution, I'm not going to go into the detail of how we got it this is you can look it up on Wikipedia it, basically, consists, of a bunch of a few. Multiplications. And a matrix inversion, and. These. Are both. Hard things to do in a database. Especially. The inversion, so the. Way we go, about. This is we do, the matrix multiplication, using, joins in Vickery because b, kotov does pretty, well for for for scaling, joints and the. Matrix, inversion is, basically. The. First, term that you see X. 3 X that's. An N by n matrix, and remember that M was much bigger than n and n, is the number of features so it's, a pretty small matrix, you don't you're not gonna have more than thousand. Features or 10,000 features so we can fit that in memory so. We basically run a query to, do the. Multiplications. Matrix. We, put, them in a table as like row row. Number column number the value and the, the multiplications. Our joins and. The. Matrix inversion because, it fits, in memory of one chart we, do use a fast solver. Off, the shelf to do the matrix inversion. Finally. How do we decide between the two as I said a psychoanalysts. In care about any of these they just want to say select star from T train. On this. So. All. Of these are handled, by us so we decide if the data fits in the memory by, Luke get the number of training features if you have more than 10,000, features we, know that that matrix is too big we are not gonna matrix do matrix inversion if, there is overfitting, issues so close form is not great, because it doesn't support edwin, regularization if we, see that number of training examples is, not enough. And we are gonna get into, overfitting issues we go. Back to the gradient, descent and if if they're explicit, options. For elbow regularization or warm start we go to gradient descent and, finally. Otherwise. We use normal equation and. Conclusion. Basically. As, I said I'm going to repeat the same stuff. This. Is meant for psychoanalysts. Not. Ml. Experts, and, our. Goal is to enable, this purely, in sequel, I. Get, some out of time and thanks. To the bigquery team. It. Depends, on on on, your, data size so. To give you an, example. In. A sequel, query if you do a group, I and a sum, on. 10 rows I'm in. Any, database I'm pretty, sure if you take that on your local machine it. Would be much much faster but. If you scale, that to a billion. Rows then. It, would be different so, it, depends, on if, your data is. Really. Small or if the data is big. So. If you have a billion rows you can knowledge on tensorflow, that's. How. Was that. Not. On a single machine if your data doesn't fit in memory oh. You. Have cancer. If, if everything, fits in the memory of a single machine is going to be faster than Bitcoin in. Our just for this for pretty much any query. Yes. Right. So my question, is training. Has this pattern of making multiple passes, over the data and. You are burning SQL, queries, in every pass versus. Exporting, the data one time and running. In a customized. Ml system, with hardware, acceleration so. How does the performance compare. Especially. When your datasets grow for one thing that we didn't touch here is. For. Doing any training. Maybe, training is so training is one step I would say is 10, percent 20 percent of the complexity, of what you want to do there's an 80 percent that you want to prove process and every. Time you through process and you want to train and see okay did my pre-processing, work you, have to export, your data train, your model. Show.
Up More data into the model and say okay my something. Went wrong let me go back get. Another slice of the data, exported. Train, again see, it works or not so it's an iterative process it's is not it's. Not just one. One shot I train, and ok I'm done you're. Right that the training per se may be faster, if you, give it enough resources. But, training. Is just one step. The. Data in. The same environment that you know how to protect data either an ml expert. You. Don't know so, the problem is that the psycho analyst knows how to prep the data the, person who does the machine learning doesn't. Necessarily, know or is, not in the database time bar but, it's not the same person who pushes, I. Agree. I should. Buy that you know are the same radius to do everything. Exactly, requested, said I remember, in, two. American, Sycamore there's a one paper, from College of London is about, actually heat cleanser. Cleanser. Because. You can. Be. The Vantage of hobbies, Asaka expression, for the symmetric. Yes approve, that's the the. Matrix calculation, is the subset, of proof I actually can be used the. Result, is actually a faster the tensor flow I think we get the pepper dip but that my my my my friends at the first questions that can, compare thank reproduce. At the result on a paper and Steve actually work, on the, car. I don't, have head-to-head, comparison. I guess. They. Do, to take. Advantage that that something. Special to my vision for the use the Superferry, to do the Dibble metrics, one. Of you take a bunch of the questions, offense Lorna sorry as well. Great. So our last speaker before the break is Mona, from Google and he's going to talk about big great analytics storage, and. Then. Afterwards, again we have a break short. Break let's get food come in here again to, talks and an arm and. Okay. So my, name is Masha poo Szymanski, but this stock was supposed to be done by another. Person. Pavan, she couldn't get here because of the snow so, he asked me to do it I saw, those slides first, time in my life couple of hours ago so I'll have to improvise a little bit. Let's. See what they say so. Yeah. So I walk in Google, and be creative for seven years before that there was a Microsoft, work on cosmos, and scope capita sheet still alive and before, that in sickle cell.
So. Be. Clear I made a describe, it beautifully, is this fully, managed it's, yellow, it was Cheryl s before there was a war shareware less, you. Know it's. Data. Was housed in the cloud so. In. This talk I will try. To focus on on the storage part and not talk about execution. Or runtime I just distributed. Parts of it and so. On so. Some. Of the things that. Big. Larry does is wait, additional, data warehouse link right, so it supports, you know and starting snowflake, schemas, and doing, joints, and GML, and all the classic stuff but, probably. It will be a little bit more interesting, to see what makes it special, so, when the scale now I welcome Google everything. Which is less than exabyte, you, know people think you're not shears so, it is it. Is exabyte. Multiple. Exabytes, of storage, and petabytes of queries. Another. Thing that makes. It kind, of interesting and, unique is that it. Is one, big system, which. Kind. Of there are no database, instances, so, for example we have you, know hundreds, of public datasets a ton of stock prices weather patterns, what have you you can immediately, join it with your data all the, data is available as, if it was in your, own instance. So. Another. Interesting thing is that even though have. Most. Of the features of the classic data warehouse we, also have, a lot, of features, to. Support. Data. Storage, for, the you. Know applications. More, modern applications, not done by the Kimball box right and some. Interesting characteristics. Of those you know people don't design. They. Like, they spoil. But this no sequel, right they say why don't we put some semi structured data in an, hour, used to be XML fuchsia gonads JSON and. When. I figure out what, happens later that is generalized. There are no joints, it's. Also very popular in the smaller and applications, kind. Of to have this team you know somebody said SCAF core or something like that just say you know let's write all the data and you, know steam it in and we'll figure out later, what to do so, bigquery has, very, very good solutions, to all of those and it. Does them in principle, database, way right not, eventual consistency like, real stuff so let's. Directly. Should go one by one and. Talk. About few of those so. At, the, data. Layout layer so, this column where sessions to data warehouse it's column, based system, but. With the twist that it has to deal with shame, structure data again, somebody. Before said you know we use nested data so it's not relational, well of course nested, data is relational, since equal 2011. Which introduced, erase, and moon cassettes and, aircraft. And so on so. Because. The latest image structure doesn't mean it doesn't have schema, it has will.
Have Some interesting, ways of how to deal with it and. How. To provide efficient storage and execution, on top of it now. For. Some of the other things. There. I guess, now become very standard. In column based data bases. You know many, different and columns what, can you do you know. Statistics. And bloom filters and things like that. Execution. Push to the compressed data so you can execute some, things like, predicates, and simple computes, without even decompression, kill data, row, the ordering, you know relations, don't have the orders of we're free to, change the order of the roles not, necessarily served by one of the columns but just try, to optimize some of the metrics for example. You. Know tank to put similar values, closer, together you can get better early early, runs. And things, like that so. So. All those happens. For. The day clout and by the way we have not. A paper but we have Freight up which goes into more depth about some, of those issues. So. Now, if we go to the very global, level so I said you. Know global. Managed. Database. It's actually, a more like a journal database. So. The. Wave of aqua storage so each zone right. The. Query has separation. Of storage, and compute and storage is, on. Distributed. File system so this already gives fault tolerance but, in say tej involves, so immediately, replicate, that into three zones and, there, is also off engine application so I guess, we store data so many times it daily must be a. Fault. All around. So. Now what happens within the zone so. One of the things that we do is. The. Way we start again remember we're the MotoGP. Data, warehouse, so the way metadata. About table is structured, we work with. You. Know. Split. Table, beta chunks, which a mutable right so every time and to commit them to the metadata storage so every, time there is any, type of change. There is a new immutable, block and I will go into a little bit more detail about it and, you. Know you can say you, waste a lot of storage but hey Google will have so much storage who don't know what to do with that but. Having, data immutable, makes a lot of things a lot a lot simpler right whether it's been caching, or for. Example there was talk about this you, know how, you apply courage analogous tomates if we thought originally estimates. Were good idea, which. We don't them cause, based optimization in, general, you know we could have computed them on those immutable chunks, and remember. For, the next execution. So. Again. This simplifies our, story a lot so. Let's, go through. Throw. One example what you know how to manage this immutable, chance. Since. We're fully managed service, without everything for the user okay so here's one example of up to my sister's suppose, there was think I mean sat in the firm time so all the three chunks their pens okay simplest possible case. You. Know at, some later time system. On its own resources schism, doesn't even know about it goes, in so came I don't you combine them together so one megabyte hundred kilobytes and two megabytes together give us two point five megabytes, because. Maybe it compares beta which was different and coding core whatnot so. From that point on I. Guess. It's from another slide but, from that point on right. It's exactly same data here. In this chunk or in this one so you, know you can vary either one and we'll probably create, this one unless, we do time travel and let's talk about time travel to bit later. So. This kind of stoic optimization, with a lot, of it but it's on individual. Block layer you know we can reorder columns, if we notice that decorate together we put them close to each other where.
The Rolls we. Pick the best encoding, credits, it's Salamone, replicated. And like. A lot lot of things so. Earlier. If our p.m. so here they would have wanted me to say we use a male to do this because we, have to assemble for everything but. We, use algorithms to, do this. But. We, do take into account you, know historic great patterns, and you, know how, this data is actually used I guess, I should say to the mouse okay. Time tell since, everything, is immutable we, get this really, nice feature for free I saw, in sequel it is the closed for system time as off. You can go back in time you can see how a table looked in the past you can look you know show. Me the data that was added in the last 30 minutes it's another time of form. Of time travel. So. It. Works kind, of, with. Immutable data pretty simple, so let's say were. Appended. To chunks, and then starship to miser, okay. That another one so. Now if you want to see. How. Data looked after g2. You can just steal it from here because contains, the same data but. If you want to see how they two looked between. Chi. 1 and Chi 2 you, go to the older one okay, and, so, on so. This. Is kind of almost side. Effect because, of the story he texture which shows. Now. What happens when you have GML so said you, know vana told cheap eva lesap. To mice to you know update single hall it's more like update, older us with same, condition, so. Again. What. We would do here is, we. Will write a new chunk I try not allowed. To do, they write existing clan but. When, we do that, the. Kind of remember so let's say the updated, customer I want to say four so updates a model test delete plus a pent-up and, say oh so long we need to care about Julius right so. The. Metadata, and transaction, manager and particular. Knows. Which. Data came from whether those errors, so it, we. Don't they write all the data when you update a little bit of it but we have those kind of pointers, we remember, which. Data actually, belongs to which chunk, okay so, again they immutable, they can even share the same you, know also, snapshot, the same data. And, everything. Folks time travel works with it and all, other features works with it. So. Last. Thing that I want to talk because I think it's kind of really. Cool and unique is. The. Stamen, congestion, that we'll have in. Big crater so as I mentioned in the beginning is. It. Is very popular with, kind. Of wave, of modern applications, kind of to say okay, everything, that we have you know all the events that we have just, open a firehose and, write. Them and we'll, figure out later what, to do maybe we can analyze, them in place maybe we can do a little bit of yield XI and transform, them but. First, let's steam everything, so. Again because we're database, we'll have to have you. Know good. Guarantees, about it so steaming, injection. Has, to be as reliable, and and, have all. The same characteristics. As if, you were App Engine data, let's. Say with batch or with GML, commands. So. We'll have special, sub, she no. I don't finish here, so have dedicated. Subsystem. To deal with streaming so what, it has to do so the way it works it is. You. Know HTTP. Endpoint where, you can send your data and. The. Moment you got the arc the data is committed, so, what we do inside, is. We. Write this data first, in the. Optimized. Format because it's the fastest right okay, and since, we're talking about big big fire hose we, need to write three really fast and, what. Really works fine, for us is if. You write only, a little bit of data then, one doesn't matter what we do performance, is going to be good, no matter what and if you are writing a lot of data it, gives great opportunity, to do basic right so. Also. Because one, system, which works like they're not database, instances, we can multiplex. Data from different tables together, and, bash it together and latex, figure out which, data belongs to whom, also. Do a lot of load balancing, so, if you start writing a lot, become, to, single table but. I magically can shout it even across different zones, and take and sell it all together so, it's all consistent. There. Is a special. Form of the stewardship to miser which kind, of looks at all this right. Optimize, data and, decides. What to do with it part of it will stay in memory because. You. Know it's, the. Data. That you're at most recently, has the highest chance of being cleared right you want to know like, let's say your instrumentation kV upset you want to know what's going on right, now not what was two, or three minutes ago, so. Part of it stays in memory in memory optimized thermit part, of it when there is enough gets converted, to the coulomb optimized format, and. So. On so. I guess a lot of time now there, are many, other interesting features, but I guess the most, interesting.
Oh. That. Would be a subject for Shepard, talk, but, basically, the problem of cost based optimization is, it's very unstable, something. Little chip it's like am i right something changes, and your plan changes and nobody knows why right. So. We have other mechanisms which. Are better than cause based estimation, like adaptive, optimization and, things like that which, are much more stable and. And. Also, walkabout like and you don't pay big price for the mistake because they can self correct themselves all the way as the CAG Claire goes on. A. Little, bit more complex than that but the idea is as the, Cray Pigasus it. Corrects, itself on the fly now. It is what did it see so far what, should I do next. So. Just a pendant but that only applies for other style queries. Hold up style queries or long brain queries right. What. Does old spell pose Rosina cute. Well. It applies to all the Claire's well I guess we're datavac house over on your other. Types of Claire's. Okay. And one more than yours, self-join. Or different, times. Yes. You, can, yeah. I mean as far as far. As query optimizer is concerned, it will see two different able, metadata, system we'll figure out where data for each one everything cool off yeah. All. Right let's thank the speaker again.