# Northwest Database Society Annual Meeting - Session 3

Show Video

Them Now, here, we have focus on the reorder, the operations, the retry request and the storage and also, reorder the transactions, and the valuation, so, in a storage is the strategy the optimal, strategy is actually fairly simple because we want to power it has all the rights before, all the reads from a batch of operations. So that for the real operations we don't read any like a stale state which, is happen to be updated by someone else in a batch. But. At the validation. Layer the, reordering, strategy, strategy, is less straightforward so, essentially, the question we are asked is that how can we create a sterilization, order with, the least number of ports. Given, batch of transactions. Now. Formally, the. Problem is to like give, up the actual transactions, the state B we want to construct a subset, B prime which. Is a subset of transactions. Such, as the B prime is 0 eyes about which means we can sterilize the transaction, in B prime without. Causing, any conflict, and the size of the department is maximum, among orders all such subsets the maximal action means we have least number of our ports now. Okay. Lost, its. Okay one minute, like I say that there is a way to, decide. When. The prom is survivable. And there is also an algorithm, to construct a, sterilization order after, we decide this is Sara as well this is something to do with the dependency, graph and has, something to do with how, many cycles or if there is in a psycho in the dependency graph so, you can look up in a paper for more details okay, so after we identify, this problem, we still have the, usual like we, know how to decide this is the so, this is a sterilized but now the question is how it can take the least number of transaction, out of the, out. Of the. Dependency. Graph to make the guapa cyclic now. There is people have strings study that and, unfortunately and, this is mg hard and apparently. We haven't figured out how to make em T equal to V so, we propose them greedy, organisms so that we can have, a design. Choice or trade-off between the overhead. Of the organism, and the. Performance. Of the organism, okay. So there is some extension, where you can like optimize for different goals instead of just minimize, the number of ports you can try to phrase, my reduce the tail lit and say or you can try to power, it has some transactions. Way so like Amina in dollar value instead, of a like a penny value transaction. So. This can also be handled under the same framework so, I guess they just give you a snip of the performance. Numbers, we. Have up, to two times the improvements, throughput as. Well as up to four times reduction, in percentage, presentation, 11 so that you don't lose anything you don't have a trade off here you actually get pissed, off passports, in. Terms of additional. Policies, we. Have like for example we have a college, that try to reduce the very tail light and see like the three nights four nights five nights and, in this case the new policy, can actually reduce that but up to six times.

Okay. I think, I can skip the conclusion. Any. Questions. Yes. Is it single corner or multi-port this, is not equal but but the recording process has to be like when you decide the sterilization order so we have an extension, to how to paralyze. Different. Operations. And we do like type like paralyzing, we, do we, do like horizontal, paralyzes, you whenever we, can but if we can we do can do pipeline paralyzation. Yes. How. Much do you need to know the read and write sets up front before you schedule, the transaction so don't know at all because in this case you also say you you actually decide, the Ceredigion order after, the transaction has finished it is present, so we are trying to decide the order you already know exactly, what are the Rhydian Rice's yes. There. Is no approximation oh, there, is so we can guarantee it and this is a correct, but we cannot guarantee this is the minimum, okay. Does. It make sense yes, but I wonder why you don't use an approximation, if, there are known approach if there if you can use an approximation then. You can always gather with no really long approximation, because we cannot sterilize, the transaction. See if we, find a cycle in a graph. You. Can take this offline. Okay. Thanks. Our. Next speaker is Alec turned off from Microsoft. He's going to talk about towards learning optimizer, for shared clubs Thanks. Hello. Everyone my name is Alec Jindal and I'm going to talk. About our work on the water learning optimizer for share clouds this is a joint work with our awesome intern, Chang Gangu and. A whole, set. Of my colleagues, here at Microsoft. So. We are witnessing a rise of decatur systems they're, really big and we. Have so many examples of systems like hive spark, flame calcite, and a, whole bunch of offerings even from the industry, and there's, a trend towards making the system's declarative. And having. A cost-based, optimizer. Which. Essentially says given. A user query. It'll, automatically figure out the best physical. Execution plan for. That query and the way it does is that it estimates the cardinalities. At each point in the query, plan estimate.

The Cost of the plan and picks the plan with the cheapest cost. So. Far so good and of, course a good plan leads to good performance. But. Problem. Is that the costs optimizers can make all sorts of mistakes and, the, most notorious. Being cardinality. Estimation. So. Much so sure and so that, people believe that it's the root of all evils. It's. A acnes he'll of good. Optimization and. It's a biggest, problem apparently, so. It's a really hard problem I would say, so. Put, into basically tune, our systems you know so, as to make them really performant. And, the. Things which people normally do is to think and do things like you know collect, statistics or, provide you know query, hints so, for example in sequel server you can you know add a row, count hint. Or. You can apply some DB. Administration. So. There's also been, a rise or in the clouds. As well and what. This means for this systems. Is that they're being offered as either. Managed, or even. As our less offerings. Which. Means that the end. Users, now don't have any control and they, can't apply these kinds of tunings any more so. Which means they may not they may not have admin or they not have you in expertise because it's where long tail of users who simply show up and sign up for any of these big. Data services, and they. Don't even actually have control because it's a managed service so. It, means the systems have to self tuned there's no other option. Self, fielding is a really hard problem and it's, super. Hard in these big. Data systems, so. But. There is some hope if you look at specific instances. Of these systems for. Instance if you look at shared cloud infrastructures. What. Do we mean by that so. If you look at enterprises, where there are sets of people who query. Or who like you know process the same sets of data some, sheer data over. A shared infrastructure. Then. They end up actually, producing. A massive log of queries. On the same in inputs, and that, could be used actually to you know tune the system, so. This essentially, visible, query log on the same in each search could be really valuable, this. Might see actually. It, turns out that this, kind of setting is actually very typical, in enterprise, analytics. In, fact at Microsoft. We have cosmos, which. Is a shared, cloud infrastructure, for, doing all of internal Microsoft analytics.

And, In. Cosmos, we have a scope query engine that. Does, batch processing. In a job, satisfactory. And it. Has a massive output of hundreds. Of thousands. Of jobs each day, which. Has been scripted, by thousands. Of users and they, also, checks the bytes of data. If. You look at the Carnegie estimation, in the scope jobs. So. We looked at one of the largest scope, customers, which is Asimov, and we, looked at like one day what, of the locks from them. That's. How the estimation looks like so. In this figure the. Green Line is the ideal and even. After lots of tuning, and constants, and best, effort estimation. There. Is a huge curve of underestimation, and there's, a, huge. Overestimation, happening, and, given. That the x-axis is log scale you can understand that it, could be all over the place basically. And. It's a hard problem because scope. Has this big. Data lots. Of data it's, not easy to know analyze or. Statuses on on that data the. Data is unstructured and, there is lot of custom. User code which, ends up being black, boxes. But. We have some observations, here actually and, one. Thing we observe is that much, of the scope work boards actually are collecting. Which means same sets of queries appear again and again at. Some, frequency, and they. Also have lots of shared, query sub expressions or sub graphs so. Which means that if you have two queries q1, q2 they, may end up scanning. The same items. Table and applying the same filter. Now. A question is how can we leverage this common. Sub-expressions, or common sub graphs to, build or to learn cardinality. Models. So. Of course the easiest thing is that we can simply cache so. If we see a similar. Sub. Expression, we can just completely simply record, the value and use it the, next, time you see it again the. Problem of however is that it's very very has, a very low coverage, because the inputs change we have to change this you know we can't use a cache, anymore and the. Feedback has to be online it, has to be materially as soon as we have seen get applied back immediately which, is kind of tedious. The. Other extreme is ok we learn models, a general, model for, any sub graph no. Saying, if you have a single model for any kind of a sub, graph that's. Very hard to fit your eyes we tried that we failed it's. Hard to train it. Has a very high latency. When scoring because we end up having a large a very, large feature vector. But, most important we had very low accuracy we could not fit a single model for all the cases, so. We took some middle. Ground you say okay let's amongst. Logical. Sub expression, parameters. And inputs let's, only fix expression and let's, allow there are parameters and inputs to change. Which. Means that we, don't learn a single model no, one-size-fits-all let's. Learn a set of models which, together capture the entire workload. So. Which means we now look at sub graph templates, which have the same logical, expression, but may have different physical implementations. And different parameters. And inputs as an. Example suppose we have this. Join, happening and this is a filter, happening, we. May have a ply filter, age 18 or age 20. And the inputs may have changed as well, so. If he tries this kind of sub, expressions and, used. A bunch of features which includes mainly metadata. And. Parameters. And input. Characteristics. And. We. Tried a bunch of models so we try it and we found the linear models to be actually very useful, especially. The. Poisson. Regression. It was the most. Accurate, somehow. The figure is not that, accurate but. It, makes sense because. Cardinalities. Can really the distribution is more exponential, so it can really fall, off really, and it's are also non-negative, so, it makes sense that it was most accurate if, you do all of that what, do we end up basically, so. We end up something like this basically so we see wherever we have models if they are super accurate basically, you know we can actually very. Close to the ideal almost, this. Is actually scary hey like how how actually can be and. If. You look at the default of course we have orders of magnitude more accurate, because. We look at we optimize, for specific instance that for this. Customer. Can be done the, cardinality. Models. But. The other question is what's about applicability, are. These mods applicable to the entire worker or not so. We looked at the applicability. What, percentage, of sub graphs we have models for turns. Out that we approximately, have after 2 days of training. Approximately. 65% of the sub graphs have models. And. We. Need to train around once. A month because, after a month we. Lose some, applicability.

So We are engaging. With some toasted cassava folks, and seeing if you can apply there as well. There, would be some changes probably how we apply feedback, and how much you go ahead you can always. So there are some concerns but, we are engaging industries we don't have answer yet they engage with them and see how we can apply it there has been. Thanks. And. We'd like next to a mere from guru who is going to talk about machine learning in Google. Machine. Learning in bakui. I've. Been, in the Vickery team for for. Four years I've been at Google for six. Before that I was at MSR before. That I was a Michigan, so. It's good. To be back here. I'm. Gonna give a very, short overview of equity, and. Motion. Laser is gonna give you a better overview probably. And. Then I talk about why, we decided to do bigquery ml ml, and. Briefly. Touch and touch on syntax, and then, talk about our, solution. To one of the simpler. Ml. Algorithms, that we support linear. Regressions we have two solutions I'm going to go, through why we decided to do each one and finally. We, have time you can do questions. So. Bigquery, for, those of you who, don't. Know B query goopy. Queries Google's, cloud, base sequel. Data warehouse as a service for, mostly, analytics. You, can use it right now for. Free it, has a free tier so you can try it basically, it's an enterprise data warehouse. It's. A fully managed service within. Sequel. With a sequel, API. It, can scale to petabytes, of storage and. Support. Encryption streaming. And all. The nice features that you will you, want from your data warehouse so. The, quarry ml. One. Of the things that we saw from enterprise. Customers, and. That. Use bigquery is that the. Sequel analysts, that they have. Their. Main job is, to extract, inside, from, beta and. Usually. It's based on some historic, data, that they stored they want to extract something. Based. On the history or they want to predict something in the future and. You. Can see two queries here that these are just simple queries the first one is trying to come up with the average income based on state, and the, second one is top, orders of, top.

Solve This and. The. Solution, for, W, so I wrote it here it's, it's a it's a well-known solution, I'm not going to go into the detail of how we got it this is you can look it up on Wikipedia it, basically, consists, of a bunch of a few. Multiplications. And a matrix inversion, and. These. Are both. Hard things to do in a database. Especially. The inversion, so the. Way we go, about. This is we do, the matrix multiplication, using, joins in Vickery because b, kotov does pretty, well for for for scaling, joints and the. Matrix, inversion is, basically. The. First, term that you see X. 3 X that's. An N by n matrix, and remember that M was much bigger than n and n, is the number of features so it's, a pretty small matrix, you don't you're not gonna have more than thousand. Features or 10,000 features so we can fit that in memory so. We basically run a query to, do the. Multiplications. Matrix. We, put, them in a table as like row row. Number column number the value and the, the multiplications. Our joins and. The. Matrix inversion because, it fits, in memory of one chart we, do use a fast solver. Off, the shelf to do the matrix inversion. Finally. How do we decide between the two as I said a psychoanalysts. In care about any of these they just want to say select star from T train. On this. So. All. Of these are handled, by us so we decide if the data fits in the memory by, Luke get the number of training features if you have more than 10,000, features we, know that that matrix is too big we are not gonna matrix do matrix inversion if, there is overfitting, issues so close form is not great, because it doesn't support edwin, regularization if we, see that number of training examples is, not enough. And we are gonna get into, overfitting issues we go. Back to the gradient, descent and if if they're explicit, options. For elbow regularization or warm start we go to gradient descent and, finally. Otherwise. We use normal equation and. Conclusion. Basically. As, I said I'm going to repeat the same stuff. This. Is meant for psychoanalysts. Not. Ml. Experts, and, our. Goal is to enable, this purely, in sequel, I. Get, some out of time and thanks. To the bigquery team. It. Depends, on on on, your, data size so. To give you an, example. In. A sequel, query if you do a group, I and a sum, on. 10 rows I'm in. Any, database I'm pretty, sure if you take that on your local machine it. Would be much much faster but. If you scale, that to a billion. Rows then. It, would be different so, it, depends, on if, your data is. Really. Small or if the data is big. So. If you have a billion rows you can knowledge on tensorflow, that's. How. Was that. Not. On a single machine if your data doesn't fit in memory oh. You. Have cancer. If, if everything, fits in the memory of a single machine is going to be faster than Bitcoin in. Our just for this for pretty much any query. Yes. Right. So my question, is training. Has this pattern of making multiple passes, over the data and. You are burning SQL, queries, in every pass versus. Exporting, the data one time and running. In a customized. Ml system, with hardware, acceleration so. How does the performance compare. Especially. When your datasets grow for one thing that we didn't touch here is. For. Doing any training. Maybe, training is so training is one step I would say is 10, percent 20 percent of the complexity, of what you want to do there's an 80 percent that you want to prove process and every. Time you through process and you want to train and see okay did my pre-processing, work you, have to export, your data train, your model. Show.

Up More data into the model and say okay my something. Went wrong let me go back get. Another slice of the data, exported. Train, again see, it works or not so it's an iterative process it's is not it's. Not just one. One shot I train, and ok I'm done you're. Right that the training per se may be faster, if you, give it enough resources. But, training. Is just one step. The. Data in. The same environment that you know how to protect data either an ml expert. You. Don't know so, the problem is that the psycho analyst knows how to prep the data the, person who does the machine learning doesn't. Necessarily, know or is, not in the database time bar but, it's not the same person who pushes, I. Agree. I should. Buy that you know are the same radius to do everything. Exactly, requested, said I remember, in, two. American, Sycamore there's a one paper, from College of London is about, actually heat cleanser. Cleanser. Because. You can. Be. The Vantage of hobbies, Asaka expression, for the symmetric. Yes approve, that's the the. Matrix calculation, is the subset, of proof I actually can be used the. Result, is actually a faster the tensor flow I think we get the pepper dip but that my my my my friends at the first questions that can, compare thank reproduce. At the result on a paper and Steve actually work, on the, car. I don't, have head-to-head, comparison. I guess. They. Do, to take. Advantage that that something. Special to my vision for the use the Superferry, to do the Dibble metrics, one. Of you take a bunch of the questions, offense Lorna sorry as well. Great. So our last speaker before the break is Mona, from Google and he's going to talk about big great analytics storage, and. Then. Afterwards, again we have a break short. Break let's get food come in here again to, talks and an arm and. Okay. So my, name is Masha poo Szymanski, but this stock was supposed to be done by another. Person. Pavan, she couldn't get here because of the snow so, he asked me to do it I saw, those slides first, time in my life couple of hours ago so I'll have to improvise a little bit. Let's. See what they say so. Yeah. So I walk in Google, and be creative for seven years before that there was a Microsoft, work on cosmos, and scope capita sheet still alive and before, that in sickle cell.

So. Be. Clear I made a describe, it beautifully, is this fully, managed it's, yellow, it was Cheryl s before there was a war shareware less, you. Know it's. Data. Was housed in the cloud so. In. This talk I will try. To focus on on the storage part and not talk about execution. Or runtime I just distributed. Parts of it and so. On so. Some. Of the things that. Big. Larry does is wait, additional, data warehouse link right, so it supports, you know and starting snowflake, schemas, and doing, joints, and GML, and all the classic stuff but, probably. It will be a little bit more interesting, to see what makes it special, so, when the scale now I welcome Google everything. Which is less than exabyte, you, know people think you're not shears so, it is it. Is exabyte. Multiple. Exabytes, of storage, and petabytes of queries. Another. Thing that makes. It kind, of interesting and, unique is that it. Is one, big system, which. Kind. Of there are no database, instances, so, for example we have you, know hundreds, of public datasets a ton of stock prices weather patterns, what have you you can immediately, join it with your data all the, data is available as, if it was in your, own instance. So. Another. Interesting thing is that even though have. Most. Of the features of the classic data warehouse we, also have, a lot, of features, to. Support. Data. Storage, for, the you. Know applications. More, modern applications, not done by the Kimball box right and some. Interesting characteristics. Of those you know people don't design. They. Like, they spoil. But this no sequel, right they say why don't we put some semi structured data in an, hour, used to be XML fuchsia gonads JSON and. When. I figure out what, happens later that is generalized. There are no joints, it's. Also very popular in the smaller and applications, kind. Of to have this team you know somebody said SCAF core or something like that just say you know let's write all the data and you, know steam it in and we'll figure out later, what to do so, bigquery has, very, very good solutions, to all of those and it. Does them in principle, database, way right not, eventual consistency like, real stuff so let's. Directly. Should go one by one and. Talk. About few of those so. At, the, data. Layout layer so, this column where sessions to data warehouse it's column, based system, but. With the twist that it has to deal with shame, structure data again, somebody. Before said you know we use nested data so it's not relational, well of course nested, data is relational, since equal 2011. Which introduced, erase, and moon cassettes and, aircraft. And so on so. Because. The latest image structure doesn't mean it doesn't have schema, it has will.