Patterns and Operational Insights from the First Users of Delta Lake

Show video

Hi. My, name is Dominique Brzezinski, I'm a distinguished, engineer and apple information, security, and, gonna. Talk about Delta Lake patterns and insights. The. Assessee is a mother of all invention and. Building. Big data systems, there. Are a lot of challenges. About. Three years ago I, was at spark, summit and I. Ran into Michael, Armbrust, and talked. About a system that I had belted, a prior employer, that was pretty large-scale, doing hundreds of terabytes a day data, and, I. Had, to build a similar system at. Apple and I. Described, some of the stuff that we've gone through that we had done and challenges. That we'd had with certain components, and. I. Saw. A flash of you, know inspiration. Terr brilliance, in in Michael's eyes and. He. Said. Yeah. I don't think our architecture, is gonna hold up to that now but, I have, this thing in my back pocket, and, as. We. Discussed. It a bit more it, really. Resonated, with where, I wanted to be and. What I wanted to do for this project so. A. Couple. Months into a POC. Michael. Reached out and said hey I have some code for you to try and that turned out to be basically. The Alpha, of Delta. Lake it. Was code named Tahoe at the time and, we. After. Running a few like, tests, and, verifications. We, just went all in and within, a short period of time we were doing tens of terabytes a day into. Delta lakes and now. We're much bigger hundreds, of terabytes protein. Of how to buy today data that we ingest, through our ETL pipeline and. Then we process much higher volumes, of data from. A kind of daily working set. So. We. Learned a couple things along the way and had some insights. I'm. Not really going to talk too much about just, sort of standard meat and potatoes, of having. Data in a table, like format, but. I'm going to go into a little bit more about. Some, of the more novel or interesting, things that Delta Lake has allowed us to do and then some just kind of operational, insights, and and gotchas. That we've, experienced. Um hopefully, a little bit to tips and tricks kind, of mixed in with us. So. First. I'm going to talk about patterns, and. First. Up since. I said we run a big ETL, pipeline, is. The. Inversion. ETLs, extract. Transform, load but, in big. Data ecosystem. We. Usually do something more like. Extract. Load. And then transform, and that's. Exactly what our system, looks like we. Actually have an extract, system, that dumps, a bunch of data into s3 and. Then. We pick up that data we. Load, it into a staging table, and then we have a bunch of streams off the staging table that are doing transformations, that result, in unique, data tables, that, have. Well-defined. Schemas, and present. Really a stable. Interface, for, our. Downstream, consumers, of that data. But. Before we get too far into. These. Patterns I'm going to stop for a second and talk a little bit about our, parsing, framework, that. We use because, I'll, show some sample data down.

The Road or some sample code and it'll. Use this and so it's easier if we just pause, for a second and take a look at it so. We have an abstract class called parser and, a. Few. Of the unique things about, this is that. We. Actually kind. Of broke it up into some base steps set of really. Giving us flexibility, and so. We have a prepare, stop and our. Parsers are basically, data, frame transforms, right so they take a data frame in and they, pass a data frame out and so, prepare given, the source data frame it's, meant to do simple, things like filtering. Or, simple. Extractions. That would give us additional information that, we might need to use in parsing, and. Just fundamentally. Prepare possibly, drop some tables are, sorry some columns if necessary, coming in. But. Other than that you, know not too much processing, in there parse, is where we do a majority of our, extraction, and. Transformation. Of the data and. Then. Complete. Is an. Opportunity, once, parsing. Is done once validation. Has occurred for us to do kind of any final fix-up of the data before it actually gets, committed, to. The table our output. From the stream and, so an interesting thing is basically. Apply function, on this this is all Scala then, notice is. That. It. It. Runs, these steps in a fixed sequence and, in. Between some of the steps so for instance we were unprepared but. Then what it actually does is, we. Create a struct of all the columns in the incoming, data frame and, that goes into an origin column and so we basically just, transposed. All the columns into a struct, put. Them in one column that we start with and then. We run the parse, method. And then. We. See this set parse details, is automatically. Run and the, interesting, thing that it does is it actually creates a column with a struct in it and. It uses as valid conditions, expression. And well, as the additional, details expression, in order, to validate the. Row and. Then possibly, capture additional, detail, about. It that. Could. Be useful downstream. And either I'm, dealing, with a parsing, error or. Tracking. Lineage, of the data and it also puts a timestamp on it so we have a marker. From when the data was actually parsed, and. Once. The. Parse details has been set finally. The complete method is called so this is sort of a layup for what we do through our ETL, pipeline. So. When. We get into the, extract, part we actually have a fairly complicated system. Upstream, it's. Actually a couple systems that, is responsible for collecting. Log data our telemetry or, extracting. Data from, from upstream systems, databases, api's, and. Fundamentally. What it does is takes the raw data wraps. It in and JSON. A layer, that includes, for. Instance the event type the. Source of the data we. Get time stamps for not system so we can we, can track Layton sees but. A fundamentally, its metadata and then there's a raw field in that that includes, the, actual. Log event or piece of telemetry or data in it and the. Extract system, plots, that into s3 and. From. There we. Actually use the s3, sqs source in, spark and. That's. Tired tied up to notifications on the s3 bucket and that's. How we consume. High. Volumes, of data without having to do expensive, list operations, and slow list operations, on that three bucket and. So that's the source of our data on roar I said it's, in the. JSON. Object. Per line format, but. A note here is we don't actually use the, JSON input. Format, for, spark. Streams we, actually use file format, to capture that it's just a single text line and I'll, talk a little bit more about why, we do that the advantage, so as, we're loading the data out of s3.

Here's. Our staging, parser, and you. Can see our validation. Conditions, is just that it has a valid timestamp and in, on additional, details, one, of the things we capture, is actually the input file name so this is giving us lineage, from this row back, to the file and s3 that it came from and so this is just to help with debugging. Break. Fix type information, but it establishes, at least a baseline. History. For the data and then our parser is really simple we're basically just extracting. The JSON from, the. The, text file input creates, a single column called value so, that's been transposed, into this origin struck so. We're just using from, JSON as you notice here we've actually, extended. From, JSON to. Include some fancy timestamp, are seeing stuff that we do and. Add, some semantics, that. It's, consistent, for our use case, so. We know the schema for this input data because, it's the standard, metadata wrapper, so weird extracting, those fields, and then, we're basically transposing. Them again, out of the resulting. Struct in the top-level columns, and then we're preserving that origin column and so. We. Have the the raw data as it came in and in, origin, value, and then. We have the. Actual extracted, data in those columns and. So. Then. Internally. The parser is doing the validation, and checking whether or not that there's a valid timestamp, and so, in our complete method we. Actually come along and we look to see if if the timestamp is null, and if. It is we substitute, in the parsing time for that and if. Not we use actual extracted, timestamp, and then we capture, all of those standard, columns and, the. Parse details, as well, as that origin, that gives us lineage, from, this so. The important things here are that. We. Always end up with a valid record, even if we fail to extract, the JSON we, have original, string, its lineage, that comes through here and, and. Then, we. Have. Parse details and so we'll know whether or not it was correctly, parsed, and we have a lot of information to help us if we had a problem loading, and. The reason why we use that text file input at this stage is because, if you just try to use the JSON. Reader. And it, comes across a. Bad. Line or, worse. It. Comes across the line that doesn't match your, extract schema, you would just get all nulls and you.

Won't Know exactly what happened, and so. What we do is we capture, the whole string in case that fails so. We have a copy of it and we have the lineage of it and this is really helped, a lot in dealing, with when data. Shape changes, upstream, coming, into our table and allows, us to actually, reprocess. Against, the same table and overwrite if we need to, it's. Had real advantages. From a kind of break, fix. Perspective. So. Then. We get to the transform, phase so, off that staging table, that that last one into, its. Partitioned. By date and event type and so, mostly, then our actual, data parsers, have, a stream, reading off of the staging table, that has a predicate, on it looking for a particular, event type or. Maybe is possibly, poking, into the rock the. Raw log itself looking for some kind of fixed. Tokens in it and. And. And, that's driven into the stream and, so. In our prepare phase we actually dropped that origin, struct, from. The prior table, because we're going to create a new one with, the incoming data from this so we drop origin, we have all the other fields, it'll create a new origin and then. By parsing, it a, lot. Of our data sets are actually, JSON themselves, so I'm, just giving an example of. Doing. An extraction from, JSON, using, the schema again and transposing, that back out some, of ours are more complicated, in this this parse method would you know have a bunch of code there that has to deal with you know ugly raw, semi structured, events. So. That just really depends on on the source type the, output from this will then go to an actual clean, data table, somebody. You might call that silver, or, gold you know kind of depending on it for our users, that data, is its, first class and super important there are a lot of derivative. Datasets, from. Those but the, events, themselves actually. Have a lot about you in our environment. So. That's. Extract. Load, transform. So. Once we have these tables we want to start doing stuff and some of the unique features of, Delta Lake are being. Able to do things like acid, up certs at, a very large scale and, so. In an absurd you, have a table, and you have new data coming in and based. Off of essentially. Like join, criteria, you know whether, or not certain columns, match or have a relationship with one another you.

May Want to choose to update an existing row, insert. A new row or delete, a row and. That's. Super, super powerful we'll talk a little bit more about that in some of the use cases but. It turns out some time. Your data doesn't look like you think it does or your. Merge conditions. Essentially, the on Clause aren't. As specific. As they. Need to be and so, what will happen is multiple. Rows from your new input data will. Actually match a single row in the existing table and that violates, a constraint, for merge and it will throw an exception and, you're left going hmm, I wonder. If it was my on criteria or, I wonder if my input data actually. Had duplicates. In it and. You're. Sort of stuck so merge. Logic, debug, it. Turns out, oftentimes, you're doing if you're doing merge in a stream you're doing it in a for each batch writer and, are. Sorry for each batch method, and a. Little, trick that you can do is the, input data so you get this micro batch data, frame. You. Can just take that and you can write it out to some other Delta, table and just overwrite, the data in it and. If, then. It's Rosanne exception, in your actual merge you'll. Be left with the actual input data that came in and violated, the constraints, and you can just go manually, run a join with, the same on Clause conditions, and then. Validate whether, or not you had duplicate, rows in your input. Or. Whether or not you're on conditions. Or actually, too wide and matching multiple rows that aren't duplicates, and. It makes it super easy to debug when you hit that case I wouldn't, recommend leaving, that intermediate right, in is, it'll add latency, in production, if you're not latency, sensitive it's, not a terrible, idea just, after each batch that table, will represent, so you know the last batch that was processed but. In. Development, super. Good and then, if you have a production stream and it breaks you. Can just go and wedge that, little bit of code in there run. It against, this existing, checkpoint it'll try to do that last batch it'll, fail again and you'll have the data that you need, so. Super. Helpful thing. So. Another thing that we do with. Merge and this really, marries. A powerful. Feature of SPARC. Structured streaming, which, is essentially.

Stateful, Processing, with. Merge, and where we use this a lot is creating, things like. DHCP. Sessions. VPN. Sessions, so. Any session, ization. Makes, a lot of sense or when, you have a, stateful. Change, or. Update, to something that you want to do then. These. Two marry together really well because you, can run basically, on that groups with state or a flat Maps group with state and so. In our case of say take DHCP. When. A machine gets, a new a, new session. And IP binding. We. Want to admit that as soon as possible, so any, other event data that we have coming in we. Can map by IP to, the correct machine identity, and so, we want to make that open, available, super fast but eventually that, session will close and it will have an end time to it which becomes, really important, for, processing. Old, data and making, sure that you're only. Assigning. The. Host to, the IP within. The right time frame with regard to the time series data that you're looking on so, what. We do is we emit an open and, that. Comes in and that basically gets through. A for each batch writer gets. Appended. To the, table, but. Then we, emit a closed when we see the end of the session and through. Merge, we, actually find that open, session the corresponding, open session and we, just update, that row in order, to be the closed session with, the addition, of the end time and some other state and, attribute, information to it and that keeps then that session table. Really. Small right. We either have an open or we have a closed for that session we don't end up with an open and a closed, it, keeps the processing, logic I when we're doing a join in order to enrich other data with that it keeps that logic, more simple, and. We. Keeps a very accurate it's easy to reason about the gable data you see in the table as well so, mat, groups with state or flat maps group with state married, up with, Delta Lake merge. Um they're. Really peanut butter and jelly. So. Another. Big one. That we've done and this is sort of a crazy thing that we did but. We have. We. Have a few dozen tables. That have columns, that have either an IP in. It or have, a, comma, delimited. List of, IPs or an array of IPs. And, it, is really common for us to have to try to find all the data are tamo true we have, around. A given IP or Isetta IPS, or, IPS. That are in some cider range and. This. Is a very, expensive operation even. With things like dynamic file, skipping, and z, ordering on those IP columns, since. Some of them are just a single IP and they're actually a list of IP s and you have to do like it contains, or reg, X really, terrible, you're doing full table scans so, in the past when we've had a set of IPs and we had to look over our long, retention, window across all those data sets that could take up to 48, hours on a very large cluster. We, decided we needed an index and. But. We have 38, tables and by the way these tables are crazy, big where we're writing hundreds, of terabytes a day, across. These tables so it's a lot of data so first naive thing was create. A stream off of each one of those tables, that was pulling out IP. Addresses. And the date partition, and the table. Union. Those streams together do an aggregation, on it and try to write it out to a table you. Can imagine how well that went even. On extremely large clusters, that. Wasn't going to work glue up kind of beyond the computational. Resources, that we have to do it so. We, had to start decomposing. The, problem a, bit and. And. The. Really interesting thing, is that that, Delta, Lake, semantics. And operations, you can do on it in some ways really. Complement. Structured. Streaming or. Can, be used I, instead. Of certain, features on structured, streaming so, what we did on this is. We basically hang. One. Or more streams off of each source data set where we're doing a for each batch group, by source IP des type e in the date partition, we. Also include, the source table name and column. Information in. That, and, so. This is for each batch means we're running this aggregation just, on each micro batch on the screen but not in, totality, on, the screen itself and we. Take that and we append, that to the step 1 table, so. We, can have a bunch of source tables. They, each have streams, coming, off of them and we'll, just have all those streams append, to one table within some some, reasonable. Balance. And that, creates this step one table and then off the step one table will. Do, another, for each batch actually. Two operations. One, where we're doing a group by source IP in the date and another one by the desk type in the date is ultimately, our index it's going to be about one IP to. The, the, tables, columns and, date partition, that it showed up in and, so. We, do the we do the two group eyes and we union those together, and.

Then. We append to the step two table, and then. From the step two table, we basically do a for each batch and. Here, instead of doing a group by we do actually a window aggregation. On it turns out more performant, I haven't really dug into that, and. Then. We take the output of that. Aggregation, and we, actually do an up cert into the index table, and so, what they'll look like the. Final index table we actually have the original value from the column we, have an individual, extracted, IP from that value we. Have the date partition, and then we have an array of the data set names, and columns. Where. That IP showed, up in that, original value on that, date partition, and then we also have a sum total row count that we keep for that which. Gives us a good approximation of, how, much, activity there was for that IP on that day and. So. Over the day our as, new data comes in that, array may grow in. That total account. Bro but, we won't be creating a bunch of other roads and, so, this gives us a, much, smaller, more compact table. That, we can z-order by extracted, value on and we. Can very rapidly, search, for, rare IPs and, then. From that we. Got some metadata about what data sets they we're in and, the date partitions, and some real counts and we. Can do some interesting stuff with that, or. We. Dynamically. Then, compute, queries against, the source datasets using that information, so now we have, a. Date, partition, predicate, as, well. As the original value so we're doing the Equality search of that column value in that table and we've. Taken the one that took 48 hours using, the setup to, get the real data from the source tables, took. About a minute so, that's a radical improvement, so we're spending some time up front to compute this but, every time we go to do that type of search for saving massive, amounts of cost or resources, and we're giving our users much. Faster, response for the data that they need. So. Interesting. Thing about tables. That are merged into so, again I'm talking about hey there's all these great we build these index tables accession, tables, and we're doing these up certs into them the problem is is if you want to actually take, that data and you want to get say like a slowly changing. Data feed of it so you see the, new sessions, or you see the updated, state of a session or you see the new index value, that's coming out of it the, problem is if you just try to hang a stream off that table soon. As there is an update, or a delete, the, stream will throw an exception because. Normally. That, violates. The standard semantics, so, now you have to throw on the ignore change option, onto the stream which. When, you're doing just updates, and inserts. Essentially. What happens is if a underlying. Park' file is read because, one row needs to be updated in it that, row gets updated and a new Park a file is written out and replaces, the old one in the table with, a more changes set the entire. Contents. Of that Parkay file will, get played down, so, you imagine you have tremendous, amounts of duplicates, coming down so you're not just seeing new inserts, and you're not just seeing the roads updated, you're, seeing all the other unchanged. Roads from the same park eight miles so, a, little, trick that we do for this to get kind of a clean just, insert, and updated, row key is we. Actually hang. A stream off the table that's being absurd into that, stream uses ignore changes, and then, it does another absurd but using the D do pattern and so. The D do pattern is essentially, you just, check. Whether or not all the columns match and if they do you, just ignore it and if, they don't then, it's, a new value that you haven't seen in the table ie it's inserted or it's an updated record and then. We insert that into a clean table and, so now that clean table can, be a nice, source of a stream that just, gives you like an sed, type update, feed and, so. This is you know one extra step you add a little bit of latency but. We've now solved, an, interesting. Problem by. Just kind of understanding, the semantics, of Delta and kind, of marrying that up with streaming in. An interesting way and kind of chaining, merges.

Together I'm using, different semantics. So. That's, a really, really interesting. And powerful pattern, that we use in some places too so. That's it, for, patterns. But. Now I'm gonna get into some of the insights we've had and. So. Obviously. We run some very large infrastructure, I said we take in a petabyte a day we, have a long retention, window as well so we actually have. Tables. That are, on the order of two petabytes, and we have a ton a half petabyte, table sitting around that. We operate against, regularly, and. So. From. Our prior experience, we. Knew that you can. Have performance, problems with s3, if you don't have enough entropy, at the top of the path file. On. Top of the pass so. That the s3 bucket can shard well and distribute. Your workload. So, when, we started this we said hey we know how to do so we, can get really high i/o on a bucket, so let's just put all our tables in one bucket you know kind of easier. Turns. Out it's not easier and it turns out still under certain conditions, or s3's having a bad day we might get throttled and sometimes that throttle can impact other tables, in the same bucket all right if they happen to share the same shard or something like that so. We, found best practices, for, for. Most. Tables unless they're incredibly, small and bounded we, just put each table, and its corresponding check, point in its own bucket and we, enable, random prefixes, on, that bucket are put on that Delta table I should say and this. Means that the data that's written that table is spread nicely, across, a nice. Hash or partitioning, ball pass, space and then. We get really high i/o for. For. S3 and, we. Don't get throttled for those and it also has some other nice advantages, where the, bucket then becomes a proxy for table access so we can just use you, know standard I am Ackles on that and. And. So, we can actually share individual. Tables with. With, non SPARC consumers. So for instance if you wanted to use hive with the Delta Lake connector, or. Presto, starburst presto with the Delta Lake connector, you, could individually expose. Tables, by. Just using echoes. On the bucket off. To, those things even potentially running in other accounts, which is a lot easier than doing that goes on prefixing, and trying to track the whole thing just, a little bit more shines and, also if you want to replicate a table to another region you, can just use bucket, replication, and largely. It's gonna replicate the. Data and the table and also the metadata and also the checkpoint data so. Under. You know under most, conditions that. Other table within a reasonably. Short sorry the other bucket in another region will, have a, corresponding. Version, of the table and even a checkpoint you can restart problem so. Some, nice properties, there. So. The Mexican site we had and, this. Shows up in the parents's we really realized, that Delta Lake and structured, stream and SPARC are really composable. And if, you, understand, the semantics, that you can get through, the various configuration. Options. And using normal, stream or using for each batch in the stream and. Then the semantics, around Delta Lake we found that we were able to overcome, scale. Hurdles, that we might have so for instance if we wanted to keep a running aggregation. On a huge, stream, and. In. Our case oftentimes, the, timestamps. On our streams are actually, have a long tail distribution so. Trying, to set up a window and a watermark, is either not possible, or, might.

Be Possible, but the resource consumption, is kind of sketchy on it and, might. Not be super stable and we're not well protected against, kind of huge spikes in data but. What we found is we, can just go do for each batch, aggregations. On the string which don't have to keep in a batch State and then. Use, the merge operations. On Delta Lake to, keep updating, the aggregation, output, there so we have a running total on the stream and, so. Now we're able to use Delta Lake semantics, to, overcome, scaling. Issues, that. We might run into on, SPARC structured streaming and when. You use for each batch you give up exactly, one semantics, and get at least once but, if you include a batch ID column and a Delta Lake and you're merging into it you can actually include that in the on conditions, to make sure that you won't merge your results, from a batch that you've already seen and now you're back to exactly once so. These. Type of you know kind of interplay, between the. Capabilities. In in, Delta, Lake and structured, streaming, gives, a super, composable. Building, blocks and, and. We. Found that we've tackled many problems, including. Building a giant streaming, inverted index without, actually, having to introduce, any other technologies. Into our stack and just using our expertise, on those two. So. A little. Thing to know, and this is actually covered well in the, Delta Lake Docs but. Schema. Ordering and the interaction, with stats is. Something. That has bitten, us before, even though it was documented, so just calling it up again and. Essentially. Stats. Collection, in Delta so for for, the. First thirty two. Fields. Let me and I'll clarify fields, and columns in a second. Have. Stats. Like min and Max. Generated. On them in Delta, and that's capped it up her, parka file basis, within the metadata, why. I say fields, and not columns, it's. Bad terminology. But. Delta. Lake tries to not have exceptions, around type so for instance in, a struct. It. Considers, each member. Of the struct to be a column, essentially. So, it's. Not the first 32 top-level, columns, that have stats collection, if you have a deep, struct. In you, know in the early. Column. Index, then. You may eat up those first 32. By. The first couple columns and then that big struct that's there so, you got to make sure that you kind of count over and then down as well if. You're trying to figure out you're within that, you. Can actually modify, the. 32, using, the data skipping, num index call setting, and you can set it to -1, if you want and on everything or any value that you want, so. Dynamic. File pruning, uses. The, min max column, stats. When. You have a predicate, on. A column, value if. It falls outside the min and max for that file then the answer can't be there and it can choose. To not read that file and reduced, io is increased, performance, so. Z ordering also in in the data bricks product, essentially. Maximizes. The utility, of min max by doing a clustered sort across one or more columns I'm so this gives you tighter min max ranges, without, inner file overlaps, and. So this is really going to maximize, a number, of files that you can skip based off of a predicate like that so, the, key is make sure if your Z ordering or your sorting columns that, those columns have stats collection, on I either within the first data skipping, them indexed, or by default 32, fields. Otherwise. You, know happily, go on and see order and do nothing for you um it. Might catch that these days but originally it didn't and happily, would optimize on, a field that didn't pass that's collection. Another. Little hint there is that. It's. Expensive to collect stats on long screens, I don't, have an exact, number, for, what, constitutes long, buy I'd say you know when, you're talking about hundreds. Of bytes you know to a kilobyte plus um you're. Crossing the threshold you know if, you're under, a hundred bytes like, you're definitely fine. But. It's expensive to do it so if you can move your long strings to the end of your schema, beyond, the data skipping, on index columns value if. You're you, know column, width is not 32 if it's less then you can tune down that, value, for your table and put.

Those Long string columns. Outside. Of that and that'll, make your writes faster, if, you really need stats on them though, they're less, helpful on strings but you know then you you pay the price on rights so. Depends. On what you're really tuning and optimizing, for. Don't. Over, partition, in your Delta tables. So. This. Is worth kind of talking through a little bit when. You add partitions, to your table you, genuinely, are increasing. The object, or file count underneath. The table on the, storage system, and that. Is because if you have no partitions. You, can optimize, the table, to, have essentially. The fewest number of one. Gig or whatever you kind of tune, optimize, for. Objects. So you minimize. A number, of objects, and get nice big ones that give you great throughput, from the underlying storage, as, soon as you introduce partitioning. You're, probably going, to a larger, number of objects, possibly. Smaller or uneven size going. Into those partitions, even. If you optimize, it the partitions, are forcing, you, know a break, between. Those and how the data is commingled, so you're increasing object, count when you increase object, count it. Has a performance, impact on full table scans on you. Know large aggregations, on joins. All. Those things unless, you're. Using the, partition, values, within, your predicate. Liberally. And. And. Efficiently. And so, basically. If most, of your queries, are join our aggregation operations. Can. Specify. Partition. Values, as a predicate, in the operation, then. Probably. You're. Picking something reasonably, good to partition on I, think. Michael, is suggested, that a, partition. Should have at least two gig of data in it as a general. Case so. That can just give you a you, know a role by but, we often see cases you know of, ten even hundred, gigabyte. Tables, that. The way that they're commonly, interacted, with partitioning. Actually doesn't make a ton of sense and it. Actually is faster, if it's just unpartitioned, the. Caveat, is if you have to have a retention, window, or delete, certain data at time if, you can set it up to where you can delete by the partition, value, say a date or a month then, that makes deletes, cleaner, and disjoint, from other operations, on the table but again that's, often paying the cost of higher object, count which can reduce certain, certain, performance, of operations, so.

The, One case where over partitioning. Makes, sense is if, you're trying to make a table that purely, services. Point lookups, on, partition. Values, and usually, a small number of partition, values, if, you're only appending. Data to that table and you're. Only deleting, by partition, values. And in, which case then if, a partition, ends up with a very small, amount of data but. You're, intentionally only, trying to look for the data on that partition, it's okay, you're not really paying a throughput performance. Problem, so, that's the narrow case of we're over partitioning. Can, be advantageous to. Make high-performance, point query. But. Soon, as you have to mix like a point query load with. You, know aggregation or join stuff it's probably better to right-size. Partition. Or even under, partition, a little bit and, then, fall back on something like Z ordering, or sorting. On your primary key column to, maximize, the dynamic. File pruning, the min/max stats so. One, thing that you'll run into with, Delta Lake is sometimes. Having conflicting transactions. And this really comes about when you're doing things like deletes. And updates. On. A table if you're only doing append, type, operations, then, you. Don't really have conflicts, so, where we really run into this is on tables that we're doing up certs or merges into because the table is changing, all the time and then if we have to do something like a Z order to optimize, on the table and, optimize. Takes a little while it can stumble across the, table changing, out from underneath itself, and so, our way of handling that is when we have sort of two, related. Operations that, conflict, you, can either make everything, partition. Disjoint. Which. Means that are not operating on the same partition at a time but, if you have something that really operates, on the full table I can optimize, the order what, we do is we just inline. It like, in the same method that's doing the up cert and then, for instance in a for each batch we're, just doing a mod on the batch ID and and every, end number of batches. We, execute, the optimized and this way they're serialized, they don't conflict we get a little bit more latency on that batch but, we're also ultimately, keeping, latency, down because the table is is well, ordered, and. Nice, and compact. And. One. Last thing is we have some very large tables, I mentioned you know one and a half petabytes, two petabytes, with millions, of objects, in them, and. So that means the metadata is larger, and so just one thing that we found you. Can use the snap cup snapshot, partitions. Setting. In order to essentially, use more partitions, or tasks to read metadata, and so if you have a larger, cluster, sometimes. Boosting, that up into the you know hundreds, or even 1024. Reduce. The latency on, reading the metadata, until. We see some coming. Hopefully, optimizations. I'm on metadata. Operations. And, then. Sometimes. You might want to twiddle down like your log retention. Again. If you're, adding tons of data to, a table on a table, with lots of metadata, you're, gonna you, know build up more, more, metadata, and more underlying metadata, files and. Changes. And so you. Might, want to shorten the interval, that those are kept for that. Can be impacts to time travel and other stuff like that so you know dig into that detail, but that, can just be something that can help kind of keep your metadata size, down a bit so, really it's adding. More parallelism, to reading. The metadata and then keeping, them at a data within a reasonable, size through settings we largely most, of our tables we don't touch anything but, we set snapshot, partitions, on a cluster wide basis, and. Just you know based. Off sort of cluster size how much resources, we want there in order to read these large table metadata so, that's all for kind of the patterns, and insights, I have. Thank. You and. Get. In contact with me if you have any questions I'm also on the Delta Lakes Lac often, and happy. To answer questions or work through issues with you thank, you very much and take care stay safe and healthy. You.

2020-08-28

Show video