The Apache Spark File Format Ecosystem
Hi. Everyone and welcome to the spark file format ecosystem, talk here, at sparks ember 2020. My. Name is video ganesh and i'm the chief technology, officer at marisa. To. Frame this session here is our high-level agenda, we'll, start with my company very set and why, this presentation, is directly, relevant for us we'll. Continue to discuss the goals of this session on disk. Storage. OLTP. And oil AP workflows, and specific. File formats. Finally. We'll look at a few case studies the, format ecosystem, and conclude by looking forward, some of the advancements for the industry. Let's. Start with some quick background again. My, name is vini and i'm the CTO of air so, previously. I was related compute and spark, at Palantir technologies. Very. Set is a data as a service, or gas. Company, focused. On delivering, high-quality. Anonymized. Geospatial. Data to, power innovation, across a number of verticals. Most. Recently our data sets have been heavily used you're only coded 19, investigations. To, measure the effectiveness of non-pharmaceutical. Interventions and, social, distancing, policies, each. Year we process, cleanse, optimize. And deliver over, 2 petabytes, of data as. A. Gas company data. Itself, is our product we. Don't build analytical, tools or visualizations. Other. Words were just data. Every. Day we think about the bottlenecks associated. With operationalizing. Our data and how we can reduce them to. Achieve our mission it is pivotal, that our data is optimized, from a workflow specific, storage retrieval, and, processing. Perspective as an. Obligatory plug, if you find our mission or in any of the content that we're discussing today motivating. We, are hiring my, email is in the upper right hand corner. The. Singular goal of this session is to do a whirlwind, tour of frequently, used file formats, in the SPARC ecosystem. Looking. To the on disk storage strategies, of these formats and build frameworks for selecting a format based, on the data access workflows. For each format we'll talk through the feature set of the format provide. A visual inspection of a file and look, at some frequently, used configuration, settings, finally. We'll, look forward at some of the new developments in the format ecosystem, a. Format. Is just, a specification, for. How bits are used to encode and decode data. You've. Undoubtedly seen many of these before for. Our purposes, we're going to catalyze them, into, three groups. Unstructured. Semi-structured. And structured. Text. Files are about as unstructured, as you can get, CSV. Is and TSVs while they have structured, don't. Have the structure programmatically. Enforced, in the way that some of the other semi, structured data formats, do for. This reason we can consider CSVs. Or TSVs. As either, structured, or unstructured data. Semi. Structured data is data that contains tags, or markers, that delineate, semantic, elements but. Aren't as rigid as structured data typical. Examples, of semi structured data include, XML, and JSON in this. Session we'll look at JSON as an example of semi structured data. Structured. Data formats, rigidly, enforced schema. And data, type rules many. Leverage this knowledge about the data to, provide optimizations. At query time right out-of-the-box will. Spend the most time in this session talking, about structured, data through, the lens of Apache Avro, Apache.
Orc And Apache. Parque. To. Frame this conversation we, need to understand what happens on disk on. Disk data is stored in, chunks, known as blocks a block. Is the smallest amount of data that can be read during a single read operation. Meaning. That if you only want to read one byte of data on. A hard drive with a 512. Byte, block size you. Would need to read be full, 512. Bytes as, the. Disk spins the, head reads data to. Read data from elsewhere, on the disk the head physically, has to move and reposition. In a process. Random. As. You can guess reading. Sequential, data is much easier than reading fragmented, data, to. Ensure optimal reads we want to read only, what we absolutely have, to while, minimizing, random stakes the. Notion of spinny disks of blocks doesn't just exist in spinny disks even. Technologies, such, as HDFS. Have, a block size of 128, X. That. Leads us to a big insight, your. Goal as an application, developer, should. Be too late data out on disk in a format that is optimized, for your, workflow. Later. In this talk will cover OLTP. And OAP. Two, common categorizations, for these workflows to. Understand. How different file formats mail a data out on disk let's look at some example, data our. Sample data will consist of four rows, and three. Columns and, will be in tabular form. Well. Then break the data into pieces, to understand, the specific storage, methodologies. As I shown in this slide. In. Grow. Oriented, or a row wise stored, methodology. Rows. Are stored contiguously on. Disk as you. Can see a 0. B, 0 and, C 0 are. Stored one after another inside. Of block 1 this. Type of storage methodology, can be great if your goal is to access full rows at a time for, example reading, the third row of data a 2 B, 2 C, 2 is a, pretty trivial operation, in the row oriented, model the only requires me to read block 2 as you. Can infer writing. Additional rows is a pretty trivial operation as well however. Let's, try something slightly, different, let's. Say I'd like to reconstruct, column B well. This would require a linear, scan of the entire data set on disk meaning, every single one of the blocks not. Only that but as part of this process, we'd be reading data that we don't necessarily care, about for, the operation, pacifically. Data from columns a and, C, well. That seems not great let's. Try a different solution and see if we can fix this in. The. Columnar, storage, ethology columns. Are stored contiguously on, disk as you. Can see a 0, a 1 a, 2 and, a 3 are. Stored one after another, this. Type of storage methodology, is great for the column query that I posed earlier but. Now trying, to recreate the rows themselves, has become the fairly intensive process, not. Only that but let's say I have a new row of data that I want to write the. Lab process just became a lot more expensive. Our. Takeaway, here is that, the column, or storage model allows. Us to quickly do column, level opera gate operations. But. We'd really struggle with workflows, that are right, heavy, only. There was a way we could get the best of both worlds well. Lucky, for us turns, out there is. The. Hybrid, storage model is a combination. Of both the row wise and the columnar model in this. Model we first select the groups of rows that we intend to layout for. Our purposes, and to be consistent with the naming scheme of parquet as we'll see later on we'll. Call these row groups, well. Then apply the column, row layout inside, each of the road groups so. Let's break down what you're looking at in row. Group 1 in the bottom half of the picture you, can see we've logically, grouped together the first two rows of the table what, we've implemented a columnar, partitioning, scheme inside. Of the group now. What would this look like on disk.
In. The. On disk, layout, we. Employ the row based partitioning, strategy, but aim to fit one row group inside, one block this. Allows us to recreate individual. Rows on one singular, read I'll allow, us to do effective column. Level of manipulations, when, used in conjunction with some of the features of the formats that will soon discuss I want. To call this out explicitly, matching. The block to the row group size isn't, always possible as is shown in this slide but, if we're able to do it it can lead to big performance, ones. In. Summary, we've, seen that row wise storage formats are best for write heavy work flows that involve frequent, row additions, or deletions, shakal, called transactional, workflows. On the other hand columnar, formats, are best optimized, for read heavy workflows, which, we'll call analytical. Workflows. Hybrid. Formats combine these for the best of both worlds and for, that reason we'll see a number of formats that use the hybrid approach, we've. Been speaking about the difference between, transactional. And analytical, workflows, for a bit of time now and luckily. For us there. Are characterizations. Using the industry, to differentiate them the. Most common ones are called OLTP, and Ola P. The. OLTP. Or online, transaction, processing workflow. Generally. Involves, a data processing, focused workflow it. Operates, which short-lived, transactions, and is geared towards row, based processing, rather than column based processing, meaning. Data is generally updated, and deleted on a per record basis much, more often, we. Can thus reason that row wise storage formats, are generally, better for, OLTP. Workflows. The. Oil AP or online, analytical, processing workflow. Generally. Involves an analytics, focused, workflow. These. Are generally workflows, that involve column based analytics, since. The queries here are generally more complex the, transaction, times are longer as well, these. Types of workflows are best stored by, columnar, storage formats. This. Leads us to yet another insight, our. Data storage, and access patterns, should, inform the selection, of our file format, let's. Put this in practice with some example data. We're. Going to use this data throughout the rest of the examples, it. Is notional, data that contains information about, different students, as scores in different, classes you'll.
Notice A few things immediately. First. We have three columns all with, different data types the first. Column contains an integer which describes, the ID of the student the. Second one contains, a string which contains a subject and finally. The third column contains a double describing. The students score in that subject. Additionally. The data types contained in these columns are relatively, simple, data types meaning. We don't have any structs, or raise or enumerations. Here. Finally. Each, these columns have pretty clearly defined minimum, and maximum, values, whether, through mathematical, comparison or string. Lexographic Alexa, graphical comparison. With. This in mind let's kick off our whirlwind, tour of file formats. Starting. With our good old friend V, CSV. CSV. Stands, for comma separated, value it, is a format, that delineate, x' columns, with a comma and rows, with a newline the. CSV format was originally developed by IBM in. 1972. And, was largely motivated by, its ease of use with punched cards you. Can see an example of the format on the right-hand side where I cut out my table, dot CSV, the. CSV format is a flexible. Row, based in human, readable format its, flexibility, can come at a cost though without, stringent, file format rules it's easy to have corrupt or unpossible. CSV. Files floating, around as I'm, sure many folks on data engineering, teams here can attest to the. CSV format, is compressible, and when, that data is either in the raw or compressed, format with the splittable format like a zip - it's, also suitable. SPARC. Has a native CSV reader and writer making, it both fast and easy to write CSV files. CSV. Is a pretty standard format, in the ecosystem, and can be especially powerful when, working on smaller, datasets that don't, need some of the paratime optimizations. That would be beneficial for larger, datasets. Now. Let's, talk about a ubiquitous semi, structured, data format, JSON. The. JSON format, was originally specified in, the early 2000s, it's, the first file format, that we discussed, that falls in the category of, self describing a self.
Describing File is a file that contains all of the necessary information, to. Read and interpret the, structure and contents of a file inside. Of the file itself it. Generally includes attributes, like the file schema, or the datatypes of the columns, the. JSON format, was specified not only, as a storage format, but, also as a data exchange format which. Makes it a bit more verbose, than CSV. However. Similar. To the CSV format it is, row based and human-readable as you. Can see in the katha dot file on the right hand side json. Contains, the column name as part of each row however. We're, still relying on spark schema infants to, select the correct data types of the columns. Json. Is both compressible. And is generally splittable save, one exception, involving, the whole file rate operation, in spark. Json. Is also natively supported, in spark and has the benefit, of spawning complex, data types like arrays, and nested, columns, finally. It's pretty fast from a right perspective as well, the. JSON format can be a powerful format, especially, when you have highly nested values however. It lacks, one of the optimizations. Of somebody, structured, data formats, that will, soon see. Now. Let's. Talk about the heavy hitters the. Heavy hitters in specifically, the farm of the spark, file format space. Starting. With, Apache, opera. However. Is unique in that it is both a data format as well, as a serialization, format, in other. Words Avro, can be used both, for writing data to disk, and presenting. Data over the wire, it's. A self describing format, that has huge, a huge, feature known, as schema evolution, this. Unique, feature gives, users the ability to update the schema of a data set without, requiring a full rewrite, of all of the files inside, that data set the. Actual, semantics. Of how sima evolution, works is beyond the scope of this presentation but. Suffice it to say that they're a complex, set of governing. Rules to, ensure things work properly in, schema. Evolution, both, column additions, and deletions are. Supported, with full, backwards, compatibility, with, previous schema, versions. Avro. Is a row based and binary, format, it's. The first format that we've spoken about thus, far that, is not easily human, readable now, there are ways to inspect the file as will shortly see but, it won't be as easy as just adding the file out. Avro. Is both compressible, and suitable. Reading. And writing a bro files is supported, using an external library in spark. Finally, Avro, supports rich data structures, like arrays, sub, records and even enumerated, types. Using. Avro, tools we can dump a file to inspect its contents, the, first section shows the JSON representation of, the data inside, of the Admiral file you'll. Notice information, such as the column name data types and actual. Values are all contained inside of the file. Equally. Interestingly we, can dump the metadata of the same file to, view the schema and compression, codec of this file I want. To draw your attention to the blue values, specified as, the type in the fields block these. Are complex types called Union types that specify, the value, of the fields may, either be the data type specified, or null, this. Is how we specify whether a field is nullable in the schema you can, use Avro tools to manipulate and, understand the contents, of arrow files very, powerful, ways.
Here. Are some configurable, spark settings for a bro both. Of these settings control compression, options including, the implementation and deflate, level, we. Can see that Avro has powerful features but, it's row oriented, nature generally, leads it to better support, OLTP. Workflows. Let's. Now look at a little bit more into, the data types that are better supported, for OLTP. Workflows starting. With, org. The. Optimized, row columnar, or ork file, format, has its origins, in the hive RC format it. Was created in 2013 as, part of the singer initiative, to speed up hive it. Is, a self describing hybrid, format, that, groups rows into structures called stripes. Similar. To Avro it is a binary format that will require us to use special, tools to inspect, or visualize, the file contents. Orc. Is both compressible, and splittable, and is, immediately supported, in spark. Or. Also supports, rich data structures, and hive data types such as structs. Lists. Maps, and Union. Understand. The power behind orc, we'll need to delve a little bit more into its file structure. As I, mentioned or stores, row group information in structures, known as stripes that are by default 250. Minutes each. Striped contains three sections index. Data row, data and the, stripe footer the. Index data section contains, metadata about values, inside of the data that section. Contains information about, the minimum and maximum values on a per column basis as well, as the row offsets, within the data the. Actual data itself lives, in the row group section of the string as. You can see in the image on the right the, columns are stored in a columnar, Manor inside, the row data section similar. To what we saw in the hybrid storage message, hybrid, storage method earlier, the. Stripe footer contains, a directory of stream locations. Finally. Each org file has a postscript, that contains compression, information. To. Get a more concrete understanding of the data contained, in these files let's. Actually inspect one of them. Just. As we use Avro, tools before we, can use orc tools to inspect an org file you. Can see some of the structures that we spoke about here, from. The top up we can see that this file has to four rows it. Was compressed by snappy has, a compression of about 260. To mention size of 262, thousand bytes about. 262. Kilobytes, the. Schema and data type following, the type section as well. Under. The stripe statistics, block you can see that we have one stripe that appears to have four columns our. Actual data start, on the latter three, we. Can see that each column in a stripe has four values no. Null values how, many bytes on disk the column takes up and the. Minimum of maximum, values for each column on the. Right hand side we can see detailed information about the index, data inside, the string we. Can see where the row index for each column starts as well as the data itself. Finally. The, bottom four lines show the encoding that is being used, direct. The first RLE v1, or run, length encoding v1. And direct. V2 refers, to our le v2 or run length encoding feature. Looking. At just this output the, wealth of information empowered containing, structured, data formats, like or should be a parent. Spark. Has a few orc config, related settings, the. Out-of-the-box set default, are mostly same so I would recommend being very deliberate, and mindful before changing these, there. Are a few settings that I'd like to encourage you to be particularly, mindful, of the. First is the, columnar, reader batch size, this. Setting controls a number of look rows included, in the batch vectorizer, read be. Mindful of this value as stuning it incorrectly, doubly duties the. Second setting I'd like to call out is the new merge scheme of functionality, available in spark 3.0, that, allows on-the-fly, merging, of schemas of all of the files in the data set it's. Off by default but. If you have a use case for it it is available for use, orc. Is a large step in the file format, ecosystem. And specifically, the structured file format ecosystem, next. Let's, turn to Apache Park in. Apache. Park a is one of the most prevalent formats, in the spark and big data ecosystem, and the majority of you have probably used or, at, a minimum at least seen it before. Parquet. Was originally, built as a collaborative. Effort between Twitter, and cloud era is. A self describing hybrid. Binary, format, that, is both compressible, and splittable and is. Natively supported, in spark, if. These lists of attributes look familiar from the orc slide don't, be too surprised the, format's are actually fairly similar conceptually. In. Parquet. Each, row group is a container, for a set of column jumps the. Column chunks are made up of a set of, contiguous pages, of columns as we. Can see just like orc there's information contained, about the columns inside, of the metadata that. Information, includes the column type path.
And Coatings and codec as, well as the number of values data, and index page offsets compression, sizes any. Additional, metadata, as. We sell in the hybrid storage format one, part a file can have multiple row groups inside of it as you, can see in the image on the right hand side. We, can inspect park' files using park' tools let's. Start by looking at the metadata of the file first using, the meta command, as you'd. Expect we have information about the schema row groups and columns, inside of the row groups we. Can see the data type in blue, compression. In red, encoding. In yellow and the, minimum and maximum along. With number, of nulls values. In green. We. Can also dump, the file with park' tools see the actual values, contained, in the data as I showed on the right hand side so I know there's a lot of content on these inspecting, slides I'll, pause for a few seconds, but keep in mind that these slides will also be available afterwards. Barky. Has similar config settings to auric as, such I'd like to mention a columnar, read columnar. Reader batch size setting, again as remindful. To be mindful when cheating, it, one. Thing we've seen a few times is that columns contain Ian's metadata have information, about min, and Max values, and. These may just be nice pieces of information to have but, how do they actually help us the. Answer to that is known as predicate, push down or filter, push down the. Metadata. That exists, on columns, allows, for pruning of unnecessary or irrelevant, data meaning. For example if, I have a query that's looking for column values above, a certain number through. The min and Max attributes, I can completely ignore or prune, out certain row groups, Parkay. Actually, has a wealth of these settings called, filter push down you. Can see a full list of types you can push down on on the right-hand side and. These optimizations, are available to you right out-of-the-box. The. Structured format ecosystem, brings, a number of optimizations to data analytics, which can drastically decrease, query times to. Demonstrate, this let's take our discussion out of the abstract, and move it into the concrete by. Looking at a, case study about their set company, I work at. Various. It's data pipeline processed, over three terabytes, of data daily, the numbers actually. We hire now our. Entire tabular, pipeline was previously, written in gzip, CSV it. Took nearly six hours to run and any, type of analysis, of the data was the exercise, and send patients. Recognizing. That something had to be done we, revisited, our workflow fundamentals. Various. It receives and writes data in bulk, twice a day we. Have a number of users including ourselves frequently. Doing analytics. On individual. Columns, in the data, finally. We found that queries we were running and it to be fairly long-running, and increasingly, complex. This. Led to the obvious but fitting conclusion that, we were using an OLTP system. For an oil a people in flow need. To change the system to be consistent, with our workflow. Additionally. As a data company schema. Changes are blanket breaking changes so, we can effectively treat all of our schemas is fixed no, need for schema evolution, capabilities. We. Observed most of our data is hot data making, snappy, a much better choice for compression than gzip. Given. All of this who made the decision that our pipeline was much better suited to be, in snappy park' rather. Than gzip CSV, once. We made these changes which were actually, two lines, of changes in the code we our, pipeline dropped to a little over two hours. This. Seems like a seemingly, minor change, resulted, in massive, compute, savings as well as engineering time savings, not. Only that but, it was lauded across our customer base who also saw large, drops and data processing and analytics, focused. On both the cost and time perspective, one. Of our customer pipelines actually dropped from, 11. Day processing, period to a two-day processing, period so. Clearly. File formats, are the greatest thing since sliced bread right, however. I do have one huge disclaimer file. Formats are software and software. Can't have bugs to. Illustrate this let's. Look at a recent park' bug. We've. Already covered the benefits, and functionality. Of Perdition pruning already, prior. To park a one two four six however there was actually a pretty scary, issue hidden, in the code the. Issue was with, the itch this issue resolved, a bug where a lack of or order for negative, zero point zero or. Positive. Zero, point zero or net man values, would, result in incorrect partition, pruning meaning. Entire row groups would inadvertently be, pruned out of your query turning, incorrect results, of, course once the issue was identified it, was swiftly fixed, I use. This case ticket to bring up an important lesson that all developers, should know keeping.
Libraries Up to date is imperative, new. Software releases bring bug fixes and performance improvements. And releases. Of data formats, are no exception, as. We. Look forward in the format' ecosystem, I wanted to briefly touch on Apache arrow arrow. Is an in-memory data format that complements, the on disk formats that we just spoke about its. Goal is to provide interoperability between. Different formats, arrow. Is a columnar layout in memory uses, zero copy reads to minimize dirty overhead, this. Cache efficient, we oil AP workflows, and allows. For Cindy optimizations, on model CPUs, modern, CPUs. Finally. It has a flexible data model, many. Are already seeing big performance, ones in prod with arrow your, workflow frequently, involves data exchange, for example from Python to the JVM you may want to experiment with it other. Sessions, at sparks I'm at 2021. Era. And if you're interested in learning more I recommend, checking out some of those talks. To. Wrap up it's important to think critically about your workflows and your file formats, file. Format selection can have a large performance, implications so, it's also important, to test your format changes before going to pro, compression. Is a topic we didn't cover too much in this session but. As we've seen in our case it, also can have a pretty large impact, on your data processing, pipeline. Finally. Keep, your libraries up to date you, never know what bug fixes or performance, improvements, you may get for free and. With. That thank, you for your time and attention if, you're interested in learning more or have more questions feel, free to reach out or ask questions happy. Formatting. You.
2020-08-18 22:45