Powering Interactive BI Analytics with Presto and Delta Lake
Alright hello, everyone, this. Is Camille, CTO. And co-founder at, starburst and, I will be talking about powering. Interactive, VI analytics, with, Bristol and that, data like today all, right so in, this presentation I, first, want to introduce presto for those who don't know what that is and a little bit of starburst and what we do here. To. Help. Enterprise. Adoption of Westar. You. Know the main topic, of my presentation is. The, Delta Lake integration, that we done. For, Forrester. And and, then, sort of show, how we can combine Forrester. And data, breaks spark, and Delta, together, in one data. Path from architecture. And. How to efficiently use the, best, capabilities. Of both technologies. And, then, finally. Show real. Use cases. Where. That combination actually, delivers best. Results. For for. Your team, so. With. That. It's. Going. So. Presto. And starburst. First. Itself. Open-source. Community, driven project. And. Its core. It's. A high-performance. NPP, sequel, engine, it. Was designed, specifically. To be. Geared. Towards, interactive. Analytics. And. Perform. On cc-cold. Analytics. Over. Large. Scale data, ranging. From gigabytes, on the low end to, to petabytes, on high end and. The. Design goal for a Presta, was also to to help with high, concurrency queries. And. Fast, performance overall. One. Of the unique capabilities of faster is that it provides full, separation of, computational, and storage, layers. So. That allows you to scale, those, two, independently. Add. More storage without. Adding compute, nodes, and then, boosting, compute, if you need to perform lots of analytics, and then, scaling back when when. You don't need them at. That. Capacity. So, allows, you to manage cost and, performance. Elastically. And and cost, effectively. Also. Crystal, itself is just a computer layer right it accepts, a sequel statement does, all the eyebrow. Execution. Of the query and then, it's. Done right there's, nothing in fest about storing. The data so. You have to kind of bring your own storage and. Storage. This most efficient, effective for. Your use case and most appropriate. So. In, a sense Bristow allows, you to run cycle on anything and. With, the powerful. Extension. If, it is you can actually, connect. To a variety of different sources and. Also, write. Your own, connectors, to two, sources that are not yet available. In. The open source or commercially. The. Side effect of this is that you can also run, federated. Queries meaning. That in, a single, sequel. Statement, you can actually refer to tables, coming from different sources and, and. Thus allow. You to do. Correlations, of. Data. Coming from different, sources and, perform. This interesting. Analytical, queries. That will, otherwise, be waiting. For you to bring all your data into one place which. Obviously. Is. A long. Process and I mean you may hit a lot of barriers. Especially in the enterprise settings. And. Finally. You. Can deploy presto, basically, anywhere. We've. Seen successful, deployments, of presto initially. On bare, metal on-premises. Systems, then. In virtual virtual environments. And cloud, now. Kubernetes, is, really, the. Way to go for deploy. Questo in essentially, any of those varmints. So. Gives. You a lot of flexibility, and, you have to be tied, any specific. Platform. And you can move pressed. Analytical, system. On one place to the other angle. Were word your center, of gravity for de Guise so. First has been around for some time it's. Been about seven, years, this was originally, developed by. The. Team at Facebook, and. Got initial, adoption is, a large, Silicon Valley Internet companies, and, then, progressively. Expanding. More. And more companies. Globally. Different. Verticals. And news cases, so you, can, see some bloggers here on. The, screen that just gives you a sense of you, know, presence, applied pretty, much anywhere, for. Psychological. Purposes. And, it's, been the, fastest-growing sequel. Engine. Among. Any anywhere, in the open source for sure, you. Can run at a massive scale um so, some, of the larger deployments especially. Its Facebook. Links. And lived talking, about hundred, machines. Whirring. Petabytes. Of data and. Running, thousands. Hundreds. Of thousands of queries a day so. It's really really impressive from, what. Those companies are doing and pushing the, limits and scale. And performance. So. Now starburst. We. Are the enterprise Presta company we, are the commercial arm behind.
The Open-source project, much like databases, for spark. What. We, value. We are bringing to to, enterprise customers specifically. Is it's around, simplifying. The usage of Christ. Provides. You lots, of security, enhancements, with integrations. Data. Encryption masking. Permissions. Education. With without that talked. About. You Ranger and other. Security. Related protocols. You. Know in the enterprises, you typically have lots, of different, additional. Connectivity. Needs. Beyond. Just open source. Formats. You'll. Be talking, about querying. Oracle, generate, db2. Snowflake. Etc. Etc and. Those are all packaged. In the enterprise, distribution. Of presto, that starburst. Offers. On-demand. Route side as I mentioned you know ease abuses super. 4. Into, enterprise settings so, we manage, configuration of. A scaling, H a monitoring. And just, deployment, in, all those different environments, simplified. Orchestrated, effectively. And. All this is obviously wrapped into, as. Providing. Support. For, presto. And because we have the largest team of experts, in. The world we. Have first, the creatures behind. Company. Right now and. Provide. Hotfixes security, patches 24. By 7 support and, I, know the, thing that's also very important, and it's, also true for. Beta. Breaks like we are the leading contributor. And committee compressed. Itself, the. Open source project all right so so, we are driving the roadmap, and enhancing. Quest to. Get, you more performance, more scale, more. Functionality. In, the open source and then, if you choose to go. With. Our. Platform, and managed enterprise. Offering and, that's where you get additional, benefits as well ok. So so, you've, I've, talked about how priced I can connect to many different sources already. So. Why why, did data like we obviously. Got interested, in in. The, Delta, like when when it was first. Announced by little, bricks a couple years ago then. Subsequently. When it was open sourced last. Year and. Built. Alike is really really exciting, technology. For. Many reasons so I just mentioned several of those here, which, I think are fundamental. First. Of all it gives you the acid properties over, the. Data Lake and, this, is huge right it's massive because. You. Can now go, and and delete, individual, rows and, update. Individual, rows and, and incidental, rows, effectively. Over. Your, data like which. No, in the past was quite, cumbersome, and. And. Complex. Frameworks, to do that that's. Now simplified, everyone. Can treat object, storage from, as. Essentially. A database, table which. Is which is amazing that's, how we want to work and that's how presto, here's, the world so that's great. It's opens open, on open source table, formats, multiple. Different tools can use it, spark. Obviously was, the initial implementation now presto. There's, connectivity. For hive and, I suspect. It will be much more broadly, adopted in the future as well. The, actual data and under. Underneath. Is todos, porque file. 14. Files which, is great because ok, is very very efficient, format. To store the data is. Columnar, it's compressed, is built in new. Max indices, and other performance, optimizations. Which. Would which, is very great, for performance. The. Other supports object storage. Natively. On on as, free adls. And, an address storage, which, is where. The world is going with, when storing large, amounts of bacon these days that's. That's amazing, on. Top of all those basic. Fundamental. Benefits. There. Are really, great features around. Scheme, evolution, and, even time travel, allowing. You to see you what, would, be the answer to this query if, you are in fact a day ago and, I think that's that's, very interesting for for in many workloads, and especially, in the analytic space. And. Then, you. Know just to. Further. Show. The benefits, of Delta, having. The dedicated metadata.
Information. That's, outside of highlighter store having, statistics. On the data. That helps out with performance, because you can skip, the data and. You. Know if you have data. Arranged in a special order but, also helps. To speed up many. Queries. So. I, think we all are very very, thankful to data breaks for inventing. Delta and making, that popular, file format is getting lots of adoption. Many of our joint, customers are, also. Users of Delta so, it really made make. A lot of sense for, for us at. Starburst, to enhance, presto, and allow. It to to. Also equate, Delta, effectively. As. I mentioned stuber's, developed a native, purse. Delta, Lake connector. And I would, like to acknowledge, great. Help from engineers. At data bricks that assisted, us with. That. Integration, and provided, further. Details. In, addition to the, official specification. Of how to implement this effectively. So. We decided to build a native connector, written from computer from scratch right, so. There used to be a, legacy. Solution. That I was, manifest. Files and, hive might restore. Integration. And, simulate. Access. From presto to Delta, I was very very inefficient because it wasn't taking advantage, of any of the inherent. Data Delta. Like properties. So. In. This implementation. We. Actually natively read the Delta transaction, log we. Perform, data. Skipping, based on the metadata read. From. From, from Delta, and we. Are able to fetch the Mystics about the. Data and basically, leverage those as, an input to the. Optimizer. And that. Allows us to effectively. Perform. Joins. Among. Both, the tables as low as those other tables and other, other. Team, was coming from different sources so, I was, really important, to have this native integration. Built. For. Stay here. Ok. So once. We build the very first version of that we, accuse obviously. How that performs, to this. Legacy. Solution. Which basically treated Delta as a collection Alfredo perky files.
So. We run on. Standard age. Benchmark, and across. 22 queries, in this benchmark we so on average. Except. For for all the queries and. Obviously. Those, queries are doing much, more than just real data so. This. In, the. Fastest, query and do, we observe in this metric, which was a single table scan doing some aggregation was, like she's showing, six, acts performance, boost in. There and with, the native reader so. That's that's pretty substantial and. And, once we get the pollination, done. Efficient. Implementation, of the Delta. Reader. We also share, this with our. Preview. With our early customers and. The, feedback we got was actually, even, more enthusiastic, because. They were reporting, speed. Ups you know over. 10. X 4. For the native reader versus, the previous, solution so this. Is dry seductions. Helps. House, for the performance, overall and it, makes users. More happy to, come back and run more queries and do, more analytics, so. I, definitely, encourage everyone, to try, out this native. Integration of Delta with. Starburst Presta and the, link, and documentation, link here on the screen so. You. Know having that is great. Obviously engineering, edition yet. Another connector, for the presto. But. How you soft.i, all those things together right to, work together effectively. In, one environment and. I, thought. I would spend a. Few slides on this topic how to effectively. Leverage data reefs and starbursts. One. Architect. So. This terrorist platform, itself obviously at the center of that, is Presta. As. This fast, sequel, here. That's talked, to many sources we've. Done in addition to just, leveraging open source press or, we, build you know additional, connectors, to more sources, I mentioned, some of those are traditional. Commercial. DBMS. Sources. A very. Column in, prices. Being. In, the primary. Examples. We. Enhance, support for additional SP. Compatible, storage, engines for. Premises. As well as. The clouds. All. The clouds obviously and, today and. Press it can also talk to your, modern no sequel stores such as manga. Elastic. Cassandra. Etc. I can and so, having, a tall, one to reach. To all those different sources. Both legacy and. And. Modern, places. Where you may store your data is, really, really powerful however. There's. Always always, challenge well with all the diversity of sources how to do it effectively. How. To do it securely, and how. To manage the, whole experience, for the end users right so, this. Called. As. Part of our platforms, we call it data consumption, layer and, big, piece of that was, building. A global security mechanisms. That will govern, secure. Access to all the sources from presto, and and. Also from the users that are occurring. This, data by oppressed, and. Of. Managing permissions. Masking. Has sensitive.
Information. Ensuring. Data encryption. Both. For, reading the and data, encrypted at rest as. Well as encrypting. Data and, flight when. It's moving from the source to Preston, and no, we, Winterfest a cluster and then between, the client tool and. The. Past right you. Can all these all the queries you, can have, very fine-grained, access control. Down to the. Table and column and, actually, you cannot survive, role, or all filters, and, further, restrict, access, southern. Things so all of those things are part of starburst platform. Edition. And. On, top of that we found integrations, and. Certifications. With a number of, your. Favorite, BI tools there's, no look at RBI. Click, cetera. And. And so. Enabling. That class, all the modern tools and. That. Can issue sequel, starting. From Jupiter notebook. Superset. Read ash which, I think right now is part of data breaks all those tools can talk and, speak to restore natively and, gives. All the users. Power of the the. Broad. Set. Of connectors, and security features of starbursts. So. That's on the starboard side and. Now, this, question aloud like you know I have spark I beta breaks I have first and starbursts, like how, do these things work, together. Was, the best way to effectively. Leverage, both right, and what, we see. We. All live deployments. With. Our customers, is, that pretty. Much everyone, who ever just arbors and impressed they, also leverage spark. And often doubts data, bricks spark right, and. The, reason is that I, think those technologies, excel at different. Things, and, complement. Each other, very. Well so if, we're talking about swimming. Ingestion, of data, talking. About you know being machine learning jobs artificial. Intelligence. Obviously. Managing, data like doing, really, really heavy long ETL, jobs all. Those things will. You do it through, just, native spark. Syntax. Or. Sequel. All of these things are best positioned to, happen. Inside. Data bricks and inside spark and spark astir and the, way to do, this effectively. On. The other hand presto. Really, was designed and excels in high concurrency sequel. So if you're running tens. And hundreds of course at. The same time if, you if, you're doing some bi reporting, analytics, interactive. Data discovery. Using. Sequel, and. And. You want to also further write different sources right so you have.
Delta. Lake. And in porque files on, this we and you, know Avro, and, also relational. Database with no sequel engines. And. This is where presto, I can provide, a lot of value and, we. Feel that Interactive, faster performance is, enough. Of. Distinguishing. Factor. That. Drives adoption. Of question, starburst and because, we have now, so. Many joint customers, feel. Really compelled to really, advocate for, its joint architecture. Yeah. So. If you look look sort of holistically. At the data ingestion analytics. Ecosystem. This. Is all coming together by. Having. You know your, raw data sources, all. Being. Ingested by. A data breaks and. Spark into, detail, like you. Know these days like. Specifically, right, but you can also just put your your data on. You. Know either Amazon, s3 or as. Your. Els. And. And, you can run, all the machine, learning guy and. All flow or. Say. To make her over, this data. For. For those use cases while. If if, you want. Sequel. Over this data plus. You, know correlate, this. Information, with with lots. Of different data sources but. There are, DBMS, new sequel and, all, over the place they're like, Star Wars and presto are is the perfect answer right, and provides, so higher more. Responsive, having, currency more responsive. Sequel. Access, and allows. You to leverage or the di tools and secure editors and. Reader. Super say Jupiter, notebook, and your, favorite tools. For. Analytical, purposes. For. The more classical, unknowingly, purposes. There's. Of the promise. This is very interesting as well so. These, days we highly recommend deploying. Pressed. On. The kubernetes cluster and. In fact I will be showing. Present, deployed. ETS. Service. Later. In the demo, but. We see this pattern of press to being deployed by. Kubernetes. On. Azure I guess. Also, in practice so, we work closely with, an. Open, platform for, the, Aqua his deployment and we see people playing. Event is you. Know pretty, much anywhere, but. As I mentioned you can obviously do. It in, many different ways as well. At. Your company and. And. So if we simplify all that configuration, deployment. Management, sufficiently. So it's, really very. Very. Deployment. And management and, simplified. Drastically. To, what what. You've been used to in, the past. Okay. So now bringing, all this together, right, it's. All great architectural, technology. Was. The real use cases that we. Can. Show. Here right so they're the one use key that we want to advocate for and. We've seen this already being, leverages. You. Know in a joint architecture, data. Breaks and starbursts. We, have obviously, some IOT data, streaming. From. From. The IO devices swimming, into, the. Delta like. We. May have some, data. Coming from us. Yep, your piece system you. Know you know already batches we are moving the statistic. From. The application, layer into. Again. Delta. Like, specifically. And we can go through all those, different layers from, you know ingestion bronze, refined. Silver layer, of Delta Lake and then, to the aggregate store, or. The gold layer right which. Is the. Best position to find, for fast analytics, and. Starbursts. Can come. In all. That everything. Before happened in data breaks now, starburst, comes in and allows. You to run your fast, concurrent. Sequel, and. Fetching, data from your, eyelid store as, well as reaching out to more. Or less refined, source, versions. Of those tables if needed right if, your analyze. Data scientist, refers. To look, at the very raw, data that's. Also possible. From. Compressed, and, now, while. We. Have also made that being ingested into this there's, data like it's. Not going to be, everything. You have potentially. In your ecosystem. We've. Seen a, lot of companies coming with. Data. Like started, this and, having, the center of gravity there however, your. Our, DBMS, traditional. Oracle, carry, to a db2. Etc. Are always, there and there, is sometimes we're interesting data, there that you. Better correlate. With with the information coming from other sources, and. For. Other. Data sets. Like no. More textual, data so say web. Logs and, user, comments. Instructor. Less doctor data you, know things. Like elastic. Or MongoDB. Rr-really.
Properties. Ways, and, places, to store them right and, now since. Your data is spread across so many places. You, know you, need, the ability to, query all those at, the same time. Quite. Often due, to arrive at the best insights, that a contraction of business forward and, this, is what what. We can provide here with starburst and all the connectors we have to enforce bathrooms. So. Adam as I mentioned will, be doing so the real-time ingestion, of event data and I'll be showing this in a demo in. A moment. Will, be doing also the the, hourly jobs of hoping that the, more. Traditional, enterprise data and, then, reaching. Out your fukui data sources as well and. We, modify this and refine this and and we, prepare, this data for further outdoor, means as, appropriate. In in our, overall. Architecture. So. We're. You know obviously, the thermosphere provides, this this single. Access. Later right it's. No longer true that, you have already been in one place so. There's no single, source. Of truth for day care this single. Access. Point the, queried is date right, but, so visualize. And. Query environment, is where the, power of questa is right. So. You can, leverage. The, power of the, appropriate data. Stores for, your data and in query data right. Now rather than wait for maybe. If your job to bring this Oracle, legacy, data or, X. Data. Into. One face because. You it's hard to imagine, you know how, you might need to modify and prepare this data sometimes. For. Your purposes. If you give if you're given access to this elasticsearch, you can actually push. Down some process from there and, find. Appropriate. Data. By, by doing that and external. It takes essentially. And bringing that back to the relational world and, unpressed, to service can provide so far a global access. Control for sources, so analysts, are appropriately, privileged, to the data that's, that's. Meant, for them to be arised. So. Now in, the in this demo know and, users will be leveraging API tools sequel, edit first look, at a blonde power bi. Leveraging.
Connectors. Like JDBC, drivers, ODBC, and wagons, for other languages. And. Wizard of mouch want to show all of this you know in a demo and briefly. Comment how you put all those things together. Ok, so we, are in a database, notebook, here and we have obviously as we bracket mounted and we, are setting up a structured, streaming, ingestion, from, from a Kafka topic. Into. Into, a stream that will then save into, a Delta, table. So. Sitting up here, making. Sure that connectivity, is is, live. Stream. Checking. The schema, making. Sure it's matching, what to expect and now basically issue a command to. Receive. This data and save into a Delta. Like calm and very fine. That the stream is actually live, okay. So we, have ingestion going, on now. We are switching to, a DB. View which, is a secret, or tool in which, we have mounted all different, data sources such, as Delta, Lake and we'll. Be. Trying. Out the query Delta from Malaya Prestone from from DB per client that, came up very, quickly as you can see now. Showing. The same for elasticsearch, so, this is vastly set up in I was in cloud we, can query that very quickly as well individually. So that's perfect and then, our. Zone RDS, we have an Oracle system. With. Some customer, information and. And. That's together. Now, bringing, all this together into one sequel statement. We. Can sort of correlate this information, and do the join between, orders. Customers. Parts. And, execute. The sequel and, here, in the first. Web. UI we can see how query is progressing, we can know in toward a state of the cluster, and. See. The, progress of, this join, pretty. Calm based on query with, 40, rows then in the results. The beaver, screen. Here now putting, back to a more. Bi classy, bi tool below we, have a dashboard. We. Are so according all the same information, and. Displaying. The, orders. On. A, Jeju, graph your dashboard. Visualize. I know which which orders become from which countries, so. It's. Like pretty. Quickly we have another statement, being, Ram and and. All, of this is now reflected. On on a dashboard again and the. Difference is here that we have a virtual view. By. Directing all this sources but still, the same Delta. A state rhetorical. As. Free and now in addition to that body bounce you around a classic report. Here. We are showing customer. Country. For region and. For. Visualization. And, all, of this is interactive life course against the real data. Coming. From all those different sources. During. COI time and, so, we've I hope that the unsession was helpful, and. You. See the power of both tools together and. We. Thank. You all for attending and, really happy to answer any questions. You.
2020-07-25 16:11