So, today's a very special, year for apache spark spark. People I know is out but it's also ten years since the first open source release of spark so I want to talk to you today about what, got us here with the open source project and what did we learn in the process that were using to, contribute, to apache spark development. In the future and to do that I'll start by telling you my story with big data how I got into, this space when I started, my, PhD at, UC Berkeley so. I started my PhD in 2007. And I was very interested, in distributed, systems I had actually been working on peer-to-peer networks. Before, this but. I wanted to try doing something different and as I looked around at what's happening in the industry I got, very excited, about data, center scale computing, these large. Web companies, were starting to do computations. On thousands, of machines and store, petabytes, of data and, do interesting things with them and I wanted to learn more about these, and, so I actually worked, with some of the big data teams at Yahoo and Facebook early, on and. In working with these I realized. That there's a lot of potential, for this technology beyond, web companies, I thought this would be very useful for scientific, datasets industrial. Datasets many artery things but, it was also very clear that. Working with them these technologies, were too hard to, use for you, know anyone who has not at a large web, company, in particular, all, the data pipelines, to you know to use these technologies have, to be written by professional. Software engineers and, also. All the technologies. Were really focused, on batch processing, there was no, support for interactive, quays which is very important, in data, analysis, no support for machine learning and advanced algorithms, and when. I came back to Berkeley after spending some time working. On these I also found that there were local research, groups who wanted a more, scalable, engine. For machine learning in particular so, I thought this was a great opportunity to. Try building something in this space and in particular, I worked, early on with Lester Mackey who was one of the top machine learning. Students at Berkley in my here and, who was on you, I was, on a team for the Netflix price competition this, million dollar competition. To improve Netflix's. Algorithm, and so, by seeing the kind of applications. That he wanted to run I started, to design a programming, model that you, know would make it possible for. People. Like Lester. To develop. These applications, and, I started working on the SPARC engine in 2009. Now. Once, I started the work the feedback on it was great these machine learning the, researchers at Berkeley were actually, using that and, so we decided to open source the project, in 2010, and this is what the initial version, of it looked like so it was very small open, source project, really just focused, on, you. Know MapReduce, tile computing, with. A cleaner. And faster API, but. The exciting thing is that within. Two months of us open sourcing and we started getting poll, requests from other community, members so there were really people picking. This up trying to use this very you, know early project, and doing interesting things with it and so, we spent the next two, years working. With early users in the Bay Area I spent a lot of time actually. Visiting these companies, and, organizing. Meetups and trying to you know to make this a great platform, for, large-scale, data processing. And in. Working with these early users we, were blown away by some of the early use cases people were doing things that we as researchers had never anticipated, and we thought that there's the potential to do a lot more here so for, example, some. Of the early users were powering, interactive. Operational. Apps like a user interface, using. Spark on the back end for example, there was a group of neuroscientists. And, Genelia formed, that were monitoring. Data from like you know the brains of animals and other sensors, and real-time while, they're running neuroscience. Experiments and, and built in building that up using spark that. Was something we had never anticipated is obviously very cool to, be able to build these operational.
Apps For exploring, large datasets another. Interesting, group was from this startup. Company, quantifying, that had, built you, know a product, for its, customers, to analyze social media and other data sources and they, actually, were using spark to update datasets and memory and implement streaming computation, this, was before we added a streaming engine to spark so they had just built streaming. Computation, on this platform and. Finally, there were groups in industry, for example Yahoo's. Data. Warehousing, group that that, were running sequel. Queries over, spark, or our data science were closed and other workloads and they were opening, up the. Spark engines, to tens or even hundreds, of more users than than. We initially you. Know had had, been targeted so. Based on these use cases and, what spark was was able to do for these companies we, spent the next few years really, working on expanding. Access to spark making sure that anyone who works with data can, somehow, connect, to this engine and be, able to run computations. At scale and. There were three major efforts, there or the first one was around programming, languages, so we developed, Python, R and, sequel interfaces, to spark that today I by are by far the most popular. Ways of using, the engine well. The second was around libraries, we wanted, to have great built in libraries, for machine learning graphs. And stream processing, and these provide a lot of the value of the platform, today and finally, we, also got, a lot of feedback on the API what, users from difficult, you know for example how to set up map, and reduce and, other, operators. To do a distributed, computation, and we decided to make this big change to the high-level API, called. Data frames which owns most, of the api's in SPARC on top of the SPARC sequel engine so you get the same kind of optimizations. And quite planning, that a, sequel engine. Will do but in these are easy to use you know programming language interfaces, and that, quickly became the dominant way, to, use a patchy SPARC so. If you look at Apache spark today these, changes, have all had a huge impact on the project let's. Start with support for Python among. The, users of interactive. Notebooks, have data base where we can measure this sixty, eight of the commands coming into the platform. This is more than six times the amount of Scala it's it's by far the most widely used programming. Language, on database, today and it means a wide range of software, developers. Are able to connect. Their code to spark and execute at scale, interesting, in the next most common language, and notebooks is sequel there's, also a similar, amount of, commands to the total that we have in notebooks is just coming in through. Sequel connector. Is directly, so. That's also very widely used now next. Thing is sequel, so even, when developers. Are using Python, or Scala about, 90%. Of the SPARC API calls are actually running on spark. Sequels through these data frame api's, and other api's which means they benefit, from all the optimizations in, the sequel engine and just on data breaks alone we're seeing exabytes, of data equate, per day using. Spark sequel as a result, the community, has invested, a lot in the, spark sequel engine and making it you know one of the best open-source, sequel, engines on the planet so, for example this. Block here shows the improvement. In performance on, the tpc vs, benchmark, the leading benchmark, for data warehousing, workloads, and. You can see with spark Sepoy no spark, is now running two times faster, than the previous version and, also, about 1.5, times faster, than presto, very highly. Regarded sequel. Engine for, large-scale. Data sets and. Another, exciting piece. Of news that happened with sequel this year is that earlier this year Alibaba. Has set a new benchmark for, the TPC D as a, new record for the TPC das benchmark, using, the SPARC sequel engine to exceed you know the cost performance of, all the other, engines, that have been submitted to this benchmark so it's really the basis of a very. Efficient. And powerful sequel, engines today, and. The, final one I want to talk about is, streaming, just, on database alone we see about 5 billion records, per day processed, with structured streaming, which makes it super easy to, just take a data frame or sequel computation. And turn it into a streaming one and this number has been going by close, to a factor, of four year over year so we're, seeing very fast, adoption of this high-level streaming. API. Okay. So given, these changes of the project what are the major lessons, that we learned for me at least there were two big lessons the, first of all the first one is to focus on ease-of-use prioritized.
Ad For. Both data exploration, and production we, found that a lot of the applications that people built quickly became these operational. Apps or streaming apps or repeated. You know reports, and we wanted to add a lot of features into into the engine to make it easy for these to keep going and to tell you you know if something changes, or something breaks you know in a way that's easy to fix so a lot of the work in spark now is to make that, super easy the. Second big lessons we had was, to really design. The, system around API, is that enable, software, development. Best practices and integrate, with the bot ecosystem. So we designed all the API since spark so that they fit in the standard programming environments, like Python, and Java and so that you can use best practices, like composition, of libraries. Into, an application. Testability. And modularity, and you can build packages. That a user in your company, can can, safely use to do a computation or, have this wide open source, community, around them and, this is something where we've done a lot of improvements over time as well, okay. So given, these lessons, what's what's, happening in Apache spark 3.0, this is our largest release yet with over 3,000, patches to the community, and it's actually designed to be easy to switch to from spark 2 so we definitely encourage you to check you check it out and you know switch to Android you can this. Chart here shows where. The patches, have gone and you can actually see, almost half the patches are in the spark sequel engine, both for sequel support itself and because it's an underlying, engine, for all these data frame API calls so this is you know the most active, piece, of development, but there's of course a lot else going on as well so, I just want to highlight a few of the features that I'm excited about focusing, specifically, on sequel. And Python but. Of course you, know there's there's quite a bit of other stuff going on in 3.0 as well and. I'm gonna start with a major change, to the spark sequel engine the, largest change in recent here is which called adaptive query, execution, so, this is a change where the engine can actually, update.
The Execution, plan for, computation. At one time based on observed. Properties of the data for example it can automatically, tune the number of reducers when doing an aggregation, or the join algorithms, or it can even adapt to skew and the data to, plan the computation, as it's going and this, makes it much easier to run SPARC because you don't need to configure these things in advance it will actually adapt, and optimize, based on your data and also leads to better performance, in many cases, so, to give you a sense of how it works let's look at a simple example of setting the number of reducers we. Found that today about 60%. Of clusters, you, know the user is tuned, the number of a juice in them so that's a lot of manual configuration. That you want to eliminate and automate, and. So with a QE what, happens is after SPARC, owns the initial, phase, of an aggregation, it, actually, observes, the result size you can have some number of partitions, there and it can set a different number, of partitions, for the reduce for example, coalesce. Everything, down to 5 partitions, and optimize. It for the, best performance based on what, kind of data came out of that aggregation, now. Even more interesting, things happen, with joins and more complicated, operators, for, example when you're joining two tables even if you have high quality statistics, about, the data it's hard to know how, many records, will end up before the joint and using, a QE SPARC, can actually observe this after. The initial, stages, of the joint have happened and then choose the joint algorithm, that best, optimizes. Performance, downstream, and in fact it can adapt both to the size of the data and to the skewer on different keys so you don't have to worry about tweeting. Skewed keys, in a special way anymore and, these, results in really big speed. Ups on, sequel, workloads. So, for example, on TPC des queries we've seen up to a factor of 8, speed, up with with a QE and even, better it, means you can actually run SPARC on a lot of data sets even without, pay computing, statistics and, get great performance, because.
If, This cover is the statistics, as you go along, ok. That's just one of the changes that affects both sequel. Usability, and performance or, there's, quite a bit more so on the performance, side we have dynamic, partition. Pooling quakin, compile, time speed ups and optimizer, hints, and as I said earlier these have led to a 2x. Reduction. In, an execution time 40 PCBs, and a lot of improvements. Are real workloads, and, finally. There's been a big effort in 3.0. On, sequel, compatibility. In particular, and an SI sequel, dialect, that follows, the, standard. Sequel. Convention, in May many areas, of the language and that makes it very easy to point in workloads, from other. Sequel. Systems, and this is one of the areas that were continuing. To invest in by we've. Made it we've made significant, strides in. 3.0. Ok, so that's a bit about sequel, the, other thing that I wanted, to highlight is around Python, so we've, done quite a bit of work on both Python, usability, and performance for. Usability we've, made it much easier to define panda's. User-defined, functions, using. Type, hints in Python which let you specify the. Format of data you expect, so in SPARC it's easy to make a function, that they send batches of data spanned. A series or data frames for example, or even as an innovator of series and you can just specify these with type heads for, comparison, the previous API required, you to write a lot of boilerplate code, in order to say what kind of input. Your function expects and that was variable, so now that can go away and it's it's super straightforward to define and. Use these functions, there's. Also been a lot of work on performance using Apache Aero we're. Seeing about, 20. To 25% speed, ups in Python UDF performance, using the latest enhancements. In Apache hello and also up to a 40 times speed up in spark, are by using our to exchange data between our and spark. And these are all transparent, to the end user you just stop great and you give these speed ups and. We have quite a few new api's for combining pandas with spark as well and, finally. Their features throughout the engine including. Our new structured, streaming, UI for, monitoring, computations, a way, to define, custom observable. Metrics about your data that you want to check in streaming jobs sequel. Reference guide and a powerful new API. For data sources, so. If you want to learn more about Apache spark 3.0, I invite, you to check out charlie's stock and the summit and many, of the talks on these individual, features. And. The. Spark, project, itself is not the only place where things are happening there's, actually a bot community, around it and I wanted to highlight some of the changes so. Our data breaks last year we released koalas, which is a panda's, API. That can run directly over, spark to make it very easy to port workloads, and, that's evolved, a lot in this year I'll talk about it of, course there's Delta Lake for reliable, stable storage and. We've also done quite a bit of work to add spark, as a scale out back-end in popular, libraries, including, scikit-learn hi part and job lips so if you use these for machine learning you can just scale out your jobs on a spark cluster, we. Also collaborating, with each other on to, develop, glow a widely used library. For large-scale genomics, and. NVIDIA has been developing, Rapids, which provides, a wide range of data science, and machine learning algorithms, that. Can you can call from spark that is GPU, acceleration, and really, speeds up these workloads and. Finally, we've, also done a lot of work on data bits on improving. Connectivity. With both. Commercial. And open source visualization. Tools so, that users can build interactive, dashboards. Using spark as the backend so. I'll just dive into some. Of the changes in koalas specifically. If. You're not familiar with koalas, it's an implementation of, the pandas API over, spark to make it very easy to port data science code in this really popular, library, we launched it actually a year ago at spark AI summit, and it's already up to 850. Thousand, downloads per month which is about a fifth the total, downloads of Pi spark so we're really excited with how the community, has adoption, adopted. This library and, we're investing quite a bit more into colas and at this event were actually excited to announce koalas. Version, 1.0. This. New release has close to 80 percent of the API coverage of Banda's it's, also quite a bit faster thanks to the improvements, in spark 3.0, and it supports a lot of features. That were missing before including. Missing values, and haze and in-place, updates, and it's also got you know faster distributed, index type it's, very easy to install koalas. And pie pie and get started and if you're a panda's user we believe this is the easier way the easiest way to migrate your workloads, to spark, so.
That's Enough of me talking we, also like to show you demos, of all these new things and. I'm really excited to, invite, book wenig the machine learning practice. Leader data books to give you a demos, of the new features on koalas and spark 3.0. Hi. Everyone these. Are some crazy times and while, we're all still social, distancing, many, of us are staying at home trying to figure out what, are we going to cook for dinner tonight I know. My whole SF office has gone through the banana bread and sourdough craze but. Now we need some new recipes to impress our co-workers with on our food sock channel in particular. We need to find a recipe for matang mattei. Is a very busy person who loves to optimize everything. Given. The increase of people contributing, recipes, we now have millions of recipes, that we need to analyze in a scalable manner so. That metate can make a data-driven, decision, on what he's going to prepare tonight, along. The way we'll explore some new features in spark 302 and koalas, let's, get going. Like. Any data scientist, I'm going to start off with some exploratory, data analysis, with pandas pandas. Is a great way to get started with exploring your data because it's simple to use has, a great documentation, and a fantastic community, let's. Go ahead and visualize the subset of our data. You. Can see here that our recipes data contains, the name of the recipe, when is contributed, nutrition. Ingredients, etc, however. We, don't just have one file we, have a whole directory of park' files we need to read in so. Let's go ahead and load in our entire data set with parking, with, pandas. Unfortunately. Though pandas. Wasn't prepared to handle the rising recipe count so, if we allow this query to continue on it'll, crash try to load in 30 gigabytes of data onto a single machine let's. Instead cancel it now. Let's, use koalas, to load in our data set, instead, koalas. Provides the pandas like syntax and features that you love combined, with the scalability, of Apache spark to. Load in our data set with koalas we, simply replace any of the PD logic with, KS for, koalas and now. You can see just how quickly we can load in our entire data set with, koalas. But. Let's. Say meta is pretty, busy tonight and he, wants to find a recipe that takes less than 30 minutes to prepare you. Can see I've already written the panda's code to filter out four recipes that take less than 30 minutes and then, visualize, the distribution, of the number of steps these recipes take to, convert it to koalas, I simply, replace my pandas dataframe with. My qualitative. Frame and. Voila. I no, longer, need, to downsample my data in order to visualize it I can, finally visualize, the big data at scale. We. Can see here the distribution of the number of steps while. Most recipes take less than 10 steps you. Can see the x-axis, extends, all the way out to 100 there's. A universal, muffin mix that takes 97. Steps to prepare but. A won't be cooking that one tonight in. Addition, to our recipes, data we're, also interested in the ratings for those recipes, because. Koalas, runs on Apache spark under the hood we, can take advantage of the spark sequel engine and issue sequel queries against our koalas data frame you'll. See here that we have our ratings table now. We want to join our ratings table with our recipes, table and you'll, notice I can simply pass in the koalas data frame with string substitution. Let's. Go ahead and run this query. Wow. We can see that this query took over a minute to run let's. See how we can speed it up I'm going. To copy this query and move it down below, after. Enable adaptive, query, execution. With. Adaptive query execution, it can optimize our query plan based on runtime statistics, let's. Take a look at the query plan that's generated, once it's enabled I'm, going. To dig into the spark UI. And. Take. A look at the Associated sequel query. Here. We can see that it passed in a SKU handling hint to the sort merge-join based, off of these runtime statistics and as, a result this query now takes only 16. Seconds for about a 4x feedom with no code changes. In. Addition. To using koalas, to scale your panda's code you can also leverage SPARC in other ways to scale your panda's code such, as using the pen dysfunction, api's, we're. Going to use the pandas function ap is to, apply arbitrary, Python code to our data frame and in, this example we're going to apply a machine learning model. After. Seeing that universal, muffin mix take 97, steps in 30 minutes I'm a little bit skeptical with the time, estimates, for these recipes and I, want to better understand, the relationship between the number of steps and ingredients, with the number of minutes as. A dative scientist I built, a linear regression model in our data set using scikit-learn, to predict the time that, a recipe will take, now.
I Want to apply that model in parallel to all records of our data frame you'll. Notice here just how easily I can convert our koalas data frame to a spark data frame to. Apply our model I'm going to use the function called map and pandas map, and pandas accepts, a function you want to apply with, the return schema of that data frame the. Function I want to apply is called predict predict. Accepts an iterator of pandas dataframes and returns, an iterator of pandas dataframes I'm then. Going to load in our model if your. Model is very large then there's high overhead to repeatedly load in the model over and over again for, every batch, in the same Python worker process, by. Using iterators we can load the model only once, and just, apply that to batches of our data, so. Now let's go ahead and see how well our model performs. We. Can see here that, the predicted, minutes is pretty close to the number of minutes for some recipes this, case the true number of minutes is 45 we predict 46, are, the true values 35 we predict 39 but. In some cases our time estimates are a bit off and we. Can actually see in the description don't, be dissuaded by the long cooking time so. Mateus. Decided he doesn't want to cook anything tonight he just wants to make a smoothie with the shortest preparation, time so. Matei which recipe will you be preparing. Looks. Like Matteo will be preparing a very very smoothie with only four ingredients and a grand total of zero minutes wow, he's really optimizing, for speed stay, tuned for a very delicious surprise, Wow. Thanks Brooke that was an amazing demo, and the smoothie turned out. Delicious. It's great. One. Other announcement I, wanted to make that, also ties into book is. That we've, been working to publish a new edition, of learning spark on the the. Popular book farmer Harley book is one of the co-authors actually, and we're giving away a free copy, of the e-book to every attendee of the summit so, we invite you to check this out if you want to learn about, the new features in sparks a pono okay, and then the final thing i want to end on today is what's happening, next in the apache spark ecosystem. If we step back and look at the you know the state, of data and AI software today it's. Obviously, made great strides over the past 10 years but we still think that, data and AI applications. Are just more complex, to develop them they should be and, we think that we can make that quite a bit easier and we have a lot of ongoing efforts in open source Apache spark to, do this building, on these two lessons ease, of use in, exploration, and production an, API is that connect with the standard, bad software ecosystem.
So. I'm just going to briefly talk about the big, initiatives that we are, working on a database, for the next few versions. Of spark so. The first one is what, we're calling projects, and the goal is to great, improve, Python usability, in Apache spark because it is the most widely used language, now we want to make sure that Python, developers, have a fantastic, experience, that's, familiar with, everything else that that, they do in Python and there are quite a lot of areas where we think we can improve the experience, we've, called this project and by the way after the Zen of Python, the set of principles. For designing Python. That have led to it being such. An amazing environment, today so, some of the things we're working on our better error, reporting, porting, some of the API changes, from koalas, we got to experiment with them there we think they're, useful and after that we want them to just be part of spark improve. Performance, and pythonic. API, design, for new api's, I'll, give a few examples of what we're doing with this next we. Also have a continued, initiative, on adaptive. Query execution, we've, been really, pleased with what is doing so far and we think we can cover more, and more of the sequel optimizer, ease decisions, adaptively, and really, dramatically, reduce the amount of configuration. Or preparation, of the data needed to get great performance with spark and. Finally, um we're also continuing, to push on an SI sequel the goal is throw an unmodified, cuase, from all, the major sequel, engines by having dialects, that match these and we think we've been working a lot with the community to build these and we think it will provide a lot of benefit, and spark, so. Just to give you a couple of examples of the, projects and features one, of them is around error messages, so if you run a python computation, today and you have an error on your work up artists such as dividing, by zero you get this very scary looking error, trace in your terminal you. Know lots of stuff going on and if you look at it closely there's a lot of Java stuff and maybe, you. Know you. Can see part of the are traces actually about the problem that happened in, your Python worker which in this case was division by zero but, it's pretty unfriendly, especially. If you're just a Python developer it's, hard to see exactly what was going on so as, part of projects, and were simplifying, the, behavior, and python of a lot of common error types so that if it's a Python only. Error or sequel planning or you see a very small message. That lets you directly fix, the problem so you can see this has just the Python relevant, bits of that ever and you can see that there is a division by. Zero of. Course this might be a bad example because, if, you really accept. Projects, then you will come to a much better understanding of emptiness, and you will probably never get divisions, by zero again but, it's just one example and. Then, the second change I want to show is a new documentation, site for Vice Park that's, designed. You, know to make it really. Easy to buffs to find Python examples, and it's built on kind, of the latest best practices, for Python documentation and. We think this will make it easier to get started for many of our users as well so. These are some of the changes were, really excited, about what's next for Apache spark and we. Look forward to seeing what you do with it.
2020-06-29