Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Hi. There, welcome. To this talk and. We'll be presenting to you about how. We have, leveraged spark, to, develop a products. And services at Bosch. In. This presentation, we, will cover to a applications. And that we have built at Bosch using. Spark API, I am. Prashant, a senior, data mining engineer at Bosch, and I, will be presenting the first use case called, the, manufacturing, analytic solution, the. Second use case will. Be presented by my colleague dr. who. Is a senior data scientist, at Bosch where. He will take you through the, financial, forecasting, use case we. Are both working for the boss Center for AI here. At Sunnyvale in California. Bosch. Is a world-class, manufacturing, company. With. Hundreds of manufacturing, sites, thousands. Of assembly lines and, production. Of billions. Of parts, every year therefore. We have some of the biggest data sets in the world, comprising. Of process, machine supplier data just. To name a few, Bosch. Operates. In four main business, domains, mobility, industrial, technology, energy. And building technology and consumer. Goods, we. Have around half a million employees, spread. Across all regions of the world, ai. Is a key enabler for us in. The transformation, towards an AI, driven IOT company by. The middle of this decade we, want all our products to be equipped with AI or AI. To have played some part in their development, the. Boss Center, for AI has been established to help Bosch, achieve. This goal. Since Bosch has many manufacturing. Plants, across the globe there. Is a need to use the, this production data to improve the process efficiencies. And increase, the quality of the products, so we, develop manufacturing. Analytics solution, that. Aims to solve these problems. To. Facilitate this, we built. Data pipelines, and automated. Data preparation, we. Have created, a centralized, storage, for. All assembly, lines, we. Also set up compute resources for. In-house data analysts. And engineers. To directly analyze the data using. Self-serve, analytics, and standardized. Dashboards. We. Have also developed advanced, analytics, tools like root cause analysis, that, will help us identify root, causes for, failure in, a plant. Below. Is the architecture. Of a pipeline and, on. The Left we, can see the data coming in from the assembly, lines into. Our Kafka, messaging, system, from. That from there the data is pushed, into Hadoop file system where.
We Have several spark jobs running, to. Perform, data transformation. Cleaning, and augmentation. Many. Of these jobs are implemented, using Scala based spark API we. Have also developed. Spark, jobs that perform advanced, analytics, like root cause analysis, which. Have been implemented, in Python, the. Output, of these jobs is pushed, into tableau, which, is a front-end, application, the. Data is published to tableau, as static, extracts, as well as live, connection, between tableau, and Hadoop using. Apache Impala. In. The rest, of my presentation I, will focus on the root cause analysis, where. We are trying to answer the question, why. Are parts. Failing. Quality, checks in an assembly line let. Me illustrate with, an example to. Explain this, what. You see here is an example of a typical assembly, line in a manufacturing plant. The. Raw material, comes in from, the left to. The process one and then. Goes through, the process to where. There could be multiple machines. Working in parallel and then. The parts moved to process three and four, and finally. Reach process five where. The end of the line tests, are performed. All. The parts that pass these final, tests, are then, shipped to the consumers. As sensors. Auto parts are home appliances. But. There are also certain parts, that fail the quality checks and we want to find out the root cause for these failures we. Need to do that because we have lost both time and raw, material, producing. These parts, and that. Costs a lot so. The target, of interest here, is, the failed test from process file and the, potential, root causes would be the measured processes, parameters. The, tools used machine. Configurations. From. Process one, through process four. We. Have implemented, this root cause analysis, as four, different modules. The. First one being the part graph generation, where. All the process data for a unique part is stored, as a unique part graph, these. Part graphs are then, passed on to the feature extraction. Module where. The, where. We extract self defined features that. Could help us identify the. Failures. Or. Could, identify the potential root causes for the failures both. These modules, have, been implemented, using spark API and the networks API. In Python, the. Next module is the feature matrix generation. Where. We prepared the data so, that the different machine learning models and statistical algorithms. Could be applied we. Create a mapping between the target variables, that, is the dependent, variables, and the independent, variables, in the, final module we apply these machine, learning models on the, independent, partitions, of. Well-defined, data assets. For. This we have used scikit-learn, scifi and stat model CPI on top, of our SPARC API to accomplish, this. In. In, these few slides you. Can have a peek into our code, and implementation. Here. You, can see how we have implemented the feature extraction. Module as briefed. Earlier we. Generate a part graph to encapsulate all the information, of a, single part on the. Left you can see the spa data frame with two columns the. First one is the unique identifier, for a part and the, second column is the corresponding, part graphs. Examining. The part graphs you, can see that it is an acyclic graph, where. Each node represents the, location, that the part visited, and stores. All the information that is collected in that location. Each. Part, graph is then passed to the feature extractor, which. Extract, the features out of it and the final output is another, spark data frame where the first column would be the part identifier, and the second column is the list of features that have been extracted. Here. Is a small code. Snippet, from, the feature extraction, module, in. Line 4 you can see the definition of a typical feature extracted, that. Takes a part, graph as input, processing. And returns. A list of features that could be either categorical. Or continuous, in. Line 9 we've, wrapped the feature extractor, with the UDF, depending. On the return type of the extractor, and in. Line 12 we apply this UDF, to the entire pod graph, data. Frame to extract the features for all parts. So. In this code there, is a clear separation of, expertise, where, the engineers, who implement, the feature extractors, do not have to know anything about. The underlying SPARC API and are. Free to define the logic according, to the customer, needs and the, application, of these extractors. Is exclusively. Handled, by these park developers, and so, there is a, smooth handshake, and separation. Extracting. The root causes for enumerable. Tests then across plants, within Bosch involves, huge computational, complexity, as an. Example, we, have tens, of thousands of assembly lines and on, average each, line produces, around, 2 million parts.
Which Amounts, to around 30, billion data points per, line, so. It is incumbent, upon us that, we are able to scale these modules. So. Talking about scaling our code I would, like to bring to your notice one, of the many challenges that we had to face which. Is the feature extra feature matrix generation. As explained. Earlier we want to create a mapping between the target variables, that is the dependent, variables, and independent variables, you. Can see two matrices, to, the left where. Each represents, a particular group of features the. Goal is to create a certain Cartesian, product, so. That we can create a data frame that is shown in the right. Joins. Are typically very hard in SPARC especially, when we are talking about millions, of rows per data frame where. Each row contains lists, of thousands of features. So. We implemented the require logic, using UDF's, and smart partitioning, of course and it. Was taking nearly seven hours for, a typical assembly line and that was concerning because we need to deliver results very frequently, and also, for much bigger lines, then. We made, an optimization. To our code where we replaced all the loops, and conditions, with, Python, functional, constructs, like map, filter, and reduce, this. Increase is the speed three fold and we were able to execute the code within, two hours so. Using functional, constructs, it's, Park API is a very good programming, approach at least from our experience. So. Due to the constraint, of time I could, not delve much deeper. But. Please feel free to ask questions or reach out to me as well I will. Now hand over the presentation to. Octave. Thank, you the. Next use case we will be covering today is the financial forecasting, use case that we developed at our Center for artificial, intelligence. First. Let me set up the, environment. That, this solution was developed in we. Were a team that's, comprised, of controllers. Who. Are the end users of the solution, and on, the development side we have software. Engineers data, scientists, and data engineers. We, wanted to achieve. Automatically. Generated, sales forecasts at large scale so, that we can improve the financial decision-making, at, posh and take, it to a new, level that, combines. The talent of AI and, human. Intelligence, and reaches. Better. Decisions. For, the future of the company. In. A short. Sense. What, we want, to achieve is, the forecast of several key performance, indicators, every, month and currently, we are forecasting about, 300,000. Time series at full. Scale we expect this number to be about 3 to 4 million time series and to. Forecast, these we are using about 15. Different. Statistical. And machine learning models together with several data transformations. Porsche, is, comprised. Of about 15, different companies. Internally. And each. Company has a specific, business structure, therefore. The solution, that we developed, has. To take into consideration these structures, and has. To be able to breakdown a given, KPI, in terms of customers, products, regions, business, divisions or any, other custom need and. Given. This constraint, we chose our pilot application to be revenue forecasting. For Bosch the. Forecasts, for revenue, needs, to be generated, every month immediately. After, month closing, calculations, there, is a very intuitive on reason. Why we need this, to. Happen very quickly the. Point of these forecasts, are such that the company can see the developments, and take, measures. To react, to these developments and if we take let's, say 5 or 10 days to generate these forecasts, we, are removing. The, opportunity, of action. And giving, less time to the organization, to react to these changes, therefore, our target. Is actually to, create these forecasts, within hours. Of data availability, and let's. See what that would, be in terms of data complexity, so, if we assume we have 1 million time series, and we apply 5 models per I'm series at. Five seconds, per model we are looking at about 15 million seconds of computation, that, needs to happen every, month if we want to achieve this computation, within a couple, hours we, need to use thousands, of cores and this, is why we went towards using, spark. Here. Is the technical architecture, our. Data. Is stored in a sequel database, and we, ingest this data using our programming, language and we understand, the business structure.
Of The data in forecasting, domain, this is called hierarchical, time series, now. Once we create our data structure for hierarchical, time series we, use an in-house built, Python, model to, automatically, select, which statistical. Or machine learning models should be applied to each time series. Now. We. Take these time series and the. Models and distribute, them using spark and R and, at. The end collect all the results back into our driver and then, consolidate, the, hierarchical time series and push, the results back to the sequel database, the, model universe comprises. Of well-known, models traditional. Models such as ARIMA or state-space. Models, such as exponential, trend smoothing, or machine, learning models, such as neural networks. If. You think about the task that I just explained, in the previous slide the task is embarrassing, the parallelizable. At, any. Point in, this, pipeline. Every. Single time series doesn't. Need any information from, all the other time series so we can take a single, one and the. Models that needs to be applied to this time series and send, them over to multiple, cores get. The results back and then, use, them all together for the consolidation, step and this, makes SPARC a really good candidate, for us if. You have. A, smaller. Scale, to handle, this you can of course distribute. This computation. On a single machine using. Packages. Such as parallel, in our. But. At the scale that I have described, to you at hundreds, or millions of thousands, of millions, of time series you, actual need to do this in several, servers. And you, need to use a technology, like spark, the. Next question I would like to answer is why did we use our, the. Latest and greatest forecast. Algorithms, are actually, available in, our therefore. We don't want to reinvent the wheel but take what's available out, there and, quickly. Put it into production for. Our companies, use. Given. Spark, and are a user, has, to. Api's. That to choose from to utilize these two technologies together one. Is sparkly, are and the other is spark are sparkly. Are provides. A. API. That. Accepts. Spark, data frames and also, returns spark data frames, spark. Are on the other hand has, the same functionality, with, the apply, and G apply but. Also can distribute, UDF's, over lists, using, the L apply functionality. The. Apply functionality. Therefore brings a new way. Of distribution and. Enables. Use cases that are not like what presents presented. Earlier in the presentation where, we want to distribute the computational, rep artisans, of a data frame but, we can be. A little more loose in our definition, of data structures, and, distributed. Over lists. Given. This advantage. Of flexibility. We have made the decision to use user-defined. Functions, similar. To the root cause analysis, module. That spark %, have shared. With us and we. Are going, to distribute to UDF's over list using L apply function. The. Second, reason as I explained we have chose, lists. Is the. Flexibility. Of changing the data as we, move, forward. With our application and as our requirements. Develop. The. Lists, are we. Can think of lists as folders. Instead. Of a data frame when, I distribute over a list I can just want put one more item into, my folder and distribute. And I will not change my architecture, or my schema, I only, have to make changes, to my UDF. My data science, team can. Really. Make. Data science, changes Without Really, interfering. With the data engineering, or the spark development, team and. Modify. The whole architecture, of the solution, that, is why we have chosen the, lists, path, we. Also use spark, that add file to make. A number, of files that needs to be readily available on, the nodes and then. Finally, I want to speak about the bug that was a very, crucial issue. For us in the beginning of our development, in, two point three point oh you. Couldn't, actually distribute. UDF's. Over a list larger, than roughly four to six thousand elements this, was due to due to an integer overflow, error and has, been since fixed, and I have listed. The G ratio here for the ones who are interested, but. Until. We were able to get the solution into our software, we. Had to create a workaround, and the way, we worked, is if our lists, were larger, than 40,000, elements we. Basically broke it into chunks of 40,000 at a time and distributed. 40,000. Tasks with spark in a given moment and. You. See the example code how we did this chunking. Logic on the top right and once, we move.
To Sparks part 2 point 4.0, we, were able to simplify. The code to what you see in the bottom right. To. Conclude the remarks, here. Is a. Graph. That shows the speed-up that we have achieved in this experiment, we used about 2,000 times series and we. Have scaled. The number of cores from 8 to, 64, and, we. Have seen, that we can get a speed of speed-up, of roughly. 7.5. X when. We increase, the number of cores about eight X and that. Again, reminds, us the, well-known. Idea. That there is no free lunch but this is actually a really good speed-up, factor, that, currently. Enables, us to create hundreds of thousands of forecast, within, hours at wash. We. Have. Had an incredible. Amount of, collaboration. And work put, in by many of our colleagues during. The development of these two applications so we want to also send, our thanks to them and recognize. Their, work in here. So. We really appreciate that. You. Attended our session, please, send. Us the questions and we want to answer as much, as we can and please also rate and review our session, thank you very much. You.