Simplifying AI integration on Apache Spark

Show video

hello everyone my name is haim shungar and i'm a principal software engineer at informatica i look into various ai and ml initiative and informatica focusing on bringing them to which is informatica platform and products so if you have any questions uh during the talk please feel free to add them in in the comment box and we will answer them as soon as possible right so following is the agenda for talk we will first set the context of the talk where we'll discuss what are the personas involved followed by that we will uh see how informatica can run job on a spark post that we will see what are the different problems that we face while uh executing a animal uh solution with various production environment and then we'll see what are the different solutions where we'll see how we can simplify the ai ml integration on top of spark post that we will see demo on deploying integration and how cicd can be achieved while integrating a solutions and followed by that we'll have summary so since the topic of this talk uh simply find a integration on top of spark is pretty broad i want to set the right context so that it becomes easier for everyone to understand so when we talk about ai and spark these are the two personas that come to our mind data scientists and data engineers data scientists mainly deal with data exploring model building model training and they use tools like notebook art studios and libraries like tensorflow carriers scikit-learn while data engineers are mainly involved in data ingestion data pre-processing transfer transformation cleansing and they use libraries like spark and tools like informatica so we can see that how different is the tools and technologies that is used by these two personas so in this talk we will discuss about how difficult is it to collaborate between these two teams and how we can solve these problems so informatica offers a data integration tool for data engineers where they can connect to various data sources followed by that they can just drag and drop with this transformation and create a data flow once they are done with that then they can select spark engine in the runtime and then the platform will generate spark code for them uh which can then run on various clusters like hadoop or databricks we also have a cloud data integration product which is a part of our intelligent cloud services platform where we offer similar ui where customer can drag and drop and create data flow and then that can be executed on top of spark which will run on kubernetes cluster and there it will take care of water scaling and auto provisioning of the cluster and spark will run in several serverless mode i'll also like to introduce our ai engine called claire which apart from other things will help us in auto tuning the data management tasks and help us in bet achieving better performance for especially for a high volume of data so in this talk we'll assume that data engineers are using informatica tools for their daily tasks and data scientists are free to use the technologies that they want so now let's discuss about the problems that we face while integrating a solutions on top of spark so for this use this example the use case that we'll be seeing is an organization which has a single data engineering team and multiple data science teams and data engineering team will be using informatica dei for running the spark jobs on the cluster so a data science team will develop an algorithm and share the binaries and python code with the data engineering team which is the first pain point where data engineer has to either understand the technologies that is being used by the data science team or data science team has to understand about the data engineering platform post that we have to then deploy the binaries manually and let's say if the data science team comes up after two months saying that this is the better version of the algorithm so we have to again deploy that binary manually and so this is going to cost some downtime post that there can be multiple data science team which will be working on different run time and different technologies and so if you want to integrate any one of these or if you want to integrate all of these into a single spark job that is something probably not supported as of now or it would be very costly to achieve so now let's discuss about the solutions we are coming up with a new offering called ai solutions repository which is a collection of a solution a solution is code plus metadata along with the dependencies and runtime details a solution can be in any language it can have any dependencies and it can potentially run on gpus provided the right hardware and drivers so a solutions repository is a collection of solutions along with the runtime details where these runtime details can be associated with any of the added solution and a solution is combination of the code provided by data scientists the dependencies and some code that is generated by the solution repository so that this particular solution can execute on top of any platform which is supported an example will be a data scientist using jupyter notebook can upload his ai solution to a solutions repository and once the solution is deployed it can be used from different environment like java spot or informatica data integration platform so the implementation is based on a general solutions repository where solutions can be uploaded from any language and it can be consumed from any platform it is implemented in a plugable way so that customers can write their own plugins to upload and consume the solution that is present in the solutions repository directly from their own product so let us see how ai solutions repository will solve the problem that we saw earlier so each team can submit their sk can add their solution along with the runtime details and the data engineer can then browse through all the solution and select any of these solution the platform will they then take care of executing that for the solution on top of spark on on the cluster so as you can see there is a minimum collaboration between data scientist and engineer where we have discussed on what solution to use instead of how to use that particular solution and then in case of any upgrade or version change that auto deployment is automatically taken care by solutions repository and there is no downtime as such and using this approach multiple different version of the same solution or different solution with different runtime can be supported within the same spot job so now let's jump into the demo and in the demo we'll be seeing how easily we can collaborate between data scientists and data engineers and in case of any upgrade how we can achieve a ci cd with no downtime so a data scientist will be using jupyter notebook for developing an image classification algorithm then he'll be uploading that solution to a solutions repository which then can be consumed by data engineers promise from their platform which will then execute that job on top of spark so this is the democrative that we have on the left hand side we have a jupyter notebook that signifies data science environment on the right hand side top we have ai solutions repository ui and on the right bottom we have informatica's developer tool that signifies data engineering platform so data scientist can format his code in following way and provide following information where it contains initialization code prediction code training code parameters inputs and outputs container details and dependent files so that's what has been done in the following cell where initialization code is doing all sorts of imports initializing various variables and models followed by infer code in which we are getting the actual predictions and followed by training code which can be used for training or retraining of the model as and when a new data arrives post that dependent files if there are any and parameters which can be tweaked before the solution can be used for example for how long a training or returning has to happen post that data scientist can then use following code to download the notebook plugin for ai solutions repository and using the add solution method you can then provide all these details like name of the solution initialization code input code training code parameters inputs and outputs container details files and then once he executes this cell the solution will be then deployed or added to the solutions repository and he can browse using this link here now let's look into the ai solutions repository and if i refresh this page i can see the new solution that got added and we can see all the components of this solution the initialization code inference code ad hoc code which is the training code and the parameters along with that we can also see what are the inputs and outputs for this particular solution what is the container details that is attached to this particular solution and what are the different dependent files post that we can then deploy this particular solution and so i have done that ahead of time for a similar solution and when we deployed we land to this page where it shows all the different consumption options from where this particular solution can be consumed so ai solutions repository behind the scene does the heavy lifting of generating binaries for all these supported platforms and for example you can select on java and follow these steps and you can execute this solution locally from your system but if you want the execution to happen on the solutions repository you can click on create rest endpoint and it will create a rest endpoint for you so that you can use it via a java code or you can maybe use the curl command so let's test this solution if this was as expected and what we do is we can copy paste this command and provide the right data for which we need the predictions so i'm giving the refrigerator image here and when we run this command we see what is the output looks like and uh what are the results so it is correctly able to predict refrigerator with really good confidence so now i'm happy with the results and i want to run this particular solution on top of spark so i can click on spark and see the instruction i can download all the binaries and follow this code so it will take a data frame and return a data frame which will have the predictions in it but for this demo we want to use informatica ecosystem so for that we can create a new transformation within informatica developers tool and then in that transformation we can select solution which will give an option to browse all the deployed solution and we can easily select one of them and use use it to run on top of spark so now let's go to data engineering platform and see how we can consume this solution from there so this is informatica's developer tool where because data engineers can create objects for various data sources and then they can drag and drop those data sources followed by that they can create various transformations and form a data data flow so for this case we are going to use solutions transformation which provides an option to browse all the solutions that are already deployed in ai solutions repository and when we select that solution here the inputs and outputs get set automatically so that we can very easily connect the inputs and outputs and then run that mapping so once we are done with that we can then come here and select the runtime as spot and then we can run that mapping which will then run on top of spark so i have already run the mapping ahead of time and we already have the results out and this is the result that we got for following input where we are able to correctly predict television and refrigerator uh as television and refrigerator but for watch it is not able to predict uh correctly provides classification as refrigerator with fairly low confidence now this is because uh the training data that we used was only limited to television and refrigerator now this is very classical use case where the model has to be constantly updated with new data as and when they come and the framework should be able to transparently handle that so in this case what we can do is we can the images of watch so that we can train on that and once we upload the data we can then click on trigger docker run which will trigger the training job for us and we can come here and monitor the training job and behind the scene uh ai solutions repository will take care of repackaging the updated and trained model so that all the platform where from where this particular solution can be used don't have to do any changes in in their corresponding platform so now the training is completed and let's re-verify if our solution has been updated to get the prediction for our watches or not so what i can do is i can provide the image of a watch which is this one here and if i run this command now able to see that this particular solution is able to correctly predict the watch as west watch so for that we can go back to our data engineering platform and without doing any modification in the mapping we can just rerun this mapping and the solution transformation will take care of for downloading all the required binaries updated models and behind the scenes it will transfer those binaries to the cluster and make sure that the updated model is used for getting the predictions and we can also see here in the hadoop cluster that if the job is submitted or not so this particular job just got submitted and it's running on top of spark so let us understand what happens behind the scene when a job gets submitted from informatica spark so this is the setup which we discussed uh just now where we have informatica in which we have ai solutions repository and dei which is being used by data engineer engineer so data scientist creates a solution and then adds that solution to solutions repository where solutions repository then generates the required code so that this solution can be executed on top of spark data engineer then creates a mapping using this particular solution and then submits that solution to data engineering integration platform where the platform takes care of caching the binaries uploading the runtimes to the cluster and then submitting the spark job to the cluster once the job is submitted executors are created and the binaries are localized the executors then consume these binaries and the binaries have the intelligence to launch a container and start the data scientists code inside the container and then they start getting predictions from it in the similar way data scientists can then add multiple images or multiple solutions and the solutions repository will take care of the generating required code so that it can be executed from all separate platform and data centers can also provide new runtimes so that these runtimes can be associated with any of these solutions so our job is finished now and the mapping ran successfully and now let's look into the results if uh our predictions are correct this time or not so these were the predictions last time for the data on the left hand side and if we see the data results now we can see that the watch is not getting predicted correctly as wristwatch yeah so let's have a recap of our demo so we saw how we can easily integrate ai solution from jupyter notebook and how we can then explore that uh added solution from a solutions repository post that we can then uh deploy that added solution where the ai solutions repository takes care of um generating all the binaries using which it can be consumed from various different platform and we then created the rest endpoint using which we were able to test the added solution and see how the data looks like post that we went to data engineering uh integration platform where we created a mapping using informatica and there we created a transformation in which we use this deployed solution post that we then ran the mapping on top of spark so it was really easy to use ai solution provided by data scientists on top of uh data engineering platform and then to run on top of spark post that we figured out that our solution needs to be retrained and so we will able to retrain our solution with few clicks and then we uh use that retain solution from the data engineering platform with no change and no downtime sure let's summarize this talk now so initially we looked into various uh personas that is data science test and data engineers and we saw how what are the what are the tools and technologies that they use post that we saw how it is challenging to collaborate between these two teams followed by that we then looked into informatica which provides a drag-and-drop way of spark job creation and eases the life of data engineers was that we saw how ai solutions repository helped us to integrate the job of data scientist and data engineers with minimum collaboration and help us to achieve the processing of a solution on top of spark at spark scale and this yields better performance then we saw that a ci cd is something which is inbuilt in ai solutions repository so in case there is any any uh upgrade on a solutions there is no downtime or no changes that is required to be done from data engineering perspective we also saw that how i i ai solutions repository is based on a generic solutions repository framework and so our partners or customers can develop their own plugins and in this way they can add or consume that all already added solutions directly from their platform or from their product and so this easy drag-and-drop way of creation of spark job with minimum collaboration between data scientists and data engineers and in build ci cd what organization can achieve is overall reduction in cost while delivering their projects and so we look forward for customers to use this feature so a few points about informatica we are leaders in all the magic quadrants provided by gartner especially in data data integration and data management domain we have 9 9 500 plus happy customers and we are leaders uh in uh five of the magic quadrants as shown in the last slide and most of the fortune 100 companies are our customers so if you have any data processing or data integration related use cases please feel free to contact us thanks a lot for your time again and attending this talk uh please provide your feedback and please do not forget to rate us and review give give review to us thank you you

2021-01-20

Show video