AWS re:Invent 2020: How to use machine learning & the Data Cloud for advanced analytics (Snowflake)

AWS re:Invent 2020: How to use machine learning & the Data Cloud for advanced analytics (Snowflake)

Show Video

um hello i hope you are having a good virtual conference my name is ashish singh and i lead the data analytics architecture team in caterpillar and i'm going to be talking about how you can use cloud native machine learning services with data cloud technologies for advanced analytics which is a different approach than has been traditionally used uh my agenda for today um is dividing three sections first i'm going to talk about why do we need data cloud uh for advanced analytics and what is the compelling reasons behind it then i'm going to propose an architecture pattern on how you can use data cloud for advanced analytics then go deep dive into couple of topics in that around data modeling and partitioning and how you use native external functions and built-in visualization um as key capabilities for that and then lastly i would cover how you integrate the data cloud with cloud machine learning technologies specifically how do you run ml models along with the variety of different runtimes available in the cloud how do you scale it to thousands of models potentially to run every minute and then how do you do an integrative monitoring between data cloud and native machine learning services just before i deep dive give you a quick background on my company and myself um caterpillar as you might likely know is a well-known heavy manufacturing company we have been in business since 1925 more than 95 years we make more than 200 finished products many of them heavy machinery and construction and mining in other industries we were upwards of 50 billion revenue last year and is is one of the key companies that you would typically monitor for a dow jones industrial average for example about me quickly um as i mentioned before i lead the analytics and architecture group within cad digital i've been um involved heavily in building the next gen digital platform for caterpillar and a key part of that is the analytics capabilities that i'm going to talk about today um i spent several years in commerce about 18 years before i joined caterpillar been with caterpillar in various roles for about six years now so caterpillar digital vision it is very similar to other manufacturing companies that are doing digital capabilities it provides it helps us provide an insight into our products and the way it provides insight is through a lot of data that is coming from our telematics hardware devices that are in our products um there are several improvements we can provide to our customers from productivity which allows them to have a better fleet management uh and fleet utilization of the machines that they have to profitability uh to allow them to sell more parts and services uh based on leads that can be provided based on the telematics data and safety improvements that can come through alerts and predictive failures or predictions about failures rather of the parts in the machines specifically caterpillar has a very strategic goal of doubling services revenue from 14 billion dollars to 28 from in 10 years that started in 2016 and uh the digital platform we're building is a key enabler of achieving the strategy for calculus so with that let's jump right into the presentation of the material so if you look if you take a step back and look at the continuum of use cases in the analytic space uh you know from the traditional bi reporting and dashboards to very modern and emerging machine learning use cases and deep learning use cases um typically companies tend to use a variety of different data persistence technologies from traditional relational database uh to time series specialized databases and they tend to copy this data over and over again for for various use cases and and the point i'm trying to make with this presentation is that you can simplify that architecture tremendously by replacing all these variety of technologies by two technologies uh one a one is an object store like s3 and the other one is data cloud like snowflake and if you use a combination of these two technologies in a way that i'm going to expand more you can essentially avoid using a lot of those technologies that are in use today and with more advancement of data cloud and material data cloud technologies there is even further simplification that can happen in the future which might uh require less and less use of the object stone so the key challenges you end up with the variety of database technologies is that you end up duplicating data across these databases which makes it very difficult to keep the data consistent and you end up with multiple data models not to mention the additional cost and time it takes to maintain build and maintain these so and then the other additional problem you end up with is that you have different types of data some data may be structured other data may be semi-structured unstructured coming from your telematics sources uh which makes it makes it very difficult to ingest and um and consume data in a consistent manner and what the data scientists end up doing is they have different consumption patterns and to do simple things like for example coverage analysis of data for a fleet of machines it becomes a challenge because that data is distributed among different database technologies things like creating a simple data science feature like a derived channel value becomes very difficult to reuse across different models because that data is is fragmented across different database technologies so what i'm going to propose is an architecture which allows you to build your database or sorry your machine learning models um using only those two database technologies that i mentioned before an object store in a data cloud so if you look at this picture and you look at um in the left hand side you would typically have different sources those come through different gateways that you would have so there may be a gateway specific to telematics data there may be another gateway for your internal application data but all these data end up in an object stored like s3 in some canonical form in a parquet format in in s3 for example and then what you do is you use the native connectivity from the object store to a data cloud for example snowflake uh provides you a mechanism to natively ingest that through a snow pipe you can use that and then store persist that data in data cloud and it's very important when you persist the data in data cloud that you have a very well thought data model which accommodates both unstructured and semi-structured data inside the same persistent store and then you also need to define a good partitioning mechanism based on the consumption patterns that you have identified for your bi use cases as well as your machine learning use cases so after you have done you have persisted the data in the right data model and you have partitioned it correctly then you can start leveraging the functions of the data cloud natively to do a lot of that machine learning functions that you would have you would have required other third-party tools so for example you can use external functions that available with the data cloud and use that external functions to create those data science features that i talked about before the reusable features like derived channels you can use it as an external function store it within the data cloud and then use it across different analytic models for simple data science models like rule-based model algorithm-based models you can simply use an external function again to run and execute that model in production because your external function can accommodate simple um rules like that in in in a python function and i'm going to talk a little bit more about that in an upcoming slide so you eliminate the need for a different tool for for reusability then if you look up you have a cdc stream again that's natively provided by your data cloud service that you can use to identify incremental data that's coming in and schedule that data to then go and execute more sophisticated machine learning functions so you would use a function orchestrator for example a step function in aws that would allow you to have different paths for different types of model so if you have an algorithmic model that could not run as a as an external function then you can use a serverless execution environment like aws lambda to run that model if you have a large data set and you're running a model against a a very large historical data set you can use pi spark which would be something like emr cluster in aws that you can use to run that machine learning model if you are running it against incremental data and you want more real-time machine learning model then you can use a native cloud ms service like sagemaker so you have a variety of different runtime engines that you can use to run your machine learning model and the output from all this then goes to an object store like s3 and then using the same mechanism the native connector from the object store to the data cloud like snow pipe i mentioned before you ingest it back into the data cloud and then you use the native data cloud visualization capability to visualize the data so you have avoided putting another external tool for visualization and then last but not the least on on the top left you see data sharing once you have the data in in object store and the data cloud since these are cloud native and cloud friendly services you can then use the same persistent store to share data across different business units within your organization or to external parties without having the need to copy the data so hopefully this this architecture gives you a sense of how you simplify the architecture so i'll jump into some specific details on the data cloud technologies that that you can use so first i'm going to touch upon the data modeling simplification that i talked about before so typically what we have done in the past is you would have structured data in a relational database and you would have time series data in a time series database or in in a file system somewhere and and what you end up doing there is that you end up storing data in different databases with different models typically we use the tall format that is shown on the right hand side where you would have the data model for each with that mirrors each second and each data point like a channel as as an individual row so the idea here is that you move away from that and use a tall format or sorry a wide format which is typically a columnar database um like snowflake that you can use to blend the structured and semi-structured data together in one data model and you use that and you use the variant column type as the um as the mechanism to blend that so your structured data on the left hand side like asset id date time and period in this example is your traditional structured column and then on the on the right hand side all the channel information which is variable number of columns per data point is stored in a variant column type so now you get best of both worlds you get structured and semi-structured data in a single data model and and what that does is that it allows you to um now query that semi-structured and structured data with one single sql query and it reduces the amount of store storage by a factor of 100 plus if say for example you have 100 channels you would have to store one row for all channels per second instead of having one row for each channel and each second and and your the typical performance that i have seen in my experience is it improves a factor of 10 and that's the lower limit i've seen even improvements upwards of 50x so if you think about a partitioning once you have modeled your data and processes that data in that in in a in a blended data model for structured and semi-structured data one very important aspect to make the performance consistent for all different use cases from from a bi workload to a machine learning workload is to make sure you understand the consumption pattern um of of your use cases and and based on that performance of consumption patterns you would identify the partitioning scheme and you will identify those key columns and attributes that would be the partitioning key for your culminate database like snowflake and if you do that right it can make a difference between minutes versus seconds in your query performance in my experience for example we had table upwards of 100 billion rows had partitioning of 6 million because we we identified those those key attributes that would partition it at that granular level and we would get one to two second query time across the board whether it was a bi dashboard in a tableau for example or it was running a sage maker workload running against the same data set i mentioned external functions before in my reference architecture and here is a little bit of deep dive into external functions external functions have been around with relational databases for a long time but i don't think they have been used to their full potential so when the external function work very similar in concept but but very differently in actual implementation in a data cloud so in external functions allow you to link your function with individual data rows in your table within a data cloud so the way it works is that when you call an external function within a select statement in your data cloud it would invoke that external function through the native cloud api so if you're running it on aws for example it's going to invoke aws cloud api gateway and that gateway then invokes a serverless execution function for example lambda to execute that python function externally and and the link to to invoke that lambda function and the management of that landvar function in this and to scale up and down is all managed by your by your data cloud so you really don't have to uh do anything except register that um that external function in your data cloud and the data cloud then natively works with the serverless function of that of that cloud where it's hosted and in future you would also see the data cloud vendors working very closely with the with the providers like aws to integrate even more with some of the native machine learning services and as that happens you will see that this integration between the sql invocation of these models become even more and more powerful okay if the last one i would touch on the data cloud feature is this built-in visualization function so typically what data scientists have done is to visualize the data that they are exploring or the data output that they're analyzing out of their models they've either typically used python notebooks or they have used some visualization tool like tableau to visualize that data and neither solution is is really efficient because in a python notebook environment they have to write bunch of code and they have to keep writing code to visualize the data in tableau you have to maintain another special environment and and have this additional cost of visualizing it so with with native uh visualization capability now being built into the data cloud vendors uh like uh snow site in snowflake you can write simple sequels and then visualize the data natively in within the data cloud ecosystem so if you have your input data which is all the data that's persisted inside data cloud and you are putting the output data that i mentioned in my reference architecture going back into the data cloud you're both the input and output data available right there in in in that platform and and all you have to do is run sql to visualize the data and you can take it one step further and even define dynamic parameters and build reusable dashboards and templates that you can share between different data scientists within your company and it is very very powerful if you think about it because it eliminates a lot of time and coding that's that has been typically done in the past so i'll move into the last section of my presentation where i would talk about how the data cloud services and the native ml services um blend together so why do these two different services work together so well it works so well because if you think about it you know it's it's really a compatible services meant for um this powerful platform to provide analytics capabilities to our data scientists so data cloud provides you that rich and powerful mechanism to persist your data both structured and semi-structured in a consolidated data model and these ml services in the cloud like sagemaker services provide you a rich variety of compute options to run your data and if you marry them together you have best of both worlds you have the data store and you have the compute and and it simplifies the way the data scientists then start building the models using using a blend of these two technologies just another quick reference architecture on how you would typically see these services work together so if you look in this picture number one and number two is your typical integration between your enterprise authentication uh platform it may be a saml based on oauth based federated login that you would have at a corporate level that you would typically have integrated with say aws and and once you have that federated login built in you have iam roles within the aws environment for example working with your federated login environment so you would use that as an entry point and then that would provide you of course native authentication for object stores like s3 so you don't have to do anything additional there so after you have set that up then what you would do is you would use those same iem services now to get access to key vault which would be uh aws keyword for example a native key vault in the cloud uh to store credentials for your um cloud data cloud vendor and that credentials is that link between the data cloud authentication and your native cloud authentication so you would store some common application specific passwords that would have specific roles associated um with uh the data cloud vendor and those roles will then provide you access to that data in your machine learning models access data through through that keyword mechanism and then your data cloud vendor would of course have a native console that you would again have federated login and it would provide native access just as before right so there's no change in that so with this kind of mechanism you can create a seamless experience for your data scientist to jump from native cloud services to data cloud services um so when you run these ml models um some of the learnings from my personal experience i i thought i'll highlight here so that you know it it would help all of you to think about that in your own environment um so one is that you know try to use native connectors uh as far as much as possible that's provided by your data cloud vendor so for example snowflake provides a python connector so you want to use python connector for your machine learning algorithms if you're build if you're using python for those not general purpose odbc jdbc connected there's a huge performance improvement between the two you want to create some sort of python function a reusable function within your ecosystem that insulates sql code from your data scientist because they are typically not very sequel savvy they are very good in python but not necessarily in sql so you would want to create some reusable python functions that they can reuse across the machine learning models the third thing you want to use from a network perspective is that you use a the direct link between and it should be actually private link here private link capability between your vpc that you have in your corporate account and thus the sas account that your data cloud vendor is providing so for example you can create a private link between your aws account and your snowflake account and that makes a huge difference in performance again and then the last tip i would give is that typically there are some constraints built around native ml services for example sagemaker expects the input file to always be a csv file in an s3 bucket and you can very easily get around that by creating a dummy a csv file um for sage maker to bypass that and then in your in your preparation portion of your machine learning model you can connect to data cloud service and and query data that way how do you scale these ml models so the way the way uh we found best to scale is uh to put a queue in front of your ml models and and the reason it becomes important is because typically if you think about a native ml service like sage maker it has its own framework that it maintains that controls on how these machine learning models are scaled up and down and that may or may not necessarily work for your particular uh environment if it does great you don't need a queuing mechanism if it doesn't then you can put a queue in front of it and have your own model controller that controls on how you submit jobs uh to the native ml service for large data set that you run typically a batch program you typically don't need a queue because performance of few seconds here and there is not very important it's more important to have a high throughput so typically you won't require a queue for that but if you have a more real-time machine learning model that has to return a result back in in a second or two then you would most likely need a queuing mechanism to manage the workload yourself and in in my personal experience we've seen that you know with this reference architectures that i've shared with you um you can very well get few seconds response time we have seen for most of our models it's under a second for for some uh few higher more sophisticated model it's more four or five seconds for batch data sets which are processing in some cases hundreds and hundreds even terabytes of data our jobs typically run around 20 to 30 minutes at most how do we manage this environment um so the way you manage this environment is actually pretty straightforward so you have um you know native uh capabilities already built into your aws services for example right that allow you cloud cloud watch logs and things like that so all you have to do is use that framework and extend it by creating custom metrics and those custom metrics are going to you know use lambda functions for example to connect to data cloud service like snowflake and get metrics uh from the data cloud and present it as a matrix to cloudwatch and then essentially your cloudwatch framework allows you to then create dashboards which would have a blend of native metrics and custom metrics and and that allows again a seamless experience for your for your operations team um so with that concept of custom metrics we uh we have in our experience um you know able to build uh different types of dashboards so some dashboards are devops dashboards which are geared towards the system support and they would most likely come from native metrics that are available from from aws services that would allow you to to monitor you know all the various different computes and things like that and then for providing a machine learning operations dashboard for your data scientist what you want to do is you want to define a structured logging mechanism for your data scientist to instrument your machine learning models in a certain way so you would prescribe certain things that they have to do in each machine learning model so that that prescribed logging then allows you to capture that log and then create dashboards seamlessly using the same framework so as you see in this example here you can create very specific and very custom metrics and dashboards for your data scientist for them to understand how their models are actually performing in production are they drifting away from accuracy you can provide them worse performing models top 10 worst performing models and you can provide metrics on um on on whatever they want to allow them to see um in in very quantified format on how good or bad their models are doing and then you can link those metrics into native ml services so they can not only see their models are performing poorly but they can also drill down on why they're performing poorly so this allows you as i mentioned before a blend between the data cloud and the ml services so hopefully this was useful and i would end up with a summary of all the information i gave you um so the contention i'm making in this presentation is that you can use data cloud service like snowflake to provide a high performance data store for all your needs no matter whether it's bi or a very advanced machine learning or deep learning model you can just create one copy um in in data cloud and one copy in object store and that should be that should be sufficient for all your use cases however you have to take care that you model the data correctly you cluster your data correctly to get the performance that you're looking for then you make extensive use of native data cloud capabilities like um you know auto ingest we talked about you know linking with native uh native uh uh object store like s3 for example uh native cdc like streams and tasks in in snowflake you can use those capabilities to simplify your overall architecture without a need to add a lot of other third-party tools and then last but not the least use um and design your network connectivity appropriately use things like private link between aws account and and the sas accounts uh to make sure that you're getting the optimal performance um so that's the last of my slide thank you very much and if you have any questions please feel free to ask them in the chat

2021-02-06 15:15

Show Video

Other news