Decoupling Your Legacy On-Premise Data into Operationalized Lakehouses
[Music] good morning everybody and uh welcome to this session uh sponsored by click what we're looking to cover today's is how click a helping organization decouple your your legacy on premise data and bring it into an operationalized lake house so so my name is ted i head up the data integration strategy here here within within click uh and i'm also joined by by my cordon my colleague jordan jordan welcome thanks ted my name is jordan martz i'm a principal solutions architect for click so what we're going to cover today is is a couple of uh sort of two uh two parts in a head-to-head i'm gonna go first and give a little bit of an overview of of click and how we're helping organization and the vision who are bringing their their data into their modern platforms today and look at some of the trends and themes that we see in the market how we're bringing that together into one unified platform but helping our customers today and then i'm going to hand over to jordan who's going to do a look at more detail about a specific customer example and and then look at the technology and show that in in more detail jordan do you want to give a brief summary of what you're going to be showing today yes we have a customer that's in the trucking space an incredible use case around how they've taken their mainframe and optimized that source and as we talk about the different sources and targets that we have transformed how we can then integrate to tools like databricks or spark based data lakes is one of our key tenets to some of our success stories this success story in particular was unique and how that level of automation has been able to take i think it was about three months in totality to success that was incredible and the overall experience of a platform to support that was another part of that story so ted why don't we get started and i'll and we'll keep going perfect always so uh just a little bit of housekeeping on our side of what we're showing here today and really uh really i'd like to look into more detail then about about about who click are and why we are what we're helping organizations today so uh so click click uh uh i've been around uh and have a large number of customers customers and uh and global partners delivering on our solutions and offerings today we've been recognized as an industry leader by gartner for many years both in the analytics and the data integration space and really driving our innovation whilst solutions today helping our customers around data and all components of that and really you may be aware of click as just being a visualization analytics tool but really through uh click bought and the acquisition of opportunity last year and many other components in the data integration space to bring together these these pillars of the organization so what is it that we do and how we help organizations and what we and how we help them is uh is is all around the challenge the challenge around data the organizations struggle really to take take what we call this actionable data and drive insight into that and and the the different silos of data that organizations have and often make it difficult for business decision makers to feel as though they've got the right data to make those decisions that they need and maybe often the analytics platforms are not really part of their day-to-day business and really this is where we can we can embed analytics and embed the right analytics into the business decision making so here we have we are bringing this data together how we're helping organizations is uh is bringing these three components together these three components are where data integration data analytics and uh and data literacy turning business value into that data so and how do we do this how do we do this we do this by by by closing gaps by bringing raw data and bringing that and making it making it available freeing the data that is locked within silos of data freeing data is only half the story then we need to be able to to use that data find that data before before gaining insight into that data understanding the data before then actually the the data is then what we call through data literacy is is being able to have a conversation being able to have an argument with that data to gain insight into it so data integration data analytics and data literacy together bringing all of that together and what we call this in the drive that is is often the phrase around around digital decoupling digital decoupling is a great phrase in that it uh it separates out your legacy platforms separates out those systems that run the business today to the modern platforms that need need access to that data and really it all starts then with a simple building block called cdc standing for change data capture and this is where gartner has called us out as being the leaders of cdc the leaders of independent cdc because cdc unlocks your your sap systems unlocks your mainframe systems your oracle your sql your track your most valuable data that you have today it unlocks that data through the mechanism of real-time capture real-time capture of that of those legacy platforms to bring it into the modern modern analytics platforms that you're building today to drive that action to drive that insight into those the data that you have and here we see uh three main trends that we see in the industry that are driving at the adoption of these newer analytics platforms the first one is a is around cloud application development and really this has been ongoing for many years but really it's about uh here whether or not you've got a you're a bank and you've got your your main application locked within a mainframe but you want to be able to build cloud-based applications so that you and i and everybody today has banking applications on their mobile phone and you want to be able to access real-time data to understand how i've been paid today can i pay this bill can i run this the open banking platforms that have been driving digital transformation for many years relies upon having real-time data out of that legacy on-premise platform we see this also in the drive with sap sap is holding the most valuable data that you the organizations have but you need to be able to build modern based applications on that data in cloud infrastructure second big trend that we see is is around the the rise of the of the cloud data warehouse and uh and again this this methodology often around around failing fast in around being more agile and about driving uh different types of business applications on the data but the traditional on-premise data warehouses just couldn't solve traditional on-premise etl batch loading that was there to solve a single business use case wasn't really agile enough to meet the modern cloud analytics that people are looking today to just uh consume the data that they need to answer those business questions faster and again real-time data out of your operational system in these modern cloud applications is critical to drive the value that these these platforms are are being built today ted do you mind if i hop in here and maybe add some key components here that i hear you often hear with the data warehouse modernization there's some components of the streaming that is associated to terms like micro microservices and also continuous integration and continuous development and as those schemas and the sources change or as the systems of the code change automation's incredible to continually integrate and deploy that and that's a key consideration here i want to pass it back to you ted but i thought maybe thinking about how that is an incredibly powerful component of that only enhances the partnerships with like click for instance and and component vendors like databricks with their delay house technologies exactly and and this is where just delivering the data isn't enough you need to be able to deliver the data in a usable format based on the platforms that people are building today and and this is where again traditional etl tools often just uh batched up and landed data but it wasn't usable for the people building it and even more so in in what we call these next generation sort of data lake data lakes as a service and really all three of these environments are are driven by the adoptions of cloud driven by the introductions of scalability the cloud can give organizations today and and the modern the these modern cloud-based data lake environments are are traditionally you know there's a file based system underneath that but again just delivering a file and delivering a file delivering a file landing a file you can quickly get a swamp in these environments but actually by bringing the data making it usable bringing it uh changing making that real-time data but then usable to the business without whatever that use case may be is where we're seeing a lot of innovation today but really these three main use cases uh we see blending as well what we what i mean by that is you know what is your most valuable data within an organization you know that uh that that mainframe data or that sap data that that erp or crm data itself is critical and you don't want to then just build another silo of data you don't want to build just another silo of another use case so that the answer just a single use case uh you want to be able to bring that data to answer multiple use cases make data draw as a service across across platforms as such bringing together these three main trends these three main use cases is where we see organizations working today because they don't want to have just creating another silo for an individual use case but they know what the data is that they want and that is exactly where click data integration is bringing the data bringing the data using cdc change data capture capturing the data out of their oracle and their sql and their db2 and their mainframes and their sap and and unlocking that and bringing it into the use case so streaming data in real time out of operational systems and then making it usable based on the platforms that they're building today streaming data warehousing all eight lakes but we're seeing the combination of this on multiple clouds or any cloud before then driving analytics above that buying the data using the data with our with click catalog for drive gaining insight in the analytics across that so this is the platform this is how we're helping organizations but truly is in the integration where we see with working with partners like databricks that own the infrastructure and the platform itself but we can stream that data and land that today so the combination of these three pillars this really is where our customers are going to that next level bringing together both data integration data analytics and then their data literacy being able to have that conversation being argued with the data embed the data into the business process that organizations are running today freeing finding understanding and taking action upon their data this really is the uniqueness that click have to offer today and really that's really where the bottom line of that nice quote working with idc those organizations that that can get the benefit of the value of bringing this together and understanding the integration of both streaming real-time data and the power of click analytics together the ones that are going to be driving uh driving more value in their organizations today so that was an overview of a little bit about about click about how we're helping and how the integration the trends that we see organizations choosing the best of breed products to to help them on their their journey to cloud to help them on their journey to get value out of out of uh of spark and data bricks today and really this is what i'd like to pass on to jordan to to show you and give you a an understanding of how how this is being used in real life do a deep dive into a customer that's a the shared customer of ours today and then look at the products so you can actually see and feel and understand that the benefits that we're helping organizations today so jordan over to you thanks ted when you think about transformations and jb hunt was due for a transformation they're one of the largest trucking industry um partners in the entire north american continent when you think about how product gets shipped and managed there's real time requirements around the shipping location locale the partners and the products that need to be sourced from different locations and brought together across the supply line and also the communication of maintaining the maintenance of the that system and that system not only is logging individual trucks and how to maintain those trucks but infrastructure across rail and air and the trucks themselves in an ecosystem that spans thousands of kilometers when you're looking at the overall ecosystem of jb hunt there's a focus to from their side of those operational concerns so they took three of our partners they took uh well jb hunt partnered with click they partnered with databricks and they partnered with microsoft in this use case they had a number of different databases one specifically was the legacy platform of their mainframe that has been running their business for a very long time those operational systems when you look at those they usually had a nightly operation that then would drive activity that nightly reload cycle then became part of the process of accumulating what would be a standard data warehousing and legacy data warehouse environment the operational intelligence that comes from this type of scenario was that they ended up having an insights team that was focused really about prediction not around insight of a learning on actions that were real time from the data there's aspects of the real-time infrastructure that enabled through a tool like click replicate but when you're supporting this bolted-on reporting some of the time-intensive tasks have requirements and that results may not always be accurate or relevant so the restrictions of the technologies were based around the compute time often the sourcing of the data and often manual changes that then change the behavior of the overall system so what happened was the the leadership came to a realization that now that the tools are available we have six key components along the supply line that we need to consider in our overall vision for one a lot of edi created some of the visibility and lack of visibility that needed to be addressed sourcing that from the mainframe gaining that insight into real time and using just data science immediately started a value uh chain that that grew the organizational needs and requirements over time as you look at the insights of each one of the assets that were then directed as you were looking at well shipment requirements or location requirements telemetry along that supply chain became not only data science enabled but also integrated into applications now analytics are becoming the actionable changes that they're using to then build their business for the 21st century and beyond jb hunt's been around i think over 100 years at least and as those businesses change for many of the different suppliers shipping components across the world supply chains are changing even in the changing world that we've had this challenges of this year supply chains have had to continue and that manageability of what is important not just the supply line physically but the data supply line and how these four components that you're going to see today become very relevant for their overall experience when you think about ingestion flexibility the types of data that come from the different sources are paramount the repositories that they write into need to be cost effective but if those systems change the streaming requirements as ted noted the overall warehouse needs to be are the data lake the lake house the warehousing all of those type of changes that are part of that cloud transformation warehouse transformation and then data lake transformation that's where the automation's required for managing the loads and as you monitor for those changes and you monitor for that overall system let's go to an architectural diagram and talk about what moved and how did how it operated there were certain structures of how the the location for shipped materials lived and where they lived you had cloud hosted applications in sql server you had on-prem sql servers and you had the mainframe replicate unlocked it supporting that data science ecosystem and partnered with databricks both from a structured streaming one of the core requirements to building a real-time and continuously updated scenario that then combined resources from synapse and into data bricks to then be able to then service models that they can consume as applications then to also service their analytics team the requirements and security that they had in the mainframe had to be replicated that was a part of that system that was enabled through microsoft's powerful cloud enabled components further then databricks enabled the overall transition of streaming data and gave them the scoring that they needed to change their overall pattern and evolve synapses this is the part also that just that that fundamental value around what decoupling is all around you know because the db2 the mainframe the production systems are are locked in they may have their own life cycle there may be a a two five ten year plan for migrating and one monitor or or moving on in those those platforms today but the value of that data in those modern platforms is critical and understanding the why and what people are doing and building out those applications really does highlight that architectural slide of of the value of bringing real-time data out of production and building up an ecosystem around data bricks to to prove value i just love how you marked that because when we talk about storing and processing and maintaining the management of that over time it's not as easy as it looks and it's being able to have that automation that becomes really paramount i think you nailed these four points and i think that's what's what's important next as we look at the complexity when you talk about complexity the first thing you do when you solve any problem is you break it down into smaller pieces right decoupling and your highlighted components that you had ted were these three components you talk about catalog the click analytics platform and the qda platform all of that works together data jb hunt was focused around this classic slide you've seen from google the amount of infrastructure it takes to put together a data lake when you think about the data collection that's a big aspect of it you think about machine resource management databricks is a big big part of that the process and automation there's lots of that that was one of the key tenets that ted talked about was warehouse automation with system automation so spark gives you analysis it gives you feature management it gives you verification but between the click components the databricks components the microsoft infrastructure monitoring and overall and then getting to machine learning it took quite a bit to bring that together the click data integration platform and ted i'll i'll definitely um ask for you to chime in here when we are loading this overall ecosystem this is what we call our core pillar slide talking about those key components and servicing the consumption from all of the partners that click works with and also the other partners on the machine learning side as well as the data science it really does this one slide does tell the whole story very much about click data integration in that on the left hand side there is the operational systems that are running the organizations when jb hunt would whether their banking or finance or insurance whether they may be oracle sql db2 the mainframe sap and unlocking that data but then delivering it based on on the use case we can read once and write many so if you're building out applications and streaming in kafka we'll stream that data to those environments you're building out modern warehousing on snowflake or synapse and or google and the like we can land that and and make that data usable on the monitor data lake environment whether or not that then with data bricks or s3 or adls and the like the storage devices we will land that data and make that usable critical then to understand that use case so understanding the drivers here about what are we trying to achieve again repeat that ones we're not looking to then just create another single silo of data to answer one line of business those days are gone you want to be able to have that data usable and findable i think that and that's where the catalog sits because the catalog can and make the business find the right data for that use case and what is right for one use case is going to be right for another so being able to reuse that data for for multiple business lines and that's where the analytics a or machine learning science can all consume that data this is the end-to-end flow uh that really has highlighted well the uh the overall independence of the platform in that the data is an asset that asset can be consumed by multiple business units within within the organization awesome ted let's go show them how it works in this sequence what we're going to discuss are the change data capture components that replicate provides thinking about the batch initial load the change data capture and the change of the ddl when you're going to store this you want to be able to automatically merge it to delta transform that delta and create a real-time operational data store that data store can then be used within a databricks notebook to build spark ml and ml flow integrations where we then can render them back through qlik sense what this product does is it takes in the ingestion whether it's a real time stream of the bulk load but it takes it into the bronze layer of delta tables both notifying the actual files the tables and the metadata associated dbfs and as well merging into that component in this section we're going to be talking about how to create a connection to databricks we're going to be loading into adls and we're also being hooking into the meta store as we configure that operation we're going to then walk within the replicate console as you can see here the replicate console will start with a new task which we'll call spark summit emea i'm going to incorporate some of the store changes functionality to carry all the histories of the transactions that have occurred prior in that operation we're going to configure some of the endpoints such as this sql server we can connect to a bunch of different platforms whether they're on-prem or in the cloud and we're then going to target the delta component from our replicate console and connect to it so we bring over our two sources and targets and what we're going to do is we're going to tune that both for the full load and optimize the loading functionality as well as then the number of tables that we're going to pull from some of them being smaller so we're just going to extract them very quickly i'm also going to incorporate some of the store changes functionality which auto keeps which is aware of the partitioning requirements on the system as you bring these data sources over now we're selecting multiple schemas and multiple sets of data i can also adjust my columns as i need to so such as maybe do some math functions add first name last name or put in date times and impute some null records as you're going what i'm going to do now is we're going to kick this off to do some bulk loading via replicate as we're loading these records we're going to be loading about 7.8 million records in this operation let's get this started we're going to click this start processing which begins creating a loop of all the tables 74 tables we've loaded that into what we call the perf bronze schema that's inside delta and what we're going to do is we're going to bulk load all of these over as you can see on the screen there's a quite a few of the smaller ones in the thousands and couple hundred records in each table often lookup tables or dimension tables in the dimension in the data warehouse but as we're loading these operations what you're going to see is we're going to start to hit in the millions of records such as the fact and sales information records well you see right here you've got sales order and large with about 4.8 million another one on order header which is about 2.2 million
and these bulk loads are really bringing over a lot of records and you're seeing 2 000 records a second and it's moving across as we process this operation most of the data comes over in a compressed format those are the two components of a replicate task there's the in integration to the source database and how it's part of the native database and then there's also the in-memory task of transferring the data in a compressed and an encrypted format some of this data as it's almost completely finished now will be both mapped for common data types between two systems as well as then the handling of the batch and the generation of the types of code native to uh bulk copy often like the bc uh bulk copy command and as we're almost done with this one i think this is sales order header and it completes we're going to discuss creating a replication task now that we've got all that over and that lasted all of about five minutes we're going to go into part three which talks about capturing merging and managing those real-time changes as we capture and manage these what we're going to be looking at is the deletion of millions of records that you can see right here secondarily we're also going to run a bunch of inserts i think i got one of the tables i put a million another 100 000 and in this utility we're going to kick off and manage um both multiple tables that are listening for inserts updates and deletes crud operations when it's running these operations what replicate does is it becomes a native deployment of the source system for instance we're hooking into a sql server so it's using its own inherent ability to understand the sql engine none of the replication apis but a very core integration to the sql server engine they have like blog binary log reader which is one of the key components of reading the oracle database or certain messages inside of a mainframe but this listening component then gives you that really powerful extractor and when we're generating a bunch of the code and extraction components you're seeing millions of rows of truncation it already captured what was in memory so now it's cached the change records into disk so that it's read them out of the database and put them into the transfer cache of the replicate engine with the command criteria that it will then generate when it loads into the target delta what we're looking at now is the optimizations like you can see that the cluster's moving along really really well and i'm going to go into the database and i'm going to run account and i've got 4.8 million records in there or i go to the address and i can show that we've collected sample data from those right thanks everybody for joining us we'd love to hear your stories on how you're doing your digital transformation and decoupling your data [Music] you
2021-01-09 11:53