Drive Enterprise Wide Adoption of your Databricks Lakehouse with Data Mesh

Drive Enterprise Wide Adoption of your Databricks Lakehouse with Data Mesh

Show Video

foreign [Music] for those of you who just joined this is our webinar Drive enterprise-wide adoption of your databricks lake house with data match presented by ascend.io I'm looking forward to a great conversation today about what datamesh is how it is compatible with the databricks lake house how it can encourage people to maybe migrate off of their older platforms and paradigms into the the lake house environment um and and to do that we've got a great presenter for you as well so my name is Paul Lacy I am the head of product marketing first and I open I will be your host and moderator for today and I'd also like to introduce my co-host John Osborne John has a long career in data he's I think describes himself as a recovering Chief data officer is that right John that's right Paul yeah that's absolutely right lots of experience in data mesh as well John so yeah if you want to introduce a little bit about your background what you do here to send sure great thanks Paul thanks for the intro I'm always excited to talk about data mesh in a particular data mesh design uh databricks are super interesting to me so I'm happy to spend uh 40 minutes or 50 minutes talking through this and bring some road rash to bear and some opinions and talk through how Ascend might be able to help with data mesh too just a little bit of that if we can awesome yeah and for those of you that have questions as we go through the session here today please feel free to put those in the chat in the Q a we will be saving some time towards the end to answer any of those questions that might come up um so please please feel free the other thing is we will be recording this session and a copy of the recording will be sent out so you don't have to furiously take notes as we go through here and as we were just talking about before um if you're multitasking don't worry you're going to get another shot at going through all this great material but please do add any questions as we go along here great yeah I love the questions awesome well shall we get started John sure let's uh let's go um and what I wanted to lead with uh before we get to the agenda slide is uh just a few interesting statistics about data-driven businesses in general my assumption is if you're exploring a data mesh or uh you've heard about it or you're researching it or maybe you're in the middle of an implementation chances are really good you're trying to help your business go forward with data I mean the current Buzz is all about you know chat GPT and Ai and ML and large language models and all of that but it if you take a step back it's all about making businesses data driven you know so here's some interesting statistics the one the one on the left is from McKinsey in that over the next decade there's gonna be a quarter million person shortage of data experts which is like pretty that's that's crazy in our world that's a lot of Gap in where you want to be versus where you're at and then I know we do um a pulse survey every year we've done it for years and years and we do know that um for the businesses we survey like 95 of them are over capacity so this means their backlog is is longer than they can pass possibly handle it's bigger they can't get all the work done and that most companies also believe that their data product needs is growing faster than their teams are so they want they want to produce three or four times as many data products but they're only getting maybe a flat budget or they're getting maybe 10 or 20 more budget to hire people to actually do the work so all of these factors are driving the need for not only strategies that are going to drive more business value more quickly but also technology and automation that allows fewer humans to be involved to produce more things because this is what's happening in the market so I don't know Paul in your marketing research if you if you see these kind of things but I know I see this on a daily basis talking with uh um potential customers and with just people in my network too it's pretty amazing yeah I mean it certainly reflected in the most recent version of the pulse survey that we did and we're actually going to be showing a little bit of that data later on John which is super exciting bought off the presses for the survey that we connected this year but yeah teams data teams are definitely over capacity um and people want to get more done with less and that's kind of the theme for the year as well right so if we think about what's happening in the broader scheme of things but um yeah there's there's also a trend towards having people self-service more that we've seen in a couple of the surveys that we've run or that our partners have run recently so you know how can we get more people conversion with data you know using data and and that's something I think we're going to talk about too right John with datamesh definitely I'm a huge fan of non-engineers uh helping with data meshes um and uh real quick before the agenda I just wanted to um kind of hit this just for a second um so why is ascendio interested in mesh it's not just because me because I'm interested in mesh but Ascend is a platform um uh lends itself super well to data meshes in that we are an intelligent data platform that allows you to do the Automation and the end-to-end work that you need to have in order to produce an effective data mesh in a cost effective way and with those constraints we just talked about so like there aren't enough people there isn't enough money how do you how do you still get your work done Ascent provides that capability in particular on a lake house with a lake house with uh databricks too so um so from an agenda today um I know we went back and forth on this a little bit uh Paul but um I'd like to hit these kind of four bigger topics like one is data mesh is a solution that can help help you out in a large ways uh also talk about like if you're going to do a mesh like what are some considerations and some watch outs and some good design practices for your domains and what kind of what to worry about what to think about if you're moving toward a data mesh again if you're going toward a data mesh you're coming from somewhere so we're going to talk about sort of the likely scenarios there we're going to talk something about why databricks is like a really great platform to put a mesh on um in general and how a sun helps with that and then what about all your other data um most companies at least the ones that I talk to or have talked with databricks isn't the only database there's other databases in their organization database may be the center of the compute world and it may be the desirable place to put all of the governance and all those kind of activities but what about all this other data that might be out there so let's talk about that a little bit also so here's the agenda you mean Enterprises have Legacy technology John is that what you're saying oh yeah yeah whether it's an on-prem data center or you have multiple clouds or you have like another Cloud database that shall not be named today where data exists and you need to move that or somehow reference that in your um in your data Works environment uh yeah let's talk through that for sure I mean I think that may be the definition of an Enterprise like if you have one postgres database you're probably not an Enterprise but as soon as you have databricks and maybe a dozen other things like most Enterprises do you're probably an Enterprise but um but let's get started um so real real uh real quick um just doing some basic research and if you're familiar with mesh uh anyway or if you're not familiar with mesh I highly recommend reading on the smart and Fowler article it's basically the origin of the thinking around the term uh data mesh and how it came to be some of it's kind of Technical and some of it kind of gets down into the weeds but um zamaka is basically accredited with uh coining the term and then driving a bunch of the strategy for data mesh but if we want to summarize data mesh um at a high level it's really a composition of two things one is bringing a product mindset um to your data so Paul to your earlier uh point this is getting the data approachable so that it can be actually used and shared and then consumed into multiple data products from the business standpoint whether those products are reports or actual maybe apps or insights or whatever you're going to consume that data product mindset helps us bring uh bring that data more readily toward people who maybe aren't well-trained Engineers there may be um you know just business people who know a little bit so maybe we can use product mindset to help there the second concept that builds a mesh is making sure that we're organizing the data in terms of data domains the technical term is a boundary context but this means that we're going to maintain data in groupings that make sense for our business and we're going to match up those groupings with how our business operates and then we're going to as data Engineers we're going to help solve problems like like how do I generate a customer list out of my bounded context so how do I a customer for a shipment and a customer for maybe a marketing activity and they came together in different ways so how do I design a data environment that Services all those needs with all those different shapes of data and of course the term we're going to use today for that is data mesh so when we say data mesh you can think of like this um maybe woven thing I'm not going to use the word fabric but um sort of this consolidation of all of these different contexts into us into a thing that I can then use for real for real value and maybe that's a a good thing to dwell on for just a moment if you don't mind John like the difference between data machine fabric because in a lot of people's minds I think they're interchangeable but there are some some differences right like yeah talk about data fabric you're not necessarily talking about a data mesh no you're actually not um really the difference between them uh and there's tons of research out there but in 30 seconds or less um a data mesh is constructed from um these domains that you're going to engineer that match up with your business so you're going to end up with uh like a finance domain maybe for your business that like has your finances in it you might end up with a customer domain because you have a customer service department so they have a customer domain um a data fabric is more about um putting a layer on top of the data where it resides um that allows that layer allows you then to query the data and figure out where your data is and then compose data sets that you're going to pull out of your fabric that become useful to you so if I'm going to ask a data Fabric or question of like hey I want customers I want customer data and I want skew data for this one SKU go tell me where this data is and then I want you to compose a data set for me that's more of a fabric type implementation where I'm not actively pursuing a consolidation of of knowledge into domains and I'm more doing it at a metadata level and then querying it so two different concepts both servicing very similar uh outcomes from a business standpoint but both implemented in totally different ways so today we're going to talk about data mesh though because it pertains uh or it overlays really nicely on top of databricks and uh especially with the top on top so um so if we uh if we talk about like um uh where you're coming from so if you're uh looking at a data mesh uh most people looking at data mesh that I've talked to across all of the talks I've done they want to go faster and what they mean is I need to produce more data products for my businesses I need to like not take six months or a year to produce this report I need to get these times Consolidated I needed to take fewer people and I needed to be less complex so lots of um people Architects Engineers leaders who come to data mesh they're coming from a monolithic data architecture these are architectures maybe in a private data center um or somewhere where you have a monolithic data model and you're burdened by hyper specialization either because your tooling is holding you into a single architecture or your data experts are few and far between you have two or three people who know how this model Works nobody else can figure it out it's got hundreds of tables in it maybe I personally have had them with like 700 tables in in them in the monolithic architecture it's possible to have more than a couple people who understand all of that and a lot of that's driven from Maybe some Legacy thinking about um you know normalized architectures that can be everything for everyone data models these are very complex they're very hard to build they're hard to maintain and they tend to get very um rigid in design because they're very fragile they're easy to break they're hard to query so we don't change them very much and my rate of change starts to slow down over time every time I add a feature I slow down a little bit more because I add some more debt onto this monolithic model um data mesh is the promise of a data mesh though is a little different in that because I'm maintaining the ownership of the data in there or in the area of the business or with closely aligned uh Technical Resources aligning with the business I can keep this data separate and smaller and and assign the ownership to The Experts for that particular piece of the business and not worry about building a whole model across the entire business but now I have a a different areas or regions of my data infrastructure where the expertise is Consolidated that there's less data in those regions and then I can manage the ownership across them that helps with things like prioritization it helps things with agility then it can also Drive approachable self-service when I do this too because I can have data models that are simpler and easier to query and easier to document and easier to govern so the premise of a data mesh is if I'm able to do these things on the right it means that I can actually go a little faster or at least not degrade every time I add something to my model I can at least maintain a certain level of velocity as I go forward for sure hopefully accelerate as you get better at it for sure and so if we look at if we double to click a little bit more onto data mesh um it's really uh ends up looking like a consumer friendly Paradigm consumer as in my data Partners my data consumers my business friends who want to use this data the data mesh is going to help us with with these problems so it's going to force me to have discoverable and shareable data so if I have a data mesh and I have multiple data domains one of the underlying assumptions of that is that I have to be able to discover where is my data and how do I share it so if I have a customer data set and it's unshareable and you can't see it but you need customer data I don't have a data mesh I can't derive self-service so we need to like have shareability as part of the as part of the data asset the data also needs to be usable and approachable and those sound really simple but really what it comes down to is the language and the the language we use to define these data models and the attributes within them need to be understandable by business people and by you know cohorts that are going to consume this data very often in the technical side there is a conversion of business terms into column names that don't really mean a lot of things and you can patch that with data catalog Solutions or other software or levels of complexity but I can also choose to make my data model actually business friendly too so this is what we mean by approachability of course secure is important but the other two things trustworthy and standardization are also critical one is expanding a data mesh within an Enterprise requires trustworthiness of the data if people don't believe the data is correct or they don't understand where it came from and they're questioning it and they can't get answers that's going to drive doubt into the data and there'll be less apt to want to dive in and start to use that data standardization is also important and one of the ways the standardization plays out is in how do I access the data so if one data set is exposed with a rest interface for example and another data set is a SQL interface that's not very standard and so it's very difficult for me as a data consumer to join that data together and use it so having standardization in a platform is key if you're putting your data mesh in databricks you know it's all queryable there's a query engine doing either python or SQL it doesn't really matter provides some level of standard standardization there so that I can actually query the data the next levels of standardization might be in naming conventions and some of my governance activities around how do I understand how to interpret this data what are my business Keys between my domains and how do I standardize all that information and that's all governance but if you pile all this together now you can start to talk about having an actual data mesh when you've addressed these problems either theoretically and you're Off to the Races building a mesh or you have a mesh and you can check these boxes and say yes I have all of these things that means that you'll have a data mesh for sure you know the other thing I find interesting John is that this entire list of attributes could also be applied to any product that is being created in the marketplace you imagine going to a supermarket and saying like oh everything on the Shelf here if you know these all these things apply right has to be discoverable so that I think it will buy it has to be approachable usable you have to trust that if you consume this thing it's going to be good and it has to be secure and standardized you can't be getting Fruit Loops when you're buying Cheerios you know and stuff like that so yeah I mean I think it's really true in that um lately maybe in the past few years you know data as a product is a real thing and uh just because data is a little ephemeral and it's hard to see and it's not physical like you can't pick it up like you know like a box of um you know cereal uh it is a product and if we treat it that way we also get the benefits of the product so we get the feedback we get the um we get the Improvement Loops we get all of those things that we would get in any sort of product management scenario for sure so data as a product is is one of the key features of a mesh definitely um the other thing that will happen with a data mesh is it's not just going to change the technical aspects of how the data is organized it's also going to change the organization and how the organization operates and and this may or may not be obvious but in order to get the benefit out of the mesh you can't just have a technical solution it's also got to have this business or a human side solution that comes to bear too so if you're in a monolithic architecture for example um typical data data work looks like this where I have a data engineer they're doing a bunch of work over long cycles for a bit of time they quote unquote finish their requirements they give it to a data analyst and the analyst then does a little bit of work and puts on our report and they're done so this is um a lot like uh people or organizations who file tickets to get reports written by engineering this is like they this is the opposite of self-service and this is relying on Engineers to do lots of things that really aren't engineering they're more like data analysis but because they're trained in the tools and in the monolithic architecture and in this sophisticated data model that very few people understand um they're then burdened by actually delivering data products out of it we can't actually let the analysts touch it because they don't know how to use git or they don't know how to use you know the workflow process to deploy code and all of that um so so we're hamstrung by by these architectures and by the engineering required to manage them if we put in place a data mesh architecture and we're doing it with a tooling that allows more things to be true a couple things will become really really obvious in the organization one is that development Cycles will be a lot shorter that the engineers working on those deployment Cycles will be more experts in the particular area that they're working on than they ever wear across the entire monolithic model um so you can have an expert in the customer service data model because it's not very complicated and you can have an expert maybe in your shipment or fulfillment data models because those aren't very complicated as they stand alone but having experts across the entire business is almost a near impossibility for lots of companies so focusing engineering effort allows them to move more quickly and produce more more produce their outcomes faster but it also allows then the data to be picked up from the engineer sooner so notice the the vertical bar here for the engineers is further to the left that means the engineers can do less work and then analysts can do more work they can start to do things like actually build their report from the data mesh instead of having a report built for them and then they're just qaing it for example they can actually do the report building and then I can introduce even more cohorts of people so more fingers and keyboard theoretically means more output so now I can introduce data scientists who actually want the same data I'll be in maybe a different shape but they want to take the data and they want to do some interesting you know AI or ml stuff with it or do some you know voice to text whatever they're doing with their data science activities so the data mesh architecture in the org should feel dramatically different even down to the number of people in meetings should be smaller the meetings should be shorter they should be simpler and there should just be in general better documentation across the entire Enterprise because the models are simpler and smaller for sure um so before we uh before we go too much further um let's talk about uh companies who are feeling the need here and who are talking through um some of the statistics around what they're planning on doing so in the in this pulse survey Paul this is the 2023 one um this is very very interesting that a ton of people want to implement data mesh architectures I think there's a ton of traction for it and um it's also interesting to me that um well a smaller smaller percentage only 20 is already doing it and you have almost 60 percent or almost 70 percent going to do it but haven't started yet does that make sense to you yeah it does I mean it's you know it's been a buzzword in the industry for quite some time now ever since that article came out which I when was it John was it 2018 2017 something like that um through the pandemic where this really kind of caught fires buzzword you know we've seen some pretty dramatic shifts as people you know really start to lean into this and and we do ask this question in a specific way because there are some crossover and some people you know use the term data match some people use the term data fabric which is why you know we have to clarify sometimes when we're talking about this once a technology first approach once a organization first approach um but kind of lumping those two things together we do see that this decentralized organization of data is accelerating across the business and I think in the next slide there John you can actually look at what it's doing year over year um where you can see yeah you know the the number of people who plan to implement data Fabric or mesh within the next 12 months has grown by six percent as well as the the number of people that have implemented or say they're already using a data mesh data fabric type Network approach has grown by 10 year over year so we're sorry I think we're starting to see the wave Crest a little bit here which is fascinating and it's actually interesting data uh chart ish chart thing right so it's grown by 10 but that's 10 across the whole cohort it's actually grown by 100 percent there's almost double the amount of uh people using a data mesh or data fabric from one year to the next which is still only 20 but that's a that's a pretty big um pretty big gain there for sure um and I think um the other one that's interesting is uh almost nobody moves away from it so once they go to a data mesh for example um they tend to stick with it for sure it ends up as a better place to be I think you grow out of your monolithic world and you need this data mesh fabric Thing and once you once you're big enough to go there you're going to stay there for sure absolutely um we have a couple more we have a couple more slides though um and then um this one is like will it actually help uh on the consumption side of data so I think one of the main problems that data mesh solves or uh the problem the solution that people are seeking when they're looking at a data management is like hey I have all this data and we keep doing stuff in engineering from the engineering side but nothing happens on the business side they're still wanting more reports and they're still like not using enough data or we're under utilizing all the data we have and I think data mesh is one way that folks are um definitely thinking through getting more value out of the data they already have so yeah it's and it's interesting too if we double click in here John with the I think the next slide looks at the the spread of uh the roles of people who are responding here and uh and it's it's interesting you know data executives are really all in on Data mesh um with over 50 of of them and their direct reports saying they strongly agree that data measure fabric will enhance the use of data across the organization um and and uh you know so you look across the start there's a few a few bars that stick out here uh maybe some individual contributors uh can't see the force for the trees uh sometimes in some of the stuff uh but yeah yeah but on the whole it you know it's it is a very top-down kind of executive level initiative um that we see kind of rolling out across these organizations and um you know I I think the the other thing that we can take away from this is is that you know the individual contributors that you know the team leads the kind of folks that are you know mired in the day-to-day um uh um hope is Hope is there like there's a light at the end of the tunnel um you know you can see that with with this level of executive support um we can definitely expect a lot of these programs to start getting some traction over the next 12 months and when there's one thing that you know you were just talking about John is that individual contributors lives will get easier once the mesh is implemented right so um you know we can we can there's hope there's light at the end of the tunnel there yeah I think so and I think um you know oftentimes the the strongly agree uh that's low for individual contributors and team leads reflects maybe um a little skepticism that the leader is gonna is gonna actually execute on the model because it does take some intestinal fortitude to get a mesh across the goal line not just technology-wise but from the business side to actually drive that um that self-service and like demand self-service from the business like Hey we're building this thing you need to use it so you have to do your part which is come get trained on how to do a little bit of data work and then come and actually use this model um so I know I've been there a couple times and it's hard work for sure absolutely that's great so love that I love the statistics we're all we're all data people so that's fantastic um I think uh as we move on here through the discussion I think um and this is a good segue into if we are going to be changing our organ we're putting a mesh into place and we're going to think through like what does a data mesh do for us and how does how do those things uh play out on the business I think it's really important to talk about um old paradigms that we may be coming from so if you're coming from a monolithic model or you're coming from you know something that is just really hard to work with so you want to do something different I think it's really important to understand uh some of these biases biases that we that we bring forward into our models um and one is uh when I hear this all the time with uh data mesh comments is like duplicative data is going to cost me more right I've been trained my entire career that I should have a normalization model that like eliminates all the duplicates so there's no duplicate I have one table that has all my customers in it it's all perfect I call that full normalization bias right or the second one I have a single I want or I need a single source of Truth so I need a single table table that shows me the data that is a source of Truth and this is a uh ignoring customer need bias I call it because different consumers of data have different sources of truths for what actually matters to them and what their business context is so it's important to like let those business contexts play out in the models so that they become really valuable and therefore trustworthy for our business partners right and then of course the last one which is near in dear to my heart is um you know my scheduling of all these jobs is too complicated so I want fewer schedules and fewer pipelines I want more of them more bigger ones to or fewer bigger ones to manage sorry and I call this an imperative design bias I want to have like one schedule that does everything instead of a whole bunch of little things that actually do the right thing and I think the um kind of the solution to these paradigms as we think through them as we're moving to mesh and uh likely moving into the cloud too if you're thinking about a mesh or if you're already in the cloud you're at least changing your mindset from monolithic maybe private data mindset in the cloud to like actually Cloud usage um you know there's there's some there's some truth in the cloud that we should be aware of and we should design for and one is that storage is less expensive than compute so having a bunch of data sitting in the cloud doing nothing is way way less expensive than taking a large data set and continually Computing the same answer around it all every day and wasting uh credits on compute so I think storage is something we should leverage to our benefit for sure I think Delta Lake on databricks really helps with that a lot especially with these super high compression ratios you can get it's also true that simpler data models use less compute so if I have a giant normalized model that takes you know a dozen or two dozen tables and a really sophisticated join criteria to produce an outcome if I physicalize those outcomes into simpler data models that is going to consume less compute albeit at the cost of maybe more storage or maybe more orchestration requirement to build it it will it will cost me less in the long run and so if we take these first two things and we sort of conclude make a general conclusion it means that more simpler data models should have a lower total cost of ownership than my old monolithic single model and this plays directly toward mesh and I think if you pull the the technology away and you start to look at the financial side or the financial operations side of data meshes this is playing out in space for lots of companies to save money by actually consuming a little bit more storage because the compute is maybe 10x more expensive this is all predicated however on the fact that you have Automation and determinism that can help you not only manage it but also build it in this way so if you're relying on individualized Old imperative scheduling you probably have no hope so you should not try and do this but if you're actually building a mesh with deterministic tools that are more are highly automated then you can have your cake you needed to you can have smaller models that are less expensive to operate but with the support of automation that allow that to be true so I think um for a financial management standpoint this is this is really important to um to Think Through for sure as we're talking again through um data domains and mesh um I do like to spend some time thinking about and talking through the assumptions that meshes build and the reason I like to talk through the data mesh assumptions is because oftentimes this is where the trouble spots lie like these are the potholes um that you're I'm gonna trip on as you're building a mesh so understanding the assumptions and internalizing that and coming up with mitigations or at least strategies to to make the assumptions true means the mesh is going to work again if you're assuming something it means it needs to be true and if it's not true you're going to fail so you need to have the assumptions be true one of the most important assumptions that mesh makes in all of the white papers the mocks and pretty much any white paper is that you actually need to have technology and data domain experts available to you to actually build a mesh so if I want to build for example a customer data model and nobody in the business really understands how their customer data model really works or maybe they understand it but can't articulate it to an engineer and the engineer can't understand the business partner I'm going to have a really difficult time building a data model that things together because my domain expertise is is missing so this is an assumption that's made by the data mesh premise but this assumption that needs to be true for you to be successful another one um is that um and I've talked about this a couple times but we'll hit it directly you know data Pipeline and data sharing also needs to exist so uh the technology not only the technology but the automation needs to exist and so this plays out because if I'm loading data like uh customer data into a customer model and that customer data is coming from maybe a call center system or maybe all kinds of different systems that I'm accumulating this customer data into a into a customer data domain I need to pipeline that data to keep it up to date but then once I even have this data I need to be able to share that data so if I need if I'm in marketing like you Paul and I need a list of current customers who have purchased a particular item like I need to be able to share the customer data over in a way that includes the information that you need from a marketing standpoint but also in a way that joins it up with purchase history so you can understand what products they've purchased I think um you know companies like Amazon obviously are experts in this in Spades but they have this capability built in not only to pipeline the data in but also make it shareable and usable across the different domains so this is an assumption now that needs to be true um the third one which is a little Shadow assumption but it's it it assumes that you have a governance strategy that works or that can be deployed uh and that means that your organization one needs to be able to accept a governance strategy as part of your work but it also means that you have tools and processes in place that allow governance to be true I think if you're building something in databricks and you've maybe heard of unity catalog or you've started using it this is what databricks is bringing forward as their Central governance tool across data and I think Unity catalog really will assist in deriving a lot of this governance uh to be a first-class citizen with the rest of the design activities in my opinion and in my experience building a mesh with no governance uh quickly devolves into a mess from a mesh so governance is key um and then the last assumption um which is uh needs to be true but um you know I chuckle about it all the time is that uh you know every data domain that you use whether it's customer data or shipment data or Finance data or whatever data you have in your business um it needs to have a little minimum maturity level I don't know how many companies still do this but a lot of them do they manage they might be a billion dollar company and you manage your Finance on you know a spreadsheet instead of a database or an Erp tool or something the spreadsheets are probably not a great place to build a data mesh so opening your maturity level in the areas that you care to pull into a mesh model is going to be part of your pre-work to building a mesh so if you if you have those scenarios it's important to acknowledge them and come up with plans and Investments to improve the maturity level of the data so that when you build a mesh it becomes sustainable and it becomes a much better product um and then of course once we understand our assumptions then um there are some decisions you have to make when you first start building a mesh so as we get into the mesh and you start to actually want to build one these are things that need to have answers to them you can't you can't wish you wash you around you actually have to make a decision um and so one is if you have um technology and data domain and knowledge gaps like what's the plan to close the gap are you going to hire a consultant um to bring some knowledge in are you going to like sequester a business person for six months are you gonna like meet with them every morning for two hours for a couple weeks till you understand it like how are you actually going to do this as a technology partner how are you going to engage with your business and and or pull that domain information in and same for technology gaps like what's the plan like are you gonna what technology you're gonna use so you're gonna use what you have are you gonna experiment like what do you what are you going to do across that that's a very important decision um another important decision and this sounds a little off the wall but it's not is are you going to use a standard platform for every single domain so you're going to consolidate all of your data for example for all of your data mesh in databricks for example as a standard platform or are you going to have each data domain build its own mesh strategy in in their particular platform and then somehow consolidate the result of that later on or not like what's the strategy for managing all of this data I think a lot of people jump to well of course I have to consolidate all my data in one place to do this that's not actually true depending on the technology you have but you should actually make that a decision you shouldn't let that just be an accidental architecture you should actually plan for that for sure and then um I love build versus buy discussions but you should decide like am I going to build this mesh on my own am I going to build it with my own tools my own open source am I going to use um you know databricks or Delta Lake on its own or what am I doing from a technology solution or am I going to buy a solution that builds a mesh for me or is it some combination of course and I think this should actually be a decision again accidental architecture is painful and not um uh it's not something you should get into by accident you should make these decisions and therefore make the budgetary investment and and planning and all of the scheduling that you're gonna have to do for these projects you should make that a real real upfront decision for sure better um and then whether you decide to um build or buy a data mesh on a lake house lots of people just default to I'm going to build it that doesn't that doesn't necessarily need to be true anymore now that Ascend is here but um the two things that you need to have when you're building it is Data pipelines and data sharing so having a functional uh data model for customers one thing actually keeping that model up to date and then actually sharing that data out in a way that's consumable by other people are the two problems that you also you must solve in order to have a data mesh otherwise you've taken a monolithic data model you've moved it into a different area in your database and call it a mesh but it's not actually a data mesh it's just a refactored data model that's doing exactly what it's doing before presuming that you didn't have sharing and pipelines um the way you needed them before so these two things are going to be required so um if you're building this plan for these couple things and uh if you're buying it we can show you some options there for sure sorry when you say data pipelines um you know in the context of a databricks deployment what what would that kind of look like is that like uh is that is that a notebook that I run on a regular basis is that you know like how would I actually um Implement that and what would that look like what would be the defining characteristics yeah so let's let's uh let's theoretically say maybe my um customer data is in Salesforce for example so I have cell Source cloud and I want to use databricks as my um data mesh environment because I have some reporting I want to do out of there fine um the data pipeline looks like the software that's going to connect to Salesforce on a regular basis pull this data down it's going to curate that information and it's going to land it into a data a domain model for customer right you could write your own Python scripts for sure you could use notebooks for that um you there's any number of ways that you could build it yourself in in databricks you can also buy tools to do it um there's like a like absc data connector or something that we're partners with that you could buy it but you can also have a platform that's going to include all of the other platforms services like a sound does for my scheduling standpoint to login consolidation to Unity catalog integration and doing all of that all of that stuff in an automated way for you and so that you don't have to actually write each step along the way so there's multiple there's a whole spectrum of potential Solutions there um where you and I obviously are biased or the you should buy a platform for that stuff because it's not adding any value for you to write it yourself but certainly um you could use a notebook for some of it too got it thanks um if we're I'm going to talk very specifically about databricks um databricks brings a bunch of things to the to to bear on a data mesh that are really important for lots of folks who want to do um data mesh um one of course is Unity catalog can be a central point of governance um for the data that you're going to put in a mesh so this is the place where you're going to Stage information for sharing activities outside of databricks staging from and do some security work do some um catalog work you're going to use Unity catalog for these things and databricks is making significant investments in the unique catalog and they're continually improving it so even if it doesn't do what you need to do um today I'm certain in the future that I'll be adding uh quite a few things to it for sure databricks is also um uses Cloud native Technologies like data lake or Delta Lake sorry which lend themselves very well to data mesh architectures in that you can have many many things in a Delta Lake it's highly parallel you can load it it's cheap to store information in a Delta Lake you can get all of this capability like cloning and all these other capabilities that you can use so it's very very nice for doing doing our data meshes at scale when testing becomes important and some of these other topics databricks also does a great job of allowing you to use multiple languages so you can use SQL and python basically together you can you can write SQL you can write python it does it in an elastic way so you can Define how much compute to consume you can limit it you can unlimit it you can make it unlimited but you have all this capability around executing the queries that you're going to need to execute databricks can also be scripted but it's also secure so it brings all of these these are just a few of the facets but it brings a lot of these things together into a platform that that allows you then to not have to worry about how is my data actually being stored or if I have consumers that want to do data science in Python but I have consumers that are SQL Savvy I can service both of them on the same platform with databricks so a very very nice and a nice way for sure um when you start to look at um okay what about Ascend on databricks um Ascend is bringing um the extra layer of platform services on top of what databricks is already providing so on the left side here we have Bunches of data sources name them uh you can name you name it and it's available um and then on the right side you have all these destinations some of which might be reporting tools some could be um you know reverse CTL jobs back to Salesforce for example and then Ascend is providing the ingestion and transformation activities and this deterministic scheduling so that you can have you can have your cake you needed to you can have many many data domains they can all be managed for you and then they can also be shareable and secured in ways that are are useful we'll talk more about that in a minute or two um so ascenda ends up being this end-to-end uh intelligent automation platform um so just to summarize um meshing uh data bricks with Ascend on databricks uh Unity catalog is automatically integrated with Ascent ingestion is gapless so at scale ingestion across uh pretty much or not pretty much any data source is possible uh data sharing is super simple like one or two clicks and you can share data between an engineer and like maybe a data scientist you'll have polyglot language so you can use SQL and you can use Python and you can use them in the same pipeline you can go back and forth between them so you can do machine learning right in the middle of regular SQL statements uh integrated quality controls so that you can put in rules for what your data quality should be all the way throughout through throughout the entire pipeline so from ingestion quality all the way to final delivery this report needs to be perfect and here's my equality rule um and I can stop it from happening if I want Ascend provides that in the platform of course then observability is critical how much is my pipeline costing me and how how much if I'm doing showbacks to business partners which which pipelines are costing what and how do I divide up the cost of my databricks compute fairly across all of my consumers which may be business units at an Enterprise scale typically that's true and then finally now finally but like the last one on the list anyway um how do we drive agility so how do we get not only all of this really good stuff but how do we do it in a way that allows me to have Source control integration with cicd and workflows and controls for compliance and audit and other Enterprise quality or Enterprise needs when I have multiple developers working on the same platform how do I know it's all going on properly you're driving agility through Automation in the in the workflows is critical so Ascend provides all of this additional functionality while still pushing all the work down to databricks and then of course the last um the last piece that I'll talk about here um with Ascend uh on databricks is that we also support um multiple instances of data bricks so if you have multiple Cloud instances of databricks you can share that through our live sharing across across instances of databricks but then if you have data in other databases maybe you have a large marketing database in gcp that you need to clean up and share into your databricks environment but you also have maybe um Finance in Snowflake and you need that data in there also Ascend provides us live data sharing across all of these environments that not only moves the data on a regular basis on a controlled way but also notifies you for schema changes and uh and other activity that may be happening you know in different platforms that databricks can't see Ascend provides this this bridge across the different platforms so you can actually share data across them without having to fully migrate data so it makes the data mesh assembly process much cleaner and simpler across multiple multiple sources and then um the the you know kind of the following the last point about building a mesh on databricks with the send is that because Ascend is providing um not only the end-to-end efficiency uh of of only having a single tool across all of that uh in addition to databricks though a single tool but we're also really good at consuming compute when compute is needed and not consuming compute when compute isn't needed that means that uh in general customers can spend less money and get more work done using Ascend because we're not going to redo work that doesn't need to be redone we're super good at not only ingesting data into databricks in a very efficient way but also as you're processing pipelines Downstream and making changes and running your business you're not going to reprocess data that doesn't need to be reprocessed we can do mid pipeline restarts we can do all kinds of interesting things that save compute Downstream both for SQL and python across both of those and then um the other because the platform takes over things like Unity catalog integration and makes CI CD really easy and does secure already really easily and does all of these things that Engineers would normally have to do it means that they have more time engineers and frankly data scientists and analysts have more time to actually do engineering work that means they can manage more pipelines because they're very much focused on business rules and Transformations and executing on the business needs and spending way less time if not zero time managing log consolidation and connections to databases and like all of these other activities which are even creating tables which are toil based activities that don't directly result in business value output so so much less expensive on the tooling side and you get much more efficiency out of Engineers so if you're building a data mesh you could expect a mesh to be built much more quickly on a send than you would build it by hand analysis work aside there's no magic there somebody still has to figure out what the model looks like what do you think Paul I think um makes sense I think it's I love it yeah it's a great topic and I could definitely keep going for um for quite a while on it yeah I want to I want to save 7x of our time that we spend on on that kind of stuff yeah that would be fabulous yeah yeah for sure uh good stuff yeah great well thanks for that John that's that was super uh insightful and and a really great primer kind of overview for the audience uh to to think about when they're thinking about all aspects of implementing a data mesh on databricks um this is the part of the program where you can ask us questions so um you'll find a little q a uh option in your Zoom chat window please feel free to drop any questions in there as you if they come up if there's anything you feel like we didn't cover or you have open areas um but John I I have a question and I realize that this might devolve into a massive philosophical debate that might take more than three minutes so um apologies in advance but um you know for those that have maybe been around for a while in the you know data modeling world and the data warehouse all that kind of stuff um in your mind what are some of the differences between a Data Vault and uh in a data mesh um you know in terms of you know how would you kind of compare those two things yeah I mean I think um identiful has a very defined structure um that that you're implementing in order to call it a vault there's a very defined way to build a datavault um versus a data mesh is going to be much more strategically oriented in that we want the data mesh to reflect here the way your business operates and so the models that you're building are are essentially images of how your business wants to operate and this reduces friction with your business partners this increases trustworthiness of the data but it also means that everybody has their own take on it because every business makes money their own way they have their own secret sauce which ends up in a model somewhere and so I think as far as designing a a mesh there's some principles that you follow but it's very much about making friends with your business partners and figuring out why your business makes money and reflecting that in a data model that works for you versus a Data Vault which is much more structured and even the way the keys are set up and evolved like it's very like kind of locked down which is I think why they call it a date of all so I think I've gotten the confusion between those by the way um but datables are very popular in like um you know Financial Services places where um the actual movement of data has serious consequences um but but yeah I did of all the players for sure it solves a problem um yeah I think for most companies data mesh is just fine yeah yeah it makes a ton of sense and you probably don't have to go to a three-day boot camp um for just to learn how to do a data mesh right compared to a Data Vault um yeah and in fact if you um had a happy hour with one of your business partners who was like the main owner of the area you're interested in and you got out a few napkins and a pen uh I bet you come up with a pretty decent data model over a couple of Beverages and it wouldn't be too far from correct I mean you need to put a little of the technical stuff in it later but um but as far as meeting a business need uh it's it's much less complicated than it seems uh in a lot of cases in healthcare I was in healthcare for a while some of the things that I wish I had done and now I would do if I was actually driving um like a claim model I would have separated out all the pieces of healthcare claims like so diagnoses and procedures and finances and not just like Loosely coupled them like make them separate domains and then worry about recombining them later that would have uh taken a 700 table model down to like a dozen tables per and it would have cut down the amount of um like business knowledge required to use it dramatically it would have actually been approachable and usable um so that that's uh just a personal example of of how some of this monolithic thinking tends to drive up complexity whereas if you just sit down and kind of sketch out kind of what you want and actually make that true and then have real business keys that work instead of surrogate Keys it tends to it tends to make it much simpler to operate that makes a lot of sense yeah and you can also imagine that um customizing the data models to the business can only add tremendous levels of value right because your business is unique and competitive because of the way that you you operate right and so your data fits that versus your business having to fit into the data model um you know then you can imagine you're just leveraging your strengths yeah absolutely and um you know I have a um if you follow me on medium I have an article on um on how to talk through that and I use a two by two um one axis is business differentiation in the market right and the other is business value generation so if you if you prioritize your work in the upper right corner which is highly differentiated and highly valuable that's where your engineers should be focused and spending most of their time because that's where your business is making money um anywhere else on that two by two especially the bottom left where it's not differentiated and it's not creating any value you shouldn't be spending any time on that um so this is why nobody writes their own HR System like you should buy an HR license for an HR System you never write your own software so then the question is like why am I writing my own data platform if my business is like analyzing MRI imaging like my secret sauce is in the software that does the analysis the rest of the pipeline stuff I should be buying all of that because I'm not I am not a software company I'm an MRI imaging company for example um so those are ways to aggressively think through like how do I how do I go faster one way to go faster so think that way through through how you're managing your people and your dollars and where you'

2023-09-05 02:39

Show Video

Other news