Catalog and Cocktails #44: Why it’s time to mesh with your data architecture (Zhamak Dehghani)

Catalog and Cocktails #44: Why it’s time to mesh with your data architecture (Zhamak Dehghani)

Show Video

this is catalog and cocktails don't forget to subscribe rate and review wherever you listen to your podcast here's your hosts juan cicada and tim gasper hello everyone welcome to catalog and cocktails this is a weekly live hangout an honest no bs non-salesy conversation about enterprise data management with tasty beverages in hand i'm tim gasper director of product data.world and long time data nerd and joined by juan hey tim i'm juan cicada i'm the principal scientist at data.world and as always it is middle of the week middle day or end of the day and great to pause and chat about data and today we have i think the name that i hear almost every other day of my life right now uh shimaktigani is the director of emerging technologies for thoughtworks and the founder of the datamesh concept if you have not heard about data mesh and or not heard about schmucker and if her name or her data mesh has not come up you have literally been living under a rock so i'm super excited to be able to spend time here talking with shamak and you tim so mark nice to see you thank you so much for joining us here today welcome jamal thank you good to see you too i have to uh disappoint i i don't drink and it's 2 p.m in san francisco so i'm drinking a mushroom coffee [Laughter] no worries no worries well quick reminders uh hey please give us your review on apple podcast and follow us on spotify and also to let you know that we are partnering with the knowledge graph conference which is taking place may 3rd to the 6th it's a and we're going to have a special edition of cataloging cocktails where i will be moderating the data architecture panel with zima who's going to be on the panel together with teresa tong from accenture ju from intuit and mohamed oscar from mckenzie you can go to knowledgegraph.tech and with the special 10 discount code is cckgc so with that let's uh our tell and toast what are we uh drinking and what are we toasting for schmuck we know you're drinking mushroom coffee what is that i have never had mushroom coffee it's uh some potion i drink the four sigmatic guys it's just some some a northern european potion of mushrooms that give you superpowers this particular one supposed to make me smarter so you can be the judge of that by the end of this i want some tea that makes me smarter that sounds good how about you tim um i am drinking a whiskey smash i got way too much mint growing in the backyard so it's basically like lemon and whiskey and simple syrup and i got some some backyard mint in here all smashed together so well i'm having a nice margarita and i do want to go toast a special toast today uh we have been i think this is episode 44 and we have uh so many people behind the scenes at catalog data.world who helps us to go produce

cataloging cocktails and one of those is our really good friend and colleague sean schweiko who is going to be going off to his next adventure but we would not be able to pull everything that we do with cataloging cocktails without him so this is a toast to sean sean thank you so much for everything i know you're listening so thank you so much cheers sean you're destined for great things thanks so much for all you've done for us really appreciate it and we also have our warm-up question so it's uh our fun question here inspiring architectural designs outside the data space so let's not talk about data architectures talk about some other architectures so like real architecture i'm going to go out outside of human technology i would refer to the work of a female architect building architect zaha hadid she's i think she had a middle eastern background i think lebanese background if i'm not mistaken but she has this beautiful organically influenced um kind of commercial spaces and living spaces architectures and actually she's no longer with us she passed very very young um so her work has been inspiring for me that's awesome i actually i'm not familiar with the work i've got to check that out how about you tim you know um i don't know a ton about architecture but i'll tell you about architectural designs that i like lately um you know i really like um when you take traditional architecture and you blend it with like new and modern architecture and lately i've been looking at some house designs where like the core house is there right and you've got like whether it's a tudor style house or a colony or things like that but then you actually have parts of the house with which are modern you know and maybe are a little boxier or a little have interesting angles and things like that and i don't know i think that's cool i think you can make that work and blend it together that's that's really awesome that's my little architectural insight for today a couple weeks ago i was in ikea and i just love seeing these uh they they set up their the little kind of living spaces of 200 meters and stuff like that i think that's kind of great cool design but anyways let's dive into the discussion what we're here for so honest no bs honest obs what is a data mesh and what is not a data mesh let's kick off with that all right so um data vision's an approach it's it's not a thing it's an approach for designing architecting your um big analytical data management based on a decentralized architecture and governing that architecture based on a federated and computational governance but it has to also address the concern of how you do that efficiently and effectively so it also talks about the foundational infrastructure that you have to put in place so it's that's why it confuses people because it started as an architectural paradigm in managing big data and analytics kind of data architecture but it had to go further and become more to not create a mess so it also addresses how to think about the architecture of infrastructure in that space and how to think about your organizational structure in that space and how to think about governance of that um that's the paradigm and you can apply it using different technologies to your organization and it doesn't try to be prescriptive about what technology do you use even though i am very opinionated about that for data missions i i love that and we want to definitely go into those into the opinions and see where i mean i have my opinions too this is going to be the interesting thing is how much we align and we don't align so let's start with the kind of on the technology side then um what are the the so one of the key things that you just brought up there for me the message was it is it is an approach and it's about decentralization so what are the technologies that you're seeing kind of at a high level that need to be involved inside of a data mesh yeah and i have to disappoint you disappoint here because i can't really call out any specific technologies that fits perfectly but there are complementary technologies that we can use well let me let me pause there for a second and this is one of the things that i i we were taught we were talking before about this i call bs when people say well i have here's my product and here's my one-stop shop that does a data mesh people are starting to go say that i'm like no i mean the whole point of the data mesh is that there isn't a one thing like there's a different ways and you decide which one you want to go bring in you don't buy a data mesh right don't figure out you don't buy a data mesh nobody's going to go off and say hey are you looking for a data mesh like i don't think so that's my perspective what do you what do you think i mean are we on the same page here or yeah i completely agree i mean if you think about this as an ecosystem right you need to have first a set of standards and a set of conventions that we agree upon as an interfaces between components and agents within a necklace within that ecosystem so that doesn't exist for for a large part of it and then you have to think about okay what are the technologies that then plug in and provide different capabilities i know i'm talking abstract so let's just put it into concrete examples right now when i when we kind of define data measure and started building it there is no prior art there is no language there is no concept that i can describe the smallest units of this architecture and i can put a boundary on it like we call this thing data products which is the smallest unit of architecture around which you can form teams like the microservices let's say operational world but that thing actually doesn't exist to start with because that thing it needs to contain for it to be truly distributed architecture and satisfied analytical use cases that thing needs to have you know access to storage of data in a way that scales it needs to have a computation and an engine that you can inject computation into it because a lot of analytical use cases you actually run you want to run your competition on where the data is it needs to have the apis and interfaces right to serve that polyglot data you have a way of injecting your policies around it i want to access the data but i don't have that access so give me the you know um [Music] the the um differential privacy mode of access so i can just do analytics without really seeing the forest without seeing the trees so there's just like so much needs to be encapsulated in something that can be a meaningful units of your architecture that then you can say it's my data product complete and that thing doesn't exist so then how do we even talk about the technology when we don't have a language to describe the the pieces of the architecture that we need to build um so we've got to build a language first we've got to build the system of dividing this world and i've tried to we've tried to create that language to some degree and then we can think about okay how do i plug in the technology that exists today underneath and above and we can talk about these layers and where's the gap does that does that help but did i completely derail this conversation no i i think that's a good framework um and and i think that that that really elevates this to i think the the way that you want to approach the conversation which is not to get pulled too much into the is it this tool or is it that tool i guess you know a question for you would be you know you mentioned language it seems like sort of the words that we use and the frameworks that we apply here really help us define sort of do we need a data mesh and how is it going to kind of play in our organization you know i i've heard the phrase domains for example as being a key aspect of how to change your thinking a little bit and and prepare yourself to think more like uh in the mindset of a data mesh what are some of the key terms would you say and would you start to point us to that that are the key drivers here sure this is that's a really good way of actually unpacking the problem and perhaps describing it so domains are a big part of it and the reason is and maybe even we can go one level back and abstract away so if you just for a moment every one of us can stay quiet for a few seconds and imagine a world in 10 years down the track where everything that we do is somehow augmented with a form of intelligence recommendations machine learning models that you know tell us the the you know augments our understanding of the world and you know we know what those could be and the data that feeds those things can come from every touch point every place my data your data or the organization data the medical data can come from anywhere then how does this world like how can we build something that can scale to that so just take a few seconds let's just imagine that world yeah and up to now what we've been is just dumping things into a lake exactly dump it somewhere someone else would define it would make meaning out of it so in that world you have to bring the ownership and the quality and you know all those affordances that makes this data actually useful as close as possible to its source and give it an ownership and then give it all the tools so that they can without friction that data be insured and being or being consumed being discovered but in a truly decentralized and distributed way and given where we are in that in our journey towards that world today we break up our organizations around boundaries the functions and boundaries of domains and then we have the bounds of trust between the organization that's how we're organizing our you know systems at least so then maybe that the way to break up this big problem to a smaller pro um owners and this data as close to the source we end up with a domain driven distribution of ownership of the data and then the structure of the data and then everything around it so that the domains being the bounded context within which we can establish a language we can perform a function of the business like i do order management or customer management or whatever your business pieces you know functions are so that's that's that's where we are in terms of defining but if we do that exercise we sat together and that did that it created drank a few more mushroom coffees and thought about what that world would look like we probably actually end up with a different different model which is the data then the ownership comes to the people the real owners of the data so my data would be you know organized around me i'll probably have some sort of a grid that i can keep my data so that's that's this takes the conversation into the future but now we're not there we're here so i i use the domains as a way of decomposing uh a complex problem into smaller problems yeah this is a great exercise because you when you start doing it you realize oh i need to have data that comes from this place from this place from this place and then your original kind of your mindset says well yeah we'll just put it in the same lake or whatever that's where all data is and then you'll realize wait that's what we've been doing already for 20 30 years that and we're still not able to accomplish this kind of idea this future thinking and the the way you're proposing this right to think about it by the domains it really goes back to take this big gigantic problem and split it into smaller pieces and i think honestly that's the way how computer science works right you try to you take this big block and you try to put into something smaller where the the input you have an output of the output of one black box is the input of the other one and so forth and then you still break it down smaller and smaller i think that i think that's a great way of managing very kind of messy problems and then if you start doing that exercise you end up realizing that you have all these different domains and i think at the end of the day everything should be decentralized now this is something that i i wanna i wanna i wanna get your take on this is you're talking about decentralization almost is everything decentralized or or what or is some part centralized what's the true balance here yeah and i think i always try to be pragmatic and see this as an equilibrium that we constantly have to manage and sustain and i sometimes feel the centralization decent transition in fact two sides of the same coin and the way i think about it is that the moment we decentralize in terms of the data ownership around domains and sharing it through your apis of domains and all of those things in that moment you realize oh now if i go and decentralize all the way down to the to the bottom of the stack that supports this model to the bare metal does it mean that now every one of my teams and every domain builds its own decentralized stack and hope that they would also talk to each other and then is that from the cost perspective and just pragmatic reasons is that possible probably not so then what you end up doing is saying okay i'll give a layer of utilities to these domains the tech stack that they need to build these data products and likely from their perspective they're seeing this as a centralized kind of layer of apis centralized platform within that you can still have decentralization like you can have different teams looking at different aspects of it but to have that kind of ease of use of that technology it's probably a centralized layer of from the perception of the user perhaps a centralized layer of utilities that they can use and then within that you can again have decentralization of you know okay i i do access management you do encryption i do storage you do pipelines whatever it is that that sits in there right and it you know this is super interesting and it makes me think of some questions that i've i've gotten some people from some people about you know how to get started with with data mesh and and usually we start with domains and we start talking about that and like what's the right number of them like what's the balance of 20 between centralization and decentralization and a lot of times you start to get into this question of like well how premeditated does this need to all be right like do i need to think ahead of time like okay well let's i don't want to have more than 10 domains so uh what are those 10 domains going to be oh we better pre-meditate it right now like like i guess how do you how do you think about getting started with this this kind of approach yeah i i i find those academic exercises i'm using and definitely engaging but are they you know are they getting giving us results so i would think i would think about it very pragmatically why did we want to decentralize in the first place because we wanted to mirror how we are decentralizing our business and other applications if you haven't then don't bother with database perhaps but if you have and if you have different teams already responsible for different functions within your business then or capabilities within your business then just use that as a starting point and if you don't have yet that platform capabilities to allow have this autonomous teams and you're not there yet well maybe there is a point in time that is you go from a centralized model to then a decentralized model because that that having the economy of scale that every team runs around and does its own thing and have its own data and yet these things are connected and yet these things are monitored and understood at a global level requires the level of maturity of the platform that enables that yeah so then there is the axis of evolution as where you start with the adoption curve of data mesh within or trans within your organization or the curve of um transformation where you start looks very different from where you end and then you have to be pragmatic that where i am today does it make sense to have 50 of these things running around probably not i would say this around i mean i i my thinking is being influenced by seeing that kind of the migrants microservices and and so on from um you know more than a decade ago now and this was the same conversation i mean you have to be this tall to be able to run my services and run data mesh and that being this tall is a set of like data platform capabilities you need to have in a self-serve passion so if you're not this tall maybe you start with a smaller number but mirror your business mirror how your world is being distributed so so don't overthink it don't try to bowl the ocean around it like start where it makes sense to do what's natural for your organization and iterate yeah move the foundation so that you can scale out right the whole purpose of having this domain so that you don't have to scale up like a lake you can slick it out based on boundaries of trust and boundaries of domains and that's i think what you said mimic the the business and the different kind of domains that already existing in the business i think this is key because i think that's how you want to go start small it's like well let's not go let's define what those 10 domains are let's just start with one and let's get one started the one that that is kind of most interested in participating then that'll get the other domain involved and so forth and then start with the marketing department or whatever right yeah and i think at the same time you'll start building these best practices because at this at some point you can have this you can provide best practices that that can be fairly generic but at the end of the day like these things are part of the culture within one's company about how you deal with data how you've set up teams how your governance what type of governance style are you are you really focused about risks are you really taking things about to the next level and kind of being more open about it i think it really depends on on on the culture and work backwards right like it seems interesting that you mentioned marketing marketing is one of the hottest use cases in fact to bring the english to life because when you look at marketing function they are probably one of the few parts of your organization that you want to look at across your product across your touch point so they want data from many different domains so even if you pick one use case from the marketing and work backwards and say okay for this particular use case like on segmentation of my customers or whatever it is that i need to create these machine learning models or reports if i work backwards which domains do i need to have access this is i'm also working on a marketing project exactly like this and it's fascinating because they're like well i got this thing and then this and everything that they look at is touching a customer or touching the product so they're involved in so much places i think that's also probably another kind of quick interesting takeaway here is that the marketing domain is one that lets you kind of get touched touching with different aspects of a business yeah but but you're not gonna like consolidate that data in the marketing department you're gonna be bringing up a mesh that feeds the use cases so this is this is the aspect of kind of the decentralization and centralization that i'm seeing or this is my point of this is my opinion i want to i want to hear what you think about it which is look the typical thing what do you call a customer well you know what the marketing department has a definition of the customer let them define it right customer success has another definition for it the sales folks have another definition for it okay let every every domain have their own definition of a customer they'll write it down in english in a natural language they'll generate data for that and at the end of the day they're going to deliver a data product right here's the data probably involves customers and the people who are consuming it they're the ones who are going to be i'm i'm happy with this or they're going to complain about it and then at a central point there needs to be a central point who's cataloging these reviews who are cataloging the complaints the the the the recommendations and and and i think what i always mention is let's enable friction and let's people know that that they agree they don't agree and then you put them in the same room and you're saying hey look not only don't we agree what a customer is because we always we already know we don't agree with a customer is but here's the actual data that you've generated and here and best why you talk to bob and alice because they're the ones who own that stuff so i think a central point you want to be able to go centralize what those core models and and that central mod and that centralized kind of group at some point are the ones gonna be paying attention to what the consumers are doing and those complaints that they have and they're gonna take it back to all the domains and let them know about it that's how i think about it and it's like this living organism within a company that you're never gonna get this perfect and then everybody's gonna be happy it's always gonna be changing that's that's my perspective i i do agree with that uh i do think though there is a slippery slope we have to be watch out for and i don't think there is a convention i can point and say let's solve it that way and that slippery scoop is that in my domain in my customer domain and the marketing domain customer actually looks different because i look at different aspects of the customer and all the management look at the different aspects of the customer while i do agree that when these data on the inside inside of my app i can just design it however i want because it's just for my app right but when the data becomes data on the outside and the data outside language and data measures like the data products and particularly data products that like not look at just the current state they look at the historical state of customer in the orders that they put in that data on the outside i do agree that needs to have a mapping context be able to have to have a way to map that internal context to an external context where other people understand but if the mechanism to do that was let's define a customer in one place and everybody agree with that we end up with this bloated definition of the customer that needs to encompass all those different views and nobody actually is going to use it so then the mechanism of arriving at that consensus so i can link the customer from this place to that place and still you know understand that it's the same thing but look at it looks different and differently are what are those minimal mechanisms like unification of ids having links between those entities be able to link them those are some of those fine grain mechanisms they have to put in place but we've got to and i do agree that you need to have a way to just get these things that's a cool that's an exact one like identifiers and that's something that needs to be managed centralized in a central manner because otherwise i mean we're going to just end up having more and more identifiers when you're telling people like go reuse this i mean the same thing for some types of schemas and models out there but again data on the inside and data so data image tries to and that's the difference between prior thinking like virtualizations fabric and so on it's like let's be respectful of that autonomy of different domains and applications the data on the inside is designed optimized for them to move fast to do what they need to do data on the outside which is the data product which is designed to share and get consensus and you know share across and correlate um there might be a gap between the two the bigger the the problem we end up in the problem because they didn't in inside on the inside would turn into data products on the outside and then those things you know feed machine learning model that get embedded into the application so the moment you come and say okay for this customer recommend i don't know the next music track they've got to listen to you've got a disconnection so you need to keep those things close but yet need to allow um them to be different because they're built for a different reason like the database my database for my application to play music is has a very different access model to um the data on the outside that says or what what music people have listened to so there's just some nuanced things in there that would go be to of the differences yeah no that's interesting and to your comment about like respecting the differences but also making sure that things make sense and kind of come together you know that makes me think a lot about sort of the governance side of things uh and i know that obviously with sort of the decentralization and balancing that with data mesh there's this sort of umbrella function the governance function that needs to be effective to kind of keep everything all together um you know how do you think about you know the managing of governance and and sort of and sort of handling that overhead there like like for example you know is uh how does stewardship play a role and you know how do tools like catalogs play a role do you do you have some some frameworks that you recommend around that kind of stuff i have to carry that i'm really no expert in this and i and i go back even to think about when i thought about the governance model which i call you know federated competition of evidence i felt that as human beings we've been struggling and wrestling with this again balance between individualism rights particularly in the u.s like by my domain i want to do my own thing i want to move fast you know i don't care i know my own data or whatever right right and then they come in good well great you're moving fast but you nobody can use your data you're breaking everybody else so that i don't know from aristotle's times of like difference between common you know balance between common good and then individualism so i think that mesh governance had to have a both an incentive model and a structure of people and roles that constantly tries to counterbalance this these two polar poles right so then uh the thinking behind it is okay we've got these data product owners that are they have local incentives to have make your data product awesome what does that mean a lot of people are using it data scientists are recommending to their friends it's easy to discover all of those good things but then counterbalance that with global incentives well your data product uh you're gonna get extra bonuses if your data product actually pops to other data products there are other people on dimension are using it they're connecting into each other so the network effect we want the network effect right so the incentive model is a dual kind of incentive model that um the group that governs the mesh composed is federated from the folks that have that local responsibility uh and then let's make it real like to make this real we have to like push complexity down and make it automated and make it embedded into every one of these nodes on the mesh and we've done this with you know zero trust architectures and so on in operational world when we went from like on-prem to cloud and how we thought about you know policy execution and configuration um at each microservice we can we can do it similar as we can just take that learning but then you apply it to the data concerns right so then let's put this everything we agree that it's a global policy that we all adhere to let's say how we describe our schemas for example what meta language we use to describe sql so let's let's put that into the platform layer and make that so easy for people to just adopt it uh and then give bonus points for you know global incentives if you are up to date with your version of the you know schema that we all using then you get extra extra votes this is i love this about the get bonus points i've i've called this in the past is that's how be a good data citizen right people are gonna like if if you're if you're part of the mesh but you don't use the schema people don't know about you're not documenting your stuff people are not gonna go use it it's like i'm not gonna go use wand's data like that sucks they have to go right um very rarely are people measured on this kind of stuff right like the idea of a kpi around like well how many users are there of your data product and are they happy like how many organizations are measuring that maybe they should be right i think they should be on that stuff and then thinking about i really like what you say is like the data product is one thing but also have it connected with other data products i think it's important i think kind of also my background this is why i think knowledge graphs and using just graph technology is is and technology for implementation because you get that for quick quote for free be able to go connect your data across things go able to go share your metadata so from from the technology perspective i think uh having for example catalogs i see catalogs as as plain two roles one is to go a tool for the data engineers to go catalog the existing data which is not a data product those data sets are ugly or unorganized you don't want to release that but you need to go understand what that is and once you've organized that you've created a data product that needs to be cataloged so other the consumers can go use the catalog to go search the products not the underlying data sets that those are the what what i call the inscrutable ugly enterprise data so yeah i i do agree like you you so the way i i think about it then sergeant i didn't answer your question around the catalog you previously asked as well is that you know once you have this distributed well nicely playing nice citizens of the mesh data data products and they talk to each other they convert each other's data they connect because they have relationship with each other in terms of their semantic but you need to have a even though i i think very bottom-up in terms of the decentralization like each one of these nodes should be self-sustained autonomous you've got to be able to heat an api on this data product and discover it and have all the information about it available right then and there it's metadata it's timeliness schema all of that but you still need as a user to have the global view of this mesh right you need to have a way of searching it browse it all of those things and that's where again i refrain the word data catalog because i want to imagine even new words i want us to imagine that 10-year down the track so i just call it for an app data discoverability or data exploration tool and not use the mechanical so something that lets you discover and explore and what you can discover and explore could be a knowledge graph that has emerged from the mesh so knowledge graph emerges from the animations not the mesh itself because the mesh itself is data and computation and scheme all of those things um so it's an execution context as well as the data and then the knowledge graph emerges from it so you want to browse that knowledge graph and you have to have a window into it and and today's window is basically what we've called data catalog but it may be an inverted model that is instead of going as you said one go and look in the data inside and try to like apply a ton of machine learning to figure out what this column names and tables actually meant and what was their relationship try to invert it on its head and say well that's great to have some sort of intelligence at the top to look at this but let's assume the nodes themselves are self-descriptive and self-discoverable and have some type of quality submits somewhere in between where we have thought about data catalogs as these master tools that get all the you know intelligence out of non-intelligent beings yeah but by the way i always say 30 minutes fly by but this is really going so well and and and actually i think tim we're gonna i'm gonna do an executive decision here it's like let's keep going for a bit like i think this is i got a couple more things i want to go talk about yeah maybe a bonus section a bonus section here so one of the things first thing i'm talking we're talking about definitions is honest no bs what is a data product how do you how do you describe a data product my answer at this moment is it is a beautiful table that people understand the columns make sense you have the they have definitions like it's it's end up being a table that i will still open it up in power bi and thought and and and and that that's one for me uh that's my that's my def i want to hear from you what is a data product i had a lot more hopes for this little data product than just being a table so if i have if i want to see this data product grow and be the thing that i had hoped was in fact a new completely new uh architectural quantum a unit of architecture that abstract everything you need to compute and provide access to a domain data with ability to also execute policies on it so it's a new abstraction that maybe when you went three level down apis you actually get to a nice beautifully designed table but to create that table you need computation you need those transformations to actually create that table to serve that table in many modes of access table is just one mode of access um different words of access you need apis and projections and transformations that do that to actually get to that table with the right access control and make sure you have access to that table you need to have you know policy engines right next to it to to do it so the container that i put around all of this which is compute policy data as one unit of architecture that now i can really put my hands on my heart and say this is an autonomous unit right and it can have many of these things and they connect to each other then it becomes more than just a table but that beautiful table that you described has to be somewhere in there right it's it's all about data anyway yeah but it we need a new and that's that's really hard to convey because we just don't have it we have exactly that that's the thing that i struggled with and at the end i was like yeah it's it kind of seems like it's it's underwhelming it's like all of this and i'm just getting a table which is still in excel it's like yeah but i mean think about it it's it for me it's like look at the column the column has a name that you understand it has a description the data underneath it is well defined it should be clean about it you know where it comes from the exact lineage if you don't like it you don't you know who to complain to you know who's responsible for that i mean all of these are the things that go around treating data as a product and and yes even though in the simplest term it looks like a table but there's much more around that and and i think that we need to convey and physically show it to people and that's what i have right now to show so that that's one of the things that that i've been thinking a lot about um and and um and the other thing that we were chatting about is you you we were slacking earlier today you said we have a choice to reimagine or rebrand and i really love that because it's like let there's a choice of a path of change instead of just putting a fresh cone of coat of paint over what we've always done and i think that's really when you start thinking about something very very different and i think that's part of the message here it's like let's stop let's sit down think about what we've been doing for so long think about what the life should be there's this big gap let's not kind of put lipstick on a pig here i think that's yeah not not just uh you know oh data fabric's not cool anymore data mesh is the cool thing okay cool i'll say mesh now instead like wrong well i think this is a good sticker to this gave us an idea of a new segment we're going to go do here it's called the honest no bs lightning round so we got we prepared five questions and the questions are yes and no answers and we'll give you a a small amount of time to support your answer so well we got five questions here so kickoff question number one is a data fabric a data mesh no but they're complementary if you think about data fabric when it was created by netapp's folks and what problem they try to solve they try to solve access to data wherever it is and be able to integrate it and that was a point that people were going to the cloud so they had to solve the problem of hybrid i've seen data fabric implementations that i still have at the end of the line they get this data extracted from all sorts of databases placed everywhere but at the end of the line they dump it into a lake for a warehouse to actually run analytics on it so i think they're complementary i think fabric can be in the bottom layer of this bag your bare metal layer of the stack and then look at it logically with a new set of kind of technologies as um a mesh or relay that i think there are synergies and they complement but they're not the same thing i like that so for everyone listening you can do both it's not one or the other um all right question number two is is data mesh an architecture yes and no yes definitely an architecture so this is definitely an architecture but as people who are familiar with conway's law architectures and organizational structures very tightly coupled you can just talk without the others so it does have an organizational transformation um aspect to it um yeah and it's it's and it talks about both so it's hard to say it's just an architecture but architecture is a big part of it all right is a data product a data service um you can imagine it that way but that's that's the way that the problem with that model is that we're again imagining the world based on what we've known and that falls apart very quickly so you can go today and build a service so what is the service i think about it back in the i don't know as a unix process or these days more cooler it's a container but it's an execution context right this is well-rounded execution context and that's all it is so you can implement it as a service but that model falls apart very quickly when you think about your big data processing workloads and actually what it takes to run them and then you can think about all those cross-functional capabilities that needs to be a sidecar or decorate this service so it's a need to have attachments so then suddenly it becomes a satellite thing that there is a service somewhere but also all these satellites running around it so i think that a service could be a mechanical implementation but a narrow implementation narrow in terms of just some of the capabilities that we want to encapsulate by data product it can be encapsulated in it interesting okay i like that answer there um okay question number four i have a warehouse and i'm building data products in it am i doing data mesh we might be all living in parallel universes so maybe in a parallel universe database could be a table in a warehouse in the universe of meaning it doesn't because i imagine really data as a product which has a heartbeat in it it's still live it's still processing data it still um really is autonomous i can i can move it anywhere it does not bound and linked tightly to the rest of the data it doesn't disturb everybody else if i change something um to get really to that level of autonomy and bring execution into the same context so execution is not an outside object like the pipeline outside of the you know the data itself the separation of the pipeline and the data is it doesn't exist in in the database it's all one unit uh then uh you find out that uh you might be very thinking very in a limited way however if that's all you have and you have a warehouse good for you to start decoupling that and thinking about it in domain but you will find that your technology stops you very quickly to take that ambitious plan of decoupling further final one do spreadsheets disappear in a data mesh no spreadsheets are wonderful little tools that we humans like don't put your spreadsheets in production but if you just want to do exploration i think they're they're wonderful tools and um database should allow and in fact the reason uh the nodes on the mesh those are the measure shareable data right the reason the thinking is let's give a native access to the users that this wonderful spectrum of analysts to scientists that we have the whole purpose of that is that someone can connect his or her spreadsheet to a port there's a language around the product output port that allows them to get their data into a spreadsheet and play with it in their in their tooling so that goes back to where the way we've gone full circle so the the technology ecosystem that the mesh enables should allow connecting a spreadsheet to one or many data products and be able to get two data into it would the spreadsheet be a component of a data product i don't think so i think that it's just a tool that you're using to talk to the data products and get data out of them well this this has been a fantastic discussion i think we need to go i need to go back and listen to this myself many times so we always like to close with our takeaway so our ttt tim take us away with your takeaways all right taking away on some takeaways so big take away data mesh is not a thing it is not something you buy and it's not just an architecture um you know it's it's a it's an approach and it's uh you know i like your vision statement of like where do we want to be in five or ten years you know it's it's it's a vision for a better future around our data and a path to get there and and and i i like that we're talking about this path and we're trying to figure out the right way to go down that path i like that you talk about the autonomous unit being around compute and policy and data and thinking about how those things work together and thinking about how that fits into the different components and sort of the ways that the data mesh connects together and then you made a statement today where you said you want to push the complexity down and i think that's a really interesting concept and i think as people start to learn about data mesh one of the things they start to get confused about is they're like oh this sounds complicated it sounds overly ambitious it sounds like i need a bunch of people power to manage my data mesh and i think the interesting insight there is like we'll look at like you know what we're doing around automation look at what we're doing around software best practices and ci cd and streaming architecture and things like that there are ways to make this approach be less complex and you have to do it that way if you're going to make it sane and my takeaway is when the the whole kind of idea of thinking about that ideal world and going back from that right and break the problem to smaller pieces and the result is hey you're going to move back to the original source as your domains and have that ownership um we talked about marketing i think that's an interesting way to go start like go think about the marketing domain because they touch so many parts of the business right i think that's that's excellent and you just said nodes are nodes on the mesh are shareable data and you want to make your data product awesome i think they're still kind of all honest here is that i think they're still ambiguity about what a data product is i i gave my definition i think yours is still a bit abstract to be honest but we got to go figure it out um but there's incentives that need to be around this right why you want people to make sure that they use a data product you want to get how do we incentivize what are those kpis we hand behind that and you want to bonus people about it and bonus a product if this data product is connected to more of the products that's great and you should be bonus for for that finally you said data has a heartbeat and it always continues to live i i think that that's a beautiful way i like that better than data has gravity we need some new analogies now you know well sherlock do we always like to close with two questions for you one what's your advice open broad ended about anything and second who should we invite next oh okay so advice um just be critical don't take my word for it poke holes at it like this is this is an open invitation to reimagining a distributed decentralized world around data so just be critical of things you hear and yeah let's let's let's work on it together um who do you like who should we invite uh oh this is a really hard one there are a lot of good people out there i i'm gonna i don't know i feel like i should consult these people before call their names out oh no worries we don't everybody just calls them out here yeah it's just public shaming it's just it's just the policy we do here well i i have a coach and someone i collaborate with who hired me out of university 25 years ago and at the moment i think he's the vp of data infrastructure intuit a good friend of mine i and he's very quiet but he's a very wise man and he's seen a lot so mama i would i would i would like to hear him talk but i have not consulted him all right well what we will do is that you will send him this podcast and then he will hear his name [Laughter] well thank you thank you so much for this time and uh well i truly truly appreciate it this is super exciting and to wrap up just uh don't forget for the knowledgecraft conference go get your tickets knowledgecraft.tech at 10 discount at cc with cckgc and we'll have a special edition of cataloging cocktails we're going to be doing this again but now in addition to some mac we'll have teresa tang and ju and muhammad asser uh and then next week uh it's data ops with chris berg from data kitchen one of the authors of the data ops manifesto so that should be really really cool and with that cheers everyone thank you so much

2021-04-24 15:31

Show Video

Other news