Breaking Analysis: Technology & Architectural Considerations for Data Mesh

Breaking Analysis: Technology & Architectural Considerations for Data Mesh

Show Video

from the cube studios in palo alto in boston bringing you data-driven insights from the cube and etr this is breaking analysis with dave vellante the introduction and socialization of data mesh has caused practitioners business technology executives and technologists to pause and ask some probing questions about the organization of their data teams their data strategies future investments and their current architectural approaches some in the technology community have embraced the concept others have twisted the definition while still others remain oblivious to the momentum building around data mesh we are in the early days of data mesh adoption organizations that have taken the plunge will tell you that aligning stakeholders is a non-trivial effort but necessary to break through the limitations that monolithic data architectures and highly specialized teams have imposed over frustrated business and domain leaders however practical data mesh examples often lie in the eyes of the implementer and may not strictly adhere to the principles of data mesh now part of the problem is lack of open technologies and standards that can accelerate adoption and reduce friction and that's what we're going to talk about today some of the key technology and architecture questions around data mesh hello and welcome to this week's wikibon cube insights powered by etr and in this breaking analysis we welcome back the founder of data mesh and director of emerging technologies at thoughtworks jamaak dagani hello jamac thanks for being here today hi dave thank you for having me back it's always a delight to connect and have a conversation great looking forward to it okay so before we get into it in the technology details i just want to quickly share some data from our friends at etr you know despite the importance of data initiative since the pandemic cios and i.t organizations have had to juggle of course a few other priorities this is why in the survey data cyber and cloud computing are rated as the two most important priorities analytics and machine learning and ai which are kind of data topics still make the top of the list well ahead of many other categories and look a sound data architecture and strategy is fundamental to digital transformations and much of the past two years as we've often said has been like a forced march into digital so while organizations are moving forward they really have to think hard about the data architecture decisions that they make because they're going to it's going to impact them jamar for years to come isn't it yes absolutely i mean we are moving really from slowly moving from reason-based algorithmic logical algorithmic to model-based computation and making where we exploit the patterns and signals within the data so data becomes a very important ingredient of not only decision making and analytics and discovering trends but also the features and applications that we build for future so we can't really ignore it and and as we see uh you know some of the existing challenges around getting value from data is not necessarily that no longer is access to computation it's actually access to trustworthy you know reliable data at scale yeah and you see these domains coming together with with cloud and obviously it has to be secure and trusted and that's why we're here today talking about a data mesh so let's get into it jamar first your new book is out data mesh delivering data-driven value at scale just recently published so congratulations on on getting that done awesome now in a recent presentation you pulled excerpts from the book and we're going to talk through some of the technology and architectural considerations just quickly for the audience four principles of data mesh domain-driven ownership data as product self-serve data platform and federated computational government governance so i want to start with self-serve platform and some of the data that you shared recently you say that data mesh serves autonomous domain oriented teams versus existing platforms which serve a centralized team can you elaborate sure i mean the role of the platform is to lower the cognitive load for domain teams for people who are focusing on the business outcomes the technologies that are building the applications to really lower the cognitive load for them to be able to work with data whether they are building analytics automated decision making intelligent modeling they need to be able to get access to data and use it so the role of the platform i guess just stepping back for a moment is to empower and enable these teams data mesh by definition is a scale out model is a decentralized model that wants to give autonomy to cross-functional teams so at its core requires a set of tools that work really well in that decentralized model when we look at the existing platforms they they try to achieve this similar outcome right lower than cognitive node give the tools to data practitioners to manage data at scale because today centralized teams they're really their job they centralize data teams their job isn't really directly aligned with one or two or different you know business units and business outcomes in terms of getting value from data their job is to manage the data and make the data available for them those cross-functional teams or business units to use the data so the platforms they've been given are really centralized around or tuned to work with this this is structure of the team structure of centralized team and although on the surface it seems that why why not why can't i use my you know cloud storage or computation or data warehouse in a decentralized way uh you should be able to but you still some changes need to happen to those underlying platforms as an example some cloud providers simply have hard limits on the number of like account storage storage accounts that you can have because they never envisage you have hundreds of lakes they envisage one or two maybe 10 lakes right uh they was just really centralizing data not decentralizing data so so i think we see a shift in thinking about enabling autonomous independent teams versus a centralized team so just a follow-up if i may we could be here for a while but so this assumes that you've sorted out the organizational considerations that you've defined all the although what a data product is in a sub-product and people will say you know of course we use the term monolithic as a pejorative let's face it uh but the the data warehouse crowd would say well that's what data march did you know we so we got that that covered but you're the premise of data mesh if i understand it is whether it's a data martial a data a data mart or a data warehouse or a data lake or whatever a snowflake warehouse it's a node on the mesh okay so don't build your organization around the technology let the technology serve the organization is that that's a perfect way of putting it exactly i mean for a very long time when we look at the composition of complexity we've looked at the composition of complexity around technology right so we have technology and that's maybe a good segue to actually the next um item on that list that we looked at oh i need to you know decompose based on whether i want to have access to raw data put it on the lake where whether i want to have access to model data and put it on the warehouse whether you know i need to have a team in the in the middle to move the data around so and then you know try to fit the organization into that model so dynamesh really uh in versus that and as you said is look at the organizational structure first um this scale boundaries around which your organization and operation can scale and then the second layer look at the technology and how you decompose it okay so let's go to that next point and talk about how you serve and manage autonomous interoperable data products where code data policy you say is treated as one unit whereas your contention is existing platforms of course have independent management and dashboards for catalogs or storage et cetera maybe double click on that a bit yeah so if you think about that functional or technical decomposition right of concerns that's one way that's a very valid way of decomposing complexity and concerns and then we'll build solutions independent solutions to address them that's what we see in the technology landscape today we will see technologies that are taking care of your management of data bring your data under some sort of a control and modeling you'll see technology that moves that data around or perform various transformations and computations on it and then you see technology that tries to overlay uh some level of meaning metadata understandability storage and policy right so that's where your data processing kind of pipeline technologies versus data warehouse storage late technologies and then the governance come to play and over time we we decompose and we promise right deconstruct and reconstruct back this together but but right now that's where we stand i think for datum is really to become a reality as in independent sources of data and teams can responsively share data in a way that can be understood right then and there can impose policies right then when the data gets accessed in that source um and in in a resilient manner like in in a way that data changes to the con the structure of the data or changes to the scheme of the data doesn't have those downstream um downtimes uh we've got to think about this new nucleus or new um you know units of data sharing and we need to really bring that transformation um and governing data and the data itself together around these decentralized nodes on the mesh so that's another i guess deconstruction and reconstruction that needs to happen around the technology to formulate ourselves around the domains and again the data and the logic of the data itself the meaning of the data itself great got it and we're going to talk more about the the importance of data sharing and the implications but the third point uh deals with how operational and analytical technologies are constructed you've got an app dev stack you've got a data stack you've made the point many times actually that that we've contextualized our operational systems but not uh you know our our data systems they remain separate um maybe you could elaborate on this point yes i think this is again has a historical background and beginning you know for a really long time applications have dealt with features and you know the logic of running the business and and encapsulating the data and the state that they need to run that feature or run that business function and then we had for anything analytical driven which required access data across these applications and across the longer domain dimension of time around different you know subjects within the organization this analytical data we had made a decision that okay let's leave those applications aside let's leave those databases aside we will extract the data out and we will load it or transform it and put it under the analytical kind of data stack and then downstream from it we will have analytical data users the data analysts the data scientists and the you know the portfolio of users that are growing use that data stack and that led to this really separation of dual stack with point-to-point integration so applications went down the path of transactional databases or organ document store but but using apis for communicating and then we've gone to you know late storage or days warehouse on the other side um if we are moving and that you know that again enforces the silo of data versus app right so if we are moving to the world that that our missions that i'm ambitious around ambitions around making applications more intelligent making them data driven these two worlds need to come closer as in ml analytics gets embedded into those applications themselves and and the data sharing as a very essential ingredient of that gets embedded and gets closer becomes closer to those applications so so if you are looking at this now cross-functional app data based team right business team then um the technology is like stacks can't be so segregated right it has to be a continuum of experience from app delivery to sharing of the data to using that data to embed models back into those applications and that that continuum of experience requires well integrated technologies that give you an example um which actually in some sense we are somewhat moving to that direction but if we are talking about data sharing or data modeling and applications use you know one set of apis you know http compliant graphql or res apis and on the other hand you have proprietary sql like connect to my database and run sql like those are very two different models of representing and accessing data so we kind of have to harmonize or integrate those two worlds a bit more closely to achieve that domain oriented cross-functional um you know teams yeah we're going to talk about some of the gaps later and how there are actually you look at them as opportunities you know more than barriers but they are barriers but they're are opportunities for more innovation let's go on to the fourth one the next point it deals with the roles of that the platform serves data mesh proposes that domain experts own the data and take responsibility for it end to end and are served by the technology you kind of we referenced that before whereas your contention is that today data systems are really designed for specialists i think you use the term hyper specialist a lot i love that term and the generalists are kind of passive bystanders waiting in line for the technical teams to serve them yes i mean if you think about the again the intention behind data mesh was creating a responsible data sharing model that scales out and and i challenge any organization that has a scaled ambitions around data or usage of data that relies on small pockets of very expensive specialist resources right so we we have no choice but upskilling kruskaling the majority population of our technologists um we often call them generalists right that's a shorthand for um people that can really move from one domain to one technology to another technology and you know paint some sometimes we call them painting people sometimes we call them t-shaped people but regardless like we need to have ability to really mobilize um our journalists and we have to do that i thought works we serve a lot of our clients and like many other organizations we are also challenged with hiring specialists so we we have tested the model of having a few specialists really conveying and translating the knowledge to generalists and bring them forward and of course platform is it is a big enabler of that like what what are what is the language of using the technology what are the apis that they like that generalist experience and it doesn't this doesn't mean no code low code we have to throw away into good engineering practices and i think good software engineering practices remain to exist of course they get adopted to the world of data to to build resilient and you know sustainable solutions but um a specialty especially around kind of proprietary technology is going to be a hard one to scale okay i'm definitely going to come back and and pick your brain on that one um and you know your point about scale in the in the examples the practical examples of companies that have implemented data mesh that i've talked to i think in all cases you know there's only a handful that i've really gone deep with but it was their their hadoop instances their clusters wouldn't scale they couldn't scale the business around it so that's really a key key point it was a common pattern that that we've seen now i think in all cases they you know they went to like a data lake model in aws and so that maybe has some you know violation of the principles but we'll come back to that but so let me go on to the next one of course data mesh leans heavily toward this concept of decentralization to support domain ownership over the centralized approaches and we certainly see this the public cloud players database companies as key actors here with very large install bases pushing a centralized approach so i guess my question is you know how realistic is is this this this next point where you have decentralized technologies ruling the roost i think uh if you look at the history of places in in our industry where decentralization has succeeded they heavily relied on standardization of connectivity with you know across different components of technology and i think right now you're right the the way we get value from data it relies on collection at the end of the day collection of data whether you have a deep learning machine learning machine machine learning model that you're training or you have you know reports to generate regardless then the model is bring your data to a place that you can collect it so that you can use it and that leads to naturally set of technologies that try to operate as a full stack integrated proprietary with no intention of you know opening data for for sharing um if you now conversely if you think about internet itself web itself microservices even at the enterprise level not at the planetary level they succeeded as decentralized technologies to a large degree because of their emphasis on opennet and openness and sharing right api sharing we don't talk about in the api worlds like we don't say you know i will build a platform to manage your logic or applications maybe to a degree but actually move away from that we say i'll build a platform that opens your applications to manage your apis manage your interfaces right give you access to api so i think i think the shift needs to that definition of decentralized there means really composable open pieces of the technology that can play nicely with each other rather than a full stack i'll have control of your data um yet being somewhat decentralized within the boundary of my platform um you know that that's that's just simply not going to scale if if data needs to come from different platforms different locations different geographical locations it's um it needs a rethink okay thank you and then the the final point is is data mesh favors technologies that are domain agnostic versus those that are domain aware and i wonder if you could help me square the circle because it's nuanced uh and i'm kind of a 100 level student of your work but you have said for example that you know the data teams lack context of the domain and so help help us understand what you mean here uh in this in this case absolutely so as as you said we want to take data which tries to give autonomy and decision-making power and responsibility to people that have the context of those domains right the people that are really familiar with different business domains and naturally the data that that domain needs or that naturally the domain data that domains shares so if we if the intention of the platform is really to giving give the power to people with most relevant and timely context the flat platform itself naturally becomes as a shared component becomes domain agnostic to a large degree of course those domains can still platform is a fairly overloaded world as in if you think about it as a set of technology that abstracts complexity and allows building the next level solutions on top those domains may have their own set of platforms that are very much doing agnostic but as a generalized shareable set of technologies or tools that allows us share share data so that that that piece of technology um needs to relinquish uh the knowledge of the context to the domain teams and actually becomes doing agnostic got it okay makes sense all right let's shift gears here talk about some of the gaps and some of the the standards that are needed you and i have talked about this a little bit before but this digs deeper what types of standards standards are needed maybe you could walk us through this graphic please sure so what i'm trying to depict here is that if we if we imagine a world that data can be shared from many different locations for a variety of analytical use cases um naturally the boundary of what we call a node and mesh will encapsulate internally a fair few pieces it's not just the boundary of that you know node on the mesh it's the data itself that it's controlling and updating and maintaining is that of course the computation and the code that's responsible for that data and then the policies that continue to govern that data as long as that data exists so if that's the boundaries and if we shift that focus from implementation of implementation details that we can leave that for later what becomes really important is the is the scene or the apis and interfaces that this node exposes and i think that's where um the work that needs to be done um and the standards that are missing and we want the scene and those interfaces be open because that allows you know different organizations with different boundaries of trust to share data not only to share data to kind of move that data to yet another location to share that in a way that distributed workloads distributed analytics distributed machine learning model can happen on the data where it is so if you follow that line of thinking around the centralization and connection of data versus collection of data i think the very very important piece of it that means really deep thinking and i don't claim that i have done that is how do we share data responsibly and sustainably right that is not brittle um if you think about it today the ways we share data one one of the very common ways is around i'll give you a jdcb endpoint or i give you an end point to your you know database of choice and now i as as a technology whereas user actually you can now have access to the scheme of the underlying data and then run various queries or sql queries on it that's very simple and easy to get started with that's why sql is an evergreen you know standard or semi-standard pseudo standard that we all use but it's also very brittle because um we are dependent on a underlying schema and formatting of the data that's been designed to tell the computer how to store and manage the data so i think that the data sharing apis of the future really need to think about removing these brittle dependencies think about sharing not only the data but what we call metadata i suppose uh additional set of characteristics that is always shared along with data to make the data usage i suppose ethical um and and and also friendly for the users um and also you know it i think we have to that data sharing api the other element of it is to allow kind of computation to run where the data exists uh so if you think about sql again as a simple primitive example of computation when we select and when we filter and when we join the computation is happening on that data so maybe there is a next level of articulating distributed computation on data that simply trains models right your your language primitives change in a way to allow sophisticated analytical workloads run on the data more responsibly with you know policies and access control and force so i think that output port that i mentioned simply um is about you know next generation data sharing responsible data sharing apis suitable for analytical decentralized analytical workloads so okay so i'm not trying to bait you here but i have a follow-up as well so you the schema for for all it's good creates constraints no schema on right well that didn't work because it was just a free-for-all and it created you know the data data swamps but that now you have you know technology companies trying to solve that problem take snowflake for for example you know enabling you know data sharing but it is within its proprietary environment uh certainly databricks doing something you know trying to come at it from from its angle bringing some of the best of data warehouse with with the data data science um is your contention that those remain sort of proprietary and and de facto standards and what we need is more open standards maybe you could comment sure i think the content the two points one is as you mentioned open standards that allow um you know actually make the underlying platform invisible i mean my litmus test for a technology provider to say i'm a data mesh you know kind of compliant uh is is your platform invisible as in can i replace it with another and yet get the similar data sharing experience that i need um so part of it is that part of it is open standards so they're not really proprietary uh the other angle for kind of sharing data across um different platforms so that's you know we don't we don't get stuck with one technology or another um is around what what you know is around apis it was around code that is protecting that internal schema so where we are on the curve of evolution of technology right now we have we are exposing the internal structure of the data that is designed to optimize certain modes of access we're exposing that to the end client and application apis right so the apis that use the data today are very much aware that this database was optimized for machine learning workflows hence you will deal with a columnar storage of the file versus this other api is optimized for a very different you know report type access relational access and is optimized around rows i think that is that should become irrelevant in the api sharing of the future because as a user i shouldn't care how this data is internally optimized right my the the language primitive that i'm using should be really agnostic to the machine optimization underneath that and and if you did that perhaps this war between you know warehouse or lake or the other will become actually irrelevant um so we're optimizing for that human best human experience as opposed to the best machine experience we still have to do that but we have to make that invisible make that an implementation concern so um that's another angle of what the what what should if we daydream together the best experience and resilient experience in terms of data usage then these apis become agnostic to the internal storage structure great thank you for that we've we've up to our ankles now in the controversy so we might as well wade all the way in i can't let you go without addressing some of this which you've catalyzed which i by the way i see as a sign of progress so this gentleman paul andrew is an architect and he gave a presentation i think last night and he teased it as quote the theory from jamaicani versus the practical experience of a technical architect aka me meaning him and jamar you were quick to shoot back that data mesh is not theory it's based on practice and some practices are experimental some are more baked and data mesh really avoids by design specif the specificity of vendor or or technology and then you say perhaps you intend to frame your post as a technology or vendor specific specific implementation so so so touche you that was excellent now you don't need me to defend you but i but i will anyway you spent 14 plus years as a software engineer and the better part of a decade consulting with some of the most technically advanced companies in the world but i'm going to push you a little bit here and say you know some of this tension is of your own making because you purposefully don't talk about technologies and vendors sometimes doing so it's instructive uh for us neophytes so why don't you ever like use specific examples of technology for frames of reference yes my role is push us to the next level so you know everybody picks their fights pick their battles um you know my my role in this battle is to push us to think beyond what's available today of course that's my public persona on a day-to-day basis actually i work with clients and existing technology and i think in towers we have given the talk we gave a case study talk with a colleague of mine and i intentionally get got him to talk about how to talk about the technology that we use to implement data mesh and the reason i haven't really embraced in my conversations um you know that this specific technology one is i feel the technology solutions we're using today are still not ready for the vision i mean we have to be introduced transitional step no matter what we have to be pragmatic of course and practical i suppose and and use the existing vendors that exist and i wholeheartedly embrace that but that's just not my role um you know to show that i i've gone through this transformation once before in my life you know when when microservices happened we were building microservices like architectures with technology that wasn't ready for it big application web application servers that were designed to run these giant monolithic applications and now we're trying to run little micro services onto them and the the tail was wagging the dog the environmental complexity of running these services was so consuming so much of our effort that we couldn't really pay attention to that business logic the business value and that's where we are today the complexity of integrating existing technologies really overwhelmingly um you know capturing a lot of our attention and cost and effort money and effort to as opposed to really focusing on the data product themselves so it's just that's that's the role i have but it doesn't mean that um you know we have to rebuild the world we we've we've got to do with what we have in this transitional phase until the new generation i guess um technologies come around and reshape our landscape of tools well impressive public discipline your point about microservices is interesting because a lot of those early microservices weren't so micro and for you know for the naysayers look past is not prologue but but thoughtworks was really early on in the whole concept of microservices so be very excited to see how this plays out but now there are some other good comments there was one from a gentleman who said the most interesting aspects of data data mesh are organizational and that's how my colleague sanjeev mohan frames data mesh versus data fabric you know i'm not sure i think we've sort of scratched the surface today that data today the data meshes more and i still think data fabric is what netapp defined as software-defined storage infrastructure that could serve on-prem and public cloud workloads you know back whenever 2016. but the point you make in the thread that we're showing you here is that you're warning that this and you referenced this earlier that this segregating different modes of access will lead to fragmentation and we don't want to repeat the mistakes of the past yes there are comments around um you know i again going back to that original conversation that you know we have got this at the macro level we've got this tendency to decompose complexity based on technical solutions um and you know the conversation could be oh i do batch or you do a stream and we are different you know we create these bifurcations in our decisions based on the technology where i do events and you do tables right so that that sort of segregation of modes of access causes accidental complexity that we keep dealing with because every time in this tree you create a new branch you create new you know kind of new set of tools that then somehow need to be point to point integrated you create new specialization around that so the least number of branches that we have i think and think about really um about the continuum of experiences that we need to create and technologies that simplify that continuum of experience so one of the things for example give your past experience um i was really excited around the papers and the work that came around on apache beam and generally you know flow based programming and screen processing because basically they were saying whether you were doing batch or whether you're doing streaming it's all one stream and sometimes the window of time you know narrows and sometimes the window of time over which your computing um you know widens and at the end of the day is you're just getting you know doing the stream processing so it's those sort of notions that simplify and create a continuum of experience um i think resonate with me personally more than creating these tribal fights of this type versus that mode of access so that's why the image by uh naturally selects kind of this multimodal access to uh to support the end users right the person of the end users okay so the last topic i want to hit is look at this this whole discussion the topic of data mesh it's highly nuanced it's new and people are going to shoehorn data mesh into their respective views of of the world and we talked about you know lake houses and s3 buckets and of course the gentleman from linkedin with azure microsoft has a data mesh community um so you're going to have to enlist some serious army of enforcers to adjudicate but but but and i i just wrote some of the stuff down i mean it's interesting monte carlo has a data mesh calculator starburst is leaning in chaos search sees themselves as an enabler oracle and snowflake both use the term data mesh and then of course you've got big practitioners jpmc we've talked too into it's orlando hellofresh has been on netflix has this event-based sort of streaming implementation so my question is how realistic is it that the clarity of your vision can be implemented and not polluted by by really rich technology companies and others is it even possible right is it even impossible that's uh yes that's why i practice then these days i should practice this because i think uh i think that's that's it's it's going to be hard i think the what i'm hopeful is that the socio-technical labeling data image that this is a socio-technical concern or solution not just a technology solution hopefully always brings us back to um you know the reality and that the vendors try to you know sell you say oh it lets you know solves all of your problems uh all of your data image problems it's just um it's just gonna cause more problem down the track so we'll see time will tell um dave and i count on you as one of those members of you know folks that will uh continue to share their platform um to to go back to the route us why why in the first place i mean i dedicated a whole part of the book to why because we get as you said we get carried away with vendors and technology solutions try to ride the wave and in in that story we forget the reason for which we're even making this change and we're going to spend all of this resources so hopefully we can always come back to that yeah and i think we can i think you know you have really given this some deep thought and as we pointed out you the this was based on practical knowledge um and experience and look we've been trying to solve this data problem for for a long long time you've not only articulated it well but you've come up with solutions so jamaica thank you so much we're going to leave it there and uh i'd love to have you back thank you for the conversation i really enjoy it and thank you for sharing your platform to talk about tastemash yeah you bet all right i want to thank my colleague stephanie chan who helps research topics for us alex meyerson is on production and kristin martin cheryl knight and rob hoff on editorial remember all these episodes they're available as podcasts wherever you listen all you do is search breaking analysis podcast check out etr's website at etr dot ai for all the data we publish a full report every week on wikibon.com siliconangle.com

you can reach me by email david.velante at siliconangle.com or dm me at divalante hit us up on our linkedin post this is dave vellante for the cube insights powered by etr have a great week stay safe be well and we'll see you next time [Music] you

2022-04-23 08:21

Show Video

Other news