Zhamak Dehghani, Director of Emerging Technologies at ThoughtWorks
in 2009 hal varian google's chief economist said that statisticians would be the sexiest job in the coming decade the modern big data movement really took off later in the following year after the second hadoop world which was hosted by claude cloudera in new york city jeff hamebacher famously declared to me and john furrier in the cube that the best minds of his generation were trying to figure out how to get people to click on ads and he said that sucks the industry was abuzz with the realization that data was the new competitive weapon hadoop was heralded as the new data management paradigm now what actually transpired over the next 10 years was only a small handful of companies could really master the complexities of big data and attract the data science talent really necessary to realize massive returns as well back then cloud was in the early stages of its adoption when you think about it at the beginning of the last decade and as the years passed more and more data got moved to the cloud and the number of data sources absolutely exploded experimentation accelerated as did the pace of change complexity just overwhelmed big data infrastructures and data teams leading to a continuous stream of incremental technical improvements designed to try and keep pace things like data lakes data hubs new open source projects new tools which piled on even more complexity and as we've reported we believe what's needed is a complete bit flip in how we approach data architectures our next guest is jamaak dakhani who is the director of emerging technologies at thoughtworks jamarck is a software engineer architect thought leader and advisor to some of the world's most prominent enterprises she's in my view one of the foremost advocates for rethinking and changing the way we create and manage data architectures favoring a decentralized over monolithic structure and elevating domain knowledge as a primary criterion and how we organize so-called big data teams and platforms jamaak welcome to the cube it's a pleasure to have you on the program hi david it's wonderful to be here well okay so you're pretty outspoken about the need for a paradigm shift and how we manage our data and our and our platforms at scale why do you feel we need such a radical change what's your thoughts there well i think uh if you just look back over the last decades you gave us a you know a summary of what happened since 2010 but if you if even if we go before then uh what we have done over the last few decades is basically repeating and as you mentioned incrementally improving how we've managed data based on certain assumptions around as you mentioned centralization data has to be in one place so we can get value from it but if you look at the parallel movement of our industry in general since the birth of internet we are actually moving towards decentralization if we think today like if let's move data aside if we said the only way web would work the only way we get access to you know various applications on their web or pages is to centralize it we would laugh at that idea but for some reason we don't we don't question that when it comes to data right so i think it's time to embrace uh the complexity that comes with uh the growth of number of sources the proliferation of sources and consumptions models you know embrace the distribution of sources of data that they're not just within one part of organization they're not just within even bounds of organization they're beyond the balance of organization and then look back and say okay if that's the trend of our industry in general um given the fabric of computation and data that we put in you know globally in place then how the architecture and technology and organizational structure incentives need to move to embrace that complexity and to me that requires a paradigm shift in full stack from how we organize our organizations how we organize our teams how we uh you know put a technology in place um to to look at it from a decentralized angle okay so let's let's unpack that a little bit i mean you've spoken about and written today's big architecture and you've basically just mentioned that it's flawed so i want to bring up i love your diagrams you have a simple diagram guys if you could bring up a figure one so on the left here we're ingesting data from the operational systems and other enterprise data sets and of course external data we cleanse it you know you've got to do the do the quality thing and then serve them up to the business so so what's wrong with that that picture that we just described and granted it's a simplified you know form yeah uh quite a few things so yeah and i would flip the question maybe back to you or the audience if we said that you know there are so many sources of the data and actually the data comes from systems and from teams that are very diverse in terms of uh domains right a domain if you just think about i don't know retail uh the e-commerce versus order management versus customer these are very diverse domains the data comes from many different diverse domains and then we expect to put them under the control of a centralized team a centralized system and i know that centralization probably if you zoom out it's centralized if you zoom in it's it's compartmentalized based on functions and we can talk about that and we assume that decentralized model will be service you know getting that data making sense of it uh cleansing and transforming it then to satisfy a need of very diverse set of consumers without really understanding the domains because the teams responsible for it are not close to the source of the data so there is a bit of a cognitive gap and domain understanding gap um you know without really understanding and how the data is going to be used i i've talked to numerous when we came to this i came up with the idea i talked to a lot of data teams globally just to see you know what are the pain points how are we they're doing it and one thing that was evident in all of those conversations that they actually didn't know after they built these pipelines and put the data in whether the data warehouse stables or linked they didn't know how the data was being used but yet they're responsible for making the data available for this diverse set of use cases so a centralized system a monolithic system often is a bottleneck so what you find is that a lot of the teams are struggling with satisfying the needs of the consumers struggling with really understanding the data the domain knowledge is lost there is a loss of understanding and um kind of in that transformation often you know we end up training machine learning models on data that is not really representative of the reality of the business and then we put them to production and they don't work because uh the the semantic and the syntax of the data gets lost within that translation so um and we are struggling with finding people to uh you know to manage a centralized system because still the technology is fairly in my opinion fairly low level and exposes the users of those technologies i say let's say a warehouse a lot of um you know complexity so in summary i think um it's a bottleneck it's not gonna um you know satisfy the pace of change on pace of innovation and the pace of you know availability of sources um it's disconnected and fragmented even though the centralize is disconnected and fragmented from where the data comes from and where the data gets used uh and it's managed by you know a team of hyper specialized people that um you know they're they're struggling to understand the actual value of the data the actual format of the data so it's not going to get us where our aspirations our ambitions need to be yeah so the big data platform is essentially uh i think you call it uh a context agnostic and so as data becomes you know more important in our lives you've got all these new data sources you know injected into the system experimentation as we said with the cloud becomes much much easier so one of the blockers that you've cited and you just mentioned it is you've got these hyper-specialized roles the data engineer the quality engineer data scientist and and it's illusory i mean it's like an illusion these guys are they seemingly they're independent and and can scale independently but but i think you've made the point that in fact uh they they can't that a change in a data source has an effect across the entire data life cycle entire data pipeline so maybe you could maybe you could add some some color to why that's problematic for some of the organizations that you work with and maybe give some examples yeah absolutely so in fact the initially the hypothesis around data mesh came from a series of requests that we received from our both large-scale and progressive clients and progressive in terms of their investment in data architecture so these were clients that they they were they were large at scale they had diverse and rich set of domains some of them were big technology tech companies some of them were big retail companies big health care companies so they had that diversity of the data and the number of the you know the sources of the domains they had invested for quite a few years in um you know generations of they had multi generations of proprietary data warehouses on prem that they were moving to cloud they had moved to the various you know revisions of the hadoop clusters and they were moving to that and they the challenges that they were facing were simply they were not like if i want to just like um you know simplify it in one phrase they were not getting value from the data that they were collecting they were continuously struggling to shift the culture because there was so much friction between all of these three phases of both consumption of the data the transformation and making it available consumption from sources and then providing it and serving it to the consumer so that whole process was full of friction everybody was unhappy so it's bottom line is that you're collecting all this data there is delay there is a lack of trust in the data itself because the data is not representative of the reality it's gone through transformation but people that didn't understand really what the data was got delayed uh and so there's no trust it's hard to get to the data it's hard to create ultimately it's hard to create value from the data and people are working really hard and under a lot of pressure but it's still you know struggling so we often you know our solutions like we are you know we often point at the technology so we go okay this this version of you know some some proprietary data warehouse we're using is not the right thing we should go to the cloud and that certainly will solve our problem right or warehouse wasn't a good one let's make it the lake version so instead of you know extracting and then transforming and loading into the database and that transformation is a you know heavy process because you fundamentally made an assumption using warehouse is that if i transform this data into this multi-dimensional uh perfectly designed schema that then everybody can run whatever query they want that's gonna solve you know everybody's problem but in reality it doesn't because you you are delayed and there is no universal model that serves everybody's need everybody needs a diverse data scientists necessarily don't don't like the perfectly modeled data they're looking for both signals and the noise so then you know we've we've just gone from uh etls to let's say now to lake which is okay let's move the transformation to the to the last mile let's just get load the data into uh into the object stores into semi-structured files and get the data scientists use it but they're still struggling because uh the the problems that we mentioned um so then with the solution what is the solution well next generation data platform let's put it on the cloud and we saw clients that actually had gone through you know a year or multiple years of migration to the cloud but with it was great 18 months so i've seen you know nine months migrations of the warehouse versus two-year migrations of the uh various data sources to the clouds but ultimately the result is the same unsatisfied frustrated data users data providers um you know with lack of ability to innovate quickly on relevant data and have have have an experience that they deserve to have have a delightful experience of discovering and exploring data that they trust and all of that was still a miss so something something else more fundamentally needed to change than just the technology so then the linchpin to your scenario is this notion of context and you you pointed out you made the other observation that look we've made our operational systems context aware but our data platforms are not um and like crm systems sales guys are very comfortable with what's in the crm system they own the data so let's talk about the answer that you and your colleagues are proposing you're essentially flipping the architecture where by those domain knowledge workers the builders if you will of data products or data services they're now first-class citizens in the data flow and they're injecting by design domain knowledge in into the system so so i want to put up another one of your charts guys bring up the figure two there it talks about you know convergence you show data distributed domain driven architecture this self-serve platform design and and this notion of product thinking so maybe you could explain why this approach is so desirable in your view sure um the motivation and inspirations for the approach came from um studying what has happened over the last few decades in operational systems we had a very similar problem uh prior to microservices with monolithic systems monolithic systems where you know the bottleneck um the the changes we needed to make was always you know orthogonal to how the architecture was centralized and we found a nice niche and i'm not saying this is the perfect way of decoupling a monolith but it's a way that currently where we are in our journey to become data driven um is a nice place to be which is distribution or a decomposition of your system as well as organization i think when we whenever we talk about systems we've got to talk about people and teams that responsible for managing those systems so the decomposition of the the systems and the teams and the data around domains because that's how today we are decoupling our business right we're decoupling our businesses around domains and that's a that's a good thing and that what what does that do really for us what it does is it localizes change to the bounded context of that business it creates clear boundary and interfaces and contracts between the rest of the universe of the organization and that particular team so removes the friction that often we have for both managing the change and both serving uh data or capabilities so if the first principle of data mesh is let's decouple this world of analytical data the same to mirror the same way we have decoupled our systems and teams and business why data is any different and the moment you do that so the moment you bring the ownership to people who understand the data best then you get questions that well how is that any different from silos of disconnected databases that we have today and nobody can get to the data so then the rest of the principles is really to address all of the challenges that comes with this first principle of decomposition around domain context and the second principle is well we have to expect a certain level of quality and accountability and responsibility for the teams that provide the data so let's bring product thinking and treating data as a product to the data that these teams now share and let's put accountability around it we need a new set of incentives and metrics for domain teams to share the data we need to have a new set of um kind of quality metrics that define what it means for the data to be a product and we can go through that conversation perhaps later um so then the second principle is okay the the teams now that are responsible the domain teams responsible for their analytical data need to provide that data with a certain level of quality and assurance uh let's call that a product and bring product thinking to that and then the next question you get asked about by cios or ctos or people who build the infrastructure and you know spend the money they said well it's actually quite complex to manage big data and now we want everybody every independent team to manage a full stack of you know storage and computation and pipelines and you know access control and all of that and that's uh well we've solved that problem in operational world and that requires really a new level of platform thinking uh to provide infrastructure and tooling to the domain teams to now be able to manage and serve their big data and that i think that requires reimagining the world of our tooling and technology but for now let's just assume that we need a new level of abstraction to hide away a ton of complexity that unnecessarily people get um exposed to and that that's the third principle of creating self-serve infrastructure um to allow autonomous teams to build their domains but then the last pillar the last you know fundamental pillar is okay once you distribute a problem into smaller problems then you found yourself with another set of problems which is how i'm gonna connect this data how i'm gonna you know the insights happens and emerges from the interconnection of the data domains right it's just not necessarily locked into one domain so the concerns around interoperability and standardization and getting value as a result of composition and interconnection of these domains uh requires a new approach to governance and um we have to think about governance very differently um based on a federated model and based on a computational model like once we have this powerful self-serve platform we can computationally automate a lot of governance decisions um that and security decisions and policy decisions that applies to you know this fabric of mesh not just a single domain or not in a centralized mode so really as you mentioned the the most uh important component of the dynamics is distribution of ownership and distribution of architecture and data the rest of them is to solve all the problems that come with that so very powerful and guys we actually have a picture of what jamar just described bring up bring up figure 3 if you would so i mean essentially you're advocating for the pushing of the the pipeline and all its various functions into the lines of business and abstracting that complexity of the underlying infrastructure which you kind of show here in this figure data infrastructure is the platform down below and you know what i love about this jamaica is it it to me it underscores the data is not the new oil because i can put oil on my car i can put in my house but i can't put the same court in both places but yeah i think you call it polyglot data which is really different forms batch or whatever but but but the same data data doesn't follow the laws of scarcity i can use the same data for many many uses and that's what this sort of graphic shows and then you brought in the really important you know sticking problem which is that you know the governance which is now not a command and control it's it's federated governance so maybe you could add some thoughts on that sure absolutely it's one of those i think um i i keep referring to data mesh as a paradigm shift and it's not just to make it sound ground and you know like kind of grand and exciting or important it's really because i want to point out we need to question every moment when we make a decision around how we're going to design security or governance or modeling of the data we need to reflect and go back and say am i applying some of my cognitive biases around how i have worked for the last 40 years i have seen it work or do i do i really need to question and we do need to question um the way we have applied governance i think at the end of the day the role of the data governance and the objective remains the same i mean we all want quality data accessible to a diverse set of users and these users now you know have different personas like the personal update analyst data scientist data application you know user is a very diverse person so at the end of the day we want quality data accessible to them trustworthy in an easy consumable way however how we get there looks very different in as you mentioned that the governance model in the old world has been very command and control very centralized um you know they were responsible for quality they were responsible for certification of the data um you know applying making sure the data complies with all sorts of regulations um make sure you know data gets discovered and made available in the the world of the data mesh really the job of the data governance as a function becomes finding the equilibrium between what decisions need to be you know made and enforced globally and what decisions need to be made locally so that we can have an interoperable measure of data sets uh that can move fast and can change fast like it's really about um instead of harness you know kind of putting the putting the systems in a straight jacket of being constant and don't change embrace change and continuous change of landscape because that's that's just the reality we can't escape so the role of governance really the governance model i called federated and computational and by that i mean um every domain needs to have a representative in the governance team so uh the role of the data or domain data product owner who really where understands the data of that domain really well but also where the hacks of a product owner is is an important role that has to have a representation in the governance team so it's a federation of domains coming together um plus the smes and people have you know subject matter experts who understand the regulations in that environment who understands the data security concerns but instead of trying to enforce and do this as a central team they make decisions as what need to be standardized what need to be enforced and let's push that into that computationally and in an automated fashion into the uh into the platform itself for example instead of um trying to do that you know be part of the data quality pipeline and inject ourselves as people in in that process let's actually as a group define what constitutes quality like how do we measure quality and then let's automate that and let's um codify that into the platform so that every data product will have a cicd pipeline and as part of that pipeline those quality metrics gets validated and every data product needs to publish those slos or you know service level objectives so you know whatever that we choose as a measure of quality uh maybe it's the you know the integrity of the data the delay in the data the time liveliness of it whatever are the decisions that you're making let's codify that so it's um it's really um the role of the governance the objectives of the governance team try to satisfy the same but how they do it it's it's very very different i wrote a new article recently um trying to explain the logical um architecture that would emerge from applying these principles and i put a kind of a light table to compare and contrast the role of the you know how we do governance today versus how we would do it differently to just give people a a flavor of what does it mean to embrace decentralization and what does it mean to embrace change and continuous change um so hopefully that that that could be how um helpful yes very so many questions i have a bit but the point you make it too on data quality sometimes i feel like quality is the end game whereas the end game should be how fast you can go from idea to monetization with the data service what happens again you sort of address this but what happens to the underlying infrastructure i mean spinning up ec2s and s3 buckets and my pie torches and tensorflows and where does that that lives in the business and and who's responsible for that yeah that's i'm glad you're asking this question david because um i truly believe we need to reimagine that world um i think there are many pieces that we can use um as utilities and foundational pieces but i but i uh i can see for myself a five to seven year roadmap of building this new tooling i think in terms of the ownership the question around ownership that would remains with a platform team but don't perhaps a domain agnostic technology-focused team right that they are providing a set of products themselves and but the products are the users of those products are data product developers right data domain teams that now have really high expectations in terms of low friction in terms of lead time to create a new data product so we need a new set of tooling and i i think with the language needs to shift from you know i need a storage bucket or i need a storage account or i need a cluster to run my you know spark jobs two here's the declaration of my data products this is where the data for it will come from this is the data that i want to serve these are the policies that i need to apply in terms of perhaps encryption or access control go make it happen platform go provision everything that i need so that as a data product developer all i can focus on is the data itself representation of semantic and representation of the syntax and make sure that data meets the quality that i have that i have to assure and it's available the rest of provisioning of everything that sits underneath will have to get taken care of by the platform and that's what i mean by um requires a reimagination and and in fact um and there will be a data platform team the data platform teams that we set up for our clients in fact themselves have a fair bit of complexity internally they divide into multiple teams multiple planes so there would be a plane as in a group of capabilities that satisfy that data product developer experience there would be a set of capabilities that deal with those nitty-gritty underlying utilities i call them at this point utilities because to me the the the level of abstraction of the platform is to go higher than where it is so what we call platform today are a set of utilities we'll be continuing to using we'll be continuing to using object storage we will continue using relational databases and so on so there will be a plane and a group of people responsible for that uh there will be a group of people responsible for capabilities that um you know enable the mesh level functionality for example be able to correlate and connect and query data from multiple nodes as a mesh level capability to be able to discover and explore the measured data products as a measurable capability so it would be a set of teams as product platforms with a strong again platform product thinking embedded and product ownership embedded into that to satisfy the experience of this now business oriented uh domain data teams uh so there's a we have a lot of work to do i could go on unfortunately we're out of time but but i guess my first time i want to tell people there's two pieces that you've put out so far one is uh how to move beyond a monolithic data lake to a distributed data mesh you guys should read that and then data mesh principles and logical architectures kind of part two i guess my last question in the very limited time we have is are organizations ready for this we i think uh the desire is there i've been um overwhelmed with the number of large and medium and small and private and public and governments and federal you know organizations that reached out to us globally i mean it's not a this is this is a global movement and i'm humbled by the response of the industry i think the desire is there the pains are real people acknowledge that um something needs to change here so that's the first step i think that awareness is spreading organizations are more and more becoming aware in fact many technology providers are reaching out to us asking uh what you know what shall we do because our clients are asking us you know people are already asking we need the data machine we need the tooling to support it so the awareness is there in terms of the first step of being ready however the ingredients of a successful transformation requires top-down and bottom-up support so it requires you know support from chief data analytics officers or above the most successful clients that we have with data mesh are the ones that you know the the ceos have made a statement that you know we want to change the experience of every single customer using data and we're going to do we're going to commit to this so the investment and support it you know exists from top to all layers the engineers are excited that maybe perhaps the traditional data teams are open to change so there are a lot of ingredients of transformation needs to come together um are we really ready for it i think uh i think the pioneers perhaps the innovators if you think about that innovation uh careful by doctors probably pioneers and innovators and lead adopters are making making moves towards it and hopefully as the technology becomes more available organizations that are less you know engineering oriented they don't have the capability house today but they can't buy it they would come next maybe those are not the ones who are quite ready for it because the technology is not readily available and requires you know internal investment today i think you're right on i think the leaders are going to lean in hard and they're going to show us the path over the next several years and and i think the the end of this decade is going to be defined a lot differently than the beginning jamal thanks so much for coming on the cube and participating in the program all right keep it right there everybody we're back right after this short break
2021-01-22 20:40