Zhamak Dehghani Kafka Summit Europe 2021 Keynote How to Build the Data Mesh Foundation

Show video

(smooth upbeat music) - If you haven't heard anyone talking about data mesh yet, I don't wanna say I'm surprised, but it's like when somebody tells you they haven't seen some classic movie like "Star Wars" or something. The good news is that they get to see it for the first time. And that's the deal with data mesh and ThoughtWorker, Zhamak Dehghani. She's gonna tell us how legacy data systems haven't quite delivered what we've wanted.

That's probably an easy sell for this audience, and she's really gonna dig into why in a social and architectural sense, why they were not structured properly and how to fix that. Listen for her to give the four principles of data mesh systems. I think they're really compelling and I think as you follow along you're gonna see that this is a system that cries out to be implemented with Kafka. It's like Kafka was waiting patiently for an idea something like this to come along. So please let me welcome to the virtual stage, Zhamak Dehghani. (smooth upbeat music) - Hi everyone, I'm Zhamak Dehghani.

I'm the Director of Emerging Technologies at ThoughtWorks North America. I'm excited to be here with you today talking about data mesh, sharing my experience along the way with you and explaining what it is and why this concept came to exist a few years back. As a technology consultant, I can share this with you, that there are great expectations around data. Every company and executive that I work with has ambitious plans around data-driven initiatives sprinkled all over his or her strategy. And just as an example, I wanna share a few mission statements with you. The first one is from a wonderful financial SaaS company and their goal is bring new value and new products to their customers using AI and data that they have about their customers.

The second one is from a telco in North America. Their ambition is around streamlining the workflows and internal processes to serve their customers with their staff augmented with intelligence and AI solutions. And finally, my favorite is from a healthcare provider here in North America. Their mission is to improve the experience of their members at every touch point with their company using data NML. And I'm sure your company has equally audacious plans. But the inconvenient truth is that for decades almost, we've been trying to bootstrap ourselves get to these goals with spending more and more accelerated pace and amount of spending on our data platforms and yet failing on transformational metrics.

Really failing to see value based on the investments that are making. This is from a new vantage partner reports. A report that I keep an eye on every year, and what it shows is that despite accelerated pace and amounts of investment, the results on transformational metrics are changing your data culture becoming data-driven competing on data. Being able to go from a thought of an experiment to getting access to the data, to run that experiment and validate whether your hypothesis was true or not is we're still not seeing that transformational change.

The failure symptoms that I signed the ones that triggered the thought around questioning the existing paradigm, where these ones and the most important ones are the two in the middle. The failure to scale as we see more data sources becoming available to us and as we become more ambitious around using that data to get value for our businesses, we're seeing these two pressure points that are existing technology solution our architectures can't handle from the input and from the output. And the solution obviously like there's ones that is often, well, was build by next data platform. So we find ourselves in this cycle of repeated cycle of a bootstrapping yet another next data platform and ultimately failing to materialize data-driven value.

Let's look at the current data landscape. How have we been trying to address these plans and goals that we've had. So for, I would say for decades up to today this is how we have imagined data. We had this great divide between operational data which runs our data that runs our business the state of our stateful workloads like microservices and applications. And then on the other side of the organization, different teams different technologies tag different architecture, we see our analytical data plane that is intended to gather data from all of those systems put them in one big place, warehouse or lake to be able to now run analytics or feed our machine learning algorithms to intelligently augment the services that we have. And in between we've had our pipelines connecting the two plane.

And if you zoom into our blue box, the analytical data plane we've had evolutionary change on the technology and our approach. We've had the first generation data architecture the data warehouse, where we are getting data out of all of those stateful workloads, out of our operational databases and putting them in pristine and perfectly designed canonical models that we can now run SQLs queries on top of them and then visualize them. And we've had our data lake. In about 2010 we decided, well, this process of perfectly modeling the data is not satisfying. Those rapid experiments of data scientists looking for the signal within the noise and why seeing the noise, so we decided let's get all the data out of those operational systems and yet put them in one big place, but in a more semi-structured format and then downstream, we can model it. And now we are in this multimodal phase that we put data still onto the cloud...

It was a very cloud-driven. So the third generation is mostly people companies are getting comfortable with putting their data to the cloud and their architectures is multimodal in a way that it includes a big storage for data lake as well as the warehouse for analytical and then we used to have the lake house solutions which bringing kind of the SQL and transactional nature to the lake operations, to the lake storage. So we're seeing this kind of multimodal change. So definitely we've had an evolutionary improvement around the architecture. Most certainly we've been busy building technologies.

We've have this Cambrian explosion of tools and technologies around kind of data solutions and analytical solutions. I didn't put this picture for you to read the icons, it's just I put it there to have a look and feel impressed then feel dizzy that we've been busy building technology. So then what hasn't changed? What assumptions remained the same that lead to those problems of scale that we're seeing? And I wanna run through very quickly some of those basic assumptions that data mesh challenges.

And they're gonna be a little bit uncomfortable. If you coming from decades of doing big data analytics, this next few slides are gonna make you uncomfortable. So the first and foremost, I think we have thought that for data to be useful, it has to be centralized in one place.

And that has led to this monolithic architectures of getting data from ubiquitous sources around organization within the bounds of the organization or outside, and that big monolithic solution with centralized data answering the needs of the consumers. And then as every architect knows when we find ourselves constrained with the limitations of big and monolithic solutions, what do we do? We try to break it into its parts and integrate those parts together. And the easiest way to decouple a monolithic solution in terms of big data solutions it's often around their technical decomposition.

So let's take technical pieces or technical functions or tasks that happened, let's bring down to break up the architecture. So ingestions, processing, serving or different modes of technical decomposition, when you do make a technical decomposition, the consequence of that is that you decompose your teams around the tasks not necessarily the outcomes and features. So the task of ingestion, the task of serving and the task of processing.

But the change and features and adding new sources or getting new use cases address is not usually localized to these nicely neatly drawn boxes. They're actually orthogonal to those. And of course, we can't talk about architecture without talking about people.

So folks who are building these solutions are hyper specialized around the technology and often are stuck in the middle between people who are generating the data, they're the point of origin for that data and the folks that are needing the data. And they are in between trying to make sense of the data and provide that to the consumers without really being involved in the domain whether it from the use-case perspective or the source and the nature of the data perspective. So in short, data mesh challenges these assumptions, centralization of the data to get value out of it and centralized monolithic architectures that are built around data centralized solutions, technical decomposition of the architecture and the way we have organized our team with centralized team around the big data solutions.

So where do we go from here? What's an alternative? What I hope to get out of here is for you to come out of this talk and say, that kind of made sense. I could have thought of this myself. So hopefully these are intuitive kind of principles that I share with you.

For decades now in the operational world where we have seen an accelerated number of innovations around how we build microservices, how we build the infrastructure around it, we decided that monolithic solutions don't scale to meet the needs of our operational applications. So what did we do based on the work of Eric Gibbons, we decided that a complex system is best decomposed around this domain functions, around how business it decomposes itself. So the first principle of data mesh is to think differently around drawing the boundaries between the components of the architecture, between the data itself and using the natural, I guess the evolution that we've been going down the path of over the last few decades with the operational world and extending that to data.

So the first principle is decomposition of the data and architecture around domains which is how we decompose our business today. So once we do that, there are a lot of challenges that arise. Silos of the data. So how do we make sure the data doesn't just get siloed within one domain and not accessible? So the second principle of data mesh is really treating data as a product and building a new foundation of architecture, an architecture quantum that represents this thing that now we call data product. Well, when we distribute now ownership to these domains and we ask them to also build these data products, then we may end up with highly costly solution.

The technology stack and complexity around it. So the third principle around building the foundation to reduce the complexity from the team so that they can easily provide this autonomous data products. And finally, any distributed solution needs a governance that really embraces change versus distribution. So it's a federated model of governance and heavily relying on computation and automation in the platform layer to bring those governance concerns around interoperability, around security to life. So I'm just gonna go touch on every one of these to see what does it entail.

So if we imagine this centralized solution differently around domains, what we end up with decomposition of the data based on the domain. So if you are an order management, if you're a customer management, if you're the logistics so these are the business functions that you already have and very likely as a modern digital organization you have organize your teams around this and your IT systems around this. So extend that to the data and make every domain responsible and accountable for providing their analytical data and the temporal time variant view of the data as it has happened within that domain and provide that to the users. And then we end up with different styles or different classes of domains. Domains align with the source or the consumption or the masters in the middle, the aggregations like customers. So really the transformation around the domain thinking is to imagine data differently.

Right now we imagine data as this source of life itself. As water, right? We flow it from the source through the pipelines and we cleanse it, then bottle it and then we serve it in the lake or in the warehouse. And I think that's a dangerous model of thinking because we assume the data is not really usable at the source and it has to go through this cleansing process by someone else to become useful down the stream. So data mesh thinks that there is actually can be useful right at the source but we have to in reinvent a new set of capabilities of what it means for data to be useful right at the source. It embraces change at scale as we don't need just one canonical model to rule them all.

We don't need one big enterprise data model. And we can think of data as really smaller models with intentional context mapping, intentional relationship described between them. Source of truth. You know that every one of the goalposts that we've been chasing and we never reached, I think it's time to really think that.

I think we should look for the most relevant source not just as the one source, it could be many sources of the data that are available in the organization. And really change our language. Change our language around pipelines. Pipelines are useful architectural styles but they lack abstractions. They lack creating those fundamental pieces of architecture that can autonomously provide value to you and they often resolve in a fragile architecture. So let's change that to move that pipeline as an internal implementation not as the first level construction of the architecture and decompose architectures around domain.

So the challenge of distributed systems, distributed architecture and ownership of the data around domains, that's gonna cause a lot of trouble. The first question that you would get asked if you present this in your company would be well, how is that any better than the silos of databases that already having my domains? Well, we need to create a new way of measuring success around data. We need to create new roles around the data ownership.

And I use this kind of data as a product as a metaphor to describe that transformation. So what does the product mean? It's kind of hard to define what a product is. I can talk about it in books, but I love this one attribute to characteristics that is defined by Marty Cagan as successful product sits at the intersection of being usable. So the life experience of its users and it's easily usable, it's valuable to utilize value, it's feasible to build right from technology and business. So let's bring us these two data.

What does this mean for data to be usable? Let's just look at the basic usability characteristics. So I list here kind of a few of those that I think are absolutely critical in making data usable. When we think about the users of the data, so we have the data analysts who are writing reports. They're comfortable with a set of tools that are mostly oriented around SQL query.

And then we have perhaps downstream people that are gonna build new views of the data, new transformations of the data and those folks may be more comfortable around using events as the type of the data. So very first thing to really invert the model of our thinking is that every domain provide its data as products in a polyglot format, in a format that suits its users. So you might have customer data or order management data or multiple data products out of that domain but that same data can be represented as tabular format, file-based columnar for data scientists or event-based for downstream data product consumers. But put yourself in the shoes of that data analyst, that data scientists. What does their journey look like? So the very first thing they wanna discover what data is about. They have a hypothesis they think that maybe if we change the way we personalize the experience of our users, we're gonna get more conversion from our sales.

Let's say if you're an eCommerce system. So very first thing is what data do I have to prove or disprove my hypothesis, discover it. Can I understand the data that I found? Can I actually look at a, I dunno a Jupyter Notebook and see how that data is being used. Can I look at it as a Schema, Symantec and Syntax and see if I can use it? Is it addressable? Can I securely request access to it without waiting for weeks to get approval? Is it interoperable with the rest of the data on this mesh so I can actually connect this customer information with our order, with our user behavior on the website? Can I trust it? So there is a set of characteristics that we need to build.

And unfortunately for decades the solutions that we're building is external to the data itself. We think that data is just this dead body of information with some vice, we run it, we dump on the disc and then we layer discoverability through catalogs or understandably through other documentation system. Data mesh actually thinks that at every quantum of your architecture, you need to provide these affordances within that quantum of that architecture. So what does that quantum of architecture look like? What is the smallest piece of your architecture that embeds these usability characteristics to start with? And this is a departure from the past thinking as well.

I mean, if you all close your eyes for a minute and imagine what do you think when I say the word data? You probably think vice or tables. This static, or maybe your streams. I mean, you're coming to a Kafka Summit so maybe you think about moving bits and bytes on streams.

So that is definitely a component of a data product as a quantum of architecture, as the smallest part of architecture with the structural elements to do its job which is serving data to their customers. That's definitely true, but I argue that this quantum of architecture needs to have more than that. It needs to have, of course, the metadata all of the information that makes the data useful. It needs to have the code that is serving and transforming it with it and it's part of the infrastructure that it can run on it. The fact that microservices has been so successful we have come up with the idea of a container that contains API to get access to the logic, the logic execution itself and the data that microservices compare and closes. So I hope that we can bring a similar self-sustaining container for data this architectural quantum I call data products on a mesh.

And that's what we've been trying to build for the last couple of years. So transformation from a data to a data as a product really needs us again to change mindset and language. We use this language of details and assets. So if is an asset, what do you do? You just want it for yourself, you wanna hoard it, you wanna collect and you wanna have more of it.

When data is a product, how do you shift that perspective? You actually wanna share it. You get measured, not by how much petabytes of data or thousands of data tables you have. You get measured by how many users do you have for your data, how happy they are, their net promoter score. So we will shift the measure of success and our behavior and language and our behavior and the design of our systems. And I think that instead of data being just an output of code, that we run the data and code for analytical data as a data product adjacent to your microservice becomes one unit of architecture. All right, so we've put a lot of demand on our domains to provide this data as a product.

What support can we give to our domain? So today technology landscape, we saw that in the slide for Metrack, is full of technology or very complex to manage. So the idea of data mesh is that we've got to somehow reduce the level of complexity and abstract that into our data platforms and the mail platforms. And platform building is idea that we're very familiar with and platform teams and platform infrastructure teams to be able to kind of create this self-serve layer of capabilities. And even though I'm simplifying here as one layer but we're really talking here about a new class of capabilities. So today, working with a lot of organizations we already have a data platform team of some sort of data infrastructure team and they give you self-serve capabilities to work with the utilities that you have.

Work with your big data storage. Do you want to have a warehouse, or do you want topic on your event stream? Taking care of those utility layer backbones that would run this mesh. But in addition to that, we've really when we think about reducing complexity, breaking down silos between those hyper-specialized people, we need to rethink and reimagine our platforms differently as well. So for the experience of a data user, data product user or improving the experience of a data product developer would really need a new experience layer or experience capabilities to be able to very easily declaratively produce this data products or version them or secure them. And most importantly, once you distribute the data over a deconnected measure of data products, then we need to have new experiences at the mesh level Manage the graphs that these data products develops, manage the security at scale across the mesh be able to query and discover data across them.

So there was a new set of capabilities that really need to be imagined, thought about, design and implemented. Again, there are a few transformational thinking that needs to happen. And I wanna just touch on a few things.

And I think in the infrastructure space, in the platform space there are some distributed solutions but not enough. I think we need to rethink about distribution. I mean, Kafka is a great example to enable distributed architectures, what we need to build upon that and create more of those solutions. And instead of really thinking about mechanics like I hear this word data catalog is a hot topic.

Everybody's building a data catalog but data catalog is a mechanism. It's a mechanics that we can mechanically provide a certain experience. So let's step back and think about that experience of discoverability, searchability, understandably and maybe that opens up a new set of tools than this static mechanics of data catalog and think about our users of this platform as specialists.

So instead of thinking that they are specialists, think about that they're generalists. Everyone with a bit of an empathy for data or interest and curiosity would be able to go from that hypothesis to delivering a data-driven solution. Well, we need a new world order, right? We've distributed the data, we distributed the ownership and accountability.

This makes folks who are in governance very, very uncomfortable. And I feel you, I know you have had big, enormous like responsibilities and accountability as a centralized function in the organizations. You've been responsible to make the data secure and quality data available to everyone, accessible at enterprise level. So this really changes that model off operation. We need a new way of governance that embrace first and foremost decentralization.

So push the responsibility of ownership and accountability for quality as close to the source and as close to the people who are most intimately familiar with that data to the domains themselves. We've got to think about interoperability as a big aspect. So if I have customers represented in two or three domains how can I actually get that customer information into operable? So there's a set of techniques that come around that. And we've got to embrace dynamic topology.

I mean, data products will be created, will be removed, will be revision, new versions of it will come online. So what does the governance look like if we need to embrace dynamism? And finally, we've got a platform. We have powerful technology today less automate and less use automation and competition as a way of enabling. So as an example, if we move the quality back to the source to those points of origin and say, you are responsible for analytical data generation the governance team is responsible for defining what quality entails.

Like what SROs we need to describe to describe a data that is of quantity. And then it's the accountability of domains to say I guarantee these SROs, I guarantee how timely my data's gonna be. I'm gonna guarantee how much missing events I might have or may not have. What's the accuracy of my data. And then I will use the platform capabilities the testing, the automated testing, integrated testing and all of the other observability kind of capabilities around the platform that platform gives me to get my data product to shore that quality.

So essentially we need to really get rid of this idea of that we have to put a straight jacket over our data to canonically model it to be able to really use it. So the mind shift around this around really thinking about this sort of thinking about governance as one centralized function, thinking about as a federated function with the domain owner, domain data product owners coming together sort of them be responsible for execution of all those compliancies in terms of quality of security. They are responsible for designing what those entail but then it really gets to the platform to enable that and the teams to adopt those. I'm putting the accountability incentives and accountability structure in place and get comfortable with, we will have multiple domains and those domains are distributed and really changing. Changing the measure of success, how do you know your governance function is successful is not how many petabytes of golden stamped approved data they have is really about how many connections on that mesh you have created between your data products, the representative of how much we're getting value from that data, how easily people can use the data that we have.

So in short, data mesh really isn't a paradigm shift around architecture organizational structure and technology that supports that, and it embraces a federated compilation of governance. We've set up global principles that get embedded into every domain. We are bringing those analytical and operational claim closer to each other based on the same team.

So we will have domains responsible not for microservice, but also for other data products and data products get served really as a product that's shareable with set of standard protocols for interoperability and getting access to the data, and the platform really a self-serve platform of multiple planes of providing seamless experiences for data product developers and for data product owners. And where do you get starts? And my journey of a 1,000 miles begin with the first step and you have taken the first step coming listening to me. There will be more content available, communities are forming around data mesh online on Slack so I hope to see you in those communities and thank you for being here and listening to me. (gentle music)

2021-05-16

Show video