Introduction to Data Mesh - Zhamak Dehghani

Show video

Hi, everyone. I'm Zhamak, and I'm the director of Emerging Technologies at ThoughtWorks, North America. I'm excited to be here with you today to talk about Data Mesh. Why Data Mesh? And what were the challenges we've been facing for over half a century almost with big data management. And what it is, really deep dive into those core tenants of Data Mesh that hopefully you can take away and apply to your context and to your organization.

One of the perks of being a technology consultant at ThoughtWorks is that you get to work with many leaders at many companies across the spectrum of industries. And one thing has been evident for me, that data is at the core of every company's strategy. There are great expectations around it, and I want to just share a few mission statements with you to make this point clear.

Our mission is to power prosperity around the world as an AI-driven expert platform. This is from a financial stats provider company, that many of us, in fact, use their products. And what they want to do is to use data and AI to really change their offerings. What they offer and how they empower their users and small medium business owners that use their platforms. The second one from Telco in North America. By people, for people we incorporate human oversight into AI.

With people at the core, AI can enhance the workforce. So the ambitions are around streamlining their workforce and optimizing their business and then how they serve their customers using data. And finally, my favorite from a healthcare provider, our mission is to improve every single member's experience at every single touch point with our organizations through data and AI. And then they truly have these goals.

They want to improve the experiment of every single member that gets care or other services from this health care provider using their data, knowing their customers, and changing how they interface with them. But the inconvenient truth is that, despite such audacious and admirable goals and despite the large amount of investment that's happening around data and AI. The results on really measurable and transformational metrics are low. Do we have a data culture? Have we become data-driven? Competing on data, the percentage of the companies that measure themselves having achieved these goals are quite low compared to amounts of investment. These numbers are from NewVantage Partners report, that I highly recommend for you to just keep an eye on every year. I've been following their work for a few years to see how the investment versus results are trending.

And even though we are getting improved, but still we're not there yet. So the reason Data Mesh really came about as a hypothesis initially, was that we started seeing these failure symptoms in the wild. We started seeing companies really trying to build these big data platforms. But once they're successful in building them, they fail to scale them.

They fail to scale sourcing data from diverse set of domains and corners of organization that's producing data with large volumes and rapid pace. And we see failure in scaling consumption of that data. How quickly we can run experiments to use that data to improve their operations or how we optimize the experience of our customers.

Or how many of these experiments can we actually run. So then, often the answer to these points of friction around scaling the source or scaling the number of consumers. And number here may refer to a number of diverse consumers.

Like, diverse set of transformation, diverse set of use cases is, let's go and build the next big platform and then we get stuck in bootstrapping the next big platform that we have to build. And ultimately failing to materialize any value from the data that we're collecting. So I feel we're in this phase of-- at least I felt, a few years back that we are in crisis. We need to think out of the box.

We need to challenge some of these assumptions, and really see what we can do to approach managing and collecting big data differently. So I just want to give you a quick run through of what is the current state of big data and where have we come from. What assumptions have we made.

So as a starting point, we are in this great divide of operational data and analytical data for good reason. Operational data is responsible for running the business, serving the customers, keeping the current state. And often we use APIs to serve them. Serve that data. And on the analytical data side, we have this big data platforms.

Whether it's a warehouse or lake, which tries to aggregate data from all of these sources and put in some form of time variance [INAUDIBLE] subjects focused aggregations of data. And in between, we have this kind of ETLs to get data from one end and put it in the other end. And I don't have to tell folks who've worked with ETLs of how problematic they are. And how fragile this architecture that separated the two planes, the teams that are managing it with the integration in between.

So then let's focus on the blue part, the analytical data. For decades we've been working on different approaches to manage data at scale or aggregate data. And we've had data analytics, initial versions of data warehousing to support business intelligence and various business reporting from the 60s. And there are various technologies, often proprietary technologies to allow you to extract the data from these sources and then model them into some multidimensional systems that you can then run queries and build reports and metrics on top. And in 2010, we realized that well this cumbersome and elaborate modeling process of getting data from one end and then modeling it perfectly into this star schemas or various form of schemas so that we can run queries doesn't really serve the needs of today's analytical models.

Such as running and training machine learning models. So then we decided with that, let's go with this late model, which is still, get the data from those operational systems but don't transform so much. Dump them into some object store or semi-structured so that downstream data scientists can-- or data analyst model it however they wish to. But they can still see the signals as they came and in they're raw format.

And despite millions of investments on various Hadoop installations that I've seen. We still had a lot of challenges. So the current generation approach is, let's do the same, but let's do it on the cloud in a multi-modal way. So let's get the best of breed from the cloud providers, get the data out of this operational systems, put them into some big storage and then downstream use data warehousing. Or other ways of getting access to the data.

So this is a combination of the both, and done on the cloud. So we have certainly been going through an evolutionary improvement. We had the warehouse, lake, multi-modal on the cloud. And we certainly have seen a Cambrian Explosion of tooling. So this diagram here is from a [INAUDIBLE] in New York and they published the landscape of the data analytics tooling every year.

This is from 2020, and I really don't expect you to read every icon, but you can just glance at this and feel a bit dizzy. About how busy we've been, building various solutions to manage the data and build intelligence on top of it. So then why? Why we're still seeing big failures in the wild.

And why we're still challenged. I was in this wonderful presentation yesterday, women in data science. And every almost every second in the presentation on data science ends with, we don't have access to data at scale.

So why we can't bootstrap ourselves out of this situation. So I think the fundamental reasons when I really look back at 50,000 feet and look down. I think there are a few assumptions that we haven't really questioned for half a century. So I want us to question these assumptions, are they relevant? So the very first one is that-- and you've heard all of these phrases-- let's collect data in one place. Let's hydrate the lake let's centralize the data that's been siloed in organizations. So we've ended up with this monolithic solutions, centralizing data that is ubiquitously available across the organization, yes, in silos of the operational systems.

And yet we want to centralize them so we can fit our [INAUDIBLE] innovation agenda. Those data-driven experiments, and as we all know, monolithic solutions and centralized solutions are perfect to start with. They're wonderful, easy to start with.

Deal with one [? wind ?] vendor, maybe a small set of technology stacks that are very difficult to scale as the expectation of that platform grows. So then what do we do? The easiest path to scale a monolithic solution is technically decompose it. As in, look at the technical functions that occur within that solution and put a boundary around them and put them in a box as separate pieces of the puzzle.

And particularly in a data platform, we end up with various ways of technical decomposition and one of them is the one that I have on the diagram here. You optimize for ingestion by a particular part of your solution and processing and serve it. And if you recall the diagram that I had around the multi-modal implementation of the data warehouse on the cloud, this phase is mapped to the technologies that we are sold by a lot of vendors today.

And then what happens when we-- what happened the people. So when we have this technical decomposition, the natural way to create teams or create these task-oriented teams, as opposed to outcome-oriented or end-to-end capability or feature development. Or in this particular case, data development. So we have teams that are responsible for one end and the teams that are responsible for the other. But change doesn't happen locally to a particular phase. Change often happens orthogonality across this phase.

So this all leads to the slow pace of development, slow pace of access to data or on-boarding new data sets. And what we found ourselves is this set of tooling that requires the specialization. I mean, who is not struggling with finding data engineers.

Everybody is. Because the tooling today expects that monolithic piece in the middle be built by highly specialized people that really understands how to transform data or how to read or write data into various technologies. But they've been put in a silo. I mean, I have a lot of empathy for people working in this space. Because they've been given a task to make sense of the data from domains that they have no visibility into. Provided to teams that they have no understanding of how they're going to use it.

And yet, they have to make sense of this data provided to them. Why? Because well, they know the tooling that they have to work with today. And like any large, big monolithic solution, building this in a connected way across the domains where the data come from to the domains where data being used is quite challenging. So we end up with this disconnected executions of, start with one end and never finish. Or build it and then the users come, and all forms of disconnected execution of a Big Bang solution has been also evident. At least from what I have seen.

So for over half a century, we've been trying to manage data and share data at a scale with a centralized architecture, with a siloed group of hyper specialized teams, and in a disconnected fashion. So the spine's going from the evolution of warehouse to Lake, to Lake on the cloud, we are still seeing this failure modes in the wild. So then, where do we go from here? If I have a hypothesis-- if I challenge the current paradigm, I better have a solution to look into. So the principles underpinning Data Mesh are directly designed to address the assumptions and the challenges that I shared with you. The first principle is an alternative way to decompose architecture, decompose ownership of the data.

So far, we had decomposers around technology warehouse lake. Data Mesh suggest an alternative way, and that's really a natural evolution of what our organization has been doing. For almost a decade, we've been moving towards domain oriented decomposition of our organizations. Building teams around technical teams that mirror how the business is decomposed itself around the domains and capabilities.

So Data Mesh suggest that, let's continue that trend. And when it comes to analytical and big data, bring the ownership to the domains to the people who are most intimate familiar with that data. And then the second pillar around that. OK, if we did that, how do we avoid this siloing of the data that we're facing today. Aren't we going to have yet another way of siloing the ownership of analytical data? And the second pillar is addressing that. Is that, with that ownership comes accountability and responsibility of sharing that data as a product.

Delighting the experience of all those data consumers and putting an incentive structure that enables that. And the third pillar is addressing the cost of ownership. So if different teams are owning their big data analytics, capturing it, sharing it, curating it. Then how do we enable them? So we need a new generation of technology and automation and platforms that really enables that autonomy. And finally, we distribute any distributed system, we need a new world order. We need a new way of governing data that enables an ecosystem play.

Enables interoperability of the data without introducing risk security issues and so on. So let me quickly go through these four principles. And if I'm completely honest with you, I think that the main principle is the first one.

The rest of them are to compensate challenges and address challenges that arise from the first principles. So domain-oriented ownership of the data. What does that look like? So in the past, we've had this big monolithic, whether it was the data team or the BI team or data lake team, that were responsible for getting the data and owning it.

And Data Mesh introduced the concept of domain aligned data products, which is around really finding in each domain that's running operational systems, what are the analytical data that they can provide and produce and share with the rest of the organization. So it could be domains that are aligned with the systems that are interacting with the customer, or back office or align with the source, or systems that are along with the consumers. And then maybe there are a few consumer cases that would require the same data. So we can take the ownership of that and create a new data product or aggregates. Those master data that's sitting between the customers and orders and entities that go across multiple domains. What does that look like? Let's look at a simple example, and I used the world of digital streaming and digital content because we all listen to some form of digital content that might relate to this more quickly.

So imagine an organization like Spotify or iTunes or so on. They really have different domains. Domains that are managing with on-boarding the artist or managing producing podcasts, or managing the users, or the playlists, the streams of music. So just look at the podcasts, a podcast today has operational capabilities. Create podcasts, release podcasts.

They provide this capability either as applications or APIs. What we're asking these domains now is that, now you're responsible to provide your analytical data. So podcast listeners demographic, is for example is an analytical data set that could be quite valuable for advertising or event management or growth marketing, any of those capabilities. So now the domain is responsible to provide that. Or based on the podcasts behavior, the listeners behavior, they can provide the podcast-- daily top podcast. So these are some examples of analytical data that now these domains will be accountable to provide.

So this transformation to domain ownership challenges some of those basic assumptions that I want to call out here. In the past, we've been obsessed with using the analogy of water. The source of life itself.

Data flows like water through the pipelines into the lakes so that we can use it. And underneath that water analogy, I think lurks a dangerous notion that it must flow from source to somewhere else by someone downstream so it can be useful. Data Mesh challenges that.

Data should be served and usable at the source. By putting-- by pushing those cleansing and curation and all of those accountability for getting the data in a shape that it can be used as upstream as possible. As close to the source as possible.

For decades, we've been trying to create this one canonical model for our organizations and domain-oriented design, domain-driven design by Evans, actually challenges that. And that's from 2003 in his seminal book on the topic. To move towards a multi-model way of designing the organization. And he introduces really interesting concepts like the bounded context and the context maps to break apart this complex model to multiple models with intentional relationship between those models. And the last one, I think this one might be a little bit more difficult to accept.

But for decades, we've been running after this moving goal posts that, let's define the source of truth for the data. Data moves around, data gets reshaped, data gets reformed, it gets reused, and it's very hard to define this one source of truth. What we can do, is create reliable and relevant reshapes and copy of the data if it needs be. And put-- I talk about this in further slides, and put the right automated governance in process to keep those relevant reshaped copies of the data in check. We really need to move away from pipeline. Pipeline as an architectural concept, it's quite challenging when you think about scale decomposition.

And let's move those pipelines to internal implementations of domains, and let's define interfaces with clear contracts for sharing data across boundaries of domains. And let's move away from this technology-driven decomposition of our architecture to domain-oriented decomposition distribution of ownership. So then how do we address the silos. How do we address avoiding these disconnected and fragmented. A set of maybe beautifully curated and shared data sets but they don't yet talk to each other. They don't yet-- they can't be really used.

If I go back to the definition of a successful product. Coined by Marty Cagan in his book, inspired. Every successful product sits at the intersection of being usable, being valuable, and being feasible. Feasible technically and business wise to create. So what does that look like if we apply these characteristics to the data. Let's zoom into the usable.

Let's assume that the data is valuable. Somebody needs to use it, somebody has a need for it. And technology is feasible to build. What is usability look one? Of my favorite books, Don Norman, the father of cognitive engineering, on design of everyday things. He starts his book with this idea that every product that we design has to be discoverable and understandable. And I think that's one of the basic characteristics for data to be usable.

So discoverability as where the data is, how can I get to it, documentation is in both a static documentation, but also computational documentation with likes of computation and notebooks, allows that understanding, enables that discoverable to an understanding. And I really think we need to move away from this mechanics of, oh I need the data catalogue or data registry. Those are mechanical implementations. And really think about that first day experience of those data scientist that wanted to run an experiment. Remember we talked about the litmus test of running an experiment. What is that journey look like for them, because those are the users of the data that we need to delight their experience.

How do you go about discovering data and understanding. What are the native tools that they use, and let's build something that satisfies their needs. The second, I think, usability characteristics that's worth calling out is that, the data is trustful and trustworthy. Trust is such a difficult word to define. In fact, one of the most wonderful definitions that I love is by the work of Rachel Botsman and she has a few books on this, but trust-thinkers is the website. A blog post that she wrote, "Trust is a confident relationship with the unknown."

So we talked about building documentation and discoverability so that we can start bridging that gap between, I don't know what the data is, let me know about it. And bridging that gap, what else we need to do to bridge that gap between known and unknown to trust the data. So we need more information. We often refer to this as metadata.

I try to avoid this overloaded terms that lost their meanings and really get to what we're actually talking about. So let's talk about a few characteristics of the data. How often does the data gets updated, what's the interval of change. What's the skew between the time that it got produced and the time that it gets consumed. How complete the data is, what's the statistical shape of it. Those are just a few characteristics that are sprinkled here.

But every data set, to be usable, should have these basic characteristics. So here are eight characteristics that I think are baseline to get data usability in place into each of our data products. And I leave that with you to go through later on. So then, what does it look like? If you have data now that needs to provide all of these characteristics. Need to be trustworthy, and with integrity, security built in, documentation built in. We need a new unit of architecture, a new architecture quantum.

We call this data product. It's a placeholder for now-- placeholder name. But data product is really an aggregation of both code that transforms, serve, and share data as well as the data itself. And the characteristics, the metadata information that we shared about.

And this unit should be able to be built, deployed, and maintained as one component. But all of these requirements around the characteristics and dates being serving data as a product doesn't come just out of good intentions. So we define-- in data implementation as we define the roles, data product owner, whose responsibility is serving data as a product. It's a long term responsibility. The KPI, she or he would measure the reduce lead time for a data user to find and be able to understand and use and experiment with the data. Growth of usage of the data, how many people are actually using the data and their satisfaction.

So what sort of transformational thinking, I guess we need to put in place. For decades we've talked about data as an asset. If data is an asset, what we would do? We hold-- we would collect more.

And that results to gigabytes of data that nobody can really use. So we need to change our mindset from data as an asset, as data as a product. Instead of collecting data, to connecting the data together.

Data as a byproduct of running the business that I really don't care about, to data as a product that my domain serves to the rest of the organization. And finally, data as a dump of a compute, as an output of the compute. To data and compute together as one unit of architecture.

So we've had a lot of ask for my domains now. We're asking them to not only run the business with applications and systems of their building, but it also provide analytical data. So how can we enable them? Well, this is where really platform thinking comes to play. We've got to abstract away the complexities that exist today for domains to be able to host, transform and share the data. What does that look like? When we think about platforms, we often think about this basic stuff like the metal work, the networking CI/CD big data block storage and put a system in place in front of those, so the users can request these capabilities in an automated fashion with less friction. But Data Mesh needs more than that.

We need to create the new set of capabilities. And honestly, none of these are to be completely transparent with you. They don't exist today. These are the new things that we have to build and hopefully the tooling and technology will catch up to provide that to us as offerings. But we've got to think about the experience of data product developers, the experiences of data product users. And create new levels of abstraction, new automation for them to be easily declaratively manage the life cycle of their data products.

And on top of that, we need new capabilities to be able to run operations at the mesh level. It's great that we can run-- build one-data product or two-data product easily. But what does it look like to connect the data products together. What does it look like to run semantic queries across the mesh. What does it look like to actually see the health of the ecosystem.

That requires a new set of capabilities at the mesh level. So again, to challenge the way we've been thinking. Really the platforms today, we've got to stop thinking about the next big platform to build.

And we've got to think about what are the protocols that we need to enable for different technologies to be composed together to satisfy those capabilities that are just listed in the previous slides. Instead of assuming that everyone is a data specialist, we've got to think about how can we make our data generalists or data enthusiasts or generalist developers to enter this space. You have no idea how many conversations I'm in that the data platform team is complaining, that the business units or technical units don't have the right data engineering skills. Well, we've got to change that because we will never meet those numbers and expectations in terms of number of data.

We need to take up the level of abstractions, move away from enforcing everyone to imperatively to say, how to build the platform or how to build the data product. To a simpler declarative model of saying, what the data product entails and then the platform take care of a lot of the metal work for provisioning that. And move away from thinking about mechanics of mechanical for catalogue, or mechanical register and think about the experiences that we want to enable. And that would really open up a world of opportunities in terms of building solutions. So what is this new world order that would need to in fact enable interoperability.

So now that we've decoupled the ownership, decoupled where the data sits and how the data gets served. We have this wonderful automation as part of our platform. And it may not be one platform. I know in this picture I have one platform, it's just that notation to show is a composition of automative capabilities. The new world order requires embracing a new set of values system, embracing decentralization or domains self sovereignty. The domains are in control of their data, how their model it, how others share it.

Why enable interoperability between these domains? What are those standards that we need to put in place? Why do we design and model our data across the domains differently? We still speak the same language. We can join data across the boundary of these domains. We need to embrace the governance need to embrace a dynamic topology. Data products will come and go, and split, and converge. The governance need to be able to handle that.

That's just part of the nature of the mesh. And finally, take advantage of the platform. Automate, automate, autamate all of those policies that we create.

Automate into the platform. So placing a system in the straight jacket of constancy it's only got us fragility. And that's what we've been doing for a long time with our data modeling and canonical models and that just simply doesn't work for the world we're in. So very briefly, the components of these governance to think about and apply to our organization include thinking about those global concerns.

Because really it's about balancing, finding that equilibrium between what we're going to decentralize and leave the decision making local to the domains. And what decision we're going to globalize. And we want to have as few as those as possible for the sake of interoperability. So deciding what those global concerns are. Creating a federated team from those domains, data product owners, and ancillary roles, like the subject matter experts and security compliance, and so on.

And defining really the principles that defines the boundary of the scope of your governance. The principles really should define how you make decisions. How you make-- whether a concern is a local concern or a global concern.

And defining those global policies and as part of those definition, defining how the platform-- how they're going to be automated. If we decide that we're going to use the same syntax and semantic modeling language for all of our domains regardless of the domain that they are in, then how can we enable the platform with automation to create these schemas. Automatically putting the scaffolding in place, putting the verification in place. A ton of tooling we can enable our platform and verification. Do the data products on the mesh follow these policies and convention automation of verification. And embed those computational policies into every single data product blueprint.

We put a lot of emphasis in automation a blueprint of every data product, the SDKs libraries, and so on that we use or [? sidecars ?] essentially that we use in every data product to really embed these capabilities. Governance is going to look very different despite having the same objective of providing quality data across organizations safely and securely. The implementation of that is going to look different from a centralized team of data experts, to really a federated team of domain experts. Instead of having a responsibility for quality of the data, for security of the data they're responsible to define what constitutes quality and security and then embed that into the platform. And it's the responsibility of the domain product owners to really meet the assurances of their agreements and the guarantees that they get their data product must meet.

They're really not responsible for defining the canonical model, they're responsible for modeling the policy. The entities that go across the boundaries and require standardization. And again, moving from an asset to product requires changing the measure of success. Moving away from vanity metrics, like volumes how many tables and petabytes of data you have, to really how many connections you established on your mesh. Because connections are representative of somebody's using your data and getting value out of it.

And that has a different formula. So data mesh in summary is really both a paradigm shift in architecture technology and organizational structure to bring your operational plane, analytical plane together with a feedback loop among those. Among your operational systems and analytical systems, coexisting as part of the domain served by an automated platform and governed by a federated and computational model.

The journey of a 1,000 miles begins with the first step, and you have taken the first step, being curious enough to listen to me and think about Data Mesh and thinking about application of that. And I wish you all the luck to take this to your organization and apply it at scale. Thank you.

2021-05-29

Show video