The thing I'd like to focus today here, is data modelling in the context of Data Vault and how do we use conceptual level modelling to solve some of these issues that we see in Data Vault problems, the problems happening in the Data Vault space. I've always liked Data Vault as an approach, I've done it myself many times. Recently we did a thing with Data Vault North America user group with Cindi Meyersohn and John Giles was involved as well. It was called Data Vault end-to-end done right and the core thing of that, is that it has to start from the business. So I mean you've read the
books, you've been on the classes, and so on you know that you know the business entities that's the core of the Data Vault, that's the whole you know point of the approach. We're not very good at that are we... and I think it's good for us to look at some of the kind of failure states really that Data Vault initiatives occasionally have and I kind of try to emphasise here that conceptual modelling and focusing on the business, is actually one of the kind of key success factors of any Data Vault project, regardless of technology, regardless of the organisation, regardless of the complexity of your data sources and whatnot is. So, this is kind of a way to emphasise the role of conceptual data modelling in this process of doing Data Vault right.
If we start from very basics and I know many of you know the Data Vault methodology inside out, I don't claim to be an expert on that but I think the if we try to kind of get to the core of the issue with this, that we're integrating data right that's the whole point. We're building a kind of centralised Big Data thing that integrates data from different sorts of places, integrates it in a way that aims to provide value for the whole enterprise. We want to have reusable data, we want to have data that is business centred, really business-centric kind of in the sense that we're perhaps not that keen on following the structures that happen to exist in the source systems. We want to
have something that can be utilised to actually produce value for the business. So the Data Vault approach of course is of integration to business keys. What is a business key? It's the identifier of the actual business entity, it's not a primary key field okay it could be a primary key field in the system somewhere but that's not really the point we want to know what are the core things that you have in your business? You have customers, you have products, you have invoices and advertising campaigns or whatnot and these are things that kind of exist across your system landscape. Different sorts of systems might handle these things differently. They might have different surrogate keys or technical identifiers for things but we try to find the actual business case, you know how do you separate customer A from customer B? Maybe you're using a customer ID, maybe you're using the if you're a public sector organisation, the social security number or something like that.
And the idea of course is that integrating the data that we have across the enterprise from multiple systems around this business case, ensures that we can actually have an enterprise wide view to what the data really is. So we're not just looking at the source system A and source system B, we want to look at customer as a thing that exists across the organisation. Now of course source systems, they have their own data models they are systems built for some operational purpose, usually you have your ERP's and CRM's and HR tools and whatnot and the thing with those, is that the internal data model of such a system, is a response basically kind of a design for the problems that, that particular solution needs to solve but it does necessarily not completely match with what the business is kind of about. And this is an interesting thing and of course many of you who might have been working with European companies, you know what SAP does and what SAP does is basically tell them, we have a such a good model internally that you can mould your business into it and that works quite well but occasionally that's not quite the case, that the tool and the business would somehow you know mould into each other. The tool might internally have some kind of conception of
what data structures exist and what objects exist, what entities exist, that do not necessarily match with the real or actions that you have in your business. So technical key field in the source system it might be a kind of real business key it might be social security number or it might not. A table in a source system database might match with a concept like a lead or prospect. You have all the prospects here, one row is a prospect. Or it might not. So if you want to build a Data Vault and integrate the data solution of the source of data sources it must be business centric, it must kind of be above and beyond the technical structures and solutions and lines of thinking that those individual systems might have, and I think the kind of core of this is there on that's quote I believe from Dan's blog post from a few years back, but I think this kind of emphasises this properly that if you build a source system Data Vault model the value of the solution drops to one tenth of one percent overall.
Now that's pretty rough isn't it? But I also think that's completely true and the, Data Vault initiatives that I've seen that I've been personally involved in, that I've audited, or looked after or heard about, I seem to sense kind of pattern there, that those that are not really valuable tends to be source system oriented and those that are most valuable and successful, tend to be properly business oriented. That's kind of anecdotal data, but I do believe that everyone that I have talked with has been saying pretty much the same thing and some like Dan put it in in rather stark terms even. Now the problem is, that it's easy to say that we want to be business centric, we want to be business oriented, we want to integrate around real business case but do we really know what the business is? Now the approach that at least Data Vault 2.0 as a methodology has, is it talks about getting taxonomies and ontologies right first and then you go into your Data Vault model from there. So all right uh taxonomy and ontology are words that I personally try to avoid using as much as possible. There's the reason for that being that they are not quite straightforward but I've tried to put this in a straightforward terms here and my apologies if someone here is a kind of academical ontologist they might, me for this but I'm trying to put this in it's kind of simple terms, so what we usually mean with a taxonomy is that it's kind of a hierarchical classification of types of things, and I have an example coming up there.
So you have like persons and persons can be customers and so on and customers are loyal customers and non-loyal customers I don't know different classifications hierarchy, hierarchical classification of things like in the animal kingdom you have animals and you have birds and fish and mushrooms and whatnot and you have the various types of raccoons there and so on so forth. It's a biological taxonomy, but when we start adding relationships between these things into the taxonomy we get roughly speaking, something that is an ontology. The ontology is kind of well you probably know where the word comes from, but it explains what the nature of existence is kind of. Things exist like this and they have something to do with other things. So we have the taxonomy of the hierarchical classification of all the things in the world and then we have an understanding of how those things interact with it each other. And a business can be described in these terms.
Now the way that Data Vault models should be created is that you take the taxonomy you take the ontology of the real business, not the source systems you take the taxonomy and ontology okay these are things that exist in the nature of this particular business and this is how they interact and now I choose from here what are going to be my Data Vault objects, this is going to be a hub and because there's a link to the an association between this thing and another thing we're going to create a link type table table between. That's kind of how it should work. The problem might be that you don't necessarily have it like this. Now this is very kind
of clean and same very simple example, and again my apologies for probably misrepresenting ontology here or at least very roughly presenting it in this simplified form, but this is a hierarchical classification resources. We have two types of resources: we have vehicles, we have buildings. Where there are two types of vehicles, there are cars and helicopters I don't know what the business is but anyways we have garages where cars are stored and so on. Now this taxonomy is enhanced with the associations between things we could approximately something that could be called an ontology at least in this context. So, we classify the things that exist in the business and again we're not looking at systems, we're looking at the business as it is in actual reality. Usually we don't have that.
So the reason that I've seen for kind of usual failure in Data Vault initiatives, every data initiative every type of data initiative has failures and I think the overall failure rate in data initiative is something like 70 percent but Data Vault initiatives also fail and the usual path is that first of all you don't have that picture, you don't have the taxonomy, you don't have the ontology, most likely no one in the organisation has ever heard the words taxonomy or ontology, which is by the way why I try to avoid them. They say okay you know they hear about the Data Vault methodology, they say taxonomy and ontology about the business, this is an ID project we don't want to start figuring that out. It's something that sounds like you know business analysts might be doing or, process development people something like, no we are Data Architects we don't care about that, instead what we have is a fancy automation tool and that can connect into all of these different source systems in our landscape, and we press a couple of buttons and it automatically generates hubs and links and satellites. The stuff that Data Vaults are made of. So there's the data, the tables there are primary keys and whatnot and keys are also something the Data Vault needs, press a couple of buttons generate the model. We get a lot of outputs do we get a lot of outcome? Usually not.
Now, I will come back to the topic of automation I have nothing against the automation tools as I said, we've been doing a lot of stuff with Petr from Data Vault Builder, we've done lots of stuff with Vaultspeed, we've done lots of stuff with WhereScape and we are happily doing, as a company, a lot more stuff and I'm personally also very happy to work with them because the tools are awesome. But if you follow this kind of a process with that kind of a tool I think there is a kind of built-in sensitivity in Data Vault as a methodology, for this kind of failure. Precisely because it is so kind of well-structured methodology that it is easy to automate and generate stuff and it is very tempting to go directly into automating and generating. Now... basically it should be so, that we look at the ontology here, we have the taxonomy yeah we pick okay what do we want in our Data Vault. We won't have hubs for cars and
helicopters and buildings and there's clearly a link between cars and buildings so you know that's going to be a link. This is kind of the the idea if we had the taxonomy and ontology. So just you know point at that and pick your objects from there. Here we have an example, this is actually based on a true story I am anonymising this completely, but it is based on a true story. They had a large ERP system which had a massive table.
ABC doc one, two, three, which had document numbers, what not done they asked the business all right what does this table actually contain and this well you know it contains financial documents. Good, so these Data Vault guys that were working in this organisation, they were thinking that yes okay we're smart, we know that this has to be about business open kind of entity's, so if this contains financial documents that's how we're going to name the hub and then we're going to bring all these attributes as into a satellite around this hub and link it to other hubs but if this table contains financial documents that means that we create a hub called financial document. So they did and the project went on and at some point when the kind of usage of that data was becoming quite difficult, there was a bunch of people, me included, looking at the models and looking at the business requests around the data and what we kind of separately discussed with the business was that okay yeah, we in separate use cases and separate processes and with separate groups of people we're interested in sales contrast variance it's interested in sales order data, deliveries according to product according to target country, invoices, different types of invoices and so on and so forth and they said yeah it's a little bit difficult, you have this financial document table so that's kind of the what we get from the data warehouse. So,
this kind of a system, this is this is a source system Data Vault, you're just renaming the hub for something that sounds like a business entity. This is a highly generalised structure, theoretically in this case, all these entities that the business was interested in were to some degree contained in this table. There were financial documents about all of these events, these are practically events right? There were financial documents recorded in this fancy ERP system, about all these events but for example, if you wanted to have information about specific kinds of invoices, you had to go and find the right types of documents and then you had to figure out are there perhaps multiple different kinds of financial documents per invoice, or something like that. And with sales contract data you had actually the sales contract was recorded as a role in the financial document it had its own key but half of the data about the sales contract, was actually in a completely different system and that system had its own data model and as a result it had its own hub.
So now you were actually forced in the kind of consumption layer of this thing, you're forced to actually combine attributes, from satellites of two different hubs, and that's not how this thing works. It's a source-based approach it's not based on the business reality. So, if we look at the types of failure, this generalisation, but I have categorised this in three types of failures. One, is that the source oriented hub like we saw here on the financial document, this is actually kind of, it is sort of a business object but it's on the wrong level of taxonomy. So here we have we're creating a highly generalised financial document hub,
whereas we would would have actually wanted to get something that was you know more detailed. But then again, if you go too low on the taxonomy you go to the levels of I don't know individual types of contract that might be a fractured data landscape. You are kind of taking a business object but you're picking a wrong one; that requires the users and whoever is working on the consumption layer to constantly apply more business logic to combine this data or to split this data, and that is done every time someone needs something for every single information model whatever you're building on top of that you're always figuring out okay how do I get my events from this generalised hub and its satellites.
Another thing is that, if you pick a source oriented hub, you could have the data related to that thing also in another system but record it completely differently. So, imagine if we have a system where we have people as part of resources there's table for resources and people are individual rows there, there are also cars there's buildings but we have another system where there is a single table for employees and one table for contractors both of which are also people. How do we integrate if we are only building our hubs around resources employees and contractors but we're really interested in people? The third type of issue with this, this is kind of the obvious one is that the source system structure is actually kind of, somehow weird. It's highly technical, it's something that cannot be kind of comprehended it might be built around log messages or something things that do not actually even belong to the taxonomy of things in the business world, these are technical objects like messages sent between systems or something like that. If you built a hub
message and that doesn't correspond to any event that the business understands then you know what does one row mean there? What's the business key? There is no business key because there's no business related to those messages. It's a way of the systems inside the systems how they manage data relating to things. So these are just some of these kind of data source system, Data Vault failures that I've seen and I do think quite many of these kind of bigger scale failures of Data Vault initiatives relate to one or more of these cases. So, how do we do it right then? We had a webinar like I don't know, a year and a half ago with Cindi Meyersohn, and we were kind of drafting this quick and dirty process how to do Data Vault right, obviously based on the actual methodology but just trying to create kind of quick steps here and and what we did, we tried to figure that you know first of all business needs okay. We need to figure
out what the business need is, but then we take a step away from the kind of usual Data Vault talk, we're not talking about Data Vault at that point, we're figuring out what's taxonomy and ontology we're capturing those and a good way to capture those is conceptual data modelling, and this theoretically or methodologically has nothing to do with data world in the sense that we would be using evolved objects we're just figuring out what is the business actually about. And we're working together with the business experts. Then, we have the model, we have the kind of taxonomical different levels of hierarchy, we pick what we actually want from that conceptual data model. We want this to be on the level of you know cars and helicopters but in terms of buildings, we want to just generalise a building, we don't want different kinds of buildings. And once we have done that, we've picked the ends that we really want to use. Then we figure out okay what's the business key. How do we recognize a building? What's the identifier for
a building? What's the identifier for a helicopter? After we've done that, we have picked the entities, we have figured out what the business keys are, then we can go designing the Data Vault, and this is the first, kind of place, where we actually go into the logical datable model itself. All of this is part of the kind of Data Vault process but only here do we go into the Data Vault model and only after we've done that, do we look at the source systems. I think this is kind of one of the most important things, also in the Data Vault done right thing. We're only looking at the source systems once we've already defined what the Data Vault basic structure looks like, obviously the satellites are populated from the sources and whatnot and it's pointless to design all the satellites beforehand, but we should have the hub link structure at least already there. And then we go creating the scripts and scheduling and
monitoring all the technical stuff that is fortunately done mostly automatically. This kind of approach requires an understanding of data modelling on different levels so, we need to understand that there's kind of a business it gap which we need to be able to bridge and it starts from the business side, so every time we start doing Data Vault actually, any project we need to start with business glossary really. What do you mean by a customer okay, you want invoice data, what do you mean by an invoice? What's an invoice? What do you mean by a product is it different from product category? This is us trying to understand the business need, then we create the conceptual data model because it is a good way of capturing the taxonomy and ontology, so what things do we need data about? How are they related? How are they related in real life of that business? Not in store system databases and then when we have that, then we can figure out alright now we want to make a Data Vault out of that, what does that look like? What are the hubs? What are these satellites? What are the links? And then, we start mapping this and maybe we have concerns around the technical storage. I mean theoretically you could do Data Vault in any technical storage, doesn't have to be even a relational database I suppose. How do you implement that Data Vault design in this particular technology which is the physical data model and only then, do we figure out the different kind of integrations with tools and so on and so forth, and that is breaching the business ID gap with different levels of models we can't just say that yep we're doing data modelling now because we are generating the Data Vault model from whatever source system might exist.
Conceptual data modelling is therefore kind of an important step in bridging that gap, and I have here on a large font, a quote which I practically always use in every presentation I've ever done that's by Alex Sharp, explaining what a data model in the conceptual sense is, it's a description of a business in terms of the things it needs to know about so, we're modeling a business we're not modeling a system and we figure out what things does the business need to know about. It needs to know about customers invoices and products good, that's the part model there so we create a model that describes the actual real-life things the events people places resources whatever you have we figure out how they are related in the real life, we work together with the business experts and this is super important, we try to ensure that we understand those words in the same way and this is kind of the business glossary part because everyone has different description or definition of a customer or a product which are super difficult things. And this all is technology agnostic so we don't care if it's SAP, we don't care if it's whatever, the reality of the business can be described as a conceptual data model regardless of the systems or technology they use. So, the thinking process involved in this is that we are modelling the real world. The way I like to kind of go at this is
that I think of a slice of reality, I pick a kind of scope area from the real life of the business. There's logistics and sales and whatnot you have an area there that you kind of point at, and you say I'm going to model that. I'm not going to model a system. I'm not going to model database. I'm not going to model Data Vault even. I'm going to model that piece of reality there and I picked the entities in the conceptual model from that reality, customers invoices, website visits, products, whatever and I figure out how these are related to each other by kind of describing the narrative part of that reality a deliveries dispatched from my warehouse means that there must be a relationship an association between the delivery and a warehouse called dispatched from. Again, I'm not looking at foreign keys, I'm thinking about how the narrative actually goes and then turning the verbs from that narrative, into relationships. Who cares about technologies, our database is not important at this point we collaborate with the business we find out the people who actually know how this works, not the systems administrators of the system in which this work happens, but the people who actually do this work because they know how it goes. We ask them how does your reality work and then we draw the model on that. Which
results in us creating a technological solution but that technological solution is designed based on the slice of reality that we encountered. So here's an example of model this is done in Ellie. Not advertising much at least, but what a conceptual data model should do, it should only contain this business entities, it should capture the taxonomy and the ontology and there should be if not in the model itself, at least close to it somewhere the definitions of those entities the glossary. So here you can see this is I think this is called a barker notation or something of
subtypes this is basically a taxonomy part. We have resources of type building and vehicle, we have vehicles of type, car and helicopter. We have parties of type, organisational, person. And then we have these lines telling that you know organisation owned resources, a car is stored in the garage, a person has a license to a vehicle, and so on. And then we can describe what an
organisation is, there's the description in Ellie. These are captured in the conceptual data model. It's taxonomy and ontology the hierarchical classification and the relationships. Then we go thinking about the Data Vault. We haven't, obviously we would pick also the kind of the entities that we are interested here, it might be that the data model on the conceptual level includes also stuff that needs to exist there because you know we figured out that okay you know this is also related and we want to separate this and that entity, so let's put them both on the model but we decide what do we want to actually work on and we decide that by kind of pointing at the model and saying yep that's going to be my hub and based on the relationships and entities between these hubs, we can then decide what would be the links.
And this decision is made here on looking at the conceptual model and I would actually even go and and point physically my finger at the end of the saying hub, hub, hub, hub, hub and then we are happy with the selection there. The conceptual data model might contain more stuff, but we have to remember that because we did that by looking at reality instead of systems it is one hundred percent valid, it's completely reusable no matter what we pick here. We're just saying that okay reality looks like this, we're going to pick these parts from it for our Data Vault purposes. So, we have done that we have our hubs, buildings, cars, persons whatever that is, we can create links we know we can look again at the model say okay there's a line between those that means there's a link, cars stored in the building. We need to identify another business case, so figure that out what's the business key for a person? Social security number do we have those? License plates whatever, and then we can go and create these logical models. In a later tool, we have a separate layer for logical models, or of course we could use this conceptual data model wherever it's created and then push that as kind of basic design into another tool, WhereScape, Vaultspeed, Data Vault Builder are ones that we've worked with, which are all great tools and create the kind of actual Data Vault model inside that tool, but the point still is that we have these two models that are separate but connected.
Now here's an example just super simplified I left out all the attributes and everything, created this in Ellie, such a super simplified model where we have the hubs, the person, the current building, we have the links and I have linked these entities this kind of logical Data Vault and this to the kind of business entities from the conceptual model so that I can see that the hub building is actually about the building which has its own definition in the glossary. It's obvious here because, the naming is good, but it might not be always obvious and that's why it's very important to be aware that my selection of this logical entities here corresponds to those kind of business entities in reality in some way. And this is also a good place to have a kind of quick peek at the source systems, we often talk about doing kind of a reality check against the systems that you have, it is often so that the model that you create based on what the business should look like might not actually be completely true because you don't have even any systems that would record some piece of that information. So obviously we would involve someone who knows about the source data, who could tell us that you know actually we don't have any information about the kind of building in which the car is stored. That the state doesn't exist at all. Then we
can you know edit this model, but the core design of this whole thing is derived from the conceptual data model without looking at the source systems other than doing this kind of reality check. So, automation steps in at this point and this is also kind of where I will not go much further with this anymore because now we get then into the kind of nitty-gritty details of of doing the actual Data Vault modeling and figuring out the actual mappings and the actual ELD scripts and so on and so forth. You need to have automation, you need to have right tooling for that some people do it by hand you know writing thousands of lines of SQL but it's not really a good use of anyone's time anymore, because so much better options exist. But the point is that we have got this far without resorting to automation and the thing the beautiful and annoying thing about this conceptual data model is that we cannot automatically create it. It's about the business reality. You don't have the business reality reflected in a structure of a database, unless you have already done this exercise. So after we create, and to understand what the business is about we create the model, we derive the Data Vault model's basic structure from that and then we start automating things and mapping stuff around, and go just generated and we're happy. And of course when
the source systems change that's very easy for us now, because we have mapped them to business entities which rarely change, they do change but they rarely change, so we can just you know automate again so that the new sources are also mapped into these things. If the business itself changes, then of course the conceptual model also changes, and that means that there's going to be manual work you, have to update the conceptual model, you have to figure out how that is reflected in a new version of the Data Vault model and so on. Data Vault's or luckily could in the sense, that it rarely breaks down in that kind of situation, but the point here is that I'm actually saying here, you cannot automate away the conceptual model, you cannot automate away the understanding of the business concepts, but you can automate away many steps after that.
So, I think that brings us to our conclusions here. I know many of these things is stuff that you've heard from, from many people all the time and read about you know this is business-centric things it's always been so, all the books say it, to achieve this we have to do a little bit of leg work. We have to model it after the actual business entities and we have to figure out what the actual business entities are. We have the kind of tempting, like one of those sirens or what they were in the Odyssey, that are you know singing there, luring the seamen to treacherous rocks that, you can go and and automate this from the source systems, but that's risky and it will lead often to failures. So to capture the taxonomy and ontology, conceptual data modelling, it's been there for decades already so everyone knows it works exactly in that part of this work. So just go and do it, capture the actual
business needs and then we design the Data Vault without looking at the systems. So, whereas many of the kind of more technical aspects of Data Vault methodology are about doing the Data Vault in a right way following the steps, having the business case, and hashes and whatnot, this part of the process which we cannot automate away is about doing the right things not doing things right away but doing the right things in the first place. We pick the right entities the things that the business is actually about and the things that the business needs information about, and that's really what conceptual data modelling is in this, whole big process of Data Vault development and design it's ensuring that we do actually the right thing and thus avoid these traps that source system Data Vault solutions usually have.
2023-02-09