Capture your business needs with conceptual data modelling

Show video

The thing I'd like to focus today here,  is data modelling in the context of Data Vault and how do we use conceptual  level modelling to solve some of these issues   that we see in Data Vault problems, the  problems happening in the Data Vault space.   I've always liked Data Vault as an  approach, I've done it myself many times.   Recently we did a thing with Data Vault North  America user group with Cindi Meyersohn and John Giles was involved as well. It was called  Data Vault end-to-end done right and the   core thing of that, is that it has to start  from the business. So I mean you've read the  

books, you've been on the classes, and so on you  know that you know the business entities that's   the core of the Data Vault, that's the whole you  know point of the approach. We're not very good at that are we... and I think it's good for us to look at some of the kind of failure states really that Data Vault initiatives occasionally have and I kind of try to emphasise here that conceptual modelling and focusing on the business, is actually one of the kind of key   success factors of any Data Vault project,  regardless of technology, regardless of the   organisation, regardless of the complexity of  your data sources and whatnot is. So, this is kind of a way to emphasise the role of conceptual data modelling in this process of doing Data Vault right.

If we start from very basics and I know  many of you know the Data Vault methodology   inside out, I don't claim to be an expert on  that but I think the if we try to kind of   get to the core of the issue with this, that we're integrating data right that's   the whole point. We're building a kind of  centralised Big Data thing that integrates data   from different sorts of places, integrates  it in a way that aims to provide value for   the whole enterprise. We want to have reusable  data, we want to have data that is business   centred, really business-centric kind  of in the sense that we're perhaps   not that keen on following the structures that  happen to exist in the source systems. We want to  

have something that can be utilised to actually produce value for the business.   So the Data Vault approach of course is of  integration to business keys. What is a business key? It's the identifier of the  actual business entity, it's not a primary key field okay it could be a primary key field in the system somewhere but that's not really the point we want to know what are the core things that you have in your business? You have customers, you have products, you have invoices and advertising campaigns or whatnot and these are things that kind of exist across your system landscape. Different sorts of systems might handle these things differently. They might have different surrogate keys or technical identifiers for things but we  try to find the actual business case, you   know how do you separate customer A from customer  B? Maybe you're using a customer ID, maybe you're   using the if you're a public sector organisation, the social security number or something like that. 

And the idea of course is that integrating  the data that we have across the enterprise   from multiple systems around this business  case, ensures that we can actually have an   enterprise wide view to what the data really is.  So we're not just looking at the source system A and source system B, we want to look at customer  as a thing that exists across the organisation.   Now of course source systems, they have  their own data models they are systems built   for some operational purpose, usually you have  your ERP's and CRM's and HR tools and whatnot   and the thing with those, is that the internal  data model of such a system, is a response   basically kind of a design for the problems  that, that particular solution needs to solve   but it does necessarily not completely match with  what the business is kind of about. And this is an   interesting thing and of course many of you who  might have been working with European companies,   you know what SAP does and what SAP does is  basically tell them, we have a such a good model   internally that you can mould your business  into it and that works quite well but occasionally that's not quite the case,  that the tool and the business would somehow   you know mould into each other. The tool might  internally have some kind of conception of  

what data structures exist and what objects exist,  what entities exist, that do not necessarily match   with the real or actions that you have in your  business. So technical key field in the source   system it might be a kind of real business key it  might be social security number or it might not. A table in a source system database might match  with a concept like a lead or prospect. You have all the prospects here, one row is a prospect. Or it might not. So if you want to build a Data   Vault and integrate the data solution of the  source of data sources it must be business centric,   it must kind of be above and beyond the technical  structures and solutions and lines of   thinking that those individual systems might  have, and I think the kind of core of this   is there on that's quote I believe from  Dan's blog post from a few years back, but   I think this kind of emphasises this properly that  if you build a source system Data Vault model the   value of the solution drops to one tenth of one  percent overall.

Now that's pretty rough isn't it? But I also think that's completely true and the, Data Vault initiatives that I've seen that I've been personally involved in, that I've audited, or looked after or heard about, I seem to sense kind of pattern there, that those that are not really valuable tends to be source system   oriented and those that are most valuable and  successful, tend to be properly business oriented.   That's kind of anecdotal data, but I do believe  that everyone that I have talked with has been saying pretty much the same thing and some like Dan put it in in rather stark terms even.   Now the problem is, that it's easy to say that  we want to be business centric, we want to be   business oriented, we want to integrate around  real business case but do we really know what the business is? Now the approach that at least Data Vault 2.0 as a methodology has, is it talks about getting taxonomies and ontologies right first and then you go into your Data Vault model from there. So all right uh taxonomy and ontology are words that I  personally try to avoid using as much as possible. There's the reason for that being that they are not quite straightforward but I've tried to put this in a straightforward terms here and my apologies if someone here is a kind of academical ontologist they might, me for this  but I'm trying to put this in it's kind of simple   terms, so what we usually mean with a taxonomy is  that it's kind of a hierarchical classification   of types of things, and I have an example coming up  there.

So you have like persons and persons can be customers and so on and customers are loyal customers and non-loyal customers I don't know   different classifications hierarchy,  hierarchical classification of things like in   the animal kingdom you have animals and you have birds and fish and mushrooms   and whatnot and you have the various  types of raccoons there and so on so forth.    It's a biological taxonomy, but when we start  adding relationships between these things into the   taxonomy we get roughly speaking, something that  is an ontology. The ontology is kind of well you   probably know where the word comes from, but it  explains what the nature of existence is kind of.   Things exist like this and they have something to  do with other things. So we have the taxonomy of the hierarchical classification of all the things  in the world and then we have an understanding of   how those things interact with it each other. And  a business can be described in these terms.

Now the way that Data Vault models should be created  is that you take the taxonomy you take the   ontology of the real business, not the source  systems you take the taxonomy and ontology okay   these are things that exist in the nature of this  particular business and this is how they interact   and now I choose from here what are going to be  my Data Vault objects, this is going to be a hub   and because there's a link to the an association  between this thing and another thing we're going   to create a link type table table between. That's kind of how it should work. The problem might be that you don't necessarily  have it like this. Now this is very kind

of clean and same very simple example, and again  my apologies for probably misrepresenting ontology   here or at least very roughly presenting it in  this simplified form, but this is a hierarchical   classification resources. We have two types of  resources: we have vehicles, we have buildings. Where there are two types of vehicles, there are cars  and helicopters I don't know what the business   is but anyways we have garages where cars  are stored and so on. Now this taxonomy is   enhanced with the associations between things we  could approximately something that could be   called an ontology at least in this context. So, we classify the things that exist in the business   and again we're not looking at systems, we're  looking at the business as it is in actual reality. Usually we don't have that.

So the reason that I've seen for kind of usual failure in Data Vault initiatives, every data initiative every type of data  initiative has failures and I think   the overall failure rate in data initiative is something like 70 percent but Data Vault initiatives also fail and the usual path is  that first of all you don't have that picture,   you don't have the taxonomy, you don't have the  ontology, most likely no one in the organisation   has ever heard the words taxonomy or ontology,  which is by the way why I try to avoid them. They say okay you know they hear about the Data  Vault methodology, they say taxonomy and ontology about the business, this is an ID project we don't want to start figuring that out. It's something that sounds like you know business  analysts might be doing or, process  development people something like, no we are Data Architects we don't care about that, instead   what we have is a fancy automation tool and that  can connect into all of these different source   systems in our landscape, and we press a couple of  buttons and it automatically generates hubs and   links and satellites. The stuff that Data Vaults are made of. So there's the data, the tables   there are primary keys and whatnot and keys are  also something the Data Vault needs, press a   couple of buttons generate the model. We get a lot of outputs do we get a lot of outcome? Usually not.

Now, I will come back to the topic of automation  I have nothing against the automation tools as   I said, we've been doing a lot of stuff with  Petr from Data Vault Builder, we've done   lots of stuff with Vaultspeed, we've done lots  of stuff with WhereScape and we are happily   doing, as a company, a lot more stuff and I'm  personally also very happy to work with them   because the tools are awesome. But if you follow  this kind of a process with that kind of a tool   I think there is a kind of built-in sensitivity in  Data Vault as a methodology, for this kind of failure.  Precisely because it is so kind of well-structured  methodology that it is easy to automate and   generate stuff and it is very tempting to  go directly into automating and generating. Now... basically it should be so, that we look at  the ontology here, we have the taxonomy yeah   we pick okay what do we want in our  Data Vault. We won't have hubs for cars and  

helicopters and buildings and there's clearly  a link between cars and buildings so you know   that's going to be a link. This is kind of the  the idea if we had the taxonomy and ontology.   So just you know point at that and pick your objects from there. Here we have an example, this is actually based  on a true story I am anonymising this completely, but it is based on a true story. They had  a large ERP system which had a massive table. 

ABC doc one, two, three, which had document numbers,  what not done they asked the business   all right what does this table actually contain  and this well you know it contains financial   documents. Good, so these Data Vault guys that  were working in this organisation, they were   thinking that yes okay we're smart, we know that this has to be about business open   kind of entity's, so if this contains financial  documents that's how we're going to name the hub   and then we're going to bring all these attributes  as into a satellite around this hub and link it to other hubs but if this table contains  financial documents that means that we create a   hub called financial document. So they did and the project went on and at some point when the kind of usage of that data was becoming quite difficult, there was a bunch of people, me included, looking at the models and looking at the business requests around the data   and what we kind of separately discussed with  the business was that okay yeah, we in   separate use cases and separate processes and  with separate groups of people we're interested   in sales contrast variance it's interested in  sales order data, deliveries according to product according  to target country, invoices, different types of   invoices and so on and so forth and they said yeah  it's a little bit difficult, you have   this financial document table so that's kind of the what we get from the data warehouse. So,

this kind of a system, this is this is a source  system Data Vault, you're just renaming the hub for   something that sounds like a business entity. This  is a highly generalised structure, theoretically in   this case, all these entities that the business  was interested in were to some degree contained in   this table. There were financial documents about  all of these events, these are practically   events right? There were financial documents  recorded in this fancy ERP system, about all   these events but for example, if you wanted to  have information about specific kinds of invoices,   you had to go and find the right  types of documents and then you had to figure out   are there perhaps multiple different kinds of  financial documents per invoice, or something like   that. And with sales contract data you had actually  the sales contract was recorded as a role in the   financial document it had its own key but half  of the data about the sales contract, was actually   in a completely different system and that system  had its own data model and as a result   it had its own hub.

So now you were actually forced in the kind of consumption layer of this thing, you're forced to actually combine attributes, from satellites of two different hubs, and that's not how this thing works. It's a source-based approach it's not based on the business reality. So, if we look at the types of failure, this  generalisation, but I have categorised this in   three types of failures. One, is that the source  oriented hub like we saw here on the financial   document, this is actually kind of, it is sort  of a business object but it's on the wrong level   of taxonomy. So here we have we're creating  a highly generalised financial document hub,  

whereas we would would have actually wanted to  get something that was you know more detailed.   But then again, if you go too low on the taxonomy  you go to the levels of I don't know   individual types of contract that might be a  fractured data landscape. You are kind of taking   a business object but you're picking a wrong one; that requires the users and whoever is   working on the consumption layer to constantly  apply more business logic to combine this data   or to split this data, and that is done every  time someone needs something for every single   information model whatever you're building on  top of that you're always figuring out okay   how do I get my events from this generalised  hub and its satellites.

Another thing is that, if you pick a source oriented hub, you could have the data related to that thing also in another   system but record it completely differently. So,  imagine if we have a system where we have people   as part of resources there's table for resources  and people are individual rows there, there are   also cars there's buildings but we have another  system where there is a single table for employees   and one table for contractors both of which are  also people. How do we integrate if we are only   building our hubs around resources employees and  contractors but we're really interested in people? The third type of issue with this, this is kind of the obvious one is that the source system structure is actually kind of, somehow weird. It's highly technical, it's something that cannot be kind of comprehended it  might be built around log messages or something   things that do not actually even belong to the  taxonomy of things in the business world, these   are technical objects like messages sent between  systems or something like that. If you built a hub  

message and that doesn't correspond to any event  that the business understands then you know what   does one row mean there? What's the business  key? There is no business key because there's no   business related to those messages. It's a way of  the systems inside the systems how they   manage data relating to things. So these are just some of these kind of data source system, Data Vault failures that I've seen and I do think quite many of these kind of bigger scale   failures of Data Vault initiatives relate to  one or more of these cases. So, how do we do   it right then? We had a webinar like I don't  know, a year and a half ago with Cindi Meyersohn, and we were kind of drafting this quick and dirty  process how to do Data Vault right, obviously   based on the actual methodology but  just trying to create kind of quick steps here and   and what we did, we tried to figure that you know  first of all business needs okay. We need to figure  

out what the business need is, but then we take a  step away from the kind of usual Data Vault talk,   we're not talking about Data Vault at that  point, we're figuring out what's taxonomy and   ontology we're capturing those and a good way to  capture those is conceptual data modelling, and this   theoretically or methodologically has nothing to do with data world in the sense that we   would be using evolved objects we're just figuring  out what is the business actually about. And we're   working together with the business experts. Then,  we have the model, we have the kind of taxonomical   different levels of hierarchy, we pick what we  actually want from that conceptual data model.   We want this to be on the level of you know cars and  helicopters but in terms of buildings, we want to   just generalise a building, we don't want different  kinds of buildings. And once we have done that, we've picked the ends that we really want to  use. Then we figure out okay what's the business   key. How do we recognize a building? What's the identifier for 

a building? What's the identifier for a helicopter?  After we've done that, we have picked the entities, we have figured out what the business keys  are, then we can go designing the Data Vault, and   this is the first, kind of place, where we actually  go into the logical datable model itself. All of   this is part of the kind of Data Vault process  but only here do we go into the Data Vault model   and only after we've done that, do we look at  the source systems. I think this is kind of one   of the most important things, also in the Data Vault done right thing. We're only looking at the source systems  once we've already defined what the Data Vault   basic structure looks like, obviously the satellites are populated from the sources and   whatnot and it's pointless to design all  the satellites beforehand, but we should have the   hub link structure at least already there. And then  we go creating the scripts and scheduling and  

monitoring all the technical stuff that is fortunately done mostly automatically. This kind of approach requires an understanding  of data modelling on different levels so, we need to understand that there's kind of  a business it gap which we need to be able   to bridge and it starts from the business  side, so every time we start doing Data Vault actually, any project we need to start  with business glossary really. What do you mean   by a customer okay, you want invoice data, what  do you mean by an invoice? What's an invoice?   What do you mean by a product is it different from  product category? This is us trying to understand   the business need, then we create the conceptual  data model because it is a good way of capturing   the taxonomy and ontology, so what things do  we need data about? How are they related? How   are they related in real life of that business? Not in store system databases and then   when we have that, then we can figure out alright now we want to make a Data Vault out of   that, what does that look like? What are the hubs?  What are these satellites? What are the links? And then, we start mapping this and  maybe we have concerns around the technical   storage. I mean theoretically you could do Data  Vault in any technical storage, doesn't have to   be even a relational database I suppose. How  do you implement that Data Vault design in this   particular technology which is the physical  data model and only then, do we figure out   the different kind of integrations with tools and  so on and so forth, and that is breaching the   business ID gap with different levels of models we  can't just say that yep we're doing data modelling   now because we are generating the Data Vault model  from whatever source system might exist.

Conceptual data modelling is therefore kind of an  important step in bridging that gap, and I have   here on a large font, a quote which I practically  always use in every presentation I've ever done   that's by Alex Sharp, explaining what a  data model in the conceptual sense is, it's a   description of a business in terms of the things  it needs to know about so, we're   modeling a business we're not modeling a system  and we figure out what things does the business   need to know about. It needs to know about  customers invoices and products good, that's   the part model there so we create a model  that describes the actual real-life things the   events people places resources whatever you have  we figure out how they are related in the real   life, we work together with the business experts  and this is super important, we try to ensure   that we understand those words in the same way  and this is kind of the business glossary part   because everyone has different description or  definition of a customer or a product which are   super difficult things. And this all is technology  agnostic so we don't care if it's SAP, we don't   care if it's whatever, the reality of  the business can be described as a conceptual   data model regardless of the systems or technology  they use. So, the thinking process involved in this is that we are modelling the real world. The way I like to kind of go at this is

that I think of a slice of reality, I pick a kind of scope area from the real life of the business. There's logistics and sales and  whatnot you have an area there that you kind of   point at, and you say I'm going to model that. I'm  not going to model a system. I'm not going to model   database. I'm not going to model Data Vault even.  I'm going to model that piece of reality there and   I picked the entities in the conceptual model from  that reality, customers invoices, website visits,   products, whatever and I figure out how these are  related to each other by kind of describing the   narrative part of that reality a deliveries  dispatched from my warehouse means that there   must be a relationship an association between the  delivery and a warehouse called dispatched from. Again, I'm not looking at foreign keys, I'm  thinking about how the narrative actually   goes and then turning the verbs  from that narrative, into relationships. Who cares about technologies, our database is  not important at this point we collaborate with   the business we find out the people who actually  know how this works, not the systems administrators   of the system in which this work happens, but the  people who actually do this work because they know   how it goes. We ask them how does your reality  work and then we draw the model on that. Which  

results in us creating a technological solution  but that technological solution is designed based   on the slice of reality that we encountered. So here's an example of model this is done in Ellie. Not advertising much at least, but what a conceptual data model should do, it should only   contain this business entities, it should capture  the taxonomy and the ontology and there should be   if not in the model itself, at least close to  it somewhere the definitions of those entities   the glossary. So here you can see this is I think  this is called a barker notation or something of  

subtypes this is basically a taxonomy part. We  have resources of type building and vehicle,   we have vehicles of type, car and helicopter. We  have parties of type, organisational, person. And   then we have these lines telling that you  know organisation owned resources, a car is   stored in the garage, a person has a license to a  vehicle, and so on. And then we can describe what an  

organisation is, there's the description in Ellie. These are captured in the conceptual data model.   It's taxonomy and ontology the hierarchical  classification and the relationships. Then we go thinking about the Data Vault. We haven't, obviously we would pick also the kind   of the entities that we are interested  here, it might be that the data model on   the conceptual level includes also stuff that needs to exist there because you know   we figured out that okay you know this is also  related and we want to separate this and that   entity, so let's put them both on the model but we  decide what do we want to actually work on and we   decide that by kind of pointing at the model and  saying yep that's going to be my hub and based   on the relationships and entities between these  hubs, we can then decide what would be the links.  

And this decision is made here on looking  at the conceptual model and I would actually   even go and and point physically my finger at  the end of the saying hub, hub, hub, hub, hub and   then we are happy with the selection there. The  conceptual data model might contain more stuff, but we have to remember that because we did that  by looking at reality instead of systems it is   one hundred percent valid, it's completely reusable no matter  what we pick here. We're just saying that okay   reality looks like this, we're going to pick these parts from it for our Data Vault purposes. So, we have done that we have our hubs, buildings,  cars, persons whatever that is, we can create links   we know we can look again at the model say  okay there's a line between those that means   there's a link, cars stored in the building.  We need to identify another business case,   so figure that out what's the business key  for a person? Social security number do we   have those? License plates whatever, and then  we can go and create these logical models. In a later tool, we have a separate layer for  logical models, or of course we could use this   conceptual data model wherever it's created  and then push that as kind of basic design into   another tool, WhereScape, Vaultspeed, Data Vault Builder are ones that we've worked with, which   are all great tools and create the kind of actual Data Vault model inside that tool, but   the point still is that we have these two models that are separate but connected.

Now here's an example just super simplified I left  out all the attributes and everything, created   this in Ellie, such a super simplified model where we have the hubs, the person, the current building,   we have the links and I have linked these entities  this kind of logical Data Vault and this to   the kind of business entities from the conceptual  model so that I can see that the hub building   is actually about the building which has its  own definition in the glossary. It's obvious   here because, the naming is good, but it  might not be always obvious and that's why it's   very important to be aware that my selection  of this logical entities here corresponds   to those kind of business entities in reality in  some way. And this is also a good place to have a   kind of quick peek at the source systems, we often  talk about doing kind of a reality check against   the systems that you have, it is often so that the  model that you create based on what the business   should look like might not actually be completely  true because you don't have even any systems that   would record some piece of that information. So  obviously we would involve someone who knows about   the source data, who could tell us that you know  actually we don't have any information about the   kind of building in which the car is stored. That  the state doesn't exist at all. Then we  

can you know edit this model, but the core design  of this whole thing is derived from the conceptual   data model without looking at the source systems  other than doing this kind of reality check. So, automation steps in at this point and this  is also kind of where I will not go much further with this anymore because now we get then into the kind of   nitty-gritty details of of doing the actual  Data Vault modeling and figuring out the actual   mappings and the actual ELD scripts and so on and  so forth. You need to have automation, you need to   have right tooling for that some people do it by  hand you know writing thousands of lines of SQL   but it's not really a good use of anyone's  time anymore, because so much better options exist.  But the point is that we have got this far  without resorting to automation and the thing   the beautiful and annoying thing  about this conceptual data model is that we   cannot automatically create it. It's about the  business reality. You don't have the business   reality reflected in a structure of a database,  unless you have already done this exercise.  So after we create, and to understand what the  business is about we create the model, we derive   the Data Vault model's basic structure  from that and then we start automating   things and mapping stuff around, and go just  generated and we're happy. And of course when  

the source systems change that's very easy  for us now, because we have mapped them to   business entities which rarely change, they  do change but they rarely change, so we can   just you know automate again so that the  new sources are also mapped into these things.   If the business itself changes, then of course  the conceptual model also changes, and that means   that there's going to be manual work you, have to  update the conceptual model, you have to figure   out how that is reflected in a new version of the  Data Vault model and so on. Data Vault's or luckily could   in the sense, that it rarely breaks down in that  kind of situation, but the point here is that   I'm actually saying here, you cannot automate  away the conceptual model, you cannot automate   away the understanding of the business concepts,  but you can automate away many steps after that.

So, I think that brings us to our conclusions  here. I know many of these things is stuff that you've heard from, from many people all the time and read about   you know this is business-centric things it's  always been so, all the books say it,   to achieve this we have to do a little bit of  leg work. We have to model it after the actual   business entities and we have to figure  out what the actual business entities are.  We have the kind of tempting, like one of those sirens or what they  were in the Odyssey, that are you know   singing there, luring the seamen to treacherous  rocks that, you can go and and automate this   from the source systems, but that's risky and it will lead often to failures. So to capture the taxonomy and ontology, conceptual data  modelling, it's been there for decades already   so everyone knows it works exactly in that part of  this work. So just go and do it, capture the actual  

business needs and then we design the Data Vault without looking at the systems. So, whereas many of the kind of more technical aspects of Data Vault methodology are about doing   the Data Vault in a right way following the steps,  having the business case, and hashes and   whatnot, this part of the process which we cannot  automate away is about doing the right things   not doing things right away but doing the  right things in the first place. We pick the right   entities the things that the business is actually  about and the things that the business needs   information about, and that's really what conceptual data modelling is in this, whole   big process of Data Vault development and design  it's ensuring that we do actually the right   thing and thus avoid these traps that source system Data Vault solutions usually have.

2023-02-09

Show video