Return of the Hypercube, Simply Business
i'm going to talk to you today about a presentation i've called return of the hypercube or how i learned to stop worrying i love the obt now the way i do presentations i come up with a title first and i figure out the content later so i really like this one so i hope that the presentation follows through so a little bit about me this is my first time at a boston dbt meetup i've been a user of dpt for a while but i'm based in the uk we do have offices in boston and analytics engineers based there i'm head of customer analytics i'm primarily focused on on marketing operations trying to make data valuable in the point of the customer journey so being predictive and active part of my role doesn't look after bi traditional business intelligence concerns with dashboards so a little bit about me apart from being head of customer analytics i used to run a dance music party in london data is my first love but it used to be trance music if anybody knows that's a sub-genre of a sub-genre i started running during the pandemic maybe like many of you guys love the garmin so it's got some data it's got data lovers watch from there so it's great get up get one and interesting fact i'm a co-founder of a whiskey bar in scotland as opposed to drizzly i'm trying to encourage people to get out and drink and if you're ever in scotland go to inverness and have some whiskey good so a little bit of the uk's number one sme insurance broker so we service small businesses and we've been successful because we've tech and and data into kind of one of the traditional or second oldest professions and we're expanding both in the uk and us we've seen continued growth despite those enormous challenges of the pandemic in the uk we're behind all the major uk financial price comparison sites i think the us is still a way away from having these kind of consolidated giants where you can really compare financial products but it's been quite established in the uk and for sme insurance were behind basically all of them so that's good and much like everybody probably we're hiring for dbt get in touch agendas i've got to go why are we here what's this all about wtf is a hypercube it's a good name is it anything else using dbt to build the blocks of the hypercube and how we actually where dbt fits in the architecture and what this did for us and then hopefully a little bit of time for q and a so a very existential question why are we here for this particular db team it's about this our analytics architecture changed so i'm a little bit old-school probably older than a lot of people on the call i came from a more traditional bi background we had a stack that was pretty old in the tooth now i think probably not that many people will will still be on this sort of stack but we were on sql server and using sql server integration services as our data etl tool integration platform our engine users are using excel and they used something called sql service analysis service on the back of that now ssas is basically an olap technology so this is something that would take information data and aggregate it up and make interactive queries lightning fast for our users so if we look at our current architecture obviously sql server has been replaced by snowflake ssis is replaced by dbt excel has been probably like many many of you guys will have been replaced by looker and then ssas what's been replaced by it's been replaced kind of by snowflake there's no kind of parallel i think the thought is that looker runs natively on these columnar massively parallel cloud databases so why do we need uh olap cube because everything is blindingly fast anyway so why let's have a look at a typical data set but this is something i've got on the internet uh data model for e-commerce by barry williams i'll give him his full credit nothing to do with me what we can see is that it's really highly normalized there's lots of relationships everything's kind of third normal form no redundancy lots of joins everywhere and lots of one-to-many joints and many of these might have you know outer jewelry requirements you can see some little circles around those little crow's feet everywhere so that's nasty because writing aggregate queries that span these tables is absolutely non-trivial just trying to get any kind of measure out of that as soon as you you're combining information from some tables you've got to really think hard about and what's more running aggregate queries against those is slow not even if you've got some massively fast parallel cloud database it's still always always the next thing it's just it's going to take time and and probably it's going to be expensive as well because we're paying for that querying now we've used looker and some of you might also do if you're using you know looker against anything that looks like that now they solve the problem which is the first one which is writing aggregate queries that span tables is non-trivial look at how they solve the problem has every anybody ever taken the time to understand that i i never have i never have but they solve the problem great but what's the problem has anybody seen this screen before okay so this is basically the loading data screen it just it's awful and this is a curious cat now i think analysts they are inherently they need to be curious right so this is how we need to go to the bottom of what's happening now if every time we want to ask a question we have this enormous surging query that goes out onto the ether and gathers up that data and costs money then you're going to kill the curiosity inherent in the cat so not good and at the same time somebody is making a lot of money out of that and obviously we're paying for some very lavish lifestyles out there as analysts so good we don't like that we want to live in a more equal world don't we what is the hypercube what can it do for us interactivity maybe a pop quiz what what is a hypercube is it a way to make interrogating data faster b a multi-dimensional array of data or c a name that makes data sound cooler than it actually is any thoughts you can only if you've got a whoever gets that win surprise i'm not sure what it is yet okay love them yes good it's good it was a trick well done good daddy's been through this before so i i do know danny good so yes it's all of them good but if we go to a standard definition so cube is just a shorthand for a multi-dimensional data set and given data can have an arbitrary number of dimensions it doesn't have to be three like a cube and when you've got more than three you just call it hypercube and then you sound like data's all sexy and cool and people are interested in what you're talking about cube data to get cube data you frequently prepare them using a star schema now many analysts nowadays they don't know what about star scheme is because they're off in their own kind of normalized world this star schema is a traditional business intelligence structure if you haven't come across it you can read up about them but it's fundamentally about way of preparing data and you basically split your data into measures and dimensions and you know once you've done that then you can get the data and you can load it into a cube so some many people will already know that so just to recap we have facts which is just anything that's quantitative that you can measure a piece of information like a sale download then you have your conform dimensions in a conform dimension it's basically just the set of data attributes that are referenced in multiple database tables and are consistently addressed so they're always talking about the same domain entity in the same domain way and that's an important part of this making sure that a person is a person or as an order as an order it just has to be consistent across the different the the different data sets and finally just the the star schema which is basically just where you have the facts just stored in a fact table and only the facts with basically surrogate keys to those dimensions and it looks like a star this is probably very much old half to a lot of people grew up with but then even if you get this done you still get lots of joints and in looker there's no way to do fat fat joints so you always choose a root which would be your fat table or or a dimension table but then you always need to go to dimensions so it's just not a multi-fact data structure isn't suitable for a relational tool like looker and it doesn't really get us away from all these joints so we need something else going back to the kind of the cube analogy what's the replacement for our olap queue so building blocks what do we need to build we need to build this cube we need to first design the cube right so this is what we do we choose events as the universal brain so because what we're talking about is is multi-dimensional cubes multi-facts we say something that happened is universal you can always talk about something that happened you can have a transaction was made a policy was purchased a person's visited a store and what we need to do is you just design the facts and your conformed dimensions you build those dimensions and once you have those dimensions you use those dimensions to build denormalized fact tables so this is gonna you start with a traditional approach of your factory and then you add all the dimensions onto the fat row and you denormalize it so you squash it you flatten it you've got a very long fat table you use the we use dbt to encode all business logic onto that fact table so therefore you don't need any inherent knowledge between the relationships between the facts and any of its parent you know dimensions or and you don't need to apply any kind of calculation for you know any that's not an aggregate query obviously because you can't encode those on a roller and then finally you bring all of those multiple fact tables into one big table so you have multiple fact tables effectively all together now because they share all the the dimension tame you know they may share most of them they don't have to be perfectly conformed in terms of they don't all factors have to have all the dimensions but you just bring them all into your one big table and finally you create aggregations on top of that so that's what we do sb for a subset of our data sets in that model every dimension is a reusable analytical asset once you've created your dimension you you treat that like like goldust right and you can use that across any fact process and also every factor is golden because no business logic is needed to derive its context as an analyst you don't even have to know how to use a joint you just need to know how to use the sums or max or whatever function you want to apply you don't have to worry about the relationship of a measure to anything else in the business so just to walk through that in a step-by-step e here's a relational model that we actually had a simple business uh you know very simplified this is our contact center in the contact center you can imagine that we have we make calls we have contact requests coming in from customers saying call me we have consultants that make calls we have a consultant that is part of an organizational chart they'll be in in a team which is itself a part of a hierarchy we might need to know more about the you know their metadata and then you might have sales off the back of that so that's a relational model and actually we used to have that in looker and you know we have these full outer one to end joints and we get that that lovely symmetric aggregate queries that we had beforehand and the common concurrent querying speed was just atrocious so step one and that is just model the first r in this case it's just you can see we have a contact requests so contact requests become a fact contact request and you know in fact we've talked about fact content request events things that can happen on the contact request contact request will go through various stages it might be insected it might be passed to the dialogue might be killed those they're all in the main space of the contact requests and things that are happening on the contact request so that's the fact contrary request events and we have all of the conform dimensions into the dimension tables and we split those out i've put dim date there because dim date is a universal fit thing but that when something happens for an event is universal um i've put it may be implied because looker you don't actually ever need to build with that dimension but in some tools you might have to construct your own date hierarchies any numeric elements in dimensions are not facts they are merely attributes and should never be used in aggregate querying so that's just to make that point and that makes that clarity between what is a dimension what is the fact really clear in the previous model you you have dimensions and measures all mixed together in the same the same model little kind of point of housekeeping is that every attribute in a dimension will have that dimensions prefix if something like name or i don't know telephone number or something generic like location could be shared across different dimensions always prefix every single attribute with the dimension name before the attribute this will help ensure column names are unique later when we squash it all down and then we want to denormalize the star so we've created the star now we're going to normalize it how would we do that we can just simply do a lovely dim star now i'm sure some people are horrified by looking at the stars but what we've said is that we've constructed these very carefully we're creating the attributes up front now we use the dim star just because now we can conform it we make sure that everything that we build in the dimension is always available across any of the fact that normalized factors so that's why we use the infostar and effectively there if you for a very particular purpose if you need to control the dimensions in a fact row just put a view on top of it that's specific for that fat grow and then restrict the dimensional columns if needed for a specific cube then we model the next style so remember we we had this one we had we talked about contact requests so now we want to talk about say call events right so now we've got dim like dim called core measures and the other dimensions that were under the first fact optimalized factory but now we've got more dimensions because it's further down in the process we've got more dimensions involved and then you just keep denormalizing each star so just keep on going and on there we go so we just repeat lots of times now we want to union them all to and and and this little function is just i call it stardust or goldus i love this this is such a simple thing but it just means that every column is always joined like unioned in in the appropriate location if anybody's ever had to had to manage union relationships and the fact that everything has to be aligned you don't have to do that anymore dbt has this wonderful piece of code love it and that's what this looks like for this cube where we've got all of the different fact fact tables that are denominated and then we just use that lovely union relations and what we have is really one very large table i've put three and a half kilometers long just it's just that big probably it's is it h8 in the scale that we were talking about beforehand and you can see it's got every single attribute from every single dimension measure and all the measures all there lovely you can then get pre-filtered versions of the master over bt so this is the master obt opt is just one big table so you could then just filter that by saying well actually this master opt we just want to get like a territory of you or we want to get a state view a country however we want to filter it in ways that make sense to the business because it's just like here's your subset of your data you know there you go you're only ever going to be querying that side of things maybe the product line and you can really use again dbt to make those super super simply so you can just go through all of your territories and then off you go it's a very simple query now and maybe some people are talking about isn't this horrifically redundant it absolutely is this is horrible in the old days of ssis and sorry i won't go all the way back but in the old days of ssis and sql services this would be anathema but we have snowflake now and storage is just like water and redundancy on the data storage is absolutely fine and when we have redundancy on the data storage we can get away from the intense compute costs so this is great finally you you just define your pre-computed aggregations on the key dimensions of metrics to mirror that olap cube functionality now different tool sets might have different ways of doing this now we use look at facilitators greatly because it's already got aggregate awareness as they call it so you can create these very easily i have seen in dbt there's an outstanding request for dbt materialized views which would be you know aggregates i don't know how that's less than different different um architectures might have different ways of creating that but once we've got that flattened obt aggregations are so much faster to compute so that's the thing you're not you're not having to join you just literally go down and your aggregation queries forward so you can create bunches of aggregations at very little time and cost and if you're querying dimensions and measures in the aggregation performance will be even faster than the royal bt but remember performance in the raw obt is blazingly fast because it's literally a scan down a single table so here we can go from massively uh aggregated to the absolute in-depth detail of absolutely each event that's ever happening even if we're in the detail of the event it's still fast and that's something we can never really do on olap cubes because you're always having to get to a certain level of navigation otherwise the memory requirements would be enormous so that's it just a couple of pointers slowly changing dimensions uh type one slowly trying to mention so one thing that the changes in the dimension in a single dimension that attribute changes if that happens watch out because you've now got to replace that everywhere so you know think about the strategy you're going to use we have a strategy works very well for us and obviously on this type of model type 2s where after a certain time so you have an attribute and after a certain time it goes to another attribute inherently on this model work perfectly and there's not a lot of well there's zero uh sed type two management it's great so what did this do for us as a business it reduced query time massively if we look at if we look at the the previous model if you think of that relating one you know one to many uh one out of joins we had we had full outer joins in some cases it drops query times from minutes to seconds and sometimes less and we have these reusable data assets from both dimensional dimensions and factors that can then be used in other processes and other key and really simplifies that the querying by analysts anybody can just go in there okay i understand everything about the context of what's happening and data applications so we've now hooked these data tables up to close to real-time consultant dashboards that they don't have to know any other business logic they just take that data as a flat file and for me it really continues that kind of encoding of best practice and data modelling for our new starters because obviously not everybody's you know conf comfortable or knows about facts and dimensions and it guides us towards best practice hello there we go so that's the end of my presentation i hope you found it useful thank you so much alex that was incredible okay so we've got some questions actually i'm gonna call you out i know you didn't have a question but i actually dropped in a comment and i know ashley i worked with her at hubspot and i know she understands the looker side of things like very deeply so actually you wanted to speak to your comment about why multi-factors and why that resonated so well with you ah yeah absolutely i really appreciated that succinct way of putting it how i had been trying to describe it to people is i have a classic situation where my subject matter expertise is in one of our core like business dimensions it's partner accounts that we have at hubspot and then of course the events that everybody cares about the most are all of our revenue changes so i have an explorer that's built around the partner as a dimension that joins in just about anything you could ever want to join in for a partner so that we have that for like agile analytics and then separately we have explorers that are about the revenue events that go on those so i've been trying to articulate this to people of okay use the one that's about partners for anything that's about partners unless money is involved and as soon as money is involved go use the money explorer that has the same dimensions because we have that kind of partner account dimension modeled well but doesn't have all of the extra stuff because we get very detailed on one end or the other everything about revenue or everything about the accounts so that's a really helpful way to put it of for me trying to put two facts together would be trying to put all of this information about what's happening to partner accounts next to all of this information that's happening about our revenue and there was just enough of a gut instinct with how i was setting things up in liquor that i was like ah this is where we draw the line but i really appreciate having some words to put to that i i've seen business users obviously what they try to do to use looker merge queries which is again at the end of the world isn't it yeah i actually use merged queries pretty heavily but you have to have an intention and kind of know why you're doing it and you can get a lot done with them but certainly if you can solve that problem with better data modeling upstream that is the way to solve it actually i i really found that dbt was essential in that process yeah yeah that makes a ton of sense awesome thanks okay d reuben you had a question are you able to come off mute there we go yeah i was just wondering how big are your obts and then follow up are they clustered and how so they're three and a half kilometers so truly like we have millions and millions and millions of fat flat rows and i've never actually stopped to count the number of columns but we were we're in the realms of hundreds and regards to clustering we haven't gone out of our way to performance tune at the moment the the scale that we're dealing with the performance is is is absolutely fine i wouldn't say that you know we're in the realms of big data we haven't done modeling of of page views for example or going through the site which which would lead us to probably think about some maybe some form of aggregation up front but when we're dealing with things like calls you know contact center so it's quite manageable yeah i was wondering we're struggling with some of our web analytics stuff which is approaching the scale of hundreds of millions of records a day not sure sure yeah we're far away from that but i think the principle can hold firm it just i suppose it's what level do you need to drill down at at what point and what you could yeah it's yeah i i would work through can we deal with some certain form of aggregation and and then apply the same principles thank you great anna did you want to come off me and ask your question i was very curious how operationally you manage when some of this changes do you drop your entire obt and recreate it yes we do yes so we we actually have a a process that that will drop it actually even weekly so this is the the way that we actually manage our std type ones is that we do incremental builds that that take the previous seven days worth of data and then we do full full batch recreations every seven days on a downtime window on a sunday so that because we have that luxury that it's not 24 7 access that's the way we've done it but you can imagine doing things in parallel that would get you to the same result and just to drive home a point these things are every half an hour for our contact center so this is massive wide tables each one is not a significant overhead on snowflake but it's like we're doing the the kind of pro the complexity of the queries to build the model is a kind of complexity of an end user trying to get more than some basic analysis out of the front end so the difference is that we just do that once and then it's available for all everybody thanks thank you awesome okay alex i've got one it sounds like the way you're describing it this is the an evolution in terms of how you were previously building and designing things so i'm curious how you've had to change an adapter development workflow to support designing towards this architecture it's like what part of your organization owns this work what roles or teams what and what types of challenges there are in that okay so yeah so we we we got onto the the look of the looker bus quite early on and we were building pdt's i think probably like any early looker user was and it really laid us down to jettisoning best practice data bottling principles and i think the introduction of dbt that's what precipitated or not immediately because i think what what initially happened was that the models just got taken wholesale from pdt from pdt's to dbt and the modeling principles weren't were there so the challenges are you know are we going to invest our engineers time in re recreating these things now for for me the difference was that that our operational analytics are really complicated they drive half of our business and we've got something like you know 100 people running dashboards frequently throughout the day so at that point you just think it's some things it's got to give we've got a we've got escalating snowflake costs we're contributing to those luxury yachts and we've lost that curiosity we've got those offline analysts that just they're scared to look into the data they'll leave they'll do something they'll leave it running and that's what i really wanted to change luckily in my team we've got people who obviously scored a sequel they upskilled in dbt quickly and then we just thought about how we would structure this so the journey's not complete so people are saying about these sort of golden dimensional factory data sets now we've got them but they're not used everywhere because we still have plenty of legacy and i guess my challenges process by process is to introduce them so we started off in a very specific i guess you would call it a data mart but but the key thing is that those dimensions are they are conformed to the business domain so they can be then rolled out to new business areas have have you had any challenges with adoption of this the hypercube before teams don't want to make the cut over and want to continue with no as soon as people start using it it's because it's day and night so yeah it sells itself that's awesome okay let me just check if any i think that i think i got all the questions oh we just had another one come in neil if you want to come off meat sure thank you so much um for this talk it was really interesting and i think i know this has been focused on the you know engineering side of it on the technical side of it but i was curious about the impact on the business side what i said was often we as analysts or analytics engineers build tables and then build and then metrics on top of those tables or maybe locker explorers for one particular team yeah there's metrics for product team for a marketing team in your experience or maybe do you think there's a way in which having all of the relevant information in one larger table can help break down the it's not so much in data but just in like operations because a product team is only concerned about what it's meant right so so the thing is because each part of the cube is componentized itself once you if you're talking about calls for example if you need to join that to another part of the process obviously here's this domain space was around operation like course content center operations but you may want to know like a product manager might want to know what's the flow between my customer through from online to offline now what you're interested in more of a would be about what the calls that that customer is making you can reuse these factors because you've confirmed the dimension you can say i've got a conformed customer dimension i've got a conform date dimension i now have the way of joining that process so you can just you can just join them in that way does that make sense so once you uni once you have a a the one fact because it the hypercube obt is just an aggregation of lots of uh a unioning of lots of minor obts each fact flat and fat row so each flattened fat table can be used independently because without any knowledge of of how it fits together because the dimensions can form it all that makes sense so it seems like the if you want to explore outside of your domain then the data that you would need is in some ways already present exactly through the conform dimension okay thank you
2021-06-12 03:19