Zero to trillions how we scaled our data platform to meet Canva s growth in data and teams

today i will start this presentation called zero to trillions and we'll talk about how we scale the warehouse at camba to fulfill all the business requirements the agenda that we have for today it's focused on four different and business problems that we face along the way and the first one will be accelerating insights the next one scaling for growth followed by faster data onboarding and we will wrap up with special data access i'm sure that many of you will relate to this this business problems very common in our field so let's jump in the first one today is accelerating the insights so a bit of a context here roughly two years ago and a bit more a bit less when i show in the company uh the main business problem was that how slow the access to insights was and back then the warehouse bills were taking more than 20 hours it was crazy and it was keep going up and up and up our data engineering team at the time it was a made by just a few folks we were like more or less like five ish and we were looking into different things and in my case i was the main one looking after the warehouse so this is the first scenario that i was presented to very slow access to insights the bills were taking less than 20 hours so then i started digging up and seeing why all these issues were you know happening the first thing that i've noticed is that every day we were in parsing all the json we were repeating operations on a daily basis which was like very low performance and we were also to my surprise building the warehouse using full bills instead of incremental so facts reports we were building this from scratch every day which now sounds crazy to me so the solution that we came up with was a big investment in foundations and one of the first things that i did to kind of clean up the house was to introduce a layer architecture in the data warehouse very common practice in data warehousing we have like different places and or layers where the data will be stored in a particular way to fulfill different business needs these allowed me to then implement the incremental bills and save a lot of time when i was building facts and reports on an incremental fashion rather than from scratch and finally what i also did was standardizing the ingestion methods the warehouse was growing we had my sequel snapshots events and a few other events from the front end the back end everything was ingested in very different ways so we standardized it and now we were loading everything from a single place which is the data lake so what was the impact on on this initiative well it was a stronger team and tech the builds were like much faster we reduced the the build time more than six hours which it was like also cost savings and ultimately a happier analyst team and last but not least the solid foundations on the warehouse more automation less operation this led us to focus on bigger problems and i cannot like and underestimate how important is to have a solid foundation it requires a bit of early investing but then it will open the doors for many many more and opportunities in the future so moving on to the next section we have scaling for growth so this was shortly after we we solved the incremental builds and hard layers the problem hit us again the build time was continuing to increase because naturally we have more customers more users more data more and different services that we needed to support so the build time was going up and back then we were using a solution that was not elastic and this was the main problem we couldn't just and you know extend the warehouse capabilities to be able to process more and more information rather we needed to upgrade into a different tier which will be more or less fine until a few months down the line so it was like a very rigid warehouse solution that we were using and due to this fact it was also pretty hard if not impossible to handle very very large data sets and and as a consequence the low velocity of contribution uh was another problem that we were facing even though the analytics team was growing and it's one of the data teams that has grown the most in the last couple of years still the contributions were very slow because we needed to to test and there was no much you know free compute and resources on the warehouse to go through all of these steps so what we started and working towards was a migrating into a different solution we did a few of technology we implemented a few technology changes and the main one being snowflake and dbt after a few evaluations on different warehouse solutions we decided to go with snowflake and the main sole point back then for me was you know the flexibility that snowflake provides and the json support which it was like very very impressive and important in retrospective and moving out from our previous warehouse platform we needed to take care for and we needed to have a solution to handle all the orchestration of the sql models our previous solution did have something similar to dvt not even close to being that powerful and useful but there was an orchestration method so by moving out from that solution we needed to jump into something new i knew about dvt for a while but as i say i was like pretty much fighting fires and i didn't have the time to evaluate the product so at this point in time we need to migrate the warehouse we may as well just take a look at dvt and finally we embrace both changes doing the migration it's it's a big task and you have to rewrite a bunch of code add up for the new technology so on and so forth so what we came up with is an approach of um using templates and generating a bunch of code so then we were able to with a few and days or weeks we generated a bunch of code and got all the raw data loading into the warehouse this again it was more automation less operation more free time there was another change in this part of the solution because as we change the technology we need to change some processes and the team as well we wanted from the warehouse team point of view we wanted to get away from the analytics so to speak at this point in time combat was growing in size so was the data specialists and we had a warehouse team in which i'm a part of and we had a bigger analytics team the last one is the one in responsible to creating the transformation and all the business logic models so we wanted to give them ownership we only provided in sql code reviews to make sure that they they were performing for the next and for this new tech stack but ultimately they were the the code owners and yes we were involved on request usually for query optimization or large deployments so the impact that is changed had was massive and we could say that analytics went through the roof we were able to handle now complex analytics models such as you know session marketing attribution search impressions at camera these are like very very big scale data sets that before we weren't able to handle and now we could contributions skyrocket so if you take a look at the the chart that we have from github here like all the commits went just like through the roof very very good momentum and accelerating at that point in time and same thanks to the elasticity of snowflake we were able to reduce the daily bills even down to five hours a few lessons learned and that i'm that i can share with you guys because yeah this is also quite important in retrospective it's about communication and training whenever you're doing such a big project especially if you're migrating to new lands communicating communication can never be underestimated we could have done this a little bit better and yeah can always do this a bit better now we use a well we do use design docs for for proposing changes and reviewing and then you know this is the first step of a new project so that's something positive that i like about canberra and highly encourage to all of you folks to take a look at that training is another very very important one we should have had more training on the with the analytics teams yeah we did have a few it's not that we just left them alone but the more the better at the same time the analytics team was like hiding really really fast so if i was running a training session today one month or two months or three months down the line there was a new cohort of people that didn't really know about these things because they missed it so what we are doing now is we are recording every session we are doing about training or knowledge sharing and we have like even a youtube channel private to ourselves in which we upload every single content and it's really really great for onboarding newbies we have great feedback from everyone there i'm taking a look at the questions that are coming up into the chat what was your former data warehouse don't want to dunk on them yeah it was good it was good they have many features they have connectors they have this orchestration tool it was good for a medium-sized company we just outgrow their capacity what is the new balance between data engineers and data analysts like role responsibilities well in terms of responsibilities is that we take we the warehouse engineer we take care of the load or the integration with different services and then analytics team they create all the transform all the business logic fact dimensions and reports dashboards so to speak we do collaborate on some code reviews for a performance issues rough size of the data you were handling well the biggest one we have is trillions so that's the name of the session a little about these sign docs yeah i will i will share something about design box later on the chat okay just conscious of the time biggest challenge for analysts new to this way of working and i'll jump to that one now okay so this concludes the first half of the the presentation i will hand over now to krishna so they just gave us the first two chapters of our journey and this really uh followed one from where jose left off and now i would like to talk about how we onboarded not canvas first party data sets but our data sets coming from outside canada so next slide next player please thank you so the challenge at hand for us in dating engineering was that we had to build into great conditions to third-party data sets that are very useful like facebook phrase manual and you know the process for building these integrations is involved understanding babies developing testing and then deploying it in a nice way so after our migration when we started doing these integrations with third party data we would spend maybe anywhere between a week to three weeks to get these uh integrations built while this time you can imagine during this time you can imagine that the marketing folks and the data analysts are blocked not only are they blocked but they also can't see the value of this data and make a determination if this data is actually valuable furthermore another challenge was once we got these data sets going and integrated we had to make sure that we're on top of any api changes schema changes which means more time to build and deploy and lastly there are so many of these marketing analytics mainly has lots of these data sets that we can integrate with and it was difficult to standardize automate and to be honest for the engineering team that was building these it wasn't the most rewarding of tasks so we decided to go with five trend a data integration platform that we tested out and it literally made things so much easier for us the hard work of integrating with a multitude of third-party data sets was done it was a point-and-click interface we did some initial setup in terms of setting up a process for getting data requests in but we tried to maintain it so that the ownership of these data sets and these integrations remained with the business as much as possible so we would request authentication details from the from marketing for example and then we would set them up initially and then we would really say this data set is owned by you and we would support so data analytics and data engineering play a supporting role in the ownership of these data sets and the impact of this is that we got more access to richer data in less time giving us the ability to go from weeks but it used to take two hours and it unlocked a lot of time for the analytics team and for the data eng team because then we could focus on the governance aspect of these data sets onboarding off-boarding these new data sets and we could also focus on some of the more challenging integrations for which we could build connective and introduce a release framework so that these can be actually driven by the analyst instead and we have examples of two we have examples of two connectors in production right now that were entirely built by the analysts with very little support from data edge i say that this is not a silver bullet for every injection scenario and i just added a few tips and learnings from our experience so generally speaking if it's a very large data set i think you're better off trying to use the warehouse bulk load tools now the numbers that i've given here are indicative only it's from our experience the context in your organization might be different but that's what we found also the initial syncs with the third-party datasets might take a while and by a while i mean several days so just important to manage expectations with your stakeholders because it's not always point and click and you have data in the next hour finally we also use the csv and google sheets integrations that are available in flight five trend analysts and anyone using the data warehouse love these because they can put in external data sets quickly into the warehouse very convenient so really self-regulation and periodic checks is really recommended uh when it comes to these uh ad hoc data sets next slide please okay so having a warehouse being able to scale having a very co-generated first party data ingestion process and faster onboarding for third-party data sets we now had a lot of momentum going our way now this was really opening the doors for lots of other folks that normally didn't use the data warehouse to come on board and with that came special data access requirements excite please thank you so diverse data access so as soon as we had this nice platform built we had requests from our internal teams to have access to pii access to all sorts of things really and we generally for general can access to the warehouse we hide pi but some use cases do need this so that was something that we had to deal with as we saw the as other teams saw the power of snowflake and our data platform they they wanted to use and also i should add all the great content all the great data content is king after all many more people wanted to use these data sets in conjunction with theirs so this was external folks and internal teams as an example an internal team that wanted to use our model [Music] now we're trying to build more data apps we have developed data access requests different levels of access some cases they do need pii even though we naturally hit in the in the warehouse but there were cases in which they needed and different sharing extracts it was also required and therefore yeah we saw this trend of new applications built on top of the data that we had in the warehouse that was really the problem at the time so as a solution we introduced the concept of data mats and very common data warehouse principle is just a dedicated place in the warehouse with specific um data and access for a particular business unit or team so we separate the special access to handle bii data in different data mats with yeah restricted access we gave owners you know in github using code owners and we were able to basically hand over every ownership to those team members that are the ones who should rule and what's happening in that area of the warehouse and we dedicate the resource and onto them this is like a snowflake feature that we were able to give them their own compute unit tailored to their requirements so basically we gave them you know a place and resources for them to go crazy solution around like data sharing and extracts what we usually do is extracting to s3 so that was the main um output that we have from the warehouse usually done by dvt models running on particular data marts you know in the context of the data maps and then we um make use of dvt post hooks and the copy command so then we can get the data out into s3 and this s3 bucket may be owned by the this particular team or service so from that moment onwards they take on and do what they need to do another feature from snowflake is data sharing so it's very easy to set up and full control we do use it in our security lake that is an another big important use case applications we have simple pipelines in which we aggregate data that produce in single analytics data set as an example we have a customer and facing analytics for the customer happiness team in which they are able to to query for particular users and see the last sort of events that they have been doing for better helpful just conscious about time i'm gonna speed through the last slides sorry if i'm too too fast impact insights and savings you know we were able to share data and replace paid products with our own and that's a it's a big one and separated pipelines so then we have like micro pipelines there's not one big transform every data market has its own compute engine its own schedule its own pipeline and the warehouse latency we are moving towards a low latency with high concurrency and pipelines and that concludes the end of our session

2020-12-28

Show video