7 - Combining the Best Cloud Technologies with Innovative Engineering How We Built Neo4j Aura

Show video

[Music] and we're ready to go okay so thanks very much everyone for coming i'm sorry we're starting a little bit late um let me introduce myself i'm alistair jones i work at neo4j i've done that for over 10 years i live in london um i in my job at neo4j i manage some of the engineering teams and i also do architecture for neo4j aura so what is nifty aura well as we say on the website it's far scalable always on fully automated fully managed graph platform offered as a cloud service we have two flavors of aura for developers who are building applications we have uh nfj already b for data scientists who are doing graph analytics we have nifty aura ds now let's talk about uh what it means to be fully managed if if you want to run a database in the cloud i think you need to think about at least these five things so first of all you need some servers to run it only some server infrastructure you probably need some kind of clustering or some means of assuring availability you need to be able to upgrade the software the database software to take bug fixes or security patches you probably want to adjust the different levels of load so you're going to need the ability to scale to change the size of the hardware that you're running on and finally you need backups you need some way to to be able to go back in time in case you delete your day for some reason and you need to you need to plan that if you self-manage this in the clouds you need to worry about you need to worry about all these things and that's why it covers them gray none of them are done for you but you can get these managed for you so the next level up i've got on this picture is to do something i've called platform managed and what i mean here is something like uh the the the helm charts that um my colleagues bledy and harshit were talking about earlier today that's great because then you can run on on a one of the managed kubernetes services and it will sort out the server instruction infrastructure for you using the helm charts it'll also do the clustering for you but it's not going to solve the other things at this stage it won't do the upgrade when you're scaling won't do the backups if you want all those things taken care of then you're in the fully managed service realm and that's where we are with neofj aura now for this talk it's how we built nfj aura and what i mean by that what i'm going to talk about is the the time period from 2017 what roughly when we kicked off the project in earnest to now so about five years i'm going to start off with a little uh recap of where we were before we started the project at the end will be a glimpse of the future like what's what's what's coming next so before uh back in 2017 we already had the world's leading graph database we had people all around the world with all sorts of use cases using neo4j really widely and the vast majority of them were running in the cloud back then you could probably say 10 years already at that point that the fj had been running in in the clouds so we knew it was possible we knew it was possible to offer this service obviously um and then as emil were talking about this morning we also had a really strong engineering team we had the graph database experts in our in our in our staff so facing the the the project facing facing us was a number of challenges and i this is how i remember anyway that we were looking to support tens of thousands of um tens of thousands of production databases since they were production we need to ensure uptime and availability we're going to want to run them on multiple clouds because we we could see that was going to be a long-term direction we could always already predict that customers are going to have slightly different needs based on the type of use case that they had and finally we need to design systems so that we could keep on evolving it we could keep on developing it to add all the new features that we could see coming into the future i don't know why it's starting so um the way that we chose to address all these challenges i've had it as three three core principles so first of all it's about using the best cloud infrastructure so we want to run in the best way that we can in the cloud use the best capabilities of the cloud that are available but we don't want to build everything ourselves so sometimes we're also going to use the best third-party services that are available to do our non-core capabilities and finally the way that we organize in the way that we work we want that to to use the best possible practices to think about the the structure that we use in the in the team and the process we follow so when we go through each of those in turn first of all we'll look at the cloud infrastructure now the big news the big studio sorry the first thing to say is the multi-cloud already touched on that and right now or is running on aws on gcp and coming soon on azure and why do we do that well it's because of customers we know that there are customers who use these clouds they've chosen for whatever reason that they want to be on those clouds and they want us to meet them there on those cloud platforms and that's why we're that's why we're in that in that realm now the big story here is uh kubernetes um so i'm sure everyone has heard of kubernetes at least um we cho we've chosen to build aura on kubernetes and that's a that's a conscious decision for us and i'm i'm calling it a kubernetes opportunity so at the top of this picture i'm trying to show how some of the other databases service offerings operate and there are other databases there are other databases service offerings out there and especially if they started before kubernetes was available they they run on these lower level pieces of cloud infrastructure so virtual machines that was the canonical graph piece of um cloud infrastructure we're more like the bottom part of this diagram below the dotted line aura is running on kubernetes infrastructure um and that's that's our kind of interface to the to the cloud world in general now this is this has been pretty controversial in the past i don't know whether people have heard this quote before this because he had hightower over at google uh he said back in 2016 he said if you do this exactly what we're doing you will lose your job that's pretty like that should make you stop and think for a moment if you've got somebody saying something like that um and this is quite like widely reported at the time what i can say is that that kelsey has changed his mind since then publicly uh and it's it's mostly because kubernetes has moved on a great deal since then but also um like there are things that you need so there was a you know the things that are still true about about his about his reasoning it's a big undertaking to run on kubernetes and to get it right to get the kind of reliability of and data durability that you need to need to run a database as a service we we're prepared to invest in that because we're a database company but as a general choice if you're just running any old piece of database software it still might not be a good decision for you we've got a strategy for approach for how we use how we use kubernetes so first of all we're using it for what it's really good at and from my point of view it's about managing the complexity of a pretty complicated infrastructure all the different pieces that we need especially the declarative state mechanism makes it really easy to implement self-healing systems which is exactly what we need in a in a very large deployment um when we go beyond what kubernetes has built in we're going to extend the service you're following the same patterns so we're going to we're going to introduce our own resources our own higher level resources which are going to follow the same pattern and extend it cleanly but the final point is that we don't delegate everything to the cloud platforms there's some kind of capabilities that actually we're going to use the core capabilities in the database to provide and that's especially the durability the consistency elements that we're using in the core database and we we rely on the the raft protocol and that implementation to to do that let's talk about what what raft is because i've just uh explained it so uh here's a little animation i won't explain what's going on but uh raft is a distributed consensus protocol near for the nfk database uses that protocol to safely replicate data in a consistent way around a cluster and um it's a it's a really cool protocol we've had it in neo4j since 2016 so it's pretty battle tested by now and it's a very solid uh very solid way of operating but you don't just get the consensus with raft you also get a pretty robust membership protocol so you it gives you the mechanisms to safely change the membership of the cluster without getting into a split brain situation or some kind of divergence of data and that's a really powerful thing that we can lean on in the design of our database as a service so let's take an example here's a really uh sim here's a really her straightforward three-member cluster imagine we have this now as is gonna happen from time to time one of those members is likely to to fail something go wrong maybe hardware failure maybe uh some um software abnormality of some kind um built into kubernetes we have uh the the capability to recognize that that something is down so to use um availability metrics to remove one of a server from the from the load balancing i called it gray to indicate that and we can also start up a new member to replace it so here's a new member starting up it's gray and then it goes to the white state eventually the the damage server is removed completely and and and shut down it's important that we have the wrath protocol because otherwise something could really go wrong in this situation we could easily start up with a new server with no data on it and have that as the master and lose all our data something like that and it's the wrath protocol that's helping us keep consistent in that state for more complicated example uh so this lump of circles uh what we're doing here is we're scaling the servers from a 16 gigabyte size that's a size to a 32 gigabyte size and we can have a protocol where we completely replace all of the hardware without any downtime for reads or writes and we're able to do that because of the because of the underlying protocols that's inside that um the third example is is kind of is kind of interesting it's exactly the same shape as the uh as the scaling example but here we're upgrading the software i've been i've drawn this as v23 v24 what that's saying is a new version of the neo4j database software we need to take new versions because of bug fixes or for security patches um we want to do that without any kind of downtime we can do that because the database has been engineers to cooperate between adjacent versions and we can we can upgrade without any without any interruption of service okay so that's that's the process we're going to follow how do we actually implement it in kubernetes well we're following the core the core uh well kubernetes works on on a core pattern we call reconcile a pattern i've tried to draw a picture of it here so you you interface with the reconcile pattern by writing a desired state into the system and what and then you have a controller which compares that desired state with its perception its measurement of the actual state of the system in general those are going to match so this is on a constant loop constantly checking in general it's in steady state no actions are required but sometimes there's a mismatch and when there's a mismatch the controller decides to take an action and that action eventually leads to the to the actual state matching the desired state this is the core way that the kubernetes is operating as a more slightly more concrete example if we were trying to have a cluster with three members on the left hand side there are three members nothing to do on the right hand side there are only two for some reason whatever reason that is the controller will take the action to add a new cluster member this seems very simple and that's for good reason this is how kubernetes is built up it's built up in a layered architecture so each controller it often delegates to another resource type managed by another controller so you end up with a kind of hierarchy and this is this is what we do exactly in in the fj so we have something called the neo4j operator this is vastly simplified but some of the kubernetes resource it uses are a stateful set a pod lowest level message resource and a personal volume and there we're following the same pattern that you get in core kubernetes so it's great we've got we've got an operator now but the next question is where do we run this thing like what how are we going to structure the overall architecture we're going to have like one operator and it's going to operate tens of thousands of databases or is it one database and one controller basically what we're asking is what is n in this picture how many databases should we manage with a controller and to kind of help us answer this question we introduced a concept um specific to aura we came up with this concept of orchestra and uh i hope this looks like an orchestra to you um the the one an orchestra for us is a special is our deployment of kubernetes cluster divided into a control plane where the operator lives and the data plane where all of the databases are managed by that operator live and we also need some other services to to make the whole thing work but it's completely self-contained it's in it's isolated independent um and uh yeah it just keeps just keeps going in that way it's really really sleepy okay the advantages of this are that we have the isolation already thoughts about that but in terms of that number what we can do is test with a certain number of databases and we know that the orchestra works to our acceptable characteristics with that number and that kind of load if we want to scale if you want to add more databases we can add another whole aux or another whole kubernetes cluster and that can take us up to a very large scale it also gives us that cross-cloud capability that we can use this roughly as pattern it needs customization but roughly this pattern on each of the clouds how many do we need of this well the minimum we need is one orchestra per combination of region and cloud provider you'll see on this picture it's kind of you know schematic but you'll see that in some reasons we have more uh we're gonna have more oysters and that's gonna be to do with demand how many customers want to run their database in that particular orchestra so sounds like we shouldn't have that many orchestras maybe like tens or hundreds of how many regions are there but actually uh there's a bit more of a wrinkle than this uh because uh we talked right at the beginning of the challenges customers have different needs and particularly we have our enterprise customers who would really like the highest possible level of isolation from of their data from everything else and that's why we actually implement a single tenant kind of system where we put an entire orchestra it's the same kind of concept the same capabilities but we dedicate an orchestra to an individual customer and that means that i mean it's less efficient so we charge more money for it um because we there's less sharing uh but it's it means that we're actually running hundreds of kubernetes clusters in production and that's really like talking to other companies that use kubernetes that's a pretty high number actually of classes in production okay uh that was about the cloud section let's look at third-party services so as a recap there are some things that we definitely want to build ourselves obviously the call data platform called database uh we're going to build a small example is the the role based access control that's something that we use built into the core database and then the consistency replication durability that's obviously another thing that's that's in the core data space but there are some things that aren't really all business they're not really what we're we're into um but they're still really important providing the service so the things that we we delegate out i've listed here some of those authentication billing you kind of get the idea from the vendors that we're using at the moment and this is this is not exhaustive but these things that are uh that are out there that we're using for as part of the part of the live system um especially i just call out datadog is a really big area for us obviously with all of our monitoring and alerting that's hugely important to aura uh we're using snick for the um for analyzing our dependencies from a security perspective and then we're actually for our process point of view we're using a tool called athenian and that's appropriate because now we're going to talk about process so what are we doing in engineering how do we think about developing a service like this well uh first of all really important to operating a service that's very handy that our friends at google uh wrote down a lot of stuff in this in this book and all the community around it so in the site reliability engineering having coin that's very helpful for us there's a lot to say about sre world um i'm just going to call out a couple of things that i that like my takeaways from it so first of all um it's about uh how we think about errors and reliability and and and being data driven in that area so um we talk about in sre world we talk about error budgets and about making a decision about what to do based on how many errors have occurred during certain periods doing some data driven decision making there and this is a very simple model just the idea that there's the faster you having errors the longer it takes or the less time it takes for you to to change your mind and change and take a different course of action um but there's uh there's one thing i should uh kind of maybe people are kind of ringing along about you know it's a database a service why are we talking about errors surely like databases should be super reliable super solid never go wrong right that's that's what we that's what we expect and yes actually aura is super reliable and you know has has uh great uh great uptime but there are there's actually a lot of stuff going on in there all the time and i just wanna and maybe the key way to think about this is that we're offering a service on which people run their own code it's like a whole nother level of um noise compared to offer offer offering a service that that just runs your one application so in terms of monitoring what's going on we have to think about what those users are doing my favorite example i'm going to use here is to talk about syntax errors so imagine this cipher query you probably recognize there's something wrong with it there's a missing parenthesis this can't be executed by the database it's going to give an error and we should be tracking those errors somewhere but it's perfectly legitimate that a user can run a query like this it's not a problem it's not like our faults that it didn't work it's their fault for writing the wrong query but if we totally ignore this type of error that's also a bad thing because that means that if there was some situation where suddenly we we uh throw up a lot of a lot of uh syntax errors for incorrectly we wouldn't catch that at all so when you think quite carefully about distinguishing between noise and genuine and genuine failure something genuinely going wrong and this leads us to much more complicated kind of ways of thinking about burn down rates and error alerting so we end up with this very complicated picture i can't really explain but this multi-window multi-burn rate uh alerting we get into that kind of world an essary the other thing i want to talk about in sre is about what we do when something does happen when we notice something in the in the chart some other landing on some piece of infrastructure that isn't working as we expect we're getting uh metrics from our cloud infrastructure from kubernetes and also for nifty operator which giving in the giving the user perspective on what's going wrong all those are reporting metrics our alerts come off those metrics and those are handled by our engineers they use they use written down tools these tools to automate that process and they use playbooks so that we have a clear set of things to follow it's great that we have this process all written and we make it we make it as slick as possible so that we can uh keep things keep things running at a good level but i think the really important thing actually is more of the feedback cycle that comes out of this so we go through a postmortem process and that hopefully allows us to uh fix the products or if if not if not that's not the right thing to do right now then it's going to be improving the tools in the playbook so that we can recover more quickly from the situation or automate away the toil of recovery so it's really the feedback loop that i think is really is really important here next thing i want to talk about on on process is to talk about continuous delivery in this family of practices so i want to say in in neo4j as a as a company it's really been uh ingrained in us from a long time i mean i'm putting this uh book cover up here mostly first nostalgia from working with jess and jason dave back 15 years ago and it's a lot of us come from that background of like the early days of about 20 years ago early days of of agile practices and uh following agile and working out what the agile principle should be valuing working software over anything else we're especially keen on testing and automated testing infj engineering and we automate aggressively that's how i how i describe it so i just want to i've got a little model i've been been playing with about uh the journey we've been through you can talk think about software development and in like two steps if you like uh you have to write the code and you have to integrate it with the work of your colleagues um for the last 20 years we've had the idea of continuous integration we have the the the name of uh for 20 years now um and that encourages us encourages us to go around this loop as quickly as possible and to take the smallest units of code and integrations right so that we can go faster in we for the for the most of our of our history we've been releasing packaged product and that means that occasionally we branch off of this cycle and we make a release but it's different when you're operating a service in fact your instead your your graph looks more like this your process looks more like this there's a deploy stage that happens straight away after you've integrated and that takes us into continuous continuous deployment this is the cycle actually we want to be in in a service world and it's it's it's a very different world from where we were doing package software it's actually each piece of code we really want to get that all the way into production and that's something we weren't able to do when we were releasing a package product um we've gone from just like this impact of this we were looking at you know releasing every six months of the core database that was our kind of tempo so if you wrote some code that's how long it might be until it got to any customers at all now what we're aiming for with running a service is that it's possible to get code from being typed in into production in 15 minutes that's our target um we have lots of mitigations and extra levels of of control around this so that we can do that in a safe way um but that's certainly what we're aiming for and it's made a big difference to us now that we're you know rather than releasing every six months i mean we we i just looked at the stats before doing this we're doing about 35 deployed to production uh per week at the moment and our aspiration is to be to be much higher than that and to have lots more smaller units going out but going back to this continuous deployment it's all great to do things more frequently but where are we really aiming for like where's the the goal of this well really what we want to be heading to is continuous improvement we want to be actually measuring the impact of changes we put into production analyzing and designing straight away the next piece of work that we do and this is this is really the direction that we're heading in okay so uh that was all the process stuff it's time to have a glimpse into the future which is obviously looks like a graph there we go um and uh i this is a lot of this is roadmap stuff and i'm not going to talk in detail about roadmap features at all that's not what that's not the thing for this talk um but i'm sure you can actually see that there's lots of uh data and tool integration that's what i call it there was some of that in emma's keynote this morning about some of the exciting things happening there more integrations with more platforms more uh more data integrations i'm sure there'll also be a lot more deeper integration with uh the cloud platforms i'm sure that's that's like at the data layer but also at the infrastructure layer and the the way that you you pay for these services i'm sure that's going to come as well but finally i think the thing i i'm really passionate about especially is the improvements to the core data platform and that uh this is really the the kind of proposition of aura is that this is that this this is the best graph platform and it's all about making that graph part absolutely the best that it can be so just thinking about that as a little example the kind of things that we do in this space right now users can write cipher queries and they send them to aura to their application or through the fj browser and they eventually find their way to an everyday database that executes those queries what we're able to do because we're operating this service is we have the capability to take those same queries that are running on users databases we can aggressively anonymize them to completely scrub them of anything you could interpret any data or even variable names or anything they go completely into an abstract sense but it gives us enough of what people are doing in the database that we can put this into the same analytics engine and we can work out where like what people are doing and make sure that our testing actually covers that stuff so already we're starting to see we're trying to find a few gaps in our benchmarking coverage that we couldn't have predicted but we're finding that people are doing complicated unexpected things in the database we can make sure that we make those things go faster and ultimately this is about making the uh making the database uh the database and the data platform uh much faster in the long run faster and i guess better in the long term so this is really that that's my grandson's future is is uh is this this feedback cycle so that was my kind of that that was for that that was that was my like run through the three principles and then a little bit of a glance into the future um i'm just going to leave you with some like kind of a summary like some takeaways from this so how i would describe it is that nifty aura is a fully managed service we talked about that at the beginning it runs on multiple clouds on those clouds it's running on kubernetes in fact managed kubernetes um and we're running hundreds of kilometers clusters that's a big that's quite a quite a big deal and then us in infj engineering we automate aggressively as part of our culture we've implemented the data-driven sre methods that we talked about a little bit before and then finally that we're the experts in graph that's kind of the point is uh that uh with that's that's the thing that we we're very good at and we're going to continue to get better better at through this through this mechanism so that's it thank you very much great um i think i'm gonna go outside in the corridor because i think we're a little bit over time and i will take questions uh out there if that's okay everyone great thanks

2022-09-16

Show video