Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
I sparkin. Summon, in AI attendees, I'm glad you could join me from my virtual talk today, I'll be talking about some of the lessons we learned from building theta brakes large-scale, multi cloud data, platform first, an introduction, I'm. Jeff pang and I'm a principal engineer at data bricks in our platform engineering, team data. Bricks helps theta team solve some of the world's toughest problems and, the. Data bricks platform team's mission is to provide a world-class multi. Cloud platform, to, enable us to expand fascinator, it quickly so that we can continually, improve our product for data teams if. You're interested you can find more information about our platform team at dif comm slash careers. As. You. May know data breaks was founded a few years ago by the original creators of Apache spark we. Provide a data in a platform, that now serves more than 5,000, customers in a variety of industries, and. We're. Still startup but we've grown to over a thousand employees over. 200 engineers and now, have over 20 million and annual revenue we're still growing. Our. Data. Platform, provides unified, analytics, and AI to data scientists, data engineers, and business users, it. Includes a data science workspace, that integrates, tools like notebooks, ml flow and tensor flow it. Also as a unified, data service filled on the best-in-class deployments. Of Apache spark and data Lake and it's. Built on top of an enterprise, enterprise. Cloud service that's simple, scalable, and secure because, we manage all the operations, for our customers, so. What's, in this talk first. I'm gonna take you inside our, unified. Analytics, platform and show you a bit about how we architected our multi Cloud Data Platform, then. I'll describe three challenges, and lessons that we learned while building it, first. How, we grew. And evolved our data platform over time secondly. How, to, operate successfully in multiple clouds and finally. How, we use data in AI to, accelerate, the data platform itself. But first let's, look at the architecture of the data breaks data platform. So. In a simple, data engineering, architecture, you, might use a single spark cluster to process, your, data. From your data lake which, could be data stored in the s3 or HDFS, for example, you. Might take this through bronze silver and gold stages, of data processing of a data processing, pipeline to refine the data for use and finally. You'd have analytics. And reporting applications. On top this data to actually provide that value for, your users. Now. Most. Modern, data. Platforms, actually involve, many, different data pipelines, not. Just a single one they, would involve streaming, data in addition to your data Lake it might, involve scheduling, more complicated, workflows for the output of one job is processed, by the input of another, you. Probably use, a modern data format, like Delta and your, applications, would involve streaming, analytics notebooks. Machine learning ai and probably, more and, to. Do all these things you'd, probably need many different spark clusters for scale and to, manage these spark clusters you'd probably need some cluster management system like measles or kubernetes. This. Is the type of architecture that many organizations are, using or, trying to build in, order to get more and more value out of their data. The. Data bricks theta platform, provides this for thousands of our customers, we. Managed a control, plane which, handles, the analytics. The collaboration, AI, workflows. Cluster. Management, reporting. Business insights security, and more so our customers don't have to manage that themselves and we. Manage many smart clusters in our customers, preferred networks, to process the data from all of their existing streaming, and data wakes. Furthermore. Our data, platform is deployed in many regions, throughout the world because, we have customers, deployed everywhere and critical. For the data platform, to be close to the data. And. Because. Our customers have their data in more, than one cloud we. Replicate, these deployments in several clouds and, integrate, with the best features of each of these clouds such as AWS, in Microsoft Azure. The. Data breaks Data Platform does is a global, scale Multi cloud platform, that manages the data platform, for thousands, of customers around the world. As. A consequence, the. Data breaks Data Platform manages, millions, of virtual machines on smart clusters every day as you, can see from this graph of VMs. Managed per day we. Anticipate, that the scale of our data platform, will continue, to grow pretty rapidly. That's. The data bricks control, plane data, plan the, control plane of our data platform, in a nutshell and that's. What will be the subject of this talk, our.
Control Plane has hundreds. Of thousands of users hundreds. Of thousands of spark clusters every day millions. Of VMs per day and processes. Exabytes, of data each day our. Data platform, supports everyone from university. Students just trying out spark for the first time in our free Community Edition to, Fortune 500 companies with, thousands, of users and many complex workloads, in the. Remainder of this talk, I'll. Discuss some of the big lessons that we learned from building, this large-scale multi, cloud data, platform. The. First challenge I'll discuss. Is how we grew our software, as a service data platform, over time. As. A. Start-up we obviously. Didn't start, out with a global scale multi, cloud data platform, in fact we start off with something pretty small the, challenge was how to grow a data platform that, providing, a lot the value for, one customer to thousands. All within, a span of a few years. Since. We managed. The data platform, for our customers, and want, to keep expanding, the capabilities, of what our customers could do with it we learned that the data platform, itself isn't, actually, the most important, thing actually, it's. The factory, the bill and involves the Data Platform is. More important, than the data platform, itself because. That allows us to rapidly, provide, more value and support more use cases in their data platform over time and, oh. And this happens all in a way that's transparent. For, our users so. What, were some of the keys to success in building a great data, platform, Factory first. We, needed to quickly, deliver value. To the market for, example we provided a version one of our data platform, which included some analytics, tools like notebooks but, very quickly had to provide a version two which included more features like job scheduling, and managing, the. Challenge in quickly iterating, on our data platform, is that our users are using, and relying on it at the same time that we're trying to change it so we need to be able to make sure we don't break things while we're continuously, changing, the state of platform.
The. Key success, in establishing, this virtuous, cycle are our, modern, continuous. Integration infrastructure. Fast, developer, tools and a, lot of testing for, example at data breaks we make heavy use of Scala, basil, and react Jas for our developer platform, we. Spend a lot of time optimizing your Scala builds, to get up to 500 X speed ups compared, to default tooling for example and we, run tens, of millions of tests every day to. Make sure that our changes, don't break things our. Developers, also create hundreds of data. Breaks in a box full control plain environments. To the everyday, so, that it can develop test, and demo new features. This. Allows us to keep, adding new features or data data. Platform very rapidly. Secondly. We need to very. Quickly expand, the total addressable market by. Replicating our control planes in many environments, the. Challenge is that each, environment is slightly, different there. Are different clouds different. Regions and different, data governance tones that require different, configurations, so. This quickly becomes very complex. If we, want to support all these different, environments and iterate quickly to, update them with the features the, keys, to success to expand quickly for us was, to focus heavily on declarative. Infrastructure, where. We rely solely on templates. That describe our different control plans and use. A modern, continuous, deployment infrastructure. To deploy them a. Data. Breaks we. Use a templating. Language called JSON ette and terraform, and spinnaker, to form our deployment. Pipelines this. Allows us to express over 10 million lines of configuration for, example and many fewer lines of code and it. Allows us to develop to. Deliver. New, features in, our data platform, globally, several. Times each month. Finally. We, wanted to expand, the workloads, that our customers, could run on data bricks after, they have adopted our data platform, this. Meant that we needed to expand, the scale at which our data platform, and our, data platform can operate and simultaneously. Scale, the rate at which new, features to be added to the data platform, over time as. More. And more of our engineers, develop features for a data platform we. Need to make sure that they weren't reinventing, the wheel and building, everything, that everyone else has already built. The. First key to success and scaling, quickly, within customers. Who had already adopted our data platform, was, to have a service, framework that, did much of the heavy lifting for all. Data, platform features things. Like container. Management and replica management, API. Is RBC's, rate, limits metrics, logging, secret, secret. Management etc these. Are things that all features, and all teams at data, brakes need, in order to build their features. From. Notebooks, to a sequel, to ml flow these. Are all pieces, of a data platform but, they aren't really core to their functionality. The. Second key to success was to decompose, our monolithic, services into micro services for. Example when, we first started our cluster, manager service which manages, all of our clusters was, a single machine, and. It, just managed a few hundred VMs for a customer but. To support the millions of VMs that we manage today we, need to break it apart into different, core functions so that it could scale a. Well-rounded. Service. Framework was the key to doing this rapidly so, that we can spin up new services, with different functionalities, very quickly. So. In summary the data box data platform, Factory. Allowed us to iterate on our data platform very quickly rapidly. Replicate. It in many regions, and many clouds and expand, the scale and breadth of workloads, that it could support, one. Of the great things about having a solid platform Factory. Is that, we, often use it to improve the factory itself as well you. Can see from this diagram that we, use a lot of cloud native open source technologies, that data breaks everything. From envoy, and graph to ul for our pcs and api's to, kubernetes, for container management and terraform, and spinnaker for deployments. But. We, didn't actually start out with all these technologies tech, when data bricks first started many of these open source projects, didn't even exist yet but. By taking a factory, approach to our development, process we. Continuously, retool, our factory incorporate, and extend the best tools out there over. Time essentially. Using, our factory, to improve and refine itself in addition to our data platform. Next. Let's talk about how we run data, breaks on multiple, clouds. Dataxe, runs on multiple clouds including, ATS and Azure why, do we do this well, because, the data platform, needs to be where the data is putting. The. Data platform, in the same data center as the, data impacts. Performance latency. And data data. Transfer costs it, allows us to integrate what the best features of each cloud and allows, us to adhere to the data governance policies.
That Our customers, of. Our, customers, demand from our desired Club the. Biggest challenge in supporting, multiple clouds is to do so without sacrificing. The velocity, of future development or, the data platform. Itself, as I, just mentioned iterating quickly with, the key to extending, and expanding the data pipeline we didn't want to sacrifice that. What we found is that a cloud agnostic, platform layer is a key to maintain developer, velocity, but, this layer also needs to integrate with the standards of each cloud and manage, their quirks on. So, the next slides I'll discuss what I mean. The. Developer, experience on, multiple files is pretty divergent, so it's pretty hard to build the same thing on each cloud for. Example many cloud services, have no direct equivalents, for example, in AWS. ADA. Base has a great scalable key value document score called dynamo DB but, the same interface really doesn't exist in Azure cloud. API is also, don't look anything like each other even the authentication and access. Control of each cloud is very different. Operational. Tools for each cloud is also very different and you, can't even consume logs in the same format, this. Makes it pretty challenging to, support multiple clouds without doing a lot of extra work. To. Overcome this we built the cloud agnostic, developer, platform to support our data platform, between. These blue lines here are services, that make up our data platform, things like sequel notebooks, job, scheduling ml flow cluster, management etc to. Make sure that these services can operate seamlessly in, multiple clouds we, developed a service, framework API, that's, common, across the different clouds that we support this. API goes, on a few, lowest-common-denominator. Cloud. Services, that are pretty similar around each cloud things. Like virtual machines networks. Databases. And load balancers. When. A database, engineer works on a, data. Platform. Service like MMO flow for example they, would use our service framework API to manage api's. Our pcs, user's, permissions, billing, testing. And deployment for example they, would write the, code once and they want you to worry about the differences, in each cloud it would just work. Now. We learned, that not, everything, can be cloud agnostic, no, matter how much we try to abstract, away the details, first. Of all customers. Actually want to integrate with some of what, best about huge cloud and we, want to support the standards, of each cloud, second. Even. Lowest, common denominator, cloud services like VMs, have, implementation. Quirks that reveal themselves once, you start building on them and therefore, they're not exactly, identical. To. Handle integration, with the standards of each cloud our service.
Framework Also acts as an abstraction, layer for the differences, in each cloud that we have to. Integrate. With for. Example the ways that, each cloud do authentication authorization, key. Management billing and storage are pretty different yet. Most, data platform, features. Need. To deeply integrate with all these properties. So. To harmonize, the differences, in each cloud our service. Framework API abstracts. Away the differences, as much, as possible so that a service, and data. Platform feature like heimo flow just. Need to know about our service. Framework notion, of authentication. And key management rather. Than that. Of each cloud so. For example it wouldn't need to know about like, KMS or a tricky vault you'll just know about our own, notion of bring your own key for, encryption. For. Example consider storage. Storage. Is very special example because both s3, and Azure data Lake are the most common, and cost-effective, ways to store, large. Data objects, in each cloud and typically, these storage layers typically, form, the basis, for data. Leaks that most of our customers use however. S3. Is eventually, consistent whereas. After data Lake is strongly consistent so, they have pretty different semantics, that is, after writing a file into s3 the, next read of the same file isn't guaranteed to see the same data whereas. It is an agitator Lake, obviously. Having. Each application, deal, with this semantic. Difference would be a, pretty, big headache, so. We abstracted, away this difference, with. An s3 commit, service which, basically makes. Interaction. With s3, as strongly, consistent, as as your data Lake thereby, harmonizing. The. The. Data, access, layer for. Our data. Platform. Factory, services. The. Abstraction, layer to handle, cloud. Standards, is necessary, but, is not sufficient, to make our service, platform cloud agnostic, in. Reality, even, the so-called common. Cloud services, have quirks that are a pretty. Big nightmare to deal with if you, just try to lift something that's. Working in one cloud and try to make it work in another, for. Example, consider. Virtual machines. With. Tools like Packer it's, pretty easy to actually create the exact same image huge cloud so you can actually launch the. Exact same binary that's running the exact same code exact same operating system etc. However. When, using, VMs. As Elastic. Compute the. Creation, time and deletion time of those VMs matters, a lot to lose your experience, for example when you're trying to spin up or tear down spark clusters and. If. You're not careful about adopting, to cloud limits and ap is you. Might end up creating, and deleting VMs. Faster. Than the cloud can return the quota to you and you might end up in a deadlock state where you can't actually, create, more VMS even though you think you have more quota and this, is one, of the examples, of an experience, we ran into that we had basically, had to work, around in order to make sure that we can create, VMs. And elastic, star clusters, successfully. In both clouds. Similarly. We found that different clouds handle, the Humble TCP, connections, very differently. There. Are a lot of invisible middle, boxes that will timeout, and close your connections, without warning there, are small differences in data center network hardware reliability, that can cause pretty, catastrophic application. Failures if you're not aware. For. Example we, found that in 2019. There was a major TCP, bug in the Linux kernel and, basically. What this did is that caused some, TCP connections, to, hang forever but, we really only experienced, this bug in one of our clouds because, of very, small differences, in reliability. Finally. Even, though you can run things like my sequel, and other, databases and all clouds it, isn't, exactly the same, things. We've got we found we. Thought would be trivial like, differences, in case sensitivity, or insensitivity. Of the, underlying operating system for example whether you're like files can, be case, sensitive Lee, named or not can, reach into the database and bite you if you aren't careful. So. To make, a truly cloudy Gnostic, platform, layer we. Are continuously, sussing, out these quirks and masking.
Them From our everyday developers, so they don't have to worry about them. The. Final lesson I'll talk to you about today is how important. It is to use the AI to accelerate, the data platform, itself. It. Actually is really, hard to build and operate a data platform without. Data. In AI you actually need a lot of dating data, in AIA to actually, build. A data, platform itself thus. Data bricks is actually, one of its biggest customers, and we use data breaks very heavily a. Data, platform needs data for many things women's data to track usage maintain, security and he stated to observe, users of the dead data platform and improve it for them and in. Each data to keep, itself up and running and. We. Found that having a simple. And powerful data, platform, light data breaks has been essential, to building, the data platform, itself thus. Data, nai, accelerates. The data platform. We. Use data bricks for many things and. This. Includes key platform, features like, usage, and building reports where, we want, to deliver. Actually. Reports, about customers, usage to them to, them and also. Our, audit logs where, we need, to develop deliver. Or security logs for customers, so they can figure out what their cut what their users are doing. We. Use data, bricks for analytics. And for future usage to, look at trends and to model growth to, make a turn in other forecasts, and either some of the same features that our customers are also using data bricks for and obviously we're also a data. Data. Customer because, we use a lot of data analytes are analyzed, our own plot and, finally. A. Data, platform itself is a managed, service and a, managed service means real-time data to operate, effectively. And, this is for things like mission-critical. DevOps. And. Monitoring. And observability, so, that we can gather and analyze everything, from KPIs, to. Api's. And other, logs like spark debug logs which are necessary both. To help our customers debug their workloads, and also, to debug problems with the platform when there are issues. So. To do this all we've built several globally, distributed data, pipelines, with the data bricks data platform, at its foundation, like. Our service deployments, we use declarative, templates to deploy our data pipelines, so, that we can rapidly iterate and replicate them we. Have a templating. Tool called the stack CLI, which, can find in our documentation, that you, can use yourself to deploy these data pipelines, at. Data breaks or in, data breaks. Each. Of our globally, distributed data. Breaks deployments, streams. Basically. Stream data from ETL, data into. Our data processing, engine which is built on top of data, breaks in Delta but, it also tightly, integrates with other big data tools, like, Prometheus. Elasticsearch. And Jager for distributed tracing. This. Allows us to see analyze, and model the usage and help of the data breaks data platform in real, time and historically. Right. Now our. Data pipeline, processes. Hundreds, of terabytes of logs per day and it, analyzes millions. Of time-series in real time. This. Data pipeline, has really been essential, to the continued evolution of, the data platform, that is data breaks that it's actually built on and thus, data, breaks actually accelerates, itself. So. In summary, the. Data breaks architecture, manages, millions, of VMS around the world in multiple clouds we've.
Learned Several important things while building this massive data platform, first. The factory. That builds and involves the data platform, is, more important, than the data platform itself the. Data platform, of the future will not look like the one of today and the, factory is what will build the data platform, of the future. Second. Cloud, a cloud agnostic. Platform that integrates with each cloud standards, and quirks was, the key to building a multi cloud data platform, and it's. Harder to do than it first appears because, of the performance scaling, and integration, needs of big data. Finally. Data. Nai can accelerate the data platform, features product, analytics. And DevOps you, can't really build a large-scale data platform, without, actually. Analyzing. A lot of data. And. That brings me to the end of my talk and thank. You for listening oh and. PS, if, you're excited by anything in my talk today data, bricks engineering, is hiring and I, hope some of you out there will, come to join us, thank. You. You.