- Welcome to Data Platform Week. I'm Pete Hunt, CEO at Dagster Labs. Over the past year, we've signed on hundreds of new customers and had thousands of conversations with those customers. One of the most common themes we've heard is that the best companies are now taking a platform mindset when they think about their data teams. This week we're going to lay out a vision for how we believe organizations should assemble their data teams, their technologies, and their assets as a platform, and we'll be launching new technologies and features that make this data platform vision a reality. We'll do this over the week.
With that, I'll hand it off to Nick Schrock, our Founder and CTO. - Thanks Pete, I'm Nick Schrock, the Founder and CTO of Dagster Labs. Today's keynote will dive right into explaining that vision in detail and how we envision Dagster's role as the unified control plane for your data platform.
This is an organizational and technological vision with a centralized data platform team enabling practitioners to own their data pipelines end-to-end. The core pillar of the control plane is asset oriented orchestration, and we continue to invest those capabilities. For Day 2's presentation, we will showcase our next-generation approach to event-driven data pipelines, Declarative Automation. Intelligent, powerful, and easy to use, practitioners attach policies directly to the data assets they own, and Dagster does the rest.
Orchestration is just a means not an end. The control plane must richly integrate any tool or runtime used by any practitioner shipping business logic to production. This requires an ecosystem approach that provides out-of-the box integrations for broadly used tools, but also patterns and protocols to incorporate any tool- including your own. In that spirit, we are proud to announce that Pipes- our toolkit and protocol for invoking external execution environments like Kubernetes, Spark, ML-focused runtimes- is going to general availability (GA). We will also discuss the evolution of our embedded ELT and transformation integration families.
The control plane doesn't just invoke external tools. It also ingests the definitions, lineage, and metadata of the assets defined in those external tools. That means, unlike other orchestrators, Dagster serves as the system of record for your production data assets and the structure of your data platform. On Day 4, we're expanding this vision by announcing integrations with BI platforms like Tableau, Looker, Power BI, and Sigma.
These integrations make BI assets like dashboards and semantic models, a first-class citizen in the data platform. And finally, on Day 5, we will be announcing a new toolkit in preview - codenamed "Airlift"- to accelerate, lower the cost and reduce the risk of migrating from Airflow to Dagster. It enables you to richly observe an Airflow instance without modifying any Airflow code, followed by a low-risk, incremental, accelerated migration process. We will also preview our vision for orchestration federation, which will enable Dagster to observe and control other orchestrators with minimal effort. This strategy will enable a centralized team to assemble a unified control plane without having to migrate all legacy or peripheral orchestration instances.
In today's keynote, we're gonna lay out Dagster's role as a unified control plane for your data platform. What we're gonna do in this keynote is break down that phrase into its constituent parts in order to explain our vision. Let's start with the phrase Data Platform. Many refer to companies and products like Snowflake, Databricks, and other processing systems as data platforms.
We use the term differently. We believe that every company has a data platform. All the production data pipelines that produce all of the data assets that drive your business are managed by this platform.
Heterogeneity is the norm: many tools, many sources of data, many storage engines, many personas, many use cases. We define the data platform as the systems and tools that produce data. This platform is context-specific.
Every business has its unique needs. Every business has its own history of technology and business choices. Every business has a unique combination of production apps, SaaS apps, internal tools and use cases.
You must build the right data platform for your organizations. Data platforms model the entire domain of your business and the external world it interacts with. It's a complicated world, a demanding world. Without proper care and forethought, it can be the source of much toil.
It is the focus of massive investment, both in terms of time, money, and energy. It's worth the investment because it's foundational. These data platforms and their constituent pipelines and assets drive essential functions in your business. From operations, to data-informed decision making to ML and AI applications, its value and importance is only increasing over time. The mission of Dagster is to empower every organization to build a productive, scalable data platform. This has been our mission for years, but its salience and relevance has only grown over time.
We see more and more organizations thinking in terms of data platforms throughout the industry. There is a growing recognition of the data platform as a strategic asset. Historically, this has often happened in a bottoms-up fashion, with data engineers and data scientists shifting their focus to platform engineering out of sheer necessity. Now we are seeing more formal job titles and organizational structures using the term directly such as "data platform engineering" and "data platform teams".
This is a welcome development. We see teams achieving great success for themselves and their organizations. We believe Dagster is the natural tool of choice for this generation of data platform teams who need to provide a unified development workflow, orchestration and observability across the entire data platform. Let's move on to another component in this phrase: the control plane. Now lots of companies talk about control planes in lots of domains.
It's worth asking the question: what is a control plane? What does it even mean? Very few people define it when they talk about it. It's more of a "you know it when you see it" type of phenomenon. The term comes from software defined networking, which emerged in the 2000s and 2010s as the way to manage networks at organizations. Instead of IT dealing with underlying, inflexible physical infrastructure to configure and operate networks, they instead managed a software representation of that network. This led to much more efficient and flexible network management by smaller, higher leveraged teams.
Now, this is the age of AI, so in the process of researching this talk, I asked various AI services to define the term "control plane". LLMs are a lossy compression of all human knowledge. So, while imperfect, it's often a good way to summarize the commonly understood definition and usage of terms. Two statements really stuck out to me: One, "responsible for making decisions about how to handle data or resources."
And two, "in modern network architectures like Software-Defined Networking (SDN), the control plane is often decoupled from hardware and centralized, allowing for more flexible and programmable network management." Let's adapt these sentences to how they would apply to the data platform context. First one, it's now "responsible for orchestrating data or resources." In computing platforms, it is the orchestrator that "makes decisions" about resources at runtime. Let's do the next sentence. "In modern data platforms, the control plane is often decoupled from data processing and centralized, allowing for more flexible and programmable data management."
This is starting to sound pretty familiar. So to summarize, a control plane is one, an orchestrator that makes runtime decisions. Two, it's programmable with the right abstractions. Three, it is flexible enough to handle any set of tools and any infrastructure topology. And four, it can be centralized and managed by a single highly leveraged team.
Let's go through those in detail. In a data platform, the orchestrator takes responsibility for runtime decisions about when to run computations in what order, and with what configuration. That is the function of orchestration and it's at the core of the control plan. Dagster takes that further with asset-oriented orchestration. The job of a data platform is to produce and manage data assets used by the organization for operations, analytics and AI. We believe it is important for the core abstraction to reflect the core purpose of the platform and to use the vocabulary and mental model that all stakeholders use when interacting with it.
That is why Dagster is centered on the software-defined asset that defines an asset that should exist. This makes Dagster both a much better and natural user experience, but also a reliable system of record for the structure of the data platform itself. Dagster is the control layer that contains software definitions of the assets - such as tables, machine learning models, reports- in the data platform, and an orchestration engine that schedules the production of those assets. This control layer is programmable, accessible and operable via open APIs. Dagster's built-in lineage cataloging and operational tools are built on those APIs.
This structure allows our customer and ecosystem partners to build their own custom tools and integrations on top of Dagster. The control layer lies atop the data processing layer. Dagster is able to ingest all the definitions and metadata within those constituent tools - such as models in transformation tools like DBT or sources and ELT tools like Sling- and understand them natively. It can then invoke the computations that back those assets and ingest their runtime metadata.
We believe it is critical that the control plane integrate with any data processing tool or pipelining tool. Why? Data platform's are diverse, bringing together many practitioners from different backgrounds. Data engineers, analytics engineers, machine learning engineers to name a few examples. hey are not going to consolidate on a single pipelining or transformation tool.
Their needs and preferences are too different. Therefore it must support any tool so that it can be used by any practitioner. And that's not all. The data needs of organizations vary considerably based on their business context. Organizations also come with history, baggage and technical debt. A landscape of heterogeneous data and infrastructure technologies is the norm, not the exception. A unified control plane must be able to flexibly deploy over any existing topology.
Only with the right abstractions, programmability and flexibility comes the opportunity to make a centralized control plane. Without the right abstractions, the control plane does not provide enough richness and value to users and the application layer. Without programmability, you cannot manage the complexity, and without flexibility you cannot cover all the use cases and contexts. It is that centralization of the control plane that gives the data platform team leverage and the ability to provide a compelling, unified experience for everyone operating within the platform. This naturally leads to our final pillar, unification. And this is not just about technology, but about the way we organize our teams and our work.
Let's start with goals. We'll frame it as how we want data practitioner teams and data platform teams to operate within this unified platform. With data practitioners, we want them to work autonomously.
This means they're able to focus on the business logic and the tool of their choice onboarding onto Dagster with minimal effort. They can self-serve their data pipelines within a software development lifecycle. Being able to detect errors, fix bugs, and ship code quickly to resolve issues. This is huge as practitioner teams can work independently while not putting the data platform team in the critical path. Finally, they can work hand in hand as close as possible to their business stakeholders. The flip side of this coin is that it allows data platform teams to do leveraged work.
They can ensure all practitioners are self-sufficient and productive by providing a high-quality developer experience and abstracting away all infrastructure concerns. They are able to focus on high-impact, platform-wide investments to smooth data operations, improve efficiency and deliver organization-wide impact, and they can partner with cross-cutting policy teams like finance and compliance to efficiently drive projects of organization-wide scope, rather than forcing those teams to laboriously navigate a siloed data infrastructure. So why isn't this happening now with existing technologies and practices? Let's talk about how this plays out in the current reality.
At first, you start out with a legacy orchestrator, in this case Airflow. You start to set things up and you have simple pipelines running daily. However, as you add tools and more practitioners to the platform, you notice that things start going wrong. The practitioners become less productive and self-sufficient and the platform engineers are bogged down in the work of just keeping the infrastructure alive.
Why? Because these tools were not designed with this thinking in mind. Here are just a few examples: the lack of multi-tenancy means that practitioners stomp on each other. Dependency hell. Breakages impact one another. It's a nightmare.
There's lack of testability and no functional software development lifecycle, which means that practitioners have low productivity and then there's a lack of rich asset oriented tooling, which means a siloed, fractured, poor debugging operational experience. And this is just a taste of the problems. The net results should sound familiar. Both the practitioners and the platform engineers fear to change things and are constantly firefighting. So this inflexible monolith is no longer sustainable.
Now the team typically has two paths forward: you either break it apart or you lock it down. Both of these have very nasty tradeoffs. Let's explore the "break it apart" option. Breaking apart means in practice moving the ownership of the data infrastructure to the data practitioners and their teams.
This manifests itself as numerous orchestration instances and technologies, managed by the practitioner teams themselves, with limited to no support from the data platform team. There's a lot of problems with this. There's often not enough platform or infrastructure expertise in the organization to execute on this. It's nearly impossible to deliver cross-cutting value to practitioners in the form of productivity or tooling improvements, and it leads to the silo'ing of the data teams themselves. For example, ML and Data teams no longer have shared tooling and no longer collaborate in practice. They work off of different sources of truth.
This, at least temporarily, leads to increased autonomy for the practitioner teams, but the leverage in the platform is very low. It is now difficult for a platform team to efficiently deliver productivity or workflow improvements to every practitioner teams. And instead of, say the finance team, working with a single platform team to control cost, they're forced to work with multiple practitioner teams with limited tooling support. The other option is to lock down the platform with policy.
This centralizes more work on the data platform team, limiting the autonomy of the practitioner teams. For example, they are not allowed to add or upgrade Python dependencies. Adding new technologies involves process and submitting tickets to the centralized team. This results in a data platform team inevitably taking ownership of more pipelines and business logic. This means the data platform teams are on call more and have to track down problems and triage and fix them themselves.
The data platform team has gained some control and ability to build centralized tooling, but practitioner autonomy has been severely reduced. This is frustrating for everyone. The only way to break this cycle and to have "the best of both worlds" is to make different organizational and technology choices. We believe a rich, unified control plane for the data platform is the lynchpin technology to successfully execute this. This approach in thinking manifests itself across Dagster.
For example, native multi-tenancy allows teams to autonomously and safely deploy their pipelines onto a shared deployment. Features like branch deployments that create a lightweight staging environment for every PR. This provides a unified developer experience when building and deploying pipelines on the platform, customizable to your infrastructure. The quality operational tools that enable all practitioners to self-serve and operate production pipelines, and Dagster's single pane of glass and asset graph for platform spanning observability and visibility. This technology choice must match an organizational choice.
How do you organize your teams and platform in order to execute on this strategy? This is how we think about it structurally: First, there is a unified data platform team that owns the control plane. The data practitioners in the organization are their customers. The data platform team provides a unified developer experience, operations, and observability experience for all practitioners.
These practitioners and practitioner teams own their pipelines end-to-end, enabled on the shared platform. They focus on the business logic and work as closely as possible to their business stakeholders within a highly productive software development lifecycle. Finally, through a common abstraction layer, they collectively deliver a unified asset graph to the entire organization, which serves as a system of record for all production data.
We've been using this term, the "asset graph." What is it and why do we think it is so important? It is a graph of your production data assets built in any technology by every practitioner in your organization. Rather than some idealized concept, with Dagster it is an in-product manifestation of the structure of the data platform itself. By integrating with the orchestrator, it is not just an artifact of what happened but a living, breathing system of record for your data platform. And it's a rich substrate where you can directly observe, debug, and operate your entire system. It is that unified experience collectively built by all practitioners.
Thanks for coming and listening to our vision for the data ecosystem. And it centers on the fact that every company has a data platform and they should staff and build around it. And like we said, it's not just about technology, but it's about the way that we organize our teams and our work. We've covered how it's your data platform, the system and tools that produce data.
It's a foundational asset, basis of decision making operations and AI. It's costly and the focus of massive investment in time, money, and energy and it's context-specific to your organization. Like in other domains, data platforms need a control plane. It's the orchestrator and makes scheduling and resourcing decision deciding when and what is computed. It is flexible and programmable able to integrate with any tool used by any practitioner deployed on any infrastructure.
And it is centralized, operable and managed by a single, high-leverage team. And lastly, it's a unified experience for all practitioners with a unified development and deployment experience, powerful operational tools and a unified asset graph that serves as a living, breathing system of record. In the end, organizationally, we want autonomous data practitioners that have end-to-end ownership over their data pipelines, working directly with their business stakeholders, enabled by a centralized data platform team able to invest in high-impact, high-leverage projects that deliver organization-wide value. We believe that Dagster is the way to implement this vision and we thank you for coming and listening and we look forward to the rest of Data Platform Week.
2024-12-16 10:00