AWS re Invent 2020 How Disney uses fast data ubiquity to improve the customer experience

Show video

hi i'm prasad srishti principal product manager amazon kinesis data streams it's my pleasure to introduce the topic and speaker for this session how disney plus uses fast data ubiquity to improve the customer experience by martin zebratol director software engineering at disney plus martin has a very interesting session planned today where he talks about the evolution of streaming data and the pivotal role it plays at disney plus today so let's get started right away martin the floor is all yours thank you thank you so much for watching my presentation my name is martin zaplatal and i'm a member of the disney plus team and today i'll talk about how disney plus uses fast data ubiquity to improve customer experience and actually support a number of features in disney plus we'll spend a little bit of time introducing disney plus and how we use streaming data and then go through the evolution of streaming data and data in general in disney plus but we will spend most of the time in this presentation talking about talking about streaming data platform a an internal project that we built uh so it's not a single use case but a platform that allows teams um to build their streaming use cases and then we'll talk about some more concrete examples in the disney plus platform we launched disney plus some time ago um and hopefully it does not need any introduction but you can see the disney plus home page here in the background and there's a number of features that are supported by streaming and fast data on this slide at the bottom you can see the recommended for you row so those are recommendations personalized to the user and they're updated automatically in near real time as you take actions and as you use the platform there are other use cases including monitoring our apis and applying machine learning models to detect any potential fraud or any misuse of those apis or the continue watching feature so tracking your progress in the video and as you leave the video and come back you should be able to resume watching where you stopped so we'll talk about some more use cases later but i just wanted to give you an idea an introduction to streaming data in disney plus there's a number of streaming use cases but just to give you an idea also of the scale at which we operate we have many dozens of millions of users hundreds of amazon kinesis streams some of them with just a couple of shards some of them with hundreds of shards deployed across multiple regions across in aws and processing many billions of events and terabytes of data every hour but to understand where we are today uh let's set the stage we how we got there and the evolution of our approach to data and um as usual uh it's not something that we were able to build from scratch but um where disney plus are very good at innovating and improving our solutions over time uh to always stay at the cutting edge to support our use cases so we went through a number of stages uh started starting with microservices with databases and data warehouses the teams were able to to use batch processing but the insights were slow and limited the teams they did only have limited skills and tools um and more importantly the teams and the information was siloed so they did not have access to all of the power they could from extract from the data and as we evolved we started using event-driven solutions uh for both uh transporting data between databases and data solutions but also to start using technology or approaches like event driven and event-driven integrations and event sourcing event-driven approaches have a lot of advantages like decoupling decoupling of the domain deployment scalability also repliability back pressure there's a lot of processing and analytics tools so this was a great advancement in technology but it was not yet an organizational change teams were still isolated we kept the silos because each team used their own approach to data formats their schema management data quality approach um so although it was a a great uh a great advancement it still did not enable us to scale the organization and use the data to its fullest extent to support all of the features that we need to support in disney plus especially again at the scale that we operate with many teams and many streams and many data types we needed to figure out how to scale that and the organization which brings us to where we are today we're a data driven organization or a data democracy we needed to figure out how to make data analytics machine learning experimentation first class concepts in the organization we needed to figure out how to bring those boundaries break those silos and give teams the tools to succeed and innovate so teams and services now have access to the data they need at the time they need that so they can make decisions take actions they can run experiments and when i say experiments i mean a b testing or experiments with multiple variants that we can measure and make decisions in real time but also build prototypes quickly and production ready applications to support their use cases two things i'll also mention here uh first it should not be just executives or product teams who have access to the data it should be anyone at any level in the organization to be empowered and to be able to solve their business problems and innovate and improve how they solve their business problems and secondly it's not just people but it should also be machines who can make decisions and use these attributes in near real time and that brings us to uh to streaming data platform uh it's an internal project that we built that creates this abstraction layer this management layer on top of the technologies and fills the gap between the technology and the use cases that teams build to have this level of enablement that i just talked about it must be it must be very easy it must be very accessible it must be almost natural for teams to start leveraging these data-driven features and i'll just also mention that this is just a small subset of the overall data-driven ecosystem but today we'll focus on the streaming data side of it and this is where the streaming data platform fits at a high level in our architecture services devices and other producers they expose their data sets they make them available in this ubiquitous standardized platform for information exchange they can also become consumers of other teams or their own data sets decoupling the producers and consumers and their abilities and creating this this information exchange ecosystem they can also leverage streaming pipelines so derive new data from the existing data apply the typical streaming solutions such as windowing aggregations or joins of streams but the most important part in this slide is that the triangle in the middle the integration between streaming data platform data platform and analytics machine learning and experimentation because they they allow us to use and extract the value from the data and again the results of these should not only be available to the executives or the product but also the engineering teams as well as the real time applications and services that can use it in near real time we use amazon kinesis data streams as the backbone as the building block for this whole ecosystem and then on top of it a number of technologies like amazon kinesis data analytics for apache flink we use databricks and spark aws lambda the aws client libraries the sdk kpl and kcl as well as amazon kinesis data fire hose and streaming data platform comes into play as this this management this abstraction layer on top of these technologies um and filling the gap between them and the use cases that teams can build on top of them and also allowing them to easily use experimentation analytics machine learning and ai platforms to be able to build and focus on their business use cases instead of the the infrastructure and the common problems we had a need for a reliable cost efficient scalable event log and there's a couple of options such as kafka pulsar and kinesis and others and we actually used some of the other ones for some specific use cases but we decided to use amazon kinesis data streams for a lot of use cases in this ecosystem that we will talk about today and the reason we use it is because it's replicated partitioned order distributed comet log it's fully managed by aws replicates rights to three availability zones it integrates really well with other aws services and importantly for us also it's near real time offers near real-time processing and scalability and elasticity so let's talk about um some of the more concrete examples and how we use streaming and fast data uh in disney plus on the left hand side you can see again the home page of disney plus and on the right hand side you can see us watching a video a playback of of a piece of content in disney plus there's a number of features i i mentioned some of them some of them that you can see and some features that you cannot even see they're happening in the background but they have a major impact on customer experience so i mentioned recommended recommendations and personalization that's one of the features that's updated uh or and improves over time as you use the platform i i mentioned the continue watching feature on the right hand side you can see the progress in the video and as you leave the video and come back you should be able to resume watching from the same spot um in in the in the video and i also mentioned this monitoring of api usage and using machine learning to detect fraud or detect strange patterns in that behavior but there's plenty of other features including monitoring the quality of service the quality of the video and being able to make take actions based on the information that we get customer support social watching features cdn and traffic routing related decisions monitoring device application performance and service application service performance now in a real time again to be able to take actions if we detect any strange patterns and plenty of other both analytics and service to service asynchronous integration use cases all of these are powered by streaming and fast data and as you can imagine many of these use cases they need to be in near real time we cannot use batch processing or other approaches because they either directly or indirectly impact customer experience so we just need to make sure we can react in the real time there are three main goals that we set in the streaming data platform project those are ubiquity platform and culture and now i'll just spend some time going over those three to explain what we mean by them and why we how how we implement them and how we uh how we expose them in this streaming data platform the first concept is ubiquity it's a concept where teams expose their data sets and they provide this universal near real time availability of data but while still providing the appropriate framework for management and access to that data so there's a lot going on in this diagram we will just talk about it at a high level but don't worry if you cannot capture everything that's happening there yet we will spend some more time later focusing on some concrete uh pieces in in this pipeline i also know that this is just a small this is just a tiny subset of the overall ecosystem we have many more streams many more producers and many more consumers and also it's simplified to fit on on this slide but the real solutions are considerably more complex but this pipeline is essentially processing data coming from the devices there's an amazon ecs instance or services that process the data and publish them to kinesis data streams and then at the very top of this pipeline is there's another another stream produced by a service so the service imagine receives a request applies some business logic and then produce information about um what happened downstream uh you can imagine this could be a service that manages sessions or some information that the device may not have access to which is why we merge or join those two streams together we do some validation routing filtering and we enrich the information coming from the device with the information from the from the other stream and then send the information downstream to the downstream consumers now there are just two uh two streams here in this instance downstream one of them is for the data at past validation the other for the data failed validation uh both of them are ingested to uh amazon s3 which is our data platform and then we have these two real real-time use cases at the bottom of this slide one uses spark to do some aggregations let's say rebuild sessions using some sort of state machine and then stores the data in amazon elasticsearch uses using kinesis data firehouse you can imagine this could be used for some near real-time operational analytics and operational use cases and at the very bottom we have amazon kinesis data analytics for upper chief link that applies some basic aggregations and then applies a machine learning model to produce some results downstream to another kinesis data streams instance and the downstream consumers of this stream can leverage the data to take actions and make decisions based on that information and you can imagine some of the use cases like the fraud detection use case would look something like this now one of the first things that we have to think about to make the data usable across the organization is automated data management and when i say data management it includes schema management but also evolution quality governance access control discoverability security privacy lineage ethics whatever else comes with the data and we need to automatically manage it to make that data accessible and easy to use and this is especially important for streaming use cases because unlike api driven service services the producer usually does not know and usually the the producer usually does not want to know who the consumers of the data are and how they use the data um so we have to create this this contract between them to manage the quality and manage all of these properties throughout the ecosystem we have a standardized transport layer like you know envelopes and serialization formats and so on but that's not enough we need to figure out how to manage the quality of the data because the downstream consumers we don't want them to run into issues with when deserializing or working with the data because that would introduce problems around consistency and reliable delivery of that data so we essentially need to have a declarative way to define the schema but also all of the other properties as well as ways to enforce that throughout the ecosystem so we build our own solution we call schema registry there are existing solutions out there but we needed to build our own because we needed some specific properties and we needed to integrate this solution with our other data engineering tools it essentially offers this this centralized view of the schema but also all of the other properties such as evolution or expectations and quality of that data and if that makes that makes this this view available to our of the producers and consumers so they all share this this understanding of those properties and also the the tools to enforce those checks throughout the ecosystem so we have a code generation that allows us to create generate models and serialization and validation generation generation of adversarial examples as well as an sdk that allows teams to more easily use and access all of the the attributes of streaming data platform automated qa producer apis that front the kinesis data streams applies all of the validations before producing the data um to the to the amazon kinesis data streams instance as well as checks in stream processing and batch processing so we essentially try to apply as much validation as upstream as possible and also as early in the development life cycle as possible which allows us to create this this protected ecosystem uh where everyone has the same the agreement on the properties they expect from the data which allows us to avoid of a problem spreading through the ecosystem and allows us to reduce iteration length and remove bugs and so on the second concept i talked about or i mentioned this is platform platform refers to the tools and practices to allow us to build streaming solutions and it consists of tools and declarative pipelines and deployment of these streaming solutions as well as the control plane that focuses on the shared concerns around management monitoring and so on so back to this pipeline some parts of these kinds of streaming solutions can be custom made and teams should always have the ability to build their own custom solutions ideally using the code generation and ideally using the sdks to make it easier for them to align with the expectations but they should always be able to build their own custom solutions but there are some parts of these pipelines that are repeated and then we can that we can create some patterns from and deploy them just using a configuration and then compose them into larger pipelines this is the first example what we call domain events it's a typical scenario where you have a service the service receives an http request it applies some business logic and then stores the data or stores some state in a database and emits information emits an event or a fact about what happened to a stream the downstream consumers of that stream can then build views or just use that data but in many cases they need the data to be delivered reliably to so so their views are consistent with the source of truth however a common approach is if if you store data to a database uh on in one line of your code and then emit the information to the stream on the second line you're essentially losing any transactional guarantees you're losing atomicity in isolation so the downstream consumers may either receive the data out of order or they may even miss some events some that were stored in the database this is a typical distributed systems problem sometimes referred to as dual rights and there's many solutions to this problem one of them is here using dynamodb streams and essentially they're playing the commit log of the database and applying some transformations to create well-formed domain events and making that available to the downstream consumers another typical example is the ability to validate route and filter information while decoupling the producers and consumers and their capabilities their semantics and needs and ideally the consumers don't even have to know what what the streams are and they're abstracted from from the information they can just subscribe and get the information they need with the semantics they need the ability to join and enrich streams is another common use case either enrich those the data in the stream with a data from a static source or another stream in this instance we join two kinesis data streams instances um using amazon ecs so bit of a custom solution and using amazon elastic but amazon kinesis data analytics for apache flink is actually really good for this use case another example is ingestion or convergence of data to some sort of sync to some sort of data storage in this instance we ingest data from from a stream to amazon s3 which is our data platform and one we can use the information from the schema registry to automatically manage the tables and the evolution of the tables downstream and secondly we can just use configuration to build these kinds of pipelines also we use delta format in our data platform which is a streaming source as well as a streaming sync so we can continue streaming downstream in the data platform as well if we need to so that those solutions allow us to build streaming applications and deploy these streaming applications but it's important to focus on the quality of these applications as well because as i mentioned previously some of these support features that are either directly visible or they indirectly impact customer experience so we need to make sure we can consider them production ready and streaming applications and their maturities no way not always at the same level as applications with apis so we needed to build a number of tools um and around and control plane that focuses on the runtime aspects of these pipelines like common stream management observability and deployment and all of these attributes including you know basic things like performance testing elasticity auto scaling deployment automation observability reliability and resilience multi-region patterns and many more we just need to provide tools to make these easy and to improve the maturity of these applications so we can consider them production ready and use them as part of our production solutions over the next couple of slides i'll talk about some of some of some examples and the main focus on the next couple of slides will be trade-offs because as any other applications or solutions there are some trade-offs some choices you have to make and i'll talk about how they may impact your solutions and this because there are different types of use cases such as analytics use cases and service to service integration use cases and they they may differ in a number of things like their state their throughput requirements their latency requirements their semantics and guarantees like reliable delivery and ordering they may differ in how they handle multi-region patterns and and plenty of other things in this particular example in this chart this is a metric uh i believe cpu usage or memory usage of an amazon kinesis data analytics for apache fling application during a deployment and you can see that they are in the middle there's a bit of a gap in the metrics reported which may mean that the application was simply not reporting metrics during that period of time so it may have not been running and processing our events and this is perfectly fine for a lot of use cases but in some use cases if you need these near real-time processing guarantees that could be a problem which is why we build a solutions around automation of deployment of these kinds of solutions so we build a blue green like deployment solution that essentially creates a new stack wait for waits for the stack to become healthy which means different things for different applications it may require us to rebuild a state or something else for different applications and as that stack becomes healthy we start routing traffic to that new stack and destroy the old one and there are other things you have to worry about when latency matters for example in application processing events um and if if the processing takes a bit longer for because of retries or something else happening that not only delays the records that the application is currently processing but it may also delay other records in the same chart if you're processing the record sequentially and again that may cause a problem in those near real-time applications so essentially we have to think about trade-offs such as parallel processing and ordering and batching versus latency and all of these are really important to set correctly but sometimes they essentially work against each other so you can only pick some in this this second example we focus on streaming uh streaming application elasticity you can see in the chart that um we were able to scale our amazon kinesis data streams up during the peak traffic hours and scale them back down during the low traffic periods so we can save some costs and kinesis is elastic you can just change the number of shards in the console or using an api and to leverage this this this attribute to its fullest we build a number of solutions around it including a tool that manages the scaling process as well as a system that monitors some metrics such as incoming bytes and records and then scales up and down the streams automatically and the same with services we monitor some metrics such as latency iterator age batch duration depending on the use case or the the technology itself and then scale our services up and down automatically but while talking about trade-offs elasticity may have some trade-offs as well because both scaling the streams and the applications may cause rebalancing of the shard and lease assignments and potentially impact latency for a short period of time this other example focuses on reliability in this picture this is the the throughput of the same stream in two different regions in aws and you can see it at one point one of the streams that the traffic drops to zero and the other one picks up that traffic and then eventually it stabilizes and continues as usual fortunately this was a scheduled part of our chaos testing initiatives so we expected this and we wanted this to happen but as i mentioned previously some applications they require either strict reliable delivery guarantees such as you know at least once or exactly once and sometimes they require ordering and we need to make sure that we can maintain the guarantees they need even during failure scenarios and as you can imagine that can be that can be very challenging by itself but there are some non-obvious things in the kinesis ecosystem that you have to watch out for and consider as well as an example the kcl skips record by default if an exception is thrown during processing of an of a record so it may not be the behavior you want you may want to retry and but if you don't know this you may just start dropping your records the kpl does not guarantee ordering even within a single chart because this uses the uses the put records api and retry it in some other scenarios may result in out-of-order records so the point here being that kinesis provides these building blocks to build all of these these guarantees and these needs into your applications but you just need to make sure you make the right choices for your particular use case and maintain them end to end for each specific use case and you monitor them correctly to make sure that they are being met the last component of streaming data platform is culture and i will not spend a lot of time talking about culture and how we embed this in the organization but i think it's really important sometimes even more important than the technology itself so we essentially need to embed this data centric thinking in the company's culture um so we focus on things like documentation and best practices and and training and working very closely with teams helping them succeed and also learning from them learning new requirements and building new tools and overall making all of this all this whole ecosystem easy to use and lastly integration with all of the other technologies in this data-driven ecosystem like the data platform the experimentation platform the machine learning and ai platform becomes really important to fully use the value in the data so in conclusion we needed to build data-driven organization and data democracy make these insights and data first-class concepts to be able to empower teams to be data-driven um and build and improve their solutions and we approached it by focusing on these three things making data ubiquitous making it very easy to access and manage secondly uh providing tools and services to be be able to build streaming solutions in this streaming data platform concepts and lastly embedding all of this in the culture and tools in the organization and amazon kinesis ecosystem allows us to do this reliably cost efficiently and at scale here's a couple of resources so if you're interested um these are references to our blogs and some of our open source projects that talk in more detail about some of the solutions that i just briefly mentioned and lastly a big thank you to a lot of my colleagues who participated on this presentation but also built a lot of the tools and solutions that i talked about thank you so much

2021-02-13

Show video