AWS re Invent 2020 Under the hood How Amazon uses AWS for analytics at petabyte scale

Show video

hello everyone welcome to aws reinvent 2020. my name is karthik i'm a senior solutions architect in aws supporting amazon.com as a customer i have with me today my colleague josefa igatpuriwala also a solutions architect at amazon who will join us for the second half of the presentation today we will be talking about under the hood how amazon uses aws for analytics and petabyte scale in today's presentation we will be talking about history how we evolved to what we are today talk about the migration patterns the current architecture and evolution which is key for every architecture on aws okay let's start with history to give you a high level overview on the business landscape amazon is a globally distributed businesses we have users around the world who contribute and interact with this system in the legacy days this system had one of the largest oracle clusters in the world and this system was is being used by multiple teams within amazon prime video alexa music all the businesses we have sdes and analysts and machine learning scientists who have plethora of use cases and access patterns for the underlying data we have about 38 000 active data sets and 80 000 users running about 900 000 daily jobs on this platform here's a quick summary of our legacy environment it's a single fleet that managed all of our data via the oracle clusters back in the day only one environment due to the massive scale no separate etl environment but this was an elt environment wherein all the jobs were handled via we have our source systems on the left where in the underlying data stores for them are on amazon dynamodb and amazon aurora then within the legacy data warehouse we had multiple services within it such as workflow web interface apis for the workflow data ingestion etc and these services leveraged a combination of oracle and redshift to handle the data warehousing needs the downstream dependencies for this internal services were software applications reporting systems and users who interacted with this data within amazon one of the largest data lakes they own is named as the andes data lake the the top three goal we had when building this data lake was have an ecosystem that scales with amazon's business every day we constantly add new products to our customers on our website add more tv shows and live sports to prime video and more so we needed a data lake that can meet with this ever-growing demand we also wanted to have an open architecture that allowed evolution this would give engineers the option to look at the right tool for the right job with all the services that constitute the data lake our underlying technology should be aws services aws provides these building blocks that allows customers like amazon to build and create services that meet their unique business demand now let's talk about the steps that we took as part of this migration as you can tell we had some challenges in our legacy environment our data storage and underlying compute were coupled together which means we could not scale the necessary individual components but instead had to scale up the entire platform we have amazon engineers spend hundreds of hours per month on maintenance and patching that is not a good use of an engineer's time and not to forget about the expensive licensing and hardware costs the hardware needs to be pre-provisioned for peak workloads which means we were paying for a peak cost for the entire year that was definitely not frugal so some of the requirements we had for our new platform were business will not stop when interacting with this platform peak workloads or not our customer expects amazon services to be online 24x7 365 days a year our analytics solutions needed a redesign due to some of the challenges mentioned earlier and most importantly we needed a decentralized system a system that is decoupled to the lowest layer so we can scale individual components when needed versus provisioning for scale now let's talk about how we made that happen this took us about two years it was a huge technical challenge but it was necessary as part of our evolution first we build out our metadata service and governance service then we built a data mover that moved the data from the oracle system over to s3 s3 was our primary data store or a landing zone and in the new destination we leveraged new formats versus the old and regarding the load strategy we ensured that the data was presenting two systems in parallel this was to ensure data loss does not happen we intended to run both the legacy and the new one at the same time until we finished the entire migration this was very hard however migrating the jobs was the hardest part nine hundred thousand jobs was no small fee we built a query conversion service that helped us convert some of our legacy transforms into the new migrated transforms once the data was completely migrated we disabled our legacy ecosystem one by one and had just one platform which was our andy's data lake here are some key takeaways for our customers leadership support is a must it's a much easier approach from top down versus bottom up self-service tools and mechanisms are the key for scale without this there's a huge operation overhead for teams that own these ecosystems the plan to run legacy and new was a decision that was made per our business requirements so we worked backwards from that to arrive at that decision during this migration there was a learning curve for amazon engineers engineers had to retrain themselves on the newer tools that were leveraged downstream analysts and engineers had to redo their reporting and charting this also meant training for them to learn how to use the newer tools and then as always communication is key every change we made were communicated to all stakeholders and all the important decisions were made together here at amazon we're all owners here now i would like to bring up josefa to talk about the current architecture hello everyone thank you for joining this session today my name is josefa igatpuriwala and i'm a solutions architect with business data technologies team within amazon my job is to work with teams within amazon and help them architect solutions using andes and aws services during the migration my team was involved heavily in helping users from legacy data warehouse to the andes data lake architecture so today we are going to dive deep into the andes data lake and how it helped us to be successful for our migration so here's what we built in our andes data lake this slide shows how data flows from the source systems to andy's and from andes to downstream targets i'm going to be diving deep into the different sections of the slide one by one for now let's focus on the box in the center called the andes data lake the workflow boxes that you see inside on the left were inherited from the legacy system we kept our workflow systems largely the same and enhanced them to read from s3 apart from oracle during the migration we kept the old system running in parallel and we added a bunch of services to manage the data on s3 these services we call as andes which consists of a ui at the top where users can interact with datasets users could search and find data sets modify by adding a column or changing schema users could also version and deprecate data sets in a coherent way these services apart from managing these data sets on andes would also consistently propagate data to downstream compute targets these services sit at the core of andes on top of s3 allowing the files in s3 to be used or treated like tables one of the important things that we also built was the ability to have merged semantics in these data sets if you've worked with s3 you know that data in s3 works in an append only model you keep adding data to the system but when you get multiple versions of the same record they're just sitting across multiple files this system would build a merged version of the data set in s3 so that users could do a compute on top of it and similarly when we propagate the data to redshift it would appear as a merge was being performed in the record like i said earlier these services would make the files on s3 look like tables so users had the ability to define primary keys for the data sets and we would use that information to perform the merge on redshift by deleting all records and inserting new records in their place all these services helped us to emulate the behavior of the legacy data warehouse for accessing data backed by s3 one of the big pieces of this here was the box in the middle of the right hand side that you see called the subscription service as changes occurred to data sets on andes the subscription service would propagate the data or metadata out of the analytics environment across amazon users could filter on rows they could pull data onto the local compute and if they needed more storage they could just add more storage if they needed more compute they could just add more compute if they preferred redshift they would go with redshift if they wanted to use amr they could go with emr we also integrated with glue which was a centerpiece in our non-relational approach of accessing data sets many of you who are aware of the elastic map reduce service which can be used as a delivery platform for analytics our integration with glue helps software engineering teams to access data sets via amr synchronizers for glue would propagate metadata to the glue catalog for any new updates or editions on andy's the last target that we had in scope was richard spectrum regis spectrum if you see is in some ways like glue plus redshift we were able to synchronize data sets where users were also able to use spectrum along um with their redshift cluster um one of the key things along with the subscription service was the um was that it was managed by our orchestration service inherited from the legacy data warehouse this service knew when a data set was refreshed and data had been propagated to one of the targets so that jobs could be triggered on that data set when we knew data had been arrived on that target this service sat at the heart of our legacy system and we integrated and extended it with the new andes architecture so once we built all these services and we got them stood up and people started migrating we were able to disable the legacy oracle environment and eventually were only running s3 as our backing store and the distributed analytics environment solely on aws services so in the first slide we looked at the heterogeneous sources now let's dive deep into that a little more so what do we support currently from amazon kinesis data can be ingested in real time or batch manner using kinesis data streams or data fire hosts from dynamodb streams users could ingest data at a grain of as low as of a minute from redshift and rds data could be ingested in a batch manner to andes recently we also started supporting registration of metadata for data sets in aws glue backed by s3 in this case metadata was synchronized to andy's based on cloud watch events now let's dive a little more into the andes metadata service so andy's metadata service starting with synchronization primitives so this is the place where we write the data out to compute targets we keep track of the data when it arrives and coherently propagate to user's compute this also has the mod semantics integrated into it which helps us to mimic the merged logic on the relational database next is completion service many of our customers wanted to be able to use glue or airflow or even custom emr orchestration for example to get notified for a data set like when a specific logical partition had been updated the completion service helped users to coordinate updates from step to step using these different orchestration services we built the concept of manifest which is nothing but uh bundling changes made to many different files together basically a manifest consists of the list of arns for the files which are being updated manifest also captures the metadata and merge semantics we partnered here with the redshift team so that we use the same manifest spec as the redshift team had published which helped us to integrate with spectrum access management and government governance we built this abstraction layer on top of data sets to control permissions on who had access to what data sets um don't mind but this is still like a work in progress for data governance but we were able to build strong access control for andy's data sets using this abstraction layer tracking workflow analytics we had the ability to perform lineage and drag upstream and downstream workflows this helps us in operational troubleshooting workflow discovery and also reducing engineering efforts and last is the data formats so even though andy's was built as data format agnes agnostic currently the propagation of the data to downstream targets only happens for tsv and parque formats let's dive into the subscription service the subscription service is a very powerful tool when you when you're thinking about analytics at scale especially at amazon scale there's no one fleet system or solution which could manage compute requirements for all your analytics at scale we built the subscription service for propagation of data to diverse compute targets in a coherent and consistent way a subscription mainly consists of two things first is the contract which generally has the metadata about governance who are what kind of access to data is present security slas or declaration of how frequent data sets would be updated or what kind of support data set owners had agreed to provide under synchronization a subscription consisted a synchronization of data and metadata to compute targets as well as propagate schema changes or perform a data backfill we're still trying to continuously optimize parts where the synchronizer and synchronization involves large backfills or modification of tables with higher number of columns all right let's look at synchronizer targets so why did we go with amazon redshift from the legacy data warehouse we saw this powerful pattern amongst our users coming from diverse technical backgrounds that they had the abilities to sit and write queries and get the data that they wanted from our legacy system obviously they had to have the business acumen of data sets they had to know how to write the sql and also understand the data we wanted to preserve this and redship gave us the power to support that level of consumer requirements we do support etl fleets um we know that the difference between an etl and non-etl query is a bit fuzzy etl queries go through large amount of data scan and they are expensive everywhere for example uh even if you run it on redshift or or on emr so along with the transform queries we also supported etl queries integration with glue like i talked about earlier software engineering teams within amazon are comfortable enough to work with emr which requires a bit more knowledge apart from just writing sql you have to understand the clusters the data skew and the administration and with the option of glue software engineering teams had the ability to act on data directly using emr we had a team within amazon who provided share emr fleets with orchestration services to make it easier for consumers to use glue this was fantastic we later on merged with this team and that that service is basically considered now as a corporate enterprise resource for running shared emr fleets and i'll touch upon that a little more in my next slide and we've seen great adoption in that as well we integrated with spectrum and it opened a wide range of possibilities where users could query on large data sets without even having to worry about storage on redshift so i was talking about shared emr fleets in my previous slide and what do we actually provide here users have the ability to run queries on spark via scala or spark sql acting directly on andy's thereby creating orchestrated dag workflows on emr users could build heterogeneous data pipelines and perform data aggregations to bring data to andes and eventually bring that to the richer cluster users could also make the data accessible for the analytics stack from aws and and also perform ml on spark using scala now let's look at some of the metrics that we captured during this migration we found that purpose the purpose built databases allowed amazon.com to choose the right tool for the right purposes and that one size no longer fits all under cost reduction we adopted the pay as you go or pay for what you use model and teams reported 40 to 90 percent of operational cost savings under operational excellence we reduce the peak scaling effort 10 times and the admin overhead with the use of aws managed services under performance efficiency with the adoption of redshift for running queries faster and glue and emr acting on data directly teams reported latency improvements of 40 percent at two to four times the load let's look at what we are going to work on in future serverless technologies we are looking to leverage more and more managed services for both upstream and downstream purposes service discovery framework we'll be focusing on providing service to service authentication and make services available for folks in different or less popular regions under compliance we're looking to automate the part where teams have to remain compliant as per different data regulation requirements under optimization we recently transitioned from a centralized redshift cluster model to a bring your own retro cluster model and we are striving continuously to optimize compute storage and provide cost accountability for machine learning we are looking to make andes interoperable with more and more aws ml services in near future that ends my talk for today here are some of the resources that i like to share on similar sessions which we've done in the past i'd also like to share this link of aws migration playbooks which may come in handy for your database migration project purposes thank you once again for joining with us today stay safe and stay well

2021-02-09

Show video