AWS re:Invent 2020: AstraZeneca genomics on AWS: A journey from petabytes to new medicines

AWS re:Invent 2020: AstraZeneca genomics on AWS: A journey from petabytes to new medicines

Show Video

hello everybody uh thank you for joining us today today we will be talking uh with astrazeneca and about their genomics pipelines on aws really focusing on the journey from the petabytes to new medicines and how their genomic processing is helping support that effort today i will be speaking as a genomicist and bioinformaticist at aws as part of worldwide business development supporting the genomics industry and i'll be joined by slave petrovsky he's the vice president and head of genome analytics and informatics at astrazeneca's center for genomics research at astrazeneca slave is responsible for defining and implementing the genomics initiative scientific and technical strategies he leads a diverse team of bioinformaticians statisticians clinicians clinical information developers and architects to build the capabilities required to apply conventional and cutting-edge statistical statistical approaches on top of population scale sequence data providing human genetic support across the astrazeneca pipeline today we'll start off with uh slaves introduction into genomic initiatives and the mission at astrazeneca followed by a technical and business challenges review and then i will be speaking on the project status and outcomes as well as an architectural deep dive so slave thank you for that excellent introduction lisa it's actually a real pleasure to be joining you for today's session what we're going to focus on is really discussing the capabilities that we've had to build within astrazeneca to ensure that we maximize the value of our investment in genomics and genomics is particularly important within astrozonyko because we've now embedded it into our pipeline and use it to really transform the way we approach drug discovery and development this includes guiding key business decisions throughout the pipeline in addition to the opportunities to identify the best human validated drug targets we're also using genomics to advance our existing drug programs and this includes a better understanding of disease mechanisms and biology it also includes stratifying our patient populations to better identify the molecular subtypes so individuals that may respond better to our medicines on the basis of the underlying genetic cause a concept commonly referred to in a community as precision medicine to capture these insights we need to screen petabytes of human sequence data from large populations to robustly identify the relevant genomic signatures driving these differences in how patients respond to treatments and the differences in outcomes we then partner with our translational and clinical expert teams to really maximize the power of human genetics to help guide the design and development of existing and future therapies we collect these large populations via multiple sources one of them is through our own clinical trials so sequencing participants from astrazeneca's clinical trials we also collaborate with global international experts academic experts who have shared interests in particular disease or therapy areas and the third main source of the genomics later is for a large pre-competitive biopharma consortium and the value here is that each of these peer companies biopharma companies contribute to the costs to generate the sequence data and then where we differ is in how we process and analyze those data to derive new biological insight which forms the seeds for potential new medicines so before i go into the updates of where we are today i thought it'd be useful to take a reflection on how we got to where we are in particular looking at the landscape when i joined the company a few years ago so at the time we had very powerful on on-prem compute resources however we knew very early on that the data that we were going to be receiving from the various sources the genomics data was going to come in large bursts so it wasn't a constant stream we knew at certain periods we'd get hundreds of thousands of exons and the space and speed at which we could process those and make them available for analysis was critical we also looked around and saw that we actually had quite a bit of pocs and some internal implementations for our informatics and analytics tours however it was quite clear that to get the best value out of the analytics informatics tools we needed to build something internally and the reason for that is because the science evolves constantly in genetics the best way to analyze and process data today may not necessarily be the best way to process it in two three this time so actually making sure that we can adapt and evolve with the emerging science was a key part of our decision to build the informatics and analytics capabilities internally we also had a lot of bioinformatic bioinformatics talent in the company that was actually responsible for moving data from one step to the next step in the process of the biopharmatics pipelines this didn't feel like the best use of our people and talent so this was an area we really wanted to automate and make sure that we had the greatest operational efficiency so that we can use that talent to focus on developing new tools exploring machine learning and advanced and advanced analytics frameworks to better derive insight from the data rather than focusing on the processing part which really is something that we can automate and lower value than than the science so if i talk you through where we are today the top right hand corner we have processed to date since 2016 up to 500 000 exomes and genomes this equates to about 3.5 petabytes of raw uncompressed genomics later all these half million exomes actually and genomes come with rich clinical outcomes and phenotypic data on those participants and so what we've been able to do actually just to emphasize the performance of our analytics frameworks is we can now run over 12 billion statistical tests studying the effects of individual mutations or individual genes with each of these broad range of phenotypes in under 24 hours this has already had a big impact internally so if we think about where we are today in a month's time we're about to hit our two-year anniversary of going live with our internal capabilities and we've already contributed into the astrazeneca pipeline multiple novel targets that are being pursued and and looked into and one of the reasons we're quite interested i'm quite excited about these targets is actually that most of these are for patients with unmet clinical needs so those patients where current therapy is less than ideal and took real opportunities to have transformations to their care we've also sequenced and input genetics research into over 42 of our astrozenic clinical trials using these capabilities and we're quite excited about publishing our findings and new science and this is best reflected by our 15 high impact papers in the past two years since going live including top tier journals like lancer english or medicine and nature so if we take a high-level view of what we wanted to achieve what we've achieved and what's next we have our bottom panel you can see we set out on this journey really wanting to ensure that we've got a cloud-based automated solution that is enables us to process the genomics data at pace at scale and in bursts and we wanted to do this at a fraction of the cost that it cost us at that time through external informatics providers where we are today i'm quite comfortable saying that at least to our knowledge we're one of the fastest and most operation efficient genomics bioinformatics pipelines in the world and we actually know for a fact that we are processing our sequences at less than 10 of the costs that it was prior to building our own internal capabilities what where next well the data set's going to continue to grow we're about a quarter of the way through our journey to get to our 2 million genomes goal uh which is set for 2026 and we want to ensure that as we progress towards this we're also caring for the fact that we're doing more genomes and to put this into context via exon versus genome a genome has about 50 fold more information in it than what an exon does so more and more we're transitioning to these larger data types we'll also continue to integrate new tools into our pipeline evaluating the existing ones that we have and seeing where there's opportunities to update those but also looking at what is emerging both from internal research and development but also from external publications as new opportunities for tools to better process and analyze these data and importantly reinforcing cloud best practices and this is done in close collaboration with our colleagues and and friends at amazon web services so if we look at what are the five key things the five key benefits that we've achieved by going moving our platform to the cloud and collaborating with amazon web services it would be these five firstly scale we can scale the resources as we need them and this is absolutely critical because as i mentioned earlier we don't have a steady stream we get large bursts often hundreds of thousands of sequences approaching petabytes of data that we need to process rapidly we also need to be able to deliver massive compute power when new data is there and which which again often comes in bursts we want to be able to save on costs and i want to pause here for a second because i think this is an important point what we're trying to achieve is not to reduce our overall cost rather we want to make sure that we minimize the costs in the low value bioinformatics processing turning that raw data into information so that we then every cent every penny that we save in that step we can invest into the high value scientific section where we're applying advanced analytical tools novel machine learning frameworks and our analysts can have more resource available to do the fun science also the richness of services provided by aws that we can build off we've already had certain situations where we can go from an idea to a solution rapidly on the basis of leveraging existing services and tools and finally the world-class architecture and technical consultancy to help leverage what we're building and achieve the scale and speed all the while optimizing for the costs where it's relevant so on that note i'll pass it back over to lisa thank you slave and now to continue on i wanted to dive in and talk a bit about the aws architecture and the orchestration of being able to process the number of samples and the size and scale that astrazeneca is performing for the genomic analysis and so to start wanted to give a high level overview about the genomic secondary analysis and the processing pipelines there's a number of steps that has to happen in each of this pipeline architecture starting from the sequencing data coming off the machine being loaded into the cloud then running the secondary analysis pipeline which converts those raw files and starts creating those variants of interest that is used in that tertiary and downstream processing with qc steps and profiling being done along the way this can be easily generalized to other types of applications as it really helps to orchestrate just the dependencies upon among these workflows and the number of uh processes that has to happen in between so at the stop at the top you can see the number of different workflows that need to be run uh to be able to take that raw data and move it from data into the insights for the scientists to be able to analyze and use the number of steps is what you can see in that lower portion of the slide for primary validation being able to get the files and ingest them and the number of processing steps in each of the ways so there's a large amount of coordination that has to happen among these workflows and to make the data useful and so to start i wanted to work from the architecture layer and then move down into more and more detail so from this architectural overview the way that this process starts is the pipeline manager will submit a work uh demand via the custom cli the command line interface from there the demand analyzer will use a lambda function to route that demand the start of that workflow to a dispatcher that's based on the demand type that it depends on which workflow is being run this also logs the demand into a request registry that way it can track which workflow is running and the status of it along the way this dispatcher lambda as part of that queue will then read from it and invoke the step function based on that demand type and this is where the majority of the work happens within that step functions process the step functions will invoke batch where each job runs the commands the genomic workflow orchestrator will be a utility bucket as you can see in the center there and that acts as a scratch space for errors and other sorts as well as the genomic file storage which has both a bucket and a database for object and catalog storage that includes the metadata as the the jobs are being processed the demand registry that table is updated as work progresses and the status is published to an sns topic upon completion of each of those jobs where if it's an error it then collects that error info and sends it to the data manager to be logged and captured and mitigated but if it's a success then the consumer submits that demand for the next step in the pipeline and so through this larger level orchestration and use of step functions it enables this automated pipeline to move among the different workflows and be able to capture and log the steps along each of the way both from the scratch space and the files as well as the capture in the databases of the status and reports and then automatically trigger for those next steps within the workflow so within the workflows this is just a simple example on the right which can show you the scatter gatter gather approach where maybe multiple steps are being run at a particular time and they depend upon each other so it can handle some very complex logic within the orchestration and the majority of this logic is captured within this json file which carries the instructions and stored in s3 which then is being run within batch so this is that next layer down of that execution and that work unit and how the step functions comes into play here is to manage that orchestration and the samples along the way through a loosely coupled and modular approach this helps maintain flexibility in the code and eases the maintenance and transparency for additional implementations within the work unit for each job the json file includes instructions for the program that inputs the expected outputs as well as the demand history that tracks the duration of steps date time stamps really to start tracking who did what and when which is very important for downstream logging the error reports also the tool messages within the standard output and standard error to help just create those job dependencies along the way now within each of these step functions there is the the batch jobs and this is that task execution layer this is where the work is performed the compute environment is defined and the containers are provisioned what is seen on the right is the the json structure of the files and the information that is being used throughout each of those workflows and the tasks that are being performed and what helps in terms of the plug-and-play nature for being able to determine what jobs are being performed what parameters are being used and how to help log that what you see on the left hand side is just the high level overview of the structure of that information the name and value is used to store in the demand registry to help track that progress as the the workflows are being run and the stages along in the pipeline whereas the batch is that work engine then that provisions the container and the compute environment this sends job instructions to the container the commands then are sent into a common entry point that allows for fetch and run within a python wrapper and it iterates over the array of commands again being able to log all of the labels in use in the demand history so that you can have that plug-and-play analysis for each of those steps along the way now within each of these steps there's a significant amount of data that is being provisioned and stored so whether you're looking at the file store catalog that is capturing each of the files that are being generated through the step functions and the workflows and the the catalog that stores and tracks all the files being created as well as the project registry that is overseeing the projects and workflows uh as well as the buckets so there's the file store inventory that is automatically created when you turn on version control of the buckets as well as the buckets themselves that uh have the separated folders based off of the different uh algorithms that are being run whether it's the short tandem repeats the call level associations or telomere length whatever you're running as those algorithms each of those will have the inputs and outputs that are being run and stored through the process and these are then contained within those buckets from the downstream processing of that you also have the qc metrics that will be important to validate the data that has been run uh as well as the variant database which is then used by the scientists downstream so there's a significant amount of data that is being run and generated and processed over this time with distributed information uh used and viewed in the best way and best approach for each of those storage types but because this data is distributed you still wanted to be able to look across those resources and view in a single table for reporting there are tens of millions of files and millions of transactions being run and so you want to be able to bring that data back into view for operational metrics as well as for scientific inquiry from an operational perspective you may be interested in which employee generated which file at what time of day and and when did the workflow start and end uh it helps capture what time uh the program was set to run the number of variants that were output in the tertiary analysis database it allows you to do qc checks against that expectations of the output the quality of the checking of the pipeline can make sure that the files were produced say from the the dragon output and stored in the variant database so you have confidence in the the processing and automation of this pipeline along the way this type of tracking also helps support compliance the operational reporting through athena and queries on demand allows you to easily retrace those steps and so by being able to aggregate this data within a data lake it it requires a series of glue etl processes to run to be able to track and monitor this over time so there's a series of jobs that run nightly on a daily basis and it refreshes that data lake for constant updates and using a quick site dashboard again allows for that querying and visualization mechanism to have real near real-time view of what has been processed and the status of those jobs something that's particular to the genomics processing here is also a bit of the that scientific output when we're looking at the data processing for each of these samples there are often just single sample dependent and the variant calling and processing is run independently but there are other types of scientific analyses and algorithms that you would want to run where it looks across the database of those samples that are run and wants to do other types of analyses such as looking at relatedness and that's what you're seeing and one of those additional boxes where you want to compare new samples to existing samples and do things like understand sample bias and diversity and just relatedness among what it is that you process to date this is useful for cohort creation uh and good for the the data scientists to again be able to query from uh using athena for both the downstream analytics for the scientists as well as the operational metrics for the data managers ultimately being able to help support the ability for that data to be automated from that first step of the pipeline of coming off the sequencer running through each of these pipelines and those workflows and then being able to get to the scientists as quickly as possible with all that logging and confidence along the way so there are key considerations and being able to create this type of architecture and being able to to be able to accelerate and provide this at such scale and one of which is to consider that samples are processed in batches and that's what that first graph is showing that of the approximately 500 000 samples that are run and you're looking at the sample uh counts by months you can see that the samples processed uh range from 50 000 samples approximately every three months and can spike up to say 150 000 samples in that more recent bar and so it's important to be able to scale up and be able to manage this type of burstiness and spikiness of that sample processing be able to have the data compliance at each step and have the transparency around the costs be really front and center as you'll see these types of influxes of data it's also important to really be able to be able to duplicate those results years from now because you can see that the data processing changes over time and you want to have the reproducibility of it over time as well now as each of those samples are being processed they also generate the number of files per month those half million samples are generating about 13 million files that are being stored in that file store catalog each of these files can range from tens of gigabytes to hundreds of gigabytes per sample depending on what it is you're processing and doing and this can range from an uncompressed data size from 3.5 petabytes uh and then be scaled down for that cost optimization to 1.2 petabytes for the total

file storage again showing the need for the understanding of cost and compliance and the processing over time now as you're processing all this there are also the varied computational needs and this will vary based off of the genomic data type you're using as well as the tool the the graph you're looking at here with the blue and yellow is the number of batch jobs per month in blue with the yellow being the approximate cpu minutes consumed so again being able to monitor and track and automate the scaling of this will be important as you're building this out into a production workload and also being able to track who did what and when as you can backtrack and check and also scale for the future now there are key considerations uh to be able to run at this scale and to manage that type of of spikiness uh one i'll start with the left on the throttling is to consider the hard limits that exist for api calls these may cause failures and generate errors that will be logged into cloudwatch things like lambda calling step functions entering data into dynamodb or adding topics to a queue as well as iam may uh have additional limits because of each service has their own api limits so it's important to understand what those limits are and then to be able to monitor them and be able to address them at the outset rather than let the service be what uh inhibits that scaling capacity from s3 it's important to look at the file sizes and movement again these file sizes can range from tens to hundreds of gigabytes and so you may need to create concurrent threads with one gigabyte chunks to be able to manage the file storage and movement and monitor that kms encryption along the way for those files ultimately from the the concept of throttling you don't want aws and the services to be throttling those jobs you want to implement the throttling up front where you may start experiencing quota limits of say 300 samples per hour but by taking the time and doing those estimates you can help provision the uh the resources appropriately to be able to run thousands of pull exome sequences per hour one way of doing that is to be able to look into the service quotas uh that you may need to consider what the out of box limits are and then use the aws console to be able to update those self-service requests um the instances are coordinated by aws batch and this will limit the number of runnable and running instances at any time and a best place to be able to understand both from what triggers the step functions for the overall workflows as well as the aws batch for those individual jobs to be that throttling point and be able to optimize the balance of those limits so in that estimate estimation of using aws batch you'll want to do things like understand what instances are running concurrently how many samples are being running per hour and how many volumes will be attached to those instances etc to be able to optimize all of them for the number of samples that you'll be able to run at a given time within ebs it's also helpful to enable auto scaling uh for the volume provisioning that way you never run out of the memory space as these containers are running and it's also helpful to think about the network connectivity this architecture uses vpc endpoints as much as possible where all services are available as such this gives better performance for s3 copying and uses the aws backbone instead of traffic going over the internet which also adds to the security requirement customer proxy is not needed in this architecture with the vpc endpoints but if you're not going to use these it's important because there's no internet gateway within this vp structure you need to go through the proxy customer proxy to get the internet so just to kind of summarize with this as you consider to move to a production and scalable size of the the types of analyses in the architecture that astrazeneca was able to to bring together using step functions using batch and optimizing these service quotas and throttling considerations you'll want to work with the the solution architects and technical account managers be able to review those limits and do things like in infrastructure event management uh to review and understand the cascading events of these pipelines and also have support for monitoring as it goes live being able to build this type of architecture to process and analyze millions of genomic sequences at a reduced cost and performance uh is is helping to identify the novel insights both within the r d targets and the clinical trials that slavia mentioned and ultimately will help continue to inform that precision medicine application and improve the patient outcomes so it's been a pleasure working with astrazeneca to be able to build out this pipeline with them and look forward to these types of impacts that we can have within the genomics and healthcare community so thank you for your time

2021-02-14 22:05

Show Video

Other news