Speeding Up Research in Genomics (Cloud Next '18)
Thank. You so much for coming to the session speeding, up research in genomics, my name is Jonathan Chevy I'm the product manager at Google cloud, for. Biomedical data including. Genomics, gene expression, and so on. So. I. Really. Excited about our panelists, here today we've got panelists, from Duke. Michigan, and Stanford, and. Before we get to their talks I'm just gonna tell you a little bit about the team that I'm a part of a couple, of important announcements in case you missed them yesterday and tail, a little bit about what my team is working on. So. I'm part of a team called the cloud for healthcare and life sciences team we. Are a set of product managers engineers, solution. Architects, and go-to-market. That. Are focused, on how we can help in healthcare and life sciences build. Things into Google cloud that. Our industry, will find useful. And. Increase. The scale of their work. Within. The product team we're focused on not, just genomic. Data not just biomedical. Data but everything how. Do we help make medical. Imaging data more useful through. For example our DICOM API which helps you structure. Imaging. Metadata in bigquery or. Our. Clinical, team. Which, focuses, you may have heard this week that our healthcare API launched, two alpha helping. Provider systems, more effectively, manage, their clinical, data. And. One. Of the things that's really exciting about this kind of work is what enables, us to unlock when more, and more data is coming to the cloud the, other announcement, you may have heard about in Diane Greene's keynote, yesterday morning, was. Our new partnership, with the National Institutes of Health, under. This really. Groundbreaking. Partnership we. Will be making many of the high-value public. And controlled access data sets. From. That are funded by the NIH, available. To users of Google cloud, so. This will include many of the data sets that you're already working with and. Now will no longer need to worry about needing to find space in your data center for. Actually. A story, about this a few, years ago I was sitting at a research. University that, remain, nameless and in, the meeting there were three pis that one end of the table and there. Was a head of research computing at the other end of the table and, over, the course of the discussions, we were learning about the work that they were doing it. Turned out that each of the pis had separately, downloaded, their own copy of thousand genomes I, see. Some smiles of recognition this, may have happened to you you may have seen this happen and, I looked at the other end of the table I saw a, research. Computing director, whose face was basically melting realizing, at how much data was being duplicated. And, stored in their data. Center and so. These are some of the problems that we're trying to solve both by, democratizing. Access to data for researchers, but, also solving. Some of these, operational. Excellence problems, for for research IT leaders at. Universities, and other folks doing biomedical research, and. Also. Sets us up to build some really, fantastic products. So and tell you about for example one of the products from my own team. One. Of the things we've built called variant transforms, it's an open-source tool you. Can find it on github not, only is the code online not, only the docs online but our road map is online so please feel free to vote on the issues, and. What variant transforms, does it's, a tool that lets you take your VCF, data your process, genome data and import, it directly into bigquery the, queries an incredible, tool a managed, data warehouse solutions somebody. One of my colleagues mentioned I think it's or 13 copies of the web everyday so. It's an incredible, data warehouse and we're setting. It up so that you can use bigquery to manage your variant data so. What does this mean in practice I want. To tell you about a company. Called color genomics, they're, actually right here in the Bay Area there, are health services company they provide affordable genetic, testing focused.
In Breast cancer you may have heard of them so. They've got this 30 gene panel in in cancer. And they. Also had separately, an existing, database of, you. Know phenotypic. Data health, information, data and so on about all these samples that have come through their lab and. They've. Got the VCF data they've got this this Postgres. Database of, phenotypic, information so. What can we do here you know there's a genotype to phenotype problem, we could probably investigate. And. They. Looked at, existing. Cloud providers that will remain nameless that are not Google cloud and they found they do really didn't have the data warehouse solution, that they were looking for to be able to handle this kind of a problem. And. So, they used variant transforms, this tool that I mentioned they, were able to import. All of that genomic, data directly, into bigquery they. Were able to import. The they use the bigquery import, tool to bring in all of the phenotypic data, and, their. Data scientists were able to stand up a set, of machine learning tools on top of that. Integrated. Joined data set. So. What, happened they were able to do all the feature engineering in one week actually during their quarterly company, hack week where they get to work on whatever they want and they're. Able to create models that dramatically, outperformed an. Existing, clinical model they were actually able to show that, this 30 gene cancer, panel was able to predict a biomarker not, in cancer which, is a publishable result pretty exciting and possibly. A product opportunity for the company more. Importantly it really transformed, them from thinking about themselves as a lab company that just you know makes, VCFs much like maybe the informatics core is at your own institutions. And. Think about themselves more as a data, organization where they're really doing data mining and that's because of Google Cloud all. Right you, didn't come to let just listen to me I want to introduce our fantastic panelists, so. I will. Be hearing from Alex. Waldrop and Roslyn, Pinilla from, Duke University and then. Jonathan LaFave, from the University of Michigan and finally. Sammy would hear me roar I think, I got that right, from. Stanford, University I'm, so excited to hear what they've got to share with us and, to. Lead us off Alex. And Roslyn. Thank. You Jonathan for that introduction. Today. I want to talk to you all a little bit about what. Our lab is doing and scaling up to the challenge, of cancer, genomics with Google cloud and a, software package that we've created called cloud conductor, I. Want. To tell you something about the state of cancer research that you may not know right. Is that although, we've made tremendous progress in, the past 50, years on understanding. How cancers, arise and, how to treat them we've. Only scratched the surface and understanding, the true diversity of, the known, cancer type star in existence, so, for instance the, world war that World Health Organization. Identified. Or it's, currently recognizes, over 1,000, cancer types we, have good data on maybe 200, of those so, the next decades we'll be filling. In those knowledge gaps and, part. Of that is with cancer genomics, and the state of Kansas gene omics actually. Lags even further behind so. Take for example the, Cancer Genome Atlas TCGA, which, Jonathan. Just mentioned but, in that original study it took them eight years, they, took they sequenced. 10,000. Tumors but, just from 13, different cancer, types right, so, we have good, sequence. Data and genomic. Data on less. Than 1%, of the known cancers. So. Our lab before. You know our, labs goal, is simple, and that is to systematically study. The genomics, of all cancers, now. Before you laugh so. Hard that you fall out of your chair I want.
To Make I want to I want to say a claim that's going to be even more audacious and that is our. Goal is to do that within, the next five years. Now. We're making progress towards, that right, so, this past year in. A, pilot study that we did published in cell we. Looked at 1,000, tumors from, a single lymphoma subtype, called, DLBCL. But. Eventually that. Goal is to scale up to, 100,000. Tumors sequenced, from, each of those 1,000, cancers represented. How. Are we making strides on that well first off we did the hard work and we founded a world a consortium of over 30 leading academic, hospitals, from around the world, so. These are helping us gather, patients, enroll. Them in the studies so that we can have the diversity of cancer types record, represented. In that data set we've. Also created a collaborative cloud-based system, for, collating, collecting. That metadata, and those tumors. So. We can store things like pathology, data and pathology, reviews, eventually. We'll be able to link up that metadata with, the genomic, data and understand, how patient outcomes are being influenced, by patient. Genotypes. But. The more. Kind. Of the more, tangible. Goal that we'll be reaching this year is sequencing, 10,000, blood cancers, from, over 100, different lymphoma, subtypes. Now. Why blood cancers. For. One they're the number one in terms of cancer diagnosis, and treatment, across. Cancers, for. Two with over a hundred types they're extremely diverse, and. Basically. We'll be knocking out about, ten percent of our goal of all blood of all cancers, by just sequencing, this, one category, so, we've already enrolled more, than 10,000. Patients in the study and have, collected more than 10,000. Tumor samples, and we've already begun exome, and whole trim and transcriptome. Sequencing on, this project but. A bigger question is how. Can. A group of four bioinformaticians. Two and four months what it took TCGA, and there are hundreds of by informaticians. Almost, eight years to do and that is to sequence ten thousand genomes at. The same time and maybe. To understand, how we are saving time it's, it, would be a better, lesson to talk about why it is so time-consuming to. Sequence a genome in. Order to make those kind of breakthroughs, so. It would be really great if a DNA, sequencer, was a machine that picked up a genome and lets you read it into in like a book, that. Actually doesn't exist in, practice, a DNA sequencer, is a little bit more like a shredder and that, you put the book into and it rips it into a billion pieces that, you get back and have to assemble later.
So. In order to read. That story from. That bag of a billion pieces and. And by reading the story in our case it is, you. Know identifying, a variant that may be underlying, a cancer. But. In in any case, the. The process. Of assembling. A bag of a billion, slices. Of paper back into original book or reassembling, a genome is, very computationally. Expensive. And, operationally. Complex it, usually involves, fifteen, or twenty third party tools being, strung together in what we call a biological pipeline. And. This. Is basically where you're piping the output from one program into the output of another they. Have different, versions each of them has their own runtime. Environment, it's a real pain in the butt to try to get these things going and. So from a computing. Perspective, the, ad hoc approaches, that people generally use are, not reproducible, they, are time consuming and they're very error-prone and so, that's why we. Are using, and created, cloud conductor, to. Help, solve. Those challenges and, on, our way to the ten thousand and, even the hundred thousand genomes goal and to talk about how cloud conductors doing that. I'm. Going to turn it over to my colleague Roslyn Pineda here. Thank. Alex. Cloud. Conductor, is the cloud-based bioinformatics. Workflow. Management system. We. Developed cloud. Conductor, to, address, the four central, challenges in bio computing, the. First challenging bar computing, is diversity, as many. Analysis, pipelines, in research. Use, different bioinformatics tools, in different orders, so. Implementing the system that allows the researcher to, run any pipeline. Secondly. Scalability. Is a problem in bioinformatics as, the. Input sample size is continuously, increasing and the, institutional, computing. Resources, are, limited. Fortunately. Cloud computing, platforms, such as Google cloud are actually. Solution to this problem, furthermore. The, total cost of processing is, also. A problem, however, managed to fix this issue by, using preemptable, instances from Google Cloud and finally. Portability. Is actually a very serious problem as data. Analysis, needs to be reproducible. However. We're using darker technology, to, containerize. The tools in the pipelines so the researcher, can run the same pipeline anywhere. Without, having to worry about affecting, the final output. After. Solving. These challenges, cloud. Conductor, has, become a comprehensive. Workflow, management system that, works, in three big phases the, first phase is represented, by the researcher, selecting. From a list of predefined, and user-defined mathematics, tools to. Generate an analysis, workflow after. Which cloud conductor, validates. And interprets, this workflow. In. The, next phase called. Conductor. Obtains. These tools and runs them in the correct order. By. Users, allocating, the necessary, resources on a cloud computing platform such, as Google Cloud and by, generating, a processing environment, using, the docker system, after. The analysis is complete cloud. Conductor. Could. Transfer. The Naza statistics, database, and the, final output to, an actual cloud, storage. System such, as Google Cloud Storage. Now. In, order to enough, renowned a lab to use cloud conductor, we. Implementing, additional infrastructure. Layer, that. Fully automates. This entire system the. System. Starts. From the moment the sequencing, data enters. Our lab. At. The click of one single button, we. Update, the database flower, lab database, with the new sequencing, information, and. Almost. Immediately after that cloud, conductor, daemon, a service that implemented, that waits and continuously. Waits and watches the database identifies, the new edition and generates, a new instance of cloud conductor, on Google, cloud platform and, after. A short period of time when after the analysis is complete. Cloud. Conductor, reports. The status, of the analysis, to. The lab database, and the final output faucet we'll look later are. Transferred. To Google Cloud storage. And. As you can observe overall. The, system is not only fully automate, automated. But, it can also be reproducible, and as. A proof of concept, where.
Who Is an early version of this pipeline of the system, to. Process. A large, cohort of a thousand, former patients, that was published himself and, in the future were planning to use this system to. Increase, the sample size not, only by ten times but, even a hundred times. Before. I close I would like to acknowledge our lab to. Chart out a core developer of, called conductor, and in. Conclusion, cloud. Conductor, is helping, our lab. Create. The largest cancer, data sets in the world. Using. The right tools cloud. Computing. Helps, small labs like ours to, data analysis, at. The. Highest attrition of skill level and using, all these systems were. Able to get, faster. Breakthroughs. Develop. Better treatment, and get. Closer and closer to, curing the disease that we study. We'll. Take questions thank, you. And, now I'd like to introduce, Jonathan. La-5 from, University, of Michigan. Thank. You and good, morning everyone I. Suspect. Most, you've heard of the term precision medicine the. Idea is to provide. Medical. Treatments, and screenings that are tailored to an individual's, unique characteristics. Take. High cholesterol for example. LDL. Cholesterol which is often referred to as bad cholesterol can, build up in your arteries resulting, in heart disease the. Usual cause of this it's unhealthy lifestyle, choices pertaining, to diet and exercise but. Sometimes. This can be inherited one. Gene responsible for this is called pcsk9. The. Steam produces, a troublesome enzyme, of the same name this. Enzyme binds to naturally-occurring LDL. Receptors, preventing, these receptors, from removing, the bad cholesterol. After. Discovering this, pcsk9. Inhibitors, were developed these, are drugs, that, neutralizing. Enzymes a fact that cholesterol levels. But. The treatment won't be equally effective for, everyone rare. Mutations, in this gene can either make it overactive. Or knock out its activity altogether. But. By looking at a patient's DNA it's. Possible, for doctors to determine, whether or not these, inhibitors, would be the most effective treatment for their patient. This. Is an example of a genetic effect that we do know about but there's still plenty to be learned before, precision, medicine can reach its full potential. The. National Heart Lung and Blood Institute is sponsoring, a project called top med that, is focused on closing, this knowledge gap. But. For project lis projects, like this to study DNA they must first collect DNA a. Traditional. Method for this is using what's called a microarray. It. Targets small subsets, of the genome supplying. Researchers, with tens to hundreds, of thousands of genetic, markers to analyze. This. Technique is relatively inexpensive which. Makes it attractive for use in both research, and consumer, services like 23andme, and ancestry.com, and. While. This method is useful for targeting, common variations, in DNA it lacks the scope needed, to. Find. Rare associations. Like. Those in pcsk9. For. This we need to use a method called whole, genome sequencing. Which. Looks at the entire genome, with. This process of lab physically, szubin TSA's short accurate, reads of DNA these. Reads are then algorithmically, aligned, or mapped, to, a reference genome and, include. A lot of overlap, or depth, to. Ensure that what we're seeing in the data isn't due to random error. The. Top med initiative, is employing this technique with, 130,000. Individuals, and as. Produced datasets up to, 600. Million variants, in. Other words 600, million potential, avenues for discovery. 40%. Of these are extremely. Rare and have likely never been seen before. Only. With this level of information can we study how rare mutations, are associated with disease. Now. At this point some of you may be thinking well. That's great but what does this have to do with Google Club so.
Going Forward I'm going to talk about the. Technical challenges, of whole genome sequencing, how. And why we leverage Google cloud platform to, surmount these challenges, and how. The cloud is helping us foster, collaboration, in the research. So. Here's how whole genome sequencing, breaks down in terms of data size and compute the scale, of top med it produces, approximately 3 petabytes, of highly. Compressed sequence data and mapping. These reads takes. Around, 6,000. Core years, of compute, that. Is a long time to weigh and we haven't even gotten to the science yet. At. The University of Michigan we are warehousing, mapping. Generating. Aggregate data sets of Tama data so that the genomes can be used in Association analyses. Used. In our local cluster which has a few thousand cores it. Would take years. Tackle, this amount of sequence data but. With the seemingly limitless resources. On Google compute engine we. Can compress a couple years of data prep into a few months. This. Means researchers, can start analyzing the data and making, discoveries much, sooner. This. Convenience, does come at a cost and researchers. Tend to have limited budgets but. There, are a couple features which go beyond simply having the positive prices, that make Google. Cloud, platform standout. In this result in this regard. Unlike. Other cloud providers that restrict, users of predefined machine types compute. Engine is completely modular allowing. Resource. Allocations, that meet the exact needs of a given task. These. Fine tune controls, allow those of us with limited budgets maximize, every penny. Pre-emptive. Machines can take those cost savings even further as many. Of you already know TCP, offers their idle resource reserves at a fraction of the cost but. With a caveat that they can be preempted, or taken, away at, any, moment in. Order, to reduce the time loss from preemption, we have altered our pipelines, to be more, resilient, we've. Done this by chunking, our compute jobs into smaller pieces and checkpointing. The output files into cloud storage. With. This approach we can still make progress even when preemption, occurs. Cloud. Storage has also been serving another purpose, top. Med is a collaboration of approximately, 30 studies across many institutions, each. Study has focused, activities, and diseases such as asthma. Sickle-cell. Disease. Atrial. Fibrillation and, heart disease. Making. These discoveries requires, a diverse knowledge base so genomic. Data sets like these have traditionally, been distributed. To the experts to be analyzed. On our local clusters. But. This is becoming more and more difficult, to handle as a data sizes increase. Especially, for smaller labs. The. National Institute of Health is piloting, ecosystem, using, Totman data following. A different approach. Instead. Of sending data to the scientist, we. Are bringing the scientists, to the data. It's. Called the NIH data Commons and it will empower researchers. From institutions, of all sizes to perform high-throughput, computations. Within, Google's platform. But. Even within the space data, access needs to be regulated. When. DNA is donated, it is often given with specific, consent that determine, what it can be used for and these. Consents vary widely from study to study. The. Data Commons is building frameworks.
For Access control on top of the existing cloud storage api's and. We're. Doing this to ensure that these consents, are respected. These. Frameworks range from programmatically, generating, signed URLs to, fuse. File system invitations, that turn buckets, into a mounted, disk. Google. Cloud container support, is. Another great feature, for. The second retirement. Scientists. Love containers, and it's, not. Just because it's easy to deploy, complicated. Pipelines but because, it makes it easier for peers. To reproduce. Your work and. Reproducibility. Is critical, in science. Googles, private container repos are backed by cloud storage, this. Allows for below. Two container container, images to be pulled more quickly and just. Like typical objects, in cloud storage access. Controls, can be applied allowing, you to safely share images that contain sensitive. Health or genetic, information. So. By. Leveraging the cloud in these ways we can scale to unprecedented, levels of data. Storage compute, a line, of ethnic discoveries, and more quickly, we. Can bring researchers, together to work on a level playing field and, in the highly collaborative environment. And. In doing so we. Can quicken the pace of medical progress thank. You. Up next we have the. Director of bioinformatics at, Stanford University. Sammy, you're. A mirror. Thanks. Jonathan. Morning. Everyone I'm Sami as any, true academic would, do I have collected multiple, titles which I will not bore you with today. But. In, essence, my role is to drive, adoption of genomics, in medicine. Jonathan, already mentioned, about. Precision. Medicine so what is precision, medicine since I made pretty slides with lightbulbs I'm going to reiterate the point. We. Take genomic, sequencing, data. For. After patient we, combined with. Sequencing. Data from, other patients, with similar disease, or disorders even healthy individuals, for that matter we, analyze them together with, all available annotations. Or publicly available annotations. To, come up with a precise, diagnosis. For, that patient, and finally. We, tailor the treatment not, just based, on the disease but, to, the particular, genetic makeup of the patient that is concerned, so, in essence we are trying to be Starbucks, for medicine, so. At Stanford, our vision, is to deliver, on the promise of precision, medicine for, all our patients and, we. Do so by using genomic, sequencing, primarily. We have focused, on childhood rare. Diseases, inherited. Cancer cardio. And. Neurological. Disorders so, that's the promise. Of precision medicine but what are the challenges right. So there are several challenges principal. Among which is that we are yet to understand, the regulatory mechanisms, of many of the genes and how, their functions correlate, at a more at the pathways level, right but there are more mundane, or fundamental, challenges, which were yet to overcome and, principal among which is as Jonathan. Mentioned repeatability. And, reproducibility. Which. Is paramount. When you're using anything in the diagnostic. Setting right, so if you are diabetic you're taking a blood test to measure sugar levels if the instrument gives, two widely, different, measurements. At different. Times that you'd. Make measurement, you're now going to rely on that genetic. Tests are extremely. Complex, primarily, because we use multiple instruments, hundreds. Of different analytical, methods which runs on different hardware platforms whether, it be laptop, in, house cluster, are on the cloud but yet we are expected, to reproduce, the same result. Not just on a given day but, up to two years into, the future San, the second, biggest point as that. Everybody and the other panelists, mentioned is, security. Anybody has done anything on the cloud is, primarily. Worried about data security and this is extremely, important. When you put patient. Health information on, the cloud this essentially, gives people fits, and makes. Us lose sleep every, night but. How do we overcome these challenges, primarily. With, the help of container, optimizer engine virtual, machines from, Google, and HIPAA, compare the compatibility, and certification. Of all DCP, components. Helps, us achieve this goal much, more easily and removes, the heavy lift of maintaining. These Hardware from small clinical, labs like us which, essentially, we are not experts in. So. Here is the example of, how, we have solved it in Stanford, you already have heard about darker. Which is essentially, crucial, for reproducibility. We have our own workflow, manager of course, we have to have our own right. We cannot just use others. So. We, ours is called loom which essentially, pulls. In the understands. Workflow. Language, pulls.
In The doctor container that you need for execution, step imports. All the files that is needed hashes, them to make sure that the files have not changed, if so it wants that the, workflow. Is not reproducible, at, the end of the execution schatz on the VM and copies to all the files back to the specified, storage, this, way we can ship off the container and the, workflow language to any institution. And they can reproduce the results, but. Let's get to a bigger problem and the bigger problem is that any genetic, caste the. Average, diagnostic. Yield is around 30% of course for, certain diseases is much higher than 30 and for certain this is slower but. What do we mean by that like you know if for all the patients that walk in with a genetic, disorder into, our lab we, are able to give a precise, diagnosis. Only four on average 30% of the cases right. Which we. Have to improve, upon so, I'll give you two specific patient study examples, in the next couple of slides to, highlight the importance, of adopting, novel sequencing, methods analytical, methods and data, mining, and data sharing in, order to be able to improve, upon the set diagnostic. Yield, so. The first is example. As a patient case with carney complex, with, cardiac, myxoma. Which is essentially, Carney complex leads to benign, tumors, in different, tissue types primarily. Skin but, when it gets to heart, it leads, to weakening of our valves, and ultimately. Requiring. A heart transplant, so this is a patient case we. Had a suspicion. That we, this is a genetic disorder we know the gene that causes this but, we sequenced it we could not find, the change that was castle, in this patient so, in, addition to iluminage, artery sequencing, which which we typically do for all patients we used a new technology, called long-range, sequencing, from PacBio, which, led us to identify a, long deletion, to exon deletion, in this gene which, is deleted, in one copy of the gene versus, the other which makes it much more harder to find with, Shortridge sequencing, and we had, to develop novel sequencing, methods in order to be able to analyze this data this, is where the contribution, of Google ai and Google genomics like. Deep. Variant, from Martha Bristow is very critical for us to be able to advance this, process.
Of Being able to identify novel. Genetic variants, which you were not able to do otherwise. Right. And second, example highlights. The importance, of data sharing and data. Mining, right, this is the case of two siblings affected. With several, neurological. Phenotypes. Primarily. Delayed, intellectual. Development. Cerebral. Ataxia, which is essentially you lose control. Of your muscles atrophy. Of the brain itself, so. We, sequence this. Patients. We. Found in this case we found a mutation is less and 14 Jean but. Which was involved. Are implicated. In brain development in mouse models but we are not seeing this in Kenya the patient so we were in sure whether it is the real causal mutation, or not but, purely by chance we. Identified. Another. Patient, in the sidelines of a conference like this that, another, patient who had the same symptoms and the same mutation, in that gene which, led us to conclude that this is the true causal, variant in this gene. For this patient which ended a 10-year diagnostic, Odyssey for these patients in most of these cases these are not, curable. But just finding, the cause of the disease is so, much help, for these patients that, that there's a hope at least that they are secure for them in the future right so, the importance, of this data mining, or data sharing is really, important. Because these are rare diseases where only a handful, of people and handful, of mutations, are found in, genes like this so, finally, I would like to conclude by. Raising. The importance, of. Platforms. Like Google cloud, and other. Things. Even though it's not in their core business right, like you know the contributions, made by Google AI and Google genomics helps. Drive genomics, in research. Much forward and being able to adapt this in precision medicine not, whole lot faster with. That I like, to thank everyone. And invite. Jonathan back for Q&A. Thanks. So much. All. Right thank you all.