Cloud OnAir: Powering Genomics Research on Google Cloud Platform
Hi, and welcome, to cloud on air live, webinars, from Google cloud we're. Hosting webinars every Tuesday, my. Name is Jonathan Shetty I'm the product manager for genomics, and life sciences, at Google cloud today. Were we talking about powering, genomics, research on Google cloud platform you. Can ask questions anytime on the platform we have good word on standby it to answer them let's, get started so. A. Few topics we're going to talk about today first. I'm gonna tell you a little bit about how. You can accomplish genomic, and clinical data, mining together using, bigquery, then. I'll tell you a little bit about how you can run batch workflows, at scale, using, pipeline's, API. Then. I'll hand it over to my friend Geraldine. From the Broad Institute tell. You a bit about gta4, and the fire cloud platform, she'll. Also tell you about how their team optimized. Gta4. For, Google cloud and then. Keith from Google will, tell you a bit about some of the really cool innovations, that are happening right here at GCP. But. First I want to tell you a little, bit about the team that I'm part of this. Is a team that started at Google just, last year it's called the cloud for health Life Sciences team we've. Brought together product. Managers software. Engineers solution. Architects, and a, commercial team to. Help organizations, do. All sorts of amazing things with health and life sciences data using, Google cloud whether. It's an academic, institution whether. It's a pharma, company a, provider. A payer. Health. Products, or services, company we, want to help you use, any kind of health and life science data type in Google cloud more, effectively. In. Particular the. Product, teams dyma part of our focused on a few key modalities. Clinical. Data so, EMR. Data EHR. Data other, kinds of phenotypic, data sets. Imaging. Data that's, DICOM. PACs PET. Scans MRIs. X-rays in. My. Team the genomics, and Life Sciences team where, we're dealing with not only genomic, data but, all kinds, of other biomedical, data sets so, it, could be microbiome, RNA, seek. And, others flow cytometry. Why.
We're Doing this is because our, vision is that there's going to be a conversions, of these data types in the cloud where. For. Whatever sample you're working with whether it's yeast, or, an, entire human population, it'll. Be helpful for you to be able to analyze data across. These, different modalities together. And, that's what we're enabling. So. Let's. Talk a bit about what that can look like especially, coming from the genomics and life sciences point of view. So. One of the products from my team is called variant, transforms. Variant. Transforms, is an open-source tool to, help you manage your genomic, data using, bigquery you. Can think about it as a way to take your VCF, sfrom, panels. Exomes. Or genomes and. Import them directly into bigquery using the inherit VCF, schema you're presenting, from. There you, can do things like calculate. Your allele frequency, or build. A cohort or do. Ad hoc exploration, of that data set but. I think, we all know that we're these these, kind of data sets get really exciting is when, we join them with other kinds, of data so. That's exactly what company called color in the Bay Area did I want to tell you a little bit more about that. So. Color. Is a health services, company they offer very affordable, and accessible genetic, testing and. They have a 30 gene panel in cancer. And they. Also had an existing, database of phenotypic. Data some, health history, data and report data that, was sitting at Postgres database. The. Problem they had is that they wanted to be able to expand, that test by mining, the raw sequence data alongside. The other data, they had available on those samples, the. Cloud prior they had been using up. Till now didn't, really have a mature enough managed data warehouse solution, like, bigquery which. Is so great for these kinds of scalable. Queries. So. What they do they. Used variant, transforms, as I mentioned totally open source tool for bringing all of that data into, bigquery they. Took the data from the Postgres database which, had a bunch of biomarker, information, and put that in bigquery and joined it together on. Top of that their data scientists, have stood up all kinds of really cool machine learning tools and built a whole bunch of great great models so. What happened. Well. They, got all the feature engineering done in one, week and. They created three novel, models that outperformed, an existing, major clinical model for, predicting certain biomarkers in, fact. They, are test which, is. Actually designed for cancer they found out worked in a completely, unrelated therapeutic, area this, is really exciting it may, become public a ssin may become a novel test but, we're really excited about the kinds of result, that our customers, are achieving by, joining their genomic data with, other data types using bigquery. So. I'm, going to switch gears for a minute here from tertiary analysis to secondary, analysis, no, doubt if you're watching this you're, probably processing. Lots of samples. Perhaps. In a cloud environment perhaps, somewhere else and you're thinking about how you can really scale up that work of, your pipeline of your pipelines so. Our. Team also offers a, API. Called not, surprisingly, pipeline's, API. Pipeline's. API helps, you run bioinformatics, pipelines, at scale, in. Cheating great metrics, around turnaround, time and cost, you.
Can Stand, up all kinds of great industry, standard tools like, gtk which you're going to hear more about and. You can optimize for parallel, execution and, you can leverage a lot of great, innovations, in Google Cloud like. Preemptable, virtual machines GPUs. Regional, storage you're gonna hear a little bit more about that later today. We've. Also brought together a phenomenal ecosystem. Of platform, partners and workflow, engines to help you with all this so. Soon, you're going to hear more about rode, Institute and their fire cloud platform, you'll, also hear about the Cromwell open source workflow engine which are great ways of running your workflows and leveraging, the power of pipeline's API we. Also have other partners such as mooji next code DNA, stack, seven, bridges and the. Next slow open source workflow engine. The. Kinds of results that people, get from leveraging, pipeline's API is this kind of massive scale, so. For example the team at Stanford School of Medicine is running. Be, able to run thousands of genomes in parallel what today, they, say feels like infinite, resources, and be able to boot up a lot of machines at once to accomplish with batch workflow, in a very short period of time while, still focusing, on low cost. Our. Friend the Broad have, used. Pipeline's API quite a bit to the shall we say they. Process, over two hundred and fifty genomes, every single day they, are now working with over, 35 petabytes of data in Google Cloud and just, last year consumed, over 100. Million, compute, core hours and. They're scaling up their work because of the innovation within pipeline's, API. So. We'll talk a little bit about some of the work that they're doing the. Broad nomad team recently, called, variants, on the, nomad, data set with, the grch38. Reference, and. The problem they had was that Nomad as you might know, 79,000. Whole genomes that. Were previously called with hg19. They. Wanted to be able to call variants for that entire data set using, gatk 4 and. Grch38. So. They were able to spin up. 79,000. Workflows and. Call, the variants for the entire, data, set in three, days, it's. A really exciting result and i think there's a great example of the kind of scale you can achieve using. Pipeline's, API and. Engine. Such as Cromwell, and platforms. Such as fire cloud so. What's next for that team they're going to be calling variants for exome data on Nomad as well, again. Using some of the same technologies and techniques. So. Speaking. Of innovations. Through the pipeline's API just, last month we released a brand new version of pipeline's, API for. Our ecosystem, of platform partners and workflow engines what's. New well. First, you're, able to run multiple docker containers, for instance this. Is really helpful because you can now run service containers in the background while, workflow is running in progress. We. Also make it a lot easier for to authors field to inject, and containers, around, a workflow. Localization. And D localization of files is much much simpler it's just now a container instruction. And. There's also profit. I favorite feature a lot, more information is available to, our platform partners, and workflow engine partners. On the, information. About the transition, between states so. That in M failure modes so for example when, I have a pipeline that fails why, did it fail did, it run out of memory did, it simply crash something. Happened to the node some how did I get preempted, that, information. Was a little hard to access before and now it's much easier and, good provides a better experience for whatever workflow, engine or platform you're using, we've. Also made it easier to, specify, regions, in. Zones and much. Of you to specify the minimum CPU platform, options Keith, will tell you a little bit more about that soon in terms of how you can bring. Up custom, VMs. So. All this great talk about secondary, analysis, let's learn a little bit more about how. People are doing this in the real world so, now I'm going to invite up my friend Geraldine from the road tell you more about gta4. And firecloud. Thanks. Jonathan its, playroom in here I'm going to talk today about gta4. And firecloud and. The way, we have. Been working to make our pipelines, available. Accessible. To. The wider community, of researchers, now. First. I want to start with a little bit of recap. Of what is JDK, exactly, what that's covers, so, JDK, is, it genomic, analysis, toolkit and. This. Is roughly. The process. For. Genomic, analysis, specifically, with a focus, on variant. Discovery if. You think that you're starting with, biological. Material. For, example blood samples, we're going to extract DNA from that put.
That To the sequencing, machine right that's, going to produce a huge pile of short. Reads that. Those, short reads that's, the raw data that we're going to be, using. In our pipelines, the end result that we want to get to is a list, of variation, mutations. In. The. Individual. Genome, or genomes. Of a group of individuals, and so, the pipeline, which. Is, what. Change, key, constitutes. Is, going, to take, that raw data do. Some clean up do. Some mapping to. Find out where the reads belong in the genome and then, do the analysis, that, enables, us to identify, those. Variations. Those, mutations in individual, genomes, which. Is ultimately, what, researcher. Cares about, now. These tools have been around for a long time about. A decade, which honestly, in genomics time is pretty. Much at like a century and, for. A long time they were evolving kind of naturally were just. Improving. The. Our. Ability, to just. Push, the scientific, boundaries of what we could identify, now. Over, the last few years as Jonathan mentioned there's, been a huge, explosion. In the the size. Of the genomic data and. The data sets that we need to be able to analyze. And so we've put, a lot of work into making sure that's the tools and pipelines, can cope with that, huge, explosion in, the, size, of datasets and, specifically. We built what we call gatk floor which, is version 4 of the GTA toolkits I'm not going to go into a detailed, laundry, list of features but the main themes. Are that we've, really improved the, the. Technical, performance speed, and scalability of, the tools and. That has involved a complete rewrite, we've rewritten that entire, software, code mates from scratch to, be able to cope with the demands, of the. Genomics. Workloads, that are now today's, reality. But. It's not just it's not just about the. Technical. Performance well that's very important, but, we've also put a lot of work into expanding. The scope of analysis, to be able to cover, more variant, types, so. Where previously. We, were very focused on germline, short variance it was kind of what we were known for now, we've expanded that, to cover all, the major variant, classes. Including. Copy number structural, variation, in both, journey line and, somatic. Mutations, so. We're now able to cover almost. All of these these. Use cases and, the, remaining one that's it's, not quite, implemented, today somatic, structural variation, is planned. And we have that on the roadmap we. Hope to deliver that later this year and so the idea is that you have all these pipelines, that allow you to really, do, all the scientific, work. That you need, related. To variant discovery now. The. Thing is and. There's, there's been two. Big. Pushes in. That direction one, is just, developing. The right methods, to produce. Those. Those. End results, but, also improve. The, speed and performance of, the tools right, and. That has involved. Leveraging. A lot of the improvements. In for. Example machine learning that have really. Flourished, over the past few years now. If we look at these pipelines. All. Of them involve, a. Sizeable. Number of steps. Involving. Different tools and for the individual, researchers, it's not it's, not easy to, implement. This at. Scale and so, one, of the things we've been focused, on lately is. Formalizing. Pipeline, definitions, which we call the GTA best practices, which. Researcher. Can take and apply to their, data without having to, figure, out by. Themselves how. To implement this so, we call that the GTA, best practices pipelines, but.
There's Still a big step between having, the pipeline, definitions, and and. Being able to implement that so, we've gone a step further and, now we're making available these, pipelines, as ready. To run. Render. On pipelines that. Are in preloaded, in a. Platform. That we call firecloud trends on Google cloud and so, we have these. Workspaces that, contain the pipeline as well as example, data and so anyone, can go and clone. One of those workspaces, run. The pipeline as it is pre-configured. On an example data they, can also. Upload. Their own data and run the pipeline on their own data and that way you have the, JTA best practices, pipelines, just, available, in ready to run state out, of the box so, hopefully I've got you interested now, I want to tell you a little bit more about, fire clouds, platform. Itself. Ok. So, fire cloud is meant to be a self-service. Analysis, portal that anybody, can access and. Specifically. Anybody can access for free so, we built this originally, from the broad researchers, but, we make this available for freedom of. Course it's using, Google, cloud resources so you still have to pay Google cloud for. Storage. And compute costs things like that but, we do have a free credits program, that you. Can use to. Get started with the for him to test it out without taking any financial, risk yourself. Now. The. The platform itself offers, a number of services including the, ability to manage your data to share your data securely, and to. Run. Workflows, on your, data, so. We've got a number, of pipelines, there are presets, that you can run on your, data in this platform but, you can also bring your own pipelines either. Variations. Of ours or just, different. Workloads that you need to run because we understand, that many. People use. Other tools besides ghk. And. We can use this through. The web, portal, which is a graphical, user interface, you, can also use it through the API if you prefer to interact, programmatically. With the platform, and. As of last week we also have, now the ability to spin up Jupiter, notebooks, on spark. Clusters, so, you. Can follow up your batch execution your, pipeline, execution with. Interactive. Analysis, of the, results that they produce. Now. Let's. Look a little bit closer, more, closely at how batch execution, of workflows. Works in fire lab. Don't. Worry about the details yet. But let's. Look at the overall system, here, we have two, researchers, who, have access to this workspace the, workspace.
Encapsulated. Encapsulate. Everything. That's. Involved in a particular project. Including. The definition of, the data that's involved, and the methods and, that's also the place where you're going, to be executing. Pipelines, so. You. Also have access. To pipelines, who workflows that's the stack, of scripts that, you see above the workspace and, you have a number of data, buckets, that, are available, now. The way it works is, that the data is stored in the buckets so. Your data is going to live in Google storage either associated, with the workspace or in, other buckets and you can also compute, on data that's elsewhere. In, the data library which, are public data they're providing. The. Workflows, are Whittle. Scripts Woodall. Is a workflow, definition language. The. Gatk. Pipelines, are all, provided. As, pre-loaded. Wordle scripts and, that, contains, the definition of your actual workflows, now. When you go to run. A. Workflow. On your data all, you need to do really is connect, the, workflow, with the inputs, the, inputs that's, the data that's defined in your data model which. Describes what files you have available and how they, plug into the workflow. So. You. Have, your, you, have your data definition, you're. Going to launch. One, of these workflows on your data that's. Going to be handed off to on the backend to Chrome well which is the execution engine, and Cromwell. Is going to take care of everything, for. You so it's going to execute your workflow talk, to the pipeline's API which, Jonathan. Talked about and. Run your workflow, on the, Google compute engine and, then write back the outputs, to. Your workspace data. Backing so. That's how you, run, a pipeline. In. Pipe, lab and, specifically, if you're. Using one of the gtk best, practices, pipeline which is pre-loaded. You. Can get. Started with this very. Quickly and it's abstracts. Away all the difficulties, of. Implementing. Um these. Kinds of complicated, pipelines, on. The cloud and, what's. Really neat is like, I said as of last week you. Can follow up and use. The results that you have. Now. You can intro, you. Can explore your results, interactively. In the, Jupiter notebooks. Alright. So. That. Shows you how. To access. And, run the, ghk pipelines, and, fire cloud let's. Talk a little bit about what, it costs, because, when, we started migrating our, workflows, to, the Google platform. It. Became, clear to us that there were some great opportunities, to, cut down the cost. Of operation. And here. I'm going to talk briefly about the. Various strategies that we used to. Bring down dramatically. The cost of operating, our genome analysis, pipeline, now. The. The. Main strategies, that we used and we developed. This, with. Quite, a lot of friends quite. A lot of help from our friends at Google. Have. Been in, several. Different categories. One. Was, this. Idea of task splitting, and persistent. Disk now that's the idea here is that when. You have your tasks. In a workflow. You're. Allocating, for each task you're, allocating, some resources, some, disk size, some. Processor, size, and. The. Beefy. Or the machine the bigger the disk the more you're going to pay for this. Hardware and so, what you want to do is make sure that the, allocation is. Commensurate. With. The, needs of your task and we, have found that there. Are various, ways that we can optimize, how. Tasks, are either split, up or bundled, together depending. On what their computational. Requirements, are and. What, is the size of the inputs and outputs that, we need to store locally, to. Really, optimize, the.
Allocation. Resource, requests, now. The. Other in. That panic goes with it. The other big strategy and this is personally my favorites, that, we've applied, to, bring down the cost is, the use of preemptable, VMs. Preemptable. VMs, are machines where you. You, get them at a really, steep discount, you pay you, know 20% of the normal price, for. A machine but, that comes with the understanding, that you can get kicked off the machine at, any time so, for really long running jobs that's not so desirable but. If you're running really short jobs. Or. To a few hours it's. Actually. Really good deal because. You're, paying, a lot less for the computes. And there's some logic built in so that if you do get kicked off the machine there, will be some automatic retries, and, you can even tune how many attempts. You want to do on priam, tables before, defaulting, back to regular machine, that. You have, for. Sure and. So by. Doing that, and by switching. Virtually. All the steps in our pipelines, to use in preemptable VMS we've, gained. An additional 35 percent savings. On. The, cost of running our team and pipeline. And. Now the. Final. Piece in the puzzle the, the, last big strategy, that we've applied to. Cut down costs, of. Our pipeline operation. Is to. Be. Really, smart about how, we're accessing, the data. JP. Martin at very, life science, contributed. Some really valuable code to. HTS, JDK, which, is an open source library that, underlies, cheat K it's what JDK, uses to read in and write, out files, and. That's. That. The code, that JP contributed. Allows. Us to, reads. Data, directly, from the bucket instead of having to copy the file to the local disk before we can compute on it. And that makes big difference because, for. Example if we're. If. We're parallelizing. Across genomic, regions, if. You're you think that, normally. You would have to copy an entire genome damn file to local disk right. But. Now, you, can skip, that and just copy the little piece of data. That you're actually doing one on that. Allows you to. Go. Much faster because, you don't have to wait for 100 Gig BAM. File to get copied, and it. Also allows. You to allocate. A much smaller disk space so, that ties into the first strategy, that I mentioned. Where. We can use. Much smaller disks for. Each task. If you're thinking, well this, is going. To be. Used. Across. Thousands. Or tens of thousands, of genomic, intervals, and then. Replicated, across tens or tens. Of thousands, of genomes. Themselves, that amounts to fairly. Significant. Savings again in, this case 15%. Savings, on the total cost of the, pipeline, and, the results, of all of that in, aggregates is that, from, an initial, cost of. $45. Per genome to. Process. An entire genome on, cloud. We've, cut that down to $5. Per. Genome, and. The. We've. Done some additional, tweaking, that I can share, we're. Seeing now that's with, some of those additional tweaks we, can actually bring down the cost of the pipeline right. Now to three, dollars and sixty-five cents per, genome so, that, just gives us so, much more, bang. For our, research but it's, it has made a huge difference in, terms of what. We can do at the front Institute with. Our dream with pipelines. I. Want, to. Be. Clear that we're. Still offering. Best. Practices, pipelines, that are generic that can run anywhere but, we have this version of. The. Gt4, pipelines, that is very, optimized, very deeply optimized for Google Cloud it's what we run in production we. Make that available so that everybody can benefit, from this. Optimizing, work you. Can try this pipeline. In firecloud it's preloaded, in. The workspace, that lives, at the URL that's on your screen the. $5 genome analysis, pipeline, which we might have to update the name of to. The three dollar genome pipeline and hopefully that will just keep going down so. That. Covers. The work that we've done so. Far with ggk and five lab I'm going to throw it over to my, friend Keith from Google I was, going to talk about. Platform innovation. Great. Thanks. Geraldine, so. Innovation, is something that we hold dearly at. Google and as you heard from Jonathan. And Geraldine. Our. Innovation, stems from things, like the pipeline's, API to. Bigquery. Even. To. Stuff like preemptable VMs, it's fascinating, to think that for the price. Of a cup of coffee in New York you, can do an entire genome analysis.
To Me that's amazing. But. Our innovation, doesn't just stop there. At. Google, we. Don't believe you should simply, lift, your workload, or. Application. Into. The cloud we, believe that you should lift. Shift. And, improve. Upon that workload in application, and. The. Improvements, take, form of some of the cost innovation and technical innovation, that, we have, for. Instance we. Have a live migration, capability. So. Your. Instances, can seamlessly. Migrate, to. A machine. During maintenance windows without. Any user intervention, and without any planned downtime this. Is something, we took advantage of, during. The recent spectrum. Meltdown. Vulnerabilities. Whereby, we migrated, our entire. Customer environment. Seamlessly. Without. Them having to be involved in. Order to address that vulnerability. Additionally. We have custom, machine types so. We know that applications. Sometimes. Don't fall into a specific, ratio. Of, CPU, and memory so. We allow you to define what that ratio is, maybe. You, need four, cores an 80. Gig, of ram, we. Don't stop you from doing that. Additionally. We, have. Recommendations. Using, our vast history, of machine, learning, to. Look at how an instance is performed, and say you. Know what, if. You change the, core, count on this machine or maybe reduce, this the, memory footprint, not. Only will your instance still run but you'll save a couple dollars as well. We. Believe that you. Should pay. For what you consume and to that end we, have per second billing so if your instance, runs for 73, seconds, you're, gonna pay for 73. Seconds, you're not gonna pay for 120, seconds or even more. We. Also have extremely, fast start. Ups for our instances, we're, talking about a matter of seconds, so let's get you up and running as fast as possible, and, lastly. We, allow you to resize your discs on-the-fly if. You need a bigger disk we allow you to do that seamlessly, as well. As increasing. The I ops for that disk as you make it bigger. Now. Innovation. Isn't, only about, technology. And cost we. Think security is also a strategic, part of innovation. At. Google cloud, and it starts with you the customer. By. Default, without. You having to do anything all. Of your data in transit, and at rest is encrypted. You. Don't have to worry about any PKI infrastructure. You don't have to worry about any key management. Now. Of course if you, are a type. That likes to manage your own keys we allow you to do that, but. Knowing. That you don't have to do that is. An extremely, extremely, important. Benefit and. This. Notion of security and encryption extends. Into our, cloud stack. It. Starts, at the, silicon level, we. Made, our own in-house. Secure. Chip, that, we put on every peripheral, device and that secure. Chip will. Check to make sure that it's a trusted. Root. During. Boot if it, detects any abnormality. During, the manufacturing. Cycle, or, even during steady-state that. Machine, in, our data center will not boot and will. Be reported to. Our. Engineers. That. Extends, into our server environment, we have our, own, custom, servers. That. We manufacturer. And. Having. The ability to to. Make, our own servers, we, eliminate, all the extraneous, stuff, that you may have in a server from off the shelf we. Don't necessarily have to put in video cards, if they're not needed so. Not only do, we free up valuable real, estate on, that, server to, introduce, additional CPU. And RAM, but. We're actually. Eliminating. A threat vector. By. Eliminating. Potential, vulnerabilities. In those peripheral, devices that you don't even need. We. Also extend, this to our storage environment, and our, network environment, they. Are both custom, and allow, us to own not only the hardware but. The software, and the firmware stack, as well and. Lastly. At a holistic level our, data centers have both logical. And physical security to. That end because, we own the hardware the, software and the firmware we're, not reliant upon any, third party to provide patches. In. Case of a, vulnerability, as. We control it we can issue those patches as soon as possible. Now. You don't have to take our word for it, now. We do annual, audits. With. Our third, party consultants, to. Test. The attestation, of our environment, and when. It comes to, genomics. And the life sciences. That. One certification. That you may see on the, bottom left is, of utmost importance, and that's HIPPA with. Google our entire, cloud environment, is HIPAA certified, which. Means you. Don't have to worry about, selecting. A certain, cloud, or instance, to put that Pho data, as. Long as you put it on Google Cloud it's automatically. Covered. By. Our HIPAA, certification. And you, can see we have a vast amount of other certifications. As well, such as PCI. FedRAMP. Etc. And. Lastly. Innovation. Is not about keeping, all, of it.
Close To the chest we, are a firm proponent. Of the. Open source community. To. That end we've incubated, numerous. Projects, within Google that we've released to the open source environment. Two, of those projects. In particular kubernetes. For. Container orchestration. And tensorflow. For. Machine, learning are the, number one and number two highest. Engagement in all of github and not. Only do, we incubate. Products, and release, them to the open source community, but. Our employees are Googlers, also. Are active, contributors, as of. 2017. There are eleven hundred projects. In github. Contributed. By Googlers, and those. Googlers, totaled over nine hundred employees. So stay, tuned for a live Q&A we'll, be back in less than a minute. You. All right we're back, all. Right some questions from the audience so I've been told this first question comes. From a hey-zeus, Gomez in New York who, asks, do. I have to bid for preemptable instances, yeah. The answer is no. The. This, is a big, difference with Google cloud and that you, don't have to bid for preemptable instances you just select them and. Pipeline's, API provides. An option for you to employ both, preemptable CPUs, and preemptable. GPUs. You just use them you get the discount and you go to work and. That's a flat 80%, off for all instances and GPUs is a flat, 50%, so, definitely, take advantage of, that. How. Do I get started with fire club yeah. Fire. Cloud org, is. The URL that you're looking for and you. Can make an account you. Get actually get $250, in credits just to get started so, that's, quite a few genomes, as, you've heard about some of the really cool cost. Work that's being done by froude. Yeah. And, you know I have, a GC. P instance, I know firecloud, runs on GC p do. I have to create, a new billing account right, how, do I get my you, know procurement. Team involved. Can, I can I use my existing account with firecloud you actually can there's, instructions on how to do this both on our blog and on the firecloud site so, you can bring your own GCP, billing account you can supply the billing, account ID and then, the charge for all that compute, goes to your home institution so, you're really getting all the benefits of firecloud you're getting all the benefits of the, optimizations.
And. Yet, it's your, home institution that's getting charged you don't have to worry about doing. A transaction with, bro. Excellent. This. Is a five dollar genome, include, alignment, or. Is this just jdk. Crate. Quest for a question so the. Answer is yes. So five. Dollars gets, you yes that's fast Q to, VCF, it's. The quality metrics so it's alignment and variant, calling it's, really exciting I mean just I mean, when we're talking on a secondary analysis, there's. So. Much innovation happening, in this field especially. From a cost perspective. We. See, ghk. It. Is really funny to think that was really just a couple years ago that was 10 times more expensive 9. To 10 times more expensive and. Now we're talking about it I think a lot. Of the buzz in our industry is always about the $1,000. Genome and being able to generate they did cheaply. But. Bring down the cost by 10x is pretty. Significant, to on. The informatics, side so I I, think that as we watch, this field we're gonna see the cost of secondary, analysis, get, closer. And closer to the zero which is something I'm personally excited about because I think it unlock unleashes, all kinds, of new research questions and opportunities it. Sounds like we'll be at three dollars very, soon exactly. Yeah. Which, I guess leads us to how how would I get started, with JDK, 4, yeah, we're. Jtk hope, exactly. So. So. Again if you want to run on the fire cloud platform, fire cloud org, also. If you want to get started using GTA, 4 we've also put Doc's on our website so if you go to cloud, google.com. Slash. Genomics, /g. E TK you'll. See Doc's there that, help you just get up and running on, a sample right, away. And. I believe the broad website also and the broad website has some where URL as well and their walk yeah so we both both institutions, have blogged about this it's just it's, such an exciting development for the community and if all else fails you can google it yes we, have a search engine recommend here. How. Would I get started, with variant, transforms. So. Barring. Transforms, as you heard of I'm so excited about the work that's already being done with. It you getting, sort of the out-of-the-box. Variant. Database based. On bigquery, which, is such a powerful technology. So. Variant transforms, it's on our github page, the. URL is a little tricky so probably Google that. But. Not, only if you go to the variant transforms, github site you'll see not.
Just The code not, just documentation. You'll, see our product roadmap we've actually open sourced the product roadmap as well you'll, see the work that we're doing to add annotation. Support, you'll, see the work that we're doing to add a big. Crate to VCF pipeline, because sometimes, you might be selecting, a cohort based on all the data you've emerged and you want to go back to VCS you can use some file based tools so that's on the roadmap as well would, really look forward to folks commenting, on that we really do look at that page and it, does inform, our product development on varying transforms. So. Another great example of, contributing. To the open source exactly. Like community, exactly, we. Realized that it was so important to make that open source it's also written in Python like many other stages of bioinformatics, pipelines. And. By. Open source again we know that folks will have their own VCF schemas they'll want to be able to tweak it and we want to just make that easier. And. We're excited to give the something back to the community in this way. Awesome. Speaking. Of bigquery, right, you know I've, used it once or twice. And. I see this thing about public datasets can, you tell me which ones are available yeah. Oh man I don't, know if I can recite all of them but. Gosh there's a bunch so some. Of the ones you might expect thousand, genomes, TCGA. The. Missing, data set is an autism research data set as. They are some data from the Simons Foundation. Yeah. There's the, there's. A whole list online and we'll make sure to post the URL afterwards. But you can find them but. We're always interested in. Public. Datasets that would be good for the community and that can be very helpful so, we. See, that as an opportunity again to aid the research community by. Hosting those datasets for free and then, enabling, folks to compute on that dataset using some of the tools we've been talking about here today. And. Speaking of tools, I know your, team. Was at the hims conference, last week at anything. You, wanna call out. Sure. Yeah. So I hims. Was a great opportunity for us to introduce the healthcare API. Healthcare. API is actually built by our clinical data informatics, team makes, it really easy for you to deal. With EMR, and EHR data in the cloud we're, making we're, really interested in helping people manage. That data process, it and. Work. With it at scale so. We're, getting the signal to wrap up so thank. You so much for joining us it's been a lot of fun and. Thank you for all the great question.