Dynamic Compute Options for Research Institutes (Cloud Next '18)

Dynamic Compute Options for Research Institutes (Cloud Next '18)

Show Video

So. Thank you guys for stepping, out and coming out to our little talk here, this. Is our talk is on dynamic, compute options we'll. Introduce ourselves here in a moment so, my name is my Kinser on the vp of delivery for dev 9 I, have been, in the industry for about 24, years I very. Early a doctor of Google I asked was on the original beta program for a Google App Engine so that's just kind of some of my background I also, was a certified. Architect with Google I. Was number 170, so, my little joke is we've. Got about, 25 so. Certified. Engineers and there's about 2,000, certified, so we have 1%. Of the world's population of. The. Certified, engineers so anyway. I want to introduce Michael Michaels, with me starting off the conversation, here so. He was. Good. Morning everyone. Is, that my name's Michael got her a gym with the Fred Hutchinson Cancer Research, Center I've, been there about 15, years now running HPC, at various levels. Introducing. The Fred Hutchinson, our. Mission is the elimination of cancer, and related, diseases as, a cause of human, suffering, and death what's. Not on the slide is that about. 2015. Our president, gave us 10 years to reach this goal. So. That's an ambitious goal but, we have 300, over 300, faculty and over 3,000, employees that, are all focused on reaching. This goal so I feel pretty confident, that we're gonna get there. Our. Research has expanded, beyond just cancer we're starting, to find links to viruses, the, world's microbiome, in the, development and progression of cancer ously research, area expands, with every passing day I'm. Part of the scientific computing team we're. A small team of about seven folks we're. Responsible, for supporting, our researchers. And their compute needs we, support almost a full stack compute hardware the software the, tools as well, as some high-performance storage. So. We. Have a smallish. Cluster about 1200 cores over 510, systems right. Now we're running about 1.2, petabytes, of data that grows by 24, terabytes every day. We. Have some. Miscellaneous. Systems large. Large. Compute nodes that we use for casual. Prototyping, work and some other miscellaneous functions. That aren't, satisfied by the cluster but it's a pretty diverse environment. And, it's a growing, environment. So. The. Core, of our talk is high performance computing. What. Is high performance computing well at the hutch high performs convening is pretty much computational. Resource beot beyond, what's available under your desk, we. Scale horizontally meaning. We use commodity. Hardware we just add more Hardware as we need more capacity, it's. What you might call a Beowulf. Or a shared nothing style cluster versus something like a Blue Gene a Cray, or one of those more exotic machines, why. Do we need it well we've. Got about 7 years left to cure this cancer problem, so we need a lot of compute, we're, answering, it's asking. A lot of questions that require a significant. Amount of compute capacity. So. How. Do we use HPC, up the hutch, we. Have I've just picked, out three free. Research projects, here give, you a quick introduction to what. Sort of work they're doing. The. OP tides project, is a collaboration, that identifies. New. Drugs. And therapies, for very, treatment resistant tumors of a head neck throat and of. The brain as well one. Of their key problems, is how you get drugs. Across the blood-brain, membrane. To, deal. With brain cancers and other other. Diseases. They're. A very diverse, project, they work a lot of proteomics, data they use a lot of kinetics data type fruit high-throughput, sequencing base calls and alignments as well. As protein. Structure and functional. Predictions. Just. Giving the idea of the scale their work, they've, used, 141,000. CPU hours since the first of May. The. Mallik lab is a. Run. By dr., Harman Mallik he's part of our basic sciences division. It's. I've. Read through his, research. His. Area of research twice and I still don't entirely understand, it it's uh he's. Basically looking at the problems that can be classified under kinetics. Of evolutionary, conflict, does. A lot of de novo genome, assembly, so new, genomes come in to try to put them together figure. Out what's interesting about that how they've changed, short. Read mapping single. Nucleotide, polymorphism. Discovery. They're. Probably. On a smaller scale, of compute for the center at about eleven thousand CPU hours since the first, of may. And. Finally the, gecko project, this, is a really, large projects, been ongoing for a number of years at the center we've, got about forty thousand participants in this one and this one is basically. Aimed at colorectal. Cancers, and how the environment affects your risk for receipt. For getting, a colorectal cancer. They. Do a lot of work with whole. Genome so. They do a lot of work.

Looking For new genotypes, and the existing datasets G. By E which is genetic, versus the environment how, the. Environment affects your genome and then how that transfers, into your risk of getting a cancer they. Have they're like I said they're one of the larger projects, they've used about two hundred five thousand core hours since the first of May. So. With. An institution, this large we have a very, diverse. Toolset, we. Have a few of the big ones our bio conductor is our you know our meat and bread in the first-class tools that we maintain we're. Getting more. Interest, in machine learning these days as we move into more clinical data and. Then we've got the old workhorses, Sam tools the car things like that do your your traditional, genetic analysis, we've. Also got a lot of custom, pipelines that are being built at the center where people are trying to came together different tools using a variety of systems. Scripts, more. Modern things air flow et cetera to, came together but various, to answer their problems. We. Maintain all these using a couple of things easy building l mod that lets us sort of build our our tools in a reproducible, manner as well, as deliver, them easily to our researchers. So. Discuss. The cluster at the beginning of the presentation. We're a big slurm shop we've, been using it since the early to dot days. It's. An open source project. That began with the Lawrence. Livermore labs. It's. A very high performance solution that's used on well. As you can see there are six of the top set ten supercomputers. The. Emphasis on performance is. Well. It's a key part of how they developed and how it's been. Very, useful for us the. Big. System, tuning guide indicates. That they had, almost. Two million tasks on 122,000. Compute nodes and they started those in three hundred and twenty-two seconds, which is fairly impressive a little, bit more down to earth three, thirty thousand tasks on 15,000, Linux nodes and that took 30 seconds so we're looking at a very high performance solution for. Managing and distributing workload to a cluster. Unfortunately. This comes with some costs. To. Get this kind of performance. All. The nodes in the cluster must, share the, same configuration they all have to know about each other you can't have nodes, just joining randomly, throughout, the day so you have to the cluster has to have consistent. View. Of what the rest of it the cluster looks like, so. We add or remove nodes this is a disruptive, task where you have to sort of reconfigure, the cluster and tell everybody but there's a new configuration. We. Decided we really needed to keep with slurm like I said it's a key strategies, key strategic component, of our environment, we don't want our researchers to have to learn a new submit, job submission system. And. We wanted them to be able to just use their existing scripts, run, with the same command same options things like that whether they were running on premises, or into, the cloud they.

Probably Make a real quick note there slurm. The partner, said. That the company that builds. And develops, slurm actually, did a partnership with Google and did, a kind of an alpha release of, a. Dynamic. Basically. Build, up a cluster do your work and tear it down that. Still hasn't really been released, yet so well, we'll cover that and a little bit may get into more the technical side of it but, is it important to kind of note that the, challenges we had really over the fact that the, configuration, has to be known ahead of time of, once you build your infrastructure, so the cluster, can communicate there's no cluster communication, amongst new nodes. Right. Yeah this is there are a number of ways of doing this, and, since our, approach. We didn't want to kink anything for our community, we had, to tie into a lot of resources that are back on campus so that's a big a big. What. Do you wanna call it a big boat anchor on using. The cloud is that we have to tie into a lot of on, campus resources, I. Don't. Know how many of you have worked with or familiar with slurm it's, pretty straightforward you've, got a controller in the middle you've, got your. Damage that run on the individual nodes that accept, work from the from, the controller and then you have a database daemon, that setre sits in the background and does all the accounting work and keeps records about what jobs were running and what, jobs are running, there's, a few different commands for interfacing, with the cluster they're sort of listed. Up there as Ron dispatch, we're one of the ones that just allow you to dispatch, jobs to cluster or cancel them or see how they're progressing through, the through. The rivet, session there's. Two options there's a batch option, and an interactive, option we do have a lot of users that use a lot of course interactively. Where they will fire up a job and just, from a shell fire off various. Art tasks or things like that so it's a very very, diverse. Workflow. Throughout. The center. Data. Management here where's the data well it's all on campus so right now we come back to the campus from our various cloud clusters, to read. And write data that's a bit of a another, anchor and I'm never constraint on our design as.

Well As the fact that we're still using LDAP. And the Kerberos authentication back, to you on campus host. So. That's the problem we have a tightly, coupled infrastructure. Dependent, computing. Environment, and we want to take that and replicate it in the cloud. So. Ongoing. Cloud is going to be a big part of our computing strategy and. We are adopting, sort of a cloud first approach, to computing, gives. Us a lot of options allows. Us to you scale capacity, on demand and then scale it back as circumstances. Warrant, we. Also get new options and new technologies, easily, and quickly, rather, than having to wait to provision, things on campus. We've, um we've. Deployed, into another cloud provider. I'll. Leave it to you to guess which one, and, that's. Been moderately, successful given. The constraints of having to use on campus resources for data access but, for some work we're doing a lot of stimulation. It's. Been highly successful in being able to give people the capacity they need to accomplish their goals so. What, we wanted to do next was, expand. Into GCP, there's. A lot of exciting things going on there there's some new capabilities of different ways of. Managing. Resources, so it's fairly exciting, to to. Approach. Google. And see see what we can gain, from there there's chance for us to really diversify. Our compute our infrastructure, as well as sort of mitigate any of the risks that might come along with adding a single vendor solution. For our. Cloud computing, salute, computing. But. How. Do we solve this problem we didn't really have any experience, with TCP, we'd, been working with Amazon for quite some time so, we want to make this happen quickly we, want to use best practices, we don't have a solid deployment, but. We don't really know what they are for Google so, this is where we look to bring in a partner that have the experience with Google to. Help us get, the solution we. Have Deb Nye at, Google subject matter expert, experience. With GCP with biology, healthcare, and as, well as being northwest. Neighbors so at this point I'll turned over the audio to you my cancer. Alright, so uh so, as Michael had mentioned we were a partner. Here with Google and. Nice, thing is I get to mention this when I every time I go to downtown Seattle, I Drive by their campus they have a big sign it's a it's a cool campus so so. It's nice that they were kind of in our back yard we do have our. Company does have some health care and. Research. Background. As well as and through some of the solutions. That we've built so, I'll, just go this real fast who are we we're Dabney and we are a custom. Software development, company based out of the Pacific Northwest. We. Kind, of introduced ourselves into the cloud world about six or so years ago at. That time we realized cloud was not a trend, and. We still stand by that you, think it's gonna stick around for a while so. That. Kind of got us into the games are working with Google about two years ago and became, a premier partner we've got you know 25 some certified engineers, so.

When When coming here to get introduced to Fred hutch it was actually a really really, awesome, time to build, a share how we build infrastructure, and how we build outer, solutions, and we do a slightly different at. Least we think we do and. We'll talk about kind of what that solution is here so. Kind. Of recap some of what what it is we were trying to accomplish. It like with a problem that's all we're trying to solve we, had some kind of unique good strengths we you know they have 400, some, you know projects, going on you, know thousands, of researchers. Around the world who are actually using these tools and. We, didn't want to interrupt that we didn't want to change that, world for them but, we wanted to give them as a yet, just a different option camille and switch. For them to say i want to try a different cluster or work with a different cluster some. Of the unique things we ended up coming up with you know we get to use all the sky like so, a lot of the performance actually is is it's. Coming up to show pretty pretty strong, so. This is what we started out with wouldn't. Be jumped in here that, we there was already an AWS, implementation. Michael. And his team did a great job at putting that together to. Kind of show the two sides i originally started as, seeing the bottom portion of the diagram. There that was the original everything, was on-premise, everything. Was set up to the network's. Storage. Arrays accepted, and whatnot that's where all the data actually lives it. Was expanded, into basically, an almost exact copy into. AWS, and tied together with a VPN. Where. Possible, they, you know dropped in. Store. Like bucket, storage for us, three and other. Types of technologies there. So. Before. I get into like how we what we did for. Our side of it I kind of went maybe take a small step back and talk about what, is multi cloud and. And, and why their, approach like where we agree with their approach and we consistently work on or, we have a lot of clients would end up approaching, both AWS. And DCP, made. A little Azure here in there anyway. So what, is multi cloud I think you guys have heard this multiple times over you know and I'm not going to go too deep into that itself but, I think an interesting note to see here is that a research, that just came out 81%. Of enterprises, have a multi cloud strategy a lot, of the time they say that other cloud is a private, cloud but it's a it's. A we. Yeah it's, a they. Generally, will take a private cloud internally, and then lo expand it out to one we're, starting to see that trend change and we're starting to see more cloud. Adoption. On multiple, cloud so it's not just one, and. That's where you start building the tools that we need in order to be able to facilitate and, interface. Into, the different two into, the different clouds, there. Are some big benefits one, avoiding, a vendor lock-in and in a moment we'll describe that as also being a negative, two but. Avoiding. Interlocking, so if you need to move your your, applications. Around or your, computer, around it's easy to do. It's nice when there's outages if you're the one, one company is still able to continue to do their work when something. Fat-fingers s3 and it goes down you. Know it's uh it's, kind of nice. Also. With, multi multi, cloud you get to take the best and breed tools, for each of the different vendors I think that's an important, piece to think about vendor, lock-in is okay if. You're using it for the right purpose so, if if the tool and you know for example here the we're going to now. Enable the researchers, to start using things like genomics, API which, they technically. Didn't really have an easy access into now that we start having that relationship with Google that data is closer they. Can start using using. The API so, that's the best in breed type of a tool. So. There are challenges to doing multi cloud and I. Think. This is kind of a cool little Venn diagram I drew up here where like if when you when you get into it if you, really only want to be completely, agnostic you end up with the lowest common denominator.

Meaning, That you're only doing compute, you're only doing network you. Know and well, that's about it actually so, if you want to use any of the cool fate features and functionality, that come as a managed provider, there's. Not available so keep that in mind when you're thinking about, where. Your strategies, are with with doing. Multi cloud it. Also does increase maintenance. And governance. So, you have to maintain multiple, different platforms and that means it's, also operationally. Now, you have, folks. That have to you know teams who have to understand all the platforms I mean, that can be a challenge to a lot of companies as well. So. How. Do we go about doing this. We in a you know as an industry we've built the tools now to allow, us to kind, of have a little bit of a homogeneous. Infrastructure. So. We. Heavily, use we're very, big fans of hashey, core in general as a company. Pouchy. Core has a product called packer what, allows you to do is it allows you to. Define. What you want an image or virtual, machine to look like and it will build that image for you and and, either, push it into the cloud you want to or build it within the cloud you want to so, with that we can have the same base level of an AWS, image we can have the same one in GCP and the same one on-premise, largely. Just the same file and you just run it with different providers, so, that sets us up for, cave. To create a base level at that point then, on top of that we we, look at what other what. Other. Binaries. And applications, need to be on those machines one. Of those in this case here a lot of the slurm and. The cluster a lot of the slurm demons and those types of things, applications. Have to be on each of the different machines so in that case we have to you chef chef. Is a great a great tool for. Configuration. Management a lot is to great once, and use. In multiple different environments. We, are we. As a company we. Expanded. This into, with, Fred Fred hush here but as a company we embrace infrastructure, as code it's. Something that we. We think that every company needs to do and and we basically, automate everything that's kind of my little tagline I talk about that way too much so but it, needs to be said so. When, we came in for, Google and this is very specific to Google we wrote. I'm. Sure all you guys have you done stuff on Google you immediately know that it's not the easiest thing in the world to start with you get a whole pile of Docs you say now, what what. We did is we wrote a foundation, layer we, use terraform, so. What this does it allows us to spell out what. A base, applicant. Are say a base Network and project, structure looks like within a Google project. So. We. Can run this with, a we, set up real simply and it defines that so we can move. Projects. Around should we need to in a folder structure it also sets up identity management into. A layered structure so we can do organization. Folders and projects. In in a hierarchy. So that, that foundation gets run once. Sets. It up then we have what we call our services layer and this is where we actually provision. Manage. Services, you, know such, things as. You. Know cloud sequel or or, other, dependent. Applications. That are other the dependent managed, services that I need to within, the. The project and. Then at the last layer we actually have an active we, call it the cap layer and this is where you're actively installing, or updating, or changing deploying, rapidly. Changing services. Within your infrastructure. I, don't. Think ahead on the next one nope I'll, say it anyway so one, little tip that we like, to try and say at all of our talks is when you're using infrastructure. As code if there's a possibility, to run it in the continuous, integration environment, definitely. Do that. Terraform, is another, hashey core product, what. It does it allows you to be declarative, about what you want your infrastructure, to look like so. You can say I want you, know this. Is the type of machine I want I want, 200. Of them and it's just a variable, you change that and you run it and allows you to declare. Your leaf or. Give a declarative, format for what your infrastructure is when, you run that into a with, a CI it. Means that you no longer have an individual, has to run it off a laptop, or even on a jump box you now can have the, same run happen every single time it's all the benefits you get from running a CI CD, kind, of little side effect of that as well terraform. When it when it builds your infrastructure. It creates, a state. File so, it knows what your, infrastructure. Looks like without, having to go necessarily, back to the cloud so it can kind of manage itself now, along with that the benefits are if you want. To heavily govern or you want to govern your infrastructure. So you're not running you know high. Cost or you want to make sure that individuals. Only, have certain access you generally don't want to give access to the console or, it's a G cloud which, is the the command line tool so use terraform to make all your changes it forces you to run through a pattern when, you're using a CI render.

Pattern That you actually get to do all of your own code reviews on your infrastructure, going out so you know exactly what your infrastructure, is at any given time and you can say you know you, know a change doesn't go through or or whatnot. So. I talked about our base images. So. In in our case here our, base images we everything. Is using a bun -, because. Use some, sentro sometimes no it's mostly it's a bun - bun to, cook yes so. The. General, idea here is that you you created image in that image is being pushed up I can't really talk through that so here's, what the provisioning, kind of looked like for us we. Kind, of started out with the base OS, image and and, we used. Just. The really. Enough and auditors really vanilla on, top of that depending, on the different machines themselves a lot of them needed to have the, the. Tools that the research was using so for, example the SDK for for our Sam. Tool tensorflow. Etc, by, the way tensorflow is something, that we are currently looking into to saying can we have that added on as being, early. Sat on GPUs into these clusters. We haven't done that yet but it's, on, the roadmap all, right. And. Then at the very top you know if you have but all this learned dependencies, and they're working with it another. Thing that was a kind of a challenge that we had to fit is all the data we, set her to multiplies but all the data actually was back at HQ. So. Researchers. Throw. All their data other everything. Comes in this is on net. Storage and we use the NFS, to transfer, it out to the. Cloud, for, the actual computer itself so, there are some challenges in that solution, as well just. Due to the latency, that happens with moving very large amounts of data. All. Right actually I already covered this so I'm going to skip that slide all, right so. Largely. The the solution itself from. A technology perspective wasn't, super, super. Difficult but. What I did is I created a homogeneous. Infrastructure. Between all the different providers, that they were working with with, on-prem, with VMware, AWS. And GCP, and. So with that that allows operations. You know I'd mentioned previously that operations, when you're doing multiple different clouds it can be a nightmare if you don't have some level of common, between all of them so this gives us that ability so here's the kind of the final depth, I've kind. Of grayed out the AWS, one because it's. Less important, but. Yeah we actually run this cluster over a, couple, different zones, which is nice so we do have some high availability built, into the the clustering of the application, arm say of the the slurm cluster and. Along that we have a lot of attached storage that. Ends up working with that, all the different systems run over VPN, back into the general networks so, that you have accessibility. To. To. The data back back in HQ. So. Ongoing, challenges and, we kind of alluded to the fact that slurm itself as a as a product, doesn't have dynamic clustering. Or it does not have dynamic. Scale. You, can't just throw a new. Machine into the cluster and have it Auto recognize. Google. Did do a product like I said that that does kind. Of help with that but it's, not it's not public yet or. The press loops this but the code isn't. So. Another challenges, you, know I did any management is another one to that we are still you know making sure we, try to stay on top of I did, any management like I am in general so what, boxes have access to what data what, researchers have access to what right, now we have a very very generic, policy that. Is maybe, a little too permissive but. It's, a it's a challenge because you have to match that up with AWS, and also what you're doing with your on-prem so. Alright. So.

At, The very end so whatever you do by going through all of this one we, took the existing solution, that Mike Michael and his team put together I would, say we strengthened it in certain ways we brought in a lot of dev nines best practices, on how we put together infrastructure. As code how, we run it and. So we kind of strengthened that so we did help out the AWS. And also the on-premise version, of the of the cluster but. By virtually we also added the capabilities. For Google we. Were able to now provide a 50/50, split for, Google, and AWS, traffic, for cloud which, now means that they can scale, even larger, the. Amount of data if you sound like 1.2, petabytes, a day that's a lot of data so, having, that they now have the capacity or, the capability of, growing, that even larger as they need and matching, the same amount. Of compute that they need or the demand they have they. Don't have a lot of resource contention, now so. Alright, so in the future we, are looking at adding, in the capabilities, of using genomics, API along, with, any of the other machine learning. The. Hosted, tensorflow, right now a lot of the researchers, are using tensor flow but it's the self managed and. It's non non GPU based often. TP, use, so. It's it's still very it's. Very early days for that but it's a lot of the data they're doing it does we doesn't. Necessarily choir but it's a something where they want to start be able to use it so, we're we're, expanding the solution, to be able to add those types of capabilities to it. Also. Right now all, the data as, we mentioned is on-premise but also all the computers generally in the u.s. Fred. Hutch has multiple, centers. Around the world and so, be able to actually use this, same solution, because we've now created this, infrastructure, is code we can now create that same infrastructure. In multiple, different regions. Around the world and. Still have the same infrastructure. For them and. Then the last thing is it's, it's a it's a hope but we like to actually bill to add in some, pre-emptive. Machines, as well to this to kind of help lower the cost a lot of that's going to come with the synchronization. And the automation, that we're doing for. Identifying. When you want to build new cluster or add more to a cluster and redeploy it so. We're working on some of those pieces but, that will be kind of a fun, next up for us. You.

2018-08-02 15:46

Show Video

Other news