Industrialize Your Data Platform with GCP + OSS (Cloud Next '19)
Good. Afternoon this. Time this session is how. To industrialize, your data platform, with Google cloud platform and, open. Source technology. I'm, Jason Clark I've, been with Monsanto. / Baer for a. Little, over nine years now. So. To. Start with who, here has seen. An ear, of corn up close and personal. If. You have, you've. Probably noticed that on, that individual ear of corn there are a lot of individual, kernels right on. An ear of corn you can have anywhere between 800. And 1,200 kernels. On that, single ear. Each. Of those kernels has. A. Lot. Of information packed. Inside a, lot of information that's very useful to us within. Bayer CropScience in making, decisions on how we drive our product pipeline and, what. I want to talk to you today is. Essentially. Asking the question like what if we could take all that data that's packed. Inside each of those kernels, each. Of those kernels that are on those ears that are in these fields where there's thousands, of corn plants what, if we could take each data. About each of those kernels an. Engineer. An information pipeline that. Can help us mass produce knowledge about. Our product pipeline and, ultimately. Make more informed decisions. That. Help drive, products. Useful, products to market that, can aid, in producing. More bountiful harvests. What. I want to convince you of is that we have as an engineering team. Produced. An information, pipeline that leverages historic, observations. Crop. Data that we have to. Further characterize our products, and. Ultimately. Improve decision making but. I had a fraction of the costs that, we would incur to have done it previously. So. Let's take a take a step back here. I'm. With Bayer AG, I'm. Actually more. A part of I'm a bigger part of Bayer CropScience. And. When. We look at our product. Pipeline for Bayer CropScience it, really comes. Down to three, high-level steps. Number. One we're gonna try to enter identify, market needs right what. Treatments, can we pair with seeds and, genetic characteristics, that we know about, our corn, plants, as. An example. How. Can we pair these things together and. Produce. A product that ultimately. Improves yield for our farmer customers. Once. We've identified what. That market what those market needs are you. Know we'll start to actually develop those product concepts, step. 2 when, we're developing those. What. We're altima doing is over the course of years taking, each one of those seeds, planting. Them. Pollinating. Them harvesting. Them and, starting. That whole cycle over shipping, seed around the globe and. Performing. That cycle year, over year and, while. We're. Planting. And harvesting these things in, between there we're measuring a lot of data into, recording, a lot of data about those. Seeds. Fundamentally. There are two data. Sets that our, team cares about that are that, are that. Contribute to driving that product pipeline. Genetics. Data genetic data which, would, include genomics, and phenotypic. Data so, genetic, data being. You, know what do we know about the genetic makeup, the, genome, of these, seeds that were removing, through a product pipeline and phenotypic. Data being, what can we actually observe is, this year of corn yellow or is it white right. That's. A very simple characteristic. But. You can imagine much more complex, characteristics. That were we're. Measuring as well and. Then, ultimately once, we've we've, identified that, product concept, we've developed this product concept, we're sure it's it's a the right fit for our customers, we'll commercialize, it at. Each of these steps along the way there, are different elements of our data stack that are important, and. What. I want to do today is tell, you who we are as an, engineering team and then, tell you how we've applied some some. Some. Principles to industrialize. Our platform, and allow, for large scale. Across our. Consumers. And within, our platform to. Ultimately. Produce, knowledge. So. Who are we how do I fit into this picture I, work. For a really. Strong engineering, team called product 360, within, our data landscape, what. We do is we engineer. A versatile, enterprise, wide data, asset platform, that's, really core to other business layers what. Does that mean well. Well, fundamentally what we do is we provide access to core product, data sets in. Various, ways various, modes of access to. Across. Various personas, so everything from other software products that are being built within the company to. Data. Scientists, that are trying to identify.
What Those next product concepts could be all. The, way up to our domain experts, the people working in the labs the people that are really, doing the hands-on work to measure. And observe. Characteristics. About these. Seeds that are moving, through our pipeline. Now. What that means is that with. That variety of consumers, and. Spanning. That long of a pipeline again the course of years we. Really build up some demand in, terms of usage of our of the, platform that we built it's. Kind of give you a sense for that usage. API. Requests, so again. Fundamental, active mode of access for the, datasets that our team delivers is through, api's. And we'll get into kind of the different modes of access for those api's. In. Just a few moments but. What. You can see here is on average monthly. Over the course of the last year we've, we've had 323. Million API requests, so. We're not talking, an internet scale but, we're talking pretty. Sizable, usage for consider, when you consider that all these consumers, are within our company. Now. Another, thing that you can see here is, peak usage and what. You can see is that agriculture. Is fundamentally, a seasonal, business right. We'll see spikes in usage. At various times throughout the year we've. Seen one month where we have we've, had at have, had, 1.16. Billion API requests, across our platform. So. Who are those coming from well we. Measure that as well and we'll talk a little bit of later, on about how we measure this data. Active. Applications, so application. Being considered, you. Know a software, product that's. Using, our platform in some way an analytical, model that's using our platform in some way also. Inclusive of individual, data scientists, that may be using our platform in some way we've. Seen an average of 85, into, active applications. Per month spiking. Up to again 116. With, with those seasonal workloads.
Now. Our platform is not stagnant it's, not stale it's changing, every day. We. Release, on average, 23, times a month to production with, you know a peek in there of. Releasing. More than one time a day now we're not releasing. Hundreds. Of times a day we don't we don't have that that, need. But. What it does show is that our platform, is still evolving and some. Of the, principles. That I'm going to talk to you about help, help, us give give us that wiggle, room to continue to evolve the platform, as we. Continue. To expand, our usage, for n provide. Value for our consumers, produce, new knowledge for. The organization. It's. Kind of give you a sense of the breadth of the platform, we're looking at about 40 a, little, over 40 resource collections, all those. Datasets those resources are all, different shapes and sizes so. Maybe graph structured, data so maybe traditional. Relational, data, they're. All they're, all different but, they all fit into the puzzle in some way and. We. Don't just distribute our databases we, distribute our engineering team as well where, we've got a team, of, 30. Plus data, engineers, you. Know split into five. To seven person, teams. Focused. On different vertical areas within product 360. So. What. Demand like that. How. Do we go about building a platform that we. Can scale that, we can use that our consumers can use and our consumers can scale. To. Do that we've had to develop some some basic principles and. If. We think about this like we're designing we're designing a factory of knowledge here and we're, trying to layout our factory, floor what. Are some of the principles of in designing that factory floor that we've we've, tried to to. Follow, well. Step one number one is we, want a minute minimize retooling, right, we don't want to have to, reconfigure. That factory, floor and incur, downtime, in. That, production of knowledge if. We don't have to we want to minimize that. Number. Two is we want standardization. And we want each, component, in that on, that factory floor we want us job and its purpose in life to be obvious right we want it we want somebody to be able to look at that layout, and say I know why this fitness piece is here. But. We also want. You. Know that indirection as an engineering team to be able to swap. A new component in that, it does that job a little. Bit at least a little bit better if necessary. Right so we want that ability to refactor but. We don't want to cause our consumers. To have to retool. The. The. Next, thing is. We. You, know if our factory floor stops, working. We're. No good for our consumers, we're not producing knowledge, we, need a really, clear view of the, health of our factory floor right. We need key performance indicators, that. Can tell us how that factory floor is operating, and where, the bottlenecks, are at right where where are we introducing bottlenecks, into this, process. Of producing knowledge. So. I want to walk you through in, the next few slides. How. We apply some of these principles and. How we design for them and then what I want to do is come, back and make it more real so we'll talk kind. Of about the you know in an abstract way about. Those principles, over the next few slides and then we'll bring it back and we'll talk to a real use case and show you how we've applied those principles, in. A, real information pipeline. So. Let's start with standardization. Right we want we want those components, to easily. Be swapped, in and out. But. We want their purpose to be obvious. That comes back to to. Me to readability and flexibility, those, two things so let's start with readability, if. We think about readability from a code perspective what, does that mean it means my fundamentally. My code base is Moorman maintainable, because someone else can come in and understand what, that software is is intending. To do, we. Want our data model to operate. The same way we want our consumers to be able to come in look. At one resource you get commitment familiar and comfortable with that resource and then, when they go to use another resource, not. Have to worry about. Reconfiguring. Their mental model we want it to look similar we want our resources to look similar such, that they can jump across those resources with. A minimum amount of friction and really combine data together to. Produce new information for the company.
Standardization. Is how we get there and the. Way we've standardized, is by adopting. Google's, resource oriented design guidelines, if you've ever used the Google Cloud api's this, will feel familiar to you this is a lot, of the principles that are applied, in the Google Cloud api's and you'll, see some, things show up here. In a few moments directly. Out of that those. Design guidelines. One. Thing that really cemented, for me kind. Of you know putting, importance, in API design was. Martin Ali's talk last year at cloud next called designing quality api's, it's, a really good watch if, you're, interested. In API design and why it's important, and and how to go about it it's, a really good watch and a really good place to start. Now. Talked. About readability standard. Is standardization, is, helping, us helping, give us that readability and it's helping give us that, that ability to swap in new components, as an engineering team. But. What about flexibility, if. We're, so rigid, that, we have to change our our. API is every time a new attributes, introduced, into a business, workflow, that maybe, it's something's. Hyper specific, to that workflow, that's. Going to be retooling, for us it's going to be potentially. Retooling for our consumers, by way of a breaking change to our API or a new version of the API that they have to wrap adopt, right. It's, going to cause churn, it's gonna potentially, cause downtime, and. Loss of productivity. So. What we've done is we've adopted, a concept from, the. Resource oriented design guidelines, and we. Tried it with these guidelines we try to not be dogmatic we try to be pragmatic so we'll. Start, with the resource oriented design guidelines, use. That as our base, and then if there's something that maybe makes more sense because we have more context about our development. Workflow or our, customers workflows, we'll. Adapt it and we'll, make it fit that but, we always we use the resource oriented design guidelines, as our base resource. Oriented resource. Annotations, are an example of that there's, a concept and resource for me to design called. Labels and if you've used the Google cloud a P is again you've seen this concept, or really, any cloud api you've seen this concept, in some way it's. Essentially, bring your own context, right, I, don't. Have to tell, Google cloud hey. Can you add this attribute to the compute engine API so that I can capture it I can, add that and it's, specific, to me and it's it's relevant to me it's, understandable, to me we've, adopted that same principle with our, API design, we've. Got we've got a lot of workflows. Internally. That spin up and spin down quickly right, we want to be that base we. Don't want to have to. Change. Our API Xin introduced, down time with, every change to, the, business workflow so, this enables that bring your own context, an, example, here is that this is really just key value pair data right so we've got some base schema that, is giving, you a definition. Of. The. Kind of core, Enterprise attributes, for, that particular resource and then you've got annotations, that you can apply on top of that resource to. Bring your own context, an example would be you know if I wanted to label a particular. Seed, as being part, of a drought initiative, I could, add that annotation. By, way of a key value pair, and. In some cases we do have, controlled, vocabulary, implemented, over some of those annotations, but. Mostly. It's freeform it's meant to be um. You know a piece, of flexibility, for our consumers. Another. Way that we enable flexibility. Is through. This, concept of resource sets so we've got resource collections, that we expose as api's. And. Then, you, know I think what we've seen a lot of is that people like their things and they like their groups of things they. Want to they want to bring context, by way of. Taxonomically. Organizing, that data in some way that's been meaningful and relevant to them resource, sets give us that it's another resource, it's structured, as a resource, standard. API, but. It gives you the ability to group a particular resource. By. Way of filter, values right so let's. Say I wanted to have a keep, a list of all the. Products. That are moving through the pipeline for 2019, well. If I have a products resource and a product sets resource I can, do that I can provide the filter parameters, and I can either manually update that thing or.
We're Moving more, towards this, concept of dynamic sets, which. Is basically. Filter. Parameters, with, context, right so give it a name name. That thing and then it can be reused across the organization, so it's a way to deduplicate that data but also give our consumers some some control and some flexibility, in. The, management of that data. The. Other side here is, that we. Don't want to impose trade-offs. On our consumers where we don't have to we, want to push that the ability, to make trade-offs back into their control. There. Are a couple ways that we do that so. We've talked I've talked in generality. About resource, api's right, resource. Just being a particular, data set. We. Provide multiple modes of access to each of those data sets. One. Is g RPC. So. We've we've. Wholesale. Adopted, g RPC across our platform, as a mode of access for each data set we've, also enabled rest we, see definitely. More usage, on the rest side from. Less. Technical, consumers, people like data scientists people that are just there aren't programmers, we're trying to get their job done and need to interact with these api is to access these datasets rust, is you know going to work with more out-of-the-box tooling, and just. Be an easier experience but. G RPC is going to give you that performance, we've. Done some head-to-head ahead, benchmarks, of, versus, GRP see and and there's definitely noticeable. Performance, gain. There with t RPC but as a, consumer, as a, data scientist I don't, want to have a high barrier to entry for accessing, these things so we provide both of these modes of access. The. Kicker here is, that. If I want to string together multiple resources. To really make make, something meaningful that. Can, be more difficult right I have. To be somewhat, of a programmer to be able to do that to call these api's, do. Some sort of transformation pipe, that into another API, do. Transformation. Off of that how. Do we make it easy for consumers to join this data together. That's. Where graph QL comes in for us so, we've. Got a graph QL server, that. Again can be used by any. Of our consumer. Prints personas, bench. Scientists, data scientists. Enterprise. Application. They. Can use this graph QL server, which. In turn, calls our G RPC API right so we get the performance of calling. G RPC api's but we can do a prescriptive, join of that, data for, the consumers. This. Gives again the consumer the ability to trade off. Who's. Doing the join right if they have more knowledge potentially, about a better way to pre-filter. Data and join. That with another data set they, can do that they can do that by way of G RPC or rest they, have that that that ability if, they don't want to join. That data together manually, they don't have to they can call graph QL our. Graph QL API and we'll do that joint for them. Both. Of those are synchronous. Modes of access to write request response, for, our consumers, the. Third pattern here that we follow is we've. Adopted cloud. Native computing, foundations, cloud events back to, publish an event, stream for each of our resources, that. Event stream notifies, on you know standard, create, update, delete operations. Right it tells it and give our consumers a hook into our our ecosystem. To, react async '''l asynchronously, to events.
You Know goings-on, in our ecosystem. This. Kind of this again gives our consumers, a way to decide. What model fits fits. Their particular workflow. Whether that's an enterprise application that, wants to react us as, new enterprise, data elements are created or a data. Scientist that is really just trying to get these datasets combined together and. Move. On with their lives so they can do really, you know focus on their job their core responsibilities. That's. Heavily, focused, on the consumer side right as an. Engineering team we. Also try to take, that that, idea of standardization, and use. That as our in direction on the to the back end and that, gives us the ability to use data, structure as a competitive, advantage so. Again I mentioned that you know we've got 40-plus, resource, collections, they're all different shapes and sizes, we've. Got datasets, like our pedigree, ancestry. Data set that tracks, seed lineage right, if. We think about that corn kernel. We. Plant that we pollinate, it in some way we harvest new seed and those, seeds are linked together, genetically right one, came from the other and we do that over the course of years but. Sometimes, multiple times a year for, that particular line, of seed that, produces, a lot of data and that data is naturally structured as a graph and, that. With. That data set what. You see on the right hand side here up. The slide is an image. That my colleague Tim Williamson, tweeted. A while back that, is actually a representation, of our. Ancestry. Data set for, corn each, of those stars in this galaxy are, corn. Seeds and the, links between them our genetic. Relationships, these. Seeds came, from these seeds right it gives you a sense, of the scale of that data set this is just corn now, we you know we we. Work in a, lot more crops than just corn. If. This is interesting, to you I would highly encourage that you go out and look at and watch Tim's presentation, from graph connect a, couple years ago he. Goes more in depth and how, we've modeled data. In neo4j, for, that pedigree data but. The point here is that we, can specialize, from, an engineering perspective right, we have that indirection none. Of this is novel but that standardization. Gives us a good way to kind of plug and play. We can start with a relational, database for, a dataset like this if we wanted to and then, optimize the backend later and, with. Minimal friction or ideally. No friction, for our consumers. The. Other side of our data footprint is cloud. Spanner has rapidly, become our primary database, so, we're coming from a legacy, environment, with a large. Oracle instance that, was really good at his job for, what, we needed at the time as we've. Moved to a more more of a micro service model something. Like cloudspinners, actually become very very, useful for us we've made smaller, thinner. More well-defined resources. So we're not trying to facilitate, 7. Table joins to get data back, cloudspinners. Been a really good fit as well because again different. Data sets, of different shapes and sizes size, being the key word in that we. Have a few, data sets that are very very large and, those, are the data sets that will actually talk through in our real life example. To. Give you a sense of the scale of our spanner. Footprint, right now we're at about 17, terabytes across our two environments you. Know about 10 10 terabytes in our production environment and, the other seven in our non production environment, and. Our instance size floats between 10, and 50 nodes now that can get expensive but.
We'll Talk a little bit later about how we try to control that cost without. Sacrificing. Performance with. Spanner specifically. So. We've, talked a lot about getting, data out of our platform let's talk about how we get data into our platform our. Most basic pattern, for getting data into our platform is. Essentially. Data generator calling, a resource API by way of G RPC arrests right again nothing novel but there are some some things some characteristics about that particular workflow that are important one. It pushes resource validation, closer, to the data generator so if the data generator gives us bad data that. Is synchronous, we, can immediately reject, that data and give give, that data generator context, as to why that data is invalid. It, keeps muck, out of our enterprise systems, right. And. It also prevents. Us as an engineering team from having to get in the muck if if. That bad data is generated. The. Other characteristic, is that the consistency, model follows, that of the resource API. So again, cloths manners are as essentially, our primary database at this point. With. The global consistency, model we can provide that same capability, to our consumers. The. Flipside of that the, trade-off here is that the data generator is sensitive, to the performance. Of the the resource API our, resource API is slow and we, you, know we don't have alarm bells, sounding. From, our KPIs, that, data. Generator is going to slow down or they're gonna have to deal with that in some way they don't have to engineer complexity. Into their system to accommodate, that so, in, cases, where either it's, not feasible for us to make something hyper, performant, or. Cases. Where you, know that that right, is just high. Enough latency, that it impacts. The data generator, and slows their work down. This. This pattern, may not fit. There's. Another pattern where you know that that pattern works well with. Kind. Of this new world that we're building where we're building new workflows, out in the cloud and. Engineering. Systems to bring us into a new. Landscape. The. Other another, right, pattern that we have though is we do have again, large legacy, systems that have, collected data over the course of 20, to 30 years. That. Data is still important, we still use that data to make a lot of decisions in our business because we, want to look at that data over time. The. Way we get our legacy data into our systems in most cases is through. A product, that we have internally called the enterprise data hub what. The enterprise data hub is is a set of replicated, kafka, clusters that, is managed by a central engineering, team within our company, that. Product. Takes, care of bi-directional replication between, our different cloud environments, right so we, have a coffee cluster on Prem, we have a conflict cluster in AWS we have a conflict closer in GCP and data.
Hub Takes care of moving. That data in and out of those environments, for us on. The front end of that we have change data capture mechanisms. By. You know informatica. Data replicator, in the case of s AP and. Oracle, GoldenGate in the case of our Oracle environment, those, are pushing change capture records out that. Those change capture records are making their way to Golan Kafka consumers that we have running out in Google Cloud and, pushing. Again still pushing things in through our resource API so, we're. Dogfooding those those resource api is even in this context we're not ever making direct writes around, the API to our databases which, again one. Hardens our workflows. Right if we're dogfooding it we're gonna harden that workflow much quicker than waiting. For our consumers to find issues. And. To, it's you know it's it's given us some of the same benefit, we can leverage those. Flexibility. Concepts, as well and. Adapt, our workflows very, much, quicker. One, thing I would I would stress here too is that if you're interested in learning, more about how. Our data hub environment, is set up former. Colleague Bob Lehmann speaks, to that approach, from, an Kafka summit last year this is a really another, really good watch that. Goes more, in-depth. Into. Our data hub set up. So. That. Particular change pattern is generally. Lightweight. Events right the change capture records. Are you. Know maybe 10 15 columns, not, not big and. A. Lot of those can be kind of treated in isolation. However. We have a set, of workflows, particularly. Particularly. In the genomics, world where, we've got a lot of data that is. That comes from a single event so, we have a single change capture record that comes across that. That. Single change record can lead to creation. Of. Thousands. Of resources all of which are fairly heavyweight. There, you know there's some meat to each of those each of those individual resources. To. Accommodate, that workflow, you, know Kafka, really. Doesn't encourage you to put large super, large messages, on. On a topic, in, a partition, so, to accommodate that workflow we still, stream out that lightweight event, we. Transform, that light away event by joining it back to the legacy system and. Exchanging, that lightweight event for those you. Know a number of large, resources. We. Take each of those resources and we sync them as objects. In Google Cloud storage. When. We do that we rely on bucket notifications, to polish out the other side right that's our change capture. Mechanism. On the other side of cloud storage so, that bucket notification, fires we, do in. Some cases a little more manipulation, of massaging, to that data and then, again write it in through the resource API so, similar, approach but, we're accommodating. That large. Transformation. That has to happen in the middle there and we're doing that, with. GCSE. And pub/sub. The. One. Of the things that we get out of this this particular workflow is that with. Our Kafka workflows we can't arbitrarily scale our consumption right. Scaling. Of a cough consumer, is bound by the number of partitions that you have Andry, partitioning, is generally, not a pleasant exercise. Pub/sub. Gives us the ability to arbitrarily, scale here we have a single event that is. Infrequent, that. Comes across and we exchange, that for a larger. Amount of information and then, we want to parallel eyes really, wide. To. Consume that information into our resource API is as rapidly as possible pub/sub. Gives us that ability we. Can scale. Out much much wider with. A button push essentially. With a change. In a kubernetes deployment, size replica. Count. So. That's one of the benefits that we get get, out of this particular work, workflow. As well. We. Don't always have control of. Our, legacy, system and by control I mean the ability to go back and exchange information, with, it directly an example of that for us is si P so. We don't ever touch our SI P environment, by querying it but. We do have. Change. Data capture that's generated out of our si P system. To. Accommodate that we've. Actually paired kafka, than those change capture records with, spanner and data flow so, this, is kind of more of the traditional replicate. Data, as is we, replicate, as is in the spanner so you'll see tables that look like si P tables and, cloud spanner and then.
We Use cloud dataflow to. Kind, of give us this concept. Of a materialized, view right, we've got cloud, dataflow running. Periodically. As a scheduled, job. Taking. Data out of cloud spanner. Making. It into, the format that we want. To make it into to make it a you know a more descriptive, resource, for our consumers, storing. That data back in cloud spanner and then our resource api's are built on top of that so, it's essentially a cache right. One. Of the things that we get out of using cloud dataflow here we're cloud, using it heavily using cloud dataflow Java, but, one of things we get out of that is, testable. Business logic right, we can test those try those data transformations, that we're doing where. As if we're building you. Know traditionally building a thousand, line sequel statement to build. A materialized, view you. Get testing that is really hard and, error-prone. And and all these other things that make it really. Difficult to ensure we're not introducing. Breaking changes with dataflow and what Java obviously, we can unit test in a similar. Way that we would any of our, other software. Another. Aspect. Of dataflow that comes, important here is that we, can scale. That. As our dataset size scales we can increase, our worker count again. It's a button push so. That, gives us the ability to ingest, a lot, more data if need. Be without. A lot of operational, overhead or having to retool, our environment. The. Trade-off here with, this workflow is that this is not near real-time. These. Other workflows that we've looked at or near real-time you know moments, after that date is written to the legacy system, it'll. Be available in our, resource. Api's for consumption by our consumers. This. Is not this is scheduled, so this is you know every few hours we run these transformations. We, can squeeze that down and make it you know a shorter, frequency. We. Haven't had the need to do that but it, does suffer from this. This. Particular. Aspect. What. You see on the right is an example of one of our simple dataflow jobs so when I talk about testability, this. Is kind of showing you you know this is a job testing. Is testing these things is important. So. We talked a little bit earlier about. How. It's important for us to have key. Performance indicators, a view. Of our factory floor right we don't want to we don't want to ever stop that, production. Of knowledge we want that to be available 24/7. So. Now I want to talk a little bit about some of the reliability engineering, aspects, that we've introduced into our stack. The. Most basic, but. Also one of the most useful has been instrumenting. Our api's with. To, capture access, log data API, request data. This. Captures the, essentially. The who the what the when and how long for. Each of the API requests, that come into our platform this. Gives us a sense of how that those, those, requests, are those api's, are, operating, for our consumers, right, if it's something slow, we. Surfaced this data through. Bigquery, written. Through pub/sub so. Each. API hands, off the request to pub/sub we've. Got some process, that syncs that in a bigquery and then we surface it for our engineering teams using, data studio, so. Kind of gives you that end-to-end experience and, gives our engineering team a view, of that factory floor but of their.
Specific, Segments of that factory floor. Another. Aspect you've. Seen how many of our ingress, data ingress patterns rely. On kafka. What. We've done is create. Essentially, a monitor, for Kafka that's. Not novel right but, what we did in this case is, use. Kubernetes, see our DS custom, resource definitions, to allow our engineering team to define their monitoring, monitoring, requirements for, their Kafka consumers, at, deployment, time so, instead of having to deploy my Kafka consumer, and then, go somewhere else to say here's, how I should be alerted for this thing you. Can push that information close, to the deployment this, is a place where to me kubernetes. Really really shines this extensibility. Allows. Us to describe our deployment, workflow in. The terms that we use and, make. It discoverable for, our engineers all. Through the surf. That they're interacting with to do their application, development daily. That. Alerts to slack you know push the message as close to where the engineers are. But. It's a set of really simple go processes, that, are watching, for that that. Custom resource definition, to change. Then. Last but not least we've talked a lot about spanner and, I mentioned that you know if, we're running as large spanner instance that, can get expensive we. Also don't want to run a small into the spanner instance and have. That seasonal, workload spike, on us all of a sudden and. Our. API. Performance. Goes. Down the drain right we want stability but. We also want cost control so, we've written a simple autoscaler, component, for quad spanner that runs in kubernetes uses, a stack driver API to watch for CPU utilization numbers, right. And watches that time series and scales. Cloud spanner up and down using the spanner API, as. Needed. We've, seen this. Drastically. Reduce our spanner bill because we were over allocating, before. To. Guard against poor. Performance, if, we saturated, the the instance but. We've seen now that CPU, utilization with, this autoscaler, component, hovered, between 30, and 50 percent usage, utilization. So. You. Know we want to be using as much of that capacity, as we can we can't put those those, cpu, cycles in the bank and save them for later we're we're, getting, what we pay for at that point in time. This. Has been a big, addition in, terms of operations, again hands. Off for the our engineering team it can this autoscaler, can take action there's, a little bit of risk in that the stack driver API is a minute behind so if we see a real, sudden burst in. You know we may see a few failed requests, but. Overall. Over, time it's. It helps keep our SSL lows in check. So. Let's make this real. Let's, talk a little bit about, one. Of the concepts that we've implemented on our platform. This information. Pipeline that we've implemented on our platform over the course of the last year what. We're doing is we're doing a, process. Called genotype, inference, so, think. About that corn seed again. That. Kernel. Again. That is each kernel is dense and information, what. We want is we want a picture in a window into the the. Information inside that kernel, genetic. Markers give. Us that information about, that the genome of that particular corn. Seed right so genome, being all the different nucleotide. Bases that compose that that kernel, genetically. Genetic. Marker giving some measurement, at particular, locations, on the genome and those, locations are picked, very strategically, right, we were looking for things that ultimately we. Know this, marker or these set of markers lead, to this, corn, cart, or this kernel being white or yellow right things that we can observe that, have meaning and value for our customers. The. Way we make those observations we. Have two different resolutions. The levels of resolution in, red. You see low-density resolution, what. That is is a smaller, number of genetic markers that we capture spread. Across the genome and. Yellow. You see high density observations. That's. A lot more markers it's also a lot more expensive to. Collect right, so these are things where we're actually sending seeds to the lab and we're.
Running. Experiments, to. Get a window into the genome of these corn seeds. Now. If we want to capture. Low density observations, it's. Cheaper it's easier it. Can be done more. Quickly, but. Still cost a lot of money still, cost money right and. It's it's not just the the money of running that experiment, it's all the shipping, costs, and all that that comes into play to move, seed to. To, actually make those measurements. What. We've done is we've leveraged. Mathematics. Essentially, and, programming. And these. Technologies that you've seen us see me talk about, to, infer. These genotypes which. Is at a much cheaper cost. We can basically take the, data that we've measured those two different levels of resolution and. We can fill in the gaps we can up sample the low-density, resolution. To a high density resolution. High resolution, and we. Can fill in those, middle areas where maybe we didn't collect any data at all but we know genetically. The. Seed came from this seed came from this seed and genetics. Tells us certain, gives us certain rules, that. We can put into play and make, estimations. About, what. The genome of that that particular scene in the middle that of which we made no direct observations. Actually, looks like we. Get a probabilistic. Value, that. Allows again our consumers. To make the choice of how. Confident do they need they want to be. About. That, picture of the genome right again allowing the consumer to, make more decisions to. Give, you a sense of how, much data we've generated, with this information pipeline. What. You see again, in yellow is the high-density observations, so we collected, a lot, of information a lot, of markers on. A, smaller, number, of, seats. With. The low density observations. We've collected, a. Lot. Fewer marker, we've observed a lot fewer markers, but we've done it on a wider, larger. Number of seeds the. Blue is the area that we filled in I. Talked. To a colleague this morning just to make sure that my number was right on that number or on that blue area we're. Talking about 197. Billion genotypes that we were able to generate and. Infer from those, those datasets. This. Is something that was previously unachievable, and we've done it at a much lower cost, and. We've done it. This. Way so, let's let's walk through this diagram real quick this is this is our architecture, for this destino, type inference pipeline. Let's. Talk about some, of those cut those principles, that we've we'd have we laid out earlier, so. If you go left to right, starting. From the bottom we've got a lab workflow a lab system that's pushing data in. To. Our observed, genotypes, resource that's. Backed by cloud spanner all, of that data is in cloud spanner, from. There you get a cloud event you get that asynchronous, notification, that new data is arrived. There's, a component. Called the orchestrator, that's listening. For those events and then, scheduling. These inference, jobs and it's doing that on kubernetes. My. Way of gke. Has. Pub/sub, messages, that. Get distributed across. Multiple. Regions. Multiple. Kubernetes, clusters in multiple regions, who, then those those pods their sole purpose in life is to do that calculation for, us to. Do that calculation, to. Infer these genotypes they'll, reach back in to our our resource, api's to, grab those observed genotypes. When. They complete. Their purpose, they'll, write that, new data set out as. Inferred, genotype, into, our inferred genotype resource one interesting thing about that, one, of the spanner talks that we were in. Earlier. This morning they, talked about using, spanner. For GCSE, metadata. Similar. Approach, here we actually use we store the metadata for these data sets in spanners. So that we can more rapidly, search over it but the bulk of the data the payload for that is stored in GCS because it is large, we're, talking 85,000. Genotypes. For a single seed right. And then we're talking millions, of seeds that were we're, inferring, these genotypes for so, that that dataset grows, rapidly a. Few. Interesting characteristics, to facilitate, this workflow the.
Workers Actually. Run, on preemptable VMS. So, that's. Actually been a huge mechanism. For us to control costs in this process preemptable, vm's you get at a fraction of the cost of a normal VM but, you get that resource, those compute resources for a limited amount of time to. Engineer. For resilience, to preemption we've, done a couple of things. One. Is checkpointing, so we checkpoint, intermediate, State out to GCS as these these workers run so that if they do get preempted, they can pick back up from where they left off these, jobs can be long running on, the order of minutes at. Times just depending on how much day there is for a particular. Particular. Seed and. With. That in mind with those that. Long-running characteristic, in mind. We've. Implemented, a job lock using, cloud memory store so, pub/sub. Likes to try to redeliver these messages. Sometimes. Before we're ready for to. Be done with them right so we've implemented we've used memory store as, a simple job lock where we write, some. Metadata to to memory store as the job starts, and then if we get redeliver, to message from pub/sub we can check and ensure, that we we, don't duplicate effort. The. Last thing I want to call out here is that, again. This all comes back to these resource api's we're, reading from these resource api's and writing output to these resource api's we're dogfooding all of these resource api's that were exposing, to our consumers, we see the same surface they do so. That forces. Us to to make sure that surface is coherent, and usable. To. Give you a sense, of. How. Much data we've generated again 197. Billion genotypes. To. Backfill this the process itself is meant to be always on right as new data arrives we, make. New inferences. We. Had to backfill this in some way and we did that again by dogfooding the process, at. A single moment in time actually during cloud next last year, we. Had a kubernetes, cluster that. Was using. 42,500. Virtual. CPUs, all on preemptable nodes at, one. Time 270, terabytes, of memory and when. We talk about cost we were doing this at. 1, cent per CPU, hour which, means that as we scale, this process, up our costs. Don't go out of control right this doesn't become prohibitive. That's, all I've got I that's, my twitter handle I'm, always happy to we're always happy to talk about what we do and how we do it.
2019-04-18 20:08