Hello. Everyone my. Name is sergei sukarenko I'm, a product manager at cloud, beautiful. And, today, with me I have my colleagues Diego. And David he'll, be talking to you about advances. In steaming analytics. Your. Feedback is greatly appreciated please take some time during, the session after the session to submit feedback. And. In the session today we're going to be talking about lots. Of exciting things, starting. With, reviewing. Your options for building steaming. Processing pipelines, in the Google cloud then. Reviewing. The recent improvements, in cloud dataflow and pops up, diving. More specifically, into features such as outer, scaling, and Python, supports equal support, steaming. Engine we're, gonna have a demo. Then. We're going to review flexor, as an, upcoming new, feature that the launching and then. We have chef Diego, from unity come in and talk about how, unity, is using, dataflow, I. Want. To start off with a very simple statistic, 25, percent. 25. Percent of, data, by, the year of 2025. Will be real-time in nature that's a projection by any. Analyst IDC what. Does it mean that 25. Percent of data will be real-time in nature it, means that it will be the most valuable when. It's consumed shortly, after it was produced most valuable to your business. What. Is driving the growth of for, real-time data it's growing much faster than, other, type of data other, types of data a, lots, of reasons I'll highlight four. Important, ones it's the, explosion. Of collection. Of user actions, such as clicks, or interactions. In games, social. Media photos. Emails, etc, the. Second source, of. Real-time. Data is, the machine, and device, data that gets, produced. On, IOT. Devices your. Laptops, your PC's your mobile phones and lastly. It's each other the next source. Of data is, Changez and databases, which. Are also quite available and lastly, it's the emergence. Of AI. And, machine learning, and the, need of machine. Learning for, real-time. Data. So. When you go to the, Google cloud and you build. A real-time. Streaming processing, system you would typically build something, like this. You. Will start with ingesting, events using our global, a global, available. Distributed, ingestion. Service called pub/sub if. You prefer kafka you, also have choices you can either do. A self-managed, kafka in GCP, or you can use. Confluent. Platform, as another, option of operating, Kafka. After. You ingested your data. Usually. Customers, then enrich, aggregate. And detect. Patterns on it and they, use data flow in streaming mode for. This if you. Prefer. Spark and flink you also have options you can run data. Proc. So. When once, you end. Up with enriched, cleaned. Up ready. To use, data. You, typically want to do more advanced analytics, on it for, example, machine learning using our cloud platform or. You store. Your, events, in a, data warehouse such as bigquery, and and. Do further analysis, using sequel. Now. Every streaming system needs a little bit of batch processing, and so, lots, of data, for customers, are happy to. Realize. That they can actually run the same code and steaming in batch mode interchangeably. The only changes between, the. Pipeline's typically, relate to the data source that you will be using and steaming. Mode that will be a steaming system, such as Kafka pops up in batch. Mode the. The, initial data source will be something, like files or databases. Now. Let's talk about recent improvements, in pops-up, and data, flow and, we, bucketed them recently as three. Or six months ago, within. The last six months. And. We bucketed them into three major investment, areas one. Is improving, the, cost profile, the other one is improving usability and, the third one is improving security on the. Cost side we, launched steaming Engine and steaming out or scaling in GA we're going to talk more about them today.
They, Are now available in, seven, regions two, European, ones two Asian ones and three US ones, and, our goal is to deploy it everywhere. Moving. Over to usability. Improvements, we. Have really invested in two languages. Such as sequel. And Python, to, allow you to use what, you already use, for, creating streaming pipelines, quite. A few folks like Python quite a few folks like sequel and we wanted them to be able to do, steaming without, needing, to go to more. Advanced. More complex, and also more powerful languages, such as Java. So. On the sequel. Side we we, a few months ago we launched dataflow sequel in preview and today. We'll be announcing new features in it on. The Python side we launched, Python, 3 support 2 launched pythons. Steaming. Support, in GA and Python, 3 support, in GA as well and, lastly we created a feature called hotkey, detection, this, is available in your worker. Logs if, we detect the condition, that causes, your pipeline to slow down we, are now actually informing you there's a hotkey you, can do something about it and this, magically, improves the performance of, your pipeline. And. Lastly security. Here, I will highlight two features. VP. CSC, and, C. Mac c-max, stands for customer managed, encryption, keys and. VP. CSC, stands for VP VP C service controls, so. What do they mean see Mac allows you to use your own keys to. Encrypt, the state, of your pipeline if. You. Don't use your own keys you can also be, assured that your pipeline, State will be encrypted just will be Google, managed encryption, keys everything, in in. Google, Cloud especially, in cloud dataflow is as, encrypted, once it touches disk but. The C Mac you have the additional, roll at the keys are yours and your man managing, them. With. VP CSC, what does it do well, it allows you to prevent. Scenarios, of data. Exfiltration. So. You can set up a perimeter around, your, virtual. Pc. And and. That, perimeter will not be violated, through your applications. Data will not leave it, so. We have we. Ensure that data stays within this perimeter. Now. I mentioned, steaming engine, previously. And. It's important, it's an important feature we've been working on it for quite a while and, lots, of customers, already using, it. Why. Is it important, to motivate. Why it is important, let me actually explain, what. Problem it's trying to solve so. Let's talk about traditional. Distributed, data processing, frameworks. And. You're seeing on the slide a visualization. Of what. Traditional distributed. Frameworks. Do they create clusters, of processing. Nodes, there. Are typically virtual. Machines we have CPUs, and memory and they, also have some sort of network. Attached state, storage, associated. With these processing nodes. The. Individual. Processing nodes communicate with each other and they also communicate via, network, with a control flame that. Is managing them, now. The traditional, architecture works, really well until, you have pipelines that, need to do lots of joining and grouping why. Is that. What. Happens during joining. And grouping in a distillate, pipeline, on a single worker everything, runs in memory or maybe touches disk from, time to time, but. Things. Typically work well once. You end up with a distributed, framework. With lots of processing, nodes of the, same importance. And function. You. Run into issues the, issues come from the fact that your. Individual. VMs. Store. A state and this, state typically. Includes. The. Key value pairs that, you're trying to group and join. Well. You can't join them and group them if the keys on the workers are. Are entirely. Difference, we actually end up, moving. Everything, sort. In a sorted fashion to worker, step own particular, keys first of all you have to decide which worker will own a key. And then you have to move physically, all of the records on that worker, so, this process is called shuffling, you're, shuffling data around and. It can get very very involved especially. If your dataset grows. Now. And you. End up with the state the at, the end of the shuffling process you end up with a sorted, list, on. Each worker where each worker owns all. Of the records, associated with particular, keys and then you can group in joined. Things. Get even more complex, in steaming pipelines, because. Of of. The next issue the next issue is, the. The. Difference between processing. Time and your business, event time let me explain what I mean, so. Let's let's imagine we have a stream of transactions. Sales transactions. They. Happen over a period of time your. Customers, are buying goods, each. Good has a time stamp so. Here's an example of a sales record. Time. Of sales 10, o'clock 10:30, then. There is a good, that was purchased and the place where it was purchased now. This, event this. Transaction will enter your streaming pipeline, steam in processing pipeline, not.
Necessarily, At 10:30, a.m. when. The seal happened, most. Frequently it, will enter, much. Later and in, many many cases they the, delays could be hours if not days, now. Usually business, users and data. Engineers. They. Would like to abstract. Away the processing, time they don't care about the processing time they really want to do analytics, on the business. Event time. So. They end up asking us to build technologies, that allow them to run. Groupings, and aggregations, on time, than those that, are related, to event. I'm not. The processing, time and to, accomplish this we have to buffer data. You have to buffer data on these. Persistent. Disks, attached, to your workers and it's a lot of data, needs to get buffer. So. In a traditional architecture. You end up with two types of data stored and. In. Disks, on attach their workers it's, the shuffle. Data and it's the time then the data, there's. A lot of network, communication, between each, worker, as. Datasets. Grow, things. Get really brittle. The. Steaming engine, is designed to solve this problem and it, solves the problem by, separating, compute. From. State storage what. Is compute, in this case computers, your CPU, and memory and some. Temporary, space, on. That worker just, to be able to boot up, but. This compute, unit. Is actually communicating with a just above the database, we call the steaming engine it's. A back-end service we operate for you it's a distributed database plus. It provides some useful functions such as shuffling. The. Benefit, of this approach is that were. Able to much, more, easily to scale your your, pipelines, we. Use. Less resources on. The worker side so your user code gets more resources and we, can support this, system this architecture, much better. This. Is the steaming engine is now available, in seven regions and, I, would encourage you to use it today. Let. Me also explain how auto scaling is much better with the steaming engine I'm.
Comparing. To architectural. Approaches. With. Each other on the left. Hand side I have the steaming engine architecture on the right hand side I have the traditional, architecture if. You need to scale these. Pipelines, you end up with in the traditional. Case. You, end up scaling the entire worker with, CPU. Memory and state and, because. Of the state coupling, you end up with a lot of inefficiency, on, the, steaming engine side though we, can scale resources. Separately, we, can scale compute, separately, we can stay a scale state. Storage separately, this, allows us to be much more reactive, to changes, in the workload. Here's. An example of two pipelines doing. The same workload or processing the same workload. Every. Two hours is, a spike, of data of a period of 30 minutes. And. It's repeated, repeated, process a very typical workload, for for steaming situations, so. What happens in the, steaming. Engine case is that. We've. Got a scale they actually going to scale much more efficiently, we don't they're not going to scale to the maximum, work as you allow, us to scale well, first scale to a reasonable, number and, then we we. Start, reviewing how, is our CPU utilization can. We extract, all bit more performance out of these workers or do we need to launch. More resources, on. The traditional, architecture side, we'll, actually scale pretty pretty, high right, away and. Stay adapt, that, number for the duration of the workload and. Then, when the time's comes to scale, down when the workload is done the. Steaming engine, workload. Will. Scale down very. Very quickly and this, is good for you as a user because, you're paying for. The area under. The worker, graph so. Your bill is directly. Proportionate. To the. Space under, the area of the worker graph and this, smaller the species, the less CP. Well. This, was a involved, introduction, of the steaming engine I know not everyone cares deeply about performance. Of steaming pipelines, and, we, asked ourselves do. We have to make it so complicated can, we just make, it easier, to every user can we make steaming easier. The. Answer is yes we we. Know that users. Like Python, users like sequel we've, launched, very, recently. Lots, of good features in. The Python and now Python, SDK, including.
Things Like life update. The. Capability, to do a drain on your steaming pipeline. We. Support auto scaling now with Python steaming. SDK we. Support the steaming engine and will also launch support for Python 3. On. The. Sequel side as, I mentioned before a couple of months ago or several months ago we launched the preview of dataflow sequel, today. We are launching several, new features we are allowing. You to you, where we give you the ability to use files. As your inputs into. Your pipelines so now you can join a, pub sub stream with, a set of GCS, files. There, are three sources, that are currently supported, it's, GCSE files which is new big. Clear tables and pub, sub topics. We. Also released, a visual, schema editor so, that you can define the schemas of data sources without. Schema, there is no sequel, capability. We have to know the schema so now. It's now it can be done visually. And. You have, the ability now to store the schema in cloud. Data catalog, which is our metadata, storage, service. At. This point I would like to invite David. Who will walk, us through a, data, policy for example. Good. Afternoon everybody my name is David salad, internal. Specialist, at Google cloud and, for. Today's demo I'm gonna pretend, I'm a data analyst okay and. Actually I've been tasked to create, some, real-time dashboards, to. Actually inform or, or. Regional. Sales managers okay the data I'm receiving is actually stream of sales through sections and. Basically. When, I'm being tested about that and it's inked and basically, have two challenges okay, so, the first one is I. Only, need speak sequel so in terms of interacting. With my data the only lingua franca uses sequel so, and, the second one is actually I'm gonna use only one hand for, the demo okay so, can I ask you questions that the audience can you raise your hand who knows sequel, who will be able to write the sequel statement, ok good so that's good that means that all, of you oh my review you will be able to apply that ok so, basically, what. We have here is the technical implementation of, that pipeline right. So we can see that actually we have the, first, thing is the stream of sales, instructions, it come they come through pops up through a pop subtopic that's why we're pushing them then. We have some mapping. Data, sitting. In bigquery table, that, mapping data is actually the. Relation between the US states. With particular, sales regions, so, I can when I I'm aggregating, the sales I can actually report them based on these sales regions, rather. Than actually on the on the US states obviously. Getting an amount, and other car traces that I would, show now in the demo the next thing is obviously I need to process, the data and joining and calculating, some aggregates, there. Is the sequel, stamen which again when I'm running through them or I will show you this graph a little bit more in detail the. Next thing when I join it then I want to stream, and write this data into a bigquery table and ultimately, I want to connect into into. A real-time dashboard, the multiple ways of actually connecting, with that and for, the demo I'm actually going to use Google sheet ok so. Let's go then, into demo can we switch to the mode please. Okay. We're here so, basically we start with a bigquery, UI as sergej describe it we. Release. Public. Preview data. Flow sequel UI in. Order to enable that all I need to do is go into the BQ UI and, then go into Korra settings and switch. Rather than using the bigquery engine I want to use the data flow engine which. I got here and I switch and, automatically. You can see that now I'm able, to create a data flow job, based, on the sequel statement I'm gonna right so. The first thing I need to do is actually go and check, with which sources I'm going to use as I describe it I have pops up for my stream of sales instructions, and I have a table for the mapping in. Bigquery, so. For. The actual. Stream. I can go and see the the schema for my transactions, which is really good and again, I'm also quite happy to present. That new features as sergyei, describe. It we can now actually go and edit. The. Schema of for of all transactions, so I can actually go here and change, the type which is really handy, when I'm actually exploring. The data i'm receiving. What. I can do is. Then. Now. That i know the details of my events, i see here this is the event timestamp, so i can actually do the aggregation based on that timestamp and i'm getting inside, the payload a few, other things in terms of a name, the, city the US estate that I'm gonna use to join with my mapping data and. Obviously more important is the amount that I want to aggregate and calculate the total okay so. Now I'm gonna go and grab my sequel, stamen because, I don't have a very, good memory and I don't want to challenge the Democrats I'm gonna copy, paste a query.
Here. So. Basically here what we have is, sequel. Stamen so everybody will understand that basically. What we're doing is aggregating the. Events in what, we call humble window that's kind of sequel standard way to describe what we call him being a fixed, window so, every five seconds I will aggregate, all the the, total of the amount that I'm receiving from events and I will omit the results I want. To actually, group by sales, region, because, as the mapping I'm doing for my Regional. Sales Managers, and for. That obviously I'm joining, with the bigquery table I mentioned, anima, sleeper with the stream of events, that are coming from my pop subtopic, transactions. Then. Like. An enormous equal statement I have my sum of the mound and I want to also create. The timestamps, for my time, series because what I'm doing is trimming that into bigquery and I'm actually creating a time series, okay. So what I'm doing is a time. Stamping with the tumbled start, timestamp. Of each, window and the sales region again so. Now what I can do is actually, trigger. The job. Here. And he said the region is sold by the fall is fine I'm gonna use some. Table. Name that, I can remember stream. Here's. The five. Okay. And. I'm. Gonna create the job so I'm actually creating the job and submit the job because, it takes a couple of minutes to spin up the analysis. The query transform, into the stages. In in data flow and spin, up the workers I have, I can actually go to the job history, and. See, an existing, job. Basically. Have this. One I around this morning and, I. Can actually open that, in, the query editor. Okay. I can first, think see the results. Of the events coming through all, the, stream. Into the into, the table into the target table and more. Importantly before I go into the results let's have a look into the, job. The actual data flow job. Okay. That's good, so we can actually go into the details about what is happening behind the scenes hold that sickle, stem and get the god transform, into the into, beam stages, we.
Can Actually go with another level of detail we can see here that this. Is the pops up source, and this. Is actually the bigquery source, okay. So. That's really, good is giving me if I need to I can go into that level of detail and in. This particular case when I'm streaming and doing a side input with, the reference, data I'm taking, for bigquery so, I can actually have that data in every single worker to, join with with, my stream of data that's, good so I have that that view now, what I want is to, connect. That to a spreadsheet. Right to shits right so, I can go into the. Actual I'm, gonna switch into this top and where I'm using the bigquery engine, to create the the data that I am streaming, into that, particular table I describe it and I, have here this table again, I can see that I'm, getting all these calculations, there and. Actually I can export, that directly, into sheets. That's. Opening. It. You. See my data there I can, switch. This it will come up with another pop-up but that's fine I can then order, sort. In, the descending order, and hopefully. I should be getting. Most. Up-to-date. Aggregated. Events that's, good, thank. You. Okay, so that should update. Every couple of seconds let's keep it them running but you could see that actually I can do that easily, and connect from bigquery into, sheets and then stream. My data and see the results from there okay so that's good that actually, I. Probably. Solved, let's, say 80%, of my challenge, let's. See my and then we, have that, streaming of data working so we're actually capturing all this data but all the sudden I come I get from the salesman, is that actually the, mapping that it was done it was wrong so it was not really mapping properly, the US states with the region's so, there are few managers that I be pissed off because they are not getting compensated. In the right way so what. I need to do is actually change the mapping run so, the challenge is for the storica later for the data that already enriched I need, to backfill that right until.
Now He, there. Was no way to actually rerun. That data flow job in batch mode using, the sequel and then using the UI and again, as. Sergey, mentioned last night is a way to actually run batch jobs which is really really cool to do this kind of this. Kind of jobs. So now me as a data analyst I can actually solve that more easily so. Now, the difference, here is rather. Than running a streaming job I really want, to run. A batch job right and for that what. I'm gonna do is as against, Sergey mention we are actually now able to ingest data from, object. Store from GCS, and then, what I have there is actually, you can go quickly here and show you the contents, of this packet I have historical, transactions, there right. So rather than getting this transaction enough through pops, up I'm gonna get them through an. Object store, so I have a couple of CSV there so, again in data catalog, I can define a file, set that will point into this CSV, files I can actually define, a pattern will be asterisk, dot CSV, so, I will be able to gather all these files together and, then. What is even better is actually I can go now here, into, the cloud, dataflow. Sources, and I can see here that definition. Okay and again I have the schema define it there and again, I can edit data schema, again, super super, helpful what. I can then do is say okay I need to run. That. Same query you. Can see that this the, slightly differences. So basically I need to change the the from obviously it's no longer pops up now is the definition I have in data, catalog, about my file set my my. CSV files I also, need to change. The so. I change, the timestamp here so I need to cast from. The from, the particular column from the file that's what defines my timestamp, is no longer something that comes in there in the message in the metadata of the message and. Obviously is no longer a payload is just columns from the file so I can actually, capture. That. Okay, again want, to, Democrats. Okay, that's good that's always good to care having a green tick box it's always, psychologically. Good so, I can then submit the job and then, as you can in mine this time because, the sources, are bounded. What, I actually trigger is a batch job let's see if that's will if. That's correct, okay so, I mean that. That's. Running and again, I'm gonna go directly to the job I already run. Because. It takes a couple of minutes to speed up the whole thing and, we're not. Want. To consume all time. Waiting. So we have here a job you could see it's a batch job okay. Let's. Click on that one and. Then, you can see on this one now again, I can click on the details I can, see that that was, the. The. Input of the mapping table from me query okay we can see here on. The other side we could see that this is the CSB this, was, actually. Text. A red text file a source up again, we're using beam and then. Here we used something different because when we run in batch, beam. Is also able to optimize so, in this particular case rather, than using insight join.
We're Using a, echo. Group by key okay which is another way to group. And shuffle your data together so. That's what I'm doing and then. Ultimately what, I can have, a look is on. The, on the results, so I can actually. Scroll. All the way down see. Which table, okay that was that was. Two botched ESD, three so again I can go into bigquery and, bigquery engine, mode and I. Can have a look into this one. And. I could see that that was historically, recalculated, and it was actually in theory if I had to change the mapping it would like to recalculate the map in the right way and obviously. On top of that I will be able to rerun. Resume. My streaming. Job and then it will start from there with the right mapping so that's a nice, way how. I will solve that challenge okay, so, yeah I think that's all I wanted to demo again I think I managed, to do the demo with one hand so I will hand it over to search again thank you thanks, David. This. Was our most important, announcement, for the day you can now develop data for pipelines with just one hand. As, David motivated. The next feature, every. Steaming system needs a little batch processing, we have the historical data processing, the error correction use case and so, we today, they're launching, flex RS what is for Express flex. RS is a cheaper. Much cheaper way to do batch processing, the. Data flow you. Can run the. Batch pipelines, you currently, have today without. Code, changes, in Flex sells mode and get, immediately, cost savings, how. Do you accomplish this well. You. You. Launch it with a new parameter the, new parameter, is flexor, as goal equals, cost optimized, this. Is all that is required for, you to start generating 40%. Savings, on your worker costs what. Is the catch there's. One. Important. Things you need to remember about like Sarah's flex, RS will submit, you're, submitting your jobs into a processing. Queue and. There will be a delay between. The time and we accept the job and we actually run the job so, if you have daily, jobs or weekly jobs where, you can accept a, up. To six hour delay in the execution, of your job. That's a workload you can easily transform, and move to flex eras. I. Would, suggest using it for life scale batch jobs. For, historical, on bore doing jobs for anything, that has a daily. Or weekly frequency. How. Do we accomplish the cost savings, well we pull together multiple, types, of resources, we put together regular, virtual, machines.
With, Pool PDMS, preemptable VMs and other. Types of resources, to. Give the cost savings. They. Also using dataflow shuffle which. Is our way to store, shuffle, data in a distributed data store to. Minimize. The effect from preemptions. Your. Jobs, will be to. You you will not see any any, effect from the preemptions, that might, happen during. The lifetime of a, PBM. We. Are hiding, this from you and because. We're also pulling, the resources, together so we have a set of regular, VMs and a. Set of P VMs your. Jobs will, be able to continue operating. Using. The regular VMs, even, in the case when all of the PM's have to have, to be returned. Flex. RS is available, in seven, regions it's available in, this in the same seven regions we, are steaming engine is available and, either floor shuffle is available, and they're, planning to launch. It in further agents, down the road, and. With, this I would like to invite Diego. From unity, to talk more about how unity. Is using dataflow. All. Right Thank You Sergey, so. Hi guys I'm a I'm, Diego I'm a engineer. On unities, Data Platform team. And. I'd like to tell you a little bit about how we use some of the tools that Sergey and David just showed you to, kind. Of renovate. Our data platform in unity it's let us consolidate, our data we've. Been able to bring structure to our unstructured, data as it arrived, and. We've. Also of course been able to bring our latency way way down with real-time. Streaming pipelines. So. As you know these data platform team our goal is, to provide a single source of truth for, unity, to discover consume, and transform. Data into insights decisions, and products, it's, a pretty straightforward but probably, also a pretty broad mission, statement for a data team right. In, case you're not aware what. Unity does is we provide tools for game developers, so. Our core engine product is used to, develop, everything from mobile, to triple-a games and we. Also run one of the world's largest mobile. Ad networks that serves ads in those mobile games. As. Data platform we support. Ingest for a bunch of unity services, so this is things like developers, analytics. Crashing. Performance reporting the unity asset store, cloud. Collaborate, services. Which allow developers. To, collaborate. On cloud, hosted unity projects and other. Real-time event sourcing as well. Now. A huge, part of that data set comes, from that. Ad network. Which. Commands, about 45%, of the top 1,000 mobile games. And. All, in all this this, event stream adds up to about, 25. Billion events per day so modest. Daily data volume. As, of late unity is also making. An attempt to push beyond just game development we're, hoping that we can penetrate industries, that require any kind of 3d computer-aided, design so naturally, this is things like film and animation, but, also things, like architecture. Civil engineering.
Construction. Automotive. And. So, as the, Data Platform we. Expect more, data sources to emerge for us and possibly. Richer data sources therein as well. All. Right so this this. Goal we had of a single source of truth platform at Unity had some historical motivation, in the company over. The past couple of years unity, has made several acquisitions each. Of those acquisitions has, kind of come with its own data, systems and data pipelines, and they. Were all kind of sitting around the company in effect in, sort. Of in different data silos all right. This. Meant as a data user data, user at Unity say I'm a business, intelligence analyst, or a data, scientist I probably. Need to be familiar with the like particular, idiosyncrasies. Of the the, systems on building right on top of there's, not a good way for me to join against, other data, silos in the company. And potentially. Extract, more value there right the barrier, to doing that was really high for, us a, lot. Of the pipelines. In the system were also built, on daily ETLs so. You can imagine that a failure, case here means, potentially. Days of latency right. In. Addition to that all, of this data flying around the organization. Was pretty, much entirely unstructured, JSON, so. Things like schema evolution, of. Schema management, even. Event validation. Were. Pretty, much non-existent at, this time. The. Final thing the final problem we had with this was that, a lot of these components, were built before the introduction of more rigorous privacy, laws like GD P R so. The. Compliance, was not a first-class, consideration, in these designs and you can imagine that encrypting. Everything. We have in storage for compliance, would just, waste all of our resources right. So. The. Net effect of this situation where we have all of these disparate data systems lying around, is. That one. Of the earlier, iterations, for. Building. A central platform at for data unity, cut, an architecture, that looks something like this. Yeah. So, I don't need to tell you that this this looks horrible, right this, is horrendously. Complicated I personally, can't even tell you what most of those components are and there's, really just no way that this thing is consistent, or behaves.
As You would want it to right, and. Similarly this set of technologies, and use in that crazy architecture, looks. Like this, also. A total mess yeah so we. Have like all of these workflow managers, and processing, engines and a ton of different warehousing, solutions that came with all of those acquisitions and as. You can imagine this is a maintenance nightmare. Again. As a user of one of some, of these systems the, latency for me is going to depend totally. On what. Fell over last night and if the data engineers, had time to kick those jobs or whatever. Alright. So the, upshot of this situation is we clearly. Saw a need to drastically. Simplify, everything about this right. And as, we moved to Google cloud last year we had the opportunity opportunity. To just do that. So. Now. We have selected, a much more reasonable set of GCP, technologies, to. Meet, our needs here a big. Query becomes our main warehousing. Solution data. Flow is used for both stream and batch as Sergey mentioned it's awesome. Well. We do use the confluent, platform to, run our Kafka nodes in this system, cloud, composer still helps us with a few auxilary, workflow, management things. Our. HTTP. Ingest endpoints, are run. Mostly on gke and cloud. BigTable, is in there for, some magical. Runtime, loaded. Configuration, for our data, flow pipelines which I, can explain a bit more later. So. As we embarked on this journey to totally. Rebuild this ingest and simplify. Everything, we. Saw, the need to focus on these three big goals pretty clearly right consistency. Latency. Compliance. And. This. Is a an, architecture, diagram of the. New ingest pipeline that we now run on Google Cloud hopefully. This looks a lot more palatable than the last one. On. The left there we have a few of those HTTP. Ingest, endpoints, which, are services, post events to those. Applications. Which. Are running on gke remember, they, then post and, forward those events to a few Kafka clusters which are divided according to, some. What according to geographical, location, and partly according to Business domain. The. Top fork you could see there, in the diagram is kind, of scoped to the ads and monetization organization. And the, bottom there is kind of everything else so naturally. As, part of this ingest pipeline. There's. Still some organizational, division right, that's expressed at ingest time which, is natural but, you could see that all of that quickly goes away, when. We consolidated. Data flow right so how do we get all of that unstructured, JSON out of Kafka get, it structured, loaded, to warehousing ready to query all. In real time data. Flow lets us do all of that that's great. So. Remember, first goal is consistency. Yeah. Even. Though all of that data, is still unstructured JSON, well we now have in the processing, pipe a data flow is a, requirement.
For Any event that arrives at that processing, stage to mash an explicit, Avro schema. Those. Schemas are stored in a central repository and. They're, published as configuration, into BigTable, as I mentioned earlier and that is. Loaded into the pipeline at runtime to, verify, the events. Any. Events, that come in for a particular event type and do. Not match their specified, schema will, be routed to a dead letter queue for possible reprocessing, later. And. Thanks. To the unified, stream and batch processing, model and beam we. Can reuse the same core business logic and code for any, raw data we need to process whether, that is the real-time event processing from Kafka or any, historical, data in storage sitting there that needs to be backfilled and on-boarded, to the new platform or. Reprocessing. Of events from that dead letter q. So. Here's an example of a, few. Of the pipeline's we actually run the. Two on the left there are real-time streaming you can see the far left were reading from one coffee cluster, and the middle there were reading from three separate coffee clusters, across the world and. The. Pipeline on the right was. A batch. Backfill. Historical, job which, was pulling, Rajasthan. Straight out of GCS, from the old platform onboarding, it to the new platform, so. Those interest stages all look different right but the great thing is that the. Main processing, blocks in all three of those pipelines are exactly the same process. Messages block contains. The exact same business logic for, all of those pipelines, and. That. Fork in the pipeline you see is extraction. Of PII data for compliance which I will go into a little bit more later. If. For anyone curious this business, logic is written in pure Scala and, it, uses a popular, library out of Spotify, called CEO or she'll. Second. Goal was latency. So. Remember. That on the old platform. The. Time until data was available to query could be measured in days we. Have now made this leap to real time stream, and.
Clearly, Just a complete, paradigm shift for us right you, can see on the graph on the left there. The. Data, watermark lags for us stays under ten minutes in that graph and this. Graph is actually a little bit old when. I look at our healthy pipelines today I typically, see watermark. Lag below, two minutes which is pretty awesome. The. Load to bigquery however, is still batched at 15-minute, intervals but. Those are triggered and fired from the same pipeline as well so we still get beam and data flows event delivery guarantees, all the way through to the bigquery warehouse. The. Auto scaling feature that Sergey talked about also hugely helpful to us here as we've. Imported more services this year and as the data volume, has kind of organically grown we've, seen a load increase of about 150 percent and a natural. Scale up to meet that easily. The. Other thing about the auto scaling is that it allows us to run. Enough. Like a follow-the-sun mode right so any particular pipeline its, load might have a certain, geographic. Distribution. And. Therefore a time distribution, and the, the. Cost-saving, that we see here is great, it's about 35% compared, to the baseline of running. Like that peak allocation, at all hours. And. Final. Goal was compliance, and since. We were able to build this in from the beginning this time this. Actually turned out to be pretty easy, we. Just had to write the business logic for it so, the fork in the pipeline I mentioned earlier when. A win. An event arrives, at that processing, block its. Fields. Some of its fields may be annotated, as PII those. Fields will be extracted, and encrypted and written. To a separate datastore with stricter access control so, we're. Encrypting. Only the data we need to to, ensure you user privacy and we, have compliance right. Cool. So. That's our three goals met where. Can we go from here well, some of the things that Sergey, and David have showed you today, are. Pretty exciting for us the streaming, engine is sounding super cool we, did end up spending a lot of our time this year kind of worrying about optimizing. A lot of shuffle behavior with, regards to. Like event destination, routing and some hot keying issues we had to tune around and, hopefully this super, cool like decoupled. Shuffle. And storage layer can. Help us worry a lot less about that sort of thing the. Improvement. In the auto scaling behavior obviously. Also super attractive, it's pretty cool. And. I mentioned that we we, have had to do a good amount of historical backfill from, the old platform and, even. Last month we had yet another acquisition which, we might need to backfill data from so, using. Flux arrests to schedule those batch jobs sounds, great to me yeah. The, customer, management encryption, keys, if. We can encrypt, the entire pipeline state that, allows. Us to protect PII, data and even like, really just all the data that's in flight at any time right that sounds awesome - and, lastly. And perhaps most excitingly, those streaming. Sequel tools that David just demoed. As. We've kind of solidified, this. This. Ingest, pipeline and it's in a pretty good state now we're now starting to think about building data discovery and exploration tooling, on top of this and these. Look perfect for, that sort of thing right, so. That's about all I have hopefully. If you were, thinking.
About Using these technologies in a similar capacity this was hopefully, at least a little bit useful to you and I'm. Gonna hand it back to Sergey now to close out. Thanks. Thank. You so much for joining us today let's quickly, recap what we saw. Separation. Of compute and state storage hopefully we were able to show you how how, this concept is allowing you to save costs, get, much, more responsive, out of scaling and get lots of other goodies. -, steaming is now in GA as a spy from three support. Data. Flow sequel is in preview we launched a bunch of new features today. Flex. RS is also generally, available go and save those batch costs, and. If you would like to improve the. Manageability. Of your encryption keys CMAC. Is a great choice for, you. Please. Don't forget to fill out feedback if, you like the session and if you don't then I'll see you the after party and, thank you so much. You.
2019-12-14