Rethinking Business: Data Analytics With Google Cloud (Cloud Next '19)
Thank. You for coming to the session I know it's after noon I will try to make sure I don't let, you fall asleep we. Are going to have an exciting, set. Of product. Demos we have a couple of our customers, sharing their stories, how they are using the platform and, the. Whole goal here is to go ahead and today. Morning in the keynote, did you get a chance to see the smart, analytics, demo is. It good, great. So. A key thing is we want to take that whole theme about smart analytics, give, you more context, of what other product, announcements, we are making what, are we launching and, give you all, the details that you need but. Before that let's. Talk about what's, changing, in the world right if you think about industries. Across. Across. The border are changing, if you think about automotive, industry. Historically. 10 15 20 years back when you looked at automotive industry was a very different, industry now. With. Organizations. Like. Cruise. Automation, they're, collecting, real-time information from. All of their vehicles, doing. Self self-driving cars making real-time, decisions, at massive. Scale and so, this the. Whole, ability. To go ahead and create these large amounts of data and make, decisions is super important in organizations, the, second, key thing is another, great example is AirAsia if. You look at, democratizing. Data. Insights, within organization, it's becoming more and more critical. So, now, AirAsia last year that. Next they were here with me at. The session and they were sharing they've, saved like roughly five to ten percent on their operational, costs in an, airline that's a pretty large number and. This is possible, only because they. Were able to take all the data and insights within the organization, make, it available for all of their users. Who are actually doing. Different. Kind of activities. Within within the organization, so it's very critical not, just to have the, infrastructure, that can scale to, do massive amounts of queries, but also, to go ahead and make those insights available to. Everybody within the organized and then, there. Is other aspect, to it a lot of different industries lot of different customers like, Broad Institute which, is which. Is founded by MIT Harvard, and Harvard hospitals, they, are basically, producing, like, 12 terabytes of, data per day and, then doing leveraging. The the cloud computing infrastructure, to, do genome, sequencing, every. 12 minutes and so, it's it's interesting to see how. Organizations. Across different, industries are, actually. Using. Cloud. Computing especially. Our data analytics, platform, to, take massive amounts of data derive. Insights, from them and make, decisions at, every point we're. Seeing momentum, across all different industry verticals, all different, regions. With. With the platform few, key interesting, facts I wanted to share with you today once, with bigquery we.
Now Have more than exabyte, amount of data managed. By bigquery for our customers. We. Have our largest customer, now has more than 250. Petabytes, of data in a single data warehouse and we. Had last, year roughly 300, percent growth in data. Analyzed, on the platform, so it's fascinating to see all this growth, organizations. Across the world now. Leveraging. The platform, leveraging, the. Insights that they can gather, from. From. From. Google. Cloud so. Let me talk a bit about what's, our philosophy, around. Around. Our, investments. In in analytics. Platform right our main. Goal is we. Have this team called radical, simplicity like. Our goal is to make sure deriving. Insights from. Data needs, to be super, simple to, get to a point where anybody. Within an organization, should be able to do that and how do we do that how do we make it happen one, the most important thing is investing, in server lists you, should be able to bring any data do, not have to worry about infrastructure, put, it into Google cloud and start, analyzing. It, the, second is your, providing. Comprehensive, solution. That provides the end-to-end, lifecycle. Of your data management. Then. Embedding ml not, just using. Ml to improve our products but so making sure ml is available, to everybody, within the organization, and then, we. Reform believer, in open cloud we are firm believer in making different. Open-source, components. Available, to you to run at scale or within our environment and finally, all the enterprise capabilities, that you all expect, us to have, within the platform is super, important so quick. Visual, the. Key thing about server less data platform, is one where in traditional, platforms, you would have to go ahead and spend, time in figuring out what's your capacity, requirement, how, many servers do you're going to need what is the provisioning, what's the monitoring, there's so much stuff goes into that but. Our key thing is you. Shouldn't have to worry about all that that's, all managed by Google, we, take care of that you just bring as, much data that you need to bring start, analyzing, it from there and then. From a platform perspective I, know, there. Are a lot of logos here I'm not going to go in depth of every one of them but, from an end to end life cycle, perspective we, have services, that allow, you to ingest, data and that could be a real-time. Streaming at scale. Like pop sub you can you. Can do millions or billions of events per per, second collection, there's. A services. For transferring. Data, from on-premises. Different, SAS applications. We. Have service, for IOT data, coming in so all the injection services, are available to you then. For all the real-time and batch processing, we have data. Flow which is our streaming. Like. Engine capability, it, allows you to do batch and streaming with a single.
API With. Data proc you can do manage Hadoop and spark environments. Data, prep allows, your data analyst to go ahead and do data, wrangling, all of, these are available to you and then in data warehousing, you have bigquery you, can use cloud storage for, for. Storing massive amounts of, unstructured. Data. And. Then. On advanced, analytics side you have our cloud Rai services, so that's the whole portfolio you, have a whole set of things in addition, to that we. Have a cloud composer manage airflow for you to go ahead and do, workflow orchestration, and then, we are announcing two new services, you heard about them today around. Cloud fusion. More about that and then, catalog so that completes, the whole portfolio that we have and then, with that let me share. Few more things on on different scenarios, though when, we talk to customers there are three main scenarios, that our customers, leverage, the platform for and Thomas. Earlier today mentioned, these and I will try to take them into the next step and give you more details but, one is modernizing, data warehouses, so that you can go it and make, it broader, than just data warehousing for reporting, and and dashboarding. It's more about intelligent. Decision making predictions, and stuff like that so we'll talk more about that, the. Second is running. Large-scale, Hadoop clusters, on promises. That customers are running moving that into cloud to, get much better TCO, but also get a, scalability. That cloud can provide and the. Third is streaming, analytics I. Think. By 2025. More than 25 percent of the data that generated, will be in streaming, form and as, the industries are changing, you, will need capabilities. That can collect. This streaming data make, real-time decisions on them in, the in the application, that you are in and all that so so, that is super critical and that's, what customers are using other, than that we. Heard a lot from, our customers about breaking. The data silos, making, it easy to get data into the platform and then, also protecting, and governing the data so we have those, solutions. Available so. Let's talk about the first thing earlier, today you saw a demo, of fusion, so. Cloud, fusion, basically, is our fully managed, code free data integration, service the. Whole idea is we, want, to make sure that bringing, data to GCP is super easy for for. Our customers. Data. Fusion is actually, based on an open source project. Called seed app and, it. Gives you a visual tool, to go ahead and drag drop pick. From hundreds, of connectors, that we have already got for you for on Prem systems. Different applications, and all and then you can go ahead and transform, the, data that's coming in and you can publish it into any one of the data stores that we have could, be bigquery, could be cloud sequel, could be any one of the other data, source that's available the. Key aspect to, this is the. Goal here is just simplifying. Migration. Of your data to cloud transforming. It as it's coming in and making, sure you. Have a single place to manage, all your data pipelines, and then. Finally, it, provides your ability to go ahead and do visual transformations. As you're coming in you, can go ahead and track lineage, about, the data that's coming in and provide, data quality, on top of top, of the data so, this is one of our big. Like. Releases, this for. This next it's available in beta so, you can go ahead and leverage it you saw some of the demos earlier today, there. Are two other things that we have in, the same realm we. Basically, we. Have a in, bigquery if you have use bigquery it has connectors. That were available from, our first party services like Adwords, like double-click, and all of that make, it easy for customers to bring data in from, these applications, and then put, it into bigquery analyze, it we, have extended, that to, our partner ecosystem and. So now I'm. Happy to announce we have more than exactly. I think 135. Connectors, across. Different. Applications. Starting. From like Salesforce, Marketo. Adobe. Analytics Facebook, analytics, work. Day all the different SAS applications, is now available, and customers, can use that in. In your big query environments, you all can start using that and the, third thing is we, know what there's a big challenge on migrating. The. The traditional data warehouses, that are running on from Isis let's say like, Tara data or if you are using red shift we have tooling, now available, to easily migrate those two to, bigquery so, that's that's, the key thing that we're providing all this tooling, to, make it easy to bring data into, GCP so that you can start leveraging the, other capabilities that, that we already have customers, want us to be able to help them understand, their business better they don't just want us to do their banking or, our employees, expectations. Are changing as well they'd like us to provide them with relevant data and insights so that they can make, smart decisions and.
Your Talk in a timely manner and so to, do that we need to look, at digital, transformation. And a key part of that digital, transformation, is data, got. It so. Can you share some use, cases with bigquery, or other things that you're using so that we, can get more insights into what you're doing yeah. Absolutely I haven't talked about some of the bigquery use yeah of course so bigquery, has been one of the key tools that our data scientists, are using our daily basis and it. Actually effectively. Helped us a lot in terms of the scale. Abilities, and running. Those handling, those heavy computational. Queries on top. Of the different data sets so, I will give you a real. Story forwarding so. Some. Of our data scientists, are working on those using, those customer. Transaction data, to build those. Aggregated. And that they identify, the insights. For our institutional. Level clients, to, for, them to understand, that they are customers better so. Those you know analysis, including, what's. Your lawyer customer, look like where they are living and who are lapsing from your business so, we. Are analyzing, billions of transactions, of, the. Data in the back to that time it was around 17 terabytes, for a single table for us children. And it, took literally. Five days to extract. The data and get insights, from from data set and which, is actually, quite costly, for us to deliver. Our insights, to our clients, and also limit our data sense to continuously. Develop new, insights, adding, new innovations. To the data set and by, moving that for pipeline to the bigquery we. Successfully. To reduce, the time from five days to 20. Seconds together in size which, is a big achievement for, us and not. Only enhancing, the efficiency. For our data scientists, but also, allow. Us to start rethinking the, data, science processing, the organization, so, our. Data scientist starts to meet our clients directly rather, than of authorizing, those query as sitting at the back end, and that. They are take, bringing. Their insights, to the clients, taking the direct feedback from our clients, to, get even together conducted. A customer. Led design, workshop. To, get, those are customized. Insights, requirements. To support our clients better so. The. Reason we are taking customers. Inside we, have the confidence that bigquery can help us to handle those having. A computation, back-end and we've. We, managed to work with airline industries, to help, them analyzing, their customers. Shopping behavior before. And after, flying. So, they can use those insights to optimize, their campaign, effectiveness, also. We've been working with a few retail industry, companies in Australia to, help them to identify which. Location, is the best for them to open a new store, so. Such such, kind of customized, and, nurses helped us to position ourselves not, only a service, provider, but. Also a strategic, partner, from the data and analytics, side, perspective. For our clients, and the. Currently we have streams. Of data scientists, work working, on bringing, more, data like payments. And supply, chain and credit. Rating data on the GCP and the, to combining.
Those, Different. Datasets together commingling. Those data sets to unlock the value of the data set in the bank I think that's, awesome I just heard, five, days roughly, too few seconds, is where you were able to drop it that's the power of at scale what you can do with data processing, and analytics, and I think super interesting, to hear can. You share more about cloud Composer usage, how you're using composer. For orchestration, and all yeah, of course so ting have been exploring. Different, orchestration. Tools and. We've. Been using composer, things it was in our Federation and it's. A great tools, for teams to keep going and our, data scientists, are loving it because, it's Python, base and it's very easy to manage your dependency, and the multi layer of data, pipelines, and. We've. Currently got daily, and weekly data, pipelines, running on a composer to generate hundreds, of features and and. Terabytes. Of data moated, terabytes, of data however, with the growing of the teams and, complex. Of the data pipeline, we, sort of meeting the challenge, like running. Multi, tendencies, on composer and we've. Been very excited, to hear more announcements. From composer side this time that's great we'll share some more clay so, Keith can. You share more on. Did you make the decision to go to GCP and what should everybody, here especially. In, the industries like yours think about as they move to cloud make. That decision absolutely, so I think, when, when looking at moving to a cloud provider one, of our key requirements, was we needed a provider, to help us get, the most out of our data and so the, the, core data capability, is very important, services. On top of data AI services, ml services, and partnering, with someone who has those services, also also. Absolutely critical but, also you, know data doesn't live in a vacuum and and where, the data is is where application. Delivery starts, to converge to and so when looking at tcp weave, and google we, found a provider that has those, AI nml services, also. Has the application delivery components. And so we're also very heavy users of gke cloud, sequel, and a number of other components, as well, as then the underlying data. Capability. And, so I think as. An. Ecosystem that's, great as a financial. Services, organization. There are a whole other suite, of considerations, that there need to be overlaid on an implementation and, so as. A heavily regulated, industry. It's, very important, that when, implementing, a cloud, environment it's, not only you, know the awesome, technology that's there it's balanced, with great. Controls, that can meet the expectations. Of your regulators, that can ensure that you hit those privacy. Expectations. Again, of regulators, but also of your customers, and I think as then. A final, point an. Interesting, piece around the cloud implementation. Journey is once, you're there things, can things become a lot faster in terms of your ability to deliver on them but it can it can then shine a bit of a spotlight on yourself, internally as an organization. And your processes, and his ability, to actually internalize, and, deliver on on that change, good. Thank you thanks a lot for sharing thank you very much thank you.
So. It's it's. Very interesting as, we have seen in last couple of years how different, industries, have. Been starting, to adopt cloud starting, to use some, of our analytics, capabilities, to go ahead and leverage it for different scenarios with. That let me share few things around what's coming, new with, bigquery in. This in this conference what are the different things we are announcing one. Is we had a goal last year to go ahead and launch bigquery, everywhere we, have been steadily, increasing, our footprint, globally, this is super critical as. Organizations. Want to keep their data in specific. Geographical, locations, so, we've launched like around 12 regions, in last one year and we'll continue the momentum going, forward make sure we, are available in every region wherever, now. Google Data Centers exists so the, work is not done we will continuously do that but we're already in all of these different regions we should be in your region. Wherever you are now. The. Second big thing that we are, announcing. Today is basically. Bigquery. Supports two, different pricing, models it has on demand model where you can go, ahead and per query pay for whatever data you're accessing the, second model is we have a flat rate model which, gives you price predictability, and, you can go ahead and and buy, out like X number of slots, for. For the whole month and then you can go ahead and use that what. We are announcing today is two, things in alpha we will have our reservations, API which. Will give you two capabilities, one, you, will be able to go. Online and then if you're registered, for alpha you can start buying slots, directly, which, means you, can go ahead and say I want 2,000 slots but we are also, reducing. The entry, on that and we are making a 500, slot, bigquery. Flat, rate available which, should reduce the cost of entry if you want to get started at, a lower level so that's one the, second, thing it allows you to do is you, can quickly and easily manage, resources so, let's say you have 2,000, slots you want to distribute them into four different the. Teams and say hey everybody gets 500 each so, that you don't go ahead and like. You know have different, kinds of queries can have different priorities and stuff like that so you can go ahead and do that that, way but. Always make sure that you have access to all of those open, new, slots available for everybody so the key thing is the computer resources that you have is, always available you can use them but you can allocate them across your organization, very easily this, has been one of the asks from our customers, for a for. Quite. A few time now and, so this is going to be available. The. Second thing that is available earlier. In couple of months back we announced storage, API so there, are a lot of organizations who, are putting all of their data in. Bigquery the, query storages. Their structured.
Storage. Layer like for for, all of that data, in the organization, and sequel. Is a great. Like. You know language for a lot of things but not for all things and so, we have a lot of customers who wanted to use spark. Or Hadoop on top of the. Same data, that we have why do you want to have the same data copied, in GCS. And bigquery and all the different, storage layers so that you can process it so we, basically have a high speed storage. API available with. This what happens is your, bigquery that's the, the, data that's stored in bigquery is now, available, from any of your spark or Hadoop workloads, you can use data flow for for. Batch jobs, from from bigquery if you want to you, can go ahead and use the ml engine ODBC, drivers all of them will be able to directly, leverage the same storage layer at, high speed and you'll be able to go ahead and do all these workloads different. Types of workloads on on, the data in bigquery so. This just expands, what you can do the. Third thing that we have coming in bigquery is earlier. Last last, year next I think in July we. Announced beta for bigquery ml, so, we will have a big query MLG going, to GA in in. Few weeks from now along. With that we are also based. On the demand that we are getting we, have, k-means. Clustering. Available. So you want to do segmentation, or, customer, segmentation those. Kind of scenarios you'll be able to do that very easily with, like just a couple of lines of sequel. Code you. Can do matrix factorization so, recommender, systems you can go ahead and do that and then you can import tensor. Flow models directly, into into. In 2b qml ah. So. Those are the three key things the fourth, key. Announcement, that we have and, we announced this earlier, in the keynote is. Around bi engine so the. Whole idea of BI engine it's like it's, a fast low latency, analysis, service you. Don't have to create any kind of models or anything you automatically, the data that's in bigquery it, can accelerate queries, on top of it our goal is to have all the response times in in. Millisecond, times under. A second in, most, cases and then it. Will be it will be available so that you can do interactive. Reporting, and interactive, dashboarding, very, easily across, your organization, at a large concurrency. Numbers. So, that's another, thing that's available so I've.
Talked About a lot of things, here earlier. I also mentioned, about all the 100-plus. SAS application. Connectors, we talked about bi engine, so, let's do a quick demo let me call upon Michael to come and show, us some of the capabilities that that we're launching. So. I'm gonna show you guys what. It's actually like inside. Bigquery, to go, get, data from an external, source and bring, that in through a transfer. I'm gonna try and go through pretty quickly here let's click the transfer. Button in, bigquery and that'll. Take us to a view where I can see the active transfers, that I have now so, I'll hit the create transfer. Button here and, we have Google's, built-in, transfers, down here I can transfer from Google Play Google Ads sources, like that but, now I can click explore. Data sources there and just. Like Sudhir said here, we have a long list we have more than a hundred external. Data sources built. By our providers, that show up in this list so for, example here is an Adobe Analytics, connector, highly, requested source, from us this is made by super, metrics one of our close partners, and we. Can see details about this connector, I can also enroll in it or I, can search for others on the marketplace another. Example, Facebook. Connectors so we have some Facebook Ads data, here that you can see I can, search for Salesforce. As well and. When I search their top, result here is a Salesforce, connector, built by five, Tran another, one of our really great partners, and I. Can enroll in, this connector, right on the marketplace and choose the project that I want to enroll in I've, already enrolled, for, this project and so because, I did that it's, gonna show up for me now automatically. On my, drop-down list right there so I hit Salesforce. By five Tran and then. I can enter the name. Of the connector of the transfer, that I want and I. Can choose the schedule, that's that's right for me we can go weekly, or in, this case a daily, schedule and, I'll. Select the the destination. Data set inside, bigquery, that I want that to go to and then hit connect, source and, right, here I get a warning this is asking me permission, for, that connector to write data into the bigquery data set that I selected, so, I'll hit accept and then up comes, this pop-up from five Tran where, I can authorize. The. The, connector that I'm interested, in so normally. This would ask me for my Salesforce, password. I've already done that though so, when I click save this is going to create the connection for me and take, me back to the transfers, page where. I can where, I can complete my settings for the transfer I can, also choose to get notifications if I want to in case the transfer fails so let's, click Save and. That's. Good to configure the transfer for me and there you can see it the transferring, run is now pending so, that's really all that I needed to do that's how easy it is to, go all the way from choosing one of 100-plus, sources, getting, that into a transfer, so that goes right into your bigquery data set that's really all you need to do. This. Particular transfer takes about seven minutes so I, have a data set already set up where you can see what it looks like once you are actually inside. Bigquery, and let, me just show you what it looks like here's that Salesforce data we. Can preview the leads table, there's, city data data, on the the company in this case for, for each of these rows we, could go and query that join it with other data and, integrate it with other information inside bigquery, but, now I'm going to show you the BI engine, feature with.
This Data so, like Sudhir mentioned, with this BI engine, we, have the capability, to run really fast sub-second. Latency. Queries, that's because it's running from memory from RAM inside, GCP, so, I can go ahead create a reservation, and, decide on the capacity, that I want with, BI engine, and then. Once I've done that what, can I do with it well here's, a great example this. Is a data. Studio dashboard it's, running off bigquery. On bi engine, and as, I'm clicking around here let's see filtering, down to nurturing. And new leads for example, maybe I want to slice and dice this by Houston. And Dallas and, it's, reacting really fast because it's using bi. Engine. So please. Try, out bi engine, today it's in beta and check, out the external, data sources on. The marketplace thank, you. I. Think. The key key, value, we can get by. Connecting, all of these like. Different types of. Applications. That are there like organizations, are using various. Different types of applications, now bringing. All that data together having. Analytics, across all of them and deriving, insights is going to be very interesting, for organizations. I think and making, it easy is one of our key goals other. Than that there are lot more other, things that we are also working on we're announcing I won't go in depth of each one of them but. But, here's some additional. Information like we will, later. This year we will have ability, to go ahead and do, federated, queries on top of park' or C finds, directly on GCS, you'll, be able to do Federation, across like cloud sequel, which, wouldn't be another data source for, for, queries from within bigquery, so, that's there other, than that there's, a good, economic. Advantage. Report that was created, by ESG, group. You should take a look at it if you're moving, to cloud especially, with bigquery there's. Massive amounts of savings that that you can get from, a total cost of ownership perspective. Let's. Switch gears talk, about the second, key scenario. Running. Large-scale. Hadoop and and. SPARC. Workloads, on GCP one, of the key things from, a value proposition that we have is we let you go ahead and pick any, of the open. Source. Projects. That you want to run through, data proc to through, composer that, technologies. That we have for. Example we. Have been continuously adding, more and more projects, now you can go ahead and leverage presto. You can go ahead and already you could do Hadoop SPARC various. Different, projects. Underneath it it's, secure. We, go ahead and do the management of it we define we can launch the clusters we can shut down the clusters automatically, and all if you really look at, if. You really look at the value proposition, on this this, this I won't go into depth of each one of these points the, key thing is if you, are on prem or you are managing it on by yourself and on compute engine versus, using a managed service that green is what you will have to do, and blue is what we take care, so, the key thing, is just focus on the last, column and see you, just have to manage your code write the code and deploy and we take care of the cluster management and everything that's the biggest value proposition, for for.
The Cloud, especially. With especially. With no, no ephemeral, clusters, you can do massive, amounts of saving so you don't have to have static clusters running throughout, the day at scale for for, you with, that let me call upon, Jonathan. And rarest, from, booking.com. To share more of what they are doing. So, why don't you introduce yourself, tell us more about booking com sure I'm in radish Merrick I'm a principal developer, at booking.com I. Work. On, enabling. Clouds, technology, for a booking outcome and opening that up to booking booking. Is the largest online travel agent, in the world. We, employ, over, 17,000. People and we have offices in 100 in over 120, countries. So, we of course work. With a lot of data as well I'm joined by Jonathan I'm John I'm a data scientist working, in data quality, and booking so I hear quite a lot about other products we're putting together here got it so what were the key challenges, about us in booking, comm before, you started migration, to cloud sure. So at, looking we run quite, a large installation, of Hadoop and on that the workloads are made mostly hive, and spark, workloads, both, production. Workloads and human interactive, we have over a thousand, daily, users. Over, these clusters, so, of course because they like to all work together there is a lot of contention, for resources over. These clusters so, that, was a huge channel challenge, for us and that was an opportunity, for us to use, cloud. And. Give the data scientists, especially for the more data intensive workloads. Give, them, personalized. Capacity, so base basically, dedicated, clusters. Per per, user and, that was our proof of concept work, that we started late 2017. That. Was very successful with, our with our data scientists, and that, was the business case for later on triggering. A big. Data migration, to cloud so, that, means every data, scientist can have their own cluster, that can spin up and then they can work on that is that yes that's the. Default is a multi-tenant. Large, cluster, where they contend. For resources yeah but they all have the option, to basically, elevate, that to a dedicated cluster, for themselves where the data scientist, decides to. A certain limit the size of the capacity, that he needs Gordon that's interesting, because that's, one of the benefits of moving these things to cloud and and having, that scenarios. Where you're gonna have static clusters but also bursts into for, specific workloads and all that's yeah that's really good. Can, you share about I know we. Have worked together on interesting. Challenges, you had and how we have incorporated, some of them in the product portfolio so can you talk about yes, aside, from the challenges of moving, the data to, clouds and then integrating, the data making these clusters, appear with the, data on and making them available for the data scientists, the, first thing that the data scientists, asked asked for was for the toolbox their toolbox to be very same as on Prem the same libraries, the same integrations, with on Prem technologies. And so on so, this required, customization. Of data proxy, installation. Of libraries, to also, etc. We. Found out pretty quickly that the time. To spin up such clusters went. Went. Up quite high and that was impacting, the user experience so. Our, ask towards, your team was to make it possible for us to create customized. Images, for data proc which, in collaboration, with Google we managed to get now to GA, yeah so, that is what we have you know that's great I think if you always learn from our customers, it was a great scenario we, were able to go ahead and put that in quickly so that everybody, can benefit so.
Jonathan Why don't you share more about what you have been up to with with the whole set of technologies, I'd love to so. Working, with, our Google counterparts, we started, an exploration saying, well now that we are in cloud there, are some tools that are available to us like, bigquery and bigquery ml that we don't have on-premise, so, let's, see if we can use this for a case close to my heart can we surface some data quality issues that would be very difficult to do otherwise so, the scenario we chose to explore. Is. Very, booking, in nature so, we, of course serve many properties, on the website, you can find them each property, has many room types that room type might represent many rooms, but, each of those room types is quite particular their scale millions, of properties, so scale tens of millions of room types and may, be most particular, for a visitor is that, those room types have lots, and lots of facilities or potential, for facilities. Something. Like a hundred and seventy six of these so, the, scenario would be a customer, visits, the website maybe they have a particular facility, in, mind that they really like to make sure is there a bathroom, a TV, who knows they, can filter for this well that's really help from our. Side however we, then need to make sure that that data is correct how, could this go wrong well, if you are forgetting, to list this as a property, manager or owner you, can lose customers, by way of the filter and go, the other way if you accidentally, say that you have it well, then you might be misrepresenting. Yourself accidentally. And then, the customer experience is quite odd when they when they arrive in the bathroom isn't they're saying, okay. It's not so good I don't, know about your trip yeah yeah, so. How do we fix these potential, things they also might tell us something about ourselves maybe, we can ask better questions of, the property owners we. Can learn things about however we're putting this in a form that makes it so that certain repeated. Mistakes we, can eliminate in a certain way so this, is certainly an added value for booking if we can get this right we're the intermediary. Okay. So. Again, I mentioned the data it's very wide it's reasonably, long tens, of millions of rows and quite, wide it's very boolean so, it's yes or no to having, a facility but, 176. Of these it's quite a lot so, we, wanted to attack this using something in bigquery ml, in, particular, we're, going to end up using k-means clustering now. Why do we want to try that perspective, well we could attack it with rules we, could say ah we know for sure that if you have pay-per-view, channels you, you better have a TV to watch them on right that's pretty reasonable however. 176. Lends itself to lots and lots of subsets of rules it's, very difficult to manage and upkeep, because well you could be certainly adding lots more facilities, in the future so, maybe all kinds, of hololens or something like this is available in your room so. Would be very difficult to manage by hand let's see if we can surface these by, throwing math at this problem and especially in a quick and irritable, iterable, way via. Bigquery, ml and sequel. So. We have one premise one, assumption. Backing. Our project. Here which is we, assume most of the data is pretty healthy it's pretty representative of truth, if. That's true then we hope that similar. Things will end up next to each other in such a clustering, and oddities, will stick out odd things, we'll be able to find so, with this assumption I think it's a pretty good we know our data relatively. Well and we've not it's not our first time looking at this kind of thing so we're, hoping math we'll find the ones we haven't caught yet.
Okay. So k-means clustering there. Was a very, nice talk earlier today we gave a longer version of this I hope you'll visit it on YouTube but, we, if we throw k-means clustering at this we have just a few lines of code to build very several. Different versions of the model we, only have to tune maybe one parameter and, we can very quickly see what comes out let's. Visualize some, of those well actually sorry let me take a step, back, what, would clustering look like this is two dimensions of course we have a hundred and seventy-six remember. Our goal, is to find the things in the triangles, the ones that really stand out if. We can cluster well the, ones that are very far from their their centroid, their middle or maybe, the ones where something is going on so, this is visually, what we're aiming for. Okay. Here's. In, data studio very, convenient that we can look at this through the the GCP, pipeline we, have a lot going on but, if you look at the red on the right this, is relating. Both, the size of clusters, and also how far they, are from each other okay, well notice, cluster, 10 there is actually, quite far from any other cluster. Well, that tells us something maybe he's odd that's one of the ways a cluster could be odd it could be very far from others it, turns out cluster, 10 is very weird for another reason if you, look at the green there this, is the, distribution of distances. Within. The cluster so, for each cluster if you are at 0 that means you're at the centroid and if, you're at the top that means you're as far, away from the centroid as anything, in that cluster now, I care about cluster 10 again because. Well the the farthest, points are the, farthest of any from, their centroid so we probably should look there first it gives us an indication where we could peek so, let's. Take a look at a few examples from, ok cluster 10 if we drill in what, do we see here well let's, look at the things that this, item. Has I've drilled in I'm looking, at exactly one item the, blue bars there represents. Something about the whole cluster cluster 10 how, common, is each one of those room, facilities, which are on the bottom so, let's, take an example of say toilet, the, Bar says that's. Available in my outlier it's also very commonly, available in the, cluster in. Total I see a toilet, I say shower. I see, body, soap I see free, toiletries, these are all very good things where, would you put them probably. In a bathroom which is not available here the the outlier does not have a bathroom listed, anyway maybe it does have one I don't know and, free, toilet paper so. Maybe, it's BYO toilet paper I'm not sure maybe, you filtered for that yourself maybe that's something you want but, for my trip I'd like to be sure that there's toilet paper waiting, for me when I arrive that's, good to know we should at least follow up with the property let's take one more example go. A different direction of, course this one might have other bathroom problems, but, let's see what it does have so, I see cable TV channels I see satellite, channels, but. I don't, see TV, or flat-screen. TV so I hinted at this before you, have plenty of channels to watch but nothing to watch them on we, would definitely want to surface this for the property because they might be missing out anyone filtering for that thing is maybe going to miss this property but it's probably likely that they have such a thing so, we found some really interesting stuff, I think. These are things we wouldn't have found otherwise. Or would have had a very hard time to identify next. Step for us would be automation, improvements so that we could do this on a very regular basis, and not have to explore. By hand the same way we did got, it Thank, You Jonathan thanks a lot with some interesting, news. Cases there thank, you thank you. So. Yeah so let's continue on some, of the new. Investments. That we are announcing right now I think we are investing. In the security, features of data proc so that Kerberos is now available, so yeah, so. You'll be able to use the same security models, that you're using on Prem you, have auto-scaling capabilities. And then you also have, one. Of the other big investments, that we're doing with composer is on composer flex so, it will allow you to make it completely serverless. Composer. Capability, and then one of the other things is we just announced our partnership, with cueball lot, of enterprise organizations. Are using cueball for their Hadoop. And spark workloads, and their, whole unified, experience with, the workbench with, notebooks and dashboards, is is super, valuable for enterprises, now, they are available on GCP so you will be able to use them they've great enterprise security features controls. And governance as well as seamless, workload migration, so if you're using cueball today you'll be able to continue, using.
That On GCP from, your on-prem the. Next key thing, that we have is as I said 25%, of the data will, be generated in 2025, in streaming. Form and we. Have great capabilities on the platform, for streaming starting. From ingestion with pub/sub transform. And analyze across, the board and. Then you can also do that with our open source technologies. Or partners that's available so with that one of the key announcements. That we have today, is dataflow, sequel, so the whole idea is lot, of organizations, use dataflow, from, different platforms, like you can write Java code in B for beam and all that and do it but sequel, is a good interface lot of customers like it so we are making that available let's, do a quick demo, from. Sergey, on that, welcome, Sergey. Thank. Cydia. So. In this demo I'm going to take a pop subtopic, I'm going to associate a schema. With this topic and I'm going to join it with a bigquery, table to do some steam enrichment, once, have a schema, topic and I enrich. Steam I'm going to group it by time and insert. The results. Into a bigquery data data, warehouse now, the goal of this demo is to show you you can calculate very quickly, aggregate. Statistics on, a stream of events. I'm. Going to start with bigquery and, many bigquery, users will find, it quite useful that, they can now access data. Flow right from query. Settings you get, the choice of the data flow engine as the execution backend and, once you choose data flow engine and save the setting, you. Will be able to create data flow jobs I actually. Have a sequel, statement. Saved. In my notepad, just for demo. Purposes, so that I can avoid typing it. I'll. Quickly explain what's, going on here I have a pop subtopic, this no this is not a table this actually is theme of events I'm gonna join it with a static, table in bigquery, allowing. Me to do steam, enrichment, here's. My joint condition, and, the. Key key. Portion of this sequel. Statement, this streaming sequel, statement, is the, tumbling function, this, is the piece that allows you to do steaming analytics, it creates fixed, five-second, windows and you, can run aggregations. On top of these windows that's, exactly, what's what's, going to happen in my projection, part of the sequel statement I'm gonna create statistics. For, sales regions, for all of the sales events in my stream and. I'm gonna have a timestamp of of, these calculations, and the sales amount, oh and, by the way I, mentioned that, we that we use schema we, store, now the scheme, of a pops up in the catalog, here's.
The Schema of my pop subtopic, this, is what enables me to run sequel, on streams having a schema my. Events, have, very, simple attributes, we, have a timestamp and we have a payload and the payload contains the person who purchased the good, the. Good itself where. It was purchased the, state of purchase and the amount well, great so let's. Let's run the job in. My. Next screen I just need to type in the destination table and then initially we support bigquery as the destination, but we'll add more destinations. In the future as well. Great. So within a second or door I'm gonna get a data flow drop ID that's. The job ID and if I click on it I'm gonna get rerouted to data flow and. Once, the resources, have been provisioned, and the sequel query gets executed and, transformed, into a data flow graph I will see things, in the middle of the screen now I don't want to wait for the purposes of the demo so I'll launch the job just like it a few, minutes ago here's. Here's, what you will see in. The in in, the data flow experience, for, those of you who are familiar with, execution. Plans that's, exactly, what's going on here so data flow took a sequel statement, it created a execution. Plan for your sequel statement, using, the beam framework. I have, my, my. Input. Flow from bigquery I have my input flow from pops, up and. I have a join condition in, the middle and for this particular sequel, statement, we are using very efficient, side and put joins now, I also wanted to conclude, the dam I also wanted, to show you that data is actually flow flowing from through, this sequel. Statement so, I switched, back to big. Wave and I have a slug, style statement here let me quickly run it. Here. The results not me rerun it again. Alright. As you can see I get my data updated. Every five seconds, awesome, thank you sir. As. As. I started, earlier. Today when I talked about our one philosophy, which is making. Sure we make it really simple for doing the activities that you do today you could have done the exact same thing in Java written.
A, Few, lines of code and and made it happen but, we want to make it really easy for all the analysts, to go ahead and do the similar activity, on streaming data and with, this sequel based language. Now dataflow. Becomes accessible, to everyone in the organization that, can write sequel in, addition to that we also have flex RS scheduling, which is flexible resource scheduling, you will be able to go ahead and for delayed data, pipelines, you can save up to 40% by. Using preemptable. VMs on, our side the main thing there is we will guarantee finishing, of the jobs in any, case because we mix regular. VMs and priam tables and so you'll get lot of savings but guarantee. That your jobs are going to get get done there are lot of other announcements. I won't go in, depth of each one of them but. You can take a look at, them from, a governance perspective one. Of the things so, we have different. Things that we that we already offer right we have the the, built-in encryption. That is their customer. Manage encryption, keys we talked about it earlier today access, transparency. Thomas, touched upon it we, also have tools for, efficient governing, as well, as compliance, with HIPAA and all the different, compliance. Things the, key thing we are announcing today is data catalog, data. Catalog, is basically, our fully managed, and scalable. Metadata. Service, which, will allow you to search for all your data assets where, they are what, are they who has like. You know all, the different details. Along with that it, also allows, you to go ahead and, define, schemas. For pops, up which is streaming data so once you have that you can go ahead and start using it and sequel, it's, a very simple search experience, it, gives you ability to go to auto tagging, with DLP, like, we can run the DLP we can mark PII data and all and then, you can also give your own business, metadata. That you can define and then from there we, will be able to go ahead and define policies, on top of it so you can say anything. That's PII do not give access to this. Group of people or give it access to only these people so you, will be able to do those care of activities, but fundamentally. It's easy discovery, of your data, within your organization. We are going to solve that across, all of different assets within GCP, other. Than that the most important thing is you, have lot of investments, in different tools, that you may have acquired from your different.
Organizations, We have a huge partner ecosystems, it, should just run as is without any problems like we. Have great partners in bi space like tableau like look or all, those tools are there informatica. Five. Trend we showed earlier today as well, as talent all of these partners for ingestion also we have a huge partner ecosystem for, you to new to leverage other, than that. Google. Has been identified as, as a, leader, in both Forrester. Waves as well as dart 'nor so so we are trending in the right direction we have a lot of investment going in analytics. Platform, generally, and then it's. It's ready for enterprise. Adoption we, have a lot of customers, and would be great if you can go take, a look at some of the key new capabilities, start playing with the products and and give us more feedback thank. You. You.