Access and Analyze Large-scale Public Datasets on Google Cloud (Cloud Next '18)

Access and Analyze Large-scale Public Datasets on Google Cloud (Cloud Next '18)

Show Video

So we'll talk about public datasets today so my, name is Shane glass I'm the program manager for the Google cloud public, dataset program I've, got a couple of themed, colleagues, that will be joining me here later on stage in a minute but, so we're gonna talk about accessing, and analyzing, large-scale public datasets in Google Cloud. So. Here's today's speakers you can see the one on the left that's obviously the one with the great jokes that's me and, then the other two gentlemen joining, me is Felipe, Hoffa one, of our developer, advocates, and dr.. Ed Kearns the, chief data officer for. The National, Oceanic and, Atmospheric Administration. So. I want, to start with a quote this is from from, faith daily from last year at next I would have loved to recorder this year but I didn't know what a keynote was going to be so I'm using. Last year's quote but it. Says we need to scale what, we need is a is more. Scalable and, effective way to democratize. Data for, data scientists and I love this quote because, it is it is such a dead-on description, of what our program, is looking to do the the Google cloud public data set program and that's, working with public, data set providers like NOAA and like. NIH with the partnership, that was announced earlier this week to host copies, of their high value high, demand data sets on our cloud to, reduce the burden for. Data scientists, and to make their workflows easier, to help them get, to where they need to be we're, talking about eliminating barriers to entry for things like basic, utilization. Of data and being, able to open up these really really valuable datasets to, data set to users, who come from a non-traditional, partners, so. From non-traditional backgrounds, excuse me so how do we do that well we we, load them into some, of our tools we curate the datasets and we can dive in us a little later but it's. About making data available to everybody you don't have to know where to go what to look for or what you're trying to find before you have to find it. So. A little, bit about myself if you were at next last year and you have an exceptionally. Good memory, than you know that I was not the one giving this presentation last. Year so. I am the. The, public, data sets program manager I joined, Google. In April of this year coming. Over from Noah where I manage the big data project before this and. I. Answer. That question upfront no I do not know when the next hurricane is going to hit Edie. Might well you can ask that later, and. So my background, is in data analytics and so I come. Into this space from a load of perspective, but I come at it with the, understanding of that it can be really really difficult to. Access and find open data when I was working my master's degree I very, distinctly, remember, that, the biggest frustration, I had wasn't learning Python it wasn't learning R it wasn't visualizing, data that was actually out of fun it, was finding open datasets to do all these things on and, that was a huge frustration for me and it's made me really passionate, about open. Data and so that, that's one of the reasons I wanted to come over and I want to make these data more democratized, for, our users. So. Currently in the program now we have a little over 90 data set somewhere between 90 and 100 and you can see the typical scatter shot right it's it's not really a session it next if you don't have a scatter shot of other people's logos, so. You'll see these are primarily government, agencies, some, states some cities some local but what, you're really seeing here is is a list of the public data some providers whose data were hosting and these. Are on. Boarded, and maintained, by Googlers, with. With input and guidance of course from the subject matter experts, that are the data providers, and what, this really means is that it's. Not just a random collection of data that someone sticks in a object. Store and never looks at again there's, a Googler going in and discovering, finding. Downloading. Extracting. Transforming. Curating. And maintaining, these data sets and we think that provides a better user experience, than, other public dataset programs, because we were putting the effort in and the time in and our expertise, and the engineering, to, bring these datasets to you. And. So we we work closely with these providers to make sure that we understand, what the best practices, are for them and that we can describe the metadata as properly, as possible. So. What we'll just go over some like really high level numbers that sound really impressive, and you guys can decide if they mean anything but so we've, had in bigquery alone, since next of 2017. So approximately one year ago we, have had more than 80 petabytes. Queried and I'm seeing here from from, Chad that that numbers actually higher than it than the, last time I looked which is about a week ago so.

That Number is probably closer to 90 now actually so 90, petabyte higher than 90, here. 95 95 do I hear ninety-five hundred eight hundred one hundred. Okay. Some way off so. So. This is what I saw it from this is a small sample of the data since I was looking at and I probably. Extrapolate, this is accurate least I could have but, yeah. So a lot, right and so that that's really the answer there is people are using a lot of public data but, it's it's in bigquery now hang on because the, public data we get doesn't always come in a, CSV or it doesn't come like preloaded in a bigquery table right and, that's what I talked about earlier where, there, are Google errs that are building pipelines and that are extracting, this data into bigquery because, it's it's an easy sequel, server list implementation, that. You could come in and build on top of right away and one of the really exciting announcements. From this week is the, bigquery machine, learning introduction, that, allows you to go on and build machine, learning applications, on top of your sequel queries using, declarative, statements. So. We have more than two thousand tables in, bigquery for public. Datasets with. More than 42 billion rows yes, I did count each of them individually thank you for asking. And, more than 5,000, distinct users per week and this is a number we're really excited about because you, know if one, user is querying. 80. Petabytes, of data, that's. Great but it doesn't necessarily mean that we're not as we're not necessarily democratizing. That data for the broader community but. To see that that volume number, grow with, the number of users and with the number to sync projects, that's really exciting for us because what that means is, that we're we're improving, the access for users we're making it easier for users, who may or may not have been data users in the past to, come in and leverage this data. But. It's not just jeez it's not just bigquery, we, have public data also in GCS, so Google Cloud Storage we've, more than three petabytes, of data there that includes things. Like radar. Imagery so we have weather radar, imagery. Going back to 1991, all the way until about 15, minutes ago. And then we have satellite. Imagery as well from the geostationary. Environmental. Satellite 16, to go 16 to, meteorological, satellite that orbits that the over, the same point in the earth at. All times and that's all I know about satellites, but, it is a a high resolution both, in space, and time for. Imagery, and it provides us the ability to to. Better understand, weather. Patterns as they're unfolding, we have a the archive of that going, back to last year as well when it became available so. It's, a quick overview of Google, bigquery for those who that aren't familiar I mentioned earlier it's, a server list data warehouse that's built to support big data analysis. Felipe's. Got some great demos on on, bigquery, coming up but the, one thing I'm gonna hit on is that it you. Don't have to worry about developing. Or managing or. Or building. The infrastructure, it's already built all you have to do is put your data in bigquery and, write your sequel statements on top of it and.

It's It scales seamlessly. And allows you to share your analysis, very easily, so. There's a free. Tier for storage, and for queries that the values are listed up there but. What that means is that you can go in and you can start using the program today. Without. Incurring costs assuming, you're below these these, values. So. How can support open data use we have a few like very general, profiles, up here so, the first one is businesses, I've. Talked to several businesses who have told us that they exist, because the, public data is in, GCP and, what allows them to do is it allows them to focus on their idea they don't have to focus on building. Data access mechanisms, they don't have to focus on maintaining, these data access mechanisms, or storing, them they can focus on building, their product, and delivering their product to their users and they, can do so on top of these public data sets either by joining with their proprietary, data which we think is a really valuable way to use it or, by, building. Their product on top of the data to. Deliver new insights to users for. Researchers. You can join multiple public datasets together to, do an analysis and ways never been done before one. Of the things that really excites me is the ability to bring data together that, has previously lived in silos, that like aren't even on the same farm if we're gonna stick with a silo analogy, so, you're you're looking at data that. Just never been brought together before, or is has, not been brought together in such an easy-to-use format before, that. Can be joined, together to find some new insights, and then, for data providers you can provide fast and simple access to your data without having to scale your on-premise services, so your storage, scales, to. A million users or to one user for the same effort. So. Without further introduction, Felipe. Hello. Can, you hear me thank you Thank. You Shane so. I'm Felipe Hoffa and Shane. Gave me this time to talk about issues, that matter to me, last. Year I showed. Space, versus tabs on code but, today I want to speak about a real issue people. Are fighting on the internet about this where, do we put our commas, and I'm. Talking about sequel. Who knows sequel, here. Excellent. So. Sequel. Is awesome is a language invented in 1978. We. Still use it we still love it and you. Must have learned sequel, with a query like this select name from employees, where they, come from substrate, then, you will want to get more columns there and then, you add a third column, and then, suddenly things, don't work, why. Because, I. Have this hanging comma here and this. Is because. Language. Create or decided that you could not put a comma there every. Other modern language will allow a coma but sequel will not, so. Now. People who are in two camps some, people put their commas at the end some. People could put the commas at the start. Who. Puts their commas at the end. Okay. Who puts their commas at the start. Two. Three four okay. That's. Super ugly. But. That's what I do too it's, much more efficient, for me if I put my comments at the start given, the language limitation, so, now the question is can I use data to prove to everyone else that this is what you should do so. Trailing commas lead-in commas we, are Google we, use we. Have Google Cloud we have bigquery, we have data, that everyone. Can, see share, and, analyze. I will look at github, just. To count the number of events, that we have and. Github last year, gave me three seconds five four three. Hundred. Million events and, I. Can, go further I can count for example the number of these thing users and I can look at where, are they coming from what are the most important, countries here and the. Most, most. Developers. Come from nool country, than the US then, China India. But. That's not a fair ranking I should, be do, it, per capita so, who has more programmers per capita, turns out it's Iceland, Sweden Norway. Cold. Countries. Is. It true that it's cold countries, or or, not I also, had Noah date I have weather data so if I take, on one hand my programmers, per capita, I take the average weather of each country, I can, get to a chart like this that shows that yeah there is a correlation. People. In cold, countries have, a high concentration of programmers warmer, countries lower, concentration, of programmers, except. Singapore. There. Because. They have very good ID so programmers, prefer to say inside a. Real. Query so, let's, go back to this issue. Let's.

Use Also we have github, code inside bigquery and I'm, able to analyze our people using more commas at the end or at the beginning this. Is my query and the, results, show that way more people use a comma at the end. But. This. Is not really, what I want to know I don't want to know what's more popular I want to know what projects, are more successful. And. How do you measure success a. Larger. Number of stars starts, last year the, number of contributors, activity. So, I can write a bigger query that looks for all of these and turns out that, these are my results and projects. That allow, people to use lead in comas or doubler successful, of the other projects. So. Yes we win. F. And yes, could be query connects to all these tools and you will be able to use open data to, share and even. If these fights do not stop at least. You. Can embrace. That I won't go through them I want. To show you a couple more features that are new and, I'm super excited about, them. Have. You heard about data. Studio. Yes. So, with. That as to you now I'm able to take my bigquery analysis, and I, can create interactive, visualizations I recently. Created this one for Stack Overflow and I, can embed, my, visualizations. In the middle of my. Blog post I can make them interactive, these. Are the trends in stock overflow like, let's look for example. Angularjs. One of the top tax on this. Side. Angularjs. Is going up or down. Down. Wise. Angularjs. Going down, because. They replace, it with a new tag, that, is angular. So. Whatever. Technology you are interested in people. Can come play here and use public, data, on. The satellite, side this is outside bigquery, my. Friend Lac with, a great, friend of everyone that knows about. Machine, learning and. Weather. Forecasting. He. Created this great sample, looking at hurricanes we have in bigquery a list, of all the kubrick ends you can see what's happening so, we don't know when it's the next one but we know all the, previous ones you. Can use to. Get their data and then, you can find. The files the satellite images, the. Correspond. To those Hugh reckons that keeps cold and you finally can produce, get. Individual, images, create. Animations. And look, at what, is happening all, around the world with, just some, few lines of code that's, the power of having the public data sets available and, the, last thing I want to show for. Everyone there is a fan, of bigquery and have seen my Wikipedia, demo for example in this demo I'm going to. Process. 377. Gigabytes of data. I'm. Going through billions. Of page views that wikipedia, has cinema and, when, I run this query I just want the sum of use for. Google. This. Query is processing. 377. Gigabytes which. Is an. Awesome demo. But. Since. You have, one. Free terabyte, every month you would be, able only to run three of these queries so.

Good, As a demo but, not. So good for public, data and the. Best news we got for me this. Yesterday. Is that now we can cluster, tables we, can, runs. We. Can set data together so, now when I run, my query, over. This. New table same. Query same, amount of data same. We. Think it will process. 375. Egg whites instead, it's only went over, 80. Gigabytes, of data so. What where we are taking our public data sets program now we are going to cluster all tables, and make. It way more effective yes. G delt I'm looking, at you Kelly, we are going to make, these tables. All. Time. Partition, and cluster. By the call of people our query and with that let. Me welcome Shane. Back to take, it over. Thank. You. Thanks. Felipe in, the interest, of full disclosure I didn't start using leading commas, until I saw Felipe slides, and realized how wrong I was. Now. I did, see in Felipe's demo however that there are people who are using both leading, and trailing commas. And those people just want to watch the world burn I. I'm. Terrified of you you're. Monsters. Okay. So next, up we have ed, Kearns here from NOAA Edie, is NOAA's. Chief data officer he's, the first chief data officer of, NOAA and, he's been a great collaborator, of the public dataset program and I look. Forward to hearing from him. Thank. You very much so, you wonder why there's a government, employee up here talking at. Google but, we have this partnership that I'm gonna describe and you've seen the results of this partnership it's very very powerful, so, you, know one of the things I do as NOAA's, chief data officer I'm trying to find the. The maximum. Exploitation. Of the taxpayers, dollar that you've paid for these expensive, satellites you fix but you've paid for a lot of collection. Systems, that, bring data, from. The environment into NOAA, and we use that for our mission, okay. But we also realized as a secondary, use for this data all right so we're using it for managing, fisheries, we're using it for weather forecasts we're using it for for managing our ecosystems, along the coastlines but, we understand that you as maybe a small business owner can also use this public open data for other purposes, and we understand this value there and we want to maximize that value. And. So, you know so, like I said we have our mission but, the secondary. Secondary. Uses. You. Know how can, we maximize this this, started about four years ago we realized well we, don't have the resources, within the federal government to, fully exploit the data but hey you know Google does and. Can. We find a way that we can work together in a partnership to, actually, bring, that value for it and democratize, the data as Jayne was mentioning so. Let me first explain how. Big NOAA is so we're. A fairly big organization. We have a very wide scope, of what we do for the American public we, have about 12,000, people of this 12,000, people we've got 7,000. Scientists, all right and all those scientists, work with data every day they they are oceanographers. Like I am their meteorologist, their biologist their fisheries experts every. Single one of them is it is also a data scientist, in in. Their in their realm of expertise, because. Data is the, heart blood of our organization, nothing happens in NOAA none of these products get off of the American people without, data flowing through them I also. Have an enormous challenge if I want to share this data all, right so at any given day, we've, got about 200, petabytes, of data residing. Across NOAA's, fed systems these, are not in the cloud largely okay these are within behind. Firewalls on. Federal systems across, the United States we've got about 70,000.

Data Sets okay, so that's an enormous number to try to try, to understand, if. You if you go out just go out on the on the regular internet and you try to find all. These data sets and they are almost probably, 95% of these are publicly available today in an open fashion if, you, can find where they are okay, and so that that's a challenge we've, got twelve, hundred registered web, domains, so you got two hundred and forty thousand. Individual, websites, that are out there that all have data so, it's very hard for you all to find and consume the data that we have but. We try, to find different ways of doing this so and. We're trying to embrace the cloud trying to bring new technologies, and last, year we, did use the cloud in order to try to make, sure everybody, that needed hurricane. Data hurricane, forecast data before a hurricane or Burma. Hit Florida, that, they had access to it and last. Year yes indeed we're getting a billion hits a day and because, we're using cloud technologies, in front of our federal services. We're able to do that so a great great a great. Success story but we are also like usually the number two or number three website. In. All of the United States government because so many people need our data every day. So. With in, order to use this there's two challenges, one is technical, how do you get the data and the, example. That Felipe just showed that. Lack had put together it. Really shows how how. Working. With Google has simplified this because before. That if you wanted to work with our satellite data and work, with our hurricane data you had to download all these different things back to your to, your to. Your own local computer say or wherever it might be and then, try. To do things with it what. This example showed is you can do everything in place so some of our data sets our climate models are about a petabyte in size it's, really hard even to move that data within our own organization. Much. Less get it out to you so you can use it so these are these are some of the technical challenges but. The bigger challenge, that we're trying to to. To. Fix, is the understanding, to, try to figure out what, these data mean and how to convey that understanding, to you. I've got 7,000, scientists, that know exactly what this data mean but, how do they interact with you to convey the meaning because like, I said we've got the data parked out there already we've had the data parked out there for many years, and it's not getting picked up and used in the kind of way that it is being used when we're, seeing we put it on the Google platform and that's what we're trying to learn more about. So. What, we did four, years ago is we signed what's called a cooperative research and development agreement. Or crater as it's, a it's a government term but. It's a way of actually having an experiment, going on with industry, where, there's that there's no money exchanged, but where we, have agreed to work on a common problem and, we've been doing this for four years with, Google right now so this comes with no cost to the taxpayer, the trick, here is that these.

Data Are open data they remain open, Google is not selling, the data they're selling services, around the data so, they can recoup their investment but, they are hosting, the, taxpayers, data. From there for free and. So we understand, how how, that can scale how we and you just saw examples of this of how the clouds can scale on the open data can scale the availability can scale but, they really get over the hurdle we're, focused on the expertise, because, I can park a lot of this complicated, data out there and you still won't know exactly how to use it right so we have to we have to solve this so. And so, you, heard about some of the examples, and some of the successes already. The. The sweet spot that we found instead of just parking the data out there if, we're loading the data into the tools that people are already using and bigquery is a perfect example of this by, loading, NOAA's. Weather data into bigquery people. Can come and use that data join, on that data consume. That data they. Don't have to break down a complicated, scientific, data format, they don't have to try to understand what all these mean they understand hey as temperatures precipitation, is cloudiness it's wind they, understand what it is it's, already loaded into the tool they're using they can immediately consume, it and use it that's a very exciting, paradigm. And, we're trying to see how, we leverage this so we're comparing you know it's sort of an a/b test how's, this working on Google versus. How it's working from our own federal, servers and, we're seeing it at least 10 times probably between ten and a hundred times more consumption. Of these, weather data coming out of bigquery than we're seeing from our own internal. Services, right and this, came when, we first did when we first started this. You. Know about a year ago now I guess when I first got this off the ground, one. Of the interesting things is we. Didn't advertise it no it didn't advertise, availability, of these. Weather, data in bigquery yet, Google. Did not advertise it yet they. Were still thinking about how they're gonna proceed right but, while we're talking about what we're gonna do next, users. Of bigquery came, in and found the data anyways, and they started to consume it it required, no advertising, we didn't have to put out any press, release that these data were available people. Were just able to find and use it and that's that's a great story. Being, able to combine data sets and the. Example, that Philippe, showed that with, the Hurricanes that lack put together is again just a great example, of bringing multiple data sets together that. Otherwise may be hard to do but when they're all in the same place on the same cloud it. Becomes really. Trivial. And. So where are we going with this - all right so. We're. Trying to find those increase levels of usage for our data we're, trying to make sure you the data consumer, can get the data more quickly we have our, archive system we have 30 petabytes, in our archive it, is still a tape based archive, with the archive is primarily preservation, but we make all those data open it is, very hard to get 30 petabytes, out of a tape archive as you all know right however, this data are sitting on the cloud it's it's fairly trivial and, so we're looking at ways that we can continue to raise that level of service and and. Do so in a way that the taxpayer is not burdened so we're trying to leverage that value, that's, in the data the value to - Google's users, Google's. Customers trying, to leverage that value, to, enable this this new way of providing. Access to federal data and so. We've been doing this for four years or. A little, over three years will be four years when we're done but, we're really trying to say how can we do this sustainably, how, can so we're doing this experimental e right now we're, excited about the results but you know how do we how do we take this to the next level and, I really need to hear from you all okay, the federal government needs to hear is this, something that you like do, you like to get NOAA's, federal, data on the.

Google Cloud if you find this valuable, we need to about it we need to be able to as. We're going to our next level of discussions, with Google about how, can this work how can this partnership work. Sustainably. Going on for many years into the future because Noah's bringing the expertise, to the table all right so that that's what we're not just bringing the day we're bringing the expertise and Google's bringing their infrastructure, how, do we maintain that, business relationship, we've. Found it you know and Google's fun very I found. It very valuable valuable to be able to have those discussions with, no experts, to to, literally do the deep dive on ocean data right actually, have those one-on-one discussions. And be, able to turn that around and figure out how do those products, fit into the Google tools and meet, the the demands. Of their. Users so, we're, looking for that right now there's. Also a. Federal. Data, strategy effort, that has just begun as well and if, you go to strategy, We're. In an open public comment, period right now where the, federal government is asking that. The the data users, across, the country. What. Do you want to see with federal data do you like things like this kind of partnership with industry that allow federal, data to be available on the, cloud what, would you rather pick it up from you know a federal. Service, these, are real questions that we're dealing with right now the. Timing is fantastic, but honestly this is a time that we need to hear your voice if you, go to bigquery and you've been experimenting with these public data sets if this resonates with you there's, a time right now to to. Be able to speak up and be heard and, have. This kind of partnership. This cloud-based data access, baked. Into how the federal, data. Strategy is going to be rolling into the future and the, timing. Is everything in life right and so this.

Crater Came along at the perfect time the. Results have come along again just, just. In the nick of time so that these things can be integrated, into the wider strategy, and, yeah. Please please I. Welcome, if you if you want to drop. Me an email at this this, email address or you want to go enter. In your comments and at, strategies data gov I encourage, you to do so and welcome. Your feedback and. I. Think I'm turning it back over to Shana. Yeah. Thanks ed and you. Know I think Edie touched on something big is, and. There's an example of it here on this slide so in, case you haven't heard enough about laxed mo we got a little bit more here for you but. I think that one of the things this, I think this is a true kind, of big data application, right and big, data is one of those buzz words that it's used to mean everything, and as a result it really means nothing because. It's just you so broad that it doesn't really have a meaning. These days but I think this is really a big. Data application, now, data. Is the new oil right, we all heard that I this, is not an original thought this change has had but I've I've, kind of thought about this phrase a lot I've kind of gone back and forth on it and I think that my opinion on that phrase is sort. Of right, data is sort of the new oil and. I think that the analogy ISM is it's, imperfect but I think it has some value in that. Data. In its rawest form, has, value, right so, having raw oil just like having raw data yes that has some value. But. Being able to process that data to, to, convert, it into insight into information, just. Like being able to process oil into, gasoline into. Plastics, that, is really, where the valuable application of this product is and that's what we're trying to unlock for our users so. What you see on the left is a, sample, the bigquery table, that, lack used for his demonstration, on, visualizing, the, imagery from Hurricane Maria, and. So what that is is it's called IB tracks there's a very long acronym there but it's essentially, the international, community coming together and, saying, we agree that at this place in time this. Hurricane was at this place, it is a shockingly, difficult, thing to come to agreement on but. What we found is that that. Doesn't always line ah it's not always associated, with the satellite imagery to see on the right there and. What we found was that those are in such different, places there in such disparate locations that. They're very rarely combined. Moreover. While the data on the left are in CSV, are available, in CSV, the, data on their right are in netcdf, for is anybody familiar with a net CDF for file format that is not employed, by the federal government. Yeah, so we have outlay a handful of people which is which, is a handful of warp it yeah a handful of people it's just a handful more than I expected, so. Netcdf. Is is great, for the scientific community because it does press, this very well and it does a lot of other things very well but, it's not but, it's very very specific, to the scientific community the. General, data science community, the data scientists, out there as opposed to the environmental, scientists, out there are. Used. To using this so that, that's a barrier to entry you have to go out you have to find these data you have to find the Python library for it you have to find out how to use the Python library for it you'd find out where to find out how to use the Python library for it you have to go back you have to write your Python for it you have to go eat dinner because at this point it's already been all day and then, maybe, you could get an analysis, and. Let's work with federal government. But. What we can do here is we can lower that barrier to entry by, putting the data from IP tracks in bigquery it's. Now a sequel, query away and we can sub sample by. In this case by hurricane, name and. We can take that very easily on the right and we can associate, that with the, same center points, from, the debt from the satellite imagery that, has the hurricane and now you can go from data insights, faster, you, can go from oil to gasoline faster, you can get the value out of your data as soon, as you can as opposed to having to wait and go, through this process we've. Worked very hard to eliminate all those barriers to entry. Now. This. Slide was made before the announcement was made however bigquery. Launched other geophys, capability.

Or Visualization capability. This week and that, takes us even further. So. We've talked a lot about, what. We've done in the past right you've heard a lot of our metrics in the past you've, heard ed talk about the great collaboration we've had so far and and you've, heard Philippe talk about some of the things that can be done with the datasets we already have. And. We've talked a little bit about what, we we, kind of want to do right like when, we talked about how we're working with NOAA right now to figure out what's a sustainable, long term partnership, beyond. The end of our agreement next year and so. Let's talk a little bit about what we're going to do so, this is a look at our mission and vision statement, for the program going. Into next year so, our mission is to maximize, the availability and. Ability of, flagship, public datasets in Google, Cloud we're. Looking at things like. Large. Large, scale data. Sets that are, putting such a burden on our providers, that they have to limit the access available, to them now in order to protect their systems from, being from, being brought down and overwhelmed. And. Our vision here is to enable the development, and to, leverage collaborations. With. Key public data set providers like Noah, built. Around high impact verticals, and so the goal here is to, to. Take these weather data sets and not just make them available to the meteorologists, of the world but, to make them available to the entire community, for use and let's, see what cool stuff comes, out of it let's see what people are able to learn let's see what people are to build for themselves going. Forward. So. Looking at the way ahead, we're. Really really excited about the launch of bigquery I'm a little aus you to build regression. Models. Cluster. Analyses among other things on top of your data sets and we think that this helps you create the, only, machine, learning, ready public, data set program available. Today. All. You have to do to. To, use machine, learning on top of bigquery public, data sets today is. Go, write sequel you, don't have to yes suits cleaner data and, unfortunately. No what I I know very few people who like to clean data but. You. Don't have to find it you don't to discover it you don't have to move it you don't have to learn sensor flow to develop great but, you want to learn tensor flow you, can just write your analysis and you can just get started, the. Bigquery geo visualization, capability, we're really excited about that in combination with the second point in the slide which is utility, datasets so. What we're focused on going forward is enabling, our partners like our note like NOAA to, onboard the datasets into our program, now. Does that mean that Google is never gonna onboard, and curate datasets again no what. It means is that we're gonna continue our focus on datasets.

That Are more broadly used across all domains so, we have a group, of data sets that we call the utility data sets they're, just things that we generally find useful for, all users or for large swaths, of users right, regarde, of what their their, interests, are, one. Of these data sets that's available today is zip. Code area so it gives you your five digit zip code the town associated, with it a few other metrics what, we've added to it recently is the. Polygon, associated. With that zip code area. From the Census Bureau's website so. You could take today and you could visualize it. By zip code your. Data and take it out to the bigquery geovisualization. Capability. And you could visualize your data today we're hoping to have that that. A demonstration, of that and a walkthrough query for that launched very. Soon but, that capability is available today as was announced earlier and. So, our focus is gonna be on providing these datasets that we just generally hear across the community, that, people need and. Allow our partners and their expertise, to scale out on our platform, to reduce that burden for them. So. I'm really proud to announce that in support of this we are making a five petabytes, commitment. To, big to public, datasets in bigquery. My. Personal, take is that this is. This. Is an unmatched. Commitment, in the community right now we. Are making five petabytes of bigquery storage available, to host public datasets going. Forward but and that's great right and that's that's a ton of structured. Data but. What. Good does it do to store data if. To. Extract, and store data if it's going to be gone tomorrow right, and, we hear this concern a lot with our public dataset providers, that we collaborate, with why. Should I invest my effort my team's effort into. Uploading the data if it's gonna be gone tomorrow. And. Because of that we're, really proud to announce that. In. Addition to this five petabyte, commitment, we are making this commitment for five years, so. This five petabytes will, be available, each year for the next five years to, host public datasets in Google bigquery we're, really excited about this and I think this is going to enable us to do some really great things in. Collaboration, with our partners. So. Obviously. When it started today because Filipe ed and I have gotten you so excited about Google's public dataset program, I. Did. Not see in the isoing that's good. So, if you want to use public datasets I encourage, you to go to these sites right here that's Slash. Bigquery, slash public data or, Slash. Public, - datasets, on. Here you'll find some descriptions, and some overviews, some of the datasets we have as. Well as some connections out to sample queries if. They're good they're written by me if not the written by Felipe. That's. Obviously, the way that that would work and. So you can, connect out and I think that's a really great way to just kind of get started immediately you. Don't have to know everything about the dataset but you can get an idea for what the structure the data are and. For how to use them and then kind of build from there so you don't have to build from scratch. Do. You have public data I want, to be friends with you if. You have public data you can get started today you, can go to bigquery or you can go to Google Cloud storage and leverage the free tier of storage that, we meant we talked about earlier today so, if you're under 10 gigabytes, for 10 gigabytes a month for bigquery you can go in today you can store that and you can make it public and you could share with anybody and because, of the way bigquery bills it's, the it's the person running the query that is build fat so you aren't billed for that, same. Thing with Google. Cloud storage it just obviously would note there that if your before you share with other people unless, you want to pay the cost I would put it on a request or pay's mechanism. For that so I think I'm safe on that one okay but, so maybe you're over the free tier of storage maybe you have a ton of public, data and that's even better I want to be even better friends with you feel. Free to reach out to us at GCP, -, public - data at Google com, that, goes to myself that goes to a couple of other of our team members as well and we'd love to get started working with you on hosting, public data. So. Thank you so much everybody for coming out let's, have a round of applause for Philippe and for Edie and thank.

You Very much.

2018-08-05 04:32

Show Video


That's actually pretty big. Good job guys.

Other news